Facial Emotion Recognition

Description

Facial expression (FE) is the most natural and convincing source to communicate human emotions, providing valuable insides to the observer while assessing the emotional incongruities. In health care, the FE of the patient (specifically of neurological disorders (NDs) such as Parkinson’s, Stroke, and Alzheimer’s) can assist the medical doctor in evaluating the physical condition of a patient, such as fatigue, pain, and sadness. ND patients are usually going through proper observation and clinical tests, which are invasive, expensive and time-consuming. In the proposed FER module, a vision transformer (ViT) based FEs recognition framework is developed to classify the facial expression of the user (patients) with an optimal accuracy. Initially, raw images of FEs are acquired from publicly available datasets according to the patient’s most common expressions, such as normal, happy, sad, and anger. The framework cropped images through a face detector, extract high-level facial features and fed them to the dense layers for classification. The trained model can be exported through dockerization to other environment and can be evaluated for real-time performance. The FER module is evaluated both qualitatively and quantitatively over standard dataset named Karolinska directed emotional faces (KDEF), FER 2013 and CK++, etc, showing promising results.

The pictorial representation of FER module is given below.

Documentation

Inputs and Outputs

The trained FER model takes image in gray-scale format and returns the scores (of type float) of each expression class per input image in a Jason format. The class with the highest probability labels the final expression of the input image.

Background Info

In the development of the facial expression recognition (FER) module, varieties of machine learning (ML) and deep learning (DL) algorithms were evaluated over a range of publicly available datasets. In traditional ML techniques such as random forest, support vector machine, and decision tree, etc., the performance of the random forest technique was satisfactory, but during testing time the random forest did not achieve the required level of accuracy. To cope with this challenge, the range of experimentation were extended to DL consisting of sequence (CNN-LSTM, DeepConvLSTM, Video transformer,3DCNN, etc.) and frame-based (2DCNN, pre-trained models, vision transformer, etc.) learning approaches.

The performance of any DL based technique directly depends on properly annotated dataset and dataset acquisition is a very challenging task in research work. Total twenty-three datasets were highlighted from the literature, but only seven were acquired and are publicly available. To cover DL based sequence learning techniques, we preprocessed and organized the RAVDESS dataset to be incorporated by sequence learning approaches for training and testing. In processing step of different techniques, both cascade-based and MTCNN face detectors were used to detect and crop the face area, removing the unessential background from the input frames. While evaluating the DL based approaches, we found that the fine-tuned vision transformer (ViT) model which uses a special attention mechanism for parameter learning outperformed over other DL models. Initially, we predicted four emotions (that is, Angry, happy, sad, and neutral) through our ViT based trained model. In future, ViT based model will be extended to predict six/seven facial expressions while conducting training and testing over combination of different publicly available datasets.

Datasets & Samples

For extensive evaluation of FER module, literature review was conducted to highlight all possible datasets consisting of facial expression recognition and analysis (FER&A). Twenty-three datasets were chosen to be incorporated for the experimentation of FER&A, where we succeeded to acquire only five datasets including FER-2013, CK++, KDEF, JAFEE, and RAVDESS. The first four datasets in this array were already arranged by the providers in universal facial expressions (FEs) classes such as angry, disgust, fearful, happy, sad, neutral, and surprise. The later, RAVDESS was recorded by the 24 professional actors in psychology department of Ryerson University, consisting of 12 male and 12 female’s actors (21-33 age range) containing 7356 recoding including only audios (Song and speech) and videos (speech plus stream of frames) of human FEs (Calm, happy, sad, angry, and fearful). We selected only the visual part of it and pre-processed according to our problem, arranging each FE video of all actors into particular folders (Neutral, calm, happy, sad, angry, and fearful) then extract frames, detect and crop faces to remove the possible effect of background contents on the final output. The final preprocessed RAVDESS dataset contains 1978 videos, where each class approximately consists of 350 videos. In addition, the RAVDESS dataset was further preprocessed for frame-based learning, removing the redundant frames. The trained model generates the output in a Jason format containing the probabilities of four classes, where the maximum probability shows the net expression of the user (Patient).

Installation Ιnstructions

FER has been packaged as a Docker image that can be accessed by running the following commands:

Login to container registry using guest account:
docker login gitlab.telecom.ntua.gr:5050 -u alameda_ai_toolkit_registry_guest -p ByeYyNesUxqQGs91FzzW
Run backend docker image:
docker run -d -p 8000:8000 gitlab.telecom.ntua.gr:5050/alameda/alameda-source-code/ai-toolkit/ai-toolkit-registry/fer_aitoolkit_backend:latest
Run frontend docker image:
docker run -d -p 3000:3000 gitlab.telecom.ntua.gr:5050/alameda/alameda-source-code/ai-toolkit/ai-toolkit-registry/fer_aitoolkit_frontend:latest
Access the application:
Type http://localhost:3000 in your browser.
Logout from container registry:
docker logout gitlab.telecom.ntua.gr:5050
Visit in browser:
localhost:8088/apidocs

OpenAPI Documentation

View Swagger Documentation