Emotional tracking of patients can play a significant role in healthcare, since it may assist clinicians to easily monitor the progress of their patients, remotely. With the use of smartphones or other smart devices, meaningful insights related to the health condition can be collected from patients.
To this end, we developed the Mood Estimation Android App, an application that monitors the face of the patient while they interact with other mobile applications, and predicts their emotions based on their facial expression.
Mood Estimation Android App’s main component is the computer vision algorithm, which takes as inputs the video frames coming from the front camera of the smartphone, and estimates the mood of the user. More specifically, we trained the mini-Xception deep learning model, which is trained to recognise in real-time four facial expressions: ‘positive’, ‘negative’, ‘neutral’ and ‘other’.
In this hub, we include only the docker image of the mood estimation algorithm (and not the whole android application).
The docker image takes as input a video frame or an image. The first step of the pipeline includes the localization of the user’s face within the video frame. For the face detection, we use the Multi-task Cascaded Convolutional Networks (MTCNN) algorithm. MTCNN model locates the face of the user and draws a bounding box around it. The region of interest is cropped and fed to the emotion recognition model, which outputs the prediction regarding the emotional state of the user.
The docker image of MEAA that we include in this hub, receives as input an image or a video frame (one at a time) and feeds the data into the model. The model outputs four prediction probabilities, one for each possible mood (‘positive’, ‘negative’, ‘neutral’, ‘other’), and the final mood prediction is the one that scores the highest probability.
For the design of our deep learning model, we experimented with different computer vision models, like MobileNetV1, MobileNetV2 and mini-Xception. Several different architectures and hyper-parameter combinations have been tested and assessed with regards to both their prediction accuracy and latency for real-time inferences. After comparison, we concluded that miniXCEPTION demonstrated the best prediction performance and the shortest latency.
Specifically, Mini-Xception proved to be the best choice, since it achieves the best trade-off between model size, inference time and prediction accuracy on test and real-time data. Its size is less than 1MB, the weighted F1 average score equals 0.72 and its prediction accuracy on live data reaches 70-75%.
The architecture of mini-Xception starts with two Convolution layers (which are followed by Batch Normalisation and ReLU layer), followed by four residual blocks. Each block contains a convolution layer on the skip connection side, and the other side consists of two separable convolutions followed by a Max Pooling layer. All convolutional layers are followed by Batch Normalization and ReLu layers. At last, follows a convolutional layer, a Global Average pooling layer and the final classification takes place at the softmax layer
The Mood Estimation Android App has been developed following industrial standards. In this sense, the toolkit has been packaged and uploaded to the AI Toolkit as a dockerized container. It requires the following commands to open ports and run the model:
Login to container registry using guest account:
docker login gitlab.telecom.ntua.gr:5050 -u alameda_ai_toolkit_registry_guest -p ByeYyNesUxqQGs91FzzW
Pull and run the docker image:
docker run -p 8000:8000 gitlab.telecom.ntua.gr:5050/alameda/alameda-source-code/ai-toolkit/ai-toolkit-registry/meaa
Logout from container registry:
docker logout gitlab.telecom.ntua.gr:5050
Our dataset includes multiple and various facial expressions within each category, in order to create a dataset that represents well the different human facial expressions. Facial images are hard to be found available online, due to the strict copyright licences. For our model training, we collected facial images of people with various facial expressions, from different ages and ethnicities, with and without accessories, and faces under different lighting conditions. Our dataset is a combination of data collected from different sources like Kaggle (FER 2013 dataset, Jafar Hussain Human emotions dataset) and other open source databases such as Unsplash, Pexels and Pixabay.
Following we provide also some samples from the dataset that we used.