Fusion of Attentional and Traditional Convolutional Networks for Facial Expression Recognition

.


Introduction
basic modules such as convolutional layers (Conv), the convolutional kernels are used to create feature map, the subsampling layer helps to retain the desired features, reduces the input size of the feature map to reduce the amount of computation in the convolutional layer. Sub-sampling consists of two main types of max pooling and average pooling. Max pooling, a pooling operation that calculates the maximum value for each patch of each feature map. Average pooling involves calculating the average for each patch of the feature map. Focus on three data sets CK+, Oulu-CASIA, FER2013, we realize that: although there are different facial expressions for the same emotion, but basically, the facial muscles to represent some emotion are still the same, a feature extractor using CNNs only needs to focus on extracting the features of certain regions in order to distinguish the basic emotions. In CNNs, correlated spatial features are extracted using filters, but features maps extracted from CNNs (feature maps) contain a lot of information that is not useful. In order to minimize the effect of feature maps that are not useful during classification, it's necessary to evaluate the importance of each feature map. Jie Hu et al. [10] proposed the SE-Block technique to solve this issue, and the experiment demonstrated the effectiveness of SE-Block. We decided to study the application of SE-Block in combination with CNNs for facial feature extraction in the facial expression recognition problem from the above knowledge.
Based on those things, we proposed a framework for facial expressions classification based on a deep learning model. Specifically, apply the Multi-task Cascaded Convolutional Networks (MTCNNs) [11] model to face detection, then calibrate bounding box (expand bounding box), combine methods such as image normalization, scale, augmentation (only training), shuffle data (only training) to create inputs for feature extraction and classification phase. In particular, we propose a technique based on combination of three CNNs models belong to the end-to-end networks for expression classification. The proposal model gives out high accuracy for datasets: namely, with The Extended Cohn-Kanade (CK+) [12] dataset, there are seven basic types of emotions, reaching 99.10 % (using the last 3 frames), 94.20% for the Oulu-CASIA [13] dataset (from 7th frame) with six basic types of emotions, 74.89% for FER2013 [14] (seven basic types of emotions). We tested the feasibility of the system on 3 datasets: (1) The Extended Cohn-Kanade (CK+), 7 basic types of emotion dataset including Anger, Contempt, Disgust, Fear, Happiness, Sadness and Surprise, ( • Initialize the weights from the pre-trained model, which generated from the MS-Celeb-1M dataset [17], loss function is the ArcFace [18] and Softmax Loss [19]. • Use methods such as shuffle data, data augmentation (rotation, random crop, random left-right flip). • Use the validation method: choose the model with the highest accuracy on the validate set. • Use ensemble learning to increase accuracy on datasets. • For data set FER2013 using ten-crop [20] validation method. The rest of the paper is organized as follows: Section 1: Literature Review, Section 2: Proposal Methods, Section 3: Experiments and Discussions, Section 4: Conclusion.

Related research
Guoying Zhao et al. [21] proposed variations in the local binary pattern (VLBP -Volume Local Binary Pattern) for the extraction of features and classification using SVM (Support Vector Machine). The experimental results were performed on the CK dataset [22] using the cross-validation method (10fold over the entire frame), with accuracy of 96.26%. Caifeng Shan et al. [23] used LBP (Local Binary Pattern) in combination with Ada-Boost to create Boosted-LBP. Specifically, the most discriminative LBP histograms with Ada-Boost were created for each expression, and then the SVM classifier was used to perform the classification, the experimental results are performed on the CK+ data set, with the cross-validation method (10-fold on the whole frame), the proposed model has 91.4% accuracy. Jie Cai et al. [24] proposed a new error function Island Loss (IL) to enhance the ability to separate features extracted from the CNNs model. In particular, the IL is designed to reduce the variations dimensions of objects in the same class and maximize the distance between one class and the others. Experimented with CK+ dataset, using the last three frames to create 981 images which divided into 10 folds. The authors used the crossvalidation method to evaluate with 8 folds for training, 1 fold for validating and 1 fold for testing. Experimental results achieved an accuracy of 98%. This approach is a part of the static-based method. The Yang et al. [25] proposed the Deexpression (De-expression Residue Learning) method to classify emotions, Yang et al. used the GANs (Generative Adversarial Networks) [26] model to create a neutral state for each input face image. Yang et al. used the features map from the convolutional layers belonging to both the generator branch and the discriminator, each convolutional layer was passed through the sub-classifier, all sub-classifiers was combined to create the final classifier, the classifier determined the corresponding emotional state for the input image (7 emotional states). Experimented on the CK+ data set, the author used the last 3 frames, used cross-validation to evaluate, divided into 10 frames, Yang et at. achieved 97%. On the Oulu-CASIA data set, the author also used the last 3 frames, using cross-validation for evaluation, divided into 10 frames, achieving 88% accuracy. Kim et al. [27] proposed an approach which combined information from two types of data: alignment (XA) and non-alignment (XNA), with the purpose of increasing the accuracy of the Facial Expression Recognition (FER) problem. Specifically, for the alignment data: from the original data, the author found the landmark on the face to align face, after that a set of face alignment data (ZA) was created. Besides, starting with XA, the author proposed an Alignment-Mapping Networks (AMNs) model to find a face alignment state (  A Z ) and a set of feature vectors (hA). For non-alignment data XNA, this dataset fed into the AMNs and the output is the features vectors (hNA). Next, all XA, ZA, XNA was determined by the separate Deep Convolution Neural Networks (DCNs) corresponding to each emotional state. Next, all XA, ZA, XNA was fed into Deep Convolution Neural Networks (DCNs) to determine the probability of each emotional layer. In the meantime, hA and hNA are also fed into the MLP (Multi-Layer Perceptron) networks to determine the probability of each emotional layer. Finally, the Ensemble Learning technique is used, combines at decision level, and labels the emotional state of the input image on the basis of the rules: (1) majority vote, (2) average networks output. The experimental results on the FER2013 data set achieved 73.31% accuracy with both (1) and (2) rules. Isha Talegaonkar et al. [28] proposed a special CNNs architecture to classify emotions. First, the Haar Cascades feature is used for face detection. Second, the normalization technique was used. Finally, the proposed CNNs architecture extracted features and classified the emotional state of the input image set. Experiment on the FER2013 data set, the accuracy on the PublicTest set is 89.78%, the PrivateTest set is 60.12%.
Summary, existing studies: divided into three main groups, (1)  Through published studies, we consider these studies focus on post-embedded features extracted from CNNs, they study will process these embedded vectors and suggest improvements. In our approach, focusing on the extraction phase of features improves the discrimination of feature vectors by combining SE-Block with each module (fire-module, fire-module bypass and transition module) included in SqueezeNet.

SqueezeNet, SqueezeNet with Complex Bypass, Inception Resnet V1
Forrest N. Iandola et al. [15] proposed a SqueezeNet model with a number of parameters less than 50 times when compared to AlexNet [20] but still achieving the same accuracy on the dataset of ImageNet (Large Scale Visual Recognition Challenge 2012 (ILSVRC 2012)) dataset, AlexNet -57.2%, SqueezeNet reached 57.5%. In addition, the authors also released SqueezeNet with Complex Bypass at the same time, this is a variation of SqueezeNet, derived from the SqueezeNet network, adding a convolutional layer with 1×1 kernel to perform. a skip connection (shortcut connections) [9] in the fire-module. Experimentation has shown that the SqueezeNet with Complex Bypass networks results on ImageNet (ILSVRC 2012) dataset to achieve 58.8 percent accuracy (1.6 percent increase compared to AlexNet). Another CNNs architecture is Inception-Resnet V1 which is made up of a combination of models Inception-A, Inception-B and Inception-C [16]. Skip connection which has been proven to help the model get deeper [9], is also added to each module. The accuracy on 2012 ILSVRC validation set is 78.7%.

Squeeze and Excitation Block
Squeeze-and-Excitation block [10] (SE-block) (Illustrated in Figure 5) that re-calibrates channel-wise responses by explicitly modelling interdependencies between channels, SE-Net is constructed. And with top-5 error at 2,251%, it won first place in the ILSVRC 2017 [8] classification challenge.

Pre-trained Model
A pre-trained model is a model that has been trained on another dataset to solve a similar or related problem that we want to solve. MS-Celeb-1M dataset is crawled from the internet, Microsoft Research Public Dataset in 2016, with a number of 10 million images, nearly 100,000 individuals. This dataset is often used to create pre-trained sets for face recognition problems. Chi Jin et al. [29] filtered this dataset again to improve the quality of the dataset (remove overlapping, missing class). We use the dataset was filtered by [29], from which we selected images from subjects with 20-30 samples to train the pre-trained set. Jiankang Deng et al. [18] proposed ArcFace loss function to extract highly discriminative features to solve face verification effectively. Therefore, we used the ArcFace loss function for training CNNs to construct a pretrained model.

Ensemble Learning
Ensemble methods is a technique of machine learning which combines several basic models to produce an optimal predictive model. Combining multiple models can increase accuracy in the classification problem. The model combination can be divided into two modes: (1) featurelevel (early fusion) (2) decision-level (late fusion). With the combination (1), the features of the sub-models will be merged into a single feature to represent the input and feed into the classifier to assign the label in response to the input. With combination (2), each model makes a decision, the consolidated model will give a final classification based on methods such as (a) voting, (b) average: averaging the probabilities on the output of all models and selecting the class with the highest probability, (c)Weighted: the output of each model will be weighted, showing the contribution of the model. In this paper, we combine the models at both levels: the feature-level, the decision-level. For decisionlevel, we use the average technique for the classification of facial expression.

Proposed Method
In this section, we will discuss the phases of the facial expression classification system of expressions was described in the figure 4.

Face Detection
At the face detection step, we use the Multi-task Cascaded Convolutional Networks (MTCNNs) method of [11]. MTCNNs is divided into three steps, each of which has a separate CNNs network: P-Net, R-Net, and O-Net. Firstly, the input image is scaled to 5 different ratios to make the input for P-Net. Next, P-Net will return the outputs which are potential faces regions. These regions will be adjusted (padding), then scaled to 24×24 pixels to become input for the next R-Net network. R-Net Networks removes non-face regions, calibrates region coordinates using bounding box regression. The output of R-Net is similar to P-Net: potential face regions. Calibration technique (padding, Non-maximum Suppression -NMS) is used for these areas. Next, the potential face regions are scaled to 48×48 pixels and fed into the O-Net. O-Net will again classify regions with faces and not faces. For regions with faces, the O-Net returns the confidence score and the bounding box coordinates are adjusted. Finally, the NMS is used to calculate the bounding box coordinates for each EAI Endorsed Transactions on Pervasive Health and Technology Online First Tin Trung Nguyen and Thai Hoang Le 6 face found. After we have found a face in image, we increase the bounding box coordinates for each face to 20 pixels (for both width and height). Finally, each face region will be cropped and resized to 128×128 pixels.

Down-Sampling
With the aim of using Inception-Resnet V1 effectively, the FER2013 image faces, with an original size of 48×48 pixels, will be resized to 86×86 pixels. For both Oulu-CASIA, CK+ dataset, we resized them to 128×128 pixels after face detection.

Augmentation
Affine transformations and other transformations often used to generate more data in deep learning such as rotation, scaling, translation, black-out, random crop, tencrop [17], and brightness and colours transformations. Data augmentation helps avoid overfitting in deep learning [28]. In this paper, we use the methods: random rotation with the value angle in the domain [-15, +15], random vertical flipping, random contrast.

Normalization
We apply the linear transformation method mentioned by Bishop in [30] to normalize the input image for both the train and the test phase. The linear transformation of the pixel value of the input images is one of the common and simple forms of pre-processing for input normalization. The linear transformation used in this paper will bring all the original values to the same smaller range. This linear transformation ensures effective input normalization for images with property: (1) Images of the same subject but with different contrasts, (2) The pixel values of the image are varied (different lighting conditions). The input linear transformation method is carried out through the following steps: • x is the pixel value with the coordinates (i, j), x is mean value of image, 2 σ is variance.
• Step 2 apply linear transformation for the image is calculated as follows

CNNs architectures were proposed in this work
In order to effectively solve the problem of image classification on ImageNet [31]. Iandola et al. [15] proposed two architectures: SqueezeNet and SqueezeNet with Complex Bypass. SqueezeNet model with a number of parameters less than 50 times when compared to AlexNet but still achieve the same accuracy on ImageNet (ILSVRC 2012) dataset. Specifically, SqueezeNet included 8 blocks of fire-module and layers: max-pooling, Global Average Pooling (GAP), and Fully Connected (FC). Each fire-module consists of two consecutive layers: the squeeze layer and the expand layer. In which the squeeze layer is a Convolutional layer with a kernel size was 1×1, it reduces the number of feature maps, and then feeds the output to the expand layer. Expand layer formed from Convolutional layer mixes with 1×1 kernel size and 3×3 kernel size Convolutional layer. SqueezeNet with Complex Bypass similar to SqueezeNet, with a few changes: the skip connection is added to the fire-module, while the 1x1 kernel-sized Convolutional layer is added to some firemodules to make the transition module. The purpose of the transition module is to adjust the number of fire-module feature maps on the previous layer, so that it was equal to the number of current fire-module feature maps to perform EAI Endorsed Transactions on Pervasive Health and Technology Online First Fusion of Attentional and Traditional Convolutional Networks for Facial Expression Recognition 7 the skip connection. In this paper, we want to leverage the high performance of two above models to propose two extended models of SqueezeNet and SqueezeNet with Complex Bypass, called: (1) SqueezeNet-SE, and (2) SqueezeNet Complex-SE, in order to effectively solve the problem of facial expression classification. Model (1) (SqueezeNet-SE) consists of 9 fire-module blocks, and each fire-module is combined with SE-block ( Figure 8 describes the proposed SqueezeNet-SE model). Figure 7 shows the combination of the fire module and the SE-block in the proposed model. Applying SE-block to the expand layer output is to recalibrate feature maps to highlight important feature maps. Thus, corresponding to 9 firemodule blocks, 9 SE-blocks will be used to recalibrate feature maps. Model (2) (SqueezeNet Complex-SE) is illustrated in Figure 9.

Ensemble for facial expression recognition
In this period, we combining the models together. The output of the SqueezeNet-SE model is an embedding vector of 128 dimensions, named E1. E1 will be fed into a Fully Connected (FC) that generates O1 output, with an output size of 7 nodes, corresponding to 7 facial emotions. Similar to SqueezeNet-SE, the output of SqueezeNet Complex-SE is an embedding vector called E2, 128dimensional. E2 is also used as input for the FC that generates O2 (the size is 7 nodes). Finally, in order to increase the efficiency of the facial emotions classification system, we use more of the Inception-Resnet V1 model, with the architecture proposed in [13]. The output of the Inception-Resnet V1 model in this paper is an embedding vector called E3, 128-dimention. E3 will be fed to a FC, the output of this FC is called O3 with a size of 7 nodes. We combine 3 vectors E1, E2, E3 together to create a 128×3 matrix. Next, use a 1×3 kernel Convolutional layer to synthesize information from 3 vectors, create feature maps (new feature space is 128×3×128). Using the Average Pooling Operator creates a new feature space named AEV (Average Embedding Vectors), which has a 128×1×1 dimension. AEV will be fed into an FC layer to generate O4, with 7 nodes in size. Combine O1, O2, O3, O4 and form a 7×4 matrix. Using an average calculation for 7×4 matrix, get a vector called AFO (Average Final Output), with a size of 7×1, which corresponds to 7 emotions.
CK+: The Extended Cohn-Kanade (CK+) [12] dataset was published in 2010 by Cohn-Kanade et al. based on the CK dataset. CK+ contains 593 sequences, belonging to 123 subjects, but there are only 327 sequences of 118 subjects labelled with seven basic expressions: anger, contempt, disgust, fear, happiness, sadness and surprise. Figure 12 illustrate some of images from the CK+ dataset.  FER2013: FER2013 dataset published during the ICML competition in 2013 [14]. Dataset FER2013 is a large dataset with images of human faces showing emotions. This dataset was collected from the internet in the following context: reality, freestyle, non-constraint, all images 48×48 pixels in size. This data set is divided into three subsets: (1) training set: 28,709 images (2) Figure 14 illustrates some of images from the FER2013 dataset. Summary, for CK+ and Oulu-CASIA dataset: Sequence data, lab context, video sequence starts with a neutral state. Expression level increases over time, peaking at the last frame, and the databases are all labelled as basic emotions, the common challenge of these two datasets is that the amount of data is too small compared to the datasets tested using the CNN method. The amount of data is too small, leading to low generalization. The data of two datasets are a variety: male-female, skin colour, light levels vary between videos (for CK+). It is possible to have expressions of relatively different facial muscle groups with the same emotion, both male and female. This leads to confusion in the classification of emotions. For FER2013: The challenge of the FER2013 dataset: Collected from the internet, static images, free context, different light levels (difference grayscale level), face with free-angle. Distribution at all ages, the image was incorrectly labelled, the image did not contain a human face. This dataset also contains faces with glasses and hats. Moreover, with the same emotion, objects of different ages show some difference.

Experiments
Based on the models proposed in section 3: (1) Inception-Resnet V1, (2) SqueezeNet-SE, (3) SqueezeNet Complex-SE, (4) ensemble model, we implement an Emotion Recognition System for each model. At the same time, 3 datasets FER2013, CK+, Oulu-CASIA were used to experiment, evaluate and analyse the feasibility of the system implemented. Implementation details.
• CK+: The CK+ dataset is not divided into train, test, and validate set as FER2013, so we extract the last frames (last 3 frames) for each sequence. After that, the cross-validation method is applied, the images are further divided into 10 folds in ascending order, based on identity (9 folds for training, 1 fold for testing). The validation set is split from the training data (10% of the number of samples in 9 folds). • Oulu-CASIA: We use images from the VIS system, with strong lighting (Strong) from frame 7 to the end. A 10-fold subjectindependent cross-validation is performed, as with the CK+ experimental setting. The validation set is split from the training data (15% of the number of samples in 9 folds). • FER2013: the images were resized to 96×96 pixels, then the 10-crop method was used to generate data for three training sets, Public Test, Private Test. The size of the cropped image is 86×86 pixels. The original image is also resized to 86×86 pixels at the end of the ten-crop implementation, forming 11 images, using the voting method to select the final classification for the input image (for the validation and test set). • Generating pre-trained model: with 2 proposed models and Inception-Resnet V1, generating three pre-trained models respectively, the pre-trained generation process is referred to in Section 2.4.

EAI Endorsed Transactions on Pervasive Health and Technology
Online First • Training parameters for 4 models: we use a Stochastic Gradient Descent (SGD) optimizer for both the creation of a pre-trained model and the fire-tuning. Learning rate Schedule is used, change the learning rate based on the accuracy of the validation set. Weight-decay = 5e-5, dropout rate = 0.5, batch size = 96, random seed = 777, The loss function used during FER training is Softmax Loss, Softmax Loss is actually just Softmax Activation plus Cross-Entropy Loss [19]. Images are represented in RGB colour space.

Results
The tables 1 and 2 show the accuracy of the classification (%) for each emotion in two datasets: CK+ and Oulu-CASIA.    The results of three proposed models on three datasets are comparable with other authors' methods (see tables 3,4,5). Specifically, with the FER2013 dataset, the ensemble model (74.87%) has 0.55% lower accuracy than the [35] model (75.42%), and 1-3% higher accuracy than other models (table 5). Similarly, three proposed models have 2-4% higher results on the CK+ dataset than those in other authors' studies (table 3). Observing table 4, three proposed models also have higher results than the models of other authors on the Oulu-CASIA dataset.

Discussion
Analysing the problem of classifying facial emotions, researchers have shown that facial features have different EAI Endorsed Transactions on Pervasive Health and Technology Online First effects for effective emotional classification, specifically, regions such as eyes, nose, and mouth have a higher effect on the performance of expression classification than other regions [47]. Thus, the use of the SE-block in combination with CNNs to re-calibrate feature maps (feature maps are weighted) is necessary for the problem of facial expression classification. Thanks to the SE-Block calibration feature maps, important features are retained and the influence of unimportant features is reduced immediately after each module in CNNs. This allows for the extraction of embedded vectors by CNNs with highly discriminative features, resulting in improved performance of layering in the classifier. The values of the important feature maps will be changed less (weight close to 1.0) and the non-important feature maps will be suppressed (weight close to 0.0). Therefore, we proposed two models for combining SEblock and CNNs: (1) SqueezeNet-SE is a combination of SE block and SqueezeNet; (2) SqueezeNet Complex-SE is a combination of SE-block and SqueezeNet with Complex Bypass.
To more clearly, in [9], the Facial Action Code System (FACS) was described by Patrick Lucey et al., this system describes a set of facial muscle movements that correspond to a displayed emotion on the face. From the information on the FACS system, realize that: action units that define basic emotions are mainly muscle groups in areas around the eyes, nose, and mouth. It is necessary to focus on extracting features from regions around these locations for high accuracy in facial expression recognition. For more effective facial expression recognition, information from the regions (eyes, nose, and mouth) must be extracted from CNNs. However, with the number of feature maps generated by each Conv layer in CNNs, a lot of weak information areas (not nearly three areas of the eyes, nose, and mouth) will also participate in the classification process, thus it is necessary to minimize the effect of a feature map containing unimportant information, Since then it has been necessary to apply SE-Block (a way to automatically recalibrate feature maps) to automatically minimize the impact of unimportant feature maps and to retain only important feature maps. Final, our model has higher results on the two datasets CK+ and Oulu-CASIA than the current models (Table l). Our method gives results on the FER2013 dataset with an accuracy close to model [42] (accuracy 0.56 less than model [42]), however our model is simpler. In addition, Cires¸an et al. [48] has shown that combining the models together achieves higher classification accuracy than using a single model. Details, Cires¸an et al. [48] used the ensemble model to solve the problem of image classification, experimented with six datasets (MNIST, NIST SD 19, HWDB1.0 (on -off), CIFAR10, traffic signs, NORB). Experimental results show that the ensemble model is more efficient than the single model. We also use ensemble model in this paper (Section 3.6). Experiment shows that ensemble model has better accuracy than single model (on two datasets: FER2013, Oulu-CASIA) and equivalent accuracy on CK+ dataset.

Conclusion
In this paper, we proposed 2 models SqueezeNet-SE and SqueezeNet Complex-SE, which is a combination of CNNs and SE-block. Furthermore, a model combining the two proposed models and the Inception-Resnet V1 model (Ensemble model) is also proposed. The proposed models are experimentally evaluated on three complex and challenging datasets: FER2013, Oulu-CASIA, CK+. The experimental results show the feasibility of the proposed models.
In the future, we will: (1) In the case of video data, it is necessary to develop a method for using more relationships between frames over time. (2) Develop, expand, and test the efficiency of the system with other classification problems. (3) Combine both Attention Spatial and Attention Channels in order to enhance the ability to extract features.