A Lightweight Face Recognition Model based on MobileFaceNet for Limited Computation Environment

The face recognition method based on deep convolutional neural network is difficult to deploy in the embedding devices. In this work, we optimize the MobileFaceNet face recognition network MobileFaceNet so as to deploy it in embedding environment. Firstly, we reduce the model parameters by reducing the number of layers in MobileFaceNet. Then, the h-ReLU6 activation function is used to replace PReLU in the original model. Finally, the effective channel attention module efficient channel attention is introduced to obtain the importance of each feature channel by learning. After the optimization, the MobileFaceNet parameters are compressed to 3.4 MB, which is smaller than the original model (4.9 MB), and the mAPs reach 98.52%, 97.54% and 91.33% on the test sets of LFW, VGGFace2 and the self-built database, respectively, and the recognition time is about 85 ms/photo. It shows that the proposed method achieves a good balance between the model complexity and model performance.


Introduction
Face recognition, as a biometric identification technology, has been used in a wide range of applications, such as security systems and identity verification systems. The traditional face recognition methods are based on Principal Component Analysis (PCA) [2] . Turk M et al. [1] proposed Eigenface which extracts main features and reduces the data dimension for face recognition. Fisher face proposed by BELHUMEUR P N et al. [3] calculates the minimum dispersion by Linear Discriminant Analysis (LDA) to distinguish faces. However, these algorithms are not robust to the change of illumination and face pose. Between 2009 and 2012, face recognition algorithm based on sparse representation [4] became a hot research topic because of its better robustness to the occlusion problem. Researchers also proposed algorithms such as face superresolution, face illumination normalization, and face pose correction to solve the effects of illumination and pose change.
With the rapid development of deep learning technology, it has also made an important breakthrough in the field of face recognition [5] . This approach trains the model in a large dataset in advance, which enables it to extract more generalized face features [6] . The Alexnet network proposed by Krizhevsky et al. started a wave of research in the field of deep convolutional neural networks for face recognition [7] . Facebook proposed the DeepFace algorithm which used a face detection method based on detection point to extract 4096-dimension features through a 9-layer CNN network and achieved an accuracy of 97.35% on Labeled Faces in the Wild (LFW) database. Florian Schroff [8] et al. proposed the FaceNet algorithm to apply triplet loss to the CNN network structure and achieved 99.63% accuracy on LFW. However, these networks have huge number of parameters that means with low computational efficiency, which cannot be deployed on mobile and embedded devices [9] , so a large number of scholars started to research on lightweight neural networks. SqueezeNet was the first model proposed by Berkeley et al. [10] , and Google proposed the Xception [11] model, which implements a lightweight neural network by compressing the neural network parameters. Google proposed a lightweight network based on deep separable convolution MobileNet [12]  Jianyu Xiao, Guoli Jiang, and Huanhua Liu 2 parameters, not only reduces the network weight parameters, but also improves the computational speed. The MobileNetV2 [13] improved by adopting the strategy of expansion and compression, which not only improves the accuracy on but also reduces the number of parameters of the model. MobileNetV2 reduces the storage size and improves the computation speed, which is suitable for mobile devices.
MobileFaceNet [14] model is a lightweight face recognition network based on MobileNetV2 with only 4M and high accuracy, which is tailored for mobile and embedded devices and is ideal for implementing face recognition in weak computing environments. The model replaced the average pooling layer with a separable convolution, used the insight face loss function for training, and introduced batch normalization [15] . Zhang [16] et al. introduced the style attention mechanism in the MobileFaceNet network [17] to enhanced the feature representation and use AdaCos [18] face loss function to train the model to improve its accuracy and robustness. Hang [19] et al. introduced the SE module in MobileFaceNet [20] and successively used softmax loss and insightface loss [21] . The two loss functions were used to train the model to improve the recognition accuracy in mobile. Bihao [22] et al. optimized the network structure of MobileFaceNet and proposed a new loss function, Focalangle Loss, to improve its recognition rate.
Although MobileFaceNet is already a lightweight neural network for embedded devices, however, there are still many shortcomings when the model is ported to embedded devices due to its hardware condition limitations. The higher resolution input image results a deeper network structure to gradually extract face features, and the computational effort increases. Therefore, in this paper, we lightened the network structure of MobileFaceNet, and the accuracy of the model is ensured by replacing the activation function and introducing the effective channel attention module ECA [23] .

MobileFaceNet Network Structure
As shown in Table 1, the MobileFaceNet network takes a 112 × 112 resolution image as input and extracts the face features from it, while the output features are 512dimension. The model contains a total of 20 layers of network, and the input feature image is received at the beginning stage using a fast down sample strategy. The next five bottleneck layers are repeated 1, 5, 1, 6, and 1 times to extract the shallow to deep features of the face, and the last few convolutional layers are down sampled and a 1×1 linear convolutional layer is added after the linear global depth wise convolutional layer as the feature output.
In MobileFaceNet convolutional neural network, the average pooling layer is replaced with global depth convolution (GDConv), and the output of global depth convolution is given by the following equation (1), where K is the depth-separable convolution kernel, F is the dimension of the input feature map, i, j are the spatial width and height dimension, and r is the current channel.
When both the input feature map size and the convolution kernel size are W × H × R, and the output feature map G has a size of 1 × 1 × R, the ratio of the computational overhead of the global depth convolution layer to the computational overhead of the global average pooling layer is given by the following equation (2), where W, H, R indicate width, height and number of channels respectively, and Q is the number of filters.
The specific network structure of MobileFaceNet is shown in Table 1, where t denotes the "expansion" multiplier, i.e., the channel expansion multiplier of the reverse residual network, c denotes the number of output channels, n denotes the number of repetitions, and s denotes the step stride. MobileFaceNet continues the bottlenecks in MobileNetV2 [24] and uses it as the main module to build the network, as shown in Figure 1 for the structure of the bottleneck layer in MobileFaceNet. The expansion factor in bottlenecks is also reduced, and PReLU is chosen as the activation function and insight face loss as the loss function. In addition, MobileFaceNet network also introduces 7×7 separable convolution before the fully connected layer to replace the original average pooling layer, so that the features extracted by the network are more generalized.
Deep Separable Convolution [25] is an important part of the MobileFaceNet network because mapping the channels and spaces of the convolutional layers separately gives better results. As shown in Figure 2, the depthseparable convolution is based on this decomposition of the traditional convolution into a depth convolution plus a 1×1 convolution, and then followed a 1×1 point-by-point convolution.
Then the calculated amount is shown in the following equation.
The depth separable convolution is calculated as shown in Figure 3. Firstly, the depth convolution is multiplied by bit according to the channels, and one convolution kernel is responsible for one channel, and the number of channels does not change at this time. Since one convolution kernel can only obtain part of the information of the feature map, it is necessary to use 1×1 convolution kernels to perform the traditional convolution operation to combine all the feature maps to obtain a new feature map with all the information. At this time, the number of channels can be changed. The computational effort is given in the following equation (5), which will decrease by 1/N + 1/ Dk² compared to traditional convolution.
Compared with MobileNet, MobileFaceNet is more lightweight, with only 0.99 million parameters. The model with smaller size and higher accuracy is more suitable in computation limited environment. Table 2 shows the accuracy and number of parameters comparison between MobileFaceNet and MobileNet. In this paper, the neural network in the MobileFaceNet algorithm is appropriately optimized to make it better applicable to the embedded platform with limited computation capacity. The computation capacity of the embedded platform is limited, since this paper adopts the way of transmitting feature values to achieve the overall process of face recognition, the number of feature points in the recognition process and the size of the output feature values can be appropriately reduced. However, this will inevitably lead to a lower recognition accuracy and cause bad results. In this paper, we use replacement of the activation function and the introduction of the Efficient Channel Attention (ECA) module to optimize the neural network to ensure that the model has a high recognition accuracy. In this paper, the network model is optimized in the following aspects. 1) We adjust the input of the model to 96×96 resolution images, then appropriately reduce the number of network layers by reducing the number of repetitions of the first and third bottleneck layers to reduce the number of network parameters and lowering the computational effort.
2) We replace the activation function of the bottleneck layer from PReLU to h-ReLU6 to improve the network performance.
3) We used ECA module in the first and last bottleneck layers to optimize the parameters automatically.

Network Structure
Considering that this paper targets face recognition in a weak computing environment and does not need to process complex images, the network input is adjusted to a 96×96 resolution image. Since the biggest advantage of small embedded devices is their small size and portability, the face is close to the camera when acquiring face images. Figure 4  Due to the reduction of features in the input image, it is no longer necessary to have an excessively deep network structure. Therefore, we reduce one of the bottleneck layers to make the number of neural network parameters smaller and lighter. In order to reduce the computation without affecting the accuracy too much, we introduce the ECA module in the first and last bottleneck, and the adjusted network structure is shown in Figure 5.

Activation function
MobileFaceNet uses bottleneck architecture as feature extraction network, i.e., bottleneck layer design, which adds one step of deep convolution between two point-bypoint convolutions. Since the Sigmoid [26] function involves exponential and division operations, which is computationally intensive and the embedded devices cannot afford this level of computational consumption. The activation function in the bottleneck layer of the original model is PReLU, which is optimized compared to the ReLU6 used in MobileNetV2, but the effect is very limited. The PReLU is used as the activation function after the first point-by-point convolution and deep convolution, and the linear activation function after the last point-by-point convolution. Figure 6(a) shows the PReLU function, and Figure 6(b) shows the function image of ReLU. ReLU6 function limits the upper bound to 6, which alleviates the problem of Relu [27] gradient disappearance caused by large gradients flowing through the neurons during the computation. In fact, this does not completely prevent this phenomenon, we replace it with the h-ReLU6 function, which is based on the Swish function proposed by Google [28] [29] . Google has proven through extensive experiments that the Swish function outperforms all current activation functions, and its function image is shown in Figure 7.   As shown in the equation (6), the h-ReLU6 function is improved by imitating the Swish function though. As shown in Figure 7, the function image is also morphologically close to the Swish function, but it is still essentially a ReLU6 function rather than a Sigmoid function, which enables excellent activation performance to be achieved by a smaller computational effort.

ECA Module
The ECA module assigns weights to individual channels by introducing a small number of parameters and to help neural network learn important features. There have been many studies from the spatial dimension [30] perspective of optimizing neural networks. We make an attempt from the perspective of weight optimization. In the literature [19] , the Squeeze-and-Excitation (SE) module [31] [32] is introduced in MobileFaceNet. The structure is shown in Figure 9, and it can be seen from the experimental results that the channel attention mechanism using the feature repositioning [33] strategy improve the recognition rate of the MobileFaceNet model in some cases, but it has some impact on the memory and computational power occupied by the model.

Figure 9. SE module
In this paper, we choose the effective channel attention ECA module, whose structure is shown in Figure 10. It improves on SE to avoid dimensionality reduction and effectively capture cross-channel interactions and calculates the weights of each channel by a very simple structure. The module implements a local cross-channel interaction strategy without dimensionality reduction. An adaptive selection of the one-dimensional convolutional kernel size reduces the complexity of the model while achieves a performance gain from the complex attention module. Figure 11 shows the process of ECA module, firstly, the global average pooling (GAP) is performed on the original features of the input image to obtain all the unidimensional features. Then ECA captures the local cross-channel interactions by fast onedimensional convolution of size k, where the parameter k can be generated by the adaptive function according to the size of the input channel C. The next step is to generate the weight shared of each channel by the Sigmoid function, and to combine the original input features with the channel weights to obtain the features with channel attention. Unlike the SE module, the ECA module aims to learn effective channel attention with low model complexity, which minimizes its additional parameters and computational effort.
The ECA module has only one 1×1 convolutional layer with kernel size k, which indicates the coverage of local cross-channel interactions, because it is unnecessary to calculate the attention between two channels in the SE module. Two fully connected layers will introduce too many parameters and computation, which is not suitable for the weak computational environment in this paper. Considering that the introduction of additional modules will inevitably increase a small amount of parameters and computation, only the ECA module is embedded in the EAI Endorsed Transactions on Internet of Things 02 2022 -04 2022 | Volume 7 | Issue 27 | e1 Jianyu Xiao, Guoli Jiang, and Huanhua Liu 6 very first and last bottleneck for extracting shallow and deepest features, respectively. Figure 12 shows the structure of the bottleneck layer after the ECA module is embedded.

Experimental Configuration
The experiments were conducted using Pytorch as the framework for building deep learning network structures, with Windows 10 as the operating system, and Python 3.6 as the programming language for training and testing. The hardware platform is an Intel Core TM i7-10700 CPU with 32 GB of memory and a NVIDIA GeForce RTX 2080 Ti GPU with 11 GB of video memory. Opencv is used to complete the work related to image preprocessing. The total number of training rounds of the network model is 500, and the batch size is 32. The learning rate is initially 0.1 using the SGD optimizer, and gradually decreases as the training progresses, and the minimum learning rate is 0.0001.
In addtion, we deployed the network structure of MobileFaceNet before and after optimization to the embedded platform in turn, and evaluate the merits of the model in terms of memory consumption, recognition rate and recognition accuracy.

Data Processing
In this paper, we use the CASIA-Webface [33] face dataset as the training dataset. Using MTCNN [35] face detection method is used to re-detect the images in the dataset, and the detected face images are cropped to 96 × 96. As shown in Figure 13, the algorithm is based on the threelevel neural networks P-Net, R-Net, and O-Net, progressively generating, correcting, and accurating candidate frames and finally generate the positions of five feature points of the left eye center, right eye center, nose tip, left mouth corner, and right mouth corner of the face.
Since the captured faces have a certain angle of inplane deflection and some different distances between the faces and the camera, which leads to inconsistent face sizes. Therefore, it is necessary to align the captured faces. In this paper, we directly calculate the transformation matrix based on the positions of the five feature points localized by MTCNN and the standard face template feature point positions using Similarity Transformation.
Assuming that the original coordinates are (x, y) and the coordinates after similarity transformation are (x1, y1). As shown in formula (7), s1 and s2 is the scaling factor, Avgpool Transpose Squeeze

Conv1d
Transpose Squeeze (t1, t2) is the translation factor, r = 1 means x-flip, otherwise no flip, and θ is the rotation angle. The transformation matrix is solved according to the coordinates of the feature points generated by MTCNN and the standard face template feature points to achieve face alignment.

Accuracy Evaluation
The purpose of this experiment is to verify the degree of improvement of the recognition effect of each step of the MobileFaceNet network compared with the original network, so the original MobileFaceNet network is chosen as the reference object. The LFW [36] and VGGFace2 [37] of the public dataset and the self-built test set were chosen respectively, where the self-built test set contains 30 people with 12 photos each, which is divided into two parts. Firstly, two photos are randomly selected from each person, totaling 60 photos, to form the face library. The remaining 10 photos from each person are used as test samples, totaling 300 photos. The number of people and the number of images in each data set are shown in Table 3 below.  Figure 14. The accuracies of the original network on LFW, VGGFace2 and the self-built dataset are 98.62%, 97.48%, 90.67%, respectively, which become to 98.14%, 96.64%, 90.13%, after adjusting the network structure. The accuracies after replacing the activation function are 98.35%, 96.25%, 90.33%, respectively, which are 98.52%, 97.54%, 91.33% after introducing the ECA module. Theoretically, reducing the number of network layers will lead to a significant decrease in recognition accuracy, but from the experimental results, it seems that the recognition accuracy of the model for the three test sets of LFW, VGGFace2 and self-built datasets decreases by 0.48%, 0.84% and 0.52%, respectively. It is not a significant decrease, which shows that there are few features in face recognition, the resolution of the input image can be reduced appropriately, and the number of network layers and parameters can be reduced accordingly.

Figure 14. Recognition accuracy in different databases
After replacing the activation function, the model improves the accuracy by 0.21% and 0.20% in LFW, and self-built datasets, respectively, compared to the previous step, although the accuracy is reduced by 0.39% in VGGFace2. It proves that in some cases h-ReLU6 outperforms the original PReLU.
Finally, the accuracy of all three datasets improved after introducing the ECA module, especially in the VGGFace2 and self-built datasets, and were even higher than before the lightweight treatment. Overall, the recognition rate of the model after the lightweighting treatment still performs well, reaching 98.52%, 97.54%, and 91.33% in the LFW, VGGFace2, and self-built datasets, respectively.

Complexity Test
In this paper, we use the memory occupied by the model and the recognition speed on the embedded device as the criteria for complexity testing. The test results are shown in Table 4, where the recognition speed is the average time taken to recognize each face image in the embedded device. From Table 4, we can see that the model initially occupies 4.9 MB of memory, and the recognition speed is 117 ms per image. For the optimized network structure, the network layers and the amount of computation are significantly reduced, which makes it occupy 1.8 MB of memory and the recognition speed per image is increased by 34 ms. After replacing the activation function, the memory occupancy is increased by 0.1 MB and the recognition speed per image is increased by 5 ms. As shown in the following equation (8) for the function of PReLU, compared with it, the complexity of h-ReLU6 as shown in the equation (6) is relatively higher and the computation is larger, but the overall impact seems to be small. The introduction of the ECA module also leads to an increase of 0.2MB of occupied memory and an increase of 0.4 ms per image recognition speed. It increases the complexity of the model, however, compared with other channel attention modules, the complexity of the ECA module occupies less memory and computation.
Although replacing the activation function and introducing ECA both lead to an increase in the model's memory and a decrease in recognition speed, overall it takes up 1.5 MB less memory than processing and increases recognition speed by 32 ms per image.

Conclusion
In this work, we optimized the face recognition model MobileFaceNet for limited computation environments.
Firstly, we reduced the resolution of the input image and the number of network layers to reduce the parameters of the model and the computation complexity. Then, the activation function in the bottleneck layer is replaced with h-ReLU6, and the ECA mechanism is introduced to achieve automatic parameter tuning. Finally, the proposed model was test on public datasets LFW, VGGFace2 and the self-built datasets. Moreover, we deployed the optimized model in the embedded devices for testing. Experiment results shows that there is a slight decrease in the recognition rate in the LFW dataset for the optimized model, however it is negligible overall. In general, the lightly processed model is more suitable for use in a weak computing environment than the original model.