An Efficient Face Mask Detector with PyTorch and Deep Learning

INTRODUCTION: The outbreak of a coronavirus disease in 2019 (COVID-19) has created a global health epidemic that has had a major effect on the way we view our environment and our daily lives. The Covid-19 affected numbers are rising at a tremendous pace. Because of that, many countries face an economic catastrophe, recession, and much more. One thing we should do is to separate ourselves from society, remain at home, and detach ourselves from the outside world. But that's no longer a choice, people need to earn to survive, and nobody can remain indefinitely within their homes. As a precaution, people should wear masks while keeping social distance, but some ignore such things and walk around. OBJECTIVES: To develop a Face Mask Detector with OpenCV, PyTorch, and Deep Learning that helps to detect whether or not a person wears a mask. METHODS: A Neural Network model called ResNet is trained on the dataset. Furthermore, this work makes use of the inbuilt Face Detector after training. Finally, we predict whether or not a person is wearing a mask along with the percentage of the face covered or uncovered. RESULTS: The validation results have been proposed to be 97% accurate when compared to applying different algorithms. CONCLUSION: This Face Mask Detection System was found to be apt for detecting whether or not people wear masks in public places which contribute to their health and also to the health of their contacts in this COVID-19 pandemic.


Introduction
COVID-19 has restructured life as we know it. Many of us remain at home, avoiding people on the streets and modifying our everyday routines, like going to school or work, in ways we have never imagined. The World Health Organization (WHO) announced coronavirus disease to be a pandemic [1] in 2019 . Situation Study 96 estimated that coronavirus has infected more than 2.7 million people worldwide and caused more than 180,000 deaths. Besides, there are a variety of identical broad Serious respiratory disorders, such as severe acute respiratory syndrome (SARS)[2] and the Middle East respiratory syndrome (MERS) [3], which have arisen in recent years, COVID-19 has had a higher prevalence than SARS. As a result, more people are worried about their well-being, and public health is considered a top priority for governments. Although we are modifying old habits, we need to follow new behaviors. First and foremost, it's the practice of wearing a mask or a face that covers every time we're in public space. As reported cases of COVID-19 continue to grow, the CDC (Centres for Disease Control and Prevention) recommends that everyone wears a cloth mask when going out in public. Experts claim that

EAI Endorsed Transactions on Pervasive Health and Technology
Research Article EAI Endorsed Transactions on Pervasive Health and Technology 12 2020 -01 2021 | Volume 7 | Issue 25 | e4 masks for community use do not stop anyone from getting infected, but they do help prevent the spread of the disease by those with the virus. Face mask detection has, therefore, become a crucial computer vision challenge to benefit global society, but studies related to face mask detection are minimal.
With the explosive development of machine learning techniques, the issue of facial recognition seems to be well solved. Beyond the impressive success of the current works, there is a concern that the implementation of better face detectors is becoming increasingly difficult. In particular, the identification of masked faces, which can be very helpful for applications such as video surveillance and event analysis, remains a major challenge for many current models. In this paper, we propose a face mask detector capable of detecting face masks and contributing to public health. The proposed method in this paper finetunes the Residual Neural Network popularly called ResNet and constructs the model. Later, we train the model and perform face detection to extract the Region of Interest (ROI) from each image present in the dataset. Finally, we pass the image through the constructed model to determine if the face has a mask or not. ResNet uses multiple layers and applies the Data Augmentation technique to enhance the effectiveness of the prediction. All the architectures proposed till now like AlexNet, VGGNet, GoogleNet, Inception, etc., take a lot of time and memory for training with many computations with reduced accuracy. This problem of time and memory consumption can be reduced using the methodology proposed in this paper. This paper is organized as follows: In section 2, the related work done previously in this area is described, Section 3 explains the methodology i.e., various stages involved in the development of the proposed system. Section 4 describes the results and Section 5 concludes the paper.

Literature Survey
Prior, the research in the area has focused on the edge and grey value of face image being based on pattern recognition combined with the knowledge on the face model. Adaboost [4] was a fantastic training classifier. The facial detection technology has made a breakthrough with the iconic Viola Jones Detector [5], which has greatly enhanced the real-time facial detection. Viola Jones Detector refined the features of Haar [6] but failed to solve real-world issues and was affected by different factors such as facial brightness and orientation. Whereas Viola Jones could be used in well-lit frames, it did not fit well in dark environments and non-frontal images. These problems have led independent researchers to work on creating new models of deep-learning face detection to produce better results for various facial conditions. Instead of using hand-carved features, deep learningbased detectors have recently demonstrated exceptional performance due to their robustness and high extraction ability. There are two main categories: one-stage object, and two-stage object detectors. In the latter case, the twostage detector produces a conceptual framework in the first stage, and then fine-tunes those proposals in the second phase. Moreover, the two-stage detector can provide a high detection output, however, at a low speed. The R-CNN seminal work is proposed by R. Girshick et al. [7]. R-CNN uses selective search to suggest some of the feature vectors that may contain objects. Subsequently, proposals are transmitted into the CNN model to extract the features, and a support vector machine (SVM) is used to identify classes of objects. That being said, the second stage of R-CNN [8] is prohibitively costly, as the network must detect proposals on a yet another-by-one basis and use of a distinct SVM for the final classification task. Quintessential architectures such as AlexNet [9] and VGGNet [10] contain stacked convolutionary layers. AlexNet has won the ImageNet LSVRC-2012 competition with 5 convolutional layers and 3 fully connected layers, while VGGNet is an improvement over AlexNet as it replaces large kernels with 3x3 multiple kernels in a row. The winning GoogleNet [11] architecture of ILSVRC-2014 uses parallel convolution kernels and concatenates function maps together. It has been used for 1×1, 3×3 and 5×5 convolutions and 3×3 max-pooling. Tiny convolutions extract feature maps, whereas larger convolutions extract high-level features. We used ResNet to create skip connections that allow deep neural networks to avoid exhaustion in training accuracy. These architectures are also used for the initial extraction of features in face detection networks. Face Recognition is also done using Adaptive K-Nearest Neighbour, adaptive weighted average, reverse weighted average, and exponentially weighted average [21].

Methodology
Our proposed technique of detecting Face Mask starts with pre-processing followed by other methodologies as shown in the architecture of the proposed system " Fig.1". We have used the RMFRD which stands for Real-World Masked Face Recognition Dataset available on the internet for free [22]. " Fig.2" and " Fig.3" are the images from the RMFRD representing with and without mask respectively. RMFRD is currently the largest masked face dataset within the real world. These datasets are openly accessible to academia and industry on the grounds of which different applications on masked faces can be built. The dataset contains 5,000 portraits of 525 individuals wearing masks and 90,000 pictures of the same 525 subjects with no masks. The whole project was implemented in Python using Deep Learning Libraries like PyTorch [12], Caffe [13], and Computer Vision libraries like OpenCV.

Pre-processing
Once the most suitable raw input data has been chosen, it must be pre-processed otherwise the neural network would not generate reliable predictions. The decisions taken at this stage of growth are vital to the success of the network. Transformation and Normalization [14] are two commonly used methods of pre-processing. Transformation requires the change of raw data inputs to create a new input to the network, while normalization is a transformation performed on new data input to disperse the data equally and scale it to an appropriate range for the network. Awareness of the domain is critical in choosing pre-processing methods to highlight the intrinsic features of the data, which can improve the capability of the network to learn how to connect inputs and outputs.

Data Augmentation
Data augmentation [15] is a method that can be used to arbitrarily enlarge the size of a training dataset by generating updated versions of images in a dataset. Training deep-learning neural network models on more data will result in more robust models, and enhancement techniques will generate image variations that can boost the ability of fit models to generalize what they have learned from new images. We have rendered several changes, including a variety of image manipulation operations, such as shifts, flips, zooms, mean subtractions, and far more on our dataset. " Fig. 4" illustrates an example of Data Augmentation.   The ResNet50 architecture has four stages, as shown in " Fig.6", "Fig.7", "Fig.8" and "Fig.9". The network will accept an input image with such a height, width as multiples of 32, and 3 as a channel width. We perceived the input size to be 224 x 224 x 3. Each ResNet architecture uses 7×7 and 3×3 kernel sizes for preliminary convolution and max pooling. Thereafter, Stage 1 of the network begins and has 3 Residual blocks containing 3 layers each. The size of the kernels used to perform the convolution process in all three layers of the stage 1 block is 64, 64, and 128, respectively. The Curved arrows refer to the identity connection. The dashed linked arrow shows that the convolution process in the Residual Block is done with stride 2, so the input size will be minimized to half in height and width, but the channel width would be multiplied. As we move from one stage to another, the width of the channel is doubled, and the size of the input is whittled down to half.

Figure 7. Stage 2 Architecture of ResNet50
Fine-tuning is a transfer learning technique. Knowledge gained during training in one form of the problem is used to learn in another similar task or domain. Fine-tuning ResNet is a 3-step process.
• Pre train the ResNet with ImageNet [16]  • Freeze the ResNet base layers. The weights of these layers will not be modified during the backpropagation process. Then, the weights of the head layer are tuned.

Train the model
Training data will be represented by 75% of all the available data and the remaining data will be marked for testing. Initially, we compile the above model with the learning rate decay and Adam Optimizer using the binary cross-entropy since this is a two-class problem. Now, the model is trained and validated using our training and testing sets.

Perform Face Detection
Now that our model is well trained, we perform mask detection. But initially, we need to detect the face in the image in order to perform mask detection. For this, we have used Caffe-based face detector [17], which is available in the face detector subdirectory of Deep Neural Network samples. We set a parameter called confidence which is a selectable probability threshold that can be fixed to override 50 % for filtering weak face detections. Once we have predicted where a face is in the image, we try to meet the threshold value before extracting the Region of Interest from the face.

Extract ROI from each Image
The face ROI [18] was indeed a rectangular shape mounted automatically to cover the face, hair, and neck of the models, while the ROI for each model was allocated to the eye and mouth coordinates within the rectangular areas. Then, we pre-processed the ROI again just like we have done during the training.

Apply Face Mask Detection
We passed Fig.3 through our constructed model to detect whether that face had a mask. We evaluated the class of the image based on the probabilities returned by the detector and add associated colors for annotation. We draw a bounding box using OpenCV including the class label and the predicted probability. " Fig.10" and " Fig.11" depicts the output of the designed system. Below is the proposed Pseudo Code for the implementation.

Results and Discussions
As stated in the preceding section, RMFRD is collected that includes masked and unmasked images of people. The type of our work is experimental and it is implemented using real world datasets in Python and this section describes experimental results. The experiments of the proposed smart face mask detection schemes were implemented using deep learning libraries like pytorch, caffe and computer vision libraries like OpenCV in python3, which is necessary and suitable for better accuracy in the process of designing a deep neural network like ResNet. " Fig. 12", " Fig.13", "Fig. 14" illustrates the training and validation loss graphs of ResNet9, ResNet15, and ResNet50 respectively. The accuracies achieved using ResNet9, ResNet15, and ResNet50 are 83%, 89%, and 97% respectively when trained our model with 20 epochs and with a batch size of 32. We plot a ROC (Receiver Operating Characteristic) [19] Curve " Fig.15" which illustrates the prediction capability. The ROC curve is obtained by plotting the true positive rate (TPR), often called sensitivity, against the false positive rate (FPR) [20].

Conclusion
This Face Mask Detection System was found to be apt for detecting whether people wear masks in public places, which contribute to their own health and also to the health of fellow people in this COVID-19 pandemic. This can help assist the government authorities in the process of detection by taking the CC camera footage as input at public places. Our detection system has done very well, when trained on the world's largest face mask dataset RMFRD and we also presume that it would do better when trained on even larger datasets than RMFRD in the future. We managed to accomplish an overall efficiency of 97%.

Recommendations
As face masks have become a very common part of our lives, it is a mandatory thing with respect to the current pandemic to ensure the safety of ourselves and others as well. Hence, this system can be used by the government to take strict precautions and ensure all of its citizens wear a mask. It can be used in CC TV camera footage at Traffic signals, crowded streets to detect people who don't wear a mask and impose some sort of action.

Future Research Work
This work can be further extended to detect face shield by changing the Region of Interest used in the current work.