Six-layer Optimized Convolutional Neural Network for Lip Language Identification

INTRODUCTION: Lip language is one of the most important communication methods in social life for people with hearing impairment and impaired expression ability. This communication method relies on visual recognition to understand the meaning expressed in communication. OBJECTIVES: In order to improve the accuracy of this natural language recognition, we propose six-layer optimized convolutional neural network for lip recognition. METHODS: The calculation method of the convolutional layer in the CNN model is used, and two pooling methods are compared: the maximum pooling operation and the average pooling operation to analyse the most important feature data in the picture. In order to reduce the simulation in the model training process, the closing rate has been optimized by introducing Dropout technology. RESULTS: It shows that the recognition accuracy rate based on the six-layer convolutional neural network can reach 85.74% on average. This method can effectively recognize lip language. CONCLUSION: We propose a six-layer optimized convolutional neural network method for lip language recognition, and the identification of lip language features of this method is better than 3D+ DenseNet +1 × 1 Conv +resBi-LSTM, 3D+CNN, ConvNet+2 -256-LSTM+VGG-16 three advanced methods.


Introduction
Human perception of speech is a very complex multi-modal process, in addition to acoustic features, it also includes the comprehensive application of features, grammar, semantics, and contextual knowledge. A large number of studies have shown that both acoustic language and visual language play a very important role when people understand the content of speech. In terms of the production mechanism of speech, both acoustic speech and visual speech are produced by the vocal organs such as vocal cords, soft palate, tongue, teeth, lips, jaw and nasal cavity [1]. Some of the articulators can  After multiple training   and selecting appropriate thresholds based on experience, a   better lip segmentation binary image is obtained; after   obtaining the lip shape binary image, construct The   template smooths the image and extracts the edges, and   selects an appropriate number of edge feature points; finally, the neural network is applied to repeatedly train the edge feature points to obtain a smooth edge fitting curve.
The most advanced lip language recognition technology so far is an automatic lip language tagging system based on the Pyramid LK (Lucas-Kanade) optical flow method. The system first uses voice processing technology and facial lip region positioning technology to pre-process the video, and then uses optical flow method to calculate the movement information of the lips between adjacent frames to accurately mark the time corresponding to the lips change.
Mark the task. Compared with the method of labelling by speech recognition alone, the lip samples labelled by this system are more accurate and the quality of the data set is higher. In order to realize the recognition of Chinese lip language, a Chinese Phrase Lip Data Set (CPLDS) was established using this system. In the construction of the difficult. Therefore, in general lip language recognition, the correct rate will be higher and higher as the number of samples increases, but there is still a long way to go to achieve a high correct rate like speech recognition [4].
Convolutional neural network is a kind of feedforward neural network developed in recent years, that is, the simplest neural network, and a college recognition method that can attract wide attention. Convolutional neural network is a feedforward multi-layer network. Information flows in only one direction, that is, from input to output.
Each layer uses a set of convolution kernels to perform multiple transformations. The CNN model mainly includes a convolutional layer, a pooling layer, and a fully connected layer [5]. His neurons are arranged hierarchically, and each neuron is only connected to the neuron of the previous layer, receives the output of the previous layer, and outputs it to the next layer. His artificial nerve can respond to the surrounding units partially covering a certain range, and has great advantages in processing large images. The network avoids the pre-processing of complex images and can directly input the original image. Based on the CNN model, combining multi-layer convolution and multi-layer pooling to generate a new network model can improve the accuracy of the network structure [5].Based on the results of previous researchers by large-scale computing cluster, dedicated hardware and vast amounts of data, convolution neural network in image classification and object recognition has been widely and useful applications. Although they do not have the creativity of human mind, strong ability of object recognition is worth our using for reference.

Dataset
The pronunciation of Chinese characters is made up of pinyin, and pinyin is made up of syllables and tones. Since 1955, Chinese pinyin has been used as a tool to assist the pronunciation of Chinese characters. It is similar to the phonetic alphabet of English, but it is very different [6].
Research on Chinese shows that the pronunciation of Chinese characters can be represented by more than 1,300 syllables. A syllable is composed of initials and finals. The initials are the beginning of the entire syllable, and the rest are finals. There are 23 initials, which can be divided into bilabial sounds, Labiodental, alveolar, gingival palatal, tongue curl and velar [7].The pronunciation classification is shown in Table 1. There are 32 Chinese phonemes in total, as shown in Table 2.  On this basis, we collected experimental models of lip language (see Figure 1)  convolutional neural networks [8]. Image recognition is not an easy task, a good approach is to apply metadata to unstructured data, one way to solve this problem is the use of convolution neural network.

Convolutional Layer
W is the weight vector; X is the input eigengraph vector; B is bias; Y is the output characteristic graph. In Where F represents each feature graph, W represents convolution kernel, and * represents convolution operator.

Fully Connected Layer
The

Pooling Layer
The pooling layer, also known as the lower sampling layer, is used to carry out the lower sampling of the input feature map and is generally used between continuous convolutional layers to reduce the number of parameters [11].The goal of the pooling layer is to bring a certain   As shown in Figure 4, the convolution kernel size is 2 * 2, under the condition of step length is 2, after the pool size is a quarter before pooling, in pixel level space, an area of the image characteristics in its neighboring areas may also apply to [13], so pooling operation under the condition of without reducing model accuracy, effectively reduce the number of parameters in the network model.

Dropout
In the training process of deep convolutional neural network, with the increase of network depth, network parameters will become more and more. If the training model has too many parameters and insufficient training samples, the probability of over-fitting phenomenon in the trained model will be very high [13]. It is very small in the training set, but the error is large when the test data is provided to the neural network, which is named as the bad generation of the new data set [14]. To solve this problem, Network not using Dropout Dropout was introduced, and Dropout is a method proposed to overcome the overfitting problem. The output of the Dropout layer is.
Where, x = [x 1 , x 2 , … , x n ] T is the input of the full connective layer, W ∈ R h×n is a weight matrix, k is a binary matrix of size h and obeys the Bernoulli distribution of parameter p.
Dropout works by "dropping" some neurons each time it passes forward, which means it randomly sets some neurons to zero. Each cell has a fixed probability, p, which is independent of the other cells. Probability p is usually set to 0.5 [15]. In this case, Dropout randomly generates the most network structure, which helps reduce overfitting. Figure 5 is a graphical illustration of Dropout:

Statistical Results
This experiment is based on Window10 system, I5 CPU computer side. This experiment is based on the window10 system, i5CPU computer terminal. The structure of our CNN network is shown in the Figure 6. On this basis, the collected data are tested and analysed by means of the mean iteration algorithm, that is, A gray value T is obtained through iterative calculation, which divides the image into two categories A and B. Satisfies the condition: the mean of class A and class B, and then the mean is exactly equal to T. In these ten experiments, CNN and other advanced algorithms were adopted. After running 10 times, the highest value, lowest value and mean value were obtained. The average accuracy was 85.74%, as shown in Table 3, indicating a relatively high overall accuracy.
Use Dropout's network  Dropout can alleviate the fitting problem during the experiment. The following table is to analyse the data without using dropout. Over-fitting will occur when the training data is small. However, the accuracy of Precision is improved by using dropout.

Conclusions
On