Chinese fingerspelling recognition via gray-level co-occurrence matrix and fuzzy support vector machine

INTRODUCTION: Chinese deaf-mutes communicate in their native language, Chinese sign language which contains gesture language and finger language. Chinese finger language conveys information through various movements of fingers, and its expression is accurate and convenient for classification and recognition. OBJECTIVES: In this paper, we proposed a new model using gray-level co-occurrence matrix (GLCM) and fuzzy support vector machine (FSVM) to improve sign language recognition accuracy. METHODS: Firstly, we acquired the sign language images directly by a digital camera or selected key frames from the video as the data set, meanwhile, we segmented the hand shapes from the background. Secondly, we adjusted the size of each images to N×N and then switched them into gray-level images. Thirdly, we reduced the dimension of the intensity values by using the Principal Component Analysis (PCA) and acquired the data features by creating the gray-level co-occurrence matrix. Finally, we sent the extracted and reduced dimensionality features to Fuzzy Support Vector Machine (FSVM) to conduct the classification tests. RESULTS: Moreover, we compared it with similar algorithms, and the result shows that our method performs the highest classification accuracy which is up to 86.7%. CONCLUSION: The experiment result displays that our model performs well in Chinese finger language recognition and has potential for further research.


Introduction
Communication exists in all aspects of our lives and plays a crucial role in our daily lives. But there is a group of special people in our lives who cannot communicate with others in words normally, they are called deaf-mutes. The results of the second sample survey of disabled people in China in 2020 show that there are 27.8million people with hearing impairment in China, accounting for 24.26% percent of the total, ranking first among the six categories of disability [1].
People with hearing impairment cannot express their demands properly like normal people when they encounter difficulties, so communication has become a problem for them when they are defending their rights and interests. In real life, speech-and-hearing-impaired people can only communicate with people who understand their language.
Thus, a translator device is necessary for dumb people to communicate with normal people and the key part of this translator device is the accuracy of sign language recognition.
There are three common methods in sign language recognition: Vision-based SLR [2], Sensor-based SLR ( [3,4]) and Hybrid-based SLR [5]. In sign language recognition, hand shape, direction, position and movement track are the common features which are also the focuses of most researches and experiments. To continue to recognize sentence-level, American Sign Language (ASL), Thad Starner et al. proposed two real-time hidden Markov model-based systems and used a single camera to track the user's unadorned hands [6]. Kishore et al. [7] proposed a 4-camera model to recognize gestures of Indian sign language. The flexible sensor was once produced based on the combination of ARM9 and 9-axis IMU, and on this basis, Li Lei et al. [8] introduced the new data gloves and sign language recognition system. Lionel Pigou et al. [9] considered using the Microsoft Kinect to recognize.
Moreover, according to the Kinect 22 depth data and the skeleton joints data, Lubo Geng et al. [10] used the combined position and spherical coordinate feature representation to construct factor vectors. Chophuk P et al. [11] used a compact and affordable 3D motion sensor, arguing that palm-sized Leap Motion sensor has a higher comprehensive evaluation than other methods used in existing researches such as Cybler-glove and Microsoft Kinect.
There are also many state-of-art methods to recognize and Wavelet Entropy (WE). Due to texture can make full use of image information, it can be an important basis for describing and recognizing images theoretically or from common sense. Compared with other image features, it can better take into account the macroscopic properties and subtle structure of images. Therefore, texture becomes an important feature to be extracted for target recognition. At the same time, SVM is a novel machine learning method and has good robustness and generalization ability.
Consequently, in this paper, we chose to combine the fuzzy SVM and GLCM together for image classification recognition.
We are committed to contributing to the automatic sign language recognition. In this paper, we pay attention to the Chinese finger language recognition work. Although finger language only has 30 finger letters classification (including 26 basic pinyin letters and 4 upturned tongues), but it is significant. Due to different area uses different Chinese sign language, sign language often has the confusion of different expression of the same meaning.
Even though China has published two versions of Chinese sign language and standard sign language, they are not widely used in practice between deaf people. Thus, as the only deterministic expression, the advantage of finger language has been highlighted. So, we proposed the model of gray-level co-occurrence matrix and fuzzy support vector machine (GLCM-FSVM) to improve the accuracy of Chinese finger language recognition. In our method, features of Chinese sign language images were extracted by using the gray-level co-occurrence matrix (GLCM). GLCM can describe the texture by the grayscale relation between two pixels in the space to reduce the difficulty of image language recognition and has potential for further research.
The rest of this article is arranged as follows: Section 2 is mainly about the dataset and the experimental methods.
The feature extraction method GLCM, classification method FSVM and PCA dimensional reduction method are introduced emphatically. Section 3 contains the experiment process and experiment result. Section 4 is a discussion of the pros and cons of this method that we proposed. We conclude this paper in the final Section 5.

Method
We divided our method into the following four steps, as shown in Figure 1.
Step 1. Data set construction: The input database consists of 720 color images of 30 isolated Chinese sign language, each of which is a size of length × width × color (length=1080px, width=1080px, color channel=3(RGB)).
Step 2. Normalization processing: Separate the hand gesture from the original sign language picture and set background color to zero. Resize each image to × size (N=256). Keep 3 color channels.
Step 3. Feature extraction and dimension reduction:  We set all the images to the same size and background to ensure the experimental results' validity, and finally converted the color images to gray-level images by using the tool software Matlab R2018a. The pseudocode is as follows (see Table 1).
Here, two finger language images were presented, one is the letter K and the other is warp letter ZH. (see Figure 3).
Chinese fingerspelling recognition via gray-level co-occurrence matrix and fuzzy support vector machine EAI Endorsed Transactions on e-Learning Online First  relationship between pixels in an image with texture information. Each two pixels in the image space will have a certain spatial grayscale relationship because the texture is made up of the grayscale's repeated appearance in the spatial position. Gray-level co-occurrence matrix is a popular method to describe texture by studying the spatial correlation of grayscale [12,13].
The joint probability density of pixels at two locations defines the gray-level co-occurrence matrix. Let ( , ) be a two-dimensional digital image, the image's size is × and is the gray level of this image, then the certain spatial relationship among the gray-level co-occurrence matrix can be described as follows: Where the { } represents how many elements there are in the set a , and P is the matrix of × . If the distance between( 1 , 1 ) and ( 2 , 2 ) is and is the angle between ( 1 , 1 ) and ( 2 , 2 ) , then the matrix ( , , , ) of various spacing and angles can be obtained.
Since the distribution of pixels may be different in all directions, it is advisable to use the 8-connected domain to compute the gray-level co-occurrence matrix, that is, = 1, = 0°, 45°, 90° and 135°.
The texture information of the image is contained in the co-occurrence matrix. The slower the textures change of images, the larger the value on the diagonal of the gray-level co-occurrence matrix and the other way around.
To describe the texture condition with co-occurrence matrix more intuitively, the following characteristic quantities are usually used.
Let , y be the pixel values of( 1 , 1 ),( 2 , 2 ), the number of ( , ) is ( , ), and F is the sum of all values in GLCM, then it is defined: In formula (2), ( , ) is the possibility of ( , ) appearing in the image.
Based on formula (2), we can derive further, as the spatial distribution of pixels becomes more uneven, the texture gets deeper, the contrast gets bigger, and the visual effect gets clearer. We used (3) to express this relationship and in formula (3), 1 represents the contrast, which expresses the brightness contrast of a pixel value and its domain pixel value. The deeper the texture groove, the greater the contrast, the clear the visual effect. Moreover, we used formula (4) and (5) to represent the lack of variation between different regions of the images and the evenness of radian distribution and texture thickness in images separately. In formula (4), 2 represents the inverse different moment，which reflects the homogeneity of image texture and is used to measure the local change of image texture. The larger the value, the less the change, and the more uniform the local area is. In formula (5), 3 is the entropy, which represents the amount of information the image is, the greater the value is, the more dispersed the elements in the co-occurrence matrix and the more uniform the distribution of the values in GLCM.
The similarity of the spatial gray-level co-occurrence matrices in rows or columns or in specific angle directions can be calculated by the formula (6), in which, 4 is the correlation to reflect the local gray correlation in the image.
When the matrix element values are uniformly equal, the larger the correlation value is, conversely, the smaller the correlation value is.
Among them: Finally, the above features are combined with a vector.
For example, when the distance difference value (a, b) takes four values, the vector can be obtained synthetically: The vector can be regarded as a description of image texture, which can be further used for classification, recognition, retrieval and so on.
We used the graycomatrix function in Matlab R2018a to create a GLCM which can create a gray-level

Fuzzy Support Vector Machine
There are two common methods of data classification: supervised classification and unsupervised classification.
In formula (12), + denotes the radius of class +1 and in formula (13), − denotes the radius of class -1. We define the fuzzy membership as a function of the radius and the average value of the two classes: Where > 0 is used to guarantee > 0.
Fuzzy support vector machine (FSVM) can predict or classify real data more effectively than standard support vector machine (SVM), among them, some training points are no longer as important as others.  [20].

Principal component analysis
Based on this theory, we used the PCA to reduce the number of intensity ranges from 256 to 8, in this way, the vector space of each image was reduced to 8 × 8 × 1 = 64 dimensions and a corresponding gray-level co-occurrence matrix was generated. Finally, the gray-level co-occurrence matrix corresponding to all images were summarized.

Implementation
In In order to convert the color images into gray-level images, we used the tool software Matlab program. On this basis, the method of PCA was used to reduce the image sizes by reducing the number of intensity ranges from 256 to 8, at the same time, the gray-level matrix is also produced.
Finally, we summarized the gray-level matrix of all images to obtain the feature matrix. The default generated image gray-level co-occurrence matrix cannot satisfy the high precision recognition, so we debugged and changed some parameter settings in advanced SVM options to achieve the current accuracy.

Statistical Results of Proposed Method
On the basis of 10-fold cross-validation, we compared the performance of some SVMs (default parameters). Table 2 shows that, the accuracy of Quadratic SVM and Fuzzy SVM (FSVM) achieve over 80%, the accuracy of GLCM+FSVM method is comparatively outstanding.
Chinese fingerspelling recognition via gray-level co-occurrence matrix and fuzzy support vector machine EAI Endorsed Transactions on e-Learning Online First To further improve the accuracy of the existing identification, we adjusted the parameters of some experimental operating environments. As it turns out, when we set the value of Multiclass method as 'One-vs-All', the accuracy of each method was improved. On this basis, there is still room for the recognition accuracy to improve (see Table 3). After adjusting the parameters, we obtained the best accuracy of FSVM classification is 86.7%. value (as shown in Figure 4), however, under the same manual kernel scale, different box constraint level had no change in accuracy (as shown in Figure 5). The experiment found that the optimal box constraint level was 5, the kernel proportional parameter was 6 in the manual kernel scale, and the one-to-one method was selected for the multi-class classification.

Comparison to State-of-the-art Approaches
In this paper, we compare four classifiers. The first is SVM is also proposed to pre-train illegal images, and then directly verify and block the illegal sites. As a gesture classifier, it performs well in computer and has strong generalization ability. The last is Decision Trees. The purpose of this method is obtaining a tree-like structure based on a principle that makes the separation of the data been minimum by splitting the data set repeatedly [23][24][25].
The greedy construction process presents one of the main disadvantages of the Decision Trees: in each step, it always selects the combination of a single optimal variable and an optimal split point; but the truth is that considers a multi-step pre-detection of variable combinations may get different or better results [26].  Table 4 ).

Discussion
Sign language is a unique way for deaf-mutes to communicate, and it is a significant method for them to contact with the outside world. Therefore, the study of sign In this paper, we calculated the image's gray-level to obtain the gray-level co-occurrence matrix, on the heels of that, some eigenvalues of the matrix were obtained to represent some texture features of the image respectively. To sum up, in feature extraction, the efficiency of data operation should be improved as much as possible. For example, dimension reduction is adopted in this paper.
When using the SVM, the selection of fuzzy membership value and some weights is critical and sufficient data samples are required. Next, we will consider introducing different kernel functions such as Guassian function and polynomial function into SVM to further improve the recognition accuracy.

Conclusion
This paper introduces the difference between finger language and sign language in Chinese sign language, highlighting the unique certainty and importance of finger language in the process of research and application. In our research, we proposed a new model to recognition Chinese finger language by using gray-level co-occurrence matrix (GLCM) and fuzzy Gaussian support vector machine (FSVM) and achieved higher identification accuracy than other existing approaches.
Our future work will focus on the following aspects: (i) Automatic segmentation of key areas in sign language images using computer programs; (ii) The calculation will be further simplified by converting the gray-level matrix into a sparse matrix; (iii) Continue testing other advanced classifiers such as the extreme learning machine, kernel SVM [27], and convolutional neural network; (iv) Continue testing other advanced feature extraction methods such as Motion Boundary Motion (MBH) [28] and Scale Invariant Feature Transform (SITF) [29,30]; (v) Applying our method to the development of sign language translator.