Hidden Markov Model for recognition of skeletal data- based hand movement gestures

The development of computing technology provides more and more methods for human-computer interaction applications. The gesture or motion of a human hand is considered as one of the most basic communications for interacting between people and computers. Recently, the release of 3D cameras such as Microsoft Kinect and Leap Motion has provided many advantage tools to explore computer vision and virtual reality based on RGB-Depth images. The paper focuses on improving approach for detecting, training, and recognizing the state sequences of hand motions automatically. The hand movements of three persons are recorded as the input of a recognition system. These hand movements correspond to five actions: sweeping right to left, sweeping top to bottom, circle motion, square motion, and triangle motion. The skeletal data of hand joint are collected to build an observation database. Desired features of each hand action are extracted from skeleton video frames by using the Principle Component Analysis (PCA) algorithm for training and recognition. A hidden Markov model (HMM) is applied to train the feature data and recognize various states of hand movements. The experimental results showed that the proposed method achieved the average accuracy nearly 95.66% and 91.00% for offline and online recognition, respectively.


Introduction
The development of interactive technology between human and computer-based machine is creating many advanced applications. For years ago, a person communicated with computers primarily via the basic access systems involving in keyboard, mouse, window, icon, menu, and point-andclick device. But these approaches did not establish a natural interface between human and device in the space. The development of virtual reality technology has changed the human-machine interaction. In the game, developers propose new ways which release players from the constraint of using keyboard and mouse to control characters. The sensors such as Microsoft Kinect and Leap Motion are considered as state-of-the-art devices toward the implementation of fully natural interfaces in which * Corresponding author. Email: bcgiao@sgu.edu.vn gestures, body poses, and hand activities become the controllers. Kinect provides the great opportunities not only in the development of games applications but also in academic fields such as automatic surveillance [1], human healthcare applications [2], and human-computer interaction applications [3].
In the field of robotics, especially in self-propelled robot applications, the Kinect camera plays an important role as a visual sensor for capturing input signals quickly and accurately. The Kinect systems are developed to detect objects using RGB and depth features. An algorithm for automatic exploration was used in a mobile robot accompanied with a Kinect camera to generate a map automatically [4] or to follow a person based on RGB-D features [5]. In these designs, depth sensor and RGB camera from Kinect are used in the role of data acquisition tools to generate the map in the unknown environment or to follow the moving target.
For healthcare applications, the fall recognition system for the elderly [6] was used in various projects to detect and recognize the postures, gestures, activities, movements, body parts, gait, and frailty of elderly people. Furthermore, based on the advantages of depth data and skeletal data, a human fall recognition system [7] was developed to detect the fall states and warn to supporters. These data are important to analyse and evaluate abnormalities in human health. Results of these researches showed that it could help the doctor easily offer the diagnosis and warning about health hazards to patients.
With human-computer interaction systems, Kinect is a useful tool to support communication between people and computers via voice, human postures, hand gestures, and object movements. Depth and color images are recorded to trigger the control program which can execute and begin a program or software on the computer. A non-touch input system [8] was proposed for tapping keys by using a virtual keypad which is made up of the depth data from Kinect sensor in a virtual reality space. In addition, the hand gesture recognition system was also applied to detect the speed of hand motion [9], to recognize of Arabic numbers (0-9) [10], and to simulate a mouse application which allowed a person could perform the functions of a real mouse on the computer [11].
In order to build an object recognition model successfully, it is very necessary to apply the suitable algorithms in the training and recognition to ensure the accuracy of the system. The HMM algorithm [13] is applied to build a human activities system from skeletal data. The result showed that achieved the high efficient recognition by 90% -100%.
This research aims to establish a model to recognize the hand movements using a Kinect camera. The obtained dataset are five separate states sequences corresponding with five motions performed by the hand. Raw data are preprocessed using specialized algorithms. The preprocessing step consists of scaling and normalization. For the recognition of gestures, a Principle Component Analysis (PCA) algorithm is employed to extract the characteristic data of each motion. After that, the HMM algorithm will be applied for training and classifying each state of hand movement.
The main contributions of the paper are as follows: (1) The use of skeletal data created by combination of RGB and depth images for the training and recognition processes supports the recognition system to offer the high accuracy. An advantage of skeletal data is that it can remove the influences of noise and unexpected factors from the recording environment. (2) The raw data, which are recorded from Kinect sensors, would be normalized in order to facilitate for extracting features of database and to enhance the accuracy in the recognition phase. The paper is organized as follows. Section 2 shows some related works. Section 3 describes data acquisition and materials using the Kinect system. Section 4 then presents the proposed method. After that, Section 5 reports the experimental results. Finally, Section 6 gives conclusions.

Related Works
There have been three typical research works dealing with recognition of human activities by using depth cameras. They are briefly described as follows.
 Hai and Kha [14] proposed a human action recognition system (HARS) using a Kinect to collect video frames of human activities. The operation of the HARS consists of the following phases. At first, the HARS uses the star skeleton algorithm [15] to extract the desired features of each human skeleton from the video frames. The HARS then maps the skeletal features to observation symbols, which constitutes an observation sequence, for the training process. Subsequently, for the learning process, the HARS uses the Baum-Welch algorithm [16] to transform in succession the observation sequences into seven HMMs corresponding to seven prespecified human actions. Finally, for the recognition process, the HARS can recognize the current human activity by labelling the observation sequence with the most suitable model in the seven trained HMMs. The HMMs are then used to classify human actions in both indoor and outdoor environments. The experimental results demonstrated that the HARS might recognize human actions with the high accuracy over 85% and the fast processing time.
 Dubois and Charpillet [16] proposed a regcognition system using a RGB-D camera to facilitate real-time analysis of the movemement of a person. Furthermore, based on HMMs, the recognition system can detect falls of a person. The operation of the regcognition system is described as follows. When the person is walking, the RGB-D camera records the depth images of the activity. Using the depth images, the regcognition system would analyse the trajectory of his centre of mass to measure gait parameters. After that, the paramters are input of the HMMs in order to result in a frailty evaluation for him. The experimental results showed that the regcognition system could offer daily information of the person's activities. Most important, the regcognition system can detect his falls to quickly alert emergency services. Thereby, the regcognition system is very useful to secure the elderly.
 Uddin et al. [12] proposed a method for human activity recognition using the joint angles from a 3D model of a human body. The method estimates directly from time-series activity images obtained with a single stereo camera by co-registering a 3D body model to the stereo information. Next, the estimated joint-angle features are mapped into codewords to generate discrete symbols for a HMM of each activity. Using these symbols, the method trains each activity through the HMM. All the trained HMMs are then used for activity recognition. The system achieved high accuracy from 92.8% to 98.1% with input data of 3D body-joint-angle features.
Hidden Markov Model for recognition of skeletal data-based hand movement gestures 3

Data Acquisition and Materials
In this research, a recognition system of hand movements has proposed with five main processes: data acquisition using the Kinect system, normalization of raw data, data features extraction, and training and recognizing hand movement gestures. Figure 1 shows the above-mentioned processes.
In particular, joints data of skeleton are produced using the Kinect camera which has an RGB camera and a laser sensor inside. The obtained parameters in each joint are the values of three axes x, y, z, in which z coordinate is the depth information. Therefore, the skeletal data with the 3dimensional system is valuable information which could make easier to detect and recognize hand motion gestures. After acquired, the raw data are analyzed to scale and normalize all received parameters. The PCA algorithm is applied to extract features for training. For the recognition of hand movement gestures, an HMM method will be used in order to train samples and recognize separate gestures. The results will demonstrate the accuracy of the proposed method.
The major requirement of the system is to detect and recognize the motion states of a hand. Every state of the hand movements has a large number of datasets which is analyzed, calculated before trained by using optimal algorithms. Therefore, the process acquisition of raw data is performed carefully to achieve a high performance.

Data Acquisition
For sampling data, three people were invited to record patterns with different hand gestures. These subjects were conducted to exercise separate actions matching with hand movement states. Each person performed alternately five hand gestures in which 20 video samples were collected per gesture. This system collected five sequence states of hand motion for training and recognizing, including sweeping right to left, sweeping top to bottom, circle motion, square motion, and triangle motion. Figure 2 demonstrates clearly the process to obtain sample data.
In order to minimize the factors that affect quality and reliability of samples, the installation location of the Kinect system in space has to meet the specifications of the manufacturer. Figure 3 shows the suitable position to place the Kinect, corresponding to the 1.2 -1.4m height and the 2 -3m depth to secure the collected images and 3D skeletal data with high quality.
The program for obtaining data is mainly based on Kinect for Xbox 360, Motion Sensor along with Window SDK version 1.8, which is supported with C/C++ and MATLAB language. Furthermore, the program allows to detect 2D-color, depth image, and skeleton tracking. The advantage of the SDK toolkit is described as follows.  Noise is eliminated but the skeletal image is maintained with characteristics coordinates of the main body parts.  The Kinect can extract parameters of each bone joint corresponding to depth image.  The 3D data of a joint are illustrated by three coordinates (x, y, z) which referred to (horizontal, vertical, depth) of a joint position.

Skeleton Tracking
The parts of a human body in front of the Kinect camera is detected by a depth stream of Natural User Interface (NUI). The twenty positions of main joints are recorded based on the original coordinate to form a complete human skeletal. Each position is defined by the 3-dimensional coordinate (x, y, z), which expresses the place of a skeleton in the real space, compared to the origin position (0, 0, 0) around the sensor. From the point view of the user, the axes described as in Figure 5 in which the x, y, z-axis includes the horizontal line toward to the right, the vertical line upward, and the perpendicular line with (x, y) surface from the sensor, respectively. The Kinect system allows tracking color images, depth image and skeletal data automatically at the same time. Thus, the trajectory of each hand gesture is easily detected, and the location data of joints on the skeleton are accurately obtained. The skeleton of the human body is created from the image of body parts which has been produced in a manner related to the specific regions of a depth image by the different density of the feature data shown. Figure 6 depicts these specific regions.
The purpose of this study is to analyze and recognize the features of hand movement. Therefore, the "hand right" joint position is obtained to build 3D data among 20 joints of the human skeleton. The position of each image frame is defined as the transpose matrix with three values (x, y, z) in the 3D coordinate as follows. .
In the case of one sample video with n frames, n = 1, 2, 3,... the matrix of this sample is … . The matrix P could be expressed in detail as follows.
In one frame of image I, the depth dimension of a pixel q is defined by the function: , where I d is the depth of pixel q in the image I and θ = (t1, t2) is the offset between t1 and t2.
In this research, the data acquisition is an important process for training and recognition phases to achieve high accuracy results. The obtained data would be processed in the normalization stage of raw data prior to the extraction process using the PCA algorithm. Finally, the resulting data are recognized with HMM method.

Pre-processing raw data
In the data collection step, for each gesture, three persons will perform the hand motions repeatedly at various times. These actions produce images which have different movement trajectories and sizes. However, the coordinate data of joints after the recording step are not similar for different tasks. Such a recording circumstance leads to the large variances in the characteristics of 3D coordinates (x, y, z). Furthermore, the database is not enough reliable to use in training process, so the recognition would not achieve the precision. Therefore, the proposed method would scale the size of images and normalize the obtained coordinate values into a new coordinate system. Figure 7 illustrates the process of pre-processing of raw data.
The scaling step converts the coordinate of (x, y, z) to a new coordinate (x', y', z'). The new coordinate is determined as follows. where L denotes the dimension of the square output image compressed from the initial rectangle shape image into an L × L square image.
Then, the central point is normalized to the standard size as follows.

Feature Extraction
The PCA algorithm is employed to extract features from the data sequence of each gesture. The joint data of righthand motion are pre-processed in a new space where has a reduced dimension. It is capable to illustrate data better and to ensure the data variability at every dimension of new space.
In the PCA algorithm, the calculation of eigenvectors and eigenvalues is very important operation to extract features of the dataset precisely. The eigenvectors corresponding to the largest eigenvalues will describe the most crucial features characterizing the data sample S. Hence, the selection and retain the largest vector k will create a high-reliability space. The principal features of five hand movement gestures are calculated in the following five steps: (i) The average value from the row components is calculated as follows.
(ii) The Standard Deviation (SD) is calculated by the following vector. The remaining vectors , , … , will be determined by the same as the above equation.
(iii) The set of SD vectors are built into a matrix H as follows.

… .
A covariance matrix is computed to present the correlation of every vector to the vector space is 1 1 .
(iv) The eigenvalue is in need to extract the prominent features of model. This eigenvalue is det(C -λP) = 0.
(v) The eigenvector E can be calculated as follows.
The main purpose of the process is to extract prominent features of five hand motion activities. The set of data after the PCA process is a coefficient matrix of 3300 x 3300 corresponding to feature vector k. In addition, these feature vectors will be trained using HMM to recognize five hand movement gestures separately.

Hidden Markov Model Method
The HMM method is a group of finite states linked to each other by the transition, in which each state is defined by a set of transition probability, conditional probability or emission probability, the first and second stochastic layer based on the Markov chain. In addition, the second layer of probability describes a sequence of observation but "hidden" from the observer. The key purpose of HMM training is to improve the accuracy of probability so that the state of the certain actual sequence is defined by the sequence of the observation.
There are three main algorithms used in HMM including the forward algorithm for likelihood computation, the Viterbi algorithm for decoding, and the Baum-Welch algorithm for learning. In this research, the Baum-Welch algorithm is also utilized for re-estimation and support in localizing the likelihood using HMM parameters for the set of training data. Each state model is trained for the certain movement to define probability in given test tasks and the presence or absence of a gesture under consideration is determined by a probability threshold. In addition, each state model is determined by the conditional probability of output gestures derived from a hand motion. Figure 9 illustrates the transitional probability of various states in a random process. Figure 9. The process of HMM transition state [15] The transitional probability of various states in HMM, which has N various states S = {s1, s2, …, sN}, is calculated as follows.
where a(i, j) is the transitional probability from the state si to the state sj and Pt is the state at time t.
An HMM is characterized by three parameters as follows.
(i) The initial state probability is (ii) The state-to-state transitional probability is the following matrix. HMM is a reliable method for the recognition of specific subjects such as gestures, pose, and speech by applying the well-developed algorithm in order to enhance the recognition performance. Each HMM is represented by a tuple of parameters λ = (A, B, π). The process of HMM composes two phases:  Recognition: Based on the observed sequences O, and trained model λ, the phase will define the most satisfied sequence states V * which maximize the joint likelihood Pr(O | V | λ) [18]. Additionally, the Baum-Welch algorithm is used to optimize HMM parameters to achieve the highest stateconditional probability Pr(Ot | pt = Si). The approach has forward and backward states to compute the forward and backward probability αt (i) and βt (i), respectively.
At time t, the probabilities of state si and the transient probabilities from state si to sj respectively, are as follows.
Since the forward and backward probabilities are not repeatedly computed during the process, the Baum-Welch algorithm is used for these parameters re-computed in succession, the results are as follow.
. A new λ model is produced from the above results of the re-estimation process based on the initial parameters of the λ model in order to generate the observation sequences O.
In this research, the HMM is set up with a set of five possible output symbols. At the training process, each observation sequence, e.g. skeletal features of human, is clustered and classified into five corresponding HMMs to optimize all parameters in these HMMs. As for the recognition action, all data of the obtained sequence will be compared to the trained HMMs to recognize and find out a model which has the highest probability corresponding to each state of the hand movement gesture.

Experimental Results
In the research, the sample features were extracted from offline databases using the PCA algorithm and then these features were trained using the HMM method for recognition. Experimental results were shown to describe the effectiveness and efficiency of the proposed method. The experiments were made with three persons of 21 -50 years old and the heights were between 166 and 172 cm to test the proposed method. Each person performed five hand gestures "Sweeping right to left", "Sweeping top to bottom", "Circle motion", "Square motion", and "Triangle motion" corresponding to five states of the HMM and each gesture is performed twenty times. Therefore, there are 300 video samples. Each sample was then extracted from 8 to 15 feature frames depending on the complexity of each gesture. The obtained data of five gestures corresponded with 3.300 image frames for training and recognition. After that, the frames were converted into a matrix with the size of 3 × 3.300. Table 1 shows the column data of the matrix. In practice, the hand movement activities were recorded as in Figure 10. Of the images of Figure 10, the "hand right" joint -point 12 (see Figure 4) describes the most basic features of hand movement. For this reason, obtained data at "hand right" joint are important to analyse different features of each hand gestures compared to the remaining joints.

Gestures recognition using the HMM method
The proposed method using the HMM was implemented on 3.300 image frames of 300 video samples for training and 60 testing videos per each hand gesture for recognition.

Online hand movement recognition
Another experiment of the research is online recognition. The recognition program was built with the thresholds that were obtained from the HMM mothed after the training process. Each one of gestures was tracked in the real-time setting to be recognized immediately. For one gesture, the recognition result was tested with 60 samples. Table 3 illustrates the accuracy of this experiment. The statistics in the table showed that the overall accuracy was about 91%. The highest result was 95% equal to 57/60 samples accuracy for the circle motion recognition. The least result 9 was 88.3% corresponding to 53/60 samples accuracy for the square movement recognition. The remaining classifications had following results: sweeping right to left with 90%, sweeping top to bottom with 91.7%, and triangle motion with 90%.  In the paper, the Kinect camera system was used to obtain hand motion data of three persons for the hand movement recognition system. The skeletal data of five gestures per person were collected into sample data to extract prominent features. Besides, the HMM method was utilized to train the feature data and then to recognize hand motion gestures with the average accuracy of offline was 95.66% and online was 91.00%. It is obvious that the proposed method can be applied effectively and efficiently to the hand gesture recognition system.

Conclusions
In the research, the images of three persons with five gestures obtained from a Kinect system were converted into skeletal data. These data were extracted crucial features using the PCA method. For classification, a HMM method was employed for training the 3D data of hand right and then for recognizing hand movement gestures. The experimental results showed that the average accuracies of offline and online recognition were 95.66% and 91.00%, respectively. The results demonstrate that the proposed method offers the high-quality recognition.
For further applications, the hand gesture recognition system can be used to improve free-touch interface for controlling programs on the computer such as virtual reality E-games, windows' cursor, and document processor in the Operating Room...