Spatio-temporal weight Tai Chi motion feature extraction based on deep network cross-layer feature fusion

Tai Chi is a valuable exercise for human health. The research on Tai Chi is helpful to improve people's exercise level. There is a problem with low efficiency in traditional Tai Chi motion feature extraction. Therefore, we propose a spatiotemporal weight Tai Chi motion feature extraction based on deep network cross-layer feature fusion. According to the selected motion spatio-temporal sample, the corresponding spatio-temporal motion key frame is extracted and output in the form of static image. The initial motion image is preprocessed by motion object detection and image enhancement. Traditional convolutional neural network extracts features from the shallow to the deep and builds a classifier for image classification, which is easy to ignore the shallow features. Based on the AlexNet network, a CL-AlexNet network is proposed. Batch normalization (BN) is used for data normalization. The cross-connection structure is introduced and the sensitivity analysis is performed. The Inception module is embedded for multi-scale depth feature extraction. It integrates deep features and shallow features. The spatio-temporal weight adaptive interpolation method is used to reduce the error of edge detection. From the edge features and the motion spatio-temporal features, it realizes motion features extraction, and outputs the extraction results. Compared with the state-of-the-art feature extraction algorithms, the experiment results show that the proposed algorithm can extract more effective features. The recognition rate exceeds 90%. It can be used as guidance and evidence for Tai Chi training.


Introduction
As the essence of Chinese martial arts, Tai Chi is a national intangible cultural heritage. Studies have shown that Tai Chi can not only help people reduce blood pressure [1], enhance the functional level of the immune system, relieve physical stress and improve the quality of sleep [2], but also enhance muscle strength, improve flexibility and prevent falls [3,4]. So it is attracted by more and more people.
Gesture motion recognition has always been a hot research topic in the field of computer vision, which has important academic value in many fields such as video surveillance, motion analysis, sports events and medical diagnosis [5][6][7]. The recognition of human posture motion will apply relevant algorithms and techniques. The motion recognition methods based on spatio-temporal feature extraction and based on motion trajectory analysis are the most frequently used motion attitude and motion recognition methods at present. In order to improve the recognition accuracy of human motion posture behavior (Tai Chi), this paper optimizes the spatio-temporal feature extraction method of motion.
The spatio-temporal weight feature extraction combines computer vision technology and image processing technology. The computer vision technology is used to extract the relevant information of human spatio-temporal posture motion, and determine whether the point of each motion image belongs to a feature of an image [8]. By dividing the points in the image into different subsets to form continuous curves or regions, the feature extraction results of human posture motion can be obtained. In practice, the results of human motion recognition can be obtained by comparing the extracted features with the information in the standard database. Traditional spatiotemporal weight attitude motion feature extraction algorithms include regular grid [9], image content analysis [10] and Mel-frequency cepstral coefficient (MFCC) [11]. Because of the rapid transformation speed of human posture movement and the diversity of behavior, the implementation of feature extraction algorithm is very difficult. In addition, light, angle and other objective factors will also affect the accuracy of the spatio-temporal weight motion feature extraction results.
In order to solve the above problems, based on the traditional motion feature extraction algorithm, the idea of cross-layer feature fusion based on deep network is introduced. The main contributions are as follows: 1) On the basis of traditional algorithm extraction steps, the improved AlexNet algorithm is introduced to collect Tai Chi motion images, and the operating process of the AlexNet algorithm is followed to detect motion feature objects, thus improving the integrity and accuracy of motion feature extraction results. 2) Processing the collected motion images, calculating the threshold values and features based on the data, and dividing the matching blocks to be used. 3) According to the weighted matching, motion feature fusion is completed, and the spatio-temporal weight Tai Chi motion feature extraction algorithm is realized, which indirectly improves the recognition accuracy of human spatio-temporal weight Tai Chi motion. 4) The weight of Tai Chi motion is calculated to obtain the fusion results of multi-scale motion feature extraction. The structure of this paper is as follows. In section 2, we detailed introduce the proposed Tai Chi motion feature extraction. The experiments and analysis are conducted in section 3. There is a conclusion in section 4.

Extracting the spatio-temporal motion key frame
Before extracting the spatio-temporal motion key frame, the corresponding spatio-temporal sample needs to be selected first. Monitoring equipment is installed in Tai Chi sports venues, and the completed video files are the selected spatio-temporal samples. The horizontal spatiotemporal slices of the shot are extracted from the selected motion video samples [12], and the spatio-temporal slices of the video are clustered. After cluster processing, motion video files may appear time discontinuous but be clustered together. After spatio-temporal slicing and clustering processing, the motion samples are collected to form the corresponding sub-lens, so a key frame can be extracted as the object image with motion features according to the preset rules. The constraint conditions that the extracted space-time Tai Chi motion key frames need to satisfy are as follows: Where, s represents the range that can be selected by the motion video block v in the spatio-temporal sequence. k represents the number of objects containing video frames in the current video sequence. Constraint condition formula (1) can ensure the integrity of the key frame extraction content of spatio-temporal Tai Chi motion.
Spatio-temporal weight Tai Chi motion feature extraction based on deep network cross-layer feature fusion 3 The extracted spatio-temporal motion key frame is output in the form of static image. Through motion object detection, image enhancement, morphological processing, image normalization and other steps, it can achieve the Tai Chi motion image preprocessing results.

Moving object detection
The key problem of time-space weight gesture motion feature extraction is to detect high quality moving human object image. In this process, background subtraction, image extraction and other technical approaches are involved [13,14]. The specific moving object detection process is shown in figure 1.
Substituting the background model of equation (2) into equation (3), the foreground moving image of the video image can be obtained. Finally, the foreground image is cropped to get the moving object [15]. Moving object detection and processing can be used as a foundation to extract the motion features of spatio-temporal weight.

Motion image enhancement processing
Enhancement processing of moving images mainly includes the following two steps: Step 1: Taking the foreground image in the collected image as the processing object for image denoising. This step is to avoid the noise in the image that can affect the sharpness of the moving image [16].
Step 2: Using the filter to enhance the moving image. Assuming that the noise reduction filter of the moving image is ) , ( y x h , the convolution operation of the noisy image can obtain the image after noise elimination, and the noise elimination process can be described as: Choosing Gabor filter to sharpen and enhance image can make the filter obtain the best resolution in both spatial and frequency domain.

Image normalization
In the motion image, the location of human body area in the movement process is in a state of constantly changing, so we need to normalize the image into uniform size. And the movement area of the human body is defined in the same central position, so that the positions of the human body in all images are aligned, which is convenient for the subsequent extraction of the corresponding image features in the Tai Chi motion image. First, human edges in video sequence images are detected, denoted as min  (5) It cuts the image to a fixed size and ensures that the completed human movement area can be retained in the trimmed image.

Cross-layer feature fusion convolutional neural network
AlexNet network proposed by Krizhevsky et al. in 2012 triggered a boom in the field of deep neural networkbased image processing. The network consists of five convolutional layers and three fully connection layers. It has successfully trained about 1.2 million images of 1000 categories with ReLu as activation function, multi-GPU parallel computing, local response normalization, overlapping pooling, and Dropout layer to coordinate network performance. The 17% top-5 error rate of ILSVRC2012 dataset is achieved with 60 million parameters. After AlexNet, various deep neural network structures have been put forward. GoogLeNet network uses global pooling and Inception moudle to cluster sparse matrices into dense sub-matrices to improve computing performance and optimize parameters. The network has 22 layers and adjusts parameters for gradient EAI Endorsed Transactions Scalable Information Systems 10 2021 -01 2022 | Volume 9 | Issue 34 | e6 problems caused by network depth with three Loss outputs. Another VggNet-16 also shows that network depth is the key to excellent performance of the algorithm.
In this paper, the improvement of AlexNet network is mainly studied on Tai Chi motion image recognition. The performance of GoogLeNet and VggNet-16 deep network in Tai Chi image feature classification is compared to carry out correlation analysis.

New network design
This paper proposes a new network based on the original AlexNet network. It consists of an input layer, four convolutional layers (followed by pooling layer), one Inception module, a cross-layer connection structure, two full connection layers (followed by Softmax loss function), and an output layer. It uses BN to replace Local Response Normalization (LRN), a second pooling layer is cross-linked to the full connection layer, which fuses with deep features extracted from the backbone network, and eventually plugs into the classifier.
The new AlexNet network structure is shown in figure  2. Table 1 lists the specific parameters of the new AlexNet network, including the Type, convolution kernel Size, Stride and Output Size of each network layer.  h9 After the convolutional feature extraction, AlexNet network is normalized by LRN, and lateral suppression is performed on the neurons adjacent to the activated neurons to achieve local suppression and improve the model generalization ability. However, BN can effectively accelerate model convergence, prevent single samples from being frequently selected during batch training, and prevent "gradient dispersion", while abandoning the dropout layer and L2 regular term parameters [17]. In the new AlexNet network, the Inception-V1 module in GoogLeNet network is introduced to extract deep features of Tai Chi images before full connection layer. The structure of the Inception-V1 module is shown in figure 3, and it is connected in parallel with four convolution kernel of different sizes. The first one conducts 1×1 convolution for the upper input. The second one convolves the last layer, and then it is connected with the 3×3 convolution.
The third branch convolves the upper layer with 1×1 size, and then it is connected with the 5×3 convolution. Continuous feature transformation broadens the dimension of feature expression. The fourth is the 3×3 maximum pooling to realize the compression of perceptual information. Finally, the four converged filtering layers are connected. The higher layers of Inception module has the greater efficiency [18].  The traditional convolutional neural network extracts features from shallow to deep, processes features through classifiers, and outputs probabilities under different conditions. With the deepening of network depth, this process can not effectively fuse the low-level and highlevel features to form the feature classifier. In this paper, the idea of cross-layer connection proposed in DeepId [19] is introduced to connect the second pooling layer to the full connection layer for feature fusion. In general, the network firstly extracts layer features from 128×128 input images in Where j represents the positive integer that is not greater than the output third dimension number j(i) in the i-th hidden layer, that is, Loss function J(w) is: Where l 12 δ and l 11 δ are the feedback errors of the output layer and the full connection layer respectively. "  " represents the Hadamard product. up(·) is the up-sampling process. ⊕ represents the outer convolution operation. 12 W is the weight between the output layer and the full connection layer.  The new AlexNet network in this paper uses the gradient descent algorithm [20] to update the weight and bias. It is known that the training set D, momentum is M and learning rate is lr. Te specific algorithm process is as shown in figure 4.

Adaptive interpolation of spatiotemporal weights
Adaptive interpolation of spatio-temporal weight can effectively solve the problem of interpolation errors caused by motion estimation errors and inaccurate edge detection. It has the advantage of automatic fusion of edge adaptive field interpolation [21,22]. Firstly, the absolute difference of the pixels before and after the moving image element should be calculated by using the spatio-temporal weight, and the corresponding weight coefficient should be calculated. Then the weighted average of adjacent pixels is carried out to obtain the estimated value of the points to be interpolated. The final expression of adaptive pixel P of spatio-temporal weight is: Where P' is the pixel value after interpolation. (X(i,j)-1,t) and (X(i,j)+1,t) are the two pixels of the current field respectively.
Where z is the change of invariant matrix of human body in a period.

Temporal and spatial features of motion
The temporal and spatial features of Tai Chi posture movement include skeletal features of human body and joint angles of limbs [23]. Under the condition of constant topological structure, the outer pixels of the gait image are stripped layer by layer by iteration. The skeleton with single pixel width is obtained, which is the feature result of extracted motion limb joint. The extraction results of temporal and spatial features of the moving skeleton are shown in figure 5. The joint angle of human body is expressed in the form of coordinate, and the rotation angle of human limb joint at different time is calculated respectively. It arranges the calculation results of rotation angles in chronological order. The temporal and spatial variation of human movement is analyzed under the corresponding skeleton model. It can be seen from figure 5 that in the actual movement process, the displacement of human limb joint is small, so the spatio-temporal features of motion can be directly represented by the joint angle features [24].

Motion features fusion
The motion feature fusion is realized by using the calculated weight of spatio-temporal. The specific fusion process is shown in figure 6.  Figure 6. Process of motion feature fusion Figure 6 shows that the reliability of different feature matching quantized values is different in the process of feature fusion. Therefore, according to the distribution of spatial and temporal weights, the fusion of motion features is realized from feature layer, data layer and decision layer. The data of each video sequence is analyzed step by step, and the data threshold is obtained according to the key frame to obtain the extraction results of the area and joint angle. The motion feature fusion is completed according to the weighted matching. Therefore, we can summarize our proposed algorithm as shown in figure 7.

Rule default
Video temporal and spatial slice processing by clustering

Experimental analysis and results
The experimental platform of this paper is Ubuntu16.04 operating system, Deep learning Caffe framework, Python2.7 interface language, GPU GTX2080Ti, processor Intel Core I7-7820x CPU@3. 60GHz× 16, and memory 64G. Setting the initial learning rate as 0.001 and using "step" attenuation. Multi-classification cross entropy loss function is used. Comparison results with SVM and traditional AlexNet is shown in figure 8.

Figure 8. Classification results
By comparing the results, it can be seen that the new AlexNet network constructed in this study has significantly improved the average classification accuracy, average accuracy, average recall rate and average comprehensive index F1 value compared with SVM [25] and traditional AlexNet network [26]. In particular, the difference in recall rate indicates that the new AlexNet network has high classification accuracy, outstanding classification effect for Tai Chi images, and strong model expression power.

Cross-layer connection analysis
It is necessary to discuss the influence of the difference of cross-connection modes on network performance. The new AlexNet network by cross-connection structure is introduced in this paper to make the deep features and shallow features fusion. Therefore, it is determined that the cross-connection terminal is the unchanged full connection layer. Now, the reliability analysis is carried out for the front segment of the network and the crossconnection initial end is changed to h2 and h6 respectively. The training process is performed under the same conditions and compared with the test results of new AlexNet network, as shown in Table 2: According to the test results, h4 hidden layer has certain advantages as the initial end of cross-connection. Compared with h2 and h6 as the initial end of crossconnection, the average classification accuracy is improved by 3.62% and 2.72% respectively. Other evaluation indicators also have obvious advantages. The feature graph output by h2 layer retains obvious edge information and has high overlap with the original input image information. The output features of h3 layer are more abstract than the previous layer, but it still retains some specific contour edges. h4 layer output features are already extremely abstract. Therefore, it can be concluded that, compared with new AlexNet without crossconnection structure, the classification accuracy of crossconnection network is significantly improved. However, when the shallow features of h2 layer output are fused with the depth features across the connecting end, the specific features are overemphasized and the contribution to the classification results is not high enough. The shallow feature output of h6 layer is too abstract, which has the disadvantage of parameter redundancy in feature fusion, and has less effect on the improvement of classification accuracy than h6 layer. Therefore, this paper chooses h4 layer as the initial end of cross-connection, which is the optimal solution.
Through the comparison of experimental results and visualization of training process, it is verified that the new AlexNet has higher classification accuracy than conventional SVM and AlexNet method in image classification, and there is no need to manually extract image features. The sensitivity analysis and visualization of the intermediate process verify the reliability of the cross-connection structure.
In order to verify the effectiveness of AlexNet network in the detection and classification of Tai Chi motion features, this paper conducts experiments on AlexNet network, GoogLeNet network [27] with Inception module, and VGGNet-16 network [28] on the same motion data set. In the process of model training, the method of control variable is adopted, the initial learning rate is set to 0.001, and the attenuation mode of "step" is used. The same loss function, optimization function, maximum iteration (10000 times) and parameter update method (Momentum+SGD) are used in different networks. The experimental results are shown in table 3. demonstrating the effectiveness of CNN in independent feature extraction.

Comparison with other methods
In this subsection, we select matching degree of feature extraction (MD), weighted matching elasticity (WME), multi-scale motion features fusion degree (MFD).
After AlexNet environment is formed, the algorithm in this paper undergoes object detection, moving image enhancement processing and normalization processing. Then it obtains the extraction environment that best matches the motion features. Therefore, the MD is set as one of the experimental indicators, and its calculation formula is: Where x(t) represents the normalized processing result.
In the process of motion feature fusion, it is necessary to calculate key frames to obtain data threshold, extract area and joint angle. Then it achieves WME by dividing matching blocks. The WME reflects the elasticity of weighted matching, that is, affects the recognition accuracy.
In order to realize the fusion operation of extracting motion features from the feature layer, data layer and decision layer, the multi-scale motion feature fusion degree (MFD) of the three layers is compared. The calculation formula of MFD is: Where L is the general feature to be fused. M is the scale optimization degree.
The WME results are shown in figure 9. It can be seen from figure 9 that under the limit of 25 iterations, the elastic curve of the proposed method fluctuates greatly, but it is superior to other methods in matching block distribution. It shows that the proposed method can complete motion feature fusion based on weighted matching method, and realize the feature extraction of spatio-temporal weight posture motion, which provides a basis for the recognition of Tai Chi spatio-temporal weight posture motion. Figure 9. Test results of WME by different methods The test results of MFD are shown in figure 10. It can be seen from figure 10 that the fusion results of multi-scale motion features with the proposed method are stronger than those of the TSF, CORR-OMP, SEMG methods at the feature layer, data layer and decision layer, respectively. Therefore, the proposed method can not only achieve higher efficiency than the traditional methods, but also obtain more accurate extraction results of spatiotemporal weight attitude motion features.

Conclusion
This paper combines cross-connection architecture and Inception to propose a new AlexNet network to realize the optimal weight design for the traditional algorithm. At the same time, BN is used for data normalization, and the gradient descent algorithm is used for optimization to accelerate the convergence of the network and avoid the gradient problem. By analyzing the time and space weights of moving objects, the problem of low extraction efficiency in traditional algorithms is solved. The motion feature fusion is completed according to weighted matching, which solves the problems of poor extraction efficiency and low recognition accuracy of traditional feature extraction algorithm in Tai Chi motion, and provides reference for related research in this field. However, the sample collection environment selected in the experiment is relatively simple, and the sample contains only one motion object. The actual identification work environment is complicated and there are many interference factors. Therefore, precise object positioning will be an important research direction in the future. In the future, we will apply this project to practical engineering applications.