CenterNet-SPP based on multi-feature fusion for basketball posture recognition

Aiming at the problem that the existing posture recognition algorithms can not fully reflect the dynamic characteristics of athletes' posture, this paper proposes a CenterNet-SPP model based on multi-feature fusion algorithm for basketball posture recognition. Firstly, motion posture images are collected by optical image collector, and then gray scale transformation is performed to improve the image quality. Furthermore, body contour and motion posture region are obtained based on shadow elimination technology and inter-frame difference method. Finally, radon transform and discrete wavelet transform are used to extract the motion posture region and body contour, and the two complementary features are fused and then input into the CenterNet-SPP network to realize the final posture recognition. Experimental results show that the recognition accuracy of the proposed method is higher than that of other new methods.


Introduction
Accurate recognition of athletes' posture movements is very necessary in high-level training and key judgment in major competitions [1][2][3][4]. At present, relevant references have been explored in this field. Reference [5] proposed a double-layer background modeling method based on code-book and run-time mean method, and used it to detect moving targets, achieving good results. Reference [6] proposed a method of posture recognition based on motion region, and discussed a two-level posture modeling architecture in detail. Reference [7] proposed a method of diving posture recognition based on visual technology, and achieved accurate recognition results.
In recent years, optical image has gradually become an important branch of digital image processing, and the application of optical image in motion recognition is getting more and more attention [8]. Based on this, a basketball posture recognition algorithm based on CenterNet-SPP model and multi-feature fusion is proposed.
In this method, motion postures such as serve and serve are collected by optical image collector, and the image quality is improved based on gray transform. The body contour and body contour are obtained by shadow elimination technology and inter-frame difference method, and then the body contour and body contour are extracted based on Radon transform and discrete wavelet transform, and the final posture recognition is achieved by combining the two complementary features and network training. In order to verify the effectiveness of the proposed method, a comparative experiment is designed.

EAI Endorsed Transactions on Scalable Information Systems
The experimental results show that the proposed method achieves higher recognition accuracy than the traditional method.

Proposed posture recognition method
The existing posture recognition algorithms can not fully reflect the dynamic characteristics of athletes' movement posture. A posture recognition algorithm based on multifeature fusion and CenterNet-SPP model is proposed. The overall structure of the proposed method is shown in figure 1. Optical image acquisition and processing technology is a key technology which is gradually mature to assist research. In order to collect optical images of motion posture, this paper uses CCD as image sensor and CPLD as control core to collect and process optical images of motion posture. Some basic postures including serving, receiving, dribbling, passing and shooting are collected.

Motor posture recognition area
Based on formula (1), the recognition area of athlete's movement posture is defined: represents a finite discrete set of athlete's posture recognition area, so equation (1) can be rearranged as:

Image preprocessing
Image preprocessing is a very important step in motion recognition. The athlete posture recognition algorithm based on optical image collector divides the image preprocessing process into the following three steps: (1) Image enhancement. In order to improve the final identification accuracy, the gray scale transformation of the collected image is carried out to improve the image quality. Set the image gray level as L, and the r-level gray level of the original image can be mapped to the s-level gray level of the resulting image through a mapping transformation, namely: (2) Image segmentation. It is the key step in image processing. In the process of athlete posture recognition, shadow elimination technology and inter-frame difference method are selected to obtain body contour and motion posture region.
(3) Image morphology processing. Based on morphology, the connectivity analysis of the movement posture region is carried out.

Motion feature extraction
Based on radon transform [9] and discrete wavelet transform [10], motion posture region and body contour features are extracted. Firstly, motion posture region and domain are extracted, and discrete binary images ) , ( y x  are set to carry out standardized processing on the image, and the following formula is used for feature extraction of motion region: CenterNet-SPP based on multi-feature fusion for basketball posture recognition Then the body contour features are extracted, and the body contour lines are expanded into one-dimensional features of the Euclidean distance of body contour points by using the following formula:

CenterNet-SPP model
AlexNet model proposed ReLU function, and the convergence speed of the model using ReLU was much faster than Sigmoid [11], which became one of the advantages of AlexNet model. The model has a total of 8 layers, among which the first 5 layers are convolution layers, the first two convolution layers and the fifth convolution layer have pooling layers, and the last 3 layers are fully connected layers. At the same time, each layer plays different roles. Overlapping pooling layer is to improve accuracy and is not prone to over-fitting, local normalized response is to improve accuracy, and data gain and dropout are to reduce over-fitting. The essence of VGG16 model is an enhanced version of AlexNet structure, which emphasizes the depth of convolutional neural network design [12]. Each of these convolution layers is followed by a pooling layer. VGG network uses a smaller convolution kernel, which makes the parameters smaller and saves computing resources. Due to the large number of layers, the convolution kernel is relatively small, which makes the whole network have better feature extraction effect.
Inception V3 network is a deep convolutional network developed by Google [13]. The main idea of Inception structure in the model is to find out how to approximate the optimal local sparse structure with dense components. The Inception structure adopts convolution kernels of 3×3 and 5×5 size, and adds convolution kernels of 1×1 size, and proposes BN (Batch Normalization) method [14]. The method of splitting a large two-dimensional convolution into two small one-dimensional convolution is introduced by using the branch structure. This asymmetric convolution structure splitting has better effects on processing more and richer spatial features and increasing feature diversity than symmetric convolution structure splitting, and can reduce the amount of calculation.
ResNet50 model solves the problem that the actual effect becomes worse due to the increase of network depth and width [15]. The deep neural network model sacrifices a lot of computing resources, and the error rate also increases. This phenomenon is mainly caused by the gradient disappearing phenomenon becoming more and more obvious with the increase of neural network layers. In ResNet50 model, residual structure is added, that is, an identity mapping is added to transform the original transformation function H(x) into F(x)+x, which makes the network no longer a simple stack structure and solves the problem of gradient disappearance. Such simple superposition does not add extra parameters and computation to the network, but also improves the effect and efficiency of network training. MobileNet model reduces model complexity and improves model speed while maintaining model performance [16]. The basic unit of the model is depthlevel separable convolution, and its essence is separable convolution operation. Different from the standard convolution that the convolution kernel acts on all the input channels, the depth separation convolution adopts different convolution kernel for each input channel, and at the same time, BN is added and ReLU activation function is used to greatly reduce the amount of computation and the number of model parameters.
CenterNet is a detection network based on the center point, which is simple and fast, and its accuracy is no less than that of the detector based on the anchor frame. In COCO dataset, the AP value of CenterNet model is 20% higher than that of YOLOv2 model, and 6.9% higher than that of Faster-RCNN model [17]. Therefore, CenterNet model is selected for motion posture recognition in this paper. Different from the traditional target detection model, the CenterNet model takes the detection target as a central point instead of the anchor frame, which solves the problems of unbalanced positive and negative samples in the anchor frame and too much calculation. The algorithm firstly obtains the feature image through the feature extraction network, then finds the local feature peak on the feature image as the center point, and obtains the image features such as the target size through the center point regression. In the training process, each target generates a central point without NMS non-maximum suppression, which reduces the amount of computation and training time. Meanwhile, feature maps with higher resolution are used to improve the detection ability of small targets.
The Centernet-SPP network structure adopted in this paper is a feature extraction network based on ResNet50, and the Spatial Pyramid Pooling (SPP) module is added [18][19][20]. SPP is used to increase the range of receptive fields to improve the reception range of trunk features, and at the same time significantly separate important features to improve the ability of feature extraction.
Based on this network model, the original image Where N is the number of key points in the image. In the prediction of center point offset, due to the downsampling R=4 times of the image, such feature map will bring precision error when remapping to the original image. Therefore, a local offset  (9) When the size of the target box is predicted,

Feature fusion
The final gesture recognition is achieved by fusing two complementary features. First, it builds the training sequence of movement posture: Then it builds the motion posture test sequence: Where m and n represent the frames of the motion posture sequence of two athletes respectively. ij X represents the j-th feature vector in the i-th athlete's movement sequence. The distance between the movement period of the posture training set and the k-th subsequence of the test set is calculated based on the following formula: To calculate the similarity of all motion sequences, the following formula is adopted:

Experiments and analysis
The experimental conditions of this paper are: Windows10, 64-bit operating system,Cuda version 10.0, Tensorflow and Keras deep learning framework based on Python programming language are adopted. The PC is configured with a GeForceGTX 1060 video card,6G video memory, Intel (R) Core (TM) i59400F processor, 2.90ghz.

EAI Endorsed Transactions on Scalable Information Systems
Online First

R E T R A C T E D
CenterNet-SPP based on multi-feature fusion for basketball posture recognition 5 In this paper, the same data set is used to train basketball locomotion posture in different situations under five different convolutional neural network models (VGG16, ResNet50, Inception V3, Mobilenet, AlexNet). For each model, the method of transfer learning [21,22] is used to initialize the parameters of the pre-trained ImageNet classification model, and the model is iterated 120 times. Cross entropy loss function is used. Adam optimizer is used at the same time, and the initial learning rate is 0.000 1 and the momentum factor was 0.1. After 5 epochs, if the model performance is not improved, the learning rate will be reduced. The final loss values of the five models tended to be stable, while the accuracy of the test set stabilized at a relatively high value. Table 1 shows the accuracy of the test sets of the five models. Figure 2 shows the accuracy of training set loss, test set loss and test set.  In basketball motion images, 50 images are randomly selected to verify each model respectively, and the confusion matrix obtained is shown in figure 3. InceptionV3 model has the highest average accuracy of 98.11% (Table 2), and it takes 0.12s on average to identify an image. The results show that all the five models can recognize basketball motion image accurately. (e)confusion matrix of AlexNet In the experiment, Labellmg (image annotation tool) is used to label basketball posture manually according to PASCAL VOC2007 format. Centernet-SPP structure is used and the pre-training model of VOC dataset is used to set initialization parameters. A total of 400 epochs are trained in the model. The loss precision of the model decreased rapidly in the first 100 times. Since the model is thawed after 100 times (the first 100 epochs trained the model after the backbone feature network and trained all the networks in the last 300 times), the loss value of the model decreases rapidly and then gradually tends to be stable. This indicates that the training effect of the model is good, and its training loss curve is shown in figure 4(a).
In order to select a model with high enough overall performance, the posture targets with confidence greater than 0.5 are retained first, and the weight files with the highest mAP values are found. Then the model is carried out 400 iterations, and one model is output every 10 iterations. Therefore, a total of 40 models are obtained, and a model with the highest mAP value should be found among the 40 models as shown in figure4(b). When the mAP value tends to be stable at the end of iteration, the maximum value is 90.03%, which is the selected model in this paper.

Conclusion
An algorithm for athlete posture recognition based on multi-feature fusion is proposed. In this method, motion posture images are collected by optical image collector, and the image quality is improved by gray scale transformation. Furthermore, body contour and motion posture region are obtained based on shadow elimination technology and inter-frame difference method. Finally, radon transform and discrete wavelet transform are used EAI Endorsed Transactions on Scalable Information Systems Online First R E T R

A C T E D
to extract the motion posture region and body contour, and the final posture recognition is achieved through the fusion of the two complementary features and CenterNet-SPP network training. Experiments show that the recognition accuracy of the proposed method is higher than other new methods. In the future, it is hoped that the proposed method can be extended to other types of athletes' movement recognition, so as to better improve the overall training level of athletes.