sis 22(34): e1

Research Article

A spatio-temporal attention fusion model for students behaviour recognition

Download215 downloads
  • @ARTICLE{10.4108/eai.3-9-2021.170905,
        author={Xiaoli Wang},
        title={A spatio-temporal attention fusion model for students  behaviour recognition},
        journal={EAI Endorsed Transactions on Scalable Information Systems},
        keywords={student behavior, spatio-temporal attention, channel information, multi-spatial attention, CNN},
  • Xiaoli Wang
    Year: 2021
    A spatio-temporal attention fusion model for students behaviour recognition
    DOI: 10.4108/eai.3-9-2021.170905
Xiaoli Wang1,*
  • 1: School of Continuing Education, SanMenXia College of Social Administration, SanMenXia, 472000, China
*Contact email:


Student behavior analysis can reflect students' learning situation in real time, which provides an important basis for optimizing classroom teaching strategies and improving teaching methods. It is an important task for smart classroom to explore how to use big data to detect and recognize students behavior. Traditional recognition methods have some defects, such as low efficiency, edge blur, time-consuming, etc. In this paper, we propose a new students behaviour recognition
method based on spatio-temporal attention fusion model. It makes full use of key spatio-temporal information of video, the problem of spatio-temporal information redundancy is solved. Firstly, the channel attention mechanism is introduced into the spatio-temporal network, and the channel information is calibrated by modeling the dependency relationship between feature channels. It can improve the expression ability of features. Secondly, a time attention model based on convolutional neural network (CNN) is proposed, which uses fewer parameters to learn the attention score of each frame, focusing on the frames with obvious behaviour amplitude. Meanwhile, a multi-spatial attention model is presented to calculate the attention score of each position in each frame from different angles, extract several saliency areas of behaviour, and fuse the spatio-temporal features to further enhance the feature representation of video. Finally, the fused features are input into the classification network, and the behaviour recognition results are obtained by combining the two output streams according to different weights. Experiment results on HMDB51, UCF101 datasets and eight typical classroom behaviors of students show that the proposed method can effectively recognize the behaviours in videos. The accuracy of HMDB51 is higher than 90%, that of UCF101 and real data are higher than 90%.