11th EAI International Conference on Mobile Multimedia Communications

Research Article

Spatio-Temporal and View Attention Deep Network for Skeleton based View-invariant Human Action Recognition

Download162 downloads
  • @INPROCEEDINGS{10.4108/eai.21-6-2018.2276579,
        author={Yan Feng and Ge Li and Chunfeng Yuan},
        title={Spatio-Temporal and View Attention Deep Network for Skeleton based View-invariant Human Action Recognition},
        proceedings={11th EAI International Conference on Mobile Multimedia Communications},
        publisher={EAI},
        proceedings_a={MOBIMEDIA},
        year={2018},
        month={9},
        keywords={action recognition skeleton view-invariant attention model},
        doi={10.4108/eai.21-6-2018.2276579}
    }
    
  • Yan Feng
    Ge Li
    Chunfeng Yuan
    Year: 2018
    Spatio-Temporal and View Attention Deep Network for Skeleton based View-invariant Human Action Recognition
    MOBIMEDIA
    EAI
    DOI: 10.4108/eai.21-6-2018.2276579
Yan Feng1, Ge Li1,*, Chunfeng Yuan2
  • 1: School of Information Science & Technology, Qingdao University of Science & Technology
  • 2: National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of Sciences
*Contact email: lige420@126.com

Abstract

Skeleton-based human action recognition has been widely studied recently with the advancement of depth capturing devices. However, the skeleton data captured from a single camera is visually view-dependent and contains noise. In this paper, we propose a spatiao-temporal and view attention based deep network model to avoid the disturbance of the view and noise in skeleton data for human action recognition. Our model consists of two sub-networks which are built on the Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM). The view-specific sub-network incorporating spatio-temporal attention learns discriminative features from single input view by paying more attention to key joints and frames. The following view attention sub-network obtains common view-invariant representations shared among views and it contains a view attention module to select the discriminative views. Finally, we propose a regularized cross-entropy loss to ensure the effective end-to-end training of the network. Experimental results demonstrate the effectiveness of the proposed model on the current largest NTU action recognition dataset.