Research Article
Spatio-Temporal and View Attention Deep Network for Skeleton based View-invariant Human Action Recognition
@INPROCEEDINGS{10.4108/eai.21-6-2018.2276579, author={Yan Feng and Ge Li and Chunfeng Yuan}, title={Spatio-Temporal and View Attention Deep Network for Skeleton based View-invariant Human Action Recognition}, proceedings={11th EAI International Conference on Mobile Multimedia Communications}, publisher={EAI}, proceedings_a={MOBIMEDIA}, year={2018}, month={9}, keywords={action recognition skeleton view-invariant attention model}, doi={10.4108/eai.21-6-2018.2276579} }
- Yan Feng
Ge Li
Chunfeng Yuan
Year: 2018
Spatio-Temporal and View Attention Deep Network for Skeleton based View-invariant Human Action Recognition
MOBIMEDIA
EAI
DOI: 10.4108/eai.21-6-2018.2276579
Abstract
Skeleton-based human action recognition has been widely studied recently with the advancement of depth capturing devices. However, the skeleton data captured from a single camera is visually view-dependent and contains noise. In this paper, we propose a spatiao-temporal and view attention based deep network model to avoid the disturbance of the view and noise in skeleton data for human action recognition. Our model consists of two sub-networks which are built on the Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM). The view-specific sub-network incorporating spatio-temporal attention learns discriminative features from single input view by paying more attention to key joints and frames. The following view attention sub-network obtains common view-invariant representations shared among views and it contains a view attention module to select the discriminative views. Finally, we propose a regularized cross-entropy loss to ensure the effective end-to-end training of the network. Experimental results demonstrate the effectiveness of the proposed model on the current largest NTU action recognition dataset.