Spatio-Temporal and View Attention Deep Network for Skeleton based View-invariant Human Action Recognition

Yan Feng; Ge Li; Chunfeng Yuan

11th EAI International Conference on Mobile Multimedia Communications

Research Article

Spatio-Temporal and View Attention Deep Network for Skeleton based View-invariant Human Action Recognition

Download636 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.21-6-2018.2276579,
    author={Yan Feng and Ge Li and Chunfeng Yuan},
    title={Spatio-Temporal and View Attention Deep Network for Skeleton based View-invariant Human Action Recognition},
    proceedings={11th EAI International Conference on Mobile Multimedia Communications},
    publisher={EAI},
    proceedings_a={MOBIMEDIA},
    year={2018},
    month={9},
    keywords={action recognition skeleton view-invariant attention model},
    doi={10.4108/eai.21-6-2018.2276579}
}

Yan Feng
Ge Li
Chunfeng Yuan
Year: 2018
Spatio-Temporal and View Attention Deep Network for Skeleton based View-invariant Human Action Recognition
MOBIMEDIA
EAI
DOI: 10.4108/eai.21-6-2018.2276579

Yan Feng¹, Ge Li¹^,*, Chunfeng Yuan²

1: School of Information Science & Technology, Qingdao University of Science & Technology
2: National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of Sciences

*Contact email: lige420@126.com

Abstract

Skeleton-based human action recognition has been widely studied recently with the advancement of depth capturing devices. However, the skeleton data captured from a single camera is visually view-dependent and contains noise. In this paper, we propose a spatiao-temporal and view attention based deep network model to avoid the disturbance of the view and noise in skeleton data for human action recognition. Our model consists of two sub-networks which are built on the Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM). The view-specific sub-network incorporating spatio-temporal attention learns discriminative features from single input view by paying more attention to key joints and frames. The following view attention sub-network obtains common view-invariant representations shared among views and it contains a view attention module to select the discriminative views. Finally, we propose a regularized cross-entropy loss to ensure the effective end-to-end training of the network. Experimental results demonstrate the effectiveness of the proposed model on the current largest NTU action recognition dataset.

Keywords: action recognition skeleton view-invariant attention model

Published: 2018-09-12
Publisher: EAI

: http://dx.doi.org/10.4108/eai.21-6-2018.2276579