Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring

Takashi Higashi; Ryuto Ishibashi; Lin Meng

Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China

Research Article

Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring

Download76 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.18-12-2025.2365275,
    author={Takashi  Higashi and Ryuto  Ishibashi and Lin  Meng},
    title={Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring},
    proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China},
    publisher={EAI},
    proceedings_a={IIKI},
    year={2026},
    month={6},
    keywords={Video recognition Action recognition ViViT Pruning},
    doi={10.4108/eai.18-12-2025.2365275}
}

Takashi Higashi
Ryuto Ishibashi
Lin Meng
Year: 2026
Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring
IIKI
EAI
DOI: 10.4108/eai.18-12-2025.2365275

Takashi Higashi¹, Ryuto Ishibashi², Lin Meng²^,*

1: Graduate School of Science and Engineering, Ritsumeikan University
2: College of Science and Engineering, Ritsumeikan University

*Contact email: menglin@fc.ritsumei.ac.jp

Abstract

This study proposes a Multi-Modal Action Density Scoring (MADS) framework that integrates frame pruning into a ViViT-based multi-modal video recognition model to improve computational efficiency while maintaining accuracy. MADS introduces two frame selection strategies: a Learnable Threshold and a Top-K Segment Method. The Learnable Threshold adaptively determines pruning levels via a learnable parameter guided by a statistical loss, whereas the Top-K Segment Method divides a video into temporal segments and selects the most informative frames based on normalized importance scores. Experiments on the NTU RGB+D dataset show that the Top-K Segment Method achieves up to 36% FLOPs reduction with only 0.5% accuracy drop, outperforming the Learnable Threshold. Qualitative analysis further confirms that Top-K Segment preserves temporally distributed and semantically rich frames, maintaining motion continuity and visual interpretability. These results highlight MADS as a flexible and efficient framework for real-time and resource-constrained video recognition.

Keywords: Video recognition, Action recognition, ViViT, Pruning

Published: 2026-06-17
Publisher: EAI

: http://dx.doi.org/10.4108/eai.18-12-2025.2365275

Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring

Abstract

About EAI

Community

Publish with EAI