
Research Article
Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring
@INPROCEEDINGS{10.4108/eai.18-12-2025.2365275, author={Takashi Higashi and Ryuto Ishibashi and Lin Meng}, title={Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring}, proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China}, publisher={EAI}, proceedings_a={IIKI}, year={2026}, month={6}, keywords={Video recognition Action recognition ViViT Pruning}, doi={10.4108/eai.18-12-2025.2365275} }- Takashi Higashi
Ryuto Ishibashi
Lin Meng
Year: 2026
Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring
IIKI
EAI
DOI: 10.4108/eai.18-12-2025.2365275
Abstract
This study proposes a Multi-Modal Action Density Scoring (MADS) framework that integrates frame pruning into a ViViT-based multi-modal video recognition model to improve computational efficiency while maintaining accuracy. MADS introduces two frame selection strategies: a Learnable Threshold and a Top-K Segment Method. The Learnable Threshold adaptively determines pruning levels via a learnable parameter guided by a statistical loss, whereas the Top-K Segment Method divides a video into temporal segments and selects the most informative frames based on normalized importance scores. Experiments on the NTU RGB+D dataset show that the Top-K Segment Method achieves up to 36% FLOPs reduction with only 0.5% accuracy drop, outperforming the Learnable Threshold. Qualitative analysis further confirms that Top-K Segment preserves temporally distributed and semantically rich frames, maintaining motion continuity and visual interpretability. These results highlight MADS as a flexible and efficient framework for real-time and resource-constrained video recognition.


