About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China

Research Article

Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring

Download16 downloads
Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.4108/eai.18-12-2025.2365275,
        author={Takashi  Higashi and Ryuto  Ishibashi and Lin  Meng},
        title={Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring},
        proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China},
        publisher={EAI},
        proceedings_a={IIKI},
        year={2026},
        month={6},
        keywords={Video recognition Action recognition ViViT Pruning},
        doi={10.4108/eai.18-12-2025.2365275}
    }
    
  • Takashi Higashi
    Ryuto Ishibashi
    Lin Meng
    Year: 2026
    Multi-Modal Video Action Recognition with Learnable Frame Pruning via Temporal Token Scoring
    IIKI
    EAI
    DOI: 10.4108/eai.18-12-2025.2365275
Takashi Higashi1, Ryuto Ishibashi2, Lin Meng2,*
  • 1: Graduate School of Science and Engineering, Ritsumeikan University
  • 2: College of Science and Engineering, Ritsumeikan University
*Contact email: menglin@fc.ritsumei.ac.jp

Abstract

This study proposes a Multi-Modal Action Density Scoring (MADS) framework that integrates frame pruning into a ViViT-based multi-modal video recognition model to improve computational efficiency while maintaining accuracy. MADS introduces two frame selection strategies: a Learnable Threshold and a Top-K Segment Method. The Learnable Threshold adaptively determines pruning levels via a learnable parameter guided by a statistical loss, whereas the Top-K Segment Method divides a video into temporal segments and selects the most informative frames based on normalized importance scores. Experiments on the NTU RGB+D dataset show that the Top-K Segment Method achieves up to 36% FLOPs reduction with only 0.5% accuracy drop, outperforming the Learnable Threshold. Qualitative analysis further confirms that Top-K Segment preserves temporally distributed and semantically rich frames, maintaining motion continuity and visual interpretability. These results highlight MADS as a flexible and efficient framework for real-time and resource-constrained video recognition.

Keywords
Video recognition, Action recognition, ViViT, Pruning
Published
2026-06-17
Publisher
EAI
http://dx.doi.org/10.4108/eai.18-12-2025.2365275
Copyright © 2025–2026 EAI
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center
  • Cookie Preferences

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL