About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Context-Aware Systems and Applications. 12th EAI International Conference, ICCASA 2023, Ho Chi Minh City, Vietnam, October 26-27, 2023, Proceedings

Research Article

3D CNN with BERT and Vision Transformer for Video Recognition

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-58878-5_11,
        author={Bao Thai Duong and Thai Hoang Le},
        title={3D CNN with BERT and Vision Transformer for Video Recognition},
        proceedings={Context-Aware Systems and Applications. 12th EAI International Conference, ICCASA 2023, Ho Chi Minh City, Vietnam, October 26-27, 2023, Proceedings},
        proceedings_a={ICCASA},
        year={2024},
        month={8},
        keywords={Action Recognition Video Recognition Vision Transformers 3D Convolution Neural Networks (3D CNNs)},
        doi={10.1007/978-3-031-58878-5_11}
    }
    
  • Bao Thai Duong
    Thai Hoang Le
    Year: 2024
    3D CNN with BERT and Vision Transformer for Video Recognition
    ICCASA
    Springer
    DOI: 10.1007/978-3-031-58878-5_11
Bao Thai Duong1, Thai Hoang Le1,*
  • 1: Faculty of Information Technology
*Contact email: lhthai@fit.hcmus.edu.vn

Abstract

According to the development of the monitor system, detection and recognition are the major areas of interest within the field of computer vision. In recent years, due to their capacity to filter spatiotemporal video features, 3D CNN architectures with BERT have proven to be the best solution to this problem. Vision Transformer (ViT) has performed exceptionally well in recent benchmarks for image classification, object detection, and semantic image segmentation, among other computer vision applications. Transferring knowledge from such powerful ViT is an intriguing opportunity for developing excellent video recognition models. In this work, we discuss and evaluate the methods on HDMB-51 dataset to address the advantages and disadvantages. As a result, the study shows that two methods improve performance and accuracy of video recognition.

Keywords
Action Recognition Video Recognition Vision Transformers 3D Convolution Neural Networks (3D CNNs)
Published
2024-08-19
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-031-58878-5_11
Copyright © 2023–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL