3D CNN with BERT and Vision Transformer for Video Recognition

Bao Thai Duong; Thai Hoang Le

Context-Aware Systems and Applications. 12th EAI International Conference, ICCASA 2023, Ho Chi Minh City, Vietnam, October 26-27, 2023, Proceedings

Research Article

3D CNN with BERT and Vision Transformer for Video Recognition

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-031-58878-5_11,
    author={Bao Thai Duong and Thai Hoang Le},
    title={3D CNN with BERT and Vision Transformer for Video Recognition},
    proceedings={Context-Aware Systems and Applications. 12th EAI International Conference, ICCASA 2023, Ho Chi Minh City, Vietnam, October 26-27, 2023, Proceedings},
    proceedings_a={ICCASA},
    year={2024},
    month={8},
    keywords={Action Recognition Video Recognition Vision Transformers 3D Convolution Neural Networks (3D CNNs)},
    doi={10.1007/978-3-031-58878-5_11}
}

Bao Thai Duong
Thai Hoang Le
Year: 2024
3D CNN with BERT and Vision Transformer for Video Recognition
ICCASA
Springer
DOI: 10.1007/978-3-031-58878-5_11

Bao Thai Duong¹, Thai Hoang Le¹^,*

1: Faculty of Information Technology

*Contact email: lhthai@fit.hcmus.edu.vn

Abstract

According to the development of the monitor system, detection and recognition are the major areas of interest within the field of computer vision. In recent years, due to their capacity to filter spatiotemporal video features, 3D CNN architectures with BERT have proven to be the best solution to this problem. Vision Transformer (ViT) has performed exceptionally well in recent benchmarks for image classification, object detection, and semantic image segmentation, among other computer vision applications. Transferring knowledge from such powerful ViT is an intriguing opportunity for developing excellent video recognition models. In this work, we discuss and evaluate the methods on HDMB-51 dataset to address the advantages and disadvantages. As a result, the study shows that two methods improve performance and accuracy of video recognition.

Keywords: Action Recognition, Video Recognition, Vision Transformers, 3D Convolution Neural Networks (3D CNNs)

Published: 2024-08-19
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-031-58878-5_11

3D CNN with BERT and Vision Transformer for Video Recognition

Abstract

About EAI

Community

Publish with EAI