
Research Article
3D CNN with BERT and Vision Transformer for Video Recognition
@INPROCEEDINGS{10.1007/978-3-031-58878-5_11, author={Bao Thai Duong and Thai Hoang Le}, title={3D CNN with BERT and Vision Transformer for Video Recognition}, proceedings={Context-Aware Systems and Applications. 12th EAI International Conference, ICCASA 2023, Ho Chi Minh City, Vietnam, October 26-27, 2023, Proceedings}, proceedings_a={ICCASA}, year={2024}, month={8}, keywords={Action Recognition Video Recognition Vision Transformers 3D Convolution Neural Networks (3D CNNs)}, doi={10.1007/978-3-031-58878-5_11} }
- Bao Thai Duong
Thai Hoang Le
Year: 2024
3D CNN with BERT and Vision Transformer for Video Recognition
ICCASA
Springer
DOI: 10.1007/978-3-031-58878-5_11
Abstract
According to the development of the monitor system, detection and recognition are the major areas of interest within the field of computer vision. In recent years, due to their capacity to filter spatiotemporal video features, 3D CNN architectures with BERT have proven to be the best solution to this problem. Vision Transformer (ViT) has performed exceptionally well in recent benchmarks for image classification, object detection, and semantic image segmentation, among other computer vision applications. Transferring knowledge from such powerful ViT is an intriguing opportunity for developing excellent video recognition models. In this work, we discuss and evaluate the methods on HDMB-51 dataset to address the advantages and disadvantages. As a result, the study shows that two methods improve performance and accuracy of video recognition.