Attention Distillation for Accuracy Improvement of Vision Transformer

Taiga Tanaka; Ryuto Ishibashi; Yifan Xu; Lin Meng

Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China

Research Article

Attention Distillation for Accuracy Improvement of Vision Transformer

Download70 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.18-12-2025.2365289,
    author={Taiga  Tanaka and Ryuto  Ishibashi and Yifan  Xu and Lin  Meng},
    title={Attention Distillation for Accuracy Improvement of Vision Transformer},
    proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China},
    publisher={EAI},
    proceedings_a={IIKI},
    year={2026},
    month={6},
    keywords={Computer Vision CNN Vision Transformer Knowledge Distillation},
    doi={10.4108/eai.18-12-2025.2365289}
}

Taiga Tanaka
Ryuto Ishibashi
Yifan Xu
Lin Meng
Year: 2026
Attention Distillation for Accuracy Improvement of Vision Transformer
IIKI
EAI
DOI: 10.4108/eai.18-12-2025.2365289

Taiga Tanaka¹, Ryuto Ishibashi¹, Yifan Xu¹, Lin Meng²^,*

1: Graduate School of Science and Engineering, Ritsumeikan University
2: College of Science and Engineering, Ritsumeikan University

*Contact email: menglin@fc.ritsumei.ac.jp

Abstract

In recent years, rapid advancements in technology have propelled progress in the fields of image recognition and natural language processing, resulting in the development of numerous applications and services. In particular, the advancement of deep learning has greatly contributed to improving the accuracy of these fields, and the use of AI is becoming more widespread in everyday life. Among them, Vision Transformers (ViT) has emerged as a robust architecture in image recognition. ViT requires a large dataset for training to improve accuracy. To solve this problem, DeiT using knowledge distillation has been proposed. However, this DeiT does not consider each token’s contribution to [CLS] in learning. In this study, a knowledge distillation model training method called ADViT that uses the contribution of each token to Classify token ([CLS]) is proposed. Specifically, the attention map of each token for [CLS] is calculated. In each layer, learning is performed to bring the student model’s attention map closer to the teacher model’s attention map. Then, the accuracy is evaluated, and experimental results show the effectiveness of the proposed method, with accuracy rates of ADViT-Tiny(L1): 98.35%, ADViT-Tiny(L2): 98.06%, ADViT-Small(L1): 99.00%, and ADViT-Small(L2): 98.75%, which are higher than those of DeiT, DeiT-Tiny: 96.39%, DeiT-Small: 96.83%. Future work includes managing the number of heads and reducing the computation for more optimization.

Keywords: Computer Vision, CNN, Vision Transformer, Knowledge Distillation

Published: 2026-06-17
Publisher: EAI

: http://dx.doi.org/10.4108/eai.18-12-2025.2365289

Attention Distillation for Accuracy Improvement of Vision Transformer

Abstract

About EAI

Community

Publish with EAI