
Research Article
Attention Distillation for Accuracy Improvement of Vision Transformer
@INPROCEEDINGS{10.4108/eai.18-12-2025.2365289, author={Taiga Tanaka and Ryuto Ishibashi and Yifan Xu and Lin Meng}, title={Attention Distillation for Accuracy Improvement of Vision Transformer}, proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China}, publisher={EAI}, proceedings_a={IIKI}, year={2026}, month={6}, keywords={Computer Vision CNN Vision Transformer Knowledge Distillation}, doi={10.4108/eai.18-12-2025.2365289} }- Taiga Tanaka
Ryuto Ishibashi
Yifan Xu
Lin Meng
Year: 2026
Attention Distillation for Accuracy Improvement of Vision Transformer
IIKI
EAI
DOI: 10.4108/eai.18-12-2025.2365289
Abstract
In recent years, rapid advancements in technology have propelled progress in the fields of image recognition and natural language processing, resulting in the development of numerous applications and services. In particular, the advancement of deep learning has greatly contributed to improving the accuracy of these fields, and the use of AI is becoming more widespread in everyday life. Among them, Vision Transformers (ViT) has emerged as a robust architecture in image recognition. ViT requires a large dataset for training to improve accuracy. To solve this problem, DeiT using knowledge distillation has been proposed. However, this DeiT does not consider each token’s contribution to [CLS] in learning. In this study, a knowledge distillation model training method called ADViT that uses the contribution of each token to Classify token ([CLS]) is proposed. Specifically, the attention map of each token for [CLS] is calculated. In each layer, learning is performed to bring the student model’s attention map closer to the teacher model’s attention map. Then, the accuracy is evaluated, and experimental results show the effectiveness of the proposed method, with accuracy rates of ADViT-Tiny(L1): 98.35%, ADViT-Tiny(L2): 98.06%, ADViT-Small(L1): 99.00%, and ADViT-Small(L2): 98.75%, which are higher than those of DeiT, DeiT-Tiny: 96.39%, DeiT-Small: 96.83%. Future work includes managing the number of heads and reducing the computation for more optimization.


