About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China

Research Article

Attention Distillation for Accuracy Improvement of Vision Transformer

Download15 downloads
Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.4108/eai.18-12-2025.2365289,
        author={Taiga  Tanaka and Ryuto  Ishibashi and Yifan  Xu and Lin  Meng},
        title={Attention Distillation for Accuracy Improvement of Vision Transformer},
        proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China},
        publisher={EAI},
        proceedings_a={IIKI},
        year={2026},
        month={6},
        keywords={Computer Vision CNN Vision Transformer Knowledge Distillation},
        doi={10.4108/eai.18-12-2025.2365289}
    }
    
  • Taiga Tanaka
    Ryuto Ishibashi
    Yifan Xu
    Lin Meng
    Year: 2026
    Attention Distillation for Accuracy Improvement of Vision Transformer
    IIKI
    EAI
    DOI: 10.4108/eai.18-12-2025.2365289
Taiga Tanaka1, Ryuto Ishibashi1, Yifan Xu1, Lin Meng2,*
  • 1: Graduate School of Science and Engineering, Ritsumeikan University
  • 2: College of Science and Engineering, Ritsumeikan University
*Contact email: menglin@fc.ritsumei.ac.jp

Abstract

In recent years, rapid advancements in technology have propelled progress in the fields of image recognition and natural language processing, resulting in the development of numerous applications and services. In particular, the advancement of deep learning has greatly contributed to improving the accuracy of these fields, and the use of AI is becoming more widespread in everyday life. Among them, Vision Transformers (ViT) has emerged as a robust architecture in image recognition. ViT requires a large dataset for training to improve accuracy. To solve this problem, DeiT using knowledge distillation has been proposed. However, this DeiT does not consider each token’s contribution to [CLS] in learning. In this study, a knowledge distillation model training method called ADViT that uses the contribution of each token to Classify token ([CLS]) is proposed. Specifically, the attention map of each token for [CLS] is calculated. In each layer, learning is performed to bring the student model’s attention map closer to the teacher model’s attention map. Then, the accuracy is evaluated, and experimental results show the effectiveness of the proposed method, with accuracy rates of ADViT-Tiny(L1): 98.35%, ADViT-Tiny(L2): 98.06%, ADViT-Small(L1): 99.00%, and ADViT-Small(L2): 98.75%, which are higher than those of DeiT, DeiT-Tiny: 96.39%, DeiT-Small: 96.83%. Future work includes managing the number of heads and reducing the computation for more optimization.

Keywords
Computer Vision, CNN, Vision Transformer, Knowledge Distillation
Published
2026-06-17
Publisher
EAI
http://dx.doi.org/10.4108/eai.18-12-2025.2365289
Copyright © 2025–2026 EAI
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center
  • Cookie Preferences

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL