About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Collaborative Computing: Networking, Applications and Worksharing. 17th EAI International Conference, CollaborateCom 2021, Virtual Event, October 16-18, 2021, Proceedings, Part II

Research Article

MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance

Download(Requires a free EAI acccount)
11 downloads
Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-030-92638-0_19,
        author={Jiahui Huang and Bin Cao and Jiaxing Wang and Jing Fan},
        title={MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance},
        proceedings={Collaborative Computing: Networking, Applications and Worksharing. 17th EAI International Conference, CollaborateCom 2021, Virtual Event, October 16-18, 2021, Proceedings, Part II},
        proceedings_a={COLLABORATECOM PART 2},
        year={2022},
        month={1},
        keywords={Pre-trained language model BERT Self-distillation Multi-layer EMD},
        doi={10.1007/978-3-030-92638-0_19}
    }
    
  • Jiahui Huang
    Bin Cao
    Jiaxing Wang
    Jing Fan
    Year: 2022
    MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance
    COLLABORATECOM PART 2
    Springer
    DOI: 10.1007/978-3-030-92638-0_19
Jiahui Huang1, Bin Cao1,*, Jiaxing Wang1, Jing Fan1
  • 1: College of Computer Science and Technology
*Contact email: bincao@zjut.edu.cn

Abstract

In the past three years, the pre-trained language model is widely used in various natural language processing tasks, which has achieved significant progress. However, the high computational cost has seriously affected the efficiency of the pre-trained language model, which severely impairs the application of the pre-trained language model in resource-limited industries. To improve the efficiency of the model while ensuring the model’s accuracy, we propose MS-BERT, a multi-layer self-distillation approach for BERT compression based on Earth Mover’s Distance (EMD), which has the following features: (1) MS-BERT allows the lightweight network (student) to learn from all layers of the large model (teacher). In this way, students can learn different levels of knowledge from the teacher, which can enhance students’ performance. (2) Earth Mover’s Distance (EMD) is introduced to calculate the distance between the teacher layers and the student layers to achieve multi-layer knowledge transfer from teacher to students. (3) Two design strategies of student layers and the top-Kuncertainty calculation method are proposed to improve MS-BERT’s performance. Extensive experiments conducted on different datasets have proved that our model can be 2 to 12 times faster than BERT under different accuracy losses.

Keywords
Pre-trained language model, BERT, Self-distillation, Multi-layer, EMD
Published
2022-01-01
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-030-92638-0_19
Copyright © 2021–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL