
Research Article
MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance
@INPROCEEDINGS{10.1007/978-3-030-92638-0_19, author={Jiahui Huang and Bin Cao and Jiaxing Wang and Jing Fan}, title={MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance}, proceedings={Collaborative Computing: Networking, Applications and Worksharing. 17th EAI International Conference, CollaborateCom 2021, Virtual Event, October 16-18, 2021, Proceedings, Part II}, proceedings_a={COLLABORATECOM PART 2}, year={2022}, month={1}, keywords={Pre-trained language model BERT Self-distillation Multi-layer EMD}, doi={10.1007/978-3-030-92638-0_19} }
- Jiahui Huang
Bin Cao
Jiaxing Wang
Jing Fan
Year: 2022
MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance
COLLABORATECOM PART 2
Springer
DOI: 10.1007/978-3-030-92638-0_19
Abstract
In the past three years, the pre-trained language model is widely used in various natural language processing tasks, which has achieved significant progress. However, the high computational cost has seriously affected the efficiency of the pre-trained language model, which severely impairs the application of the pre-trained language model in resource-limited industries. To improve the efficiency of the model while ensuring the model’s accuracy, we propose MS-BERT, a multi-layer self-distillation approach for BERT compression based on Earth Mover’s Distance (EMD), which has the following features: (1) MS-BERT allows the lightweight network (student) to learn from all layers of the large model (teacher). In this way, students can learn different levels of knowledge from the teacher, which can enhance students’ performance. (2) Earth Mover’s Distance (EMD) is introduced to calculate the distance between the teacher layers and the student layers to achieve multi-layer knowledge transfer from teacher to students. (3) Two design strategies of student layers and the top-Kuncertainty calculation method are proposed to improve MS-BERT’s performance. Extensive experiments conducted on different datasets have proved that our model can be 2 to 12 times faster than BERT under different accuracy losses.