
Research Article
Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention
@INPROCEEDINGS{10.1007/978-3-031-47359-3_11, author={Phuong-Nam Tran and Thuy-Duong Thi Vu and Duc Ngoc Minh Dang and Nhat Truong Pham and Anh-Khoa Tran}, title={Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention}, proceedings={Industrial Networks and Intelligent Systems. 9th EAI International Conference, INISCOM 2023, Ho Chi Minh City, Vietnam, August 2-3, 2023, Proceedings}, proceedings_a={INISCOM}, year={2023}, month={10}, keywords={3M-SER Multi-modal analysis Speech Emotion Recognition Multi-head Attention Multi-feature Embeddings}, doi={10.1007/978-3-031-47359-3_11} }
- Phuong-Nam Tran
Thuy-Duong Thi Vu
Duc Ngoc Minh Dang
Nhat Truong Pham
Anh-Khoa Tran
Year: 2023
Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention
INISCOM
Springer
DOI: 10.1007/978-3-031-47359-3_11
Abstract
Recent research has shown that multi-modal learning is a successful method for enhancing classification performance by mixing several forms of input, notably in speech-emotion recognition (SER) tasks. However, the difference between the modalities may affect SER performance. To overcome this problem, a novel approach for multi-modal SER called 3M-SER is proposed in this paper. The 3M-SER leverages multi-head attention to fuse information from multiple feature embeddings, including audio and text features. The 3M-SER approach is based on the SERVER approach but includes an additional fusion module that improves the integration of text and audio features, leading to improved classification performance. To further enhance the correlation between the modalities, a LayerNorm is applied to audio features prior to fusion. Our approach achieved an unweighted accuracy (UA) and weighted accuracy (WA) of 79.96% and 80.66%, respectively, on the IEMOCAP benchmark dataset. This indicates that the proposed approach is better than SERVER and recent methods with similar approaches. In addition, it highlights the effectiveness of incorporating an extra fusion module in multi-modal learning.