Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

Phuong-Nam Tran; Thuy-Duong Thi Vu; Duc Ngoc Minh Dang; Nhat Truong Pham; Anh-Khoa Tran

Industrial Networks and Intelligent Systems. 9th EAI International Conference, INISCOM 2023, Ho Chi Minh City, Vietnam, August 2-3, 2023, Proceedings

Research Article

Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-031-47359-3_11,
    author={Phuong-Nam Tran and Thuy-Duong Thi Vu and Duc Ngoc Minh Dang and Nhat Truong Pham and Anh-Khoa Tran},
    title={Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention},
    proceedings={Industrial Networks and Intelligent Systems. 9th EAI International Conference, INISCOM 2023, Ho Chi Minh City, Vietnam, August 2-3, 2023, Proceedings},
    proceedings_a={INISCOM},
    year={2023},
    month={10},
    keywords={3M-SER Multi-modal analysis Speech Emotion Recognition Multi-head Attention Multi-feature Embeddings},
    doi={10.1007/978-3-031-47359-3_11}
}

Phuong-Nam Tran
Thuy-Duong Thi Vu
Duc Ngoc Minh Dang
Nhat Truong Pham
Anh-Khoa Tran
Year: 2023
Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention
INISCOM
Springer
DOI: 10.1007/978-3-031-47359-3_11

Phuong-Nam Tran¹, Thuy-Duong Thi Vu¹, Duc Ngoc Minh Dang¹^,*, Nhat Truong Pham², Anh-Khoa Tran³

1: Computing Fundamental Department
2: Department of Integrative Biotechnology
3: Modeling Evolutionary Algorithms Simulation and Artificial Intelligence, Faculty of Electrical and Electronics Engineering

*Contact email: ducdnm2@fe.edu.vn

Abstract

Recent research has shown that multi-modal learning is a successful method for enhancing classification performance by mixing several forms of input, notably in speech-emotion recognition (SER) tasks. However, the difference between the modalities may affect SER performance. To overcome this problem, a novel approach for multi-modal SER called 3M-SER is proposed in this paper. The 3M-SER leverages multi-head attention to fuse information from multiple feature embeddings, including audio and text features. The 3M-SER approach is based on the SERVER approach but includes an additional fusion module that improves the integration of text and audio features, leading to improved classification performance. To further enhance the correlation between the modalities, a LayerNorm is applied to audio features prior to fusion. Our approach achieved an unweighted accuracy (UA) and weighted accuracy (WA) of 79.96% and 80.66%, respectively, on the IEMOCAP benchmark dataset. This indicates that the proposed approach is better than SERVER and recent methods with similar approaches. In addition, it highlights the effectiveness of incorporating an extra fusion module in multi-modal learning.

Keywords: 3M-SER, Multi-modal analysis, Speech Emotion Recognition, Multi-head Attention, Multi-feature Embeddings

Published: 2023-10-31
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-031-47359-3_11

Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

Abstract

About EAI

Community

Publish with EAI