About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Industrial Networks and Intelligent Systems. 9th EAI International Conference, INISCOM 2023, Ho Chi Minh City, Vietnam, August 2-3, 2023, Proceedings

Research Article

Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-47359-3_11,
        author={Phuong-Nam Tran and Thuy-Duong Thi Vu and Duc Ngoc Minh Dang and Nhat Truong Pham and Anh-Khoa Tran},
        title={Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention},
        proceedings={Industrial Networks and Intelligent Systems. 9th EAI International Conference, INISCOM 2023, Ho Chi Minh City, Vietnam, August 2-3, 2023, Proceedings},
        proceedings_a={INISCOM},
        year={2023},
        month={10},
        keywords={3M-SER Multi-modal analysis Speech Emotion Recognition Multi-head Attention Multi-feature Embeddings},
        doi={10.1007/978-3-031-47359-3_11}
    }
    
  • Phuong-Nam Tran
    Thuy-Duong Thi Vu
    Duc Ngoc Minh Dang
    Nhat Truong Pham
    Anh-Khoa Tran
    Year: 2023
    Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention
    INISCOM
    Springer
    DOI: 10.1007/978-3-031-47359-3_11
Phuong-Nam Tran1, Thuy-Duong Thi Vu1, Duc Ngoc Minh Dang1,*, Nhat Truong Pham2, Anh-Khoa Tran3
  • 1: Computing Fundamental Department
  • 2: Department of Integrative Biotechnology
  • 3: Modeling Evolutionary Algorithms Simulation and Artificial Intelligence, Faculty of Electrical and Electronics Engineering
*Contact email: ducdnm2@fe.edu.vn

Abstract

Recent research has shown that multi-modal learning is a successful method for enhancing classification performance by mixing several forms of input, notably in speech-emotion recognition (SER) tasks. However, the difference between the modalities may affect SER performance. To overcome this problem, a novel approach for multi-modal SER called 3M-SER is proposed in this paper. The 3M-SER leverages multi-head attention to fuse information from multiple feature embeddings, including audio and text features. The 3M-SER approach is based on the SERVER approach but includes an additional fusion module that improves the integration of text and audio features, leading to improved classification performance. To further enhance the correlation between the modalities, a LayerNorm is applied to audio features prior to fusion. Our approach achieved an unweighted accuracy (UA) and weighted accuracy (WA) of 79.96% and 80.66%, respectively, on the IEMOCAP benchmark dataset. This indicates that the proposed approach is better than SERVER and recent methods with similar approaches. In addition, it highlights the effectiveness of incorporating an extra fusion module in multi-modal learning.

Keywords
3M-SER Multi-modal analysis Speech Emotion Recognition Multi-head Attention Multi-feature Embeddings
Published
2023-10-31
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-031-47359-3_11
Copyright © 2023–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL