Speech Emotion Recognition Based on Recurrent Neural Networks with Conformer for Emotional Speech Synthesis

Xin Huang; Chenjing Sun; Jichen Yang; Xianhua Hou

Security and Privacy in New Computing Environments. 6th International Conference, SPNCE 2023, Guangzhou, China, November 25–26, 2023, Proceedings

Research Article

Speech Emotion Recognition Based on Recurrent Neural Networks with Conformer for Emotional Speech Synthesis

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-031-73699-5_19,
    author={Xin Huang and Chenjing Sun and Jichen Yang and Xianhua Hou},
    title={Speech Emotion Recognition Based on Recurrent Neural Networks with Conformer for Emotional Speech Synthesis},
    proceedings={Security and Privacy in New Computing Environments. 6th International Conference, SPNCE 2023, Guangzhou, China, November 25--26, 2023, Proceedings},
    proceedings_a={SPNCE},
    year={2025},
    month={1},
    keywords={Speech emotion recognition Emotional speech synthesis Conformer},
    doi={10.1007/978-3-031-73699-5_19}
}

Xin Huang
Chenjing Sun
Jichen Yang
Xianhua Hou
Year: 2025
Speech Emotion Recognition Based on Recurrent Neural Networks with Conformer for Emotional Speech Synthesis
SPNCE
Springer
DOI: 10.1007/978-3-031-73699-5_19

Xin Huang¹, Chenjing Sun¹, Jichen Yang²^,*, Xianhua Hou¹

1: School of Electronics and Information Engineering, South China Normal University
2: School of Cyber Security, Guangdong Polytechnic Normal University

*Contact email: nisonyoung@163.com

Abstract

Speech emotion recognition is the basis of emotional speech synthesis, a good speech emotion recognition system can learn more emotional expressions in speech and help in the synthesis of emotional speech. However, there are a number of issues that make the speech emotion recognition task difficult, including background noise and the distinct speech features of each speaker. The widely recognized speech emotion recognition system ACRNN extracts local features from speech signals using CNN, and its attention mechanism concentrates on the emotional content of the speech data. However, because only a single attention module is used, it is unable to simultaneously attend to the information from distinct representation subspaces at different locations, nor is it able to acquire long-term global information. The paper proposes CoRNN, which applies Conformer to replace CNN and attention module, with the purpose of overcoming the shortcomings of ACRNN. The experimental results on IEMOCAP dataset demonstrate that the unweighted average recall of the proposed CoRNN can achieve 65.53%, which improves 0.79% comparing with ACRNN.

Keywords: Speech emotion recognition, Emotional speech synthesis, Conformer

Published: 2025-01-01
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-031-73699-5_19

Speech Emotion Recognition Based on Recurrent Neural Networks with Conformer for Emotional Speech Synthesis

Abstract

About EAI

Community

Publish with EAI