
Research Article
Two-Stage Multi-lingual Speech Emotion Recognition for Multi-lingual Emotional Speech Synthesis
@INPROCEEDINGS{10.1007/978-3-031-73699-5_14, author={Xin Huang and Zuqiang Zeng and Chenjing Sun and Jichen Yang}, title={Two-Stage Multi-lingual Speech Emotion Recognition for Multi-lingual Emotional Speech Synthesis}, proceedings={Security and Privacy in New Computing Environments. 6th International Conference, SPNCE 2023, Guangzhou, China, November 25--26, 2023, Proceedings}, proceedings_a={SPNCE}, year={2025}, month={1}, keywords={Speech emotion recognition Multi-lingual Emotional speech synthesis}, doi={10.1007/978-3-031-73699-5_14} }
- Xin Huang
Zuqiang Zeng
Chenjing Sun
Jichen Yang
Year: 2025
Two-Stage Multi-lingual Speech Emotion Recognition for Multi-lingual Emotional Speech Synthesis
SPNCE
Springer
DOI: 10.1007/978-3-031-73699-5_14
Abstract
In multi-lingual emotional speech synthesis, it is difficult to incorporate suitable emotional expressions in the synthesis process due to the differences between the emotional expressions of different linguals. In order to extract better emotional expressions of different linguals to assist the multi-lingual emotional speech synthesis, this paper conducts research on multi-lingual speech emotion recognition. In the current study of multi-lingual speech emotion recognition (SER), the combining method (TCM) and multi-task method (TMM) are the popular methods. However, good performance can’t be obtained, the reason is that TCM doesn’t consider the emotional difference of different linguals and it is not easy to train the good emotion recognition model and good language recognition model at the same time for TMM. In order to settle the issue, a two-stage multi-lingual SER method is proposed in this paper, wherein language recognition is to recognize the language type at the first stage, and then emotion recognition is applied at the second stage. In addition, wav2vec 2.0 is used as the input while ResNet18 is selected as the model for language recognition and emotion recognition respectively. The experimental results show that the proposed method can work on multi-lingual SER, meanwhile, the proposed method performs better than TCM and TMM.