About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Mobile Multimedia Communications. 14th EAI International Conference, Mobimedia 2021, Virtual Event, July 23-25, 2021, Proceedings

Research Article

Improved Speech Emotion Recognition Using LAM and CTC

Download(Requires a free EAI acccount)
2 downloads
Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-030-89814-4_55,
        author={Lingyuan Meng and Zhe Sun and Yang Liu and Zhen Zhao and Yongwei Li},
        title={Improved Speech Emotion Recognition Using LAM and CTC},
        proceedings={Mobile Multimedia Communications. 14th EAI International Conference, Mobimedia 2021, Virtual Event, July 23-25, 2021, Proceedings},
        proceedings_a={MOBIMEDIA},
        year={2021},
        month={11},
        keywords={Speech emotion recognition Attention CTC VGFCC IEMOCAP},
        doi={10.1007/978-3-030-89814-4_55}
    }
    
  • Lingyuan Meng
    Zhe Sun
    Yang Liu
    Zhen Zhao
    Yongwei Li
    Year: 2021
    Improved Speech Emotion Recognition Using LAM and CTC
    MOBIMEDIA
    Springer
    DOI: 10.1007/978-3-030-89814-4_55
Lingyuan Meng1, Zhe Sun1, Yang Liu1,*, Zhen Zhao1, Yongwei Li2
  • 1: School of Information Science and Technology, Qingdao Universicity of Science and Technology
  • 2: National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
*Contact email: yangliu@qust.edu.cn

Abstract

Time sequence based speech emotion recognition methods are difficult to distinguish between emotional and non-emotional frames of speech, and cannot calculate the amount of emotional information carried by emotional frames. In this paper, we propose a speech emotion recognition method using Local Attention Mechanism (LAM) and Connectionist Temporal Classification (CTC) to deal with these issues. First, we extract the Variational Gammatone Cepstral Coefficients (VGFCC) emotional feature from the speech as the input of LAM-CTC shared encoder. Second, CTC layer performs automatic hard alignment, which allows the network to have the largest activation value at the emotional key frame of the voice. LAM layer learns different degrees on the emotional auxiliary frame. Finally, BP neural network is used to integrate the decoding outputs of CTC layer and LAM layer to obtain emotion prediction results. Evaluation on IEMOCAP shows that the proposed model outperformed the state-of-the-art methods with a UAR of 68.5% and an WAR of 68.1% respectively.

Keywords
Speech emotion recognition Attention CTC VGFCC IEMOCAP
Published
2021-11-02
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-030-89814-4_55
Copyright © 2021–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL