Real Time Distant Speech Emotion Recognition in Indoor Environments

Mohsin Ahmed; Zeya Chen; Emma Fass; John Stankovic

14th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services

Research Article

Real Time Distant Speech Emotion Recognition in Indoor Environments

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.7-11-2017.2273791,
    author={Mohsin Ahmed and Zeya Chen and Emma Fass and John Stankovic},
    title={Real Time Distant Speech Emotion Recognition in Indoor Environments},
    proceedings={14th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services},
    publisher={ACM},
    proceedings_a={MOBIQUITOUS},
    year={2018},
    month={4},
    keywords={emotion speech noise and reverberation},
    doi={10.4108/eai.7-11-2017.2273791}
}

Mohsin Ahmed
Zeya Chen
Emma Fass
John Stankovic
Year: 2018
Real Time Distant Speech Emotion Recognition in Indoor Environments
MOBIQUITOUS
ACM
DOI: 10.4108/eai.7-11-2017.2273791

Mohsin Ahmed¹^,*, Zeya Chen¹, Emma Fass¹, John Stankovic¹

1: University of Virginia

*Contact email: mya5dm@virginia.edu

Abstract

We develop solutions to various challenges in different stages of the processing pipeline of a real time indoor distant speech emotion recognition system to reduce the discrepancy between training and test conditions for distant emotion recognition. We use a novel combination of distorted feature elimination, classifier optimization, several signal cleaning techniques and train classifiers with synthetic reverberation obtained from a room impulse response generator to improve performance in a variety of rooms with various source-to-microphone distances. Our comprehensive evaluation is based on a popular emotional corpus from the literature, two new customized datasets and a dataset made of YouTube videos. The two new datasets are the first ever distance aware emotional corpuses and we created them by 1) injecting room impulse responses collected in a variety of rooms with various source-to-microphone distances into a public emotional corpus; and by 2) re-recording the emotional corpus with microphones placed at different distances. The overall performance results show as much as 15.51% improvement in distant emotion detection over baselines, with a final emotion recognition accuracy ranging between 79.44%-95.89% for different rooms, acoustic configurations and source-to-microphone distances. We experimentally evaluate the CPU time of various system components and demonstrate the real time capability of our system.

Keywords: emotion speech noise and reverberation

Published: 2018-04-18
Publisher: ACM

: http://dx.doi.org/10.4108/eai.7-11-2017.2273791