Research Article
Voice Activity Detection Applied to Hands-Free Spoken Dialogue Robot based on Decoding usingAcoustic and Language Model
@INPROCEEDINGS{10.4108/ICST.ROBOCOMM2007.2088, author={Hiroyuki Sakai and Tobias Cincarek and Hiromichi Kawanami and Hiroshi Saruwatari and Kiyohiro Shikano and Akinobu Lee}, title={Voice Activity Detection Applied to Hands-Free Spoken Dialogue Robot based on Decoding usingAcoustic and Language Model}, proceedings={1st International ICST Conference on Robot Communication and Coordination}, proceedings_a={ROBOCOMM}, year={2010}, month={5}, keywords={}, doi={10.4108/ICST.ROBOCOMM2007.2088} }
- Hiroyuki Sakai
Tobias Cincarek
Hiromichi Kawanami
Hiroshi Saruwatari
Kiyohiro Shikano
Akinobu Lee
Year: 2010
Voice Activity Detection Applied to Hands-Free Spoken Dialogue Robot based on Decoding usingAcoustic and Language Model
ROBOCOMM
ICST
DOI: 10.4108/ICST.ROBOCOMM2007.2088
Abstract
Speech recognition and speech-based dialogue are means for realizing communication between humans and robots. In case of conventional system setup a headset or a directional microphone is used to collect speech with high signal-to-noise ratio (SNR). However, the user must wear a microphone or has to approach the system closely for interaction. Therefore it’s preferable to develop a hands-free speech recognition system which enables the user to speak to the system from a distant point. To collect speech from distant speakers a microphone array is usually employed. However, the SNR will degrade in a real environment because of the presence of various kinds of background noise besides the user’s utterance. This will most often decrease speech recognition performance and no reliable speech dialogue would be possible. Voice Activity Detection (VAD) is a method to detect the user utterance part in the input signal. If VAD fails, all following processing steps including speech recognition and dialogue will not work. Conventional VAD based on amplitude level and zero cross count is difficult to apply to hands-free speech recognition, because speech detection will most often fail due to low SNR. This paper proposes a VAD method based on the acoustic model (AM) for background noise and the speech recognition algorithm applied to hands-free speech recognition. There will always be non-speech segments at the beginning and end of each user utterance. The proposed VAD approach compares the likelihood of phoneme and silence segments in the top recognition hypotheses during decoding. We implemented the proposed method for the open-source speech recognition engine Julius. Experimental results for various SNRs conditions show that the proposed method attains a higher VAD accuracy and higher recognition rate than conventional VAD.