1st International ICST Conference on Robot Communication and Coordination

Research Article

Voice Activity Detection Applied to Hands-Free Spoken Dialogue Robot based on Decoding usingAcoustic and Language Model

Download477 downloads
  • @INPROCEEDINGS{10.4108/ICST.ROBOCOMM2007.2088,
        author={Hiroyuki Sakai and Tobias Cincarek and Hiromichi Kawanami and Hiroshi Saruwatari and Kiyohiro Shikano and Akinobu Lee},
        title={Voice Activity Detection Applied to Hands-Free Spoken Dialogue Robot based on Decoding usingAcoustic and Language Model},
        proceedings={1st International ICST Conference on Robot Communication and Coordination},
        proceedings_a={ROBOCOMM},
        year={2010},
        month={5},
        keywords={},
        doi={10.4108/ICST.ROBOCOMM2007.2088}
    }
    
  • Hiroyuki Sakai
    Tobias Cincarek
    Hiromichi Kawanami
    Hiroshi Saruwatari
    Kiyohiro Shikano
    Akinobu Lee
    Year: 2010
    Voice Activity Detection Applied to Hands-Free Spoken Dialogue Robot based on Decoding usingAcoustic and Language Model
    ROBOCOMM
    ICST
    DOI: 10.4108/ICST.ROBOCOMM2007.2088
Hiroyuki Sakai1,*, Tobias Cincarek1, Hiromichi Kawanami1, Hiroshi Saruwatari1, Kiyohiro Shikano1, Akinobu Lee2,*
  • 1: Graduate School of Information Science,Nara Institute of Science and Technology 8916–5, Takayama-Cho, Ikoma-City, Nara, 630–0192, Japan
  • 2: Nagoya Institute of Technology, Japan
*Contact email: hiroyuki-s@is.naist.jp, ri@nitech.ac.jp

Abstract

Speech recognition and speech-based dialogue are means for realizing communication between humans and robots. In case of conventional system setup a headset or a directional microphone is used to collect speech with high signal-to-noise ratio (SNR). However, the user must wear a microphone or has to approach the system closely for interaction. Therefore it’s preferable to develop a hands-free speech recognition system which enables the user to speak to the system from a distant point. To collect speech from distant speakers a microphone array is usually employed. However, the SNR will degrade in a real environment because of the presence of various kinds of background noise besides the user’s utterance. This will most often decrease speech recognition performance and no reliable speech dialogue would be possible. Voice Activity Detection (VAD) is a method to detect the user utterance part in the input signal. If VAD fails, all following processing steps including speech recognition and dialogue will not work. Conventional VAD based on amplitude level and zero cross count is difficult to apply to hands-free speech recognition, because speech detection will most often fail due to low SNR. This paper proposes a VAD method based on the acoustic model (AM) for background noise and the speech recognition algorithm applied to hands-free speech recognition. There will always be non-speech segments at the beginning and end of each user utterance. The proposed VAD approach compares the likelihood of phoneme and silence segments in the top recognition hypotheses during decoding. We implemented the proposed method for the open-source speech recognition engine Julius. Experimental results for various SNRs conditions show that the proposed method attains a higher VAD accuracy and higher recognition rate than conventional VAD.