A Robust Visual Feature Extraction based BTSM-LDA for Audio-Visual Speech Recognition

Guoyun Lv; Rongchun Zhao; Dongmei Jiang; Yan Li; H. Sahli

2nd International ICST Conference on Communications and Networking in China

Research Article

A Robust Visual Feature Extraction based BTSM-LDA for Audio-Visual Speech Recognition

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1109/CHINACOM.2007.4469472,
    author={Guoyun Lv and Rongchun Zhao and Dongmei Jiang and Yan Li and H. Sahli},
    title={A Robust Visual Feature Extraction based BTSM-LDA for Audio-Visual Speech Recognition},
    proceedings={2nd International ICST Conference on Communications and Networking in China},
    publisher={IEEE},
    proceedings_a={CHINACOM},
    year={2008},
    month={3},
    keywords={Bayesian Tangent Shape Model  Dynamic Bayesian Networks  audio-visual  speech recognition},
    doi={10.1109/CHINACOM.2007.4469472}
}

Guoyun Lv
Rongchun Zhao
Dongmei Jiang
Yan Li
H. Sahli
Year: 2008
A Robust Visual Feature Extraction based BTSM-LDA for Audio-Visual Speech Recognition
CHINACOM
IEEE
DOI: 10.1109/CHINACOM.2007.4469472

Guoyun Lv¹^,*, Rongchun Zhao¹^,*, Dongmei Jiang¹, Yan Li¹, H. Sahli²^,*

1: School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, P.R. China
2: Vrije Universiteit Brussel, Department ETRO, Pleinlaan 2, Brussels, B1050 Belgium

*Contact email: lvguoyun101@sohu.com, Rczhao@nwpu.edu.cn, Hsahli@etro.vub.ac.be

Abstract

The asynchrony for speech and lip movement is key problem of audio-visual speech recognition (AVSR) system. A Multi-Stream Asynchrony Dynamic Bayesian Network (MSADBN) model is proposed for audio-visual speech recognition. Comparing with Multi-Stream HMM (MSHMM), MS-ADBN model describes the asynchrony of audio stream and visual stream to the word level. Simultaneously, based on profile of lip implemented by using Bayesian Tangent Shape Model (BTSM), Linear Discrimination Analysis (LDA) is used for visual feature extraction which describes the dynamic feature of lip and removes the redundancy of lip geometrical feature. The experiments results on continuous digit audio-visual database show that lip dynamic feature based on BTSM and LDA is more stable and robust than direct lip geometrical feature. In the noisy environments with signal to noise ratios ranging from 0dB to 30dB, comparing with MSHMM, MS-ADBN model with MFCC and LDA visual features has an average improvement of 4.92% in speech recognition rate.

Keywords: Bayesian Tangent Shape Model , Dynamic Bayesian Networks , audio-visual , speech recognition

Published: 2008-03-07
Publisher: IEEE
Modified: 2011-07-13

: http://dx.doi.org/10.1109/CHINACOM.2007.4469472

A Robust Visual Feature Extraction based BTSM-LDA for Audio-Visual Speech Recognition

Abstract

About EAI

Community

Publish with EAI