
Research Article
Convolutional Recurrent Neural Network Based on Short-Time Discrete Cosine Transform for Monaural Speech Enhancement
@INPROCEEDINGS{10.1007/978-3-031-34790-0_13, author={Jinzuo Guo and Yi Zhou and Hongqing Liu and Yongbao Ma}, title={Convolutional Recurrent Neural Network Based on Short-Time Discrete Cosine Transform for Monaural Speech Enhancement}, proceedings={Communications and Networking. 17th EAI International Conference, Chinacom 2022, Virtual Event, November 19-20, 2022, Proceedings}, proceedings_a={CHINACOM}, year={2023}, month={6}, keywords={Speech enhancement Deep learning Convolutional recurrent neural network Discrete cosine transform}, doi={10.1007/978-3-031-34790-0_13} }
- Jinzuo Guo
Yi Zhou
Hongqing Liu
Yongbao Ma
Year: 2023
Convolutional Recurrent Neural Network Based on Short-Time Discrete Cosine Transform for Monaural Speech Enhancement
CHINACOM
Springer
DOI: 10.1007/978-3-031-34790-0_13
Abstract
Speech enhancement algorithms based on deep learning have greatly improved speech’s perceptual quality and intelligibility. Complex-valued neural networks, such as deep complex convolution recurrent network (DCCRN), make full use of audio signal phase information and achieve superior performance, but complex-valued operations increase the computational complexity. Inspired by the deep cosine transform convolutional recurrent network (DCTCRN) model, in this paper real-valued discrete cosine transform is used instead of complex-valued Fourier transform. Besides, the ideal cosine mask is employed as the training target, and the real-valued convolutional recurrent network (CRNN) is used to enhance the speech while reducing algorithm complexity. Meanwhile, the frequency-time-LSTM (F-T-LSTM) module is used for better temporal modeling and the convolutional skip connections module is introduced between the encoders and the decoders to integrate the information between features. Moreover, the improved scale-invariant source-to-noise ratio (SI-SNR) is taken as the loss function which enables the model to focus more on the part of signal variation and thus obtain better noise suppression performance. With only 1.31M parameters, the proposed method can achieve noise suppression performance that exceeds DCCRN and DCTCRN.