
Research Article
Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution
@INPROCEEDINGS{10.1007/978-3-031-67162-3_23, author={Yi Wang and Hongqing Liu and Yu Zhao and Yi Zhou}, title={Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution}, proceedings={Communications and Networking. 18th EAI International Conference, ChinaCom 2023, Sanya, China, November 18--19, 2023, Proceedings}, proceedings_a={CHINACOM}, year={2024}, month={8}, keywords={Sound event localization and detection Deep learning Audio-visual fusion Convolutional recurrent neural network}, doi={10.1007/978-3-031-67162-3_23} }
- Yi Wang
Hongqing Liu
Yu Zhao
Yi Zhou
Year: 2024
Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution
CHINACOM
Springer
DOI: 10.1007/978-3-031-67162-3_23
Abstract
Sound event localization and detection (SELD) focuses on the simultaneous detection of various sound events along with their sptial and temporal localization. Recent work shows that audio-visual fusion methods, rarely involved in SELD research, show promising results than the single modality. For this, we proposed an audio and visual signals fusion mechanism for SELD based on convolutional recurrent neural network (CRNN). Object detection and pre-trained model processing on the corresponding image at the start frame of the audio feature sequence is utilized to acquire visual cues passed into models. Compared to traditional convolution, we devise a depth-wise separable convolution block to better learn the relevant information of different sound event categories in audio features. Experimental results on STARSS23 of DCASE (2023) indicate that the introducing of visual cues do improve the SELD performance compared to the audio-only system. The convolution block devised in our proposed work further enhances the model’s performance as it achieves higher SELD score.