Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution

Yi Wang; Hongqing Liu; Yu Zhao; Yi Zhou

Communications and Networking. 18th EAI International Conference, ChinaCom 2023, Sanya, China, November 18–19, 2023, Proceedings

Research Article

Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-031-67162-3_23,
    author={Yi Wang and Hongqing Liu and Yu Zhao and Yi Zhou},
    title={Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution},
    proceedings={Communications and Networking. 18th EAI International Conference, ChinaCom 2023, Sanya, China, November 18--19, 2023, Proceedings},
    proceedings_a={CHINACOM},
    year={2024},
    month={8},
    keywords={Sound event localization and detection Deep learning Audio-visual fusion Convolutional recurrent neural network},
    doi={10.1007/978-3-031-67162-3_23}
}

Yi Wang
Hongqing Liu
Yu Zhao
Yi Zhou
Year: 2024
Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution
CHINACOM
Springer
DOI: 10.1007/978-3-031-67162-3_23

Yi Wang¹^,*, Hongqing Liu², Yu Zhao¹, Yi Zhou²

1: School of Communication and Information Engineering
2: Intelligent Speech and Audio Research Lab.

*Contact email: s210101136@stu.cqupt.edu.cn

Abstract

Sound event localization and detection (SELD) focuses on the simultaneous detection of various sound events along with their sptial and temporal localization. Recent work shows that audio-visual fusion methods, rarely involved in SELD research, show promising results than the single modality. For this, we proposed an audio and visual signals fusion mechanism for SELD based on convolutional recurrent neural network (CRNN). Object detection and pre-trained model processing on the corresponding image at the start frame of the audio feature sequence is utilized to acquire visual cues passed into models. Compared to traditional convolution, we devise a depth-wise separable convolution block to better learn the relevant information of different sound event categories in audio features. Experimental results on STARSS23 of DCASE (2023) indicate that the introducing of visual cues do improve the SELD performance compared to the audio-only system. The convolution block devised in our proposed work further enhances the model’s performance as it achieves higher SELD score.

Keywords: Sound event localization and detection, Deep learning, Audio-visual fusion, Convolutional recurrent neural network

Published: 2024-08-06
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-031-67162-3_23

Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution

Abstract

About EAI

Community

Publish with EAI