About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Communications and Networking. 18th EAI International Conference, ChinaCom 2023, Sanya, China, November 18–19, 2023, Proceedings

Research Article

Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-67162-3_23,
        author={Yi Wang and Hongqing Liu and Yu Zhao and Yi Zhou},
        title={Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution},
        proceedings={Communications and Networking. 18th EAI International Conference, ChinaCom 2023, Sanya, China, November 18--19, 2023, Proceedings},
        proceedings_a={CHINACOM},
        year={2024},
        month={8},
        keywords={Sound event localization and detection Deep learning Audio-visual fusion Convolutional recurrent neural network},
        doi={10.1007/978-3-031-67162-3_23}
    }
    
  • Yi Wang
    Hongqing Liu
    Yu Zhao
    Yi Zhou
    Year: 2024
    Audio-Visual Sound Event Localization and Detection Based on CRNN Using Depth-Wise Separable Convolution
    CHINACOM
    Springer
    DOI: 10.1007/978-3-031-67162-3_23
Yi Wang1,*, Hongqing Liu2, Yu Zhao1, Yi Zhou2
  • 1: School of Communication and Information Engineering
  • 2: Intelligent Speech and Audio Research Lab.
*Contact email: s210101136@stu.cqupt.edu.cn

Abstract

Sound event localization and detection (SELD) focuses on the simultaneous detection of various sound events along with their sptial and temporal localization. Recent work shows that audio-visual fusion methods, rarely involved in SELD research, show promising results than the single modality. For this, we proposed an audio and visual signals fusion mechanism for SELD based on convolutional recurrent neural network (CRNN). Object detection and pre-trained model processing on the corresponding image at the start frame of the audio feature sequence is utilized to acquire visual cues passed into models. Compared to traditional convolution, we devise a depth-wise separable convolution block to better learn the relevant information of different sound event categories in audio features. Experimental results on STARSS23 of DCASE (2023) indicate that the introducing of visual cues do improve the SELD performance compared to the audio-only system. The convolution block devised in our proposed work further enhances the model’s performance as it achieves higher SELD score.

Keywords
Sound event localization and detection Deep learning Audio-visual fusion Convolutional recurrent neural network
Published
2024-08-06
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-031-67162-3_23
Copyright © 2023–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL