About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Collaborative Computing: Networking, Applications and Worksharing. 19th EAI International Conference, CollaborateCom 2023, Corfu Island, Greece, October 4-6, 2023, Proceedings, Part II

Research Article

MSAM: Deep Semantic Interaction Network for Visual Question Answering

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-54528-3_3,
        author={Fan Wang and Bin Wang and Fuyong Xu and Jiaxin Li and Peiyu Liu},
        title={MSAM: Deep Semantic Interaction Network for Visual Question Answering},
        proceedings={Collaborative Computing: Networking, Applications and Worksharing. 19th EAI International Conference, CollaborateCom 2023, Corfu Island, Greece, October 4-6, 2023, Proceedings, Part II},
        proceedings_a={COLLABORATECOM PART 2},
        year={2024},
        month={2},
        keywords={Visual Question and Answering Contrastive Learning Semantic Alignment Semantic Information},
        doi={10.1007/978-3-031-54528-3_3}
    }
    
  • Fan Wang
    Bin Wang
    Fuyong Xu
    Jiaxin Li
    Peiyu Liu
    Year: 2024
    MSAM: Deep Semantic Interaction Network for Visual Question Answering
    COLLABORATECOM PART 2
    Springer
    DOI: 10.1007/978-3-031-54528-3_3
Fan Wang1, Bin Wang1, Fuyong Xu1, Jiaxin Li1, Peiyu Liu1,*
  • 1: Shandong Normal University
*Contact email: liupy@sdnu.edu.cn

Abstract

In Visual Question Answering (VQA) task, extracting semantic information from multimodalities and effectively utilizing this information for interaction is crucial. Existing VQA methods mostly focus on attention mechanism to reason about answers, but do not fully utilize the semantic information of modalities. Furthermore, the question and the image relation description through attention mechanism may cover some conflicting information, which weakens multi-modal semantic information relevance. Based on the above issues, this paper proposes a Multi-layer Semantics Awareness Model (MSAM) to fill the lack of multi-modal semantic understanding. We design a Bi-affine space projection method to construct multi-modal semantic space to effectively understand modal features at the semantic level. Then, we propose to utilize contrastive learning to achieve semantic alignment, which effectively brings modalities with the same semantics closer together and improves multi-modal information relevance. We conduct extensive experiments on the VQA2.0 dataset, and our model boosts the metrics even further compared to the baseline, improving the performance of the VQA task.

Keywords
Visual Question and Answering Contrastive Learning Semantic Alignment Semantic Information
Published
2024-02-23
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-031-54528-3_3
Copyright © 2023–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL