MSAM: Deep Semantic Interaction Network for Visual Question Answering

Fan Wang; Bin Wang; Fuyong Xu; Jiaxin Li; Peiyu Liu

Collaborative Computing: Networking, Applications and Worksharing. 19th EAI International Conference, CollaborateCom 2023, Corfu Island, Greece, October 4-6, 2023, Proceedings, Part II

Research Article

MSAM: Deep Semantic Interaction Network for Visual Question Answering

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-031-54528-3_3,
    author={Fan Wang and Bin Wang and Fuyong Xu and Jiaxin Li and Peiyu Liu},
    title={MSAM: Deep Semantic Interaction Network for Visual Question Answering},
    proceedings={Collaborative Computing: Networking, Applications and Worksharing. 19th EAI International Conference, CollaborateCom 2023, Corfu Island, Greece, October 4-6, 2023, Proceedings, Part II},
    proceedings_a={COLLABORATECOM PART 2},
    year={2024},
    month={2},
    keywords={Visual Question and Answering Contrastive Learning Semantic Alignment Semantic Information},
    doi={10.1007/978-3-031-54528-3_3}
}

Fan Wang
Bin Wang
Fuyong Xu
Jiaxin Li
Peiyu Liu
Year: 2024
MSAM: Deep Semantic Interaction Network for Visual Question Answering
COLLABORATECOM PART 2
Springer
DOI: 10.1007/978-3-031-54528-3_3

Fan Wang¹, Bin Wang¹, Fuyong Xu¹, Jiaxin Li¹, Peiyu Liu¹^,*

1: Shandong Normal University

*Contact email: liupy@sdnu.edu.cn

Abstract

In Visual Question Answering (VQA) task, extracting semantic information from multimodalities and effectively utilizing this information for interaction is crucial. Existing VQA methods mostly focus on attention mechanism to reason about answers, but do not fully utilize the semantic information of modalities. Furthermore, the question and the image relation description through attention mechanism may cover some conflicting information, which weakens multi-modal semantic information relevance. Based on the above issues, this paper proposes a Multi-layer Semantics Awareness Model (MSAM) to fill the lack of multi-modal semantic understanding. We design a Bi-affine space projection method to construct multi-modal semantic space to effectively understand modal features at the semantic level. Then, we propose to utilize contrastive learning to achieve semantic alignment, which effectively brings modalities with the same semantics closer together and improves multi-modal information relevance. We conduct extensive experiments on the VQA2.0 dataset, and our model boosts the metrics even further compared to the baseline, improving the performance of the VQA task.

Keywords: Visual Question and Answering, Contrastive Learning, Semantic Alignment, Semantic Information

Published: 2024-02-23
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-031-54528-3_3

MSAM: Deep Semantic Interaction Network for Visual Question Answering

Abstract

About EAI

Community

Publish with EAI