
Research Article
MSAM: Deep Semantic Interaction Network for Visual Question Answering
@INPROCEEDINGS{10.1007/978-3-031-54528-3_3, author={Fan Wang and Bin Wang and Fuyong Xu and Jiaxin Li and Peiyu Liu}, title={MSAM: Deep Semantic Interaction Network for Visual Question Answering}, proceedings={Collaborative Computing: Networking, Applications and Worksharing. 19th EAI International Conference, CollaborateCom 2023, Corfu Island, Greece, October 4-6, 2023, Proceedings, Part II}, proceedings_a={COLLABORATECOM PART 2}, year={2024}, month={2}, keywords={Visual Question and Answering Contrastive Learning Semantic Alignment Semantic Information}, doi={10.1007/978-3-031-54528-3_3} }
- Fan Wang
Bin Wang
Fuyong Xu
Jiaxin Li
Peiyu Liu
Year: 2024
MSAM: Deep Semantic Interaction Network for Visual Question Answering
COLLABORATECOM PART 2
Springer
DOI: 10.1007/978-3-031-54528-3_3
Abstract
In Visual Question Answering (VQA) task, extracting semantic information from multimodalities and effectively utilizing this information for interaction is crucial. Existing VQA methods mostly focus on attention mechanism to reason about answers, but do not fully utilize the semantic information of modalities. Furthermore, the question and the image relation description through attention mechanism may cover some conflicting information, which weakens multi-modal semantic information relevance. Based on the above issues, this paper proposes a Multi-layer Semantics Awareness Model (MSAM) to fill the lack of multi-modal semantic understanding. We design a Bi-affine space projection method to construct multi-modal semantic space to effectively understand modal features at the semantic level. Then, we propose to utilize contrastive learning to achieve semantic alignment, which effectively brings modalities with the same semantics closer together and improves multi-modal information relevance. We conduct extensive experiments on the VQA2.0 dataset, and our model boosts the metrics even further compared to the baseline, improving the performance of the VQA task.