Image Captioning: Enhance Visual Understanding

G. Kalaiarasi; M. Sravya Sree; B. Sai Geetha; A. Yasaswi; I. Aravind; G. Nagavenkata Sreeja

Proceedings of the 4th International Conference on Information Technology, Civil Innovation, Science, and Management, ICITSM 2025, 28-29 April 2025, Tiruchengode, Tamil Nadu, India, Part I

Research Article

Image Captioning: Enhance Visual Understanding

Download14 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.28-4-2025.2357851,
    author={G.  Kalaiarasi and M. Sravya  Sree and B. Sai  Geetha and A.  Yasaswi and I.  Aravind and G. Nagavenkata  Sreeja},
    title={Image Captioning: Enhance Visual Understanding},
    proceedings={Proceedings of the 4th International Conference on Information Technology, Civil Innovation, Science, and Management, ICITSM 2025, 28-29 April 2025, Tiruchengode, Tamil Nadu, India, Part I},
    publisher={EAI},
    proceedings_a={ICITSM PART I},
    year={2025},
    month={10},
    keywords={convolutional neural networks (cnns) long- short-term memory (lstm) vision transformer gpt-2},
    doi={10.4108/eai.28-4-2025.2357851}
}

G. Kalaiarasi
M. Sravya Sree
B. Sai Geetha
A. Yasaswi
I. Aravind
G. Nagavenkata Sreeja
Year: 2025
Image Captioning: Enhance Visual Understanding
ICITSM PART I
EAI
DOI: 10.4108/eai.28-4-2025.2357851

G. Kalaiarasi¹^,*, M. Sravya Sree¹, B. Sai Geetha¹, A. Yasaswi¹, I. Aravind¹, G. Nagavenkata Sreeja¹

1: Vignan’s Foundation for Science, Technology and Research

*Contact email: kalaiibe@gmail.com

Abstract

Image captioning fuses computer vision and natural language processing for producing natural language descriptions of images. Conventional methods exploited CNNs such as VGG16 as visual feature extractors and employed LSTM-based networks to generate captions, and were trained on datasets including Flickr8k and Flickr30k. More recently, transformer-based models such as Vision Transformer (ViT) and GPT-2 have produced a considerable leap in the state-of-the-art by providing the capacity for a shared representation and zero-shot learning. In this work, we develop a ViT-GPT2 based image captioning system and carry out our experiments on the Flickr8k dataset. The model demonstrates substantial changes in accuracy, diversity and context-sensitiveness compared to CNN-LSTM baselines. Assessment with BLEU, METEOR, and ROUGE metrics further verifies the improved precision of caption description and semantic alignment. These results suggest the effectiveness of transformer architectures in generating natural, human-like descriptions and potential for real-world applications such as accessibility and multimedia systems.

Keywords: convolutional neural networks (cnns), long- short-term memory (lstm), vision transformer, gpt-2

Published: 2025-10-13
Publisher: EAI

: http://dx.doi.org/10.4108/eai.28-4-2025.2357851

Image Captioning: Enhance Visual Understanding

Abstract

About EAI

Community

Publish with EAI