
Research Article
Image Captioning: Enhance Visual Understanding
@INPROCEEDINGS{10.4108/eai.28-4-2025.2357851, author={G. Kalaiarasi and M. Sravya Sree and B. Sai Geetha and A. Yasaswi and I. Aravind and G. Nagavenkata Sreeja}, title={Image Captioning: Enhance Visual Understanding}, proceedings={Proceedings of the 4th International Conference on Information Technology, Civil Innovation, Science, and Management, ICITSM 2025, 28-29 April 2025, Tiruchengode, Tamil Nadu, India, Part I}, publisher={EAI}, proceedings_a={ICITSM PART I}, year={2025}, month={10}, keywords={convolutional neural networks (cnns) long- short-term memory (lstm) vision transformer gpt-2}, doi={10.4108/eai.28-4-2025.2357851} }
- G. Kalaiarasi
M. Sravya Sree
B. Sai Geetha
A. Yasaswi
I. Aravind
G. Nagavenkata Sreeja
Year: 2025
Image Captioning: Enhance Visual Understanding
ICITSM PART I
EAI
DOI: 10.4108/eai.28-4-2025.2357851
Abstract
Image captioning fuses computer vision and natural language processing for producing natural language descriptions of images. Conventional methods exploited CNNs such as VGG16 as visual feature extractors and employed LSTM-based networks to generate captions, and were trained on datasets including Flickr8k and Flickr30k. More recently, transformer-based models such as Vision Transformer (ViT) and GPT-2 have produced a considerable leap in the state-of-the-art by providing the capacity for a shared representation and zero-shot learning. In this work, we develop a ViT-GPT2 based image captioning system and carry out our experiments on the Flickr8k dataset. The model demonstrates substantial changes in accuracy, diversity and context-sensitiveness compared to CNN-LSTM baselines. Assessment with BLEU, METEOR, and ROUGE metrics further verifies the improved precision of caption description and semantic alignment. These results suggest the effectiveness of transformer architectures in generating natural, human-like descriptions and potential for real-world applications such as accessibility and multimedia systems.