About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Proceedings of the 4th International Conference on Information Technology, Civil Innovation, Science, and Management, ICITSM 2025, 28-29 April 2025, Tiruchengode, Tamil Nadu, India, Part I

Research Article

Image Captioning: Enhance Visual Understanding

Download14 downloads
Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.4108/eai.28-4-2025.2357851,
        author={G.  Kalaiarasi and M. Sravya  Sree and B. Sai  Geetha and A.  Yasaswi and I.  Aravind and G. Nagavenkata  Sreeja},
        title={Image Captioning: Enhance Visual Understanding},
        proceedings={Proceedings of the 4th International Conference on Information Technology, Civil Innovation, Science, and Management, ICITSM 2025, 28-29 April 2025, Tiruchengode, Tamil Nadu, India, Part I},
        publisher={EAI},
        proceedings_a={ICITSM PART I},
        year={2025},
        month={10},
        keywords={convolutional neural networks (cnns) long- short-term memory (lstm) vision transformer gpt-2},
        doi={10.4108/eai.28-4-2025.2357851}
    }
    
  • G. Kalaiarasi
    M. Sravya Sree
    B. Sai Geetha
    A. Yasaswi
    I. Aravind
    G. Nagavenkata Sreeja
    Year: 2025
    Image Captioning: Enhance Visual Understanding
    ICITSM PART I
    EAI
    DOI: 10.4108/eai.28-4-2025.2357851
G. Kalaiarasi1,*, M. Sravya Sree1, B. Sai Geetha1, A. Yasaswi1, I. Aravind1, G. Nagavenkata Sreeja1
  • 1: Vignan’s Foundation for Science, Technology and Research
*Contact email: kalaiibe@gmail.com

Abstract

Image captioning fuses computer vision and natural language processing for producing natural language descriptions of images. Conventional methods exploited CNNs such as VGG16 as visual feature extractors and employed LSTM-based networks to generate captions, and were trained on datasets including Flickr8k and Flickr30k. More recently, transformer-based models such as Vision Transformer (ViT) and GPT-2 have produced a considerable leap in the state-of-the-art by providing the capacity for a shared representation and zero-shot learning. In this work, we develop a ViT-GPT2 based image captioning system and carry out our experiments on the Flickr8k dataset. The model demonstrates substantial changes in accuracy, diversity and context-sensitiveness compared to CNN-LSTM baselines. Assessment with BLEU, METEOR, and ROUGE metrics further verifies the improved precision of caption description and semantic alignment. These results suggest the effectiveness of transformer architectures in generating natural, human-like descriptions and potential for real-world applications such as accessibility and multimedia systems.

Keywords
convolutional neural networks (cnns), long- short-term memory (lstm), vision transformer, gpt-2
Published
2025-10-13
Publisher
EAI
http://dx.doi.org/10.4108/eai.28-4-2025.2357851
Copyright © 2025–2025 EAI
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL