
Research Article
Automated Image Caption Generation using CNN and LSTM
@INPROCEEDINGS{10.4108/eai.28-4-2025.2358102, author={M Vinodh Kumar and P Lakshmi Karthikeya and G Sai Chand and Sivadi Balakrishna}, title={Automated Image Caption Generation using CNN and LSTM}, proceedings={Proceedings of the 4th International Conference on Information Technology, Civil Innovation, Science, and Management, ICITSM 2025, 28-29 April 2025, Tiruchengode, Tamil Nadu, India, Part II}, publisher={EAI}, proceedings_a={ICITSM PART II}, year={2025}, month={10}, keywords={helmet infractions number plate recognition optical character recognition (ocr) traffic violations and you only look once (yolov11)}, doi={10.4108/eai.28-4-2025.2358102} }
- M Vinodh Kumar
P Lakshmi Karthikeya
G Sai Chand
Sivadi Balakrishna
Year: 2025
Automated Image Caption Generation using CNN and LSTM
ICITSM PART II
EAI
DOI: 10.4108/eai.28-4-2025.2358102
Abstract
Image captioning is the challenging task of automatically generating a description for an image using computer vision and natural language processing. In this work, CNN and LSTM are integrated here to give a deep learning-based automatic captioning model. CNN serves as a visual feature extractor by capturing important patterns from the input images. These features are then passed to an LSTM which generates the grammatically and semantically meaningful captions. The model is trained on Flickr8k dataset which contains images with several human-generated captions overlaid on them. Text Embedding representation Several preprocessing techniques have been used to enhance linguistic representation text embedding 5, Sequence padding, and Tokenization. The model is evaluated by genera- ted captions and reference descriptions with the BLEU (Bilingual Evaluation Understudy) since naturally it is available once generated. Experimental results demonstrate that the proposed model is able to successfully capture visual semantics of the images and generate reasonable descriptions, thus showing the power of deep learning for automatic image understanding.