About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Industrial Networks and Intelligent Systems. 10th EAI International Conference, INISCOM 2024, Da Nang, Vietnam, February 20–21, 2024, Proceedings

Research Article

CLIP-Prefix for Image Captioning and an Experiment on Blind Image Guessing

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-67357-3_14,
        author={Triet Minh Huynh and Duy Linh Nguyen and Thanh Tri Nguyen and Thuy-Duong Thi Vu and Hanh Dang-Ngoc and Duc Ngoc Minh Dang},
        title={CLIP-Prefix for Image Captioning and an Experiment on Blind Image Guessing},
        proceedings={Industrial Networks and Intelligent Systems. 10th EAI International Conference, INISCOM 2024, Da Nang, Vietnam, February 20--21, 2024, Proceedings},
        proceedings_a={INISCOM},
        year={2024},
        month={7},
        keywords={GPT-2 image caption generation CLIP OpenCLIP sentence transformer zero-shot text classification},
        doi={10.1007/978-3-031-67357-3_14}
    }
    
  • Triet Minh Huynh
    Duy Linh Nguyen
    Thanh Tri Nguyen
    Thuy-Duong Thi Vu
    Hanh Dang-Ngoc
    Duc Ngoc Minh Dang
    Year: 2024
    CLIP-Prefix for Image Captioning and an Experiment on Blind Image Guessing
    INISCOM
    Springer
    DOI: 10.1007/978-3-031-67357-3_14
Triet Minh Huynh1, Duy Linh Nguyen1, Thanh Tri Nguyen1, Thuy-Duong Thi Vu1, Hanh Dang-Ngoc2, Duc Ngoc Minh Dang1,*
  • 1: Computing Fundamental Department
  • 2: Faculty of Electrical and Electronics Engineering
*Contact email: ducdnm2@fe.edu.vn

Abstract

Image caption generation resides at the intersection of computer vision and natural language processing, with its primary goal being the creation of descriptive and coherent textual narratives that faithfully depict the content of an image. This paper presents two models that leverage CLIP as the image encoder and fine-tune GPT-2 for caption generation on the Flickr30k and Flickr8k datasets. The first model utilizes a straightforward mapping network and outperforms the original architecture with a BLEU-1 score of 0.700, BLEU-4 score of 0.257, and ROUGE score of 0.569 on the Flickr8k dataset. The second model constitutes a new architecture exploring the boundaries of minimal visual information required for captioning. It incorporates CLIP’s text encoder to produce input for the generator, while the image embedding serves solely as a validation mechanism. Despite its relatively lower performance, with a BLEU-1 score of 0.546, BLEU-4 score of 0.108, and ROUGE score of 0.444 on the Flickr8k dataset, this model demonstrates the decoder’s ability to create captions based on keyword descriptions alone, without direct access to the context vector.

Keywords
GPT-2 image caption generation CLIP OpenCLIP sentence transformer zero-shot text classification
Published
2024-07-31
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-031-67357-3_14
Copyright © 2024–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL