Look Twice as Much as You Say: Scene Graph Contrastive Learning for Self-Supervised Image Caption Generation

Chunhui Zhang, Chao Huang, Youhuan Li, Xiangliang Zhang, Yanfang Ye, Chuxu Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations


Images are commonly used for various information and knowledge applications, such as advertising and recommendation. Automating image caption generation will significantly improve image accessibility. This cross-modal task, which takes image as input and text as output, however, is difficult for learning. Though prior methods achieve good performance for image caption generation, they rely on either supervised learning which requires sufficient labeled data or unsupervised learning which needs external dataset as language pivot. In this paper, we propose SGCL, a novel Scene Graph Contrastive Learning model for self-supervised image caption generation. SGCL adopts the pre-training and fine-tuning pipeline. Specifically, we first apply scene graph generation and objection detection method to encode scene graph and visual information in the image as feature representation. Later, a decoder network based on graph attention network and recurrent neural network is further designed to generate sequential text as caption. To enable contrastive learning in SGCL, we design scene graph augmentations as contrastive views of images and train the model effectively without ground-truth labels through contrastive learning. Additionally, we introduce the pre-trained word embedding and the context projector to enrich the text representation in the decoder network, which benefits model pre-training. Once the pre-training phase is finished, we further fine-tune the model for the image caption generation task with limited labeled data. Extensive experiments on benchmark dataset demonstrate that SGCL outperforms state-of-the-art models (both supervised and unsupervised).
Original languageEnglish (US)
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
PublisherAssociation for Computing Machinery
Number of pages10
ISBN (Print)9781450392365
StatePublished - Oct 17 2022
Externally publishedYes

Bibliographical note

Generated from Scopus record by KAUST IRTS on 2023-09-20


Dive into the research topics of 'Look Twice as Much as You Say: Scene Graph Contrastive Learning for Self-Supervised Image Caption Generation'. Together they form a unique fingerprint.

Cite this