TY - GEN
T1 - Look Twice as Much as You Say: Scene Graph Contrastive Learning for Self-Supervised Image Caption Generation
AU - Zhang, Chunhui
AU - Huang, Chao
AU - Li, Youhuan
AU - Zhang, Xiangliang
AU - Ye, Yanfang
AU - Zhang, Chuxu
N1 - Generated from Scopus record by KAUST IRTS on 2023-09-20
PY - 2022/10/17
Y1 - 2022/10/17
N2 - Images are commonly used for various information and knowledge applications, such as advertising and recommendation. Automating image caption generation will significantly improve image accessibility. This cross-modal task, which takes image as input and text as output, however, is difficult for learning. Though prior methods achieve good performance for image caption generation, they rely on either supervised learning which requires sufficient labeled data or unsupervised learning which needs external dataset as language pivot. In this paper, we propose SGCL, a novel Scene Graph Contrastive Learning model for self-supervised image caption generation. SGCL adopts the pre-training and fine-tuning pipeline. Specifically, we first apply scene graph generation and objection detection method to encode scene graph and visual information in the image as feature representation. Later, a decoder network based on graph attention network and recurrent neural network is further designed to generate sequential text as caption. To enable contrastive learning in SGCL, we design scene graph augmentations as contrastive views of images and train the model effectively without ground-truth labels through contrastive learning. Additionally, we introduce the pre-trained word embedding and the context projector to enrich the text representation in the decoder network, which benefits model pre-training. Once the pre-training phase is finished, we further fine-tune the model for the image caption generation task with limited labeled data. Extensive experiments on benchmark dataset demonstrate that SGCL outperforms state-of-the-art models (both supervised and unsupervised).
AB - Images are commonly used for various information and knowledge applications, such as advertising and recommendation. Automating image caption generation will significantly improve image accessibility. This cross-modal task, which takes image as input and text as output, however, is difficult for learning. Though prior methods achieve good performance for image caption generation, they rely on either supervised learning which requires sufficient labeled data or unsupervised learning which needs external dataset as language pivot. In this paper, we propose SGCL, a novel Scene Graph Contrastive Learning model for self-supervised image caption generation. SGCL adopts the pre-training and fine-tuning pipeline. Specifically, we first apply scene graph generation and objection detection method to encode scene graph and visual information in the image as feature representation. Later, a decoder network based on graph attention network and recurrent neural network is further designed to generate sequential text as caption. To enable contrastive learning in SGCL, we design scene graph augmentations as contrastive views of images and train the model effectively without ground-truth labels through contrastive learning. Additionally, we introduce the pre-trained word embedding and the context projector to enrich the text representation in the decoder network, which benefits model pre-training. Once the pre-training phase is finished, we further fine-tune the model for the image caption generation task with limited labeled data. Extensive experiments on benchmark dataset demonstrate that SGCL outperforms state-of-the-art models (both supervised and unsupervised).
UR - https://dl.acm.org/doi/10.1145/3511808.3557382
UR - http://www.scopus.com/inward/record.url?scp=85140846973&partnerID=8YFLogxK
U2 - 10.1145/3511808.3557382
DO - 10.1145/3511808.3557382
M3 - Conference contribution
SN - 9781450392365
SP - 2519
EP - 2528
BT - International Conference on Information and Knowledge Management, Proceedings
PB - Association for Computing Machinery
ER -