ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Asking insightful questions is crucial for acquiring knowledge and expanding our understanding of the world. However, the importance of questioning has been largely overlooked in AI research, where models have been primarily developed to answer questions. With the recent advancements of large language models (LLMs) like ChatGPT, we discover their capability to ask high-quality questions when provided with a suitable prompt. This discovery presents a new opportunity to develop an automatic questioning system. In this paper, we introduce ChatCaptioner, a novel automatic-questioning method deployed in image captioning. Here, ChatGPT is prompted to ask a series of informative questions about images to BLIP-2, a strong vision question-answering model. In Chat-Captioner, we investigate whether two AI models, unable to individually describe images in detail, can collaborate through an automated, visually guided dialogue to generate a better and more enriched image description than a single AI model. We conduct human-subject evaluations on common image caption datasets such as COCO, Conceptual Caption, and WikiArt, and compare ChatCaptioner with BLIP-2 as well as ground truth. Our results demonstrate that ChatCaptioner’s captions are significantly more informative, receiving three times as many votes from human evaluators as BLIP-2 alone for providing the most image information. Besides, ChatCaptioner identifies 53% more objects within the image than BLIP-2 alone measured by WordNet synset matching. Code is available at https://github.com/Vision-CAIR/ChatCaptioner.

Original languageEnglish (US)
JournalTransactions on Machine Learning Research
Volume2024
StatePublished - 2024

Bibliographical note

Publisher Copyright:
© 2024, Transactions on Machine Learning Research. All rights reserved.

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions'. Together they form a unique fingerprint.

Cite this