VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

Research output: Contribution to conferencePaperpeer-review

4 Scopus citations

Abstract

We introduce a new benchmark designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. Although several vision-language datasets in remote sensing have been proposed to pursue this goal, existing datasets are typically tailored to single tasks, lack detailed object information, or suffer from inadequate quality control. Exploring these improvement opportunities, we present a Versatile vision-language Benchmark for Remote Sensing image understanding, termed VRSBench. This benchmark comprises 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks. We further evaluated state-of-the-art models on this benchmark for three vision-language tasks: image captioning, visual grounding, and visual question answering. Our work aims to significantly contribute to the development of advanced vision-language models in the field of remote sensing. The data and code can be accessed at https://vrsbench.github.io.

Original languageEnglish (US)
StatePublished - 2024
Event38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada
Duration: Dec 9 2024Dec 15 2024

Conference

Conference38th Conference on Neural Information Processing Systems, NeurIPS 2024
Country/TerritoryCanada
CityVancouver
Period12/9/2412/15/24

Bibliographical note

Publisher Copyright:
© 2024 Neural information processing systems foundation. All rights reserved.

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding'. Together they form a unique fingerprint.

Cite this