Abstract
Leveraging Large Language Models’ remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
Original language | English (US) |
---|---|
Title of host publication | Computer Vision – ECCV 2024 - 18th European Conference, Proceedings |
Editors | Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 52-70 |
Number of pages | 19 |
ISBN (Print) | 9783031730382 |
DOIs | |
State | Published - 2025 |
Event | 18th European Conference on Computer Vision, ECCV 2024 - Milan, Italy Duration: Sep 29 2024 → Oct 4 2024 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 15122 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 18th European Conference on Computer Vision, ECCV 2024 |
---|---|
Country/Territory | Italy |
City | Milan |
Period | 09/29/24 → 10/4/24 |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
Keywords
- Audio-Visual LLM
- AV Localization
- AVFIT Dataset
ASJC Scopus subject areas
- Theoretical Computer Science
- General Computer Science