Localizing Moments in Long Video Via Multimodal Guidance

Wayner Barrios*, Mattia Soldan, Alberto Mario Ceballos-Arroyo, Fabian Caba Heilbron, Bernard Ghanem

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model ana-lyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding.

Original languageEnglish (US)
Title of host publicationProceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages13621-13632
Number of pages12
ISBN (Electronic)9798350307184
DOIs
StatePublished - 2023
Event2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France
Duration: Oct 2 2023Oct 6 2023

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499

Conference

Conference2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Country/TerritoryFrance
CityParis
Period10/2/2310/6/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Localizing Moments in Long Video Via Multimodal Guidance'. Together they form a unique fingerprint.

Cite this