VLG-Net: Video-Language Graph Matching Network for Video Grounding

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

18 Scopus citations


Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands understanding videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the modalities, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs built atop video snippets and query tokens separately and used to model intra-modality relationships. A Graph Matching layer is adopted for cross-modal context modeling and multi-modal fusion. Finally, moment candidates are created using masked moment attention pooling by fusing the moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with language queries: ActivityNet-Captions, TACoS, and DiDeMo.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages11
ISBN (Electronic)9781665401913
StatePublished - 2021
Event18th IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021 - Virtual, Online, Canada
Duration: Oct 11 2021Oct 17 2021

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499


Conference18th IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021
CityVirtual, Online

Bibliographical note

Funding Information:
This paper addresses the problem of text-to-video temporal grounding, where we cast the problem as an algorithmic graph matching. We propose Video-Language Graph Matching Network (VLG-Net) to match the video and language modalities. We represent each modality as graphs and explore four types of edges, Syntactic Edge, Ordering Edge, Semantic Edge, and Matching Edge, to encode local, non-local, and cross-modality relationships to align the video-query pair. Extensive experiments show that our VLG-Net can model inter-and intra-modality context, learn multi-modal fusion and surpass the current state-of-the-art performance on three widely used datasets. Acknowledgments. This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.

Publisher Copyright:
© 2021 IEEE.

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition


Dive into the research topics of 'VLG-Net: Video-Language Graph Matching Network for Video Grounding'. Together they form a unique fingerprint.

Cite this