Relation-aware Video Reading Comprehension for Temporal Language Grounding

Jialin Gao*, Xin Sun*, Meng Meng Xu, Xi Zhou, Bernard Ghanem

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

33 Scopus citations

Abstract

Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be available at https://github.com/Huntersxsx/RaNet.

Original languageEnglish (US)
Title of host publicationEMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings
PublisherAssociation for Computational Linguistics (ACL)
Pages3978-3988
Number of pages11
ISBN (Electronic)9781955917094
StatePublished - 2021
Event2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 - Virtual, Punta Cana, Dominican Republic
Duration: Nov 7 2021Nov 11 2021

Publication series

NameEMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings

Conference

Conference2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021
Country/TerritoryDominican Republic
CityVirtual, Punta Cana
Period11/7/2111/11/21

Bibliographical note

Funding Information:
First of all, I would like to give my heartfelt thanks to all the people who have ever helped me in this paper. The support from CloudWalk Technology Co., Ltd is gratefully acknowledged. This work was also supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.

Publisher Copyright:
© 2021 Association for Computational Linguistics

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Relation-aware Video Reading Comprehension for Temporal Language Grounding'. Together they form a unique fingerprint.

Cite this