Abstract
Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be available at https://github.com/Huntersxsx/RaNet.
Original language | English (US) |
---|---|
Title of host publication | EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 3978-3988 |
Number of pages | 11 |
ISBN (Electronic) | 9781955917094 |
State | Published - 2021 |
Event | 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 - Virtual, Punta Cana, Dominican Republic Duration: Nov 7 2021 → Nov 11 2021 |
Publication series
Name | EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings |
---|
Conference
Conference | 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 |
---|---|
Country/Territory | Dominican Republic |
City | Virtual, Punta Cana |
Period | 11/7/21 → 11/11/21 |
Bibliographical note
Funding Information:First of all, I would like to give my heartfelt thanks to all the people who have ever helped me in this paper. The support from CloudWalk Technology Co., Ltd is gratefully acknowledged. This work was also supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.
Publisher Copyright:
© 2021 Association for Computational Linguistics
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Computer Science Applications
- Information Systems