Abstract
Active speaker detection requires a mindful integration of multi-modal cues. Current methods focus on modeling and fusing short-term audiovisual features for individual speakers, often at frame level. We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem and provides a straightforward strategy, where independent visual features (speakers) in the scene are assigned to a previously detected speech event. Our experiments show that a small graph data structure built from local information can approximate an instantaneous audio-visual assignment problem. Moreover, the temporal extension of this initial graph achieves a new state-of-the-art performance on the AVA-ActiveSpeaker dataset with a mAP of 88.8%.
Original language | English (US) |
---|---|
Title of host publication | Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 265-274 |
Number of pages | 10 |
ISBN (Electronic) | 9781665428125 |
DOIs | |
State | Published - 2021 |
Event | 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 - Virtual, Online, Canada Duration: Oct 11 2021 → Oct 17 2021 |
Publication series
Name | Proceedings of the IEEE International Conference on Computer Vision |
---|---|
ISSN (Print) | 1550-5499 |
Conference
Conference | 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 |
---|---|
Country/Territory | Canada |
City | Virtual, Online |
Period | 10/11/21 → 10/17/21 |
Bibliographical note
Funding Information:Acknowledgments. This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.
Publisher Copyright:
© 2021 IEEE
ASJC Scopus subject areas
- Software
- Computer Vision and Pattern Recognition