Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

Kilichbek Haydarov*, Xiaoqian Shen, Avinash Madasu, Mahmoud Salem, Li Jia Li, Gamaleldin Elsayed, Mohamed Elhoseiny

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding constructed emotions in response to visually grounded conversations. The task involves three skills: (1) Dialog-based Question Answering (2) Dialog-based Emotion Prediction and (3) Affective explanation generation based on the dialog. Our key contribution is the collection of a large-scale dataset, dubbed AffectVisDial, consisting of 50K 10-turn visually grounded dialogs as well as concluding emotion attributions and dialog-informed textual emotion explanations, resulting in a total of 27,180 working hours. Notably, the dataset spans a broad range of visual stimuli, covering human heritage and contemporary life, with an average per-turn answer length of about 12 words—5 times that of the VisDial dataset—and explanations exceeding 28 words on average. We explain our determining design decisions in collecting the dataset, data inclusion and exclusion criteria starting from over 100K dialogs for quality control, and introduce the questioner and answerer tasks that are associated with the participants in the conversation. We propose and demonstrate solid Affective Visual Dialog baselines adapted from state-of-the-art multimodal models. Remarkably, the responses generated by our models show promising emotional reasoning abilities in response to visually grounded conversations. Our project page with the dataset is available through https://affective-visual-dialog.github.io.

Original languageEnglish (US)
Title of host publicationComputer Vision – ECCV 2024 - 18th European Conference, Proceedings
EditorsAleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol
PublisherSpringer Science and Business Media Deutschland GmbH
Pages18-36
Number of pages19
ISBN (Print)9783031732256
DOIs
StatePublished - 2025
Event18th European Conference on Computer Vision, ECCV 2024 - Milan, Italy
Duration: Sep 29 2024Oct 4 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15133 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference18th European Conference on Computer Vision, ECCV 2024
Country/TerritoryItaly
CityMilan
Period09/29/2410/4/24

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Keywords

  • affective computing
  • affective reasoning
  • vision and language

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations'. Together they form a unique fingerprint.

Cite this