AVERE introduces EmoReAlM, a benchmark for audiovisual emotion reasoning, and
AVEm-DPO, a preference optimization framework that aligns multimodal large language
models with emotion-centric audiovisual cues, reducing spurious correlations and
hallucinated content.
π Abstract
Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models (MLLMs) have shown strong performance on this task, two key challenges remain: (i) spurious associations between emotions and irrelevant audiovisual cues (reasoning errors) and (ii) hallucination of audiovisual cues (perception errors) driven by text priors in the language model backbone.
To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cueβemotion associations, hallucinations, and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over (i) responses exhibiting spurious associations or hallucinations and (ii) audiovisual input pairs guided by textual prompts.
We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations.
Experimental results on DFEW, RAVDESS, and EMER demonstrate that our method significantly improves the performance of the reference baseline models (6-19% of relative performance) in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI.
π― EmoReAlM Benchmark Samples
Task: Reasoning Basic (Audio)
How does the speaker's choice of words in the video reflect their emotional state?
The speaker mentions struggling to move forward despite past setbacks, indicating a reflective state.
The speaker's tone reflects a somber atmosphere, accompanied by a soft, resigned voice.
The speaker's phrase portrays a deep sense of regret and resignation, reflecting a failed attempt.
The speaker uses soft background music to enhance the somber mood, suggesting unfulfilled efforts.
Correct: C
Task: Reasoning Basic (Visual)
How does the woman's facial expression contribute to the overall feeling in the scene?
The woman displays a joyful expression with open arms, conveying her happiness and openness.
The woman's cheerful smile and lively eyes reveal her happiness and engagement.
The woman's yellow turtleneck adds a vibrant touch, symbolizing her happiness and contentment.
The woman's long dark hair frames her face, enhancing the appearance of happiness and delight.
Correct: B
Task: Modality Agreement
Do the visual elements of the video align with the audio in conveying the feeling of happiness of the person in the video?
Yes
No
Correct: B
Task: Stress Test (Audio No Hallucination)
Do the chuckling sounds in the audio enhance the feeling of joy conveyed for the person in the video?
Yes
No
Correct: A
Task: Stress Test (Visual - Hallucination)
Is the man displaying a clenched fist as a sign of his anger in this video?
Yes
No
Correct: B
π EmoReAlM Leaderboard
Accuracy (%) on EmoReAlM. ↑ Higher is better. Our models (AVEm-DPO) highlighted.
Model
Reas. Basic (A)
Reas. Basic (V)
Mod. Agree.
Stress (A)
Stress (V)
Avg. Acc.
Proprietary Β· I. Closed-source models
Gemini 2.5 Flash
78.0
88.9
57.0
63.5
73.2
72.1
Gemini 2.5 Pro
72.7
87.0
54.7
63.8
73.1
70.3
Vision only Β· II. Open-source video-only models
VideoLLaMA 3
β
86.2
β
β
64.9
β
Qwen 2.5 VL
β
88.1
β
β
75.2
β
InternVL 3.5
β
92.8
β
β
68.3
β
Audio only Β· III. Open-source audio-only models
Qwen 2 Audio
56.6
β
β
55.1
β
β
Kimi-Audio
69.8
β
β
54.0
β
β
Audio Flamingo 3
76.8
β
β
52.6
β
β
Omni Β· IV. Open-source audiovisual (“omni”) models
π Zero-shot performance on DFEW, RAVDESS, MER2023, EMER
Model
Mod.
DFEW
RAVDESS
MER2023
EMER
UAR
WAR
UAR
WAR
F1
Clue
Label
Spurious
Halluc.
Prior work Β· Baselines
VideoLLaMA 2
A,V
43.65
48.66
41.81
31.62
50.79
3.82
3.80
4.25
4.23
OLA
A,V
38.17
41.73
27.45
22.11
55.82
3.80
3.33
3.93
4.22
VITA-1.5
A,V
39.31
42.56
50.67
46.88
66.94
4.77
4.72
5.16
5.70
Qwen-2.5 Omni
A,V
46.94
54.34
32.88
28.05
79.72
5.85
6.78
6.39
6.21
EmotionLLaMA
A,V,T
45.59
59.37
28.20
29.24
90.36
6.03
6.99
5.89
5.26
EmotionLLaMAβ‘
A,V
42.72
54.06
30.36
30.45
89.05
2.76
2.78
3.44
2.36
MoSEAR
A,V,T
44.48
56.60
β
β
90.27
β
β
β
β
Our base Β· AVEm-DPO variants
Our base
A,V
56.78
60.14
53.59
53.01
89.19
5.63
6.45
5.41
5.19
+ Naive-DPO
55.67
59.90
53.63
52.94
88.59
5.81
6.30
5.96
5.48
+ Vista-DPOβ
56.42
62.33
56.94
53.64
90.06
6.08
6.89
6.58
6.07
+ AVEm-DPO
58.54
64.24
58.66
55.48
92.18
6.37
7.08
7.09
6.75
EmotionLLaMA* Β· AVEm-DPO variants
EmotionLLaMA*
A,V
54.89
58.26
52.59
48.12
90.01
5.78
6.21
5.36
5.23
+ Naive-DPO
54.97
58.12
52.69
49.01
89.35
5.89
6.35
5.89
5.62
+ Vista-DPOβ
56.28
61.58
56.42
50.96
91.19
6.05
6.56
6.85
6.31
+ AVEm-DPO
57.06
62.12
56.21
51.03
91.68
6.02
6.99
7.02
6.62
β¨ Qualitative Examples
Input video with ground truth compared to outputs from OLA, VITA-1.5, Qwen 2.5 Omni, VideoLLaMA 2, Emotion LLaMA, Our base, and + AVEm-DPO. Click a model to reveal its response. Our base and + AVEm-DPO are shown by default.
Sample 1
Ground truth:
In the video, the screen shows a young woman in an indoor environment, with a blurred background but some books can be seen stacked on the side. At the beginning of the video, she looks down with a gentle and friendly facial expression, with a noticeable smile that indicates she is happy at the moment. In the following scenes, she looks at the other person and engages in a conversation with them. In the audio, combined with the text content, the character is asking about the other person's whereabouts. In the text, the subtitle says, “Where have you been these days? I thought you were transferred.” This sentence can infer the emotional state of the female character. Based on the obvious smile and happy expression of the female character in the video clues, as well as the tone of the character asking about the other person's whereabouts in the audio clues, we can deduce that this sentence is a form of joking or teasing expression. The female character is likely playfully asking about the other person's whereabouts, expressing her care and curiosity towards them, which aligns with her overall friendly and happy demeanor.
Model responses (click to expand):
OLA
The video features a young girl with long, dark hair, wearing a blue shirt and a white undershirt. She is standing in a dimly lit room, which appears to be an indoor setting, possibly a classroom or a study area. The background includes a stack of books on a table, suggesting an academic environment. The lighting is subdued, creating a somber and contemplative atmosphere. The girl's expression is neutral, and she seems to be engaged in a conversation or listening attentively. The overall mood of the scene is reflective and serious.
VITA-1.5
The video features a young woman with dark hair tied back, wearing a blue collared shirt. She smiles and speaks directly to the camera, her expression shifting slightly as she delivers her message. The background is consistently dark, providing a stark contrast to her face and emphasizing her presence.
Qwen 2.5 Omni
Well, in the video, there's a girl who seems to be in a bit of a state of surprise or maybe even a little bit of anger. Her eyes are wide open, and her mouth is slightly agape. It looks like she's just heard something unexpected or is reacting to someone's question. The way she's looking off to the side gives the impression that she's not really sure what to make of the situation. It's like she's caught off guard. What do you think might have happened to make her react like that?
VideoLLaMA 2
In the dimly lit room, a young girl with a serious expression on her face is seen talking to someone. The atmosphere is tense and filled with anticipation as she speaks. The scene is from the movie “Crazy Eyes” where the girl's character is dealing with a difficult situation.
Emotion LLaMA
The emotional label of the person in the video is “happy”. The clues in the video suggest that the person is happy because she is smiling and laughing while watching a video. The video also shows her holding a book, which could be a sign of her interest in learning or reading. Additionally, the person is surrounded by books, which could indicate that she values education and personal growth. Overall, the video portrays a positive and content person.
Our base
The predicted emotion in the given video is happiness. The young woman can be seen smiling and her eyebrows are slightly raised showing playful eagerness. The dimly lit background contrasts the woman's expressions suggesting overall positivity in the atmosphere. The woman's speech enhances her emotional state with a happy and playful tone of voice.
+ AVEm-DPO (Ours)
The predicted emotion in the given video is happiness. The woman's facial expressions, including her smile and slightly raised eyebrows, suggest a positive emotional state. The tone of her voice is light and friendly, further reinforcing the feeling of happiness. The overall visual and auditory cues point towards a positive and happy interaction.
Sample 2
Ground truth:
In the video, the screen shows a woman. In the video, she frowns and has a negative facial expression. During the conversation, she accompanies it with obvious head and hand movements, indicating that she is emotionally excited and seems to be emphasizing her point of view. Her gaze is directed towards the character below her, and her mouth moves frequently, suggesting that she is arguing with the other person. In the audio, the voice is urgent and the character's emotions are excited and anxious. In the text, the subtitle says, ""Why don't you go after it? Why are you sitting here and arguing with yourself?"" This sentence may be a woman's accusation or criticism of the character below her. Based on the video clues of the woman frowning, having a negative facial expression, accompanied by obvious head and hand movements, and looking at the character below her, it can be inferred that the woman is emotionally excited and seems to be arguing with the other person. At the same time, based on the audio clue describing the urgent and excited emotions, it can further confirm the woman's emotional state. Therefore, this sentence expresses the woman's dissatisfaction and anger, as she blames the other person for not taking action and just sitting in one place.
Model responses (click to expand):
OLA
The video features a woman with long brown hair, wearing a light-colored sweater, who appears to be in a state of distress or contemplation. The background is blurred, suggesting an indoor setting with natural light coming through a window. The overall mood conveyed by the woman's expression and posture is one of concern or deep thought.
VITA-1.5
The video features a woman with long brown hair, wearing a white sweater. She appears to be speaking with an emotional expression, her face conveying a sense of distress or sadness. Her eyes are slightly closed, and her mouth is moving as if she is talking. The background is blurred, suggesting an indoor setting.
Qwen 2.5 Omni
Well, in the video, there's a woman who seems really sad. Her eyes are downcast, and she's not looking at the camera. Her mouth is slightly downturned, which gives off a sense of disappointment or sadness. It's like she's lost in her own thoughts, maybe feeling a bit down about something. What do you think might be making her feel this way?
VideoLLaMA 2
The video shows a woman in various scenes, including crying and talking on the phone. The main source of sound in the video is the woman's voice as she talks to different people and expresses her emotions.
Emotion LLaMA
The emotional label of the person in the video is sad. The clues in the video suggest that the person is watching a video of a woman who is crying, and the woman is looking at the camera with tears in her eyes. This indicates that the person is feeling sad or empathetic towards the woman in the video, as they are watching her cry and experiencing her emotions. The person may be feeling a sense of sadness or empathy towards the woman, which is evident from their facial expressions and body language.
Our base
The predicted emotion in the given video is anger. The woman's facial expressions shows furrowed brows and a tense mouth suggesting her angry state. The lines on her forehead suggest the intensity of her feelings. Her tone of voice is sharp and she speaks in a forceful manner further indicating her frustration and anger.
+ AVEm-DPO (Ours)
The predicted emotion in the given video is anger. The audio features a raised voice and sharp tone, indicating heightened emotional arousal. Visually, the person's facial expressions include furrowed brows, a tense mouth, and a direct, confrontational gaze, all of which are indicative of anger. The rapid speech and forceful delivery of the words further contribute to the perception of anger.
π¦ Project Resources
We are actively preparing the codebase, benchmark release,
and detailed documentation for AVERE.
Expected releases:
Code: training & evaluation for AVEm-DPO on audiovisual emotion datasets.
Benchmark: EmoReAlM QA pairs, and evaluation scripts.
Model checkpoints: AVEm-DPO tuned MLLM.
For updates, please check this page periodically.
π¬ Contact
For questions about AVERE, EmoReAlM, or potential collaborations, please contact:
@inproceedings{chaubey2026avere,
title={AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization},
author={Chaubey, Ashutosh and Pang, Jiacheng and Siniukov, Maksim and Soleymani, Mohammad},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://openreview.net/forum?id=td682AAuPr}
}
π Acknowledgement
Research was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-25-2-0040. Work was also in part supported by the National Science Foundation under Grant IIS-2211550 and the National Institute of Mental Health of the National Institutes of Health under Award Number R61MH135407. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office, NSF, NIH, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.