AVERE Logo
AVERE Teaser

AVERE introduces EmoReAlM, a benchmark for audiovisual emotion reasoning, and AVEm-DPO, a preference optimization framework that aligns multimodal large language models with emotion-centric audiovisual cues, reducing spurious correlations and hallucinated content.

πŸ“ Abstract

Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models (MLLMs) have shown strong performance on this task, two key challenges remain: (i) spurious associations between emotions and irrelevant audiovisual cues (reasoning errors) and (ii) hallucination of audiovisual cues (perception errors) driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue–emotion associations, hallucinations, and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over (i) responses exhibiting spurious associations or hallucinations and (ii) audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS, and EMER demonstrate that our method significantly improves the performance of the reference baseline models (6-19% of relative performance) in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI.

🎯 EmoReAlM Benchmark Samples

πŸ† EmoReAlM Leaderboard

Accuracy (%) on EmoReAlM. ↑ Higher is better. Our models (AVEm-DPO) highlighted.

Model Reas. Basic (A) Reas. Basic (V) Mod. Agree. Stress (A) Stress (V) Avg. Acc.
Proprietary Β· I. Closed-source models
Gemini 2.5 Flash 78.088.957.063.573.272.1
Gemini 2.5 Pro 72.787.054.763.873.170.3
Vision only Β· II. Open-source video-only models
VideoLLaMA 3 β€”86.2β€”β€”64.9β€”
Qwen 2.5 VL β€”88.1β€”β€”75.2β€”
InternVL 3.5 β€”92.8β€”β€”68.3β€”
Audio only Β· III. Open-source audio-only models
Qwen 2 Audio 56.6β€”β€”55.1β€”β€”
Kimi-Audio 69.8β€”β€”54.0β€”β€”
Audio Flamingo 3 76.8β€”β€”52.6β€”β€”
Omni Β· IV. Open-source audiovisual (“omni”) models
VideoLLaMA 21.722.234.146.148.837.1
PandaGPT 37.435.753.745.847.144.0
OneLLM 42.055.654.856.862.054.2
VideoLLaMA2 63.166.852.653.759.459.1
OLA 63.260.451.763.562.360.2
VITA-1.5 63.184.351.763.066.165.6
Qwen 2.5 Omni 76.889.252.264.067.870.0
Our base Β· AVEm-DPO variants
Our base 69.285.351.453.166.465.1
Our base + Naive-DPO 71.385.957.355.670.668.1
Our base + Vista-DPO† 72.487.863.174.187.076.9
Our base + AVEm-DPO (Ours) 77.992.568.982.694.683.3
EmotionLLaMA* Β· AVEm-DPO variants
Emot.-LLaMA* 64.884.951.248.969.163.8
Emot.-LLaMA* + Naive-DPO 67.285.756.153.571.966.9
Emot.-LLaMA* + Vista-DPO† 69.086.958.269.287.674.2
Emot.-LLaMA* + AVEm-DPO (Ours) 76.589.165.677.391.880.1

Reas. Basic = Reasoning Basic; Stress = Reasoning Stress Test; (A) = Audio, (V) = Visual. *Emot.-LLaMA: emotion-tuned base. †Vista-DPO: video-only DPO baseline.

πŸ“Š Results

πŸ“ˆ Zero-shot performance on DFEW, RAVDESS, MER2023, EMER

Model Mod. DFEW RAVDESS MER2023 EMER
UAR WAR UAR WAR F1 Clue Label Spurious Halluc.
Prior work Β· Baselines
VideoLLaMA 2 A,V 43.65 48.66 41.81 31.62 50.79 3.82 3.80 4.25 4.23
OLA A,V 38.17 41.73 27.45 22.11 55.82 3.80 3.33 3.93 4.22
VITA-1.5 A,V 39.31 42.56 50.67 46.88 66.94 4.77 4.72 5.16 5.70
Qwen-2.5 Omni A,V 46.94 54.34 32.88 28.05 79.72 5.85 6.78 6.39 6.21
EmotionLLaMA A,V,T 45.59 59.37 28.20 29.24 90.36 6.03 6.99 5.89 5.26
EmotionLLaMA‑ A,V 42.72 54.06 30.36 30.45 89.05 2.76 2.78 3.44 2.36
MoSEAR A,V,T 44.48 56.60 β€” β€” 90.27 β€” β€” β€” β€”
Our base Β· AVEm-DPO variants
Our base A,V 56.78 60.14 53.59 53.01 89.19 5.63 6.45 5.41 5.19
+ Naive-DPO 55.67 59.90 53.63 52.94 88.59 5.81 6.30 5.96 5.48
+ Vista-DPO† 56.42 62.33 56.94 53.64 90.06 6.08 6.89 6.58 6.07
+ AVEm-DPO 58.54 64.24 58.66 55.48 92.18 6.37 7.08 7.09 6.75
EmotionLLaMA* Β· AVEm-DPO variants
EmotionLLaMA* A,V 54.89 58.26 52.59 48.12 90.01 5.78 6.21 5.36 5.23
+ Naive-DPO 54.97 58.12 52.69 49.01 89.35 5.89 6.35 5.89 5.62
+ Vista-DPO† 56.28 61.58 56.42 50.96 91.19 6.05 6.56 6.85 6.31
+ AVEm-DPO 57.06 62.12 56.21 51.03 91.68 6.02 6.99 7.02 6.62

✨ Qualitative Examples

Input video with ground truth compared to outputs from OLA, VITA-1.5, Qwen 2.5 Omni, VideoLLaMA 2, Emotion LLaMA, Our base, and + AVEm-DPO. Click a model to reveal its response. Our base and + AVEm-DPO are shown by default.

Sample 1

Ground truth:

In the video, the screen shows a young woman in an indoor environment, with a blurred background but some books can be seen stacked on the side. At the beginning of the video, she looks down with a gentle and friendly facial expression, with a noticeable smile that indicates she is happy at the moment. In the following scenes, she looks at the other person and engages in a conversation with them. In the audio, combined with the text content, the character is asking about the other person's whereabouts. In the text, the subtitle says, “Where have you been these days? I thought you were transferred.” This sentence can infer the emotional state of the female character. Based on the obvious smile and happy expression of the female character in the video clues, as well as the tone of the character asking about the other person's whereabouts in the audio clues, we can deduce that this sentence is a form of joking or teasing expression. The female character is likely playfully asking about the other person's whereabouts, expressing her care and curiosity towards them, which aligns with her overall friendly and happy demeanor.

Model responses (click to expand):

OLA
The video features a young girl with long, dark hair, wearing a blue shirt and a white undershirt. She is standing in a dimly lit room, which appears to be an indoor setting, possibly a classroom or a study area. The background includes a stack of books on a table, suggesting an academic environment. The lighting is subdued, creating a somber and contemplative atmosphere. The girl's expression is neutral, and she seems to be engaged in a conversation or listening attentively. The overall mood of the scene is reflective and serious.
VITA-1.5
The video features a young woman with dark hair tied back, wearing a blue collared shirt. She smiles and speaks directly to the camera, her expression shifting slightly as she delivers her message. The background is consistently dark, providing a stark contrast to her face and emphasizing her presence.
Qwen 2.5 Omni
Well, in the video, there's a girl who seems to be in a bit of a state of surprise or maybe even a little bit of anger. Her eyes are wide open, and her mouth is slightly agape. It looks like she's just heard something unexpected or is reacting to someone's question. The way she's looking off to the side gives the impression that she's not really sure what to make of the situation. It's like she's caught off guard. What do you think might have happened to make her react like that?
VideoLLaMA 2
In the dimly lit room, a young girl with a serious expression on her face is seen talking to someone. The atmosphere is tense and filled with anticipation as she speaks. The scene is from the movie “Crazy Eyes” where the girl's character is dealing with a difficult situation.
Emotion LLaMA
The emotional label of the person in the video is “happy”. The clues in the video suggest that the person is happy because she is smiling and laughing while watching a video. The video also shows her holding a book, which could be a sign of her interest in learning or reading. Additionally, the person is surrounded by books, which could indicate that she values education and personal growth. Overall, the video portrays a positive and content person.
Our base
The predicted emotion in the given video is happiness. The young woman can be seen smiling and her eyebrows are slightly raised showing playful eagerness. The dimly lit background contrasts the woman's expressions suggesting overall positivity in the atmosphere. The woman's speech enhances her emotional state with a happy and playful tone of voice.
+ AVEm-DPO (Ours)
The predicted emotion in the given video is happiness. The woman's facial expressions, including her smile and slightly raised eyebrows, suggest a positive emotional state. The tone of her voice is light and friendly, further reinforcing the feeling of happiness. The overall visual and auditory cues point towards a positive and happy interaction.
Sample 2

Ground truth:

In the video, the screen shows a woman. In the video, she frowns and has a negative facial expression. During the conversation, she accompanies it with obvious head and hand movements, indicating that she is emotionally excited and seems to be emphasizing her point of view. Her gaze is directed towards the character below her, and her mouth moves frequently, suggesting that she is arguing with the other person. In the audio, the voice is urgent and the character's emotions are excited and anxious. In the text, the subtitle says, ""Why don't you go after it? Why are you sitting here and arguing with yourself?"" This sentence may be a woman's accusation or criticism of the character below her. Based on the video clues of the woman frowning, having a negative facial expression, accompanied by obvious head and hand movements, and looking at the character below her, it can be inferred that the woman is emotionally excited and seems to be arguing with the other person. At the same time, based on the audio clue describing the urgent and excited emotions, it can further confirm the woman's emotional state. Therefore, this sentence expresses the woman's dissatisfaction and anger, as she blames the other person for not taking action and just sitting in one place.

Model responses (click to expand):

OLA
The video features a woman with long brown hair, wearing a light-colored sweater, who appears to be in a state of distress or contemplation. The background is blurred, suggesting an indoor setting with natural light coming through a window. The overall mood conveyed by the woman's expression and posture is one of concern or deep thought.
VITA-1.5
The video features a woman with long brown hair, wearing a white sweater. She appears to be speaking with an emotional expression, her face conveying a sense of distress or sadness. Her eyes are slightly closed, and her mouth is moving as if she is talking. The background is blurred, suggesting an indoor setting.
Qwen 2.5 Omni
Well, in the video, there's a woman who seems really sad. Her eyes are downcast, and she's not looking at the camera. Her mouth is slightly downturned, which gives off a sense of disappointment or sadness. It's like she's lost in her own thoughts, maybe feeling a bit down about something. What do you think might be making her feel this way?
VideoLLaMA 2
The video shows a woman in various scenes, including crying and talking on the phone. The main source of sound in the video is the woman's voice as she talks to different people and expresses her emotions.
Emotion LLaMA
The emotional label of the person in the video is sad. The clues in the video suggest that the person is watching a video of a woman who is crying, and the woman is looking at the camera with tears in her eyes. This indicates that the person is feeling sad or empathetic towards the woman in the video, as they are watching her cry and experiencing her emotions. The person may be feeling a sense of sadness or empathy towards the woman, which is evident from their facial expressions and body language.
Our base
The predicted emotion in the given video is anger. The woman's facial expressions shows furrowed brows and a tense mouth suggesting her angry state. The lines on her forehead suggest the intensity of her feelings. Her tone of voice is sharp and she speaks in a forceful manner further indicating her frustration and anger.
+ AVEm-DPO (Ours)
The predicted emotion in the given video is anger. The audio features a raised voice and sharp tone, indicating heightened emotional arousal. Visually, the person's facial expressions include furrowed brows, a tense mouth, and a direct, confrontational gaze, all of which are indicative of anger. The rapid speech and forceful delivery of the words further contribute to the perception of anger.

πŸ“¦ Project Resources

We are actively preparing the codebase, benchmark release, and detailed documentation for AVERE.

Expected releases:

  • Code: training & evaluation for AVEm-DPO on audiovisual emotion datasets.
  • Benchmark: EmoReAlM QA pairs, and evaluation scripts.
  • Model checkpoints: AVEm-DPO tuned MLLM.

For updates, please check this page periodically.

πŸ’¬ Contact

For questions about AVERE, EmoReAlM, or potential collaborations, please contact:

Ashutosh Chaubeyachaubey@usc.edu

πŸ“Ž BibTeX

@inproceedings{chaubey2026avere,
  title={AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization},
  author={Chaubey, Ashutosh and Pang, Jiacheng and Siniukov, Maksim and Soleymani, Mohammad},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://openreview.net/forum?id=td682AAuPr}
}

πŸ™ Acknowledgement

Research was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-25-2-0040. Work was also in part supported by the National Science Foundation under Grant IIS-2211550 and the National Institute of Mental Health of the National Institutes of Health under Award Number R61MH135407. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office, NSF, NIH, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.