Logo MT-Video-Bench

A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Introduction

data-composition

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues.

Leaderboard

OR: Object Reference          MR: Memory Recall          CS: Content Memory          AR: Answer Refusal          TS: Topic Shifting          PI: Proactive interaction         

By default, this leaderboard is sorted by overall. To view other sorted results, please click on the corresponding cell.

# Models Overall (%) Perceptivity (%) Interactivity (%)
OR MR CS AR TS PI
- Gemini 2.5 Pro 68.45 66.13 67.80 80.49 67.50 73.67 55.12
- Gemini 2.5 Flash 63.30 63.44 63.41 73.48 64.32 68.12 57.04
- Doubao-Seed-1.6-vision 58.55 66.19 60.85 68.95 43.84 65.99 45.50
- Qwen2.5-VL-72B 58.48 60.60 56.40 74.20 57.07 64.27 38.35
- InternVL3.5-38B (Think) 58.11 60.87 60.36 69.90 46.86 65.17 45.51
- Qwen2.5-VL-32B 57.88 60.20 59.63 74.88 50.71 63.41 38.47
- InternVL3.5-8B (Think) 56.29 57.81 54.82 73.18 47.62 62.50 41.84
- Qwen2.5-VL-7B 53.12 56.18 49.99 67.21 52.20 57.20 35.92
- InternVL3.5-4B (Think) 52.25 54.94 53.78 67.50 37.74 54.67 44.89
- InternVL3.5-38B (No Think) 50.04 52.51 46.37 61.86 44.24 58.78 36.46
- InternVL3.5-8B (No Think) 49.35 51.71 46.95 61.50 40.83 57.23 37.85
- LLaVA-Video-7B 49.17 53.85 43.57 63.64 41.32 56.67 35.98
- MiniCPM-o 48.41 55.06 43.27 61.59 34.58 57.53 38.43
- Qwen2.5-VL-3B 48.07 50.64 43.54 65.82 46.80 50.33 31.30
- MiniCPM-V4.5 47.06 51.57 43.08 56.17 38.46 52.58 40.47
- InternVideo2.5-8B 47.04 44.87 43.49 60.33 45.23 54.81 33.50
- VideoLLaMA3-7B 46.06 52.06 42.40 55.74 45.23 48.25 32.69
- InternVL3.5-4B (No Think) 45.90 46.03 46.19 61.30 30.41 55.72 35.74
- LLaVA-OneVision-7B 45.75 50.01 43.36 59.34 32.79 55.44 33.56
- VideoChat-Flash-7B 41.11 47.92 39.33 51.14 28.02 48.27 32.01
- LLaVA-NeXT-Video-7B 38.04 43.05 36.04 48.58 27.60 42.94 30.00

Benchmark

Data Examples

Benchmark Statistics

data-composition

Benchmark Comparison

data-composition

Comparison with other benchmarks. Avg. Q/V: the average number of QA pairs per video. Long: whether the average video length is greater than 10 minutes. Cross-Scene: whether the dialogue covers more than 4 scenes.

Experimental Results

Comparison of Single Scene and Cross scene

data-composition

Different Video Lengths

data-composition

Effect of Dialogue Context

data-composition

Different Numbers of Frames

data-composition

Different Numbers of Resolutions

data-composition

Citation


    @misc{pan2025mtvideobenchholisticvideounderstanding,
      title={MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues}, 
      author={Yaning Pan and Zekun Wang and Qianqian Xie and Yongqian Wen and Yuanxing Zhang and Guohui Zhang and Haoxuan Hu and Zhiyu Pan and Yibing Huang and Zhidong Gan and Yonghong Lin and An Ping and Tianhao Peng and Jiaheng Liu},
      year={2025},
      eprint={2510.17722},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.17722}, 
    }