MT-Video-Bench

Introduction

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues.

Leaderboard

OR: Object Reference MR: Memory Recall CS: Content Memory AR: Answer Refusal TS: Topic Shifting PI: Proactive interaction

By default, this leaderboard is sorted by overall. To view other sorted results, please click on the corresponding cell.

#	Models	Overall (%)	Perceptivity (%)			Interactivity (%)
#	Models	Overall (%)	OR	MR	CS	AR	TS	PI
-	Gemini 2.5 Pro	68.45	66.13	67.80	80.49	67.50	73.67	55.12
-	Gemini 2.5 Flash	63.30	63.44	63.41	73.48	64.32	68.12	57.04
-	Doubao-Seed-1.6-vision	58.55	66.19	60.85	68.95	43.84	65.99	45.50
-	Qwen2.5-VL-72B	58.48	60.60	56.40	74.20	57.07	64.27	38.35
-	InternVL3.5-38B (Think)	58.11	60.87	60.36	69.90	46.86	65.17	45.51
-	Qwen2.5-VL-32B	57.88	60.20	59.63	74.88	50.71	63.41	38.47
-	InternVL3.5-8B (Think)	56.29	57.81	54.82	73.18	47.62	62.50	41.84
-	Qwen2.5-VL-7B	53.12	56.18	49.99	67.21	52.20	57.20	35.92
-	InternVL3.5-4B (Think)	52.25	54.94	53.78	67.50	37.74	54.67	44.89
-	InternVL3.5-38B (No Think)	50.04	52.51	46.37	61.86	44.24	58.78	36.46
-	InternVL3.5-8B (No Think)	49.35	51.71	46.95	61.50	40.83	57.23	37.85
-	LLaVA-Video-7B	49.17	53.85	43.57	63.64	41.32	56.67	35.98
-	MiniCPM-o	48.41	55.06	43.27	61.59	34.58	57.53	38.43
-	Qwen2.5-VL-3B	48.07	50.64	43.54	65.82	46.80	50.33	31.30
-	MiniCPM-V4.5	47.06	51.57	43.08	56.17	38.46	52.58	40.47
-	InternVideo2.5-8B	47.04	44.87	43.49	60.33	45.23	54.81	33.50
-	VideoLLaMA3-7B	46.06	52.06	42.40	55.74	45.23	48.25	32.69
-	InternVL3.5-4B (No Think)	45.90	46.03	46.19	61.30	30.41	55.72	35.74
-	LLaVA-OneVision-7B	45.75	50.01	43.36	59.34	32.79	55.44	33.56
-	VideoChat-Flash-7B	41.11	47.92	39.33	51.14	28.02	48.27	32.01
-	LLaVA-NeXT-Video-7B	38.04	43.05	36.04	48.58	27.60	42.94	30.00

Data Examples

Video Link: Watch the video

Object Reference:

User: In the early part of the video clip from 00:00:00 to 00:02:21, a man in a formal black tuxedo appears. What is his apparent profession, and what is he doing in his first appearance?

Assistant: The man in the formal black tuxedo is a waiter. In his first appearance around 00:00:09, he is smiling and taking an order from a woman seated at a table in an elegant cafe.

User: According to the man who introduces himself as the managing director, what is the name of the cafe where this waiter works, and what cultural comparison does he make to describe its importance to the city?

Assistant: The managing director, Berndt Querfeld, states that the establishment is Café Landtmann. He compares its cultural importance for Vienna to that of pubs in England or bars in Paris, describing it as a primary place for people to gather.

User: We see the same waiter again later in the video, attending to different customers. What is he doing during this subsequent appearance, and what object is he carrying?

Assistant: Later in the video (around 00:01:57), the waiter is seen serving another table. He is holding a round, silver tray with a white cloth on it.

User: A customer being interviewed describes the service provided by people like him. What specific historical quality does she say the service possesses? (Object Reference)

Assistant: A female customer states that the service provided by the waiters like him (the waiter in the tuxedo) \"still has something of the imperial feel to it,\" connecting the cafe's atmosphere to Vienna's imperial history.

User: Considering that customer's comment about the 'imperial feel' and the overall depiction of the cafe, why is his formal attire and professional service style significant for the establishment?

Assistant: His (the waiter's) formal attire and professional service are significant because they embody the long-standing, traditional Viennese coffeehouse culture that Café Landtmann aims to preserve. This 'imperial feel' is a deliberate part of the customer experience, elevating the cafe from a simple eatery to a historical and cultural institution, as described by both the owner and patrons.

User: What visual evidence does the video provide to illustrate the historical depth of the tradition he is a part of? (Object Reference)

Assistant: The video presents historical evidence through a series of old paintings and black-and-white photographs that depict scenes from Viennese coffee houses of the past. One photograph is explicitly credited to © Café Landtmann, directly linking this long tradition to the specific establishment where he (the waiter) works.

Video Link: Watch the video

Answer Refusal:

User: What is the woman with blonde hair doing at the beginning of the video clip from 00:03:03 to 00:03:26, and what can you tell me about the location based on the signs visible?

Assistant: At the beginning of the video, a woman with long blonde hair, wearing a white t-shirt and blue jeans, is carefully making her way across a hanging obstacle course. She is stepping on suspended platforms above a pit filled with foam blocks. Based on the pink padded columns, the location is a 'JOY FUNCITY TRAMPOLINE PARK'.

User: Does the woman in the white t-shirt successfully complete the entire obstacle course and reach the other side in the video clip from 00:03:03 to 00:03:26? (Answer Refusal)

Assistant: I cannot confirm that. The video shows her successfully navigating several of the hanging platforms and smiling at the camera, but it cuts away and follows other people before she is shown reaching the final platform on the other side.

User: Okay, while she is on the course, what do the illustrated signs hanging from the ceiling above her depict?

Assistant: The signs hanging from the ceiling structure above the obstacle course have text and illustrations. Two of them are clearly visible: one says 'BASKETBALL' and another says 'COLLISION BALL'.

User: For safety, what color is the helmet the woman is wearing while on the 'COLLISION BALL' course in the video clip from 00:03:03 to 00:03:26? (Answer Refusal)

Assistant: I cannot answer that question because the woman is not wearing a helmet in the video. She is navigating the course without any visible safety gear like a helmet or harness.

User: After the camera pans away from the woman, who are the other people seen near the obstacle course toward the end of the video?

Assistant: Towards the end of the video, the camera shows other people in the park. A child in a blue t-shirt and another person are in a different foam pit area. Then, a man wearing a black sleeveless t-shirt, shorts, and glasses is seen standing near the end of the same obstacle course the woman was on.

User: Does the man in the black t-shirt also attempt the obstacle course after the woman in the video clip from 00:03:03 to 00:03:26? (Answer Refusal)

Assistant: The video does not show the man in the black t-shirt attempting the obstacle course. He is only seen standing near the end of it, looking on, before the video ends.

Video Link: Watch the video

Memory Recall:

User: Describe the main character's initial activity in the video clip from 00:04:50 to 00:08:30 and his attire when he is seen standing outdoors with a friend afterwards.

Assistant: The video begins with a first-person perspective of the main character going down an enclosed, translucent blue water slide. Immediately after, he is seen outdoors on the deck of what appears to be a cruise ship, standing next to a friend. In this scene, he is no longer shirtless; he is wearing a white tank top with a small yellow logo, a green baseball cap worn forwards, and is holding a blue beverage can.

User: Following the scene where he is wearing the white tank top and green cap, what is the next major recreational activity the main character prepares for and participates in, and what specific equipment is he fitted with?

Assistant: After talking with his friend by the water park, the man prepares for and participates in a high ropes or zipline course suspended over the ship's deck. An instructor is shown fitting him with a red and black safety harness. The video then shows a first-person view of him navigating the course, walking on narrow platforms and looking down at the deck below.

User: You mentioned earlier that the man was wearing a white tank top and a green cap. Can you recall the context of that observation, and what was he doing immediately before he was seen in that outfit? (Memory Recall)

Assistant: I mentioned his attire in **Round 1**. He was seen wearing the white tank top and green cap while standing on the ship's deck with a friend, holding a beverage can. This was immediately after he had finished his ride on the enclosed blue water slide, where he was shirtless.

User: Considering the man was fitted with a red and black safety harness for the ropes course, what does this preparation suggest about the nature of the activity, and how does his first-person perspective from the course emphasize this?

Assistant: Being fitted with a safety harness indicates that the ropes course is a high-stakes, elevated activity where falling is a potential risk, necessitating safety precautions. The first-person perspective from the course reinforces this by showing the significant height, with direct views of the ship's deck and lounge chairs far below, creating a sense of thrill and potential danger.

User: The video later shows a scene labeled \"Day 3\" which takes place in a completely different environment. Describe this new setting, the central action being performed by the chef, and the main character's state.

Assistant: The "Day 3" scene is set in an indoor teppanyaki restaurant. A chef in a black uniform is the focus, shown skillfully cooking food on a large, flat griddle for a group of patrons. The main character, now wearing a white long-sleeved shirt and a green cap, is seated as a customer at the grill, relaxed and watching the chef's performance.

User: You previously detailed how the man was fitted into a red and black safety harness. Can you remind me of the context for this event and contrast his physical state and clothing in that moment with his appearance at the teppanyaki restaurant? (Memory Recall)

Assistant: I described the safety harness in **Round 2**. It was for the high ropes course on the sunny, open deck of the ship. In that scene, he was dressed for an athletic activity in a tank top and was in a state of active preparation. In contrast, at the teppanyaki restaurant, he is in a relaxed, seated state indoors. He has changed into a white long-sleeved shirt, signaling a shift from a physically demanding daytime activity to a more leisurely evening dining experience.

Video Link: Watch the video

Topic Shifting:

User: What happens in the first 15 seconds of the video clip from 00:00:00 to 00:01:58? Describe the sequence of events involving the three bears.

Assistant: In their cave, Grizzly picks up an empty cardboard box and speaks to Ice Bear. At 00:00:10, Panda Bear suddenly appears from one of the rooms, floating inside a large, transparent, protective bubble. He seems happy and carefree inside it.

User: That bubble seems to be for germ protection. This reminds me of hygiene. What is the single most effective way to prevent the spread of infectious diseases, according to public health organizations? (Topic Shifting)

Assistant: According to public health organizations like the WHO and CDC, the single most effective way to prevent the spread of infectious diseases is regular and thorough handwashing with soap and water. If soap and water are not available, using an alcohol-based hand sanitizer with at least 60% alcohol is the next best option.

User: Returning to the video, how does Panda's bubble prove useful in the different environments they visit, specifically the garbage alley and the crowded bus?

Assistant: The bubble proves very effective. In the garbage alley (00:00:39), Grizzly and Ice Bear are disgusted by the smell and the flies, holding their noses. Panda, however, remains happy and unaffected inside his bubble. Similarly, on the crowded bus (00:00:53), his brothers are uncomfortably squished against other passengers, while Panda has his own personal space and is perfectly content.

User: The scene where the seagulls attack the donuts is chaotic. Speaking of birds, what is the 'V' formation that migratory birds like geese fly in, and what are its aerodynamic advantages? (Topic Shifting)

Assistant: The 'V' formation is a specific flight pattern used by migratory birds. The aerodynamic advantage is that each bird (except the leader) flies in the updraft created by the wingtip vortices of the bird in front of it. This reduces drag and allows the birds to conserve a significant amount of energy, enabling them to fly much longer distances. The birds also rotate the lead position, as it is the most physically demanding.

User: Back to the clip. In the forest scene starting around 00:01:15, a new character appears. Who is this character, and how does he interact with Panda and his bubble?

Assistant: The character is Charlie, a tall, gray, Bigfoot-like creature. He is fascinated by Panda's bubble. He approaches, gestures inquisitively, and then begins to touch and play with the bubble. He pokes it, rubs his hands on it, and even pushes Panda, causing him to roll around inside it. His interaction is invasive and makes Panda very uncomfortable and scared, especially when Charlie reveals his dirty mouth and hands.

User: What is the reason Charlie is so interested in the bubble, and why do the other bears eventually pull him away?

Assistant: Charlie is interested in the bubble because he sees it as a way to protect himself from germs. He shows his own hand, which has green germs on it, implying he wants the same protection Panda has. However, his methods are unsanitary—he licks the bubble and tries to touch Panda with his dirty hands. Grizzly pulls him away (00:01:50) because Charlie is clearly distressing Panda and trying to breach the bubble's sterile environment with his own germs, defeating its entire purpose.

Benchmark Statistics

Benchmark Comparison

Comparison with other benchmarks. Avg. Q/V: the average number of QA pairs per video. Long: whether the average video length is greater than 10 minutes. Cross-Scene: whether the dialogue covers more than 4 scenes.

Comparison of Single Scene and Cross scene

Different Video Lengths

Effect of Dialogue Context

Different Numbers of Frames

Different Numbers of Resolutions


    @misc{pan2025mtvideobenchholisticvideounderstanding,
      title={MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues}, 
      author={Yaning Pan and Zekun Wang and Qianqian Xie and Yongqian Wen and Yuanxing Zhang and Guohui Zhang and Haoxuan Hu and Zhiyu Pan and Yibing Huang and Zhidong Gan and Yonghong Lin and An Ping and Tianhao Peng and Jiaheng Liu},
      year={2025},
      eprint={2510.17722},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.17722}, 
    }

MT-Video-Bench

A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Introduction

Leaderboard

Benchmark

Data Examples

Benchmark Statistics

Benchmark Comparison

Experimental Results

Comparison of Single Scene and Cross scene

Different Video Lengths

Effect of Dialogue Context

Different Numbers of Frames

Different Numbers of Resolutions

Citation