Taming Hallucinations in Video Understanding with Counterfactual Video Generation

Published:Dec 30, 2025 14:53
1 min read
ArXiv

Analysis

This paper addresses a critical problem in Multimodal Large Language Models (MLLMs): visual hallucinations in video understanding, particularly with counterfactual scenarios. The authors propose a novel framework, DualityForge, to synthesize counterfactual video data and a training regime, DNA-Train, to mitigate these hallucinations. The approach is significant because it tackles the data imbalance issue and provides a method for generating high-quality training data, leading to improved performance on hallucination and general-purpose benchmarks. The open-sourcing of the dataset and code further enhances the impact of this work.

Reference

The paper demonstrates a 24.0% relative improvement in reducing model hallucinations on counterfactual videos compared to the Qwen2.5-VL-7B baseline.