Taming Hallucinations in Video Understanding with Counterfactual Video Generation
Analysis
This paper addresses a critical problem in Multimodal Large Language Models (MLLMs): visual hallucinations in video understanding, particularly with counterfactual scenarios. The authors propose a novel framework, DualityForge, to synthesize counterfactual video data and a training regime, DNA-Train, to mitigate these hallucinations. The approach is significant because it tackles the data imbalance issue and provides a method for generating high-quality training data, leading to improved performance on hallucination and general-purpose benchmarks. The open-sourcing of the dataset and code further enhances the impact of this work.
Key Takeaways
- •Addresses the problem of visual hallucinations in MLLMs for video understanding.
- •Introduces DualityForge, a framework for synthesizing counterfactual video data.
- •Proposes DNA-Train, a training regime to reduce hallucinations.
- •Demonstrates significant improvements on hallucination and general-purpose benchmarks.
- •Open-sources the dataset and code for broader accessibility.
“The paper demonstrates a 24.0% relative improvement in reducing model hallucinations on counterfactual videos compared to the Qwen2.5-VL-7B baseline.”