Process-Aware Evaluation for Video Reasoning
Published:Dec 31, 2025 16:31
•1 min read
•ArXiv
Analysis
This paper addresses a critical issue in evaluating video generation models: the tendency for models to achieve correct outcomes through incorrect reasoning processes (outcome-hacking). The introduction of VIPER, a new benchmark with a process-aware evaluation paradigm, and the Process-outcome Consistency (POC@r) metric, are significant contributions. The findings highlight the limitations of current models and the need for more robust reasoning capabilities.
Key Takeaways
- •Proposes VIPER, a new benchmark for evaluating Generative Video Reasoning (GVR).
- •Introduces Process-outcome Consistency (POC@r) metric to assess reasoning processes.
- •Highlights the prevalence of outcome-hacking in current video generation models.
- •Demonstrates a significant gap between current models and true generalized visual reasoning.
Reference
“State-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.”