Process-Aware Evaluation for Video Reasoning
Research Paper#Video Generation, Reasoning, Evaluation🔬 Research|Analyzed: Jan 3, 2026 06:19•
Published: Dec 31, 2025 16:31
•1 min read
•ArXivAnalysis
This paper addresses a critical issue in evaluating video generation models: the tendency for models to achieve correct outcomes through incorrect reasoning processes (outcome-hacking). The introduction of VIPER, a new benchmark with a process-aware evaluation paradigm, and the Process-outcome Consistency (POC@r) metric, are significant contributions. The findings highlight the limitations of current models and the need for more robust reasoning capabilities.
Key Takeaways
- •Proposes VIPER, a new benchmark for evaluating Generative Video Reasoning (GVR).
- •Introduces Process-outcome Consistency (POC@r) metric to assess reasoning processes.
- •Highlights the prevalence of outcome-hacking in current video generation models.
- •Demonstrates a significant gap between current models and true generalized visual reasoning.
Reference / Citation
View Original"State-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking."