Factorized Learning for Video-Language Models
Published:Dec 30, 2025 09:13
•1 min read
•ArXiv
Analysis
This paper addresses the challenge of accurate temporal grounding in video-language models, a crucial aspect of video understanding. It proposes a novel framework, D^2VLM, that decouples temporal grounding and textual response generation, recognizing their hierarchical relationship. The introduction of evidence tokens and a factorized preference optimization (FPO) algorithm are key contributions. The use of a synthetic dataset for factorized preference learning is also significant. The paper's focus on event-level perception and the 'grounding then answering' paradigm are promising approaches to improve video understanding.
Key Takeaways
Reference
“The paper introduces evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation.”