Generation Enhances Vision-Language Understanding at Scale
Published:Dec 29, 2025 14:49
•1 min read
•ArXiv
Analysis
This paper investigates the impact of generative tasks on vision-language models, particularly at a large scale. It challenges the common assumption that adding generation always improves understanding, highlighting the importance of semantic-level generation over pixel-level generation. The findings suggest that unified generation-understanding models exhibit superior data scaling and utilization, and that autoregression on input embeddings is an effective method for capturing visual details.
Key Takeaways
Reference
“Generation improves understanding only when it operates at the semantic level, i.e. when the model learns to autoregress high-level visual representations inside the LLM.”