Generation Enhances Vision-Language Understanding at Scale
Analysis
This paper investigates the impact of generative tasks on vision-language models, particularly at a large scale. It challenges the common assumption that adding generation always improves understanding, highlighting the importance of semantic-level generation over pixel-level generation. The findings suggest that unified generation-understanding models exhibit superior data scaling and utilization, and that autoregression on input embeddings is an effective method for capturing visual details.
Key Takeaways
Reference / Citation
View Original"Generation improves understanding only when it operates at the semantic level, i.e. when the model learns to autoregress high-level visual representations inside the LLM."