ThinkGen: LLM-Driven Visual Generation
Published:Dec 29, 2025 16:08
•1 min read
•ArXiv
Analysis
This paper introduces ThinkGen, a novel framework that leverages the Chain-of-Thought (CoT) reasoning capabilities of Multimodal Large Language Models (MLLMs) for visual generation tasks. It addresses the limitations of existing methods by proposing a decoupled architecture and a separable GRPO-based training paradigm, enabling generalization across diverse generation scenarios. The paper's significance lies in its potential to improve the quality and adaptability of image generation by incorporating advanced reasoning.
Key Takeaways
- •ThinkGen is a novel framework for visual generation that utilizes MLLM's CoT reasoning.
- •It employs a decoupled architecture with an MLLM and a Diffusion Transformer (DiT).
- •A separable GRPO-based training paradigm (SepGRPO) is used for training.
- •The framework achieves state-of-the-art performance across multiple generation benchmarks.
Reference
“ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions.”