DiffThinker: Generative Multimodal Reasoning with Diffusion Models
Published:Dec 30, 2025 11:51
•1 min read
•ArXiv
Analysis
This paper introduces DiffThinker, a novel diffusion-based framework for multimodal reasoning, particularly excelling in vision-centric tasks. It shifts the paradigm from text-centric reasoning to a generative image-to-image approach, offering advantages in logical consistency and spatial precision. The paper's significance lies in its exploration of a new reasoning paradigm and its demonstration of superior performance compared to leading closed-source models like GPT-5 and Gemini-3-Flash in vision-centric tasks.
Key Takeaways
- •Introduces DiffThinker, a diffusion-based framework for generative multimodal reasoning.
- •Reformulates multimodal reasoning as a generative image-to-image task.
- •Demonstrates superior performance in vision-centric tasks compared to leading MLLMs.
- •Highlights four core properties: efficiency, controllability, native parallelism, and collaboration.
Reference
“DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2%) and Gemini-3-Flash (+111.6%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.”