DiffThinker: Generative Multimodal Reasoning with Diffusion Models
Analysis
Key Takeaways
- •Introduces DiffThinker, a diffusion-based framework for generative multimodal reasoning.
- •Reformulates multimodal reasoning as a generative image-to-image task.
- •Demonstrates superior performance in vision-centric tasks compared to leading MLLMs.
- •Highlights four core properties: efficiency, controllability, native parallelism, and collaboration.
“DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2%) and Gemini-3-Flash (+111.6%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.”