DPAR: Dynamic Patchification for Efficient Image Generation
Analysis
This paper introduces DPAR, a novel approach to improve the efficiency of autoregressive image generation. It addresses the computational and memory limitations of fixed-length tokenization by dynamically aggregating image tokens into variable-sized patches. The core innovation lies in using next-token prediction entropy to guide the merging of tokens, leading to reduced token counts, lower FLOPs, faster convergence, and improved FID scores compared to baseline models. This is significant because it offers a way to scale autoregressive models to higher resolutions and potentially improve the quality of generated images.
Key Takeaways
- •DPAR dynamically aggregates image tokens into variable-sized patches for efficient autoregressive image generation.
- •It uses next-token prediction entropy to guide token merging.
- •DPAR reduces token count, FLOPs, and improves FID scores compared to baselines.
- •The method is compatible with multimodal generation frameworks.
“DPAR reduces token count by 1.81x and 2.06x on Imagenet 256 and 384 generation resolution respectively, leading to a reduction of up to 40% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 27.1% relative to baseline models.”