SIID: Scale Invariant Pixel-Space Diffusion Model for High-Resolution Digit Generation
Analysis
This post introduces SIID, a novel diffusion model architecture designed to address limitations in UNet and DiT architectures when scaling image resolution. The core issue tackled is the degradation of feature detection in UNets due to fixed pixel densities and the introduction of entirely new positional embeddings in DiT when upscaling. SIID aims to generate high-resolution images with minimal artifacts by maintaining scale invariance. The author acknowledges the code's current state and promises updates, emphasizing that the model architecture itself is the primary focus. The model, trained on 64x64 MNIST, reportedly generates readable 1024x1024 digits, showcasing its potential for high-resolution image generation.
Key Takeaways
- •SIID is a novel diffusion model architecture designed for scale-invariant image generation.
- •It addresses limitations of UNet and DiT architectures in handling varying image resolutions.
- •The model is trained on 64x64 MNIST and generates readable 1024x1024 digits.
“UNet heavily relies on convolution kernels, and convolution kernels are trained to a certain pixel density. Change the pixel density (by increasing the resolution of the image via upscaling) and your feature detector can no longer detect those same features.”