[P] S2ID: Scale Invariant Image Diffuser - trained on standard MNIST, generates 1024x1024 digits and at arbitrary aspect ratios with almost no artifacts at 6.1M parameters
Analysis
This post introduces S2ID, a novel diffusion architecture designed to address limitations in existing models like UNet and DiT. The core issue tackled is the sensitivity of convolution kernels in UNet to pixel density changes during upscaling, leading to artifacts. S2ID also aims to improve upon DiT models, which may not effectively compress context when handling upscaled images. The author argues that pixels, unlike tokens in LLMs, are not atomic, necessitating a different approach. The model achieves impressive results, generating high-resolution images with minimal artifacts using a relatively small parameter count. The author acknowledges the code's current state, focusing instead on the architectural innovations.
Key Takeaways
- •S2ID addresses limitations of UNet and DiT architectures in image diffusion.
- •The model aims to improve handling of pixel density changes during upscaling.
- •S2ID achieves high-resolution image generation with minimal artifacts and a relatively small parameter count.
“Tokens in LLMs are atomic, pixels are not.”