Just Image Transformer: Flow Matching Model Predicting Real Images in Pixel Space
Analysis
The article introduces the Just Image Transformer (JiT), a flow-matching model designed to predict real images directly within the pixel space, bypassing the use of Variational Autoencoders (VAEs). The core innovation lies in predicting the real image (x-pred) instead of the velocity (v), achieving superior performance. The loss function, however, is calculated using the velocity (v-loss) derived from the real image (x) and a noisy image (z). The article highlights the shift from U-Net-based models, prevalent in diffusion-based image generation like Stable Diffusion, and hints at further developments.
Key Takeaways
“JiT (Just image Transformer) does not use VAE and performs flow-matching in pixel space. The model performs better by predicting the real image x (x-pred) rather than the velocity v.”