Just Image Transformer: Flow Matching Model Predicting Real Images in Pixel Space

Research#Image Generation📝 Blog|Analyzed: Dec 29, 2025 01:43
Published: Dec 14, 2025 07:17
1 min read
Zenn DL

Analysis

The article introduces the Just Image Transformer (JiT), a flow-matching model designed to predict real images directly within the pixel space, bypassing the use of Variational Autoencoders (VAEs). The core innovation lies in predicting the real image (x-pred) instead of the velocity (v), achieving superior performance. The loss function, however, is calculated using the velocity (v-loss) derived from the real image (x) and a noisy image (z). The article highlights the shift from U-Net-based models, prevalent in diffusion-based image generation like Stable Diffusion, and hints at further developments.
Reference / Citation
View Original
"JiT (Just image Transformer) does not use VAE and performs flow-matching in pixel space. The model performs better by predicting the real image x (x-pred) rather than the velocity v."
Z
Zenn DLDec 14, 2025 07:17
* Cited for critical analysis under Article 32.