Just Image Transformer: Flow Matching Model Predicting Real Images in Pixel Space

Research #Image Generation 📝 Blog|Analyzed: Dec 29, 2025 01:43•

Published: Dec 14, 2025 07:17

•

1 min read

Analysis

The article introduces the Just Image Transformer (JiT), a flow-matching model designed to predict real images directly within the pixel space, bypassing the use of Variational Autoencoders (VAEs). The core innovation lies in predicting the real image (x-pred) instead of the velocity (v), achieving superior performance. The loss function, however, is calculated using the velocity (v-loss) derived from the real image (x) and a noisy image (z). The article highlights the shift from U-Net-based models, prevalent in diffusion-based image generation like Stable Diffusion, and hints at further developments.

Key Takeaways

•JiT is a flow-matching model that operates directly in pixel space.
•It predicts real images (x-pred) for better performance.
•The loss function is calculated using velocity derived from real and noisy images.

Reference / Citation

View Original

"JiT (Just image Transformer) does not use VAE and performs flow-matching in pixel space. The model performs better by predicting the real image x (x-pred) rather than the velocity v."

Zenn DLDec 14, 2025 07:17

* Cited for critical analysis under Article 32.

Older

NVIDIA RTX PRO 5000 72GB Blackwell GPU Now Generally Available, Expanding Memory for Desktop Agentic AI

Newer

Creating a Horse Racing Prediction AI with ChatGPT (9)

Related Analysis

Research

Just Image Transformer: Flow Matching Model Predicting Real Images in Pixel Space

Analysis

Key Takeaways

Related Analysis

Human AI Detection

Deep Learning Book Implementation Focus

Personalizing Gemini

📬 Get AI News Delivered

Browse by Category

Trending Topics

📬 Get AI News Delivered

Browse by Category

Trending Topics