RainFusion2.0: Hardware-Efficient Sparse Attention for Video and Image Generation
Published:Dec 30, 2025 08:55
•1 min read
•ArXiv
Analysis
This paper addresses the computational bottlenecks of Diffusion Transformer (DiT) models in video and image generation, particularly the high cost of attention mechanisms. It proposes RainFusion2.0, a novel sparse attention mechanism designed for efficiency and hardware generality. The key innovation lies in its online adaptive approach, low overhead, and spatiotemporal awareness, making it suitable for various hardware platforms beyond GPUs. The paper's significance lies in its potential to accelerate generative models and broaden their applicability across different devices.
Key Takeaways
- •Proposes RainFusion2.0, a sparse attention mechanism for accelerating video and image generation.
- •Addresses limitations of existing sparse attention methods, including overhead and lack of hardware generality.
- •Employs block-wise mean values, spatiotemporal-aware token permutation, and a first-frame sink mechanism.
- •Achieves significant speedup (1.5-1.8x) with high sparsity (80%) without sacrificing video quality.
- •Demonstrates effectiveness across various generative models and hardware platforms.
Reference
“RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality.”