WeDLM: Faster LLM Inference with Diffusion Decoding and Causal Attention
Published:Dec 28, 2025 01:25
•1 min read
•ArXiv
Analysis
This paper addresses the inference speed bottleneck of Large Language Models (LLMs). It proposes WeDLM, a diffusion decoding framework that leverages causal attention to enable parallel generation while maintaining prefix KV caching efficiency. The key contribution is a method called Topological Reordering, which allows for parallel decoding without breaking the causal attention structure. The paper demonstrates significant speedups compared to optimized autoregressive (AR) baselines, showcasing the potential of diffusion-style decoding for practical LLM deployment.
Key Takeaways
- •WeDLM introduces a diffusion decoding framework for LLMs that uses causal attention.
- •Topological Reordering enables parallel decoding while preserving prefix caching.
- •The method achieves significant speedups compared to optimized AR baselines.
- •Demonstrates the potential of diffusion-style decoding for practical LLM deployment.
Reference
“WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.”