Yggdrasil: Optimizing LLM Decoding with Tree-Based Speculation
Published:Dec 29, 2025 20:51
•1 min read
•ArXiv
Analysis
This paper addresses the performance bottleneck in LLM inference caused by the mismatch between dynamic speculative decoding and static runtime assumptions. Yggdrasil proposes a co-designed system to bridge this gap, aiming for latency-optimal decoding. The core contribution lies in its context-aware tree drafting, compiler-friendly execution, and stage-based scheduling, leading to significant speedups over existing methods. The focus on practical improvements and the reported speedup are noteworthy.
Key Takeaways
- •Proposes Yggdrasil, a co-designed system for latency-optimal speculative decoding.
- •Introduces an equal-growth tree structure for static graph compatibility.
- •Employs a latency-aware optimization objective for draft selection.
- •Utilizes stage-based scheduling to reduce overhead.
- •Achieves significant speedups over existing baselines.
Reference
“Yggdrasil achieves up to $3.98\times$ speedup over state-of-the-art baselines.”