WeDLM:基于扩散解码和因果注意力机制的LLM快速推理
分析
本文解决了大型语言模型(LLM)的推理速度瓶颈问题。它提出了WeDLM,一个利用因果注意力机制的扩散解码框架,能够在保持前缀KV缓存效率的同时实现并行生成。关键贡献是一种名为拓扑重排序的方法,它允许并行解码而不破坏因果注意力结构。该论文展示了与优化的自回归(AR)基线相比的显著加速,展示了扩散式解码在实际LLM部署中的潜力。
要点
引用 / 来源
查看原文"WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice."