Speculative Decoding and Efficient LLM Inference with Chris Lott - #717

Research#llm📝 Blog|Analyzed: Dec 29, 2025 06:08
Published: Feb 4, 2025 07:23
1 min read
Practical AI

Analysis

This article from Practical AI discusses accelerating large language model (LLM) inference. It features Chris Lott from Qualcomm AI Research, focusing on the challenges of LLM encoding and decoding, and how hardware constraints impact inference metrics. The article highlights techniques like KV compression, quantization, pruning, and speculative decoding to improve performance. It also touches on future directions, including on-device agentic experiences and software tools like Qualcomm AI Orchestrator. The focus is on practical methods for optimizing LLM performance.
Reference / Citation
View Original
"We explore the challenges presented by the LLM encoding and decoding (aka generation) and how these interact with various hardware constraints such as FLOPS, memory footprint and memory bandwidth to limit key inference metrics such as time-to-first-token, tokens per second, and tokens per joule."
P
Practical AIFeb 4, 2025 07:23
* Cited for critical analysis under Article 32.