Time-Budgeted Inference for LLMs
Published:Dec 26, 2025 04:49
•1 min read
•ArXiv
Analysis
This paper addresses the critical challenge of deploying Large Language Models (LLMs) in time-sensitive applications. The core problem is the unpredictable execution time of LLMs, which hinders their use in real-time systems. TimeBill offers a solution by predicting execution time and adaptively adjusting the inference process to meet time budgets. This is significant because it enables the use of LLMs in applications where timing is crucial, such as robotics and autonomous driving, without sacrificing performance.
Key Takeaways
- •Addresses the challenge of time-critical LLM inference.
- •Proposes TimeBill, a framework for time-budgeted inference.
- •Uses RLP and ETE for execution time prediction.
- •Adaptively adjusts KV cache eviction ratio based on time budget.
- •Demonstrates improved task completion rate and performance.
Reference
“TimeBill proposes a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs.”