Splitwise: Adaptive Edge-Cloud LLM Inference with DRL
Analysis
This paper addresses the challenge of deploying large language models (LLMs) on edge devices, balancing latency, energy consumption, and accuracy. It proposes Splitwise, a novel framework using Lyapunov-assisted deep reinforcement learning (DRL) for dynamic partitioning of LLMs across edge and cloud resources. The approach is significant because it offers a more fine-grained and adaptive solution compared to static partitioning methods, especially in environments with fluctuating bandwidth. The use of Lyapunov optimization ensures queue stability and robustness, which is crucial for real-world deployments. The experimental results demonstrate substantial improvements in latency and energy efficiency.
Key Takeaways
- •Proposes Splitwise, a DRL-based framework for adaptive LLM partitioning across edge and cloud.
- •Employs Lyapunov optimization for queue stability and robustness.
- •Achieves significant improvements in latency and energy efficiency compared to existing methods.
- •Demonstrates performance on various hardware platforms and LLM sizes.
“Splitwise reduces end-to-end latency by 1.4x-2.8x and cuts energy consumption by up to 41% compared with existing partitioners.”