Per-Axis Weight Deltas for Frequent Model Updates
Published:Dec 24, 2025 05:00
•1 min read
•ArXiv ML
Analysis
This paper introduces a novel approach to compress and represent fine-tuned Large Language Model (LLM) weights as compressed deltas, specifically a 1-bit delta scheme with per-axis FP16 scaling factors. This method aims to address the challenge of large checkpoint sizes and cold-start latency associated with serving numerous task-specialized LLM variants. The key innovation lies in capturing weight variation across dimensions more accurately than scalar alternatives, leading to improved reconstruction quality. The streamlined loader design further optimizes cold-start latency and storage overhead. The method's drop-in nature, minimal calibration data requirement, and maintenance of inference efficiency make it a practical solution for frequent model updates. The availability of the experimental setup and source code enhances reproducibility and further research.
Key Takeaways
Reference
“We propose a simple 1-bit delta scheme that stores only the sign of the weight difference together with lightweight per-axis (row/column) FP16 scaling factors, learned from a small calibration set.”