Granite 4 Small: A Viable Option for Limited VRAM Systems with Large Contexts
Analysis
This post highlights the potential of hybrid transformer-Mamba models like Granite 4.0 Small to maintain performance with large context windows on resource-constrained hardware. The key insight is leveraging CPU for MoE experts to free up VRAM for the KV cache, enabling larger context sizes. This approach could democratize access to large context LLMs for users with older or less powerful GPUs.
Key Takeaways
- •Granite 4.0 Small (32B total / 9B activated) maintains ~7 tkps with a 50k token context on a Thinkpad P15 with 8GB VRAM.
- •Offloading MoE experts to CPU frees up VRAM for a larger KV cache, enabling larger context windows.
- •Hybrid transformer-Mamba architecture contributes to sustained performance as context fills.
Reference
“due to being a hybrid transformer+mamba model, it stays fast as context fills”