Krasis LLM Runtime Speeds Up Inference on Consumer GPUs, Outpacing llama.cpp

infrastructure#gpu📝 Blog|Analyzed: Mar 17, 2026 16:47
Published: Mar 17, 2026 15:58
1 min read
r/LocalLLaMA

Analysis

Krasis is revolutionizing the landscape of Large Language Model (LLM) inference by optimizing decode speeds and minimizing system RAM usage. This innovative approach allows users to run powerful Qwen3 models on consumer-grade GPUs like the 5090 and 5080, unlocking unprecedented performance for local applications. The development promises faster and more accessible Generative AI experiences for everyone.
Reference / Citation
View Original
"Krasis can now run Qwen3-Coder-Next on a single 16GB 5080 (1801 tok/sec prefill, 26.8 tok/sec decode) faster than Llama.cpp on a 32GB 5090 (layer offloading to GPU)."
R
r/LocalLLaMAMar 17, 2026 15:58
* Cited for critical analysis under Article 32.