Uncovering the 18 t/s Mystery: Testing the Qwen3.6-35B Large Language Model (LLM) on an RTX 5090
infrastructure#gpu📝 Blog|Analyzed: Apr 22, 2026 02:52•
Published: Apr 22, 2026 02:26
•1 min read
•Zenn LLMAnalysis
This article provides a thrilling hands-on look at pushing the boundaries of consumer hardware by running a massive Large Language Model (LLM) on NVIDIA's cutting-edge RTX 5090. The author's detective work to uncover the true cause of an unexpected 18 t/s Inference speed bottleneck highlights the fascinating complexities of AI hardware optimization. It is a fantastic read for anyone excited about the future of high-performance local Generative AI and custom quantization techniques!
Key Takeaways
- •The Qwen3.6-35B model utilizes advanced Unsloth Dynamic (UD) quantization to balance file size and high-quality performance.
- •Inference initially shocked the user with a dramatic drop to 18 t/s, compared to the previous generation's impressive 214 t/s.
- •Testing revealed the speed trap was linked to VRAM usage skyrocketing past 30GB on the 32GB RTX 5090, creating an exciting hardware optimization puzzle.
Reference / Citation
View Original"VRAM usage exceeded 30 GB. The cause was..."
Related Analysis
infrastructure
Edge AI is Rewriting the Upper Limits of Real-Time Perception Efficiency
Apr 22, 2026 11:19
infrastructureLinkedIn Unveils Cognitive Memory Agent: A Revolutionary Leap in Stateful AI Systems
Apr 22, 2026 04:12
infrastructureEmpowering AI as the Protagonist: A Practical Guide to File Structures and Sprints
Apr 22, 2026 10:24