Benchmarking Local LLMs: Unexpected Vulkan Speedup for Select Models
Analysis
This article from r/LocalLLaMA details a user's benchmark of local large language models (LLMs) using CUDA and Vulkan on an NVIDIA 3080 GPU. The user found that while CUDA generally performed better, certain models experienced a significant speedup when using Vulkan, particularly when partially offloaded to the GPU. The models GLM4 9B Q6, Qwen3 8B Q6, and Ministral3 14B 2512 Q4 showed notable improvements with Vulkan. The author acknowledges the informal nature of the testing and potential limitations, but the findings suggest that Vulkan can be a viable alternative to CUDA for specific LLM configurations, warranting further investigation into the factors causing this performance difference. This could lead to optimizations in LLM deployment and resource allocation.
Key Takeaways
- •Vulkan can offer a significant speedup over CUDA for specific LLMs when partially offloaded to the GPU.
- •The performance difference between CUDA and Vulkan varies significantly depending on the model architecture and quantization.
- •Further research is needed to understand the underlying reasons for Vulkan's superior performance in certain scenarios.
“The main findings is that when running certain models partially offloaded to GPU, some models perform much better on Vulkan than CUDA”