Llama.cpp Achieves Impressive Performance on M2 Max: 40 Tokens/Second, 0% CPU Usage
Published:Jun 4, 2023 17:24
•1 min read
•Hacker News
Analysis
This Hacker News article highlights a significant performance achievement for Llama.cpp, demonstrating its efficiency in utilizing GPU resources. The claim of 40 tokens/second with 0% CPU usage suggests efficient offloading and optimization.
Key Takeaways
- •Llama.cpp achieves a high token generation rate (40 tok/s) on the M2 Max.
- •The process leverages all 38 GPU cores for accelerated computation.
- •The efficiency results in 0% CPU utilization, indicating effective offloading to the GPU.
Reference
“Llama.cpp can do 40 tok/s on M2 Max, 0% CPU usage, using all 38 GPU cores”