Search: tok/s - ai.jp.net

infrastructure #llm 📝 BlogAnalyzed: Jan 16, 2026 17:02

vLLM-MLX: Blazing Fast LLM Inference on Apple Silicon!

Published:Jan 16, 2026 16:54

•

1 min read

•

r/deeplearning

Analysis

Get ready for lightning-fast LLM inference on your Mac! vLLM-MLX harnesses Apple's MLX framework for native GPU acceleration, offering a significant speed boost. This open-source project is a game-changer for developers and researchers, promising a seamless experience and impressive performance.

Key Takeaways

•Native GPU acceleration on Apple Silicon for faster LLM inference.
•OpenAI-compatible API allows easy integration with existing code.
•Supports multimodal inputs, TTS, and continuous batching for enhanced performance.

Reference

“Llama-3.2-1B-4bit → 464 tok/s”

Permalink r/deeplearning

Research #llm 📝 BlogAnalyzed: Dec 27, 2025 15:31

Achieving 262k Context Length on Consumer GPU with Triton/CUDA Optimization

Published:Dec 27, 2025 15:18

•

1 min read

•

r/learnmachinelearning

Analysis

This post highlights an individual's success in optimizing memory usage for large language models, achieving a 262k context length on a consumer-grade GPU (potentially an RTX 5090). The project, HSPMN v2.1, decouples memory from compute using FlexAttention and custom Triton kernels. The author seeks feedback on their kernel implementation, indicating a desire for community input on low-level optimization techniques. This is significant because it demonstrates the potential for running large models on accessible hardware, potentially democratizing access to advanced AI capabilities. The post also underscores the importance of community collaboration in advancing AI research and development.

Key Takeaways

•Memory optimization is crucial for running large language models on consumer GPUs.
•Custom Triton kernels can significantly improve inference performance.
•Community feedback is valuable for improving low-level code optimization.

Reference

“I've been trying to decouple memory from compute to prep for the Blackwell/RTX 5090 architecture. Surprisingly, I managed to get it running with 262k context on just ~12GB VRAM and 1.41M tok/s throughput.”

Permalink r/learnmachinelearning

Research #LLM 👥 CommunityAnalyzed: Jan 10, 2026 15:14

Mercury Coder: Diffusion LLM Breaks Speed Barriers on Commodity Hardware

Published:Feb 26, 2025 19:58

•

1 min read

•

Hacker News

Analysis

This article highlights significant advancements in LLM performance by Mercury Coder, specifically its impressive token generation speed on accessible hardware. The focus on diffusion models and commodity GPUs suggests a push towards democratization of high-performance AI.

Key Takeaways

•Mercury Coder demonstrates exceptional token generation speed, potentially improving LLM usability.
•The use of commodity GPUs makes high-performance LLMs more accessible and affordable.
•This advancement leverages diffusion models, an area gaining traction in AI research.

Reference

“Mercury Coder generates 1000+ tok/sec on commodity GPUs.”

Permalink Hacker News

Infrastructure #LLM 👥 CommunityAnalyzed: Jan 10, 2026 16:08

Llama.cpp Achieves Impressive Performance on M2 Max: 40 Tokens/Second, 0% CPU Usage

Published:Jun 4, 2023 17:24

•

1 min read

•

Hacker News

Analysis

This Hacker News article highlights a significant performance achievement for Llama.cpp, demonstrating its efficiency in utilizing GPU resources. The claim of 40 tokens/second with 0% CPU usage suggests efficient offloading and optimization.

Key Takeaways

•Llama.cpp achieves a high token generation rate (40 tok/s) on the M2 Max.
•The process leverages all 38 GPU cores for accelerated computation.
•The efficiency results in 0% CPU utilization, indicating effective offloading to the GPU.

Reference

“Llama.cpp can do 40 tok/s on M2 Max, 0% CPU usage, using all 38 GPU cores”

Permalink Hacker News

vLLM-MLX: Blazing Fast LLM Inference on Apple Silicon!

Analysis

Key Takeaways

Achieving 262k Context Length on Consumer GPU with Triton/CUDA Optimization

Analysis

Key Takeaways

Mercury Coder: Diffusion LLM Breaks Speed Barriers on Commodity Hardware

Analysis

Key Takeaways

Llama.cpp Achieves Impressive Performance on M2 Max: 40 Tokens/Second, 0% CPU Usage

Analysis

Key Takeaways

📬 Get AI News Delivered

Browse by Category

Trending Topics

📬 Get AI News Delivered

Browse by Category

Trending Topics