Search:
Match:
4 results
infrastructure#llm📝 BlogAnalyzed: Jan 16, 2026 17:02

vLLM-MLX: Blazing Fast LLM Inference on Apple Silicon!

Published:Jan 16, 2026 16:54
1 min read
r/deeplearning

Analysis

Get ready for lightning-fast LLM inference on your Mac! vLLM-MLX harnesses Apple's MLX framework for native GPU acceleration, offering a significant speed boost. This open-source project is a game-changer for developers and researchers, promising a seamless experience and impressive performance.
Reference

Llama-3.2-1B-4bit → 464 tok/s

Research#llm📝 BlogAnalyzed: Dec 27, 2025 15:31

Achieving 262k Context Length on Consumer GPU with Triton/CUDA Optimization

Published:Dec 27, 2025 15:18
1 min read
r/learnmachinelearning

Analysis

This post highlights an individual's success in optimizing memory usage for large language models, achieving a 262k context length on a consumer-grade GPU (potentially an RTX 5090). The project, HSPMN v2.1, decouples memory from compute using FlexAttention and custom Triton kernels. The author seeks feedback on their kernel implementation, indicating a desire for community input on low-level optimization techniques. This is significant because it demonstrates the potential for running large models on accessible hardware, potentially democratizing access to advanced AI capabilities. The post also underscores the importance of community collaboration in advancing AI research and development.
Reference

I've been trying to decouple memory from compute to prep for the Blackwell/RTX 5090 architecture. Surprisingly, I managed to get it running with 262k context on just ~12GB VRAM and 1.41M tok/s throughput.

Research#LLM👥 CommunityAnalyzed: Jan 10, 2026 15:14

Mercury Coder: Diffusion LLM Breaks Speed Barriers on Commodity Hardware

Published:Feb 26, 2025 19:58
1 min read
Hacker News

Analysis

This article highlights significant advancements in LLM performance by Mercury Coder, specifically its impressive token generation speed on accessible hardware. The focus on diffusion models and commodity GPUs suggests a push towards democratization of high-performance AI.
Reference

Mercury Coder generates 1000+ tok/sec on commodity GPUs.

Infrastructure#LLM👥 CommunityAnalyzed: Jan 10, 2026 16:08

Llama.cpp Achieves Impressive Performance on M2 Max: 40 Tokens/Second, 0% CPU Usage

Published:Jun 4, 2023 17:24
1 min read
Hacker News

Analysis

This Hacker News article highlights a significant performance achievement for Llama.cpp, demonstrating its efficiency in utilizing GPU resources. The claim of 40 tokens/second with 0% CPU usage suggests efficient offloading and optimization.
Reference

Llama.cpp can do 40 tok/s on M2 Max, 0% CPU usage, using all 38 GPU cores