Building a Powerful CPU-only LLM Server: Taming 64GB RAM and Podman for a Dedicated ChatGPT
infrastructure#llm📝 Blog|Analyzed: Apr 26, 2026 03:09•
Published: Apr 26, 2026 03:07
•1 min read
•Zenn LLMAnalysis
This is a highly practical and inspiring guide for anyone looking to self-host a Large Language Model (LLM) without breaking the bank on expensive GPUs. The author brilliantly demonstrates the impressive potential of CPU-based Inference by successfully running two massive 30B-class models on a 64GB RAM setup. It's a fantastic deep dive into open-source infrastructure that empowers engineers to build their own localized, privacy-focused AI environments.
Key Takeaways
- •Achieved the impressive feat of running two heavy MoE models (Qwen3.6 35B-A3B and GLM-4.7-Flash) simultaneously using only an i9 CPU and 64GB of RAM.
- •The final architecture elegantly combines systemd direct management for Caddy and Ollama, with Open WebUI running securely via rootful Podman.
- •Demonstrates a highly optimized memory footprint, carefully calculating OS, model, and KV cache requirements to hit ~54-56GB out of the available 64GB without relying on swap.
- •
Reference / Citation
View Original"CPU だけで動く LLM サーバを 1 台構築した。GPU は予算の都合で次のフェーズなので、まずは CPU 推論でどこまでやれるかの検証フェーズだ。 ハードウェアは i9-13900 + 64GB RAM。これで Qwen3.6 35B-A3B と GLM-4.7-Flash の 2 モデルを常駐させて、Open WebUI から LAN 経由でアクセスできるようにした、というのが今回のゴールである。"
Related Analysis
infrastructure
Mastering AI Agent Orchestration: How Meticulous Business Design Unlocks Autonomous Operations
Apr 26, 2026 03:10
infrastructureMastering OpenAPI 3.1 for AI Agents: Designing the Ultimate Japan Address Normalization API
Apr 26, 2026 00:39
infrastructureDesigning Bank-Grade APIs with AI: A 'Logical Fortress' to Prevent Double Transfers
Apr 25, 2026 21:46