Building a Powerful CPU-only LLM Server: Taming 64GB RAM and Podman for a Dedicated ChatGPT

infrastructure #llm 📝 Blog|Analyzed: Apr 26, 2026 03:09•

Published: Apr 26, 2026 03:07

•

1 min read

Analysis

This is a highly practical and inspiring guide for anyone looking to self-host a Large Language Model (LLM) without breaking the bank on expensive GPUs. The author brilliantly demonstrates the impressive potential of CPU-based Inference by successfully running two massive 30B-class models on a 64GB RAM setup. It's a fantastic deep dive into open-source infrastructure that empowers engineers to build their own localized, privacy-focused AI environments.

Key Takeaways

•Achieved the impressive feat of running two heavy MoE models (Qwen3.6 35B-A3B and GLM-4.7-Flash) simultaneously using only an i9 CPU and 64GB of RAM.
•The final architecture elegantly combines systemd direct management for Caddy and Ollama, with Open WebUI running securely via rootful Podman.
•Demonstrates a highly optimized memory footprint, carefully calculating OS, model, and KV cache requirements to hit ~54-56GB out of the available 64GB without relying on swap.
•

Reference / Citation

View Original

"CPU だけで動く LLM サーバを 1 台構築した。GPU は予算の都合で次のフェーズなので、まずは CPU 推論でどこまでやれるかの検証フェーズだ。ハードウェアは i9-13900 + 64GB RAM。これで Qwen3.6 35B-A3B と GLM-4.7-Flash の 2 モデルを常駐させて、Open WebUI から LAN 経由でアクセスできるようにした、というのが今回のゴールである。"

Zenn LLMApr 26, 2026 03:07

* Cited for critical analysis under Article 32.

Older

Claude Code v2.1.85-86 Brings Powerful Hooks and Performance Upgrades

Newer

Decoding AI Report Cards: A Complete Guide to 21 LLM Benchmarks