Files
localgenai/StrixHaloMemory.md
noisedestroyers 2c4bfefa95 Initial commit: localgenai stack
Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:35:10 -04:00

5.7 KiB
Raw Blame History

Strix Halo memory layout — UMA, GTT, and what fits

The Framework Desktop has 128 GB of LPDDR5x-8000 unified between CPU and GPU. How much the GPU can actually use is a stack of three knobs:

The three knobs

  1. Total system RAM — 128 GB on this box. Set by the SKU.
  2. UMA frame buffer (BIOS, "iGPU memory size") — a fixed carve-out reserved for the GPU at boot. The OS can't see or reclaim it. Pinned, guaranteed, lowest latency. As of BIOS lfsp0.03.05 (May 2026), the max selectable on Framework Desktop is 64 GB.
  3. GTT, Graphics Translation Table (kernel, runtime) — a dynamic pool the GPU can borrow from system RAM beyond UMA. Default size on recent Linux is roughly half of system RAM. Higher latency than UMA, evictable under memory pressure.

So on a 128 GB box with 64 GB UMA: GPU sees 64 GB of pinned, fast VRAM, plus up to ~64 GB of slower borrowable GTT, leaving the OS to share what GTT isn't currently using.

Why the BIOS caps at 64 GB

Not a hardware limit. AMD/Framework deliberately keeps the UMA ceiling below total RAM so the OS doesn't get starved. Earlier BIOS revisions were stricter (16 / 24 GB). Newer ones may unlock 96 / 110 GB — worth re-checking https://knowledgebase.frame.work after firmware updates.

The "128 GB support" on Framework's spec page refers to total system RAM capacity, not the UMA maximum. Distinct knobs.

What 64 GB UMA actually fits

Model Quant Size Fits 64 GB?
Qwen3-Coder-30B-A3B Q8 ~32 GB huge headroom
Llama 3.3 70B (dense) Q4 ~42 GB
GLM-4.5-Air-style 100B-A12B MoE Q4 ~60 GB tight
Kimi-Linear-48B-A3B (when llama.cpp lands) Q4-Q8 ~30-50 GB
Qwen3-235B-A22B Q3 ~100 GB
DeepSeek-V3 / Kimi-K2 (600B+ MoE) Q4 200 GB+

The row wouldn't be enjoyable on this hardware regardless — see "Bandwidth is the real ceiling" below.

When GTT spillover saves you

If a model is bigger than UMA but ≤ UMA+GTT:

  • Vulkan (llama.cpp server-vulkan container) — happily uses GTT for weights. Slower than UMA but works.
  • ROCm (vLLM, Ollama:rocm) — picky; often refuses GTT for weight buffers. Effectively capped at UMA.

So if you want to run something bigger than 64 GB and stay performant, the path is: stick with the Vulkan llama.cpp container and accept the GTT latency hit, or wait for a BIOS that unlocks a larger UMA.

Two architectural strategies (UMA-first vs GTT-first)

There's a real choice in how you allocate memory between OS and GPU:

Strategy A — UMA-first (current setup)

  • BIOS UMA: 64 GB (the menu max on lfsp0.03.05)
  • Kernel amdgpu.gttsize: default (~50 % of remaining RAM, so ~32 GB)
  • Effective GPU pool: 64 GB pinned + small GTT spillover
  • OS sees: 64 GB
  • Pros: lowest-latency GPU memory; OS can't reclaim under pressure; ROCm-based stacks (vLLM, Ollama:rocm) get the full pool
  • Cons: hard ceiling at the BIOS UMA max; OS lives in 64 GB

Strategy B — GTT-first (per Gygeek's setup guide)

  • BIOS UMA: 512 MB (the menu min)
  • Kernel amdgpu.gttsize=117760 (~115 GB)
  • Effective GPU pool: ~115 GB borrowable from the OS-visible pool
  • OS sees: ~127 GB
  • Pros: dynamic — GPU only takes what it needs; can address far more than the BIOS UMA cap; future-proof if you want >64 GB models
  • Cons: GTT pages are evictable under memory pressure (avoid running heavy CPU workloads alongside inference); ROCm stacks may refuse GTT for weight buffers — Vulkan is the safer backend with this strategy

Which to pick

  • Single-user, mostly LLMs, ROCm or Vulkan: Strategy A is fine. 64 GB covers everything that runs well on this box's bandwidth (see "Bandwidth is the real ceiling" below).
  • Models > 64 GB or you want headroom: Strategy B. Trade some worst-case latency for the ability to load anything that fits in 127 GB. Vulkan-first.
  • Mixed workload (Docker compiles + LLMs + dev tools): Strategy B also helps because the OS isn't permanently squeezed into 64 GB.

The pyinfra deploy currently sets amdgpu.gttsize=117760 via GRUB (strategy B-compatible) but leaves BIOS UMA to whatever you've configured. If you want full strategy B, drop BIOS UMA to 512 MB at the next boot. If staying on strategy A, the kernel param is benign — GTT just becomes a high ceiling you'll never reach.

Bandwidth is the real ceiling, not VRAM

256 GB/s memory bandwidth. Decode tokens/sec ≈ bandwidth ÷ active-params size. Practical implications:

  • 3B active (e.g. Qwen3-Coder-30B-A3B, Kimi-Linear-48B-A3B): ~50-80 tok/s. Excellent.
  • 12B active (GLM-4.5-Air-style MoEs): ~10-20 tok/s. Usable.
  • 35B active (Qwen3-Coder-480B-A35B): ~3-7 tok/s. Painful.
  • 70B dense: ~2-3 tok/s. Slideshow.

This is why MoE models with small active-params dominate on this box even though the box has lots of memory. The 64 GB UMA cap doesn't hurt much because the bandwidth would have made the bigger-than-64 GB models unusable anyway.

Other tunings worth knowing

  • amd_iommu=off kernel param gives ~6 % memory-read improvement on Strix Halo (per Gygeek). Wired into pyinfra's GRUB management. Reboot required to take effect.
  • TDP — configurable in BIOS (45120 W). Gygeek suggests 85 W as a noise/perf sweet spot. BIOS-only, can't be automated.

TL;DR

For most users on this hardware: stay on strategy A (UMA = 64 GB), let pyinfra apply amd_iommu=off + the GTT ceiling, reboot, done. The hardware's bandwidth makes anything bigger than ~70 GB of weights impractical anyway, so the 64 GB cap rarely bites.

Switch to strategy B (UMA = 512 MB, GTT-driven) if you specifically want to load >64 GB models on Vulkan, or if the OS feels squeezed.