Initial commit: localgenai stack

Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-08 11:35:10 -04:00
commit 2c4bfefa95
36 changed files with 5265 additions and 0 deletions

134
StrixHaloMemory.md Normal file
View File

@@ -0,0 +1,134 @@
# Strix Halo memory layout — UMA, GTT, and what fits
The Framework Desktop has 128 GB of LPDDR5x-8000 unified between CPU and
GPU. How much the GPU can actually use is a stack of three knobs:
## The three knobs
1. **Total system RAM** — 128 GB on this box. Set by the SKU.
2. **UMA frame buffer (BIOS, "iGPU memory size")** — a *fixed* carve-out
reserved for the GPU at boot. The OS can't see or reclaim it. Pinned,
guaranteed, lowest latency. As of BIOS lfsp0.03.05 (May 2026), the
max selectable on Framework Desktop is **64 GB**.
3. **GTT, Graphics Translation Table (kernel, runtime)** — a *dynamic*
pool the GPU can borrow from system RAM beyond UMA. Default size on
recent Linux is roughly half of system RAM. Higher latency than UMA,
evictable under memory pressure.
So on a 128 GB box with 64 GB UMA: GPU sees 64 GB of pinned, fast VRAM,
plus up to ~64 GB of slower borrowable GTT, leaving the OS to share what
GTT isn't currently using.
## Why the BIOS caps at 64 GB
Not a hardware limit. AMD/Framework deliberately keeps the UMA ceiling
below total RAM so the OS doesn't get starved. Earlier BIOS revisions
were stricter (16 / 24 GB). Newer ones may unlock 96 / 110 GB — worth
re-checking https://knowledgebase.frame.work after firmware updates.
The "128 GB support" on Framework's spec page refers to total system
RAM capacity, **not** the UMA maximum. Distinct knobs.
## What 64 GB UMA actually fits
| Model | Quant | Size | Fits 64 GB? |
|-------|-------|------|-------------|
| Qwen3-Coder-30B-A3B | Q8 | ~32 GB | ✅ huge headroom |
| Llama 3.3 70B (dense) | Q4 | ~42 GB | ✅ |
| GLM-4.5-Air-style 100B-A12B MoE | Q4 | ~60 GB | ✅ tight |
| Kimi-Linear-48B-A3B (when llama.cpp lands) | Q4-Q8 | ~30-50 GB | ✅ |
| Qwen3-235B-A22B | Q3 | ~100 GB | ❌ |
| DeepSeek-V3 / Kimi-K2 (600B+ MoE) | Q4 | 200 GB+ | ❌ |
The ❌ row wouldn't be enjoyable on this hardware regardless — see
"Bandwidth is the real ceiling" below.
## When GTT spillover saves you
If a model is bigger than UMA but ≤ UMA+GTT:
- **Vulkan (llama.cpp `server-vulkan` container)** — happily uses GTT
for weights. Slower than UMA but works.
- **ROCm (vLLM, Ollama:rocm)** — picky; often refuses GTT for weight
buffers. Effectively capped at UMA.
So if you want to run something bigger than 64 GB and stay performant,
the path is: stick with the Vulkan llama.cpp container and accept the
GTT latency hit, *or* wait for a BIOS that unlocks a larger UMA.
## Two architectural strategies (UMA-first vs GTT-first)
There's a real choice in how you allocate memory between OS and GPU:
### Strategy A — UMA-first (current setup)
- BIOS UMA: **64 GB** (the menu max on lfsp0.03.05)
- Kernel `amdgpu.gttsize`: default (~50 % of remaining RAM, so ~32 GB)
- Effective GPU pool: **64 GB pinned + small GTT spillover**
- OS sees: 64 GB
- Pros: lowest-latency GPU memory; OS can't reclaim under pressure;
ROCm-based stacks (vLLM, Ollama:rocm) get the full pool
- Cons: hard ceiling at the BIOS UMA max; OS lives in 64 GB
### Strategy B — GTT-first (per Gygeek's setup guide)
- BIOS UMA: **512 MB** (the menu min)
- Kernel `amdgpu.gttsize=117760` (~115 GB)
- Effective GPU pool: **~115 GB borrowable from the OS-visible pool**
- OS sees: ~127 GB
- Pros: dynamic — GPU only takes what it needs; can address far more
than the BIOS UMA cap; future-proof if you want >64 GB models
- Cons: GTT pages are evictable under memory pressure (avoid running
heavy CPU workloads alongside inference); ROCm stacks may refuse
GTT for weight buffers — Vulkan is the safer backend with this
strategy
### Which to pick
- **Single-user, mostly LLMs, ROCm or Vulkan**: Strategy A is fine.
64 GB covers everything that runs well on this box's bandwidth (see
"Bandwidth is the real ceiling" below).
- **Models > 64 GB or you want headroom**: Strategy B. Trade some
worst-case latency for the ability to load anything that fits in
127 GB. Vulkan-first.
- **Mixed workload (Docker compiles + LLMs + dev tools)**: Strategy B
also helps because the OS isn't permanently squeezed into 64 GB.
The pyinfra deploy currently sets `amdgpu.gttsize=117760` via GRUB
(strategy B-compatible) but leaves BIOS UMA to whatever you've
configured. If you want full strategy B, drop BIOS UMA to 512 MB at
the next boot. If staying on strategy A, the kernel param is benign —
GTT just becomes a high ceiling you'll never reach.
## Bandwidth is the real ceiling, not VRAM
256 GB/s memory bandwidth. Decode tokens/sec ≈ bandwidth ÷ active-params
size. Practical implications:
- **3B active** (e.g. Qwen3-Coder-30B-A3B, Kimi-Linear-48B-A3B): ~50-80 tok/s. Excellent.
- **12B active** (GLM-4.5-Air-style MoEs): ~10-20 tok/s. Usable.
- **35B active** (Qwen3-Coder-480B-A35B): ~3-7 tok/s. Painful.
- **70B dense**: ~2-3 tok/s. Slideshow.
This is why MoE models with small active-params dominate on this box
even though the box has lots of memory. The 64 GB UMA cap doesn't hurt
much because the bandwidth would have made the bigger-than-64 GB models
unusable anyway.
## Other tunings worth knowing
- **`amd_iommu=off`** kernel param gives ~6 % memory-read improvement
on Strix Halo (per Gygeek). Wired into pyinfra's GRUB management.
Reboot required to take effect.
- **TDP** — configurable in BIOS (45120 W). Gygeek suggests 85 W as a
noise/perf sweet spot. BIOS-only, can't be automated.
## TL;DR
For most users on this hardware: stay on strategy A (UMA = 64 GB), let
pyinfra apply `amd_iommu=off` + the GTT ceiling, reboot, done. The
hardware's bandwidth makes anything bigger than ~70 GB of weights
impractical anyway, so the 64 GB cap rarely bites.
Switch to strategy B (UMA = 512 MB, GTT-driven) if you specifically
want to load >64 GB models on Vulkan, or if the OS feels squeezed.