Initial commit: localgenai stack

Containerized local LLM stack for the Framework Desktop / Strix Halo, plus the OpenCode harness on the Mac side. - pyinfra/framework/: pyinfra deploy targeting the box - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override for gfx1151), OpenWebUI - Beszel (host + container + AMD GPU dashboard via sysfs) - OpenLIT (LLM fleet metrics) - Phoenix (per-trace agent waterfall) - OpenHands (autonomous agent in a Docker sandbox) - opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter) - install.sh deploys to ~/.config/opencode/ - StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md: documentation and planning - testing/qwen3-coder-30b/: small evaluation harness Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:35:10 -04:00
commit 2c4bfefa95
36 changed files with 5265 additions and 0 deletions
--- a/StrixHaloMemory.md
+++ b/StrixHaloMemory.md
@@ -0,0 +1,134 @@
+# Strix Halo memory layout — UMA, GTT, and what fits
+
+The Framework Desktop has 128 GB of LPDDR5x-8000 unified between CPU and
+GPU. How much the GPU can actually use is a stack of three knobs:
+
+## The three knobs
+
+1. **Total system RAM** — 128 GB on this box. Set by the SKU.
+2. **UMA frame buffer (BIOS, "iGPU memory size")** — a *fixed* carve-out
+   reserved for the GPU at boot. The OS can't see or reclaim it. Pinned,
+   guaranteed, lowest latency. As of BIOS lfsp0.03.05 (May 2026), the
+   max selectable on Framework Desktop is **64 GB**.
+3. **GTT, Graphics Translation Table (kernel, runtime)** — a *dynamic*
+   pool the GPU can borrow from system RAM beyond UMA. Default size on
+   recent Linux is roughly half of system RAM. Higher latency than UMA,
+   evictable under memory pressure.
+
+So on a 128 GB box with 64 GB UMA: GPU sees 64 GB of pinned, fast VRAM,
+plus up to ~64 GB of slower borrowable GTT, leaving the OS to share what
+GTT isn't currently using.
+
+## Why the BIOS caps at 64 GB
+
+Not a hardware limit. AMD/Framework deliberately keeps the UMA ceiling
+below total RAM so the OS doesn't get starved. Earlier BIOS revisions
+were stricter (16 / 24 GB). Newer ones may unlock 96 / 110 GB — worth
+re-checking https://knowledgebase.frame.work after firmware updates.
+
+The "128 GB support" on Framework's spec page refers to total system
+RAM capacity, **not** the UMA maximum. Distinct knobs.
+
+## What 64 GB UMA actually fits
+
+| Model | Quant | Size | Fits 64 GB? |
+|-------|-------|------|-------------|
+| Qwen3-Coder-30B-A3B | Q8 | ~32 GB | ✅ huge headroom |
+| Llama 3.3 70B (dense) | Q4 | ~42 GB | ✅ |
+| GLM-4.5-Air-style 100B-A12B MoE | Q4 | ~60 GB | ✅ tight |
+| Kimi-Linear-48B-A3B (when llama.cpp lands) | Q4-Q8 | ~30-50 GB | ✅ |
+| Qwen3-235B-A22B | Q3 | ~100 GB | ❌ |
+| DeepSeek-V3 / Kimi-K2 (600B+ MoE) | Q4 | 200 GB+ | ❌ |
+
+The ❌ row wouldn't be enjoyable on this hardware regardless — see
+"Bandwidth is the real ceiling" below.
+
+## When GTT spillover saves you
+
+If a model is bigger than UMA but ≤ UMA+GTT:
+
+- **Vulkan (llama.cpp `server-vulkan` container)** — happily uses GTT
+  for weights. Slower than UMA but works.
+- **ROCm (vLLM, Ollama:rocm)** — picky; often refuses GTT for weight
+  buffers. Effectively capped at UMA.
+
+So if you want to run something bigger than 64 GB and stay performant,
+the path is: stick with the Vulkan llama.cpp container and accept the
+GTT latency hit, *or* wait for a BIOS that unlocks a larger UMA.
+
+## Two architectural strategies (UMA-first vs GTT-first)
+
+There's a real choice in how you allocate memory between OS and GPU:
+
+### Strategy A — UMA-first (current setup)
+
+- BIOS UMA: **64 GB** (the menu max on lfsp0.03.05)
+- Kernel `amdgpu.gttsize`: default (~50 % of remaining RAM, so ~32 GB)
+- Effective GPU pool: **64 GB pinned + small GTT spillover**
+- OS sees: 64 GB
+- Pros: lowest-latency GPU memory; OS can't reclaim under pressure;
+  ROCm-based stacks (vLLM, Ollama:rocm) get the full pool
+- Cons: hard ceiling at the BIOS UMA max; OS lives in 64 GB
+
+### Strategy B — GTT-first (per Gygeek's setup guide)
+
+- BIOS UMA: **512 MB** (the menu min)
+- Kernel `amdgpu.gttsize=117760` (~115 GB)
+- Effective GPU pool: **~115 GB borrowable from the OS-visible pool**
+- OS sees: ~127 GB
+- Pros: dynamic — GPU only takes what it needs; can address far more
+  than the BIOS UMA cap; future-proof if you want >64 GB models
+- Cons: GTT pages are evictable under memory pressure (avoid running
+  heavy CPU workloads alongside inference); ROCm stacks may refuse
+  GTT for weight buffers — Vulkan is the safer backend with this
+  strategy
+
+### Which to pick
+
+- **Single-user, mostly LLMs, ROCm or Vulkan**: Strategy A is fine.
+  64 GB covers everything that runs well on this box's bandwidth (see
+  "Bandwidth is the real ceiling" below).
+- **Models > 64 GB or you want headroom**: Strategy B. Trade some
+  worst-case latency for the ability to load anything that fits in
+  127 GB. Vulkan-first.
+- **Mixed workload (Docker compiles + LLMs + dev tools)**: Strategy B
+  also helps because the OS isn't permanently squeezed into 64 GB.
+
+The pyinfra deploy currently sets `amdgpu.gttsize=117760` via GRUB
+(strategy B-compatible) but leaves BIOS UMA to whatever you've
+configured. If you want full strategy B, drop BIOS UMA to 512 MB at
+the next boot. If staying on strategy A, the kernel param is benign —
+GTT just becomes a high ceiling you'll never reach.
+
+## Bandwidth is the real ceiling, not VRAM
+
+256 GB/s memory bandwidth. Decode tokens/sec ≈ bandwidth ÷ active-params
+size. Practical implications:
+
+- **3B active** (e.g. Qwen3-Coder-30B-A3B, Kimi-Linear-48B-A3B): ~50-80 tok/s. Excellent.
+- **12B active** (GLM-4.5-Air-style MoEs): ~10-20 tok/s. Usable.
+- **35B active** (Qwen3-Coder-480B-A35B): ~3-7 tok/s. Painful.
+- **70B dense**: ~2-3 tok/s. Slideshow.
+
+This is why MoE models with small active-params dominate on this box
+even though the box has lots of memory. The 64 GB UMA cap doesn't hurt
+much because the bandwidth would have made the bigger-than-64 GB models
+unusable anyway.
+
+## Other tunings worth knowing
+
+- **`amd_iommu=off`** kernel param gives ~6 % memory-read improvement
+  on Strix Halo (per Gygeek). Wired into pyinfra's GRUB management.
+  Reboot required to take effect.
+- **TDP** — configurable in BIOS (45–120 W). Gygeek suggests 85 W as a
+  noise/perf sweet spot. BIOS-only, can't be automated.
+
+## TL;DR
+
+For most users on this hardware: stay on strategy A (UMA = 64 GB), let
+pyinfra apply `amd_iommu=off` + the GTT ceiling, reboot, done. The
+hardware's bandwidth makes anything bigger than ~70 GB of weights
+impractical anyway, so the 64 GB cap rarely bites.
+
+Switch to strategy B (UMA = 512 MB, GTT-driven) if you specifically
+want to load >64 GB models on Vulkan, or if the OS feels squeezed.