Initial commit: localgenai stack
Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.
- pyinfra/framework/: pyinfra deploy targeting the box
- llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
for gfx1151), OpenWebUI
- Beszel (host + container + AMD GPU dashboard via sysfs)
- OpenLIT (LLM fleet metrics)
- Phoenix (per-trace agent waterfall)
- OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
- install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
134
StrixHaloMemory.md
Normal file
134
StrixHaloMemory.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# Strix Halo memory layout — UMA, GTT, and what fits
|
||||
|
||||
The Framework Desktop has 128 GB of LPDDR5x-8000 unified between CPU and
|
||||
GPU. How much the GPU can actually use is a stack of three knobs:
|
||||
|
||||
## The three knobs
|
||||
|
||||
1. **Total system RAM** — 128 GB on this box. Set by the SKU.
|
||||
2. **UMA frame buffer (BIOS, "iGPU memory size")** — a *fixed* carve-out
|
||||
reserved for the GPU at boot. The OS can't see or reclaim it. Pinned,
|
||||
guaranteed, lowest latency. As of BIOS lfsp0.03.05 (May 2026), the
|
||||
max selectable on Framework Desktop is **64 GB**.
|
||||
3. **GTT, Graphics Translation Table (kernel, runtime)** — a *dynamic*
|
||||
pool the GPU can borrow from system RAM beyond UMA. Default size on
|
||||
recent Linux is roughly half of system RAM. Higher latency than UMA,
|
||||
evictable under memory pressure.
|
||||
|
||||
So on a 128 GB box with 64 GB UMA: GPU sees 64 GB of pinned, fast VRAM,
|
||||
plus up to ~64 GB of slower borrowable GTT, leaving the OS to share what
|
||||
GTT isn't currently using.
|
||||
|
||||
## Why the BIOS caps at 64 GB
|
||||
|
||||
Not a hardware limit. AMD/Framework deliberately keeps the UMA ceiling
|
||||
below total RAM so the OS doesn't get starved. Earlier BIOS revisions
|
||||
were stricter (16 / 24 GB). Newer ones may unlock 96 / 110 GB — worth
|
||||
re-checking https://knowledgebase.frame.work after firmware updates.
|
||||
|
||||
The "128 GB support" on Framework's spec page refers to total system
|
||||
RAM capacity, **not** the UMA maximum. Distinct knobs.
|
||||
|
||||
## What 64 GB UMA actually fits
|
||||
|
||||
| Model | Quant | Size | Fits 64 GB? |
|
||||
|-------|-------|------|-------------|
|
||||
| Qwen3-Coder-30B-A3B | Q8 | ~32 GB | ✅ huge headroom |
|
||||
| Llama 3.3 70B (dense) | Q4 | ~42 GB | ✅ |
|
||||
| GLM-4.5-Air-style 100B-A12B MoE | Q4 | ~60 GB | ✅ tight |
|
||||
| Kimi-Linear-48B-A3B (when llama.cpp lands) | Q4-Q8 | ~30-50 GB | ✅ |
|
||||
| Qwen3-235B-A22B | Q3 | ~100 GB | ❌ |
|
||||
| DeepSeek-V3 / Kimi-K2 (600B+ MoE) | Q4 | 200 GB+ | ❌ |
|
||||
|
||||
The ❌ row wouldn't be enjoyable on this hardware regardless — see
|
||||
"Bandwidth is the real ceiling" below.
|
||||
|
||||
## When GTT spillover saves you
|
||||
|
||||
If a model is bigger than UMA but ≤ UMA+GTT:
|
||||
|
||||
- **Vulkan (llama.cpp `server-vulkan` container)** — happily uses GTT
|
||||
for weights. Slower than UMA but works.
|
||||
- **ROCm (vLLM, Ollama:rocm)** — picky; often refuses GTT for weight
|
||||
buffers. Effectively capped at UMA.
|
||||
|
||||
So if you want to run something bigger than 64 GB and stay performant,
|
||||
the path is: stick with the Vulkan llama.cpp container and accept the
|
||||
GTT latency hit, *or* wait for a BIOS that unlocks a larger UMA.
|
||||
|
||||
## Two architectural strategies (UMA-first vs GTT-first)
|
||||
|
||||
There's a real choice in how you allocate memory between OS and GPU:
|
||||
|
||||
### Strategy A — UMA-first (current setup)
|
||||
|
||||
- BIOS UMA: **64 GB** (the menu max on lfsp0.03.05)
|
||||
- Kernel `amdgpu.gttsize`: default (~50 % of remaining RAM, so ~32 GB)
|
||||
- Effective GPU pool: **64 GB pinned + small GTT spillover**
|
||||
- OS sees: 64 GB
|
||||
- Pros: lowest-latency GPU memory; OS can't reclaim under pressure;
|
||||
ROCm-based stacks (vLLM, Ollama:rocm) get the full pool
|
||||
- Cons: hard ceiling at the BIOS UMA max; OS lives in 64 GB
|
||||
|
||||
### Strategy B — GTT-first (per Gygeek's setup guide)
|
||||
|
||||
- BIOS UMA: **512 MB** (the menu min)
|
||||
- Kernel `amdgpu.gttsize=117760` (~115 GB)
|
||||
- Effective GPU pool: **~115 GB borrowable from the OS-visible pool**
|
||||
- OS sees: ~127 GB
|
||||
- Pros: dynamic — GPU only takes what it needs; can address far more
|
||||
than the BIOS UMA cap; future-proof if you want >64 GB models
|
||||
- Cons: GTT pages are evictable under memory pressure (avoid running
|
||||
heavy CPU workloads alongside inference); ROCm stacks may refuse
|
||||
GTT for weight buffers — Vulkan is the safer backend with this
|
||||
strategy
|
||||
|
||||
### Which to pick
|
||||
|
||||
- **Single-user, mostly LLMs, ROCm or Vulkan**: Strategy A is fine.
|
||||
64 GB covers everything that runs well on this box's bandwidth (see
|
||||
"Bandwidth is the real ceiling" below).
|
||||
- **Models > 64 GB or you want headroom**: Strategy B. Trade some
|
||||
worst-case latency for the ability to load anything that fits in
|
||||
127 GB. Vulkan-first.
|
||||
- **Mixed workload (Docker compiles + LLMs + dev tools)**: Strategy B
|
||||
also helps because the OS isn't permanently squeezed into 64 GB.
|
||||
|
||||
The pyinfra deploy currently sets `amdgpu.gttsize=117760` via GRUB
|
||||
(strategy B-compatible) but leaves BIOS UMA to whatever you've
|
||||
configured. If you want full strategy B, drop BIOS UMA to 512 MB at
|
||||
the next boot. If staying on strategy A, the kernel param is benign —
|
||||
GTT just becomes a high ceiling you'll never reach.
|
||||
|
||||
## Bandwidth is the real ceiling, not VRAM
|
||||
|
||||
256 GB/s memory bandwidth. Decode tokens/sec ≈ bandwidth ÷ active-params
|
||||
size. Practical implications:
|
||||
|
||||
- **3B active** (e.g. Qwen3-Coder-30B-A3B, Kimi-Linear-48B-A3B): ~50-80 tok/s. Excellent.
|
||||
- **12B active** (GLM-4.5-Air-style MoEs): ~10-20 tok/s. Usable.
|
||||
- **35B active** (Qwen3-Coder-480B-A35B): ~3-7 tok/s. Painful.
|
||||
- **70B dense**: ~2-3 tok/s. Slideshow.
|
||||
|
||||
This is why MoE models with small active-params dominate on this box
|
||||
even though the box has lots of memory. The 64 GB UMA cap doesn't hurt
|
||||
much because the bandwidth would have made the bigger-than-64 GB models
|
||||
unusable anyway.
|
||||
|
||||
## Other tunings worth knowing
|
||||
|
||||
- **`amd_iommu=off`** kernel param gives ~6 % memory-read improvement
|
||||
on Strix Halo (per Gygeek). Wired into pyinfra's GRUB management.
|
||||
Reboot required to take effect.
|
||||
- **TDP** — configurable in BIOS (45–120 W). Gygeek suggests 85 W as a
|
||||
noise/perf sweet spot. BIOS-only, can't be automated.
|
||||
|
||||
## TL;DR
|
||||
|
||||
For most users on this hardware: stay on strategy A (UMA = 64 GB), let
|
||||
pyinfra apply `amd_iommu=off` + the GTT ceiling, reboot, done. The
|
||||
hardware's bandwidth makes anything bigger than ~70 GB of weights
|
||||
impractical anyway, so the 64 GB cap rarely bites.
|
||||
|
||||
Switch to strategy B (UMA = 512 MB, GTT-driven) if you specifically
|
||||
want to load >64 GB models on Vulkan, or if the OS feels squeezed.
|
||||
Reference in New Issue
Block a user