# Strix Halo memory layout — UMA, GTT, and what fits The Framework Desktop has 128 GB of LPDDR5x-8000 unified between CPU and GPU. How much the GPU can actually use is a stack of three knobs: ## The three knobs 1. **Total system RAM** — 128 GB on this box. Set by the SKU. 2. **UMA frame buffer (BIOS, "iGPU memory size")** — a *fixed* carve-out reserved for the GPU at boot. The OS can't see or reclaim it. Pinned, guaranteed, lowest latency. As of BIOS lfsp0.03.05 (May 2026), the max selectable on Framework Desktop is **64 GB**. 3. **GTT, Graphics Translation Table (kernel, runtime)** — a *dynamic* pool the GPU can borrow from system RAM beyond UMA. Default size on recent Linux is roughly half of system RAM. Higher latency than UMA, evictable under memory pressure. So on a 128 GB box with 64 GB UMA: GPU sees 64 GB of pinned, fast VRAM, plus up to ~64 GB of slower borrowable GTT, leaving the OS to share what GTT isn't currently using. ## Why the BIOS caps at 64 GB Not a hardware limit. AMD/Framework deliberately keeps the UMA ceiling below total RAM so the OS doesn't get starved. Earlier BIOS revisions were stricter (16 / 24 GB). Newer ones may unlock 96 / 110 GB — worth re-checking https://knowledgebase.frame.work after firmware updates. The "128 GB support" on Framework's spec page refers to total system RAM capacity, **not** the UMA maximum. Distinct knobs. ## What 64 GB UMA actually fits | Model | Quant | Size | Fits 64 GB? | |-------|-------|------|-------------| | Qwen3-Coder-30B-A3B | Q8 | ~32 GB | ✅ huge headroom | | Llama 3.3 70B (dense) | Q4 | ~42 GB | ✅ | | GLM-4.5-Air-style 100B-A12B MoE | Q4 | ~60 GB | ✅ tight | | Kimi-Linear-48B-A3B (when llama.cpp lands) | Q4-Q8 | ~30-50 GB | ✅ | | Qwen3-235B-A22B | Q3 | ~100 GB | ❌ | | DeepSeek-V3 / Kimi-K2 (600B+ MoE) | Q4 | 200 GB+ | ❌ | The ❌ row wouldn't be enjoyable on this hardware regardless — see "Bandwidth is the real ceiling" below. ## When GTT spillover saves you If a model is bigger than UMA but ≤ UMA+GTT: - **Vulkan (llama.cpp `server-vulkan` container)** — happily uses GTT for weights. Slower than UMA but works. - **ROCm (vLLM, Ollama:rocm)** — picky; often refuses GTT for weight buffers. Effectively capped at UMA. So if you want to run something bigger than 64 GB and stay performant, the path is: stick with the Vulkan llama.cpp container and accept the GTT latency hit, *or* wait for a BIOS that unlocks a larger UMA. ## Two architectural strategies (UMA-first vs GTT-first) There's a real choice in how you allocate memory between OS and GPU: ### Strategy A — UMA-first (current setup) - BIOS UMA: **64 GB** (the menu max on lfsp0.03.05) - Kernel `amdgpu.gttsize`: default (~50 % of remaining RAM, so ~32 GB) - Effective GPU pool: **64 GB pinned + small GTT spillover** - OS sees: 64 GB - Pros: lowest-latency GPU memory; OS can't reclaim under pressure; ROCm-based stacks (vLLM, Ollama:rocm) get the full pool - Cons: hard ceiling at the BIOS UMA max; OS lives in 64 GB ### Strategy B — GTT-first (per Gygeek's setup guide) - BIOS UMA: **512 MB** (the menu min) - Kernel `amdgpu.gttsize=117760` (~115 GB) - Effective GPU pool: **~115 GB borrowable from the OS-visible pool** - OS sees: ~127 GB - Pros: dynamic — GPU only takes what it needs; can address far more than the BIOS UMA cap; future-proof if you want >64 GB models - Cons: GTT pages are evictable under memory pressure (avoid running heavy CPU workloads alongside inference); ROCm stacks may refuse GTT for weight buffers — Vulkan is the safer backend with this strategy ### Which to pick - **Single-user, mostly LLMs, ROCm or Vulkan**: Strategy A is fine. 64 GB covers everything that runs well on this box's bandwidth (see "Bandwidth is the real ceiling" below). - **Models > 64 GB or you want headroom**: Strategy B. Trade some worst-case latency for the ability to load anything that fits in 127 GB. Vulkan-first. - **Mixed workload (Docker compiles + LLMs + dev tools)**: Strategy B also helps because the OS isn't permanently squeezed into 64 GB. The pyinfra deploy currently sets `amdgpu.gttsize=117760` via GRUB (strategy B-compatible) but leaves BIOS UMA to whatever you've configured. If you want full strategy B, drop BIOS UMA to 512 MB at the next boot. If staying on strategy A, the kernel param is benign — GTT just becomes a high ceiling you'll never reach. ## Bandwidth is the real ceiling, not VRAM 256 GB/s memory bandwidth. Decode tokens/sec ≈ bandwidth ÷ active-params size. Practical implications: - **3B active** (e.g. Qwen3-Coder-30B-A3B, Kimi-Linear-48B-A3B): ~50-80 tok/s. Excellent. - **12B active** (GLM-4.5-Air-style MoEs): ~10-20 tok/s. Usable. - **35B active** (Qwen3-Coder-480B-A35B): ~3-7 tok/s. Painful. - **70B dense**: ~2-3 tok/s. Slideshow. This is why MoE models with small active-params dominate on this box even though the box has lots of memory. The 64 GB UMA cap doesn't hurt much because the bandwidth would have made the bigger-than-64 GB models unusable anyway. ## Other tunings worth knowing - **`amd_iommu=off`** kernel param gives ~6 % memory-read improvement on Strix Halo (per Gygeek). Wired into pyinfra's GRUB management. Reboot required to take effect. - **TDP** — configurable in BIOS (45–120 W). Gygeek suggests 85 W as a noise/perf sweet spot. BIOS-only, can't be automated. ## TL;DR For most users on this hardware: stay on strategy A (UMA = 64 GB), let pyinfra apply `amd_iommu=off` + the GTT ceiling, reboot, done. The hardware's bandwidth makes anything bigger than ~70 GB of weights impractical anyway, so the 64 GB cap rarely bites. Switch to strategy B (UMA = 512 MB, GTT-driven) if you specifically want to load >64 GB models on Vulkan, or if the OS feels squeezed.