# Strix Halo memory layout — UMA, GTT, and what fits

The Framework Desktop has 128 GB of LPDDR5x-8000 unified between CPU and
GPU. How much the GPU can actually use is a stack of three knobs:

## The three knobs

1. **Total system RAM** — 128 GB on this box. Set by the SKU.
2. **UMA frame buffer (BIOS, "iGPU memory size")** — a *fixed* carve-out
   reserved for the GPU at boot. The OS can't see or reclaim it. Pinned,
   guaranteed, lowest latency. As of BIOS lfsp0.03.05 (May 2026), the
   max selectable on Framework Desktop is **64 GB**.
3. **GTT, Graphics Translation Table (kernel, runtime)** — a *dynamic*
   pool the GPU can borrow from system RAM beyond UMA. Default size on
   recent Linux is roughly half of system RAM. Higher latency than UMA,
   evictable under memory pressure.

So on a 128 GB box with 64 GB UMA: GPU sees 64 GB of pinned, fast VRAM,
plus up to ~64 GB of slower borrowable GTT, leaving the OS to share what
GTT isn't currently using.

## Why the BIOS caps at 64 GB

Not a hardware limit. AMD/Framework deliberately keeps the UMA ceiling
below total RAM so the OS doesn't get starved. Earlier BIOS revisions
were stricter (16 / 24 GB). Newer ones may unlock 96 / 110 GB — worth
re-checking https://knowledgebase.frame.work after firmware updates.

The "128 GB support" on Framework's spec page refers to total system
RAM capacity, **not** the UMA maximum. Distinct knobs.

## What 64 GB UMA actually fits

| Model | Quant | Size | Fits 64 GB? |
|-------|-------|------|-------------|
| Qwen3-Coder-30B-A3B | Q8 | ~32 GB | ✅ huge headroom |
| Llama 3.3 70B (dense) | Q4 | ~42 GB | ✅ |
| GLM-4.5-Air-style 100B-A12B MoE | Q4 | ~60 GB | ✅ tight |
| Kimi-Linear-48B-A3B (when llama.cpp lands) | Q4-Q8 | ~30-50 GB | ✅ |
| Qwen3-235B-A22B | Q3 | ~100 GB | ❌ |
| DeepSeek-V3 / Kimi-K2 (600B+ MoE) | Q4 | 200 GB+ | ❌ |

The ❌ row wouldn't be enjoyable on this hardware regardless — see
"Bandwidth is the real ceiling" below.

## When GTT spillover saves you

If a model is bigger than UMA but ≤ UMA+GTT:

- **Vulkan (llama.cpp `server-vulkan` container)** — happily uses GTT
  for weights. Slower than UMA but works.
- **ROCm (vLLM, Ollama:rocm)** — picky; often refuses GTT for weight
  buffers. Effectively capped at UMA.

So if you want to run something bigger than 64 GB and stay performant,
the path is: stick with the Vulkan llama.cpp container and accept the
GTT latency hit, *or* wait for a BIOS that unlocks a larger UMA.

## Two architectural strategies (UMA-first vs GTT-first)

There's a real choice in how you allocate memory between OS and GPU:

### Strategy A — UMA-first (current setup)

- BIOS UMA: **64 GB** (the menu max on lfsp0.03.05)
- Kernel `amdgpu.gttsize`: default (~50 % of remaining RAM, so ~32 GB)
- Effective GPU pool: **64 GB pinned + small GTT spillover**
- OS sees: 64 GB
- Pros: lowest-latency GPU memory; OS can't reclaim under pressure;
  ROCm-based stacks (vLLM, Ollama:rocm) get the full pool
- Cons: hard ceiling at the BIOS UMA max; OS lives in 64 GB

### Strategy B — GTT-first (per Gygeek's setup guide)

- BIOS UMA: **512 MB** (the menu min)
- Kernel `amdgpu.gttsize=117760` (~115 GB)
- Effective GPU pool: **~115 GB borrowable from the OS-visible pool**
- OS sees: ~127 GB
- Pros: dynamic — GPU only takes what it needs; can address far more
  than the BIOS UMA cap; future-proof if you want >64 GB models
- Cons: GTT pages are evictable under memory pressure (avoid running
  heavy CPU workloads alongside inference); ROCm stacks may refuse
  GTT for weight buffers — Vulkan is the safer backend with this
  strategy

### Which to pick

- **Single-user, mostly LLMs, ROCm or Vulkan**: Strategy A is fine.
  64 GB covers everything that runs well on this box's bandwidth (see
  "Bandwidth is the real ceiling" below).
- **Models > 64 GB or you want headroom**: Strategy B. Trade some
  worst-case latency for the ability to load anything that fits in
  127 GB. Vulkan-first.
- **Mixed workload (Docker compiles + LLMs + dev tools)**: Strategy B
  also helps because the OS isn't permanently squeezed into 64 GB.

The pyinfra deploy currently sets `amdgpu.gttsize=117760` via GRUB
(strategy B-compatible) but leaves BIOS UMA to whatever you've
configured. If you want full strategy B, drop BIOS UMA to 512 MB at
the next boot. If staying on strategy A, the kernel param is benign —
GTT just becomes a high ceiling you'll never reach.

## Bandwidth is the real ceiling, not VRAM

256 GB/s memory bandwidth. Decode tokens/sec ≈ bandwidth ÷ active-params
size. Practical implications:

- **3B active** (e.g. Qwen3-Coder-30B-A3B, Kimi-Linear-48B-A3B): ~50-80 tok/s. Excellent.
- **12B active** (GLM-4.5-Air-style MoEs): ~10-20 tok/s. Usable.
- **35B active** (Qwen3-Coder-480B-A35B): ~3-7 tok/s. Painful.
- **70B dense**: ~2-3 tok/s. Slideshow.

This is why MoE models with small active-params dominate on this box
even though the box has lots of memory. The 64 GB UMA cap doesn't hurt
much because the bandwidth would have made the bigger-than-64 GB models
unusable anyway.

## Other tunings worth knowing

- **`amd_iommu=off`** kernel param gives ~6 % memory-read improvement
  on Strix Halo (per Gygeek). Wired into pyinfra's GRUB management.
  Reboot required to take effect.
- **TDP** — configurable in BIOS (45–120 W). Gygeek suggests 85 W as a
  noise/perf sweet spot. BIOS-only, can't be automated.

## TL;DR

For most users on this hardware: stay on strategy A (UMA = 64 GB), let
pyinfra apply `amd_iommu=off` + the GTT ceiling, reboot, done. The
hardware's bandwidth makes anything bigger than ~70 GB of weights
impractical anyway, so the 64 GB cap rarely bites.

Switch to strategy B (UMA = 512 MB, GTT-driven) if you specifically
want to load >64 GB models on Vulkan, or if the OS feels squeezed.