# kimi-linear — resumption guide Open this first when picking the work back up. ## What this project is Kimi-Linear-48B-A3B-Instruct on the Strix Halo box via vLLM, ROCm/TheRock 7.x, gfx1151. Sits beside Ollama+Qwen3-Coder. Goal: long-context (256K-1M) local inference using the model architecturally best-suited to the box's unified-memory shape. Roadmap entry: `localgenai/Roadmap.md` → "Layer 0: Inference + tools" (to be added after BIOS unblock). Container artifacts: `pyinfra/framework/compose/kimi-linear.yml` + `pyinfra/framework/compose/kimi-linear/` (Dockerfile, build.sh, patch-tokenizer.sh, smoke.sh, README.md). Deploy push: `cd pyinfra/framework && ./run.sh`. ## Where we are (2026-05-10) **P0 — DONE, constrained.** First-ever locally-served Kimi-Linear generation on a Strix Halo iGPU. Smoke test passes: `/v1/models` returns `kimi-linear`; tiny generation returns "ok". Current runtime cap: `--max-model-len 4096`, `--max-num-seqs 1`, `--num-gpu-blocks-override 32`. The long-context point of the model is locked behind a **VRAM ceiling** — see "What blocks progress" below. **P1-P4 — pending.** ## The gauntlet of fixes that got us here (don't re-derive) All of these are baked into the repo. Reproducing P0 from a clean box is push the repo + run the steps in `compose/kimi-linear/README.md`. 1. **Image entrypoint missing.** kyuz0 toolboxes drop into a shell; compose's `command:` gets exec'd as the program. Fix: explicit `entrypoint: ["vllm", "serve"]` in the compose file, model path as positional first arg. 2. **Tokenizer ImportError.** `tokenization_kimi.py` imports `bytes_to_unicode` from a transformers internal that's been removed. Fix: `patch-tokenizer.sh` inlines the function. Idempotent. 3. **Missing AITER gfx1151 GEMM configs.** kyuz0 image is built for gfx1151 but doesn't ship the AITER autotuning JSONs for every op Kimi's MLA layers hit (validated against Qwen/MiniMax, not MLA-heavy models). Fix: derived `Dockerfile` copies gfx1100 (RDNA3) configs into gfx1151-named slots — kernels compile + run, tile sizes not optimal but functional. 4. **MLA AITER FP8 BMM tries to materialize a 30 GB intermediate.** On top of resident weights, that's ~58 GB needed and we have 31 GB. Fix: `VLLM_ROCM_USE_AITER_MLA=0` — bypasses AITER for just the MLA path, keeps it for everything else. 5. **`HSA_XNACK=1` is a trap with vLLM.** It enables HIP demand-paging into GTT (115 GB ceiling per kernel cmdline `amdgpu.gttsize=117760`) so vLLM computes "Available KV cache memory: 73.6 GiB", but PyTorch's actual allocator stays capped at the GPU pool (~31 GB). vLLM then OOMs trying to allocate the budget it computed. Fix: turn XNACK *off*, live within the 31 GB pool until BIOS UMA gives a single bigger pool. 6. **`--swap-space` was removed in modern vLLM.** Don't pass it. 7. **`--num-gpu-blocks-override 32`** as belt-and-braces against vLLM's KV pool auto-discovery picking a too-big number even without XNACK. ## What blocks long context PyTorch's discoverable VRAM equals the **BIOS UMA Frame Buffer Size**, not `amdgpu.gttsize`. The kernel cmdline is necessary but insufficient. - 128 GB physical → 64 GB UMA → ~62 GB visible to Linux. - `rocminfo` reports two ~31 GB GPU pools (Pool 1 coarse / Pool 2 fine). - PyTorch's allocator only uses one pool (~31 GB) → OOMs at ~30 GB Kimi weights with little KV headroom. - User has previously found Framework's BIOS caps UMA at 64 GB. **Research outcome (2026-05-10):** the right unblock isn't a higher BIOS UMA cap — it's the inverse. Set UMA *small* and merge the two pools. | Layer | Setting | | --- | --- | | BIOS | Update to **3.05 stable** (Apr 2026); set UMA Frame Buffer = **0.5 GB** or **8 GB** (counter-intuitive but documented — frees pages for GTT) | | Kernel | **≥ 6.16.9.** Earlier kernels cap ROCm visibility at 15.5 GB. pyinfra's `linux-generic-hwe-24.04` may be on 6.8/6.11 — verify with `uname -r` and upgrade if needed. | | Cmdline | `amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432` (`ttm.pages_limit` in 4 KiB pages = 128 GiB) | | Env | `HSA_XNACK=1` **+** `HSA_FORCE_FINE_GRAIN_PCIE=1` (the piece our earlier XNACK attempt was missing) | | Env | `PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9"` (HIP variant, not CUDA) | | PyTorch | TheRock gfx1151 wheels — kyuz0:stable image already uses these | Confirmed working in kyuz0/amd-strix-halo-vllm-toolboxes (DeepWiki: "Kernel Parameters and Unified Memory") and the Framework community "Linux + ROCm: January 2026 Stable Configurations Update" thread. Single-process allocation budget after this: ~110 GB. **The hard ceiling**, if all of the above doesn't yield enough: 96 GB direct UMA. AMD AGESA / StrixHaloPI limit, not Framework's. No path past 96 GB on signed firmware as of May 2026; 192 GB Strix Halo refresh is rumored but unreleased. **Fallback if the merge approach doesn't work for vLLM specifically**: llama.cpp ROCm + bartowski Q4_K_M GGUF. Different memory model, splits layers across CPU/GPU. Lower throughput, more flexible. ## When you come back 1. Read research output for BIOS UMA limits. If it landed in the chat, it's the most recent note in this project's session. If not yet, re-dispatch (prompt is in conversation history — Framework Desktop May 2026 UMA Frame Buffer cap research). 2. Decide path: - **BIOS bump available** → flash, reboot, drop the constraints in `compose/kimi-linear.yml` (max-model-len, num-gpu-blocks-override), re-`./smoke.sh`, ramp context. - **BIOS capped at 64 GB** → pivot to llama.cpp ROCm path or accept 4K context as the long-term reality. 3. Then advance through P1 → P2 → P3 per the roadmap. ## Files of record - `pyinfra/framework/compose/kimi-linear.yml` — service def, all the flag/env tradeoffs documented inline. - `pyinfra/framework/compose/kimi-linear/` — Dockerfile + scripts. - `pyinfra/framework/deploy.py` — wired into the service loop + asset-copy block. - `Roadmap.md` — strategy. - `StrixHaloMemory.md` — the UMA-vs-GTT discussion that needs a follow-up paragraph from this work (PyTorch caps at UMA, XNACK is a trap). ## Decisions worth not relitigating - **vLLM 0.19.x via kyuz0:stable** chosen over source-build with `v0.11.2` pin (build.sh exists as fallback). The recipe pin advice was based on an earlier Moonshot doc; current upstream works. - **4-bit compressed-tensors** (cyankiwi) chosen over 8-bit. With the 31 GB ceiling, 8-bit wouldn't even fit weights resident. - **VLLM_ROCM_USE_AITER_MLA=0**, NOT `VLLM_MLA_DISABLE=1`. Granular disable preserves AITER for non-MLA paths. The full disable is the next escalation if needed. - **No upstream filings.** Findings stay in this repo per project policy (memory: `feedback_private_findings`).