Snapshot of where opencode + Qwen3-Coder + MCPs + Kimi-Linear + voice + Phoenix tracing land today, plus in-flight (oc-tree, kimi-linear context ramp) and next (ComfyUI) items with pointers to per-project NEXT_STEPS.md guides.
6.8 KiB
kimi-linear — resumption guide
Open this first when picking the work back up.
What this project is
Kimi-Linear-48B-A3B-Instruct on the Strix Halo box via vLLM, ROCm/TheRock 7.x, gfx1151. Sits beside Ollama+Qwen3-Coder. Goal: long-context (256K-1M) local inference using the model architecturally best-suited to the box's unified-memory shape.
Roadmap entry: localgenai/Roadmap.md → "Layer 0: Inference + tools"
(to be added after BIOS unblock).
Container artifacts: pyinfra/framework/compose/kimi-linear.yml +
pyinfra/framework/compose/kimi-linear/ (Dockerfile, build.sh,
patch-tokenizer.sh, smoke.sh, README.md). Deploy push:
cd pyinfra/framework && ./run.sh.
Where we are (2026-05-10)
P0 — DONE, constrained. First-ever locally-served Kimi-Linear
generation on a Strix Halo iGPU. Smoke test passes:
/v1/models returns kimi-linear; tiny generation returns "ok".
Current runtime cap: --max-model-len 4096, --max-num-seqs 1,
--num-gpu-blocks-override 32. The long-context point of the model is
locked behind a VRAM ceiling — see "What blocks progress" below.
P1-P4 — pending.
The gauntlet of fixes that got us here (don't re-derive)
All of these are baked into the repo. Reproducing P0 from a clean box is
push the repo + run the steps in compose/kimi-linear/README.md.
-
Image entrypoint missing. kyuz0 toolboxes drop into a shell; compose's
command:gets exec'd as the program. Fix: explicitentrypoint: ["vllm", "serve"]in the compose file, model path as positional first arg. -
Tokenizer ImportError.
tokenization_kimi.pyimportsbytes_to_unicodefrom a transformers internal that's been removed. Fix:patch-tokenizer.shinlines the function. Idempotent. -
Missing AITER gfx1151 GEMM configs. kyuz0 image is built for gfx1151 but doesn't ship the AITER autotuning JSONs for every op Kimi's MLA layers hit (validated against Qwen/MiniMax, not MLA-heavy models). Fix: derived
Dockerfilecopies gfx1100 (RDNA3) configs into gfx1151-named slots — kernels compile + run, tile sizes not optimal but functional. -
MLA AITER FP8 BMM tries to materialize a 30 GB intermediate. On top of resident weights, that's ~58 GB needed and we have 31 GB. Fix:
VLLM_ROCM_USE_AITER_MLA=0— bypasses AITER for just the MLA path, keeps it for everything else. -
HSA_XNACK=1is a trap with vLLM. It enables HIP demand-paging into GTT (115 GB ceiling per kernel cmdlineamdgpu.gttsize=117760) so vLLM computes "Available KV cache memory: 73.6 GiB", but PyTorch's actual allocator stays capped at the GPU pool (~31 GB). vLLM then OOMs trying to allocate the budget it computed. Fix: turn XNACK off, live within the 31 GB pool until BIOS UMA gives a single bigger pool. -
--swap-spacewas removed in modern vLLM. Don't pass it. -
--num-gpu-blocks-override 32as belt-and-braces against vLLM's KV pool auto-discovery picking a too-big number even without XNACK.
What blocks long context
PyTorch's discoverable VRAM equals the BIOS UMA Frame Buffer Size,
not amdgpu.gttsize. The kernel cmdline is necessary but insufficient.
- 128 GB physical → 64 GB UMA → ~62 GB visible to Linux.
rocminforeports two ~31 GB GPU pools (Pool 1 coarse / Pool 2 fine).- PyTorch's allocator only uses one pool (~31 GB) → OOMs at ~30 GB Kimi weights with little KV headroom.
- User has previously found Framework's BIOS caps UMA at 64 GB.
Research outcome (2026-05-10): the right unblock isn't a higher BIOS UMA cap — it's the inverse. Set UMA small and merge the two pools.
| Layer | Setting |
|---|---|
| BIOS | Update to 3.05 stable (Apr 2026); set UMA Frame Buffer = 0.5 GB or 8 GB (counter-intuitive but documented — frees pages for GTT) |
| Kernel | ≥ 6.16.9. Earlier kernels cap ROCm visibility at 15.5 GB. pyinfra's linux-generic-hwe-24.04 may be on 6.8/6.11 — verify with uname -r and upgrade if needed. |
| Cmdline | amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 (ttm.pages_limit in 4 KiB pages = 128 GiB) |
| Env | HSA_XNACK=1 + HSA_FORCE_FINE_GRAIN_PCIE=1 (the piece our earlier XNACK attempt was missing) |
| Env | PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9" (HIP variant, not CUDA) |
| PyTorch | TheRock gfx1151 wheels — kyuz0:stable image already uses these |
Confirmed working in kyuz0/amd-strix-halo-vllm-toolboxes (DeepWiki: "Kernel Parameters and Unified Memory") and the Framework community "Linux + ROCm: January 2026 Stable Configurations Update" thread. Single-process allocation budget after this: ~110 GB.
The hard ceiling, if all of the above doesn't yield enough: 96 GB direct UMA. AMD AGESA / StrixHaloPI limit, not Framework's. No path past 96 GB on signed firmware as of May 2026; 192 GB Strix Halo refresh is rumored but unreleased.
Fallback if the merge approach doesn't work for vLLM specifically: llama.cpp ROCm + bartowski Q4_K_M GGUF. Different memory model, splits layers across CPU/GPU. Lower throughput, more flexible.
When you come back
- Read research output for BIOS UMA limits. If it landed in the chat, it's the most recent note in this project's session. If not yet, re-dispatch (prompt is in conversation history — Framework Desktop May 2026 UMA Frame Buffer cap research).
- Decide path:
- BIOS bump available → flash, reboot, drop the constraints in
compose/kimi-linear.yml(max-model-len, num-gpu-blocks-override), re-./smoke.sh, ramp context. - BIOS capped at 64 GB → pivot to llama.cpp ROCm path or accept 4K context as the long-term reality.
- BIOS bump available → flash, reboot, drop the constraints in
- Then advance through P1 → P2 → P3 per the roadmap.
Files of record
pyinfra/framework/compose/kimi-linear.yml— service def, all the flag/env tradeoffs documented inline.pyinfra/framework/compose/kimi-linear/— Dockerfile + scripts.pyinfra/framework/deploy.py— wired into the service loop + asset-copy block.Roadmap.md— strategy.StrixHaloMemory.md— the UMA-vs-GTT discussion that needs a follow-up paragraph from this work (PyTorch caps at UMA, XNACK is a trap).
Decisions worth not relitigating
- vLLM 0.19.x via kyuz0:stable chosen over source-build with
v0.11.2pin (build.sh exists as fallback). The recipe pin advice was based on an earlier Moonshot doc; current upstream works. - 4-bit compressed-tensors (cyankiwi) chosen over 8-bit. With the 31 GB ceiling, 8-bit wouldn't even fit weights resident.
- VLLM_ROCM_USE_AITER_MLA=0, NOT
VLLM_MLA_DISABLE=1. Granular disable preserves AITER for non-MLA paths. The full disable is the next escalation if needed. - No upstream filings. Findings stay in this repo per project
policy (memory:
feedback_private_findings).