Files
localgenai/kimi-linear/NEXT_STEPS.md
noisedestroyers a29793032d Document current coding-workflow stack state
Snapshot of where opencode + Qwen3-Coder + MCPs + Kimi-Linear + voice
  + Phoenix tracing land today, plus in-flight (oc-tree, kimi-linear
  context ramp) and next (ComfyUI) items with pointers to per-project
  NEXT_STEPS.md guides.
2026-05-10 21:14:43 -04:00

6.8 KiB

kimi-linear — resumption guide

Open this first when picking the work back up.

What this project is

Kimi-Linear-48B-A3B-Instruct on the Strix Halo box via vLLM, ROCm/TheRock 7.x, gfx1151. Sits beside Ollama+Qwen3-Coder. Goal: long-context (256K-1M) local inference using the model architecturally best-suited to the box's unified-memory shape.

Roadmap entry: localgenai/Roadmap.md → "Layer 0: Inference + tools" (to be added after BIOS unblock).

Container artifacts: pyinfra/framework/compose/kimi-linear.yml + pyinfra/framework/compose/kimi-linear/ (Dockerfile, build.sh, patch-tokenizer.sh, smoke.sh, README.md). Deploy push: cd pyinfra/framework && ./run.sh.

Where we are (2026-05-10)

P0 — DONE, constrained. First-ever locally-served Kimi-Linear generation on a Strix Halo iGPU. Smoke test passes: /v1/models returns kimi-linear; tiny generation returns "ok".

Current runtime cap: --max-model-len 4096, --max-num-seqs 1, --num-gpu-blocks-override 32. The long-context point of the model is locked behind a VRAM ceiling — see "What blocks progress" below.

P1-P4 — pending.

The gauntlet of fixes that got us here (don't re-derive)

All of these are baked into the repo. Reproducing P0 from a clean box is push the repo + run the steps in compose/kimi-linear/README.md.

  1. Image entrypoint missing. kyuz0 toolboxes drop into a shell; compose's command: gets exec'd as the program. Fix: explicit entrypoint: ["vllm", "serve"] in the compose file, model path as positional first arg.

  2. Tokenizer ImportError. tokenization_kimi.py imports bytes_to_unicode from a transformers internal that's been removed. Fix: patch-tokenizer.sh inlines the function. Idempotent.

  3. Missing AITER gfx1151 GEMM configs. kyuz0 image is built for gfx1151 but doesn't ship the AITER autotuning JSONs for every op Kimi's MLA layers hit (validated against Qwen/MiniMax, not MLA-heavy models). Fix: derived Dockerfile copies gfx1100 (RDNA3) configs into gfx1151-named slots — kernels compile + run, tile sizes not optimal but functional.

  4. MLA AITER FP8 BMM tries to materialize a 30 GB intermediate. On top of resident weights, that's ~58 GB needed and we have 31 GB. Fix: VLLM_ROCM_USE_AITER_MLA=0 — bypasses AITER for just the MLA path, keeps it for everything else.

  5. HSA_XNACK=1 is a trap with vLLM. It enables HIP demand-paging into GTT (115 GB ceiling per kernel cmdline amdgpu.gttsize=117760) so vLLM computes "Available KV cache memory: 73.6 GiB", but PyTorch's actual allocator stays capped at the GPU pool (~31 GB). vLLM then OOMs trying to allocate the budget it computed. Fix: turn XNACK off, live within the 31 GB pool until BIOS UMA gives a single bigger pool.

  6. --swap-space was removed in modern vLLM. Don't pass it.

  7. --num-gpu-blocks-override 32 as belt-and-braces against vLLM's KV pool auto-discovery picking a too-big number even without XNACK.

What blocks long context

PyTorch's discoverable VRAM equals the BIOS UMA Frame Buffer Size, not amdgpu.gttsize. The kernel cmdline is necessary but insufficient.

  • 128 GB physical → 64 GB UMA → ~62 GB visible to Linux.
  • rocminfo reports two ~31 GB GPU pools (Pool 1 coarse / Pool 2 fine).
  • PyTorch's allocator only uses one pool (~31 GB) → OOMs at ~30 GB Kimi weights with little KV headroom.
  • User has previously found Framework's BIOS caps UMA at 64 GB.

Research outcome (2026-05-10): the right unblock isn't a higher BIOS UMA cap — it's the inverse. Set UMA small and merge the two pools.

Layer Setting
BIOS Update to 3.05 stable (Apr 2026); set UMA Frame Buffer = 0.5 GB or 8 GB (counter-intuitive but documented — frees pages for GTT)
Kernel ≥ 6.16.9. Earlier kernels cap ROCm visibility at 15.5 GB. pyinfra's linux-generic-hwe-24.04 may be on 6.8/6.11 — verify with uname -r and upgrade if needed.
Cmdline amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 (ttm.pages_limit in 4 KiB pages = 128 GiB)
Env HSA_XNACK=1 + HSA_FORCE_FINE_GRAIN_PCIE=1 (the piece our earlier XNACK attempt was missing)
Env PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9" (HIP variant, not CUDA)
PyTorch TheRock gfx1151 wheels — kyuz0:stable image already uses these

Confirmed working in kyuz0/amd-strix-halo-vllm-toolboxes (DeepWiki: "Kernel Parameters and Unified Memory") and the Framework community "Linux + ROCm: January 2026 Stable Configurations Update" thread. Single-process allocation budget after this: ~110 GB.

The hard ceiling, if all of the above doesn't yield enough: 96 GB direct UMA. AMD AGESA / StrixHaloPI limit, not Framework's. No path past 96 GB on signed firmware as of May 2026; 192 GB Strix Halo refresh is rumored but unreleased.

Fallback if the merge approach doesn't work for vLLM specifically: llama.cpp ROCm + bartowski Q4_K_M GGUF. Different memory model, splits layers across CPU/GPU. Lower throughput, more flexible.

When you come back

  1. Read research output for BIOS UMA limits. If it landed in the chat, it's the most recent note in this project's session. If not yet, re-dispatch (prompt is in conversation history — Framework Desktop May 2026 UMA Frame Buffer cap research).
  2. Decide path:
    • BIOS bump available → flash, reboot, drop the constraints in compose/kimi-linear.yml (max-model-len, num-gpu-blocks-override), re-./smoke.sh, ramp context.
    • BIOS capped at 64 GB → pivot to llama.cpp ROCm path or accept 4K context as the long-term reality.
  3. Then advance through P1 → P2 → P3 per the roadmap.

Files of record

  • pyinfra/framework/compose/kimi-linear.yml — service def, all the flag/env tradeoffs documented inline.
  • pyinfra/framework/compose/kimi-linear/ — Dockerfile + scripts.
  • pyinfra/framework/deploy.py — wired into the service loop + asset-copy block.
  • Roadmap.md — strategy.
  • StrixHaloMemory.md — the UMA-vs-GTT discussion that needs a follow-up paragraph from this work (PyTorch caps at UMA, XNACK is a trap).

Decisions worth not relitigating

  • vLLM 0.19.x via kyuz0:stable chosen over source-build with v0.11.2 pin (build.sh exists as fallback). The recipe pin advice was based on an earlier Moonshot doc; current upstream works.
  • 4-bit compressed-tensors (cyankiwi) chosen over 8-bit. With the 31 GB ceiling, 8-bit wouldn't even fit weights resident.
  • VLLM_ROCM_USE_AITER_MLA=0, NOT VLLM_MLA_DISABLE=1. Granular disable preserves AITER for non-MLA paths. The full disable is the next escalation if needed.
  • No upstream filings. Findings stay in this repo per project policy (memory: feedback_private_findings).