Files
localgenai/kimi-linear/NEXT_STEPS.md
noisedestroyers a29793032d Document current coding-workflow stack state
Snapshot of where opencode + Qwen3-Coder + MCPs + Kimi-Linear + voice
  + Phoenix tracing land today, plus in-flight (oc-tree, kimi-linear
  context ramp) and next (ComfyUI) items with pointers to per-project
  NEXT_STEPS.md guides.
2026-05-10 21:14:43 -04:00

147 lines
6.8 KiB
Markdown

# kimi-linear — resumption guide
Open this first when picking the work back up.
## What this project is
Kimi-Linear-48B-A3B-Instruct on the Strix Halo box via vLLM, ROCm/TheRock 7.x,
gfx1151. Sits beside Ollama+Qwen3-Coder. Goal: long-context (256K-1M)
local inference using the model architecturally best-suited to the box's
unified-memory shape.
Roadmap entry: `localgenai/Roadmap.md` → "Layer 0: Inference + tools"
(to be added after BIOS unblock).
Container artifacts: `pyinfra/framework/compose/kimi-linear.yml` +
`pyinfra/framework/compose/kimi-linear/` (Dockerfile, build.sh,
patch-tokenizer.sh, smoke.sh, README.md). Deploy push:
`cd pyinfra/framework && ./run.sh`.
## Where we are (2026-05-10)
**P0 — DONE, constrained.** First-ever locally-served Kimi-Linear
generation on a Strix Halo iGPU. Smoke test passes:
`/v1/models` returns `kimi-linear`; tiny generation returns "ok".
Current runtime cap: `--max-model-len 4096`, `--max-num-seqs 1`,
`--num-gpu-blocks-override 32`. The long-context point of the model is
locked behind a **VRAM ceiling** — see "What blocks progress" below.
**P1-P4 — pending.**
## The gauntlet of fixes that got us here (don't re-derive)
All of these are baked into the repo. Reproducing P0 from a clean box is
push the repo + run the steps in `compose/kimi-linear/README.md`.
1. **Image entrypoint missing.** kyuz0 toolboxes drop into a shell;
compose's `command:` gets exec'd as the program. Fix: explicit
`entrypoint: ["vllm", "serve"]` in the compose file, model path as
positional first arg.
2. **Tokenizer ImportError.** `tokenization_kimi.py` imports
`bytes_to_unicode` from a transformers internal that's been removed.
Fix: `patch-tokenizer.sh` inlines the function. Idempotent.
3. **Missing AITER gfx1151 GEMM configs.** kyuz0 image is built for
gfx1151 but doesn't ship the AITER autotuning JSONs for every op
Kimi's MLA layers hit (validated against Qwen/MiniMax, not
MLA-heavy models). Fix: derived `Dockerfile` copies gfx1100 (RDNA3)
configs into gfx1151-named slots — kernels compile + run, tile
sizes not optimal but functional.
4. **MLA AITER FP8 BMM tries to materialize a 30 GB intermediate.**
On top of resident weights, that's ~58 GB needed and we have 31 GB.
Fix: `VLLM_ROCM_USE_AITER_MLA=0` — bypasses AITER for just the MLA
path, keeps it for everything else.
5. **`HSA_XNACK=1` is a trap with vLLM.** It enables HIP demand-paging
into GTT (115 GB ceiling per kernel cmdline `amdgpu.gttsize=117760`)
so vLLM computes "Available KV cache memory: 73.6 GiB", but
PyTorch's actual allocator stays capped at the GPU pool (~31 GB).
vLLM then OOMs trying to allocate the budget it computed. Fix: turn
XNACK *off*, live within the 31 GB pool until BIOS UMA gives a
single bigger pool.
6. **`--swap-space` was removed in modern vLLM.** Don't pass it.
7. **`--num-gpu-blocks-override 32`** as belt-and-braces against
vLLM's KV pool auto-discovery picking a too-big number even without
XNACK.
## What blocks long context
PyTorch's discoverable VRAM equals the **BIOS UMA Frame Buffer Size**,
not `amdgpu.gttsize`. The kernel cmdline is necessary but insufficient.
- 128 GB physical → 64 GB UMA → ~62 GB visible to Linux.
- `rocminfo` reports two ~31 GB GPU pools (Pool 1 coarse / Pool 2 fine).
- PyTorch's allocator only uses one pool (~31 GB) → OOMs at ~30 GB
Kimi weights with little KV headroom.
- User has previously found Framework's BIOS caps UMA at 64 GB.
**Research outcome (2026-05-10):** the right unblock isn't a higher BIOS
UMA cap — it's the inverse. Set UMA *small* and merge the two pools.
| Layer | Setting |
| --- | --- |
| BIOS | Update to **3.05 stable** (Apr 2026); set UMA Frame Buffer = **0.5 GB** or **8 GB** (counter-intuitive but documented — frees pages for GTT) |
| Kernel | **≥ 6.16.9.** Earlier kernels cap ROCm visibility at 15.5 GB. pyinfra's `linux-generic-hwe-24.04` may be on 6.8/6.11 — verify with `uname -r` and upgrade if needed. |
| Cmdline | `amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432` (`ttm.pages_limit` in 4 KiB pages = 128 GiB) |
| Env | `HSA_XNACK=1` **+** `HSA_FORCE_FINE_GRAIN_PCIE=1` (the piece our earlier XNACK attempt was missing) |
| Env | `PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9"` (HIP variant, not CUDA) |
| PyTorch | TheRock gfx1151 wheels — kyuz0:stable image already uses these |
Confirmed working in kyuz0/amd-strix-halo-vllm-toolboxes (DeepWiki:
"Kernel Parameters and Unified Memory") and the Framework community
"Linux + ROCm: January 2026 Stable Configurations Update" thread.
Single-process allocation budget after this: ~110 GB.
**The hard ceiling**, if all of the above doesn't yield enough: 96 GB
direct UMA. AMD AGESA / StrixHaloPI limit, not Framework's. No path
past 96 GB on signed firmware as of May 2026; 192 GB Strix Halo refresh
is rumored but unreleased.
**Fallback if the merge approach doesn't work for vLLM specifically**:
llama.cpp ROCm + bartowski Q4_K_M GGUF. Different memory model, splits
layers across CPU/GPU. Lower throughput, more flexible.
## When you come back
1. Read research output for BIOS UMA limits. If it landed in the chat,
it's the most recent note in this project's session. If not yet,
re-dispatch (prompt is in conversation history — Framework Desktop
May 2026 UMA Frame Buffer cap research).
2. Decide path:
- **BIOS bump available** → flash, reboot, drop the constraints in
`compose/kimi-linear.yml` (max-model-len, num-gpu-blocks-override),
re-`./smoke.sh`, ramp context.
- **BIOS capped at 64 GB** → pivot to llama.cpp ROCm path or accept
4K context as the long-term reality.
3. Then advance through P1 → P2 → P3 per the roadmap.
## Files of record
- `pyinfra/framework/compose/kimi-linear.yml` — service def, all the
flag/env tradeoffs documented inline.
- `pyinfra/framework/compose/kimi-linear/` — Dockerfile + scripts.
- `pyinfra/framework/deploy.py` — wired into the service loop +
asset-copy block.
- `Roadmap.md` — strategy.
- `StrixHaloMemory.md` — the UMA-vs-GTT discussion that needs a
follow-up paragraph from this work (PyTorch caps at UMA, XNACK is a
trap).
## Decisions worth not relitigating
- **vLLM 0.19.x via kyuz0:stable** chosen over source-build with
`v0.11.2` pin (build.sh exists as fallback). The recipe pin advice
was based on an earlier Moonshot doc; current upstream works.
- **4-bit compressed-tensors** (cyankiwi) chosen over 8-bit. With the
31 GB ceiling, 8-bit wouldn't even fit weights resident.
- **VLLM_ROCM_USE_AITER_MLA=0**, NOT `VLLM_MLA_DISABLE=1`. Granular
disable preserves AITER for non-MLA paths. The full disable is the
next escalation if needed.
- **No upstream filings.** Findings stay in this repo per project
policy (memory: `feedback_private_findings`).