localgenai/kimi-linear/NEXT_STEPS.md

# kimi-linear — resumption guide

Open this first when picking the work back up.

## What this project is

Kimi-Linear-48B-A3B-Instruct on the Strix Halo box via vLLM, ROCm/TheRock 7.x,
gfx1151. Sits beside Ollama+Qwen3-Coder. Goal: long-context (256K-1M)
local inference using the model architecturally best-suited to the box's
unified-memory shape.

Roadmap entry: `localgenai/Roadmap.md` → "Layer 0: Inference + tools"
(to be added after BIOS unblock).

Container artifacts: `pyinfra/framework/compose/kimi-linear.yml` +
`pyinfra/framework/compose/kimi-linear/` (Dockerfile, build.sh,
patch-tokenizer.sh, smoke.sh, README.md). Deploy push:
`cd pyinfra/framework && ./run.sh`.

## Where we are (2026-05-10)

**P0 — DONE, constrained.** First-ever locally-served Kimi-Linear
generation on a Strix Halo iGPU. Smoke test passes:
`/v1/models` returns `kimi-linear`; tiny generation returns "ok".

Current runtime cap: `--max-model-len 4096`, `--max-num-seqs 1`,
`--num-gpu-blocks-override 32`. The long-context point of the model is
locked behind a **VRAM ceiling** — see "What blocks progress" below.

**P1-P4 — pending.**

## The gauntlet of fixes that got us here (don't re-derive)

All of these are baked into the repo. Reproducing P0 from a clean box is
push the repo + run the steps in `compose/kimi-linear/README.md`.

1. **Image entrypoint missing.** kyuz0 toolboxes drop into a shell;
   compose's `command:` gets exec'd as the program. Fix: explicit
   `entrypoint: ["vllm", "serve"]` in the compose file, model path as
   positional first arg.

2. **Tokenizer ImportError.** `tokenization_kimi.py` imports
   `bytes_to_unicode` from a transformers internal that's been removed.
   Fix: `patch-tokenizer.sh` inlines the function. Idempotent.

3. **Missing AITER gfx1151 GEMM configs.** kyuz0 image is built for
   gfx1151 but doesn't ship the AITER autotuning JSONs for every op
   Kimi's MLA layers hit (validated against Qwen/MiniMax, not
   MLA-heavy models). Fix: derived `Dockerfile` copies gfx1100 (RDNA3)
   configs into gfx1151-named slots — kernels compile + run, tile
   sizes not optimal but functional.

4. **MLA AITER FP8 BMM tries to materialize a 30 GB intermediate.**
   On top of resident weights, that's ~58 GB needed and we have 31 GB.
   Fix: `VLLM_ROCM_USE_AITER_MLA=0` — bypasses AITER for just the MLA
   path, keeps it for everything else.

5. **`HSA_XNACK=1` is a trap with vLLM.** It enables HIP demand-paging
   into GTT (115 GB ceiling per kernel cmdline `amdgpu.gttsize=117760`)
   so vLLM computes "Available KV cache memory: 73.6 GiB", but
   PyTorch's actual allocator stays capped at the GPU pool (~31 GB).
   vLLM then OOMs trying to allocate the budget it computed. Fix: turn
   XNACK *off*, live within the 31 GB pool until BIOS UMA gives a
   single bigger pool.

6. **`--swap-space` was removed in modern vLLM.** Don't pass it.

7. **`--num-gpu-blocks-override 32`** as belt-and-braces against
   vLLM's KV pool auto-discovery picking a too-big number even without
   XNACK.

## What blocks long context

PyTorch's discoverable VRAM equals the **BIOS UMA Frame Buffer Size**,
not `amdgpu.gttsize`. The kernel cmdline is necessary but insufficient.

- 128 GB physical → 64 GB UMA → ~62 GB visible to Linux.
- `rocminfo` reports two ~31 GB GPU pools (Pool 1 coarse / Pool 2 fine).
- PyTorch's allocator only uses one pool (~31 GB) → OOMs at ~30 GB
  Kimi weights with little KV headroom.
- User has previously found Framework's BIOS caps UMA at 64 GB.

**Research outcome (2026-05-10):** the right unblock isn't a higher BIOS
UMA cap — it's the inverse. Set UMA *small* and merge the two pools.

| Layer | Setting |
| --- | --- |
| BIOS | Update to **3.05 stable** (Apr 2026); set UMA Frame Buffer = **0.5 GB** or **8 GB** (counter-intuitive but documented — frees pages for GTT) |
| Kernel | **≥ 6.16.9.** Earlier kernels cap ROCm visibility at 15.5 GB. pyinfra's `linux-generic-hwe-24.04` may be on 6.8/6.11 — verify with `uname -r` and upgrade if needed. |
| Cmdline | `amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432` (`ttm.pages_limit` in 4 KiB pages = 128 GiB) |
| Env | `HSA_XNACK=1` **+** `HSA_FORCE_FINE_GRAIN_PCIE=1` (the piece our earlier XNACK attempt was missing) |
| Env | `PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9"` (HIP variant, not CUDA) |
| PyTorch | TheRock gfx1151 wheels — kyuz0:stable image already uses these |

Confirmed working in kyuz0/amd-strix-halo-vllm-toolboxes (DeepWiki:
"Kernel Parameters and Unified Memory") and the Framework community
"Linux + ROCm: January 2026 Stable Configurations Update" thread.
Single-process allocation budget after this: ~110 GB.

**The hard ceiling**, if all of the above doesn't yield enough: 96 GB
direct UMA. AMD AGESA / StrixHaloPI limit, not Framework's. No path
past 96 GB on signed firmware as of May 2026; 192 GB Strix Halo refresh
is rumored but unreleased.

**Fallback if the merge approach doesn't work for vLLM specifically**:
llama.cpp ROCm + bartowski Q4_K_M GGUF. Different memory model, splits
layers across CPU/GPU. Lower throughput, more flexible.

## When you come back

1. Read research output for BIOS UMA limits. If it landed in the chat,
   it's the most recent note in this project's session. If not yet,
   re-dispatch (prompt is in conversation history — Framework Desktop
   May 2026 UMA Frame Buffer cap research).
2. Decide path:
   - **BIOS bump available** → flash, reboot, drop the constraints in
     `compose/kimi-linear.yml` (max-model-len, num-gpu-blocks-override),
     re-`./smoke.sh`, ramp context.
   - **BIOS capped at 64 GB** → pivot to llama.cpp ROCm path or accept
     4K context as the long-term reality.
3. Then advance through P1 → P2 → P3 per the roadmap.

## Files of record

- `pyinfra/framework/compose/kimi-linear.yml` — service def, all the
  flag/env tradeoffs documented inline.
- `pyinfra/framework/compose/kimi-linear/` — Dockerfile + scripts.
- `pyinfra/framework/deploy.py` — wired into the service loop +
  asset-copy block.
- `Roadmap.md` — strategy.
- `StrixHaloMemory.md` — the UMA-vs-GTT discussion that needs a
  follow-up paragraph from this work (PyTorch caps at UMA, XNACK is a
  trap).

## Decisions worth not relitigating

- **vLLM 0.19.x via kyuz0:stable** chosen over source-build with
  `v0.11.2` pin (build.sh exists as fallback). The recipe pin advice
  was based on an earlier Moonshot doc; current upstream works.
- **4-bit compressed-tensors** (cyankiwi) chosen over 8-bit. With the
  31 GB ceiling, 8-bit wouldn't even fit weights resident.
- **VLLM_ROCM_USE_AITER_MLA=0**, NOT `VLLM_MLA_DISABLE=1`. Granular
  disable preserves AITER for non-MLA paths. The full disable is the
  next escalation if needed.
- **No upstream filings.** Findings stay in this repo per project
  policy (memory: `feedback_private_findings`).