Document current coding-workflow stack state
Snapshot of where opencode + Qwen3-Coder + MCPs + Kimi-Linear + voice + Phoenix tracing land today, plus in-flight (oc-tree, kimi-linear context ramp) and next (ComfyUI) items with pointers to per-project NEXT_STEPS.md guides.
This commit is contained in:
146
kimi-linear/NEXT_STEPS.md
Normal file
146
kimi-linear/NEXT_STEPS.md
Normal file
@@ -0,0 +1,146 @@
|
||||
# kimi-linear — resumption guide
|
||||
|
||||
Open this first when picking the work back up.
|
||||
|
||||
## What this project is
|
||||
|
||||
Kimi-Linear-48B-A3B-Instruct on the Strix Halo box via vLLM, ROCm/TheRock 7.x,
|
||||
gfx1151. Sits beside Ollama+Qwen3-Coder. Goal: long-context (256K-1M)
|
||||
local inference using the model architecturally best-suited to the box's
|
||||
unified-memory shape.
|
||||
|
||||
Roadmap entry: `localgenai/Roadmap.md` → "Layer 0: Inference + tools"
|
||||
(to be added after BIOS unblock).
|
||||
|
||||
Container artifacts: `pyinfra/framework/compose/kimi-linear.yml` +
|
||||
`pyinfra/framework/compose/kimi-linear/` (Dockerfile, build.sh,
|
||||
patch-tokenizer.sh, smoke.sh, README.md). Deploy push:
|
||||
`cd pyinfra/framework && ./run.sh`.
|
||||
|
||||
## Where we are (2026-05-10)
|
||||
|
||||
**P0 — DONE, constrained.** First-ever locally-served Kimi-Linear
|
||||
generation on a Strix Halo iGPU. Smoke test passes:
|
||||
`/v1/models` returns `kimi-linear`; tiny generation returns "ok".
|
||||
|
||||
Current runtime cap: `--max-model-len 4096`, `--max-num-seqs 1`,
|
||||
`--num-gpu-blocks-override 32`. The long-context point of the model is
|
||||
locked behind a **VRAM ceiling** — see "What blocks progress" below.
|
||||
|
||||
**P1-P4 — pending.**
|
||||
|
||||
## The gauntlet of fixes that got us here (don't re-derive)
|
||||
|
||||
All of these are baked into the repo. Reproducing P0 from a clean box is
|
||||
push the repo + run the steps in `compose/kimi-linear/README.md`.
|
||||
|
||||
1. **Image entrypoint missing.** kyuz0 toolboxes drop into a shell;
|
||||
compose's `command:` gets exec'd as the program. Fix: explicit
|
||||
`entrypoint: ["vllm", "serve"]` in the compose file, model path as
|
||||
positional first arg.
|
||||
|
||||
2. **Tokenizer ImportError.** `tokenization_kimi.py` imports
|
||||
`bytes_to_unicode` from a transformers internal that's been removed.
|
||||
Fix: `patch-tokenizer.sh` inlines the function. Idempotent.
|
||||
|
||||
3. **Missing AITER gfx1151 GEMM configs.** kyuz0 image is built for
|
||||
gfx1151 but doesn't ship the AITER autotuning JSONs for every op
|
||||
Kimi's MLA layers hit (validated against Qwen/MiniMax, not
|
||||
MLA-heavy models). Fix: derived `Dockerfile` copies gfx1100 (RDNA3)
|
||||
configs into gfx1151-named slots — kernels compile + run, tile
|
||||
sizes not optimal but functional.
|
||||
|
||||
4. **MLA AITER FP8 BMM tries to materialize a 30 GB intermediate.**
|
||||
On top of resident weights, that's ~58 GB needed and we have 31 GB.
|
||||
Fix: `VLLM_ROCM_USE_AITER_MLA=0` — bypasses AITER for just the MLA
|
||||
path, keeps it for everything else.
|
||||
|
||||
5. **`HSA_XNACK=1` is a trap with vLLM.** It enables HIP demand-paging
|
||||
into GTT (115 GB ceiling per kernel cmdline `amdgpu.gttsize=117760`)
|
||||
so vLLM computes "Available KV cache memory: 73.6 GiB", but
|
||||
PyTorch's actual allocator stays capped at the GPU pool (~31 GB).
|
||||
vLLM then OOMs trying to allocate the budget it computed. Fix: turn
|
||||
XNACK *off*, live within the 31 GB pool until BIOS UMA gives a
|
||||
single bigger pool.
|
||||
|
||||
6. **`--swap-space` was removed in modern vLLM.** Don't pass it.
|
||||
|
||||
7. **`--num-gpu-blocks-override 32`** as belt-and-braces against
|
||||
vLLM's KV pool auto-discovery picking a too-big number even without
|
||||
XNACK.
|
||||
|
||||
## What blocks long context
|
||||
|
||||
PyTorch's discoverable VRAM equals the **BIOS UMA Frame Buffer Size**,
|
||||
not `amdgpu.gttsize`. The kernel cmdline is necessary but insufficient.
|
||||
|
||||
- 128 GB physical → 64 GB UMA → ~62 GB visible to Linux.
|
||||
- `rocminfo` reports two ~31 GB GPU pools (Pool 1 coarse / Pool 2 fine).
|
||||
- PyTorch's allocator only uses one pool (~31 GB) → OOMs at ~30 GB
|
||||
Kimi weights with little KV headroom.
|
||||
- User has previously found Framework's BIOS caps UMA at 64 GB.
|
||||
|
||||
**Research outcome (2026-05-10):** the right unblock isn't a higher BIOS
|
||||
UMA cap — it's the inverse. Set UMA *small* and merge the two pools.
|
||||
|
||||
| Layer | Setting |
|
||||
| --- | --- |
|
||||
| BIOS | Update to **3.05 stable** (Apr 2026); set UMA Frame Buffer = **0.5 GB** or **8 GB** (counter-intuitive but documented — frees pages for GTT) |
|
||||
| Kernel | **≥ 6.16.9.** Earlier kernels cap ROCm visibility at 15.5 GB. pyinfra's `linux-generic-hwe-24.04` may be on 6.8/6.11 — verify with `uname -r` and upgrade if needed. |
|
||||
| Cmdline | `amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432` (`ttm.pages_limit` in 4 KiB pages = 128 GiB) |
|
||||
| Env | `HSA_XNACK=1` **+** `HSA_FORCE_FINE_GRAIN_PCIE=1` (the piece our earlier XNACK attempt was missing) |
|
||||
| Env | `PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9"` (HIP variant, not CUDA) |
|
||||
| PyTorch | TheRock gfx1151 wheels — kyuz0:stable image already uses these |
|
||||
|
||||
Confirmed working in kyuz0/amd-strix-halo-vllm-toolboxes (DeepWiki:
|
||||
"Kernel Parameters and Unified Memory") and the Framework community
|
||||
"Linux + ROCm: January 2026 Stable Configurations Update" thread.
|
||||
Single-process allocation budget after this: ~110 GB.
|
||||
|
||||
**The hard ceiling**, if all of the above doesn't yield enough: 96 GB
|
||||
direct UMA. AMD AGESA / StrixHaloPI limit, not Framework's. No path
|
||||
past 96 GB on signed firmware as of May 2026; 192 GB Strix Halo refresh
|
||||
is rumored but unreleased.
|
||||
|
||||
**Fallback if the merge approach doesn't work for vLLM specifically**:
|
||||
llama.cpp ROCm + bartowski Q4_K_M GGUF. Different memory model, splits
|
||||
layers across CPU/GPU. Lower throughput, more flexible.
|
||||
|
||||
## When you come back
|
||||
|
||||
1. Read research output for BIOS UMA limits. If it landed in the chat,
|
||||
it's the most recent note in this project's session. If not yet,
|
||||
re-dispatch (prompt is in conversation history — Framework Desktop
|
||||
May 2026 UMA Frame Buffer cap research).
|
||||
2. Decide path:
|
||||
- **BIOS bump available** → flash, reboot, drop the constraints in
|
||||
`compose/kimi-linear.yml` (max-model-len, num-gpu-blocks-override),
|
||||
re-`./smoke.sh`, ramp context.
|
||||
- **BIOS capped at 64 GB** → pivot to llama.cpp ROCm path or accept
|
||||
4K context as the long-term reality.
|
||||
3. Then advance through P1 → P2 → P3 per the roadmap.
|
||||
|
||||
## Files of record
|
||||
|
||||
- `pyinfra/framework/compose/kimi-linear.yml` — service def, all the
|
||||
flag/env tradeoffs documented inline.
|
||||
- `pyinfra/framework/compose/kimi-linear/` — Dockerfile + scripts.
|
||||
- `pyinfra/framework/deploy.py` — wired into the service loop +
|
||||
asset-copy block.
|
||||
- `Roadmap.md` — strategy.
|
||||
- `StrixHaloMemory.md` — the UMA-vs-GTT discussion that needs a
|
||||
follow-up paragraph from this work (PyTorch caps at UMA, XNACK is a
|
||||
trap).
|
||||
|
||||
## Decisions worth not relitigating
|
||||
|
||||
- **vLLM 0.19.x via kyuz0:stable** chosen over source-build with
|
||||
`v0.11.2` pin (build.sh exists as fallback). The recipe pin advice
|
||||
was based on an earlier Moonshot doc; current upstream works.
|
||||
- **4-bit compressed-tensors** (cyankiwi) chosen over 8-bit. With the
|
||||
31 GB ceiling, 8-bit wouldn't even fit weights resident.
|
||||
- **VLLM_ROCM_USE_AITER_MLA=0**, NOT `VLLM_MLA_DISABLE=1`. Granular
|
||||
disable preserves AITER for non-MLA paths. The full disable is the
|
||||
next escalation if needed.
|
||||
- **No upstream filings.** Findings stay in this repo per project
|
||||
policy (memory: `feedback_private_findings`).
|
||||
Reference in New Issue
Block a user