Document current coding-workflow stack state

Snapshot of where opencode + Qwen3-Coder + MCPs + Kimi-Linear + voice + Phoenix tracing land today, plus in-flight (oc-tree, kimi-linear context ramp) and next (ComfyUI) items with pointers to per-project NEXT_STEPS.md guides.
2026-05-10 21:14:43 -04:00
parent 228fe8d1ac
commit a29793032d
35 changed files with 2067 additions and 37 deletions
--- a/kimi-linear/NEXT_STEPS.md
+++ b/kimi-linear/NEXT_STEPS.md
@@ -0,0 +1,146 @@
+# kimi-linear — resumption guide
+
+Open this first when picking the work back up.
+
+## What this project is
+
+Kimi-Linear-48B-A3B-Instruct on the Strix Halo box via vLLM, ROCm/TheRock 7.x,
+gfx1151. Sits beside Ollama+Qwen3-Coder. Goal: long-context (256K-1M)
+local inference using the model architecturally best-suited to the box's
+unified-memory shape.
+
+Roadmap entry: `localgenai/Roadmap.md` → "Layer 0: Inference + tools"
+(to be added after BIOS unblock).
+
+Container artifacts: `pyinfra/framework/compose/kimi-linear.yml` +
+`pyinfra/framework/compose/kimi-linear/` (Dockerfile, build.sh,
+patch-tokenizer.sh, smoke.sh, README.md). Deploy push:
+`cd pyinfra/framework && ./run.sh`.
+
+## Where we are (2026-05-10)
+
+**P0 — DONE, constrained.** First-ever locally-served Kimi-Linear
+generation on a Strix Halo iGPU. Smoke test passes:
+`/v1/models` returns `kimi-linear`; tiny generation returns "ok".
+
+Current runtime cap: `--max-model-len 4096`, `--max-num-seqs 1`,
+`--num-gpu-blocks-override 32`. The long-context point of the model is
+locked behind a **VRAM ceiling** — see "What blocks progress" below.
+
+**P1-P4 — pending.**
+
+## The gauntlet of fixes that got us here (don't re-derive)
+
+All of these are baked into the repo. Reproducing P0 from a clean box is
+push the repo + run the steps in `compose/kimi-linear/README.md`.
+
+1. **Image entrypoint missing.** kyuz0 toolboxes drop into a shell;
+   compose's `command:` gets exec'd as the program. Fix: explicit
+   `entrypoint: ["vllm", "serve"]` in the compose file, model path as
+   positional first arg.
+
+2. **Tokenizer ImportError.** `tokenization_kimi.py` imports
+   `bytes_to_unicode` from a transformers internal that's been removed.
+   Fix: `patch-tokenizer.sh` inlines the function. Idempotent.
+
+3. **Missing AITER gfx1151 GEMM configs.** kyuz0 image is built for
+   gfx1151 but doesn't ship the AITER autotuning JSONs for every op
+   Kimi's MLA layers hit (validated against Qwen/MiniMax, not
+   MLA-heavy models). Fix: derived `Dockerfile` copies gfx1100 (RDNA3)
+   configs into gfx1151-named slots — kernels compile + run, tile
+   sizes not optimal but functional.
+
+4. **MLA AITER FP8 BMM tries to materialize a 30 GB intermediate.**
+   On top of resident weights, that's ~58 GB needed and we have 31 GB.
+   Fix: `VLLM_ROCM_USE_AITER_MLA=0` — bypasses AITER for just the MLA
+   path, keeps it for everything else.
+
+5. **`HSA_XNACK=1` is a trap with vLLM.** It enables HIP demand-paging
+   into GTT (115 GB ceiling per kernel cmdline `amdgpu.gttsize=117760`)
+   so vLLM computes "Available KV cache memory: 73.6 GiB", but
+   PyTorch's actual allocator stays capped at the GPU pool (~31 GB).
+   vLLM then OOMs trying to allocate the budget it computed. Fix: turn
+   XNACK *off*, live within the 31 GB pool until BIOS UMA gives a
+   single bigger pool.
+
+6. **`--swap-space` was removed in modern vLLM.** Don't pass it.
+
+7. **`--num-gpu-blocks-override 32`** as belt-and-braces against
+   vLLM's KV pool auto-discovery picking a too-big number even without
+   XNACK.
+
+## What blocks long context
+
+PyTorch's discoverable VRAM equals the **BIOS UMA Frame Buffer Size**,
+not `amdgpu.gttsize`. The kernel cmdline is necessary but insufficient.
+
+- 128 GB physical → 64 GB UMA → ~62 GB visible to Linux.
+- `rocminfo` reports two ~31 GB GPU pools (Pool 1 coarse / Pool 2 fine).
+- PyTorch's allocator only uses one pool (~31 GB) → OOMs at ~30 GB
+  Kimi weights with little KV headroom.
+- User has previously found Framework's BIOS caps UMA at 64 GB.
+
+**Research outcome (2026-05-10):** the right unblock isn't a higher BIOS
+UMA cap — it's the inverse. Set UMA *small* and merge the two pools.
+
+| Layer | Setting |
+| --- | --- |
+| BIOS | Update to **3.05 stable** (Apr 2026); set UMA Frame Buffer = **0.5 GB** or **8 GB** (counter-intuitive but documented — frees pages for GTT) |
+| Kernel | **≥ 6.16.9.** Earlier kernels cap ROCm visibility at 15.5 GB. pyinfra's `linux-generic-hwe-24.04` may be on 6.8/6.11 — verify with `uname -r` and upgrade if needed. |
+| Cmdline | `amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432` (`ttm.pages_limit` in 4 KiB pages = 128 GiB) |
+| Env | `HSA_XNACK=1` **+** `HSA_FORCE_FINE_GRAIN_PCIE=1` (the piece our earlier XNACK attempt was missing) |
+| Env | `PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True,garbage_collection_threshold:0.9"` (HIP variant, not CUDA) |
+| PyTorch | TheRock gfx1151 wheels — kyuz0:stable image already uses these |
+
+Confirmed working in kyuz0/amd-strix-halo-vllm-toolboxes (DeepWiki:
+"Kernel Parameters and Unified Memory") and the Framework community
+"Linux + ROCm: January 2026 Stable Configurations Update" thread.
+Single-process allocation budget after this: ~110 GB.
+
+**The hard ceiling**, if all of the above doesn't yield enough: 96 GB
+direct UMA. AMD AGESA / StrixHaloPI limit, not Framework's. No path
+past 96 GB on signed firmware as of May 2026; 192 GB Strix Halo refresh
+is rumored but unreleased.
+
+**Fallback if the merge approach doesn't work for vLLM specifically**:
+llama.cpp ROCm + bartowski Q4_K_M GGUF. Different memory model, splits
+layers across CPU/GPU. Lower throughput, more flexible.
+
+## When you come back
+
+1. Read research output for BIOS UMA limits. If it landed in the chat,
+   it's the most recent note in this project's session. If not yet,
+   re-dispatch (prompt is in conversation history — Framework Desktop
+   May 2026 UMA Frame Buffer cap research).
+2. Decide path:
+   - **BIOS bump available** → flash, reboot, drop the constraints in
+     `compose/kimi-linear.yml` (max-model-len, num-gpu-blocks-override),
+     re-`./smoke.sh`, ramp context.
+   - **BIOS capped at 64 GB** → pivot to llama.cpp ROCm path or accept
+     4K context as the long-term reality.
+3. Then advance through P1 → P2 → P3 per the roadmap.
+
+## Files of record
+
+- `pyinfra/framework/compose/kimi-linear.yml` — service def, all the
+  flag/env tradeoffs documented inline.
+- `pyinfra/framework/compose/kimi-linear/` — Dockerfile + scripts.
+- `pyinfra/framework/deploy.py` — wired into the service loop +
+  asset-copy block.
+- `Roadmap.md` — strategy.
+- `StrixHaloMemory.md` — the UMA-vs-GTT discussion that needs a
+  follow-up paragraph from this work (PyTorch caps at UMA, XNACK is a
+  trap).
+
+## Decisions worth not relitigating
+
+- **vLLM 0.19.x via kyuz0:stable** chosen over source-build with
+  `v0.11.2` pin (build.sh exists as fallback). The recipe pin advice
+  was based on an earlier Moonshot doc; current upstream works.
+- **4-bit compressed-tensors** (cyankiwi) chosen over 8-bit. With the
+  31 GB ceiling, 8-bit wouldn't even fit weights resident.
+- **VLLM_ROCM_USE_AITER_MLA=0**, NOT `VLLM_MLA_DISABLE=1`. Granular
+  disable preserves AITER for non-MLA paths. The full disable is the
+  next escalation if needed.
+- **No upstream filings.** Findings stay in this repo per project
+  policy (memory: `feedback_private_findings`).