Files
localgenai/pyinfra/framework/compose/kimi-linear/README.md
noisedestroyers a29793032d Document current coding-workflow stack state
Snapshot of where opencode + Qwen3-Coder + MCPs + Kimi-Linear + voice
  + Phoenix tracing land today, plus in-flight (oc-tree, kimi-linear
  context ramp) and next (ComfyUI) items with pointers to per-project
  NEXT_STEPS.md guides.
2026-05-10 21:14:43 -04:00

4.2 KiB

kimi-linear

Kimi-Linear-48B-A3B-Instruct on vLLM, ROCm/TheRock 7.x, gfx1151. Sits beside Ollama (port 11434, Qwen3-Coder) on port 8000. OpenAI-compatible.

This is the P0 verification stage — no public Strix Halo numbers exist for this model as of 2026-05. Three things are unverified until a first generation succeeds: KDA Triton kernel on gfx1151, compressed-tensors loader on ROCm, and AITER + Kimi MoE topology. Smoke-test below confirms all three at once.

Prereqs

  • Pyinfra deploy has run (./run.sh from pyinfra/framework/) — gives you /srv/docker/kimi-linear/, GPU group membership, /models/ layout, and huggingface-cli on the box.
  • Hugging Face CLI authenticated (huggingface-cli login) if the weights repo gates downloads. cyankiwi's repo is currently public.

Step 1 — Download weights

huggingface-cli download \
    cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit \
    --local-dir /models/moonshotai/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit

~35 GB. The repo is named AWQ-4bit but the actual format is compressed-tensors int4 group-quantized — see config.json.

Step 2 — Try the upstream image first

cd /srv/docker/kimi-linear
docker compose pull       # ~8.5 GB
docker compose up -d
docker compose logs -f

Watch for one of three things:

  • Loads cleanly, model serves on :8000 → P0 passes. Run ./smoke.sh.
  • MLAModules.__init__() missing 'indexer_rotary_emb' → upstream image is on vLLM 0.12.x; need the v0.11.2 source build. Skip to Step 3.
  • KDA / Triton / fla-core compile error → kernel doesn't work on gfx1151 yet. Fall back path: llama.cpp ROCm + bartowski Q4_K_M GGUF in compose/llama.yml. Document the error in localgenai/kimi-linear/NOTES.md and stop.

Step 3 — Source build (if needed)

cd /srv/docker/kimi-linear
tmux new -s kimi-build
./build.sh        # multi-hour. Detach with C-b d; reattach with `tmux a -t kimi-build`

Builds kimi-linear-local:v0.11.2 from kyuz0 SHA e2288d6 with VLLM_COMMIT=v0.11.2. Then edit docker-compose.yml:

    image: kimi-linear-local:v0.11.2

…and docker compose up -d again.

Step 4 — Smoke test

./smoke.sh

Expects: /v1/models returns kimi-linear; a four-token generation returns "ok". If both pass, P0 is done. Update task #6 and proceed to P1.

Operations

docker compose logs -f kimi-linear         # tail
docker compose restart kimi-linear         # reload
docker compose down                        # stop
docker compose exec kimi-linear bash       # shell in
amdgpu_top                                 # on host: GPU power, mem, util

Pin manifest

Component Pin
kyuz0 toolbox commit e2288d6 (2026-04-22)
vLLM tag v0.11.2 (Moonshot recipe)
Image (default) kyuz0/vllm-therock-gfx1151:stable
Image (pinned, if built) kimi-linear-local:v0.11.2
Weights cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit (compressed-tensors int4)
ROCm TheRock nightlies via kyuz0 base
Python 3.12 (hardcoded in kyuz0 Dockerfile)

Bump policy: don't move vLLM to 0.12.x; don't move kyuz0 commit without re-running smoke; bump weights only when an 8-bit A/B is in scope (P3).

Port collision warning

compose/vllm.yml is a placeholder stub that also binds :8000. Only one of kimi-linear and vllm can run at a time. Don't docker compose up both. Long term either delete the stub or move it to a different port; not in scope here.

Known issues / mitigations

  • HIP graph capture broken on gfx1151 (vllm-project/vllm#32180) — --enforce-eager mitigates at a throughput cost. Re-test without it once the upstream fix lands.
  • vLLM 0.12.0 crash on Kimi-LinearMLAModules.__init__() missing 'indexer_rotary_emb'. Hard pin to 0.11.2.
  • No published gfx1151 numbers — we are first. Findings stay private (no upstream filings) per project policy.

Status

P0 in progress. Update oc-tree-style NEXT_STEPS.md if you set this aside mid-verification.