# kimi-linear Kimi-Linear-48B-A3B-Instruct on vLLM, ROCm/TheRock 7.x, gfx1151. Sits beside Ollama (port 11434, Qwen3-Coder) on port 8000. OpenAI-compatible. This is the **P0 verification stage** — no public Strix Halo numbers exist for this model as of 2026-05. Three things are unverified until a first generation succeeds: KDA Triton kernel on gfx1151, compressed-tensors loader on ROCm, and AITER + Kimi MoE topology. Smoke-test below confirms all three at once. ## Prereqs - Pyinfra deploy has run (`./run.sh` from `pyinfra/framework/`) — gives you `/srv/docker/kimi-linear/`, GPU group membership, `/models/` layout, and `huggingface-cli` on the box. - Hugging Face CLI authenticated (`huggingface-cli login`) if the weights repo gates downloads. cyankiwi's repo is currently public. ## Step 1 — Download weights ```sh huggingface-cli download \ cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit \ --local-dir /models/moonshotai/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit ``` ~35 GB. The repo is named `AWQ-4bit` but the actual format is `compressed-tensors` int4 group-quantized — see `config.json`. ## Step 2 — Try the upstream image first ```sh cd /srv/docker/kimi-linear docker compose pull # ~8.5 GB docker compose up -d docker compose logs -f ``` Watch for one of three things: - **Loads cleanly, model serves on :8000** → P0 passes. Run `./smoke.sh`. - **`MLAModules.__init__() missing 'indexer_rotary_emb'`** → upstream image is on vLLM 0.12.x; need the v0.11.2 source build. Skip to Step 3. - **KDA / Triton / fla-core compile error** → kernel doesn't work on gfx1151 yet. Fall back path: llama.cpp ROCm + bartowski Q4_K_M GGUF in `compose/llama.yml`. Document the error in `localgenai/kimi-linear/NOTES.md` and stop. ## Step 3 — Source build (if needed) ```sh cd /srv/docker/kimi-linear tmux new -s kimi-build ./build.sh # multi-hour. Detach with C-b d; reattach with `tmux a -t kimi-build` ``` Builds `kimi-linear-local:v0.11.2` from kyuz0 SHA `e2288d6` with `VLLM_COMMIT=v0.11.2`. Then edit `docker-compose.yml`: ```yaml image: kimi-linear-local:v0.11.2 ``` …and `docker compose up -d` again. ## Step 4 — Smoke test ```sh ./smoke.sh ``` Expects: `/v1/models` returns `kimi-linear`; a four-token generation returns "ok". If both pass, **P0 is done**. Update task #6 and proceed to P1. ## Operations ```sh docker compose logs -f kimi-linear # tail docker compose restart kimi-linear # reload docker compose down # stop docker compose exec kimi-linear bash # shell in amdgpu_top # on host: GPU power, mem, util ``` ## Pin manifest | Component | Pin | | --------------------------- | ---------------------------------- | | kyuz0 toolbox | commit `e2288d6` (2026-04-22) | | vLLM | tag `v0.11.2` (Moonshot recipe) | | Image (default) | `kyuz0/vllm-therock-gfx1151:stable`| | Image (pinned, if built) | `kimi-linear-local:v0.11.2` | | Weights | `cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit` (compressed-tensors int4) | | ROCm | TheRock nightlies via kyuz0 base | | Python | 3.12 (hardcoded in kyuz0 Dockerfile) | Bump policy: don't move vLLM to 0.12.x; don't move kyuz0 commit without re-running smoke; bump weights only when an 8-bit A/B is in scope (P3). ## Port collision warning `compose/vllm.yml` is a placeholder stub that also binds `:8000`. Only one of `kimi-linear` and `vllm` can run at a time. Don't `docker compose up` both. Long term either delete the stub or move it to a different port; not in scope here. ## Known issues / mitigations - **HIP graph capture broken on gfx1151** (vllm-project/vllm#32180) — `--enforce-eager` mitigates at a throughput cost. Re-test without it once the upstream fix lands. - **vLLM 0.12.0 crash on Kimi-Linear** — `MLAModules.__init__() missing 'indexer_rotary_emb'`. Hard pin to 0.11.2. - **No published gfx1151 numbers** — we are first. Findings stay private (no upstream filings) per project policy. ## Status P0 in progress. Update `oc-tree`-style `NEXT_STEPS.md` if you set this aside mid-verification.