pyinfra/framework/compose/kimi-linear/README.md

# kimi-linear

Kimi-Linear-48B-A3B-Instruct on vLLM, ROCm/TheRock 7.x, gfx1151. Sits
beside Ollama (port 11434, Qwen3-Coder) on port 8000. OpenAI-compatible.

This is the **P0 verification stage** — no public Strix Halo numbers
exist for this model as of 2026-05. Three things are unverified until a
first generation succeeds: KDA Triton kernel on gfx1151,
compressed-tensors loader on ROCm, and AITER + Kimi MoE topology.
Smoke-test below confirms all three at once.

## Prereqs

- Pyinfra deploy has run (`./run.sh` from `pyinfra/framework/`) — gives
  you `/srv/docker/kimi-linear/`, GPU group membership, `/models/`
  layout, and `huggingface-cli` on the box.
- Hugging Face CLI authenticated (`huggingface-cli login`) if the
  weights repo gates downloads. cyankiwi's repo is currently public.

## Step 1 — Download weights

```sh
huggingface-cli download \
    cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit \
    --local-dir /models/moonshotai/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit
```

~35 GB. The repo is named `AWQ-4bit` but the actual format is
`compressed-tensors` int4 group-quantized — see `config.json`.

## Step 2 — Try the upstream image first

```sh
cd /srv/docker/kimi-linear
docker compose pull       # ~8.5 GB
docker compose up -d
docker compose logs -f
```

Watch for one of three things:

- **Loads cleanly, model serves on :8000** → P0 passes. Run `./smoke.sh`.
- **`MLAModules.__init__() missing 'indexer_rotary_emb'`** → upstream
  image is on vLLM 0.12.x; need the v0.11.2 source build. Skip to
  Step 3.
- **KDA / Triton / fla-core compile error** → kernel doesn't work on
  gfx1151 yet. Fall back path: llama.cpp ROCm + bartowski Q4_K_M GGUF
  in `compose/llama.yml`. Document the error in
  `localgenai/kimi-linear/NOTES.md` and stop.

## Step 3 — Source build (if needed)

```sh
cd /srv/docker/kimi-linear
tmux new -s kimi-build
./build.sh        # multi-hour. Detach with C-b d; reattach with `tmux a -t kimi-build`
```

Builds `kimi-linear-local:v0.11.2` from kyuz0 SHA `e2288d6` with
`VLLM_COMMIT=v0.11.2`. Then edit `docker-compose.yml`:

```yaml
    image: kimi-linear-local:v0.11.2
```

…and `docker compose up -d` again.

## Step 4 — Smoke test

```sh
./smoke.sh
```

Expects: `/v1/models` returns `kimi-linear`; a four-token generation
returns "ok". If both pass, **P0 is done**. Update task #6 and proceed
to P1.

## Operations

```sh
docker compose logs -f kimi-linear         # tail
docker compose restart kimi-linear         # reload
docker compose down                        # stop
docker compose exec kimi-linear bash       # shell in
amdgpu_top                                 # on host: GPU power, mem, util
```

## Pin manifest

| Component                   | Pin                                |
| --------------------------- | ---------------------------------- |
| kyuz0 toolbox               | commit `e2288d6` (2026-04-22)      |
| vLLM                        | tag `v0.11.2` (Moonshot recipe)    |
| Image (default)             | `kyuz0/vllm-therock-gfx1151:stable`|
| Image (pinned, if built)    | `kimi-linear-local:v0.11.2`        |
| Weights                     | `cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit` (compressed-tensors int4) |
| ROCm                        | TheRock nightlies via kyuz0 base   |
| Python                      | 3.12 (hardcoded in kyuz0 Dockerfile) |

Bump policy: don't move vLLM to 0.12.x; don't move kyuz0 commit without
re-running smoke; bump weights only when an 8-bit A/B is in scope (P3).

## Port collision warning

`compose/vllm.yml` is a placeholder stub that also binds `:8000`. Only
one of `kimi-linear` and `vllm` can run at a time. Don't `docker compose
up` both. Long term either delete the stub or move it to a different
port; not in scope here.

## Known issues / mitigations

- **HIP graph capture broken on gfx1151** (vllm-project/vllm#32180) —
  `--enforce-eager` mitigates at a throughput cost. Re-test without it
  once the upstream fix lands.
- **vLLM 0.12.0 crash on Kimi-Linear** —
  `MLAModules.__init__() missing 'indexer_rotary_emb'`. Hard pin to
  0.11.2.
- **No published gfx1151 numbers** — we are first. Findings stay
  private (no upstream filings) per project policy.

## Status

P0 in progress. Update `oc-tree`-style `NEXT_STEPS.md` if you set this
aside mid-verification.
Document current coding-workflow stack state Snapshot of where opencode + Qwen3-Coder + MCPs + Kimi-Linear + voice + Phoenix tracing land today, plus in-flight (oc-tree, kimi-linear context ramp) and next (ComfyUI) items with pointers to per-project NEXT_STEPS.md guides. 2026-05-10 21:14:43 -04:00			`# kimi-linear`

			`Kimi-Linear-48B-A3B-Instruct on vLLM, ROCm/TheRock 7.x, gfx1151. Sits`
			`beside Ollama (port 11434, Qwen3-Coder) on port 8000. OpenAI-compatible.`

			`This is the P0 verification stage — no public Strix Halo numbers`
			`exist for this model as of 2026-05. Three things are unverified until a`
			`first generation succeeds: KDA Triton kernel on gfx1151,`
			`compressed-tensors loader on ROCm, and AITER + Kimi MoE topology.`
			`Smoke-test below confirms all three at once.`

			`## Prereqs`

			- Pyinfra deploy has run (`./run.sh` from `pyinfra/framework/`) — gives
			you `/srv/docker/kimi-linear/`, GPU group membership, `/models/`
			layout, and `huggingface-cli` on the box.
			- Hugging Face CLI authenticated (`huggingface-cli login`) if the
			`weights repo gates downloads. cyankiwi's repo is currently public.`

			`## Step 1 — Download weights`

			```sh
			`huggingface-cli download \`
			`cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit \`
			`--local-dir /models/moonshotai/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit`
			```

			~35 GB. The repo is named `AWQ-4bit` but the actual format is
			`compressed-tensors` int4 group-quantized — see `config.json`.

			`## Step 2 — Try the upstream image first`

			```sh
			`cd /srv/docker/kimi-linear`
			`docker compose pull # ~8.5 GB`
			`docker compose up -d`
			`docker compose logs -f`
			```

			`Watch for one of three things:`

			- Loads cleanly, model serves on :8000 → P0 passes. Run `./smoke.sh`.
			- `MLAModules.__init__() missing 'indexer_rotary_emb'` → upstream
			`image is on vLLM 0.12.x; need the v0.11.2 source build. Skip to`
			`Step 3.`
			`- KDA / Triton / fla-core compile error → kernel doesn't work on`
			`gfx1151 yet. Fall back path: llama.cpp ROCm + bartowski Q4_K_M GGUF`
			in `compose/llama.yml`. Document the error in
			`localgenai/kimi-linear/NOTES.md` and stop.

			`## Step 3 — Source build (if needed)`

			```sh
			`cd /srv/docker/kimi-linear`
			`tmux new -s kimi-build`
			./build.sh # multi-hour. Detach with C-b d; reattach with `tmux a -t kimi-build`
			```

			Builds `kimi-linear-local:v0.11.2` from kyuz0 SHA `e2288d6` with
			`VLLM_COMMIT=v0.11.2`. Then edit `docker-compose.yml`:

			```yaml
			`image: kimi-linear-local:v0.11.2`
			```

			…and `docker compose up -d` again.

			`## Step 4 — Smoke test`

			```sh
			`./smoke.sh`
			```

			Expects: `/v1/models` returns `kimi-linear`; a four-token generation
			`returns "ok". If both pass, P0 is done. Update task #6 and proceed`
			`to P1.`

			`## Operations`

			```sh
			`docker compose logs -f kimi-linear # tail`
			`docker compose restart kimi-linear # reload`
			`docker compose down # stop`
			`docker compose exec kimi-linear bash # shell in`
			`amdgpu_top # on host: GPU power, mem, util`
			```

			`## Pin manifest`

			`\| Component \| Pin \|`
			`\| --------------------------- \| ---------------------------------- \|`
			\| kyuz0 toolbox \| commit `e2288d6` (2026-04-22) \|
			\| vLLM \| tag `v0.11.2` (Moonshot recipe) \|
			\| Image (default) \| `kyuz0/vllm-therock-gfx1151:stable`\|
			\| Image (pinned, if built) \| `kimi-linear-local:v0.11.2` \|
			\| Weights \| `cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit` (compressed-tensors int4) \|
			`\| ROCm \| TheRock nightlies via kyuz0 base \|`
			`\| Python \| 3.12 (hardcoded in kyuz0 Dockerfile) \|`

			`Bump policy: don't move vLLM to 0.12.x; don't move kyuz0 commit without`
			`re-running smoke; bump weights only when an 8-bit A/B is in scope (P3).`

			`## Port collision warning`

			`compose/vllm.yml` is a placeholder stub that also binds `:8000`. Only
			one of `kimi-linear` and `vllm` can run at a time. Don't `docker compose
			up` both. Long term either delete the stub or move it to a different
			`port; not in scope here.`

			`## Known issues / mitigations`

			`- HIP graph capture broken on gfx1151 (vllm-project/vllm#32180) —`
			`--enforce-eager` mitigates at a throughput cost. Re-test without it
			`once the upstream fix lands.`
			`- vLLM 0.12.0 crash on Kimi-Linear —`
			`MLAModules.__init__() missing 'indexer_rotary_emb'`. Hard pin to
			`0.11.2.`
			`- No published gfx1151 numbers — we are first. Findings stay`
			`private (no upstream filings) per project policy.`

			`## Status`

			P0 in progress. Update `oc-tree`-style `NEXT_STEPS.md` if you set this
			`aside mid-verification.`