125 lines
4.2 KiB
Markdown
125 lines
4.2 KiB
Markdown
|
|
# kimi-linear
|
||
|
|
|
||
|
|
Kimi-Linear-48B-A3B-Instruct on vLLM, ROCm/TheRock 7.x, gfx1151. Sits
|
||
|
|
beside Ollama (port 11434, Qwen3-Coder) on port 8000. OpenAI-compatible.
|
||
|
|
|
||
|
|
This is the **P0 verification stage** — no public Strix Halo numbers
|
||
|
|
exist for this model as of 2026-05. Three things are unverified until a
|
||
|
|
first generation succeeds: KDA Triton kernel on gfx1151,
|
||
|
|
compressed-tensors loader on ROCm, and AITER + Kimi MoE topology.
|
||
|
|
Smoke-test below confirms all three at once.
|
||
|
|
|
||
|
|
## Prereqs
|
||
|
|
|
||
|
|
- Pyinfra deploy has run (`./run.sh` from `pyinfra/framework/`) — gives
|
||
|
|
you `/srv/docker/kimi-linear/`, GPU group membership, `/models/`
|
||
|
|
layout, and `huggingface-cli` on the box.
|
||
|
|
- Hugging Face CLI authenticated (`huggingface-cli login`) if the
|
||
|
|
weights repo gates downloads. cyankiwi's repo is currently public.
|
||
|
|
|
||
|
|
## Step 1 — Download weights
|
||
|
|
|
||
|
|
```sh
|
||
|
|
huggingface-cli download \
|
||
|
|
cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit \
|
||
|
|
--local-dir /models/moonshotai/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit
|
||
|
|
```
|
||
|
|
|
||
|
|
~35 GB. The repo is named `AWQ-4bit` but the actual format is
|
||
|
|
`compressed-tensors` int4 group-quantized — see `config.json`.
|
||
|
|
|
||
|
|
## Step 2 — Try the upstream image first
|
||
|
|
|
||
|
|
```sh
|
||
|
|
cd /srv/docker/kimi-linear
|
||
|
|
docker compose pull # ~8.5 GB
|
||
|
|
docker compose up -d
|
||
|
|
docker compose logs -f
|
||
|
|
```
|
||
|
|
|
||
|
|
Watch for one of three things:
|
||
|
|
|
||
|
|
- **Loads cleanly, model serves on :8000** → P0 passes. Run `./smoke.sh`.
|
||
|
|
- **`MLAModules.__init__() missing 'indexer_rotary_emb'`** → upstream
|
||
|
|
image is on vLLM 0.12.x; need the v0.11.2 source build. Skip to
|
||
|
|
Step 3.
|
||
|
|
- **KDA / Triton / fla-core compile error** → kernel doesn't work on
|
||
|
|
gfx1151 yet. Fall back path: llama.cpp ROCm + bartowski Q4_K_M GGUF
|
||
|
|
in `compose/llama.yml`. Document the error in
|
||
|
|
`localgenai/kimi-linear/NOTES.md` and stop.
|
||
|
|
|
||
|
|
## Step 3 — Source build (if needed)
|
||
|
|
|
||
|
|
```sh
|
||
|
|
cd /srv/docker/kimi-linear
|
||
|
|
tmux new -s kimi-build
|
||
|
|
./build.sh # multi-hour. Detach with C-b d; reattach with `tmux a -t kimi-build`
|
||
|
|
```
|
||
|
|
|
||
|
|
Builds `kimi-linear-local:v0.11.2` from kyuz0 SHA `e2288d6` with
|
||
|
|
`VLLM_COMMIT=v0.11.2`. Then edit `docker-compose.yml`:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
image: kimi-linear-local:v0.11.2
|
||
|
|
```
|
||
|
|
|
||
|
|
…and `docker compose up -d` again.
|
||
|
|
|
||
|
|
## Step 4 — Smoke test
|
||
|
|
|
||
|
|
```sh
|
||
|
|
./smoke.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
Expects: `/v1/models` returns `kimi-linear`; a four-token generation
|
||
|
|
returns "ok". If both pass, **P0 is done**. Update task #6 and proceed
|
||
|
|
to P1.
|
||
|
|
|
||
|
|
## Operations
|
||
|
|
|
||
|
|
```sh
|
||
|
|
docker compose logs -f kimi-linear # tail
|
||
|
|
docker compose restart kimi-linear # reload
|
||
|
|
docker compose down # stop
|
||
|
|
docker compose exec kimi-linear bash # shell in
|
||
|
|
amdgpu_top # on host: GPU power, mem, util
|
||
|
|
```
|
||
|
|
|
||
|
|
## Pin manifest
|
||
|
|
|
||
|
|
| Component | Pin |
|
||
|
|
| --------------------------- | ---------------------------------- |
|
||
|
|
| kyuz0 toolbox | commit `e2288d6` (2026-04-22) |
|
||
|
|
| vLLM | tag `v0.11.2` (Moonshot recipe) |
|
||
|
|
| Image (default) | `kyuz0/vllm-therock-gfx1151:stable`|
|
||
|
|
| Image (pinned, if built) | `kimi-linear-local:v0.11.2` |
|
||
|
|
| Weights | `cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit` (compressed-tensors int4) |
|
||
|
|
| ROCm | TheRock nightlies via kyuz0 base |
|
||
|
|
| Python | 3.12 (hardcoded in kyuz0 Dockerfile) |
|
||
|
|
|
||
|
|
Bump policy: don't move vLLM to 0.12.x; don't move kyuz0 commit without
|
||
|
|
re-running smoke; bump weights only when an 8-bit A/B is in scope (P3).
|
||
|
|
|
||
|
|
## Port collision warning
|
||
|
|
|
||
|
|
`compose/vllm.yml` is a placeholder stub that also binds `:8000`. Only
|
||
|
|
one of `kimi-linear` and `vllm` can run at a time. Don't `docker compose
|
||
|
|
up` both. Long term either delete the stub or move it to a different
|
||
|
|
port; not in scope here.
|
||
|
|
|
||
|
|
## Known issues / mitigations
|
||
|
|
|
||
|
|
- **HIP graph capture broken on gfx1151** (vllm-project/vllm#32180) —
|
||
|
|
`--enforce-eager` mitigates at a throughput cost. Re-test without it
|
||
|
|
once the upstream fix lands.
|
||
|
|
- **vLLM 0.12.0 crash on Kimi-Linear** —
|
||
|
|
`MLAModules.__init__() missing 'indexer_rotary_emb'`. Hard pin to
|
||
|
|
0.11.2.
|
||
|
|
- **No published gfx1151 numbers** — we are first. Findings stay
|
||
|
|
private (no upstream filings) per project policy.
|
||
|
|
|
||
|
|
## Status
|
||
|
|
|
||
|
|
P0 in progress. Update `oc-tree`-style `NEXT_STEPS.md` if you set this
|
||
|
|
aside mid-verification.
|