164 lines
6.2 KiB
Markdown
164 lines
6.2 KiB
Markdown
# qwen3-235b
|
|
|
|
Qwen3-235B-A22B-Instruct-2507 on Strix Halo via `kyuz0:rocm-7.2.2`.
|
|
The "overnight long-task" model — bandwidth math says ~5-10 tok/s
|
|
decode, so this is for fire-and-forget runs (deep refactors, long-form
|
|
analysis), **not** interactive coding. Daily driver stays on Ollama /
|
|
llama 30B.
|
|
|
|
OpenAI-compatible endpoint at `http://framework:8081` once running.
|
|
|
|
## Coexistence notes (read first)
|
|
|
|
At ~88.8 GB weights this can't share the GPU with anything else:
|
|
|
|
| Concurrent service | Action |
|
|
|---|---|
|
|
| `llama` (Qwen3-Coder-30B, port 8080) | `docker compose down` in `/srv/docker/llama` first |
|
|
| `kimi-linear` (vLLM, port 8000) | `docker compose down` in `/srv/docker/kimi-linear` first |
|
|
| `ollama` (port 11434) | `docker exec ollama ollama stop qwen3-coder:30b` (Ollama itself can stay up) |
|
|
| `comfyui` (port 8188) | `docker compose down` in `/srv/docker/comfyui` first |
|
|
|
|
The stack reflects this: `restart: "no"` — won't come back after a box
|
|
reboot. You start it deliberately.
|
|
|
|
## Prereqs
|
|
|
|
- Pyinfra deploy has run (creates `/srv/docker/qwen3-235b/` with right perms).
|
|
- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
|
|
Verify: `cat /proc/cmdline | grep ttm.pages_limit`.
|
|
- Other GPU services stopped per the table above.
|
|
|
|
## Download weights (M0.1 — ~88.8 GB, 2 shards)
|
|
|
|
```sh
|
|
# /models/qwen exists via pyinfra; just create the model subdir.
|
|
mkdir -p /models/qwen/Qwen3-235B-A22B-Instruct-2507
|
|
|
|
hf download unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF \
|
|
--include 'UD-Q2_K_XL/*' \
|
|
--local-dir /models/qwen/Qwen3-235B-A22B-Instruct-2507
|
|
|
|
# Files land at:
|
|
# /models/qwen/Qwen3-235B-A22B-Instruct-2507/UD-Q2_K_XL/
|
|
# Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00001-of-00002.gguf (~50 GB)
|
|
# Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00002-of-00002.gguf (~38.8 GB)
|
|
#
|
|
# llama.cpp auto-discovers shard 2 from shard 1 — only point --model at
|
|
# the 00001-of-00002 file.
|
|
```
|
|
|
|
Disk: needs ~90 GB free on `/models`. Pull is bandwidth-bound; expect
|
|
20-60 minutes on a fast home link.
|
|
|
|
## Bring up (M0.2 — first generation)
|
|
|
|
```sh
|
|
cd /srv/docker/qwen3-235b
|
|
docker compose pull # already-cached image if you ran llama first
|
|
docker compose up -d
|
|
docker compose logs -f # wait for "main: server is listening on http://0.0.0.0:8081"
|
|
|
|
./smoke.sh # /health + tiny generation + perf
|
|
```
|
|
|
|
Expect **2-5 minutes** for first start — llama.cpp has to load ~88 GB
|
|
of weights off disk into the merged arena. Subsequent starts are faster
|
|
if the page cache is warm.
|
|
|
|
If `./smoke.sh` reports `predicted_per_second` in the 5-10 tok/s range,
|
|
M0 is verified. Lower than 3 tok/s = something's wrong (likely the GPU
|
|
arena is < 100 GB — see "Troubleshooting").
|
|
|
|
## Ramping context
|
|
|
|
Defaults to 64K — chosen because opencode's auto-compaction triggers
|
|
at ~75-80 % of the stated limit, so a smaller ctx fires the rewrite-
|
|
the-conversation loop after only a handful of turns. 64K roughly
|
|
doubles how many turns fit. Stages:
|
|
|
|
| Stage | `--ctx-size` | KV (q8_0) | Margin in arena |
|
|
|---|---|---|---|
|
|
| Previous (M0) | 32768 | ~4 GB | ~15 GB |
|
|
| **Current default** | **65536** | **~8 GB** | **~11 GB** |
|
|
| M0.4 stretch | 131072 | ~16 GB | ~3 GB (tight) |
|
|
|
|
Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
|
|
re-run `./smoke.sh`. If you see an alloc error in the logs, dial it back.
|
|
|
|
opencode's `limit.context` in `opencode.json` should match — otherwise
|
|
opencode either compacts too early (limit lower than server) or sends
|
|
prompts longer than the server can handle (limit higher).
|
|
|
|
## Troubleshooting
|
|
|
|
**OOM on startup.** Check the arena size first:
|
|
```sh
|
|
rocminfo | grep -A2 "Pool Info" | head -20
|
|
```
|
|
If it reports two ~31 GB pools instead of one ~110 GB arena, the
|
|
unified-memory recipe didn't apply. Verify (in order):
|
|
|
|
1. `cat /proc/cmdline` includes `amdgpu.gttsize=131072 ttm.pages_limit=33554432`
|
|
2. BIOS UMA Frame Buffer Size is **0.5 GB** (not 64 GB) — Framework BIOS
|
|
`lfsp0.03.05+`. Counter-intuitive: a tiny UMA frees more pages for GTT.
|
|
3. Container env shows `HSA_XNACK=1 HSA_FORCE_FINE_GRAIN_PCIE=1` —
|
|
`docker compose exec qwen3-235b env | grep HSA`.
|
|
|
|
If all three are right and OOM persists, drop to Q2_K_L (~85.8 GB) — edit
|
|
the model path in `docker-compose.yml` after a separate `hf download` of
|
|
that quant.
|
|
|
|
**`predicted_per_second` very low (<3 tok/s).** Likely cold page cache.
|
|
Re-run `./smoke.sh` once — second run should be in band. If still slow,
|
|
verify the model file isn't being swapped from disk: `iostat -x 1` should
|
|
show ~0 read bandwidth during inference.
|
|
|
|
**Server starts but answers gibberish.** `--jinja` not picked up; check
|
|
`docker compose logs qwen3-235b | grep -i 'chat template'`. Should
|
|
say "using chat template from gguf metadata".
|
|
|
|
## Operations
|
|
|
|
```sh
|
|
docker compose logs -f # tail
|
|
docker compose down # stop (always — coexists with nothing)
|
|
docker compose exec qwen3-235b bash # shell in
|
|
./smoke.sh # health + perf
|
|
amdgpu_top # GPU view on host
|
|
```
|
|
|
|
Suggested cycle:
|
|
```
|
|
[evening] stop llama 30B / kimi-linear; up qwen3-235b; submit batch tasks
|
|
[overnight] qwen3-235b grinds; results land in your harness state
|
|
[morning] down qwen3-235b; up llama 30B / kimi-linear; back to interactive
|
|
```
|
|
|
|
M3 will automate this swap; M0 does it by hand.
|
|
|
|
## Pin manifest
|
|
|
|
| Component | Pin |
|
|
|---|---|
|
|
| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) |
|
|
| Weights | `unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF` UD-Q2_K_XL |
|
|
| Default port | 8081 |
|
|
| Default context | 65536 (ramp to 131072 deliberately) |
|
|
| KV cache type | q8_0 (k and v) |
|
|
|
|
## Status
|
|
|
|
M0 — compose artifacts written; awaiting box-side weight pull + bring-up.
|
|
M0.3-M0.4 (context ramp) follow once M0 boots cleanly. M1 wires this
|
|
endpoint as a 4th opencode/LiteLLM provider used by the long-task
|
|
orchestrator.
|
|
|
|
## Why Instruct-2507 not Thinking-2507
|
|
|
|
Both are published; Thinking emits a `<think>` block before every answer.
|
|
At ~7 tok/s decode, a 2K-token think block = ~5 min of wall time per
|
|
response, then the actual answer. For autonomous coding/refactor tasks
|
|
that's a tax we don't want. Thinking-2507 is worth adding as a separate
|
|
compose later for hard-reasoning one-shots; not the long-task default.
|