Files
localgenai/pyinfra/framework/compose/qwen3-235b/README.md
2026-06-08 15:31:50 +01:00

6.2 KiB

qwen3-235b

Qwen3-235B-A22B-Instruct-2507 on Strix Halo via kyuz0:rocm-7.2.2. The "overnight long-task" model — bandwidth math says ~5-10 tok/s decode, so this is for fire-and-forget runs (deep refactors, long-form analysis), not interactive coding. Daily driver stays on Ollama / llama 30B.

OpenAI-compatible endpoint at http://framework:8081 once running.

Coexistence notes (read first)

At ~88.8 GB weights this can't share the GPU with anything else:

Concurrent service Action
llama (Qwen3-Coder-30B, port 8080) docker compose down in /srv/docker/llama first
kimi-linear (vLLM, port 8000) docker compose down in /srv/docker/kimi-linear first
ollama (port 11434) docker exec ollama ollama stop qwen3-coder:30b (Ollama itself can stay up)
comfyui (port 8188) docker compose down in /srv/docker/comfyui first

The stack reflects this: restart: "no" — won't come back after a box reboot. You start it deliberately.

Prereqs

  • Pyinfra deploy has run (creates /srv/docker/qwen3-235b/ with right perms).
  • BIOS UMA at 0.5 GB + ttm.pages_limit=33554432 kernel cmdline active. Verify: cat /proc/cmdline | grep ttm.pages_limit.
  • Other GPU services stopped per the table above.

Download weights (M0.1 — ~88.8 GB, 2 shards)

# /models/qwen exists via pyinfra; just create the model subdir.
mkdir -p /models/qwen/Qwen3-235B-A22B-Instruct-2507

hf download unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF \
    --include 'UD-Q2_K_XL/*' \
    --local-dir /models/qwen/Qwen3-235B-A22B-Instruct-2507

# Files land at:
#   /models/qwen/Qwen3-235B-A22B-Instruct-2507/UD-Q2_K_XL/
#       Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00001-of-00002.gguf  (~50 GB)
#       Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00002-of-00002.gguf  (~38.8 GB)
#
# llama.cpp auto-discovers shard 2 from shard 1 — only point --model at
# the 00001-of-00002 file.

Disk: needs ~90 GB free on /models. Pull is bandwidth-bound; expect 20-60 minutes on a fast home link.

Bring up (M0.2 — first generation)

cd /srv/docker/qwen3-235b
docker compose pull       # already-cached image if you ran llama first
docker compose up -d
docker compose logs -f    # wait for "main: server is listening on http://0.0.0.0:8081"

./smoke.sh                # /health + tiny generation + perf

Expect 2-5 minutes for first start — llama.cpp has to load ~88 GB of weights off disk into the merged arena. Subsequent starts are faster if the page cache is warm.

If ./smoke.sh reports predicted_per_second in the 5-10 tok/s range, M0 is verified. Lower than 3 tok/s = something's wrong (likely the GPU arena is < 100 GB — see "Troubleshooting").

Ramping context

Defaults to 64K — chosen because opencode's auto-compaction triggers at ~75-80 % of the stated limit, so a smaller ctx fires the rewrite- the-conversation loop after only a handful of turns. 64K roughly doubles how many turns fit. Stages:

Stage --ctx-size KV (q8_0) Margin in arena
Previous (M0) 32768 ~4 GB ~15 GB
Current default 65536 ~8 GB ~11 GB
M0.4 stretch 131072 ~16 GB ~3 GB (tight)

Edit --ctx-size in docker-compose.yml, docker compose down && up -d, re-run ./smoke.sh. If you see an alloc error in the logs, dial it back.

opencode's limit.context in opencode.json should match — otherwise opencode either compacts too early (limit lower than server) or sends prompts longer than the server can handle (limit higher).

Troubleshooting

OOM on startup. Check the arena size first:

rocminfo | grep -A2 "Pool Info" | head -20

If it reports two ~31 GB pools instead of one ~110 GB arena, the unified-memory recipe didn't apply. Verify (in order):

  1. cat /proc/cmdline includes amdgpu.gttsize=131072 ttm.pages_limit=33554432
  2. BIOS UMA Frame Buffer Size is 0.5 GB (not 64 GB) — Framework BIOS lfsp0.03.05+. Counter-intuitive: a tiny UMA frees more pages for GTT.
  3. Container env shows HSA_XNACK=1 HSA_FORCE_FINE_GRAIN_PCIE=1docker compose exec qwen3-235b env | grep HSA.

If all three are right and OOM persists, drop to Q2_K_L (~85.8 GB) — edit the model path in docker-compose.yml after a separate hf download of that quant.

predicted_per_second very low (<3 tok/s). Likely cold page cache. Re-run ./smoke.sh once — second run should be in band. If still slow, verify the model file isn't being swapped from disk: iostat -x 1 should show ~0 read bandwidth during inference.

Server starts but answers gibberish. --jinja not picked up; check docker compose logs qwen3-235b | grep -i 'chat template'. Should say "using chat template from gguf metadata".

Operations

docker compose logs -f                              # tail
docker compose down                                 # stop (always — coexists with nothing)
docker compose exec qwen3-235b bash                 # shell in
./smoke.sh                                          # health + perf
amdgpu_top                                          # GPU view on host

Suggested cycle:

[evening]  stop llama 30B / kimi-linear; up qwen3-235b; submit batch tasks
[overnight] qwen3-235b grinds; results land in your harness state
[morning]  down qwen3-235b; up llama 30B / kimi-linear; back to interactive

M3 will automate this swap; M0 does it by hand.

Pin manifest

Component Pin
Image kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2 (shared with llama)
Weights unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF UD-Q2_K_XL
Default port 8081
Default context 65536 (ramp to 131072 deliberately)
KV cache type q8_0 (k and v)

Status

M0 — compose artifacts written; awaiting box-side weight pull + bring-up. M0.3-M0.4 (context ramp) follow once M0 boots cleanly. M1 wires this endpoint as a 4th opencode/LiteLLM provider used by the long-task orchestrator.

Why Instruct-2507 not Thinking-2507

Both are published; Thinking emits a <think> block before every answer. At ~7 tok/s decode, a 2K-token think block = ~5 min of wall time per response, then the actual answer. For autonomous coding/refactor tasks that's a tax we don't want. Thinking-2507 is worth adding as a separate compose later for hard-reasoning one-shots; not the long-task default.