Files
localgenai/pyinfra/framework/compose/qwable/README.md

4.6 KiB

qwable

Qwable-3.6-27B on Strix Halo via kyuz0:rocm-7.2.2. A full fine-tune of Qwen3.6-27B trained on Fable-5-style reasoning traces — the dev collected examples formatted like Fable 5's deliberate, step-by-step answers and trained Qwen to reproduce that structured, explanatory output. Think of it as a local "thinks-like-Fable" model.

OpenAI-compatible endpoint at http://framework:8082 once running.

Dense, not MoE (read first)

Despite being smaller on disk than the 30B-A3B workhorses, Qwable is a dense 27B — every weight loads per token. On this bandwidth-bound box (256 GB/s ÷ ~16.5 GB) that's ~10-15 tok/s decode, slower than the MoE 30B (~100 tok/s, only ~3B active). So Qwable is for when you specifically want the Fable-style reasoning, not for raw throughput. The interactive daily driver stays on Ollama / llama 30B.

Coexistence notes

At ~16.5 GB (Q4_K_M) Qwable is the smallest GPU resident here:

Concurrent service Coexists?
llama (Qwen3-Coder-30B, 8080) yes (~35 GB total)
ollama (11434) yes
kimi-linear (vLLM, 8000) yes (~47 GB total)
qwen3-235b (88.8 GB, 8081) no — too tight, swap-model stops it
comfyui (8188) no — swap-model stops it

restart: "no": you bring it up deliberately (via swap-model qwable), it won't auto-start after a reboot and surprise-collide with a big model.

Prereqs

  • Pyinfra deploy has run (creates /srv/docker/qwable/ with right perms).
  • BIOS UMA at 0.5 GB + ttm.pages_limit=33554432 kernel cmdline active. Verify: cat /proc/cmdline | grep ttm.pages_limit.

Download weights (~16.5 GB, single file)

# /models/qwen exists via pyinfra; just create the model subdir.
mkdir -p /models/qwen/Qwable-3.6-27b

hf download Mia-AiLab/Qwable-3.6-27b \
    'Qwable-27b_Q4_K_M.gguf' \
    --local-dir /models/qwen/Qwable-3.6-27b

# File lands at:
#   /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf  (~16.5 GB)

Single-file GGUF (not sharded) — point --model straight at it. Disk: needs ~17 GB free on /models.

Abliterated variant (refusals removed) lives at huihui-ai/Huihui-Qwable-3.6-27b-abliterated-GGUF (Huihui-Qwable-3.6-27b-abliterated-Q4_K_M_Q8.gguf, ~18.3 GB). Not the default — no safety filtering, careful with it.

Bring up

Easy path — swap-model handles stop-conflicting-services + waits for /health:

ssh framework swap-model qwable     # ~1-2 min cold load (16.5 GB)
ssh framework /srv/docker/qwable/smoke.sh    # perf measure

Manual equivalent (first-ever bring-up, before the image is cached):

cd /srv/docker/qwable
docker compose pull       # already-cached image if you ran llama first
docker compose up -d
docker compose logs -f    # wait for "server is listening on http://0.0.0.0:8082"

./smoke.sh                # /health + tiny generation + perf

First start is ~1-2 min (16.5 GB load off disk; much faster than the 235B). If ./smoke.sh reports predicted_per_second in the 10-15 tok/s band, it's healthy. <6 tok/s = investigate (likely arena < 100 GB — see qwen3-235b/README.md "Troubleshooting" for the arena checks).

Ramping context

Defaults to 64K to match the other llama.cpp stacks (keeps opencode auto-compaction consistent across providers). The model is tiny relative to the arena, so there's plenty of room to push higher:

Stage --ctx-size Margin in arena
Current default 65536 huge (~90 GB free)
Stretch 131072+ still comfortable

Edit --ctx-size in docker-compose.yml, docker compose down && up -d, re-run ./smoke.sh. The real ceiling is Qwable's trained context length (inherits Qwen3.6-27B's), not arena memory — verify the model's max positions before going past 128K.

Operations

docker compose logs -f                  # tail
docker compose down                     # stop
docker compose exec qwable bash         # shell in
./smoke.sh                              # health + perf
amdgpu_top                              # GPU view on host

Pin manifest

Component Pin
Image kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2 (shared with llama)
Weights Mia-AiLab/Qwable-3.6-27bQwable-27b_Q4_K_M.gguf (~16.5 GB)
Default port 8082
Default context 65536
KV cache type q8_0 (k and v)
License MIT (model); Qwen3.6-27B base license also applies

Status

Compose artifacts written; awaiting box-side weight pull + bring-up. Wired as a swap-model qwable target. Wire as an opencode/LiteLLM provider only if the Fable-style reasoning proves useful in practice.