pyinfra/framework/compose/qwable/README.md

# qwable

Qwable-3.6-27B on Strix Halo via `kyuz0:rocm-7.2.2`. A full fine-tune
of Qwen3.6-27B trained on **Fable-5-style reasoning traces** — the dev
collected examples formatted like Fable 5's deliberate, step-by-step
answers and trained Qwen to reproduce that structured, explanatory
output. Think of it as a local "thinks-like-Fable" model.

OpenAI-compatible endpoint at `http://framework:8082` once running.

## Dense, not MoE (read first)

Despite being smaller on disk than the 30B-A3B workhorses, Qwable is a
**dense** 27B — every weight loads per token. On this bandwidth-bound
box (256 GB/s ÷ ~16.5 GB) that's **~10-15 tok/s** decode, slower than
the MoE 30B (~100 tok/s, only ~3B active). So Qwable is for when you
specifically want the Fable-style reasoning, not for raw throughput.
The interactive daily driver stays on Ollama / llama 30B.

## Coexistence notes

At ~16.5 GB (Q4_K_M) Qwable is the smallest GPU resident here:

| Concurrent service | Coexists? |
|---|---|
| `llama` (Qwen3-Coder-30B, 8080) | ✅ yes (~35 GB total) |
| `ollama` (11434) | ✅ yes |
| `kimi-linear` (vLLM, 8000) | ✅ yes (~47 GB total) |
| `qwen3-235b` (88.8 GB, 8081) | ❌ no — too tight, swap-model stops it |
| `comfyui` (8188) | ❌ no — swap-model stops it |

`restart: "no"`: you bring it up deliberately (via `swap-model qwable`),
it won't auto-start after a reboot and surprise-collide with a big model.

## Prereqs

- Pyinfra deploy has run (creates `/srv/docker/qwable/` with right perms).
- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
  Verify: `cat /proc/cmdline | grep ttm.pages_limit`.

## Download weights (~16.5 GB, single file)

```sh
# /models/qwen exists via pyinfra; just create the model subdir.
mkdir -p /models/qwen/Qwable-3.6-27b

hf download Mia-AiLab/Qwable-3.6-27b \
    'Qwable-27b_Q4_K_M.gguf' \
    --local-dir /models/qwen/Qwable-3.6-27b

# File lands at:
#   /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf  (~16.5 GB)
```

Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
needs ~17 GB free on `/models`.

> Abliterated variant (refusals removed) lives at
> `huihui-ai/Huihui-Qwable-3.6-27b-abliterated-GGUF`
> (`Huihui-Qwable-3.6-27b-abliterated-Q4_K_M_Q8.gguf`, ~18.3 GB).
> Not the default — no safety filtering, careful with it.

## Bring up

Easy path — `swap-model` handles stop-conflicting-services + waits for
`/health`:

```sh
ssh framework swap-model qwable     # ~1-2 min cold load (16.5 GB)
ssh framework /srv/docker/qwable/smoke.sh    # perf measure
```

Manual equivalent (first-ever bring-up, before the image is cached):

```sh
cd /srv/docker/qwable
docker compose pull       # already-cached image if you ran llama first
docker compose up -d
docker compose logs -f    # wait for "server is listening on http://0.0.0.0:8082"

./smoke.sh                # /health + tiny generation + perf
```

First start is ~1-2 min (16.5 GB load off disk; much faster than the
235B). If `./smoke.sh` reports `predicted_per_second` in the 10-15 tok/s
band, it's healthy. <6 tok/s = investigate (likely arena < 100 GB — see
qwen3-235b/README.md "Troubleshooting" for the arena checks).

## Ramping context

Defaults to 64K to match the other llama.cpp stacks (keeps opencode
auto-compaction consistent across providers). The model is tiny relative
to the arena, so there's plenty of room to push higher:

| Stage | `--ctx-size` | Margin in arena |
|---|---|---|
| **Current default** | **65536** | huge (~90 GB free) |
| Stretch | 131072+ | still comfortable |

Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
re-run `./smoke.sh`. The real ceiling is Qwable's trained context length
(inherits Qwen3.6-27B's), not arena memory — verify the model's max
positions before going past 128K.

## Operations

```sh
docker compose logs -f                  # tail
docker compose down                     # stop
docker compose exec qwable bash         # shell in
./smoke.sh                              # health + perf
amdgpu_top                              # GPU view on host
```

## Pin manifest

| Component | Pin |
|---|---|
| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) |
| Weights | `Mia-AiLab/Qwable-3.6-27b` → `Qwable-27b_Q4_K_M.gguf` (~16.5 GB) |
| Default port | 8082 |
| Default context | 65536 |
| KV cache type | q8_0 (k and v) |
| License | MIT (model); Qwen3.6-27B base license also applies |

## Status

Compose artifacts written; awaiting box-side weight pull + bring-up.
Wired as a `swap-model qwable` target. Wire as an opencode/LiteLLM
provider only if the Fable-style reasoning proves useful in practice.
added models, model-swap, ... 2026-06-26 08:13:33 -04:00			`# qwable`

			Qwable-3.6-27B on Strix Halo via `kyuz0:rocm-7.2.2`. A full fine-tune
			`of Qwen3.6-27B trained on Fable-5-style reasoning traces — the dev`
			`collected examples formatted like Fable 5's deliberate, step-by-step`
			`answers and trained Qwen to reproduce that structured, explanatory`
			`output. Think of it as a local "thinks-like-Fable" model.`

			OpenAI-compatible endpoint at `http://framework:8082` once running.

			`## Dense, not MoE (read first)`

			`Despite being smaller on disk than the 30B-A3B workhorses, Qwable is a`
			`dense 27B — every weight loads per token. On this bandwidth-bound`
			`box (256 GB/s ÷ ~16.5 GB) that's ~10-15 tok/s decode, slower than`
			`the MoE 30B (~100 tok/s, only ~3B active). So Qwable is for when you`
			`specifically want the Fable-style reasoning, not for raw throughput.`
			`The interactive daily driver stays on Ollama / llama 30B.`

			`## Coexistence notes`

			`At ~16.5 GB (Q4_K_M) Qwable is the smallest GPU resident here:`

			`\| Concurrent service \| Coexists? \|`
			`\|---\|---\|`
			\| `llama` (Qwen3-Coder-30B, 8080) \| ✅ yes (~35 GB total) \|
			\| `ollama` (11434) \| ✅ yes \|
			\| `kimi-linear` (vLLM, 8000) \| ✅ yes (~47 GB total) \|
			\| `qwen3-235b` (88.8 GB, 8081) \| ❌ no — too tight, swap-model stops it \|
			\| `comfyui` (8188) \| ❌ no — swap-model stops it \|

			`restart: "no"`: you bring it up deliberately (via `swap-model qwable`),
			`it won't auto-start after a reboot and surprise-collide with a big model.`

			`## Prereqs`

			- Pyinfra deploy has run (creates `/srv/docker/qwable/` with right perms).
			- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
			Verify: `cat /proc/cmdline \| grep ttm.pages_limit`.

			`## Download weights (~16.5 GB, single file)`

			```sh
			`# /models/qwen exists via pyinfra; just create the model subdir.`
			`mkdir -p /models/qwen/Qwable-3.6-27b`

			`hf download Mia-AiLab/Qwable-3.6-27b \`
			`'Qwable-27b_Q4_K_M.gguf' \`
			`--local-dir /models/qwen/Qwable-3.6-27b`

			`# File lands at:`
			`# /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf (~16.5 GB)`
			```

			Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
			needs ~17 GB free on `/models`.

			`> Abliterated variant (refusals removed) lives at`
			> `huihui-ai/Huihui-Qwable-3.6-27b-abliterated-GGUF`
			> (`Huihui-Qwable-3.6-27b-abliterated-Q4_K_M_Q8.gguf`, ~18.3 GB).
			`> Not the default — no safety filtering, careful with it.`

			`## Bring up`

			Easy path — `swap-model` handles stop-conflicting-services + waits for
			`/health`:

			```sh
			`ssh framework swap-model qwable # ~1-2 min cold load (16.5 GB)`
			`ssh framework /srv/docker/qwable/smoke.sh # perf measure`
			```

			`Manual equivalent (first-ever bring-up, before the image is cached):`

			```sh
			`cd /srv/docker/qwable`
			`docker compose pull # already-cached image if you ran llama first`
			`docker compose up -d`
			`docker compose logs -f # wait for "server is listening on http://0.0.0.0:8082"`

			`./smoke.sh # /health + tiny generation + perf`
			```

			`First start is ~1-2 min (16.5 GB load off disk; much faster than the`
			235B). If `./smoke.sh` reports `predicted_per_second` in the 10-15 tok/s
			`band, it's healthy. <6 tok/s = investigate (likely arena < 100 GB — see`
			`qwen3-235b/README.md "Troubleshooting" for the arena checks).`

			`## Ramping context`

			`Defaults to 64K to match the other llama.cpp stacks (keeps opencode`
			`auto-compaction consistent across providers). The model is tiny relative`
			`to the arena, so there's plenty of room to push higher:`

			\| Stage \| `--ctx-size` \| Margin in arena \|
			`\|---\|---\|---\|`
			`\| Current default \| 65536 \| huge (~90 GB free) \|`
			`\| Stretch \| 131072+ \| still comfortable \|`

			Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
			re-run `./smoke.sh`. The real ceiling is Qwable's trained context length
			`(inherits Qwen3.6-27B's), not arena memory — verify the model's max`
			`positions before going past 128K.`

			`## Operations`

			```sh
			`docker compose logs -f # tail`
			`docker compose down # stop`
			`docker compose exec qwable bash # shell in`
			`./smoke.sh # health + perf`
			`amdgpu_top # GPU view on host`
			```

			`## Pin manifest`

			`\| Component \| Pin \|`
			`\|---\|---\|`
			\| Image \| `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) \|
			\| Weights \| `Mia-AiLab/Qwable-3.6-27b` → `Qwable-27b_Q4_K_M.gguf` (~16.5 GB) \|
			`\| Default port \| 8082 \|`
			`\| Default context \| 65536 \|`
			`\| KV cache type \| q8_0 (k and v) \|`
			`\| License \| MIT (model); Qwen3.6-27B base license also applies \|`

			`## Status`

			`Compose artifacts written; awaiting box-side weight pull + bring-up.`
			Wired as a `swap-model qwable` target. Wire as an opencode/LiteLLM
			`provider only if the Fable-style reasoning proves useful in practice.`