added models, model-swap, ...
This commit is contained in:
130
pyinfra/framework/compose/qwable/README.md
Normal file
130
pyinfra/framework/compose/qwable/README.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# qwable
|
||||
|
||||
Qwable-3.6-27B on Strix Halo via `kyuz0:rocm-7.2.2`. A full fine-tune
|
||||
of Qwen3.6-27B trained on **Fable-5-style reasoning traces** — the dev
|
||||
collected examples formatted like Fable 5's deliberate, step-by-step
|
||||
answers and trained Qwen to reproduce that structured, explanatory
|
||||
output. Think of it as a local "thinks-like-Fable" model.
|
||||
|
||||
OpenAI-compatible endpoint at `http://framework:8082` once running.
|
||||
|
||||
## Dense, not MoE (read first)
|
||||
|
||||
Despite being smaller on disk than the 30B-A3B workhorses, Qwable is a
|
||||
**dense** 27B — every weight loads per token. On this bandwidth-bound
|
||||
box (256 GB/s ÷ ~16.5 GB) that's **~10-15 tok/s** decode, slower than
|
||||
the MoE 30B (~100 tok/s, only ~3B active). So Qwable is for when you
|
||||
specifically want the Fable-style reasoning, not for raw throughput.
|
||||
The interactive daily driver stays on Ollama / llama 30B.
|
||||
|
||||
## Coexistence notes
|
||||
|
||||
At ~16.5 GB (Q4_K_M) Qwable is the smallest GPU resident here:
|
||||
|
||||
| Concurrent service | Coexists? |
|
||||
|---|---|
|
||||
| `llama` (Qwen3-Coder-30B, 8080) | ✅ yes (~35 GB total) |
|
||||
| `ollama` (11434) | ✅ yes |
|
||||
| `kimi-linear` (vLLM, 8000) | ✅ yes (~47 GB total) |
|
||||
| `qwen3-235b` (88.8 GB, 8081) | ❌ no — too tight, swap-model stops it |
|
||||
| `comfyui` (8188) | ❌ no — swap-model stops it |
|
||||
|
||||
`restart: "no"`: you bring it up deliberately (via `swap-model qwable`),
|
||||
it won't auto-start after a reboot and surprise-collide with a big model.
|
||||
|
||||
## Prereqs
|
||||
|
||||
- Pyinfra deploy has run (creates `/srv/docker/qwable/` with right perms).
|
||||
- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
|
||||
Verify: `cat /proc/cmdline | grep ttm.pages_limit`.
|
||||
|
||||
## Download weights (~16.5 GB, single file)
|
||||
|
||||
```sh
|
||||
# /models/qwen exists via pyinfra; just create the model subdir.
|
||||
mkdir -p /models/qwen/Qwable-3.6-27b
|
||||
|
||||
hf download Mia-AiLab/Qwable-3.6-27b \
|
||||
'Qwable-27b_Q4_K_M.gguf' \
|
||||
--local-dir /models/qwen/Qwable-3.6-27b
|
||||
|
||||
# File lands at:
|
||||
# /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf (~16.5 GB)
|
||||
```
|
||||
|
||||
Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
|
||||
needs ~17 GB free on `/models`.
|
||||
|
||||
> Abliterated variant (refusals removed) lives at
|
||||
> `huihui-ai/Huihui-Qwable-3.6-27b-abliterated-GGUF`
|
||||
> (`Huihui-Qwable-3.6-27b-abliterated-Q4_K_M_Q8.gguf`, ~18.3 GB).
|
||||
> Not the default — no safety filtering, careful with it.
|
||||
|
||||
## Bring up
|
||||
|
||||
Easy path — `swap-model` handles stop-conflicting-services + waits for
|
||||
`/health`:
|
||||
|
||||
```sh
|
||||
ssh framework swap-model qwable # ~1-2 min cold load (16.5 GB)
|
||||
ssh framework /srv/docker/qwable/smoke.sh # perf measure
|
||||
```
|
||||
|
||||
Manual equivalent (first-ever bring-up, before the image is cached):
|
||||
|
||||
```sh
|
||||
cd /srv/docker/qwable
|
||||
docker compose pull # already-cached image if you ran llama first
|
||||
docker compose up -d
|
||||
docker compose logs -f # wait for "server is listening on http://0.0.0.0:8082"
|
||||
|
||||
./smoke.sh # /health + tiny generation + perf
|
||||
```
|
||||
|
||||
First start is ~1-2 min (16.5 GB load off disk; much faster than the
|
||||
235B). If `./smoke.sh` reports `predicted_per_second` in the 10-15 tok/s
|
||||
band, it's healthy. <6 tok/s = investigate (likely arena < 100 GB — see
|
||||
qwen3-235b/README.md "Troubleshooting" for the arena checks).
|
||||
|
||||
## Ramping context
|
||||
|
||||
Defaults to 64K to match the other llama.cpp stacks (keeps opencode
|
||||
auto-compaction consistent across providers). The model is tiny relative
|
||||
to the arena, so there's plenty of room to push higher:
|
||||
|
||||
| Stage | `--ctx-size` | Margin in arena |
|
||||
|---|---|---|
|
||||
| **Current default** | **65536** | huge (~90 GB free) |
|
||||
| Stretch | 131072+ | still comfortable |
|
||||
|
||||
Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
|
||||
re-run `./smoke.sh`. The real ceiling is Qwable's trained context length
|
||||
(inherits Qwen3.6-27B's), not arena memory — verify the model's max
|
||||
positions before going past 128K.
|
||||
|
||||
## Operations
|
||||
|
||||
```sh
|
||||
docker compose logs -f # tail
|
||||
docker compose down # stop
|
||||
docker compose exec qwable bash # shell in
|
||||
./smoke.sh # health + perf
|
||||
amdgpu_top # GPU view on host
|
||||
```
|
||||
|
||||
## Pin manifest
|
||||
|
||||
| Component | Pin |
|
||||
|---|---|
|
||||
| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) |
|
||||
| Weights | `Mia-AiLab/Qwable-3.6-27b` → `Qwable-27b_Q4_K_M.gguf` (~16.5 GB) |
|
||||
| Default port | 8082 |
|
||||
| Default context | 65536 |
|
||||
| KV cache type | q8_0 (k and v) |
|
||||
| License | MIT (model); Qwen3.6-27B base license also applies |
|
||||
|
||||
## Status
|
||||
|
||||
Compose artifacts written; awaiting box-side weight pull + bring-up.
|
||||
Wired as a `swap-model qwable` target. Wire as an opencode/LiteLLM
|
||||
provider only if the Fable-style reasoning proves useful in practice.
|
||||
Reference in New Issue
Block a user