added models, model-swap, ...

2026-06-26 08:13:33 -04:00
parent de1635872f
commit 224afbb3a6
18 changed files with 1659 additions and 243 deletions
--- a/pyinfra/framework/compose/qwable/README.md
+++ b/pyinfra/framework/compose/qwable/README.md
@@ -0,0 +1,130 @@
+# qwable
+
+Qwable-3.6-27B on Strix Halo via `kyuz0:rocm-7.2.2`. A full fine-tune
+of Qwen3.6-27B trained on **Fable-5-style reasoning traces** — the dev
+collected examples formatted like Fable 5's deliberate, step-by-step
+answers and trained Qwen to reproduce that structured, explanatory
+output. Think of it as a local "thinks-like-Fable" model.
+
+OpenAI-compatible endpoint at `http://framework:8082` once running.
+
+## Dense, not MoE (read first)
+
+Despite being smaller on disk than the 30B-A3B workhorses, Qwable is a
+**dense** 27B — every weight loads per token. On this bandwidth-bound
+box (256 GB/s ÷ ~16.5 GB) that's **~10-15 tok/s** decode, slower than
+the MoE 30B (~100 tok/s, only ~3B active). So Qwable is for when you
+specifically want the Fable-style reasoning, not for raw throughput.
+The interactive daily driver stays on Ollama / llama 30B.
+
+## Coexistence notes
+
+At ~16.5 GB (Q4_K_M) Qwable is the smallest GPU resident here:
+
+| Concurrent service | Coexists? |
+|---|---|
+| `llama` (Qwen3-Coder-30B, 8080) | ✅ yes (~35 GB total) |
+| `ollama` (11434) | ✅ yes |
+| `kimi-linear` (vLLM, 8000) | ✅ yes (~47 GB total) |
+| `qwen3-235b` (88.8 GB, 8081) | ❌ no — too tight, swap-model stops it |
+| `comfyui` (8188) | ❌ no — swap-model stops it |
+
+`restart: "no"`: you bring it up deliberately (via `swap-model qwable`),
+it won't auto-start after a reboot and surprise-collide with a big model.
+
+## Prereqs
+
+- Pyinfra deploy has run (creates `/srv/docker/qwable/` with right perms).
+- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
+  Verify: `cat /proc/cmdline | grep ttm.pages_limit`.
+
+## Download weights (~16.5 GB, single file)
+
+```sh
+# /models/qwen exists via pyinfra; just create the model subdir.
+mkdir -p /models/qwen/Qwable-3.6-27b
+
+hf download Mia-AiLab/Qwable-3.6-27b \
+    'Qwable-27b_Q4_K_M.gguf' \
+    --local-dir /models/qwen/Qwable-3.6-27b
+
+# File lands at:
+#   /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf  (~16.5 GB)
+```
+
+Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
+needs ~17 GB free on `/models`.
+
+> Abliterated variant (refusals removed) lives at
+> `huihui-ai/Huihui-Qwable-3.6-27b-abliterated-GGUF`
+> (`Huihui-Qwable-3.6-27b-abliterated-Q4_K_M_Q8.gguf`, ~18.3 GB).
+> Not the default — no safety filtering, careful with it.
+
+## Bring up
+
+Easy path — `swap-model` handles stop-conflicting-services + waits for
+`/health`:
+
+```sh
+ssh framework swap-model qwable     # ~1-2 min cold load (16.5 GB)
+ssh framework /srv/docker/qwable/smoke.sh    # perf measure
+```
+
+Manual equivalent (first-ever bring-up, before the image is cached):
+
+```sh
+cd /srv/docker/qwable
+docker compose pull       # already-cached image if you ran llama first
+docker compose up -d
+docker compose logs -f    # wait for "server is listening on http://0.0.0.0:8082"
+
+./smoke.sh                # /health + tiny generation + perf
+```
+
+First start is ~1-2 min (16.5 GB load off disk; much faster than the
+235B). If `./smoke.sh` reports `predicted_per_second` in the 10-15 tok/s
+band, it's healthy. <6 tok/s = investigate (likely arena < 100 GB — see
+qwen3-235b/README.md "Troubleshooting" for the arena checks).
+
+## Ramping context
+
+Defaults to 64K to match the other llama.cpp stacks (keeps opencode
+auto-compaction consistent across providers). The model is tiny relative
+to the arena, so there's plenty of room to push higher:
+
+| Stage | `--ctx-size` | Margin in arena |
+|---|---|---|
+| **Current default** | **65536** | huge (~90 GB free) |
+| Stretch | 131072+ | still comfortable |
+
+Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
+re-run `./smoke.sh`. The real ceiling is Qwable's trained context length
+(inherits Qwen3.6-27B's), not arena memory — verify the model's max
+positions before going past 128K.
+
+## Operations
+
+```sh
+docker compose logs -f                  # tail
+docker compose down                     # stop
+docker compose exec qwable bash         # shell in
+./smoke.sh                              # health + perf
+amdgpu_top                              # GPU view on host
+```
+
+## Pin manifest
+
+| Component | Pin |
+|---|---|
+| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) |
+| Weights | `Mia-AiLab/Qwable-3.6-27b` → `Qwable-27b_Q4_K_M.gguf` (~16.5 GB) |
+| Default port | 8082 |
+| Default context | 65536 |
+| KV cache type | q8_0 (k and v) |
+| License | MIT (model); Qwen3.6-27B base license also applies |
+
+## Status
+
+Compose artifacts written; awaiting box-side weight pull + bring-up.
+Wired as a `swap-model qwable` target. Wire as an opencode/LiteLLM
+provider only if the Fable-style reasoning proves useful in practice.
--- a/pyinfra/framework/compose/qwable/smoke.sh
+++ b/pyinfra/framework/compose/qwable/smoke.sh
@@ -0,0 +1,46 @@
+#!/usr/bin/env bash
+# Smoke-test the running qwable llama-server (port 8082). Hits /health
+# for liveness, then a tiny OpenAI-compatible chat completion, then
+# measures eval_tps via /completion. Dense 27B → expect ~10-15 tok/s.
+set -euo pipefail
+
+HOST="${QWABLE_HOST:-127.0.0.1:8082}"
+MODEL="${QWABLE_MODEL:-qwable}"
+
+echo "[smoke] GET /health on $HOST"
+curl -fsS "http://$HOST/health" | python3 -m json.tool
+
+echo
+echo "[smoke] POST /v1/chat/completions ($MODEL) — tiny generation"
+curl -fsS "http://$HOST/v1/chat/completions" \
+    -H 'Content-Type: application/json' \
+    -d "{
+        \"model\": \"$MODEL\",
+        \"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: ok\"}],
+        \"max_tokens\": 16,
+        \"temperature\": 0.0
+    }" | python3 -m json.tool
+
+echo
+echo "[smoke] perf measure — eval_tps and prompt_tps (n_predict=128)"
+# 128 tokens — at ~10-15 tok/s the per-token warmup noise still matters,
+# but a dense 27B settles faster than the 235B so we don't need 64-only.
+curl -fsS "http://$HOST/completion" \
+    -H 'Content-Type: application/json' \
+    -d '{
+        "prompt": "Write a Python function that computes the Fibonacci sequence iteratively. Include type hints and a brief docstring.",
+        "n_predict": 128,
+        "temperature": 0.0,
+        "stream": false
+    }' | python3 -c "
+import json, sys
+r = json.load(sys.stdin)
+t = r.get('timings', {})
+print(f'predicted_per_second:  {t.get(\"predicted_per_second\", \"?\"):.2f} tok/s')
+print(f'prompt_per_second:     {t.get(\"prompt_per_second\", \"?\"):.2f} tok/s')
+print(f'predicted_n:           {t.get(\"predicted_n\", \"?\")}')
+print(f'prompt_n:              {t.get(\"prompt_n\", \"?\")}')
+"
+
+echo
+echo "[smoke] passed — expected band 10-15 tok/s decode (dense 27B Q4)"