added models, model-swap, ...

This commit is contained in:
2026-06-26 08:13:33 -04:00
parent de1635872f
commit 224afbb3a6
18 changed files with 1659 additions and 243 deletions

View File

@@ -72,6 +72,106 @@ curl localhost:8080/v1/models # smoke test
Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
needed — Ollama serves models on demand).
## Swapping GPU-resident models
The merged ~110 GB GPU arena (BIOS UMA=0.5 GB + `ttm.pages_limit` + the
`HSA_*` env recipe — see `StrixHaloMemory.md`) holds at most one
~88 GB-class model, and ROCm doesn't reclaim cleanly between consumers.
So switching between, say, the daily-driver 30B and the 235B long-task
model is a stop-then-start of whole compose stacks.
`scripts/swap-model` (deployed to `/usr/local/bin/swap-model` on the
box) encodes the coexistence table + per-service health probes so the
swap is one command:
```sh
ssh framework swap-model 235b # stop conflicting svcs, start 235B, wait for /health
ssh framework swap-model coder # back to Qwen3-Coder-30B (Ollama)
ssh framework swap-model status # what's currently up?
ssh framework swap-model none # everything down, free the GPU arena
```
Coexistence table baked into the script:
| Target | What runs | What gets stopped |
|---|---|---|
| `coder` | ollama (Qwen3-Coder-30B) | 235b, comfyui |
| `235b` | qwen3-235b (llama.cpp) | ollama, llama, kimi, comfyui |
| `kimi` | kimi-linear (vLLM) | 235b, comfyui (ollama can stay) |
| `comfyui` | comfyui | 235b, kimi (ollama can stay) |
| `none` | — | all five |
OpenWebUI, LiteLLM, Phoenix, Beszel, etc. are non-GPU and always-on;
the script ignores them. Wait timeout defaults to 600 s (235B's 88 GB
cold load takes 3-5 min); override with `SWAP_WAIT_TIMEOUT=<seconds>`.
A Mac-side wrapper at `bin/swap-model` in the repo runs
`ssh framework /usr/local/bin/swap-model "$@"` — symlink it onto your
PATH so `swap-model X` works from any shell:
```sh
ln -s "$(pwd)/bin/swap-model" /usr/local/bin/swap-model
```
Override host with `SWAP_MODEL_HOST=10.0.0.70 swap-model 235b` when
Tailscale resolution is misbehaving.
### Inference engine consolidation (in progress)
The model count is now at the crossover point, so the manual
`swap-model` script is being replaced. The plan, phased:
**Framing.** Full consolidation to one engine isn't possible — **vLLM
stays** as the specialist for the architectures llama.cpp can't serve
(Kimi-Linear's KDA/MLA hybrid, the 235B). And the hard part of the
memory problem — the cross-engine rule "235B can't share the arena
with anything" — survives any consolidation, because it's a constraint
*between* vLLM and the GGUF tier. So the real decision is only about
the **GGUF tier** (Ollama and the standalone llama.cpp stack do the
same job) and which tool, if any, coordinates the swaps.
**M0 — benchmark the GGUF tier (the deciding unknown).** Ollama already
auto-loads/unloads; the only question is whether its bundled llama.cpp
keeps up with the gfx1151-tuned kyuz0 build on *this* box.
`scripts/bench-engines` (deployed to `/usr/local/bin/bench-engines`)
serves the identical GGUF on each engine in isolation and reports
decode/prefill t/s using each engine's own timing fields:
```sh
ssh framework bench-engines # runs both, prints a verdict line
```
**M1 — pick the end-state from the benchmark.** Both are two engines
(GGUF tier + vLLM); the difference is the coordinator:
- **Option 1 (Ollama ≥ ~85 % of llama.cpp decode):** Ollama + vLLM,
drop the standalone llama.cpp stack. Ollama self-swaps its tier, so
**no llama-swap needed**; the one cross-engine rule ("stop Ollama
before 235B") is a few lines. Fewest moving parts.
- **Option 2 (kyuz0 lead is large):** keep llama.cpp behind
**[llama-swap](https://github.com/mostlygeek/llama-swap)**, drop
Ollama. This is where llama-swap earns its place — a YAML-configured
OpenAI-compatible proxy whose **groups** encode the coexistence table
declaratively *and* can launch the vLLM backends too, so one tool
coordinates the whole arena and the `swap-model` script retires. It
also subsumes most of LiteLLM's routing role.
Two alternatives considered, both weaker fits:
- **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
your whole stack" platform with 36+ backends; deeper rewrite for
marginal gain.
- **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
cluster manager (DB-backed, UI-first); built for fleets, overkill
on a single Strix Halo box.
Two alternatives also considered, both weaker fits:
- **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
your whole stack" platform with 36+ backends; would force a deeper
rewrite for marginal gain on a 4-model stack.
- **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
cluster manager (DB-backed, UI-first); built for fleets, overkill
on a single Strix Halo box.
## Tunables
Top of `deploy.py`:
@@ -272,3 +372,77 @@ model server.
The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
gfx1151-optimized with rocWMMA flash attention (the latter only kicks
in on the ROCm tags). Auto-rebuilt against llama.cpp master.
## Remote IDE (code-server)
**code-server** (`/srv/docker/code-server`, http://framework:8443) —
full VS Code in the browser, served from the box. The point: Claude
Code (extension from Open VSX + CLI) runs *inside the container*, so
long agent tasks execute here and survive the laptop sleeping. Any
tailnet device gets the same editor and the same running session.
Bring-up: `cd /srv/docker/code-server && docker compose up -d`, then
see `/srv/docker/code-server/README.md` for the one-time extension
install + sign-in flow. No password is set (Tailscale-only trust
model, like everything else here) — set `HASHED_PASSWORD` before this
host ever sees other traffic; a browser terminal is a shell.
The workspace is container-scoped on purpose (no host filesystem, no
docker socket). For host-level Claude work, use the section below
instead.
## Workspace manager (Coder) — pilot
**Coder** (`/srv/docker/coder`, http://framework:7080) — self-hosted
workspace manager: a web dashboard that stamps out per-project dev
containers from Terraform templates. The shipped `code-server` template
gives each workspace browser VS Code with the Claude Code extension
(from Open VSX) pre-installed and the `claude` CLI on PATH; workspace
home volumes persist `~/.claude` creds and repos across stop/start.
Workspaces publish no host ports — everything tunnels through the
dashboard — and idle autostop hands parked projects' RAM back to the
inference stacks.
Bring-up + template push + first workspace:
`/srv/docker/coder/README.md`. The `.env` (docker GID, access URL,
Postgres password) is auto-generated by `./run.sh` — nothing to fill
in by hand.
This is a **pilot** against the standalone code-server stack above:
after a week of real use, one of them retires. Same security posture
as OpenHands (docker-socket holder, spawns code-running containers) —
Tailscale-only, never expose :7080 further.
## Claude Code on the box (remote-control)
For long unattended runs steered from a phone or browser — no IDE
involved — Claude Code's official Remote Control feature bridges a
session running on the box to claude.ai/code and the Claude mobile
app. Needs Claude Code ≥ 2.1.51 with a claude.ai login (Pro/Max plan;
API keys don't work for this), traffic relays outbound-only over TLS
to Anthropic — no inbound ports.
```sh
ssh framework
tmux new -s rc
cd ~/src/whatever
claude remote-control --name "Strix Halo"
# prints a session URL → open on any device, or find it in the
# session list at claude.ai/code / the Claude app's Code tab
```
Operational notes (state of the feature as of 2026-06):
- The local process is the engine — tasks keep executing with **no
client attached**; clients are viewports. But if the process dies,
the session is gone (no `--resume` equivalent for remote-control
sessions yet, issue #30447). Hence tmux, always.
- It needs a TTY — `nohup`/systemd don't work. tmux satisfies this.
- `--spawn worktree` gives each connecting session its own git
worktree; `--spawn same-dir` (default) shares the directory.
- Headless boxes print the session URL but no QR (#34764) — copy the
URL or use the session list.
- Mobile clients sometimes drop after long idle (#28914, #34255);
reconnecting always works — the task on the box is unaffected.
- Run `claude` interactively once per project dir first to clear the
workspace-trust prompt before automating anything.

View File

@@ -0,0 +1,55 @@
# code-server — VS Code in the browser, served from the box. The point:
# Claude Code (extension + CLI) runs *inside this container*, so long
# agent tasks keep running when the laptop sleeps. Any device on the
# tailnet gets the full VS Code experience at http://framework:8443.
#
# linuxserver image over codercom/code-server for the PUID/PGID
# convention (matches host `noise` UID 1000 so workspace files don't
# come out root-owned) and the s6 init that chowns /config.
#
# Extensions come from Open VSX (code-server's default registry). The
# official Claude Code extension is published there:
# https://open-vsx.org/extension/Anthropic/claude-code
#
# Persistent state: /config is the container user's $HOME — extensions,
# settings, ~/.claude (OAuth creds + session history), ~/.local/bin
# (claude CLI). Survives container recreation; back up one path.
services:
code-server:
image: lscr.io/linuxserver/code-server:latest
container_name: code-server
restart: unless-stopped
# No PASSWORD env set — auth is disabled, same trust model as every
# other service here: reachable over Tailscale only. If this host
# ever sees LAN/internet traffic, set HASHED_PASSWORD (and even then
# prefer "127.0.0.1:8443:8443" + tunnel — a browser terminal is a
# shell on the box's docker network).
ports:
- "8443:8443"
environment:
- PUID=1000
- PGID=1000
- TZ=Etc/UTC
# Open this folder by default instead of the "welcome" screen.
- DEFAULT_WORKSPACE=/workspace
# tmux inside the container: detachable terminals for long claude
# runs, so a browser-tab close or code-server reconnect hiccup
# can't orphan your view of a session (the process itself never
# depended on the tab — it's a child of the container).
- DOCKER_MODS=linuxserver/mods:universal-package-install
- INSTALL_PACKAGES=tmux
extra_hosts:
# Reach the sibling services (Ollama :11434, LiteLLM :4000,
# Phoenix :6006, ...) from the integrated terminal / extensions.
- "host.docker.internal:host-gateway"
volumes:
- /srv/docker/code-server/config:/config
# Default project area. Container-scoped on purpose — Claude Code
# in here can't touch /srv/docker compose stacks or the host.
# Mount additional repos explicitly when you want them editable:
# - /home/noise/src/somerepo:/workspace/somerepo
- /srv/docker/code-server/workspace:/workspace

View File

@@ -0,0 +1,85 @@
# code-server — VS Code in the browser
Full VS Code served from the box at <http://framework:8443>. The reason
it exists in this stack: **Claude Code runs inside the container**, so
long agent tasks execute on the Framework Desktop and survive the
laptop sleeping, the browser tab closing, or you walking away with the
phone. Any device on the tailnet gets the same editor with the same
state.
## Bring-up
```sh
cd /srv/docker/code-server && docker compose up -d
```
First start pulls the image and runs the universal-package-install mod
(installs tmux), so give it a minute before the UI answers on :8443.
> **HTTPS is required for extension panels.** VS Code webviews (the
> Claude Code panel included) need a secure context; over plain
> `http://framework:8443` they render blank. Serve it via Tailscale:
> `sudo tailscale serve --bg --https=8443 8443` →
> `https://framework.<tailnet>.ts.net:8443`. Localhost also counts as
> secure (`ssh -L 8443:localhost:8443 framework`).
## One-time setup (in the code-server UI)
1. **Install the Claude Code extension.** Extensions panel → search
"Claude Code". code-server uses the Open VSX registry, where
Anthropic publishes the official extension
(<https://open-vsx.org/extension/Anthropic/claude-code>).
2. **Sign in.** The OAuth flow in a browser context is a copy/paste
dance: the extension shows a URL → open it in another tab → approve
→ paste the code back. Credentials land in `/config/.claude/` and
persist across container recreates.
3. **(Optional) CLI in the terminal.** The extension bundles its own
CLI, but for tmux-based long runs install it explicitly:
```sh
curl -fsSL https://claude.ai/install.sh | bash
```
Lands in `~/.local/bin` (= `/config/.local/bin`, persisted).
## Long-running tasks
The Claude process is a child of the container, not of your browser
tab — closing the tab does nothing to it. Reattach from any device and
the session is where you left it. For multi-hour unattended runs,
prefer a tmux pane in the integrated terminal:
```sh
tmux new -s longtask
claude # kick off the task, detach with C-b d
```
If you want to steer from a phone or claude.ai/code instead of this UI,
use `claude remote-control` (see the framework README's
"Claude Code on the box" section) — that works in the host's tmux too,
no container needed.
## Scope and security
- **No password is set** — same trust model as every other service on
this Tailscale-only box. If the host ever sees LAN/internet traffic,
set `HASHED_PASSWORD` in the compose env and bind to localhost; a
browser terminal is a shell.
- The workspace is **container-scoped on purpose**: Claude Code in here
sees `/workspace` and `/config`, not the host filesystem or the
docker socket. Bind-mount specific repos into `/workspace/<name>`
when you want them editable (host UID 1000 == container PUID 1000,
so ownership just works).
- `host.docker.internal` reaches the sibling services — handy for
pointing scripts at Ollama (:11434) or LiteLLM (:4000) from the
integrated terminal.
## State layout
| Path (host) | What lives there |
| -------------------------------------- | ------------------------------------------------- |
| `/srv/docker/code-server/config` | `$HOME`: extensions, settings, `~/.claude`, CLIs |
| `/srv/docker/code-server/workspace` | default project area (`DEFAULT_WORKSPACE`) |
Back up `config` if you care about extension state and Claude session
history; everything else is reproducible from the repo.

View File

@@ -0,0 +1,64 @@
# Coder — self-hosted workspace manager (PILOT). Web dashboard at :7080
# that stamps out per-project dev containers from Terraform templates;
# each workspace gets browser code-server with the Claude Code extension
# pre-installed. Evaluating against the standalone code-server stack
# (compose/code-server.yml, kept as-is during the pilot) — verdict
# criteria in compose/coder/README.md.
#
# The server container holds the host docker socket and spawns workspace
# containers as siblings (same pattern + same security posture as
# OpenHands: fine Tailscale-only, never expose further). group_add needs
# the socket's host GID — host-specific, so it comes from the sibling
# .env, which pyinfra generates on the box on first deploy along with a
# random one-time Postgres password.
#
# Workspaces don't publish host ports — code-server and terminals are
# reached through the dashboard's tunnel. No port bookkeeping per
# project.
services:
coder:
# Pin and bump deliberately — Coder releases weekly and the workspace
# provisioner/agent protocol moves with it. Verify at
# https://github.com/coder/coder/releases.
image: ghcr.io/coder/coder:v2.33.8
container_name: coder
restart: unless-stopped
ports:
- "7080:7080"
environment:
CODER_HTTP_ADDRESS: "0.0.0.0:7080"
# The URL clients are told to use — tailnet MagicDNS name, same
# convention as every other service tile.
CODER_ACCESS_URL: "${CODER_ACCESS_URL}"
CODER_PG_CONNECTION_URL: "postgresql://coder:${POSTGRES_PASSWORD}@database/coder?sslmode=disable"
group_add:
# GID of /var/run/docker.sock — generated into .env by deploy.py.
- "${DOCKER_GROUP_ID}"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
# Repo-shipped Terraform templates (source of truth:
# pyinfra/framework/compose/coder/templates/). Push after changes:
# docker compose exec coder coder templates push code-server \
# --directory /templates/code-server --yes
- /srv/docker/coder/templates:/templates:ro
depends_on:
database:
condition: service_healthy
database:
image: postgres:17
container_name: coder-db
restart: unless-stopped
environment:
POSTGRES_USER: coder
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRES_DB: coder
volumes:
# Entrypoint starts as root and chowns this to the postgres uid;
# deploy.py just creates the mount point.
- /srv/docker/coder/postgres:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U coder -d coder"]
interval: 5s
timeout: 5s
retries: 5

View File

@@ -0,0 +1,110 @@
# Coder — workspace manager (pilot)
Self-hosted control plane at <http://framework:7080> that creates
per-project dev containers from Terraform templates. Each workspace:
its own container, browser code-server with the Claude Code extension
pre-installed, no host-port bookkeeping (everything tunnels through the
dashboard), and idle autostop so parked projects give their RAM back to
the inference stacks.
**Pilot status.** Evaluating against the standalone code-server stack
(`/srv/docker/code-server`, kept as-is meanwhile). Verdict after a week
of real use: does the dashboard + autostop + create-from-template earn
an always-on control plane + Postgres? If yes, the standalone stack
retires; if no, Coder goes and a thin `workspace` spawner script
replaces it.
## Bring-up
```sh
cd /srv/docker/coder && docker compose up -d
```
The sibling `.env` (DOCKER_GROUP_ID, CODER_ACCESS_URL, random Postgres
password) is generated by deploy.py on first `./run.sh` — no hand-fill
needed. First visit to <http://framework:7080> creates the admin
account (pick anything; it's local to the box).
> **HTTPS is required for extension panels.** VS Code webviews (the
> Claude Code panel, markdown preview, etc.) run on service workers,
> which browsers only allow in a secure context — over plain
> `http://framework:7080` the editor works but webview panels render
> blank. Fix via Tailscale Serve (real Let's Encrypt cert for the
> tailnet name; enable "HTTPS Certificates" once in the Tailscale
> admin console):
>
> ```sh
> sudo tailscale serve --bg 7080
> # then in .env: CODER_ACCESS_URL=https://framework.<tailnet>.ts.net
> docker compose up -d # recreate server with new access URL
> # restart any existing workspace — app URLs derive from the access URL
> ```
>
> Localhost is also a secure context, so
> `ssh -L 7080:localhost:7080 framework` + http://localhost:7080 works
> in a pinch.
## Push the template (one-time + after edits)
The `coder` CLI ships inside the server image; authenticate it once:
```sh
docker compose exec coder coder login http://localhost:7080
# prints a /cli-auth URL — open http://framework:7080/cli-auth in your
# browser, copy the session token, paste it back
```
Then push:
```sh
docker compose exec coder coder templates push code-server \
--directory /templates/code-server --yes
```
Template source of truth is the repo
(`pyinfra/framework/compose/coder/templates/code-server/main.tf`) —
edit there, `./run.sh`, re-push. Edits on the box get overwritten.
## First workspace
Dashboard → Workspaces → Create → `code-server` template → name it
after the project. Open the code-server app tile, then one-time
Claude sign-in: the extension shows an OAuth URL → open in another tab
→ approve → paste the code back. Credentials live in `~/.claude` on
the workspace's home volume and survive stop/start and rebuilds.
Set **idle autostop** under Template → Settings → Schedule (suggest
12 h inactivity). Activity = open code-server tab, SSH, web terminal.
## What persists where
| Thing | Where |
| -------------------------------- | -------------------------------------------- |
| Workspace home (repos, ~/.claude, extensions) | named volume `coder-<workspace-id>-home` |
| Control-plane state (users, templates, workspace defs) | Postgres → `/srv/docker/coder/postgres` |
| Template source | this repo, shipped to `/srv/docker/coder/templates` |
A stopped workspace's container is deleted; only the home volume
remains. `docker volume ls | grep coder-` to audit.
## Security
Same posture as OpenHands: the server container holds the docker
socket and spawns code-running containers — root-equivalent on the
box. Tailscale-only exposure is the mitigation; never forward :7080
anywhere else. Coder does have real auth (the admin account), but
treat that as defense-in-depth, not as permission to expose it.
## Notes
- Workspaces reach the sibling services via `host.docker.internal`
(Ollama :11434, LiteLLM :4000, Phoenix :6006, ...).
- Long Claude runs: same rules as anywhere — the process lives in the
workspace container, so it survives laptop/browser disconnects, but
**autostop will kill an idle-looking workspace mid-run**. For
multi-hour unattended tasks either bump the workspace's TTL in the
dashboard or use the host-side tmux + `claude remote-control`
pattern (framework README, "Claude Code on the box").
- The base image is `codercom/enterprise-base:ubuntu` (sudo-enabled,
common toolchain). Per-project images are a later refinement —
swap the `image` in main.tf or parameterize with `coder_parameter`.

View File

@@ -0,0 +1,121 @@
# code-server workspace template — one dev container per project, with
# browser VS Code and Claude Code (extension + CLI) ready on first open.
#
# Source of truth is the repo copy at
# pyinfra/framework/compose/coder/templates/code-server/main.tf; pyinfra
# ships it to /srv/docker/coder/templates/, mounted read-only into the
# server container at /templates. Push after edits:
# cd /srv/docker/coder
# docker compose exec coder coder templates push code-server \
# --directory /templates/code-server --yes
terraform {
required_providers {
coder = {
source = "coder/coder"
}
docker = {
source = "kreuzwerker/docker"
}
}
}
provider "coder" {}
# Talks to the host daemon via the socket mounted into the server
# container (default unix:///var/run/docker.sock) — workspace containers
# are siblings of the compose stacks, not children.
provider "docker" {}
data "coder_workspace" "me" {}
data "coder_workspace_owner" "me" {}
resource "coder_agent" "main" {
arch = "amd64"
os = "linux"
# Claude Code CLI — native installer, lands in ~/.local/bin inside the
# persisted home volume, so this is a no-op after the first start. The
# code-server extension bundles its own CLI, but having `claude` on
# PATH enables tmux-based long runs in the workspace terminal.
startup_script = <<-EOT
set -e
command -v claude >/dev/null 2>&1 || curl -fsSL https://claude.ai/install.sh | bash
EOT
env = {
GIT_AUTHOR_NAME = data.coder_workspace_owner.me.full_name
GIT_AUTHOR_EMAIL = data.coder_workspace_owner.me.email
GIT_COMMITTER_NAME = data.coder_workspace_owner.me.full_name
GIT_COMMITTER_EMAIL = data.coder_workspace_owner.me.email
}
metadata {
display_name = "CPU"
key = "cpu"
script = "coder stat cpu"
interval = 10
timeout = 1
}
metadata {
display_name = "RAM"
key = "mem"
script = "coder stat mem"
interval = 10
timeout = 1
}
}
# Browser VS Code inside the workspace, surfaced as a dashboard app.
# Extensions install from Open VSX — anthropic.claude-code is the
# official Claude Code extension
# (https://open-vsx.org/extension/Anthropic/claude-code). OAuth creds
# land in ~/.claude inside the home volume and survive rebuilds.
module "code_server" {
count = data.coder_workspace.me.start_count
source = "registry.coder.com/coder/code-server/coder"
version = "~> 1.0"
agent_id = coder_agent.main.id
folder = "/home/coder/project"
extensions = [
"anthropic.claude-code",
]
}
# Home survives workspace stop/start AND template-driven rebuilds —
# ignore_changes keeps Terraform from recreating the volume (and wiping
# ~/.claude, extensions, repos) when template metadata shifts.
resource "docker_volume" "home" {
name = "coder-${data.coder_workspace.me.id}-home"
lifecycle {
ignore_changes = all
}
}
resource "docker_container" "workspace" {
# start_count is 0 when the workspace is stopped — the container is
# deleted but the home volume above persists. This is what idle
# autostop reclaims: RAM back to the inference stacks.
count = data.coder_workspace.me.start_count
image = "codercom/enterprise-base:ubuntu"
name = "coder-${data.coder_workspace_owner.me.name}-${lower(data.coder_workspace.me.name)}"
hostname = data.coder_workspace.me.name
# Agent bootstrap, rewritten to reach the control plane through the
# docker bridge (the workspace is a sibling container; "localhost"
# inside it isn't the Coder server).
entrypoint = ["sh", "-c", replace(coder_agent.main.init_script, "/localhost|127\\.0\\.0\\.1/", "host.docker.internal")]
env = ["CODER_AGENT_TOKEN=${coder_agent.main.token}"]
host {
host = "host.docker.internal"
ip = "host-gateway"
}
volumes {
container_path = "/home/coder"
volume_name = docker_volume.home.name
read_only = false
}
}

View File

@@ -95,6 +95,28 @@
server: localhost-docker
container: openhands
- code-server:
icon: code-server.svg
href: http://framework:8443
description: VS Code in the browser — Claude Code long tasks live here
server: localhost-docker
container: code-server
- Coder:
icon: coder.svg
href: http://framework:7080
description: Workspace manager (pilot) — per-project containers from templates
server: localhost-docker
container: coder
# /api/v2/buildinfo is unauthenticated — cheap liveness + version.
widget:
type: customapi
url: http://host.docker.internal:7080/api/v2/buildinfo
refreshInterval: 30000
mappings:
- field: version
label: Version
- Observability:
- Beszel:
icon: beszel.svg

View File

@@ -0,0 +1,102 @@
# Qwable-3.6-27B (Qwen3.6-27B fine-tuned on Fable-5-style reasoning
# traces — "Qwen + Fable") via the kyuz0 rocm-7.2.2 Strix Halo toolbox.
# Same image + unified-memory recipe as compose/llama.yml; deltas are
# model path, port, alias.
# https://github.com/kyuz0/amd-strix-halo-toolboxes
# Model: https://huggingface.co/Mia-AiLab/Qwable-3.6-27b (MIT)
#
# What it's for. A "thinks-like-Fable-5" interactive model — structured,
# step-by-step explanatory output. Dense 27B (NOT MoE), so it's slower
# per token than the 30B-A3B MoE workhorses despite being smaller on
# disk: all 27B weights load per token. Bandwidth math (256 GB/s ÷
# ~16.5 GB) → ~10-15 tok/s decode. Interactive but not snappy.
#
# Coexistence. At ~16.5 GB (Q4_K_M) it's the smallest GPU resident here
# and fits alongside llama 30B (port 8080), Ollama, or Kimi in the
# ~110 GB merged arena. It does NOT fit alongside qwen3-235b (88.8 GB)
# or comfyui — swap-model tears those down for the `qwable` target.
# `restart: "no"`: you bring it up deliberately via swap-model, it won't
# auto-start after a reboot and surprise-collide with a big model.
#
# Weights. Single-file GGUF (not sharded). Download path on the box
# (see compose/qwable/README.md):
# hf download Mia-AiLab/Qwable-3.6-27b \
# 'Qwable-27b_Q4_K_M.gguf' \
# --local-dir /models/qwen/Qwable-3.6-27b
# Verify exact filename in the HF repo before downloading.
#
# Port 8082 — distinct from llama 30B (8080) and qwen3-235b (8081).
services:
qwable:
image: kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2
container_name: qwable
# Manual start only — see header note about GPU contention with
# the big models. swap-model brings it up/down.
restart: "no"
devices:
# ROCm needs both kfd (kernel fusion driver) and dri (DRM); Vulkan
# only needs dri. Don't drop kfd when on the rocm-* tag.
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
cap_add:
- SYS_PTRACE
security_opt:
- seccomp=unconfined
# Numeric GIDs of host's video (44) and render (991) groups —
# required for /dev/kfd + /dev/dri access from inside the container.
group_add:
- "44"
- "991"
shm_size: 8g
ipc: host
environment:
# Unified-memory recipe (same as compose/llama.yml + kimi-linear +
# qwen3-235b). BIOS UMA=0.5 GB + ttm.pages_limit cmdline → these
# flags merge the rocminfo pools into one ~110 GB arena. kyuz0's
# image is native gfx1151 so no HSA_OVERRIDE_GFX_VERSION.
- HSA_XNACK=1
- HSA_FORCE_FINE_GRAIN_PCIE=1
volumes:
- /models:/models:ro
ports:
- "8082:8082"
entrypoint: ["llama-server"]
command:
- --model
- /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf
# OpenAI-compatible served name (matches what opencode/curl request
# as "model"). Provider-side name lives in opencode.json if/when
# this gets wired as a provider.
- --alias
- qwable
- --host
- 0.0.0.0
- --port
- "8082"
# Push all layers to GPU. "999" = all available. A 27B Q4 (~16.5 GB)
# fits the merged arena with huge headroom.
- --n-gpu-layers
- "999"
# 64K to match llama/qwen3-235b — keeps opencode auto-compaction
# behaviour consistent across providers. Tons of arena headroom
# here (model is small), so this can ramp far higher if a workflow
# needs it; see compose/qwable/README.md.
- --ctx-size
- "65536"
# No-mmap is the Strix Halo standard — forces full GPU load.
- --no-mmap
# Flash attention — required for q8_0 KV cache; modern llama-server
# takes a value (on/off/auto), bare --flash-attn is deprecated.
- --flash-attn
- "on"
# Quantize KV cache to int8 — halves KV memory at minor/no quality
# loss. Matches the other llama.cpp stacks.
- --cache-type-k
- q8_0
- --cache-type-v
- q8_0
# Use the model's embedded jinja chat template — Qwable inherits
# Qwen3.6's chat format, which the Fable-trace fine-tune relies on.
- --jinja
# Expose Prometheus metrics at /metrics — scraped by OpenLIT.
- --metrics

View File

@@ -0,0 +1,130 @@
# qwable
Qwable-3.6-27B on Strix Halo via `kyuz0:rocm-7.2.2`. A full fine-tune
of Qwen3.6-27B trained on **Fable-5-style reasoning traces** — the dev
collected examples formatted like Fable 5's deliberate, step-by-step
answers and trained Qwen to reproduce that structured, explanatory
output. Think of it as a local "thinks-like-Fable" model.
OpenAI-compatible endpoint at `http://framework:8082` once running.
## Dense, not MoE (read first)
Despite being smaller on disk than the 30B-A3B workhorses, Qwable is a
**dense** 27B — every weight loads per token. On this bandwidth-bound
box (256 GB/s ÷ ~16.5 GB) that's **~10-15 tok/s** decode, slower than
the MoE 30B (~100 tok/s, only ~3B active). So Qwable is for when you
specifically want the Fable-style reasoning, not for raw throughput.
The interactive daily driver stays on Ollama / llama 30B.
## Coexistence notes
At ~16.5 GB (Q4_K_M) Qwable is the smallest GPU resident here:
| Concurrent service | Coexists? |
|---|---|
| `llama` (Qwen3-Coder-30B, 8080) | ✅ yes (~35 GB total) |
| `ollama` (11434) | ✅ yes |
| `kimi-linear` (vLLM, 8000) | ✅ yes (~47 GB total) |
| `qwen3-235b` (88.8 GB, 8081) | ❌ no — too tight, swap-model stops it |
| `comfyui` (8188) | ❌ no — swap-model stops it |
`restart: "no"`: you bring it up deliberately (via `swap-model qwable`),
it won't auto-start after a reboot and surprise-collide with a big model.
## Prereqs
- Pyinfra deploy has run (creates `/srv/docker/qwable/` with right perms).
- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
Verify: `cat /proc/cmdline | grep ttm.pages_limit`.
## Download weights (~16.5 GB, single file)
```sh
# /models/qwen exists via pyinfra; just create the model subdir.
mkdir -p /models/qwen/Qwable-3.6-27b
hf download Mia-AiLab/Qwable-3.6-27b \
'Qwable-27b_Q4_K_M.gguf' \
--local-dir /models/qwen/Qwable-3.6-27b
# File lands at:
# /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf (~16.5 GB)
```
Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
needs ~17 GB free on `/models`.
> Abliterated variant (refusals removed) lives at
> `huihui-ai/Huihui-Qwable-3.6-27b-abliterated-GGUF`
> (`Huihui-Qwable-3.6-27b-abliterated-Q4_K_M_Q8.gguf`, ~18.3 GB).
> Not the default — no safety filtering, careful with it.
## Bring up
Easy path — `swap-model` handles stop-conflicting-services + waits for
`/health`:
```sh
ssh framework swap-model qwable # ~1-2 min cold load (16.5 GB)
ssh framework /srv/docker/qwable/smoke.sh # perf measure
```
Manual equivalent (first-ever bring-up, before the image is cached):
```sh
cd /srv/docker/qwable
docker compose pull # already-cached image if you ran llama first
docker compose up -d
docker compose logs -f # wait for "server is listening on http://0.0.0.0:8082"
./smoke.sh # /health + tiny generation + perf
```
First start is ~1-2 min (16.5 GB load off disk; much faster than the
235B). If `./smoke.sh` reports `predicted_per_second` in the 10-15 tok/s
band, it's healthy. <6 tok/s = investigate (likely arena < 100 GB — see
qwen3-235b/README.md "Troubleshooting" for the arena checks).
## Ramping context
Defaults to 64K to match the other llama.cpp stacks (keeps opencode
auto-compaction consistent across providers). The model is tiny relative
to the arena, so there's plenty of room to push higher:
| Stage | `--ctx-size` | Margin in arena |
|---|---|---|
| **Current default** | **65536** | huge (~90 GB free) |
| Stretch | 131072+ | still comfortable |
Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
re-run `./smoke.sh`. The real ceiling is Qwable's trained context length
(inherits Qwen3.6-27B's), not arena memory — verify the model's max
positions before going past 128K.
## Operations
```sh
docker compose logs -f # tail
docker compose down # stop
docker compose exec qwable bash # shell in
./smoke.sh # health + perf
amdgpu_top # GPU view on host
```
## Pin manifest
| Component | Pin |
|---|---|
| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) |
| Weights | `Mia-AiLab/Qwable-3.6-27b``Qwable-27b_Q4_K_M.gguf` (~16.5 GB) |
| Default port | 8082 |
| Default context | 65536 |
| KV cache type | q8_0 (k and v) |
| License | MIT (model); Qwen3.6-27B base license also applies |
## Status
Compose artifacts written; awaiting box-side weight pull + bring-up.
Wired as a `swap-model qwable` target. Wire as an opencode/LiteLLM
provider only if the Fable-style reasoning proves useful in practice.

View File

@@ -0,0 +1,46 @@
#!/usr/bin/env bash
# Smoke-test the running qwable llama-server (port 8082). Hits /health
# for liveness, then a tiny OpenAI-compatible chat completion, then
# measures eval_tps via /completion. Dense 27B → expect ~10-15 tok/s.
set -euo pipefail
HOST="${QWABLE_HOST:-127.0.0.1:8082}"
MODEL="${QWABLE_MODEL:-qwable}"
echo "[smoke] GET /health on $HOST"
curl -fsS "http://$HOST/health" | python3 -m json.tool
echo
echo "[smoke] POST /v1/chat/completions ($MODEL) — tiny generation"
curl -fsS "http://$HOST/v1/chat/completions" \
-H 'Content-Type: application/json' \
-d "{
\"model\": \"$MODEL\",
\"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: ok\"}],
\"max_tokens\": 16,
\"temperature\": 0.0
}" | python3 -m json.tool
echo
echo "[smoke] perf measure — eval_tps and prompt_tps (n_predict=128)"
# 128 tokens — at ~10-15 tok/s the per-token warmup noise still matters,
# but a dense 27B settles faster than the 235B so we don't need 64-only.
curl -fsS "http://$HOST/completion" \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Write a Python function that computes the Fibonacci sequence iteratively. Include type hints and a brief docstring.",
"n_predict": 128,
"temperature": 0.0,
"stream": false
}' | python3 -c "
import json, sys
r = json.load(sys.stdin)
t = r.get('timings', {})
print(f'predicted_per_second: {t.get(\"predicted_per_second\", \"?\"):.2f} tok/s')
print(f'prompt_per_second: {t.get(\"prompt_per_second\", \"?\"):.2f} tok/s')
print(f'predicted_n: {t.get(\"predicted_n\", \"?\")}')
print(f'prompt_n: {t.get(\"prompt_n\", \"?\")}')
"
echo
echo "[smoke] passed — expected band 10-15 tok/s decode (dense 27B Q4)"

View File

@@ -53,6 +53,15 @@ Disk: needs ~90 GB free on `/models`. Pull is bandwidth-bound; expect
## Bring up (M0.2 — first generation)
Easy path — `swap-model` handles the stop-conflicting-services dance + waits for `/health`:
```sh
ssh framework swap-model 235b # ~3-5 min on first cold load
ssh framework /srv/docker/qwen3-235b/smoke.sh # perf measure
```
Manual equivalent (for first-ever bring-up before the image is cached):
```sh
cd /srv/docker/qwen3-235b
docker compose pull # already-cached image if you ran llama first

View File

@@ -80,6 +80,8 @@ apt.packages(
"ca-certificates",
"unzip",
"software-properties-common", # for add-apt-repository (g++-14 PPA)
"jq", # JSON parsing in operator scripts (bench-engines)
"bc", # float math in bench-engines
],
_sudo=True,
)
@@ -433,6 +435,7 @@ for svc in (
"ollama",
"kimi-linear",
"qwen3-235b",
"qwable",
"litellm",
"comfyui",
"openwebui",
@@ -440,6 +443,8 @@ for svc in (
"openlit",
"phoenix",
"openhands",
"code-server",
"coder",
"homepage",
"whisper",
"piper",
@@ -561,6 +566,89 @@ files.directory(
_sudo=True,
)
# code-server persistent state. The linuxserver image's s6 init drops to
# PUID/PGID 1000 and treats /config as the container user's $HOME —
# extensions, settings, ~/.claude (Claude Code OAuth creds + session
# history), ~/.local/bin. Owned 1000:1000 to match (same pattern as
# kokoro: the container user isn't in the docker group, so 2775
# root:docker wouldn't help it).
files.directory(
name="code-server config dir",
path=f"{COMPOSE_DIR}/code-server/config",
user="1000",
group="1000",
mode="0755",
_sudo=True,
)
# Default workspace. Host UID 1000 == container PUID 1000, so files
# created either side stay owned by the SSH user.
files.directory(
name="code-server workspace dir",
path=f"{COMPOSE_DIR}/code-server/workspace",
user=SSH_USER,
group=SSH_USER,
mode="2775",
_sudo=True,
)
files.put(
name="code-server: README.md",
src="compose/code-server/README.md",
dest=f"{COMPOSE_DIR}/code-server/README.md",
group="docker",
mode="0664",
_sudo=True,
)
# Coder workspace manager (pilot — see compose/coder/README.md for the
# evaluation criteria vs the standalone code-server stack). Postgres
# chowns its own data dir at first start (entrypoint runs as root); we
# just create the mount point. Templates are repo-sourced Terraform
# mounted read-only into the server container, pushed with
# `docker compose exec coder coder templates push`.
files.directory(
name="Coder postgres data dir",
path=f"{COMPOSE_DIR}/coder/postgres",
group="docker",
mode="2775",
_sudo=True,
)
files.directory(
name="Coder templates/code-server dir",
path=f"{COMPOSE_DIR}/coder/templates/code-server",
group="docker",
mode="2775",
_sudo=True,
)
for asset, mode in (
("templates/code-server/main.tf", "0664"),
("README.md", "0664"),
):
files.put(
name=f"coder: {asset}",
src=f"compose/coder/{asset}",
dest=f"{COMPOSE_DIR}/coder/{asset}",
group="docker",
mode=mode,
_sudo=True,
)
# Sibling .env: DOCKER_GROUP_ID (host-specific GID of the docker socket,
# needed for the server container's group_add), the access URL, and a
# random one-time Postgres password. Generated on the box rather than
# placeholder-then-hand-fill because every value is derivable. Never
# overwritten once present.
server.shell(
name="Coder .env (generate once)",
commands=[
f"test -f {COMPOSE_DIR}/coder/.env || {{ "
f"printf 'DOCKER_GROUP_ID=%s\\nCODER_ACCESS_URL=http://framework:7080\\nPOSTGRES_PASSWORD=%s\\n' "
f'"$(stat -c %g /var/run/docker.sock)" "$(openssl rand -hex 16)" '
f"> {COMPOSE_DIR}/coder/.env && "
f"chown root:docker {COMPOSE_DIR}/coder/.env && "
f"chmod 640 {COMPOSE_DIR}/coder/.env; }}",
],
_sudo=True,
)
# Homepage config. The compose loop above only copies homepage.yml; the
# YAML config files live in compose/homepage/ on the source side and at
# /srv/docker/homepage/config/ on the box. Source-of-truth is the repo —
@@ -622,6 +710,22 @@ for asset, mode in (
_sudo=True,
)
# Qwable operator assets. Same image as llama (kyuz0 rocm-7.2.2); dense
# 27B Qwen3.6 fine-tuned on Fable-5 traces. Weights live at /models/qwen/
# via manual `hf download` per the README. swap-model `qwable` target.
for asset, mode in (
("smoke.sh", "0775"),
("README.md", "0664"),
):
files.put(
name=f"qwable: {asset}",
src=f"compose/qwable/{asset}",
dest=f"{COMPOSE_DIR}/qwable/{asset}",
group="docker",
mode=mode,
_sudo=True,
)
# LiteLLM router assets. config.yaml is the source-of-truth model
# routing table — pyinfra syncs it on every run; edits on the box get
# overwritten. The .env file holds LITELLM_MASTER_KEY + LITELLM_SALT_KEY
@@ -758,6 +862,37 @@ files.directory(
_sudo=True,
)
# --- Operator scripts -------------------------------------------------------
# swap-model — one-command swap between which inference container is
# GPU-resident. Encodes the coexistence table (235B doesn't fit alongside
# anything; ollama+kimi do) + per-service health probes. Lives in
# /usr/local/bin so the SSH user (in the docker group) can run it
# directly: `ssh framework swap-model 235b`. See scripts/swap-model for
# the modes and bin/swap-model for the Mac-side wrapper.
files.put(
name="swap-model script (box-side)",
src="scripts/swap-model",
dest="/usr/local/bin/swap-model",
user="root",
group="root",
mode="0755",
_sudo=True,
)
# bench-engines — one-shot decision tool for the GGUF-tier consolidation
# (Ollama vs kyuz0 llama.cpp decode t/s on gfx1151). See the framework
# README "Inference engine consolidation".
files.put(
name="bench-engines script (box-side)",
src="scripts/bench-engines",
dest="/usr/local/bin/bench-engines",
user="root",
group="root",
mode="0755",
_sudo=True,
)
# --- Cleanup of artifacts from the prior native-build deploy ----------------
# All idempotent — `present=False` is a no-op when the target is absent.

View File

@@ -0,0 +1,177 @@
#!/usr/bin/env bash
# bench-engines — compare decode/prefill throughput of Ollama vs
# llama.cpp (kyuz0 toolbox) on the SAME GGUF on gfx1151.
#
# Why this exists. The GGUF-tier consolidation decision (see the
# framework README "Inference engine consolidation") hinges on one
# hardware-specific unknown: how close is Ollama's bundled llama.cpp to
# the gfx1151-tuned kyuz0 build on *this* box? If decode t/s is within
# ~10-15 %, Ollama's convenience wins (it auto-swaps, so no llama-swap
# needed). If kyuz0's rocWMMA flash-attention lead is large, that argues
# for keeping llama.cpp behind llama-swap. This measures it.
#
# Method. Serves the identical GGUF on each engine in isolation (the
# other GGUF engine + 235b are stopped so nothing competes for the
# arena), warms up, then runs R raw-completion trials at a fixed decode
# length. Reads each engine's own authoritative timing fields — no
# token-counting guesswork:
# - llama.cpp /completion → .timings.{prompt,predicted}_per_second
# - Ollama /api/generate → {prompt_eval,eval}_{count,duration}
# Uses raw prompts (no chat template) on both for an apples-to-apples
# prompt-in / tokens-out measurement.
#
# Run ON THE BOX (hits localhost + docker). Requires jq.
#
# Usage:
# bench-engines # bench the model llama.yml serves
# bench-engines status # show what's currently GPU-resident
# BENCH_RUNS=5 bench-engines # more trials (default 3)
#
# To bench a different model (e.g. Qwen3.6-27B): point compose/llama.yml
# at the new GGUF, set GGUF below to match, redeploy, rerun.
set -euo pipefail
COMPOSE_ROOT="/srv/docker"
RUNS="${BENCH_RUNS:-3}"
N_PREDICT="${BENCH_N_PREDICT:-256}"
WAIT_TIMEOUT="${BENCH_WAIT_TIMEOUT:-600}"
# Must match the GGUF that compose/llama.yml serves — this is the file
# registered into Ollama so both engines run identical weights. The
# path is the in-container path (/models is bind-mounted into both).
GGUF="${BENCH_GGUF:-/models/qwen/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf}"
OLLAMA_BENCH_MODEL="bench-engines"
# A fixed, moderately long prompt so prefill is measurable. Decode is the
# number that actually decides the consolidation (bandwidth-bound).
read -r -d '' PROMPT <<'EOF' || true
You are a careful systems engineer. Explain, in detail and step by step,
how a unified-memory APU shares a single physical RAM pool between the
CPU and an integrated GPU, what a GTT aperture is, why demand paging
matters for large language model weights, and how this differs from a
discrete GPU with dedicated VRAM. Be thorough and precise.
EOF
LLAMA_URL="http://127.0.0.1:8080"
OLLAMA_URL="http://127.0.0.1:11434"
need() { command -v "$1" >/dev/null 2>&1 || { echo "bench-engines: missing '$1'" >&2; exit 1; }; }
need jq
need curl
is_running() { docker inspect -f '{{.State.Running}}' "$1" 2>/dev/null | grep -q true; }
is_healthy() { curl -fsS --max-time 5 "$1" >/dev/null 2>&1; }
down() {
local dir="$COMPOSE_ROOT/$1"
is_running "$1" || return 0
echo " stopping $1"
(cd "$dir" && docker compose down >/dev/null 2>&1)
}
up_wait() {
local svc="$1" health="$2" deadline=$(( SECONDS + WAIT_TIMEOUT ))
echo " starting $svc"
(cd "$COMPOSE_ROOT/$svc" && docker compose up -d >/dev/null 2>&1)
printf " waiting for health"
while ! is_healthy "$health"; do
(( SECONDS > deadline )) && { echo " TIMEOUT"; docker logs --tail 20 "$svc" >&2; exit 1; }
sleep 5; printf "."
done
echo " ok"
}
# --- isolation: only the engine under test is GPU-resident ------------
isolate_for() {
case "$1" in
llama) down ollama; down qwen3-235b; down kimi-linear ;;
ollama) down llama; down qwen3-235b; down kimi-linear ;;
esac
}
# --- register the GGUF into Ollama (idempotent) -----------------------
register_ollama_model() {
if docker exec ollama ollama list 2>/dev/null | grep -q "^${OLLAMA_BENCH_MODEL}"; then
echo " ollama model '${OLLAMA_BENCH_MODEL}' already registered"
return 0
fi
echo " registering ${OLLAMA_BENCH_MODEL} from ${GGUF}"
# FROM the in-container GGUF path; num_ctx/kv match llama.yml so the
# comparison stays fair.
printf 'FROM %s\nPARAMETER num_ctx 65536\n' "$GGUF" \
| docker exec -i ollama ollama create "${OLLAMA_BENCH_MODEL}" -f -
}
# --- one trial; echoes "prefill_tps decode_tps" -----------------------
trial_llama() {
local body resp
body=$(jq -n --arg p "$PROMPT" --argjson n "$N_PREDICT" \
'{prompt:$p, n_predict:$n, temperature:0, cache_prompt:false}')
resp=$(curl -fsS --max-time 300 "$LLAMA_URL/completion" \
-H 'Content-Type: application/json' -d "$body")
echo "$resp" | jq -r '"\(.timings.prompt_per_second) \(.timings.predicted_per_second)"'
}
trial_ollama() {
local body resp
body=$(jq -n --arg m "$OLLAMA_BENCH_MODEL" --arg p "$PROMPT" --argjson n "$N_PREDICT" \
'{model:$m, prompt:$p, raw:true, stream:false, options:{temperature:0, num_predict:$n}}')
resp=$(curl -fsS --max-time 300 "$OLLAMA_URL/api/generate" \
-H 'Content-Type: application/json' -d "$body")
# durations are ns; t/s = count / (duration/1e9)
echo "$resp" | jq -r '
"\(.prompt_eval_count / (.prompt_eval_duration/1e9)) \(.eval_count / (.eval_duration/1e9))"'
}
# --- run R trials, print per-trial + mean decode ----------------------
bench() {
local engine="$1" trialfn="$2"
echo " warmup..."; "$trialfn" >/dev/null
local sum_pp=0 sum_tg=0
for i in $(seq 1 "$RUNS"); do
read -r pp tg < <("$trialfn")
printf " trial %d: prefill %6.1f t/s decode %6.2f t/s\n" "$i" "$pp" "$tg"
sum_pp=$(echo "$sum_pp + $pp" | bc -l)
sum_tg=$(echo "$sum_tg + $tg" | bc -l)
done
MEAN_PP=$(echo "scale=1; $sum_pp / $RUNS" | bc -l)
MEAN_TG=$(echo "scale=2; $sum_tg / $RUNS" | bc -l)
printf " %s mean: prefill %s t/s decode %s t/s\n" "$engine" "$MEAN_PP" "$MEAN_TG"
}
if [[ "${1:-}" == "status" ]]; then
for c in ollama llama kimi-linear qwen3-235b; do
is_running "$c" && echo "$c: up" || echo "$c: down"
done
exit 0
fi
need bc
echo "== llama.cpp (kyuz0 ${GGUF##*/}) =="
isolate_for llama
up_wait llama "$LLAMA_URL/health"
bench "llama.cpp" trial_llama
LLAMA_TG="$MEAN_TG"
down llama
echo
echo "== Ollama (same GGUF) =="
isolate_for ollama
up_wait ollama "$OLLAMA_URL/api/tags"
register_ollama_model
bench "ollama" trial_ollama
OLLAMA_TG="$MEAN_TG"
echo
echo "== Verdict =="
# Ollama as % of llama.cpp decode throughput.
PCT=$(echo "scale=1; 100 * $OLLAMA_TG / $LLAMA_TG" | bc -l)
printf " llama.cpp decode: %s t/s\n ollama decode: %s t/s (%s%% of llama.cpp)\n" \
"$LLAMA_TG" "$OLLAMA_TG" "$PCT"
echo
echo " Guidance: Ollama >=85% of llama.cpp -> option 1 (Ollama + vLLM,"
echo " drop standalone llama.cpp; Ollama self-swaps, no llama-swap)."
echo " Larger gap -> option 2 (keep llama.cpp"
echo " behind llama-swap with coexistence groups; drop Ollama)."

View File

@@ -0,0 +1,186 @@
#!/usr/bin/env bash
# swap-model — coordinate which inference container is GPU-resident on
# the Strix Halo box.
#
# Why this exists. The GPU's merged ~110 GB arena (BIOS UMA=0.5 GB +
# ttm.pages_limit + HSA_XNACK; see StrixHaloMemory.md) holds at most
# one 88 GB-class model at a time, and ROCm doesn't reclaim cleanly
# between consumers. So switching models means stop-then-start of
# whole compose stacks. This script encodes the per-target conflict
# table + per-service health probes so the swap is one command.
#
# Usage:
# swap-model coder # Qwen3-Coder-30B via Ollama (interactive)
# swap-model 235b # Qwen3-235B-A22B via llama.cpp (long-task)
# swap-model kimi # Kimi-Linear-48B-A3B via vLLM (long-context)
# swap-model qwable # Qwable-3.6-27B via llama.cpp (Fable-style)
# swap-model comfyui # ComfyUI (image generation)
# swap-model none # everything down — free the GPU
# swap-model status # show what's currently up
#
# Env knobs:
# SWAP_WAIT_TIMEOUT seconds to wait for /health after up; default 600
# (235B's 88 GB cold load can take 3-5 min)
#
# Out of scope (deliberately):
# - Always-on services (openwebui, litellm, phoenix, beszel, etc.) —
# no GPU footprint, left alone.
# - llama.cpp 30B (port 8080) — same weights as Ollama's qwen3-coder
# but still LL-P0 perf-evaluating. `coder` target uses Ollama only.
# - Multi-target combos (e.g. kimi+ollama coexist on the arena);
# for now run swap-model twice if you want both.
set -euo pipefail
COMPOSE_ROOT="/srv/docker"
WAIT_TIMEOUT="${SWAP_WAIT_TIMEOUT:-600}"
# --- Service table -----------------------------------------------------------
# Map short name → compose dir (under $COMPOSE_ROOT) and health URL.
# Container name == compose dir name in every case (intentional convention,
# enforced in compose/*.yml's container_name fields).
declare -A SVC_DIR=(
[ollama]=ollama
[llama]=llama
[kimi]=kimi-linear
[235b]=qwen3-235b
[qwable]=qwable
[comfyui]=comfyui
)
declare -A SVC_HEALTH=(
[ollama]="http://127.0.0.1:11434/api/tags"
[llama]="http://127.0.0.1:8080/health"
[kimi]="http://127.0.0.1:8000/v1/models"
[235b]="http://127.0.0.1:8081/health"
[qwable]="http://127.0.0.1:8082/health"
[comfyui]="http://127.0.0.1:8188/"
)
# --- Target → plan -----------------------------------------------------------
# UP = services that should be running after the swap
# DOWN = services that must be stopped to free the GPU arena
# (anything not in either list is left untouched — e.g. switching to coder
# leaves kimi alone, since kimi(30 GB) + ollama(30 GB) fit in the arena.)
plan() {
UP=() ; DOWN=()
case "$1" in
coder) UP=(ollama) ; DOWN=(235b comfyui) ;;
235b) UP=(235b) ; DOWN=(ollama llama kimi qwable comfyui) ;;
kimi) UP=(kimi) ; DOWN=(235b comfyui) ;;
qwable) UP=(qwable) ; DOWN=(235b comfyui) ;;
comfyui) UP=(comfyui) ; DOWN=(235b kimi qwable) ;;
none) UP=() ; DOWN=(ollama llama kimi 235b qwable comfyui) ;;
*) return 1 ;;
esac
}
# --- Probes ------------------------------------------------------------------
is_running() {
docker inspect -f '{{.State.Running}}' "$1" 2>/dev/null | grep -q true
}
is_healthy() {
curl -fsS --max-time 5 "$1" >/dev/null 2>&1
}
wait_healthy() {
local svc="$1" url="${SVC_HEALTH[$1]}" deadline=$(( SECONDS + WAIT_TIMEOUT ))
printf " waiting for %s health (timeout %ss)" "$svc" "$WAIT_TIMEOUT"
while ! is_healthy "$url"; do
if (( SECONDS > deadline )); then
printf " TIMEOUT\n"
echo " last 20 lines of container log:" >&2
docker logs --tail 20 "${SVC_DIR[$svc]}" 2>&1 | sed 's/^/ /' >&2
return 1
fi
sleep 5
printf "."
done
printf " ok\n"
}
# --- Actions -----------------------------------------------------------------
down_svc() {
local svc="$1" dir="$COMPOSE_ROOT/${SVC_DIR[$1]}"
if ! is_running "${SVC_DIR[$svc]}"; then
echo " $svc: already down"
return 0
fi
echo " stopping $svc"
(cd "$dir" && docker compose down)
}
up_svc() {
local svc="$1" dir="$COMPOSE_ROOT/${SVC_DIR[$1]}"
if is_running "${SVC_DIR[$svc]}" && is_healthy "${SVC_HEALTH[$svc]}"; then
echo " $svc: already up + healthy"
return 0
fi
if [[ ! -d "$dir" ]]; then
echo " $svc: compose dir $dir missing — run pyinfra deploy first" >&2
return 1
fi
echo " starting $svc"
(cd "$dir" && docker compose up -d)
wait_healthy "$svc"
}
show_status() {
echo "Inference services:"
for svc in ollama llama kimi 235b qwable comfyui; do
local container="${SVC_DIR[$svc]}" state="down" health=""
if is_running "$container"; then
state="up"
if is_healthy "${SVC_HEALTH[$svc]}"; then
health=" (healthy)"
else
health=" (starting/unhealthy)"
fi
fi
printf " %-8s %s%s\n" "$svc" "$state" "$health"
done
}
usage() {
cat <<EOF
swap-model — coordinate inference containers on Strix Halo.
Usage:
swap-model coder # Qwen3-Coder-30B (Ollama, interactive daily-driver)
swap-model 235b # Qwen3-235B-A22B (llama.cpp, long-task, ~5-10 tok/s)
swap-model kimi # Kimi-Linear-48B (vLLM, long-context chat)
swap-model qwable # Qwable-3.6-27B (llama.cpp, Fable-style, ~10-15 tok/s)
swap-model comfyui # ComfyUI (image generation)
swap-model none # everything down (free the GPU arena)
swap-model status # show current state
Behaviour: stops conflicting services (frees the 110 GB GPU arena),
starts the target, polls its /health until it returns 200. Wait timeout
defaults to ${WAIT_TIMEOUT}s; override with SWAP_WAIT_TIMEOUT.
Coexistence: ollama(30B), kimi, and qwable(27B, 16.5 GB) coexist with
each other. 235B and comfyui coexist with nothing. See
compose/qwen3-235b/README.md for arena math.
EOF
}
# --- Main --------------------------------------------------------------------
TARGET="${1:-}"
case "$TARGET" in
coder|235b|kimi|qwable|comfyui|none) ;;
status) show_status ; exit 0 ;;
-h|--help|help|"") usage ; exit 0 ;;
*)
echo "swap-model: unknown target '$TARGET'" >&2
echo "Try: swap-model help" >&2
exit 2
;;
esac
plan "$TARGET"
echo "Plan: down=[${DOWN[*]:-}] up=[${UP[*]:-}]"
for svc in "${DOWN[@]}"; do down_svc "$svc"; done
for svc in "${UP[@]}"; do up_svc "$svc"; done
echo
show_status