added models, model-swap, ...

This commit is contained in:
2026-06-26 08:13:33 -04:00
parent de1635872f
commit 224afbb3a6
18 changed files with 1659 additions and 243 deletions

View File

@@ -0,0 +1,55 @@
# code-server — VS Code in the browser, served from the box. The point:
# Claude Code (extension + CLI) runs *inside this container*, so long
# agent tasks keep running when the laptop sleeps. Any device on the
# tailnet gets the full VS Code experience at http://framework:8443.
#
# linuxserver image over codercom/code-server for the PUID/PGID
# convention (matches host `noise` UID 1000 so workspace files don't
# come out root-owned) and the s6 init that chowns /config.
#
# Extensions come from Open VSX (code-server's default registry). The
# official Claude Code extension is published there:
# https://open-vsx.org/extension/Anthropic/claude-code
#
# Persistent state: /config is the container user's $HOME — extensions,
# settings, ~/.claude (OAuth creds + session history), ~/.local/bin
# (claude CLI). Survives container recreation; back up one path.
services:
code-server:
image: lscr.io/linuxserver/code-server:latest
container_name: code-server
restart: unless-stopped
# No PASSWORD env set — auth is disabled, same trust model as every
# other service here: reachable over Tailscale only. If this host
# ever sees LAN/internet traffic, set HASHED_PASSWORD (and even then
# prefer "127.0.0.1:8443:8443" + tunnel — a browser terminal is a
# shell on the box's docker network).
ports:
- "8443:8443"
environment:
- PUID=1000
- PGID=1000
- TZ=Etc/UTC
# Open this folder by default instead of the "welcome" screen.
- DEFAULT_WORKSPACE=/workspace
# tmux inside the container: detachable terminals for long claude
# runs, so a browser-tab close or code-server reconnect hiccup
# can't orphan your view of a session (the process itself never
# depended on the tab — it's a child of the container).
- DOCKER_MODS=linuxserver/mods:universal-package-install
- INSTALL_PACKAGES=tmux
extra_hosts:
# Reach the sibling services (Ollama :11434, LiteLLM :4000,
# Phoenix :6006, ...) from the integrated terminal / extensions.
- "host.docker.internal:host-gateway"
volumes:
- /srv/docker/code-server/config:/config
# Default project area. Container-scoped on purpose — Claude Code
# in here can't touch /srv/docker compose stacks or the host.
# Mount additional repos explicitly when you want them editable:
# - /home/noise/src/somerepo:/workspace/somerepo
- /srv/docker/code-server/workspace:/workspace

View File

@@ -0,0 +1,85 @@
# code-server — VS Code in the browser
Full VS Code served from the box at <http://framework:8443>. The reason
it exists in this stack: **Claude Code runs inside the container**, so
long agent tasks execute on the Framework Desktop and survive the
laptop sleeping, the browser tab closing, or you walking away with the
phone. Any device on the tailnet gets the same editor with the same
state.
## Bring-up
```sh
cd /srv/docker/code-server && docker compose up -d
```
First start pulls the image and runs the universal-package-install mod
(installs tmux), so give it a minute before the UI answers on :8443.
> **HTTPS is required for extension panels.** VS Code webviews (the
> Claude Code panel included) need a secure context; over plain
> `http://framework:8443` they render blank. Serve it via Tailscale:
> `sudo tailscale serve --bg --https=8443 8443` →
> `https://framework.<tailnet>.ts.net:8443`. Localhost also counts as
> secure (`ssh -L 8443:localhost:8443 framework`).
## One-time setup (in the code-server UI)
1. **Install the Claude Code extension.** Extensions panel → search
"Claude Code". code-server uses the Open VSX registry, where
Anthropic publishes the official extension
(<https://open-vsx.org/extension/Anthropic/claude-code>).
2. **Sign in.** The OAuth flow in a browser context is a copy/paste
dance: the extension shows a URL → open it in another tab → approve
→ paste the code back. Credentials land in `/config/.claude/` and
persist across container recreates.
3. **(Optional) CLI in the terminal.** The extension bundles its own
CLI, but for tmux-based long runs install it explicitly:
```sh
curl -fsSL https://claude.ai/install.sh | bash
```
Lands in `~/.local/bin` (= `/config/.local/bin`, persisted).
## Long-running tasks
The Claude process is a child of the container, not of your browser
tab — closing the tab does nothing to it. Reattach from any device and
the session is where you left it. For multi-hour unattended runs,
prefer a tmux pane in the integrated terminal:
```sh
tmux new -s longtask
claude # kick off the task, detach with C-b d
```
If you want to steer from a phone or claude.ai/code instead of this UI,
use `claude remote-control` (see the framework README's
"Claude Code on the box" section) — that works in the host's tmux too,
no container needed.
## Scope and security
- **No password is set** — same trust model as every other service on
this Tailscale-only box. If the host ever sees LAN/internet traffic,
set `HASHED_PASSWORD` in the compose env and bind to localhost; a
browser terminal is a shell.
- The workspace is **container-scoped on purpose**: Claude Code in here
sees `/workspace` and `/config`, not the host filesystem or the
docker socket. Bind-mount specific repos into `/workspace/<name>`
when you want them editable (host UID 1000 == container PUID 1000,
so ownership just works).
- `host.docker.internal` reaches the sibling services — handy for
pointing scripts at Ollama (:11434) or LiteLLM (:4000) from the
integrated terminal.
## State layout
| Path (host) | What lives there |
| -------------------------------------- | ------------------------------------------------- |
| `/srv/docker/code-server/config` | `$HOME`: extensions, settings, `~/.claude`, CLIs |
| `/srv/docker/code-server/workspace` | default project area (`DEFAULT_WORKSPACE`) |
Back up `config` if you care about extension state and Claude session
history; everything else is reproducible from the repo.

View File

@@ -0,0 +1,64 @@
# Coder — self-hosted workspace manager (PILOT). Web dashboard at :7080
# that stamps out per-project dev containers from Terraform templates;
# each workspace gets browser code-server with the Claude Code extension
# pre-installed. Evaluating against the standalone code-server stack
# (compose/code-server.yml, kept as-is during the pilot) — verdict
# criteria in compose/coder/README.md.
#
# The server container holds the host docker socket and spawns workspace
# containers as siblings (same pattern + same security posture as
# OpenHands: fine Tailscale-only, never expose further). group_add needs
# the socket's host GID — host-specific, so it comes from the sibling
# .env, which pyinfra generates on the box on first deploy along with a
# random one-time Postgres password.
#
# Workspaces don't publish host ports — code-server and terminals are
# reached through the dashboard's tunnel. No port bookkeeping per
# project.
services:
coder:
# Pin and bump deliberately — Coder releases weekly and the workspace
# provisioner/agent protocol moves with it. Verify at
# https://github.com/coder/coder/releases.
image: ghcr.io/coder/coder:v2.33.8
container_name: coder
restart: unless-stopped
ports:
- "7080:7080"
environment:
CODER_HTTP_ADDRESS: "0.0.0.0:7080"
# The URL clients are told to use — tailnet MagicDNS name, same
# convention as every other service tile.
CODER_ACCESS_URL: "${CODER_ACCESS_URL}"
CODER_PG_CONNECTION_URL: "postgresql://coder:${POSTGRES_PASSWORD}@database/coder?sslmode=disable"
group_add:
# GID of /var/run/docker.sock — generated into .env by deploy.py.
- "${DOCKER_GROUP_ID}"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
# Repo-shipped Terraform templates (source of truth:
# pyinfra/framework/compose/coder/templates/). Push after changes:
# docker compose exec coder coder templates push code-server \
# --directory /templates/code-server --yes
- /srv/docker/coder/templates:/templates:ro
depends_on:
database:
condition: service_healthy
database:
image: postgres:17
container_name: coder-db
restart: unless-stopped
environment:
POSTGRES_USER: coder
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRES_DB: coder
volumes:
# Entrypoint starts as root and chowns this to the postgres uid;
# deploy.py just creates the mount point.
- /srv/docker/coder/postgres:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U coder -d coder"]
interval: 5s
timeout: 5s
retries: 5

View File

@@ -0,0 +1,110 @@
# Coder — workspace manager (pilot)
Self-hosted control plane at <http://framework:7080> that creates
per-project dev containers from Terraform templates. Each workspace:
its own container, browser code-server with the Claude Code extension
pre-installed, no host-port bookkeeping (everything tunnels through the
dashboard), and idle autostop so parked projects give their RAM back to
the inference stacks.
**Pilot status.** Evaluating against the standalone code-server stack
(`/srv/docker/code-server`, kept as-is meanwhile). Verdict after a week
of real use: does the dashboard + autostop + create-from-template earn
an always-on control plane + Postgres? If yes, the standalone stack
retires; if no, Coder goes and a thin `workspace` spawner script
replaces it.
## Bring-up
```sh
cd /srv/docker/coder && docker compose up -d
```
The sibling `.env` (DOCKER_GROUP_ID, CODER_ACCESS_URL, random Postgres
password) is generated by deploy.py on first `./run.sh` — no hand-fill
needed. First visit to <http://framework:7080> creates the admin
account (pick anything; it's local to the box).
> **HTTPS is required for extension panels.** VS Code webviews (the
> Claude Code panel, markdown preview, etc.) run on service workers,
> which browsers only allow in a secure context — over plain
> `http://framework:7080` the editor works but webview panels render
> blank. Fix via Tailscale Serve (real Let's Encrypt cert for the
> tailnet name; enable "HTTPS Certificates" once in the Tailscale
> admin console):
>
> ```sh
> sudo tailscale serve --bg 7080
> # then in .env: CODER_ACCESS_URL=https://framework.<tailnet>.ts.net
> docker compose up -d # recreate server with new access URL
> # restart any existing workspace — app URLs derive from the access URL
> ```
>
> Localhost is also a secure context, so
> `ssh -L 7080:localhost:7080 framework` + http://localhost:7080 works
> in a pinch.
## Push the template (one-time + after edits)
The `coder` CLI ships inside the server image; authenticate it once:
```sh
docker compose exec coder coder login http://localhost:7080
# prints a /cli-auth URL — open http://framework:7080/cli-auth in your
# browser, copy the session token, paste it back
```
Then push:
```sh
docker compose exec coder coder templates push code-server \
--directory /templates/code-server --yes
```
Template source of truth is the repo
(`pyinfra/framework/compose/coder/templates/code-server/main.tf`) —
edit there, `./run.sh`, re-push. Edits on the box get overwritten.
## First workspace
Dashboard → Workspaces → Create → `code-server` template → name it
after the project. Open the code-server app tile, then one-time
Claude sign-in: the extension shows an OAuth URL → open in another tab
→ approve → paste the code back. Credentials live in `~/.claude` on
the workspace's home volume and survive stop/start and rebuilds.
Set **idle autostop** under Template → Settings → Schedule (suggest
12 h inactivity). Activity = open code-server tab, SSH, web terminal.
## What persists where
| Thing | Where |
| -------------------------------- | -------------------------------------------- |
| Workspace home (repos, ~/.claude, extensions) | named volume `coder-<workspace-id>-home` |
| Control-plane state (users, templates, workspace defs) | Postgres → `/srv/docker/coder/postgres` |
| Template source | this repo, shipped to `/srv/docker/coder/templates` |
A stopped workspace's container is deleted; only the home volume
remains. `docker volume ls | grep coder-` to audit.
## Security
Same posture as OpenHands: the server container holds the docker
socket and spawns code-running containers — root-equivalent on the
box. Tailscale-only exposure is the mitigation; never forward :7080
anywhere else. Coder does have real auth (the admin account), but
treat that as defense-in-depth, not as permission to expose it.
## Notes
- Workspaces reach the sibling services via `host.docker.internal`
(Ollama :11434, LiteLLM :4000, Phoenix :6006, ...).
- Long Claude runs: same rules as anywhere — the process lives in the
workspace container, so it survives laptop/browser disconnects, but
**autostop will kill an idle-looking workspace mid-run**. For
multi-hour unattended tasks either bump the workspace's TTL in the
dashboard or use the host-side tmux + `claude remote-control`
pattern (framework README, "Claude Code on the box").
- The base image is `codercom/enterprise-base:ubuntu` (sudo-enabled,
common toolchain). Per-project images are a later refinement —
swap the `image` in main.tf or parameterize with `coder_parameter`.

View File

@@ -0,0 +1,121 @@
# code-server workspace template — one dev container per project, with
# browser VS Code and Claude Code (extension + CLI) ready on first open.
#
# Source of truth is the repo copy at
# pyinfra/framework/compose/coder/templates/code-server/main.tf; pyinfra
# ships it to /srv/docker/coder/templates/, mounted read-only into the
# server container at /templates. Push after edits:
# cd /srv/docker/coder
# docker compose exec coder coder templates push code-server \
# --directory /templates/code-server --yes
terraform {
required_providers {
coder = {
source = "coder/coder"
}
docker = {
source = "kreuzwerker/docker"
}
}
}
provider "coder" {}
# Talks to the host daemon via the socket mounted into the server
# container (default unix:///var/run/docker.sock) — workspace containers
# are siblings of the compose stacks, not children.
provider "docker" {}
data "coder_workspace" "me" {}
data "coder_workspace_owner" "me" {}
resource "coder_agent" "main" {
arch = "amd64"
os = "linux"
# Claude Code CLI — native installer, lands in ~/.local/bin inside the
# persisted home volume, so this is a no-op after the first start. The
# code-server extension bundles its own CLI, but having `claude` on
# PATH enables tmux-based long runs in the workspace terminal.
startup_script = <<-EOT
set -e
command -v claude >/dev/null 2>&1 || curl -fsSL https://claude.ai/install.sh | bash
EOT
env = {
GIT_AUTHOR_NAME = data.coder_workspace_owner.me.full_name
GIT_AUTHOR_EMAIL = data.coder_workspace_owner.me.email
GIT_COMMITTER_NAME = data.coder_workspace_owner.me.full_name
GIT_COMMITTER_EMAIL = data.coder_workspace_owner.me.email
}
metadata {
display_name = "CPU"
key = "cpu"
script = "coder stat cpu"
interval = 10
timeout = 1
}
metadata {
display_name = "RAM"
key = "mem"
script = "coder stat mem"
interval = 10
timeout = 1
}
}
# Browser VS Code inside the workspace, surfaced as a dashboard app.
# Extensions install from Open VSX — anthropic.claude-code is the
# official Claude Code extension
# (https://open-vsx.org/extension/Anthropic/claude-code). OAuth creds
# land in ~/.claude inside the home volume and survive rebuilds.
module "code_server" {
count = data.coder_workspace.me.start_count
source = "registry.coder.com/coder/code-server/coder"
version = "~> 1.0"
agent_id = coder_agent.main.id
folder = "/home/coder/project"
extensions = [
"anthropic.claude-code",
]
}
# Home survives workspace stop/start AND template-driven rebuilds —
# ignore_changes keeps Terraform from recreating the volume (and wiping
# ~/.claude, extensions, repos) when template metadata shifts.
resource "docker_volume" "home" {
name = "coder-${data.coder_workspace.me.id}-home"
lifecycle {
ignore_changes = all
}
}
resource "docker_container" "workspace" {
# start_count is 0 when the workspace is stopped — the container is
# deleted but the home volume above persists. This is what idle
# autostop reclaims: RAM back to the inference stacks.
count = data.coder_workspace.me.start_count
image = "codercom/enterprise-base:ubuntu"
name = "coder-${data.coder_workspace_owner.me.name}-${lower(data.coder_workspace.me.name)}"
hostname = data.coder_workspace.me.name
# Agent bootstrap, rewritten to reach the control plane through the
# docker bridge (the workspace is a sibling container; "localhost"
# inside it isn't the Coder server).
entrypoint = ["sh", "-c", replace(coder_agent.main.init_script, "/localhost|127\\.0\\.0\\.1/", "host.docker.internal")]
env = ["CODER_AGENT_TOKEN=${coder_agent.main.token}"]
host {
host = "host.docker.internal"
ip = "host-gateway"
}
volumes {
container_path = "/home/coder"
volume_name = docker_volume.home.name
read_only = false
}
}

View File

@@ -95,6 +95,28 @@
server: localhost-docker
container: openhands
- code-server:
icon: code-server.svg
href: http://framework:8443
description: VS Code in the browser — Claude Code long tasks live here
server: localhost-docker
container: code-server
- Coder:
icon: coder.svg
href: http://framework:7080
description: Workspace manager (pilot) — per-project containers from templates
server: localhost-docker
container: coder
# /api/v2/buildinfo is unauthenticated — cheap liveness + version.
widget:
type: customapi
url: http://host.docker.internal:7080/api/v2/buildinfo
refreshInterval: 30000
mappings:
- field: version
label: Version
- Observability:
- Beszel:
icon: beszel.svg

View File

@@ -0,0 +1,102 @@
# Qwable-3.6-27B (Qwen3.6-27B fine-tuned on Fable-5-style reasoning
# traces — "Qwen + Fable") via the kyuz0 rocm-7.2.2 Strix Halo toolbox.
# Same image + unified-memory recipe as compose/llama.yml; deltas are
# model path, port, alias.
# https://github.com/kyuz0/amd-strix-halo-toolboxes
# Model: https://huggingface.co/Mia-AiLab/Qwable-3.6-27b (MIT)
#
# What it's for. A "thinks-like-Fable-5" interactive model — structured,
# step-by-step explanatory output. Dense 27B (NOT MoE), so it's slower
# per token than the 30B-A3B MoE workhorses despite being smaller on
# disk: all 27B weights load per token. Bandwidth math (256 GB/s ÷
# ~16.5 GB) → ~10-15 tok/s decode. Interactive but not snappy.
#
# Coexistence. At ~16.5 GB (Q4_K_M) it's the smallest GPU resident here
# and fits alongside llama 30B (port 8080), Ollama, or Kimi in the
# ~110 GB merged arena. It does NOT fit alongside qwen3-235b (88.8 GB)
# or comfyui — swap-model tears those down for the `qwable` target.
# `restart: "no"`: you bring it up deliberately via swap-model, it won't
# auto-start after a reboot and surprise-collide with a big model.
#
# Weights. Single-file GGUF (not sharded). Download path on the box
# (see compose/qwable/README.md):
# hf download Mia-AiLab/Qwable-3.6-27b \
# 'Qwable-27b_Q4_K_M.gguf' \
# --local-dir /models/qwen/Qwable-3.6-27b
# Verify exact filename in the HF repo before downloading.
#
# Port 8082 — distinct from llama 30B (8080) and qwen3-235b (8081).
services:
qwable:
image: kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2
container_name: qwable
# Manual start only — see header note about GPU contention with
# the big models. swap-model brings it up/down.
restart: "no"
devices:
# ROCm needs both kfd (kernel fusion driver) and dri (DRM); Vulkan
# only needs dri. Don't drop kfd when on the rocm-* tag.
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
cap_add:
- SYS_PTRACE
security_opt:
- seccomp=unconfined
# Numeric GIDs of host's video (44) and render (991) groups —
# required for /dev/kfd + /dev/dri access from inside the container.
group_add:
- "44"
- "991"
shm_size: 8g
ipc: host
environment:
# Unified-memory recipe (same as compose/llama.yml + kimi-linear +
# qwen3-235b). BIOS UMA=0.5 GB + ttm.pages_limit cmdline → these
# flags merge the rocminfo pools into one ~110 GB arena. kyuz0's
# image is native gfx1151 so no HSA_OVERRIDE_GFX_VERSION.
- HSA_XNACK=1
- HSA_FORCE_FINE_GRAIN_PCIE=1
volumes:
- /models:/models:ro
ports:
- "8082:8082"
entrypoint: ["llama-server"]
command:
- --model
- /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf
# OpenAI-compatible served name (matches what opencode/curl request
# as "model"). Provider-side name lives in opencode.json if/when
# this gets wired as a provider.
- --alias
- qwable
- --host
- 0.0.0.0
- --port
- "8082"
# Push all layers to GPU. "999" = all available. A 27B Q4 (~16.5 GB)
# fits the merged arena with huge headroom.
- --n-gpu-layers
- "999"
# 64K to match llama/qwen3-235b — keeps opencode auto-compaction
# behaviour consistent across providers. Tons of arena headroom
# here (model is small), so this can ramp far higher if a workflow
# needs it; see compose/qwable/README.md.
- --ctx-size
- "65536"
# No-mmap is the Strix Halo standard — forces full GPU load.
- --no-mmap
# Flash attention — required for q8_0 KV cache; modern llama-server
# takes a value (on/off/auto), bare --flash-attn is deprecated.
- --flash-attn
- "on"
# Quantize KV cache to int8 — halves KV memory at minor/no quality
# loss. Matches the other llama.cpp stacks.
- --cache-type-k
- q8_0
- --cache-type-v
- q8_0
# Use the model's embedded jinja chat template — Qwable inherits
# Qwen3.6's chat format, which the Fable-trace fine-tune relies on.
- --jinja
# Expose Prometheus metrics at /metrics — scraped by OpenLIT.
- --metrics

View File

@@ -0,0 +1,130 @@
# qwable
Qwable-3.6-27B on Strix Halo via `kyuz0:rocm-7.2.2`. A full fine-tune
of Qwen3.6-27B trained on **Fable-5-style reasoning traces** — the dev
collected examples formatted like Fable 5's deliberate, step-by-step
answers and trained Qwen to reproduce that structured, explanatory
output. Think of it as a local "thinks-like-Fable" model.
OpenAI-compatible endpoint at `http://framework:8082` once running.
## Dense, not MoE (read first)
Despite being smaller on disk than the 30B-A3B workhorses, Qwable is a
**dense** 27B — every weight loads per token. On this bandwidth-bound
box (256 GB/s ÷ ~16.5 GB) that's **~10-15 tok/s** decode, slower than
the MoE 30B (~100 tok/s, only ~3B active). So Qwable is for when you
specifically want the Fable-style reasoning, not for raw throughput.
The interactive daily driver stays on Ollama / llama 30B.
## Coexistence notes
At ~16.5 GB (Q4_K_M) Qwable is the smallest GPU resident here:
| Concurrent service | Coexists? |
|---|---|
| `llama` (Qwen3-Coder-30B, 8080) | ✅ yes (~35 GB total) |
| `ollama` (11434) | ✅ yes |
| `kimi-linear` (vLLM, 8000) | ✅ yes (~47 GB total) |
| `qwen3-235b` (88.8 GB, 8081) | ❌ no — too tight, swap-model stops it |
| `comfyui` (8188) | ❌ no — swap-model stops it |
`restart: "no"`: you bring it up deliberately (via `swap-model qwable`),
it won't auto-start after a reboot and surprise-collide with a big model.
## Prereqs
- Pyinfra deploy has run (creates `/srv/docker/qwable/` with right perms).
- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
Verify: `cat /proc/cmdline | grep ttm.pages_limit`.
## Download weights (~16.5 GB, single file)
```sh
# /models/qwen exists via pyinfra; just create the model subdir.
mkdir -p /models/qwen/Qwable-3.6-27b
hf download Mia-AiLab/Qwable-3.6-27b \
'Qwable-27b_Q4_K_M.gguf' \
--local-dir /models/qwen/Qwable-3.6-27b
# File lands at:
# /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf (~16.5 GB)
```
Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
needs ~17 GB free on `/models`.
> Abliterated variant (refusals removed) lives at
> `huihui-ai/Huihui-Qwable-3.6-27b-abliterated-GGUF`
> (`Huihui-Qwable-3.6-27b-abliterated-Q4_K_M_Q8.gguf`, ~18.3 GB).
> Not the default — no safety filtering, careful with it.
## Bring up
Easy path — `swap-model` handles stop-conflicting-services + waits for
`/health`:
```sh
ssh framework swap-model qwable # ~1-2 min cold load (16.5 GB)
ssh framework /srv/docker/qwable/smoke.sh # perf measure
```
Manual equivalent (first-ever bring-up, before the image is cached):
```sh
cd /srv/docker/qwable
docker compose pull # already-cached image if you ran llama first
docker compose up -d
docker compose logs -f # wait for "server is listening on http://0.0.0.0:8082"
./smoke.sh # /health + tiny generation + perf
```
First start is ~1-2 min (16.5 GB load off disk; much faster than the
235B). If `./smoke.sh` reports `predicted_per_second` in the 10-15 tok/s
band, it's healthy. <6 tok/s = investigate (likely arena < 100 GB — see
qwen3-235b/README.md "Troubleshooting" for the arena checks).
## Ramping context
Defaults to 64K to match the other llama.cpp stacks (keeps opencode
auto-compaction consistent across providers). The model is tiny relative
to the arena, so there's plenty of room to push higher:
| Stage | `--ctx-size` | Margin in arena |
|---|---|---|
| **Current default** | **65536** | huge (~90 GB free) |
| Stretch | 131072+ | still comfortable |
Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
re-run `./smoke.sh`. The real ceiling is Qwable's trained context length
(inherits Qwen3.6-27B's), not arena memory — verify the model's max
positions before going past 128K.
## Operations
```sh
docker compose logs -f # tail
docker compose down # stop
docker compose exec qwable bash # shell in
./smoke.sh # health + perf
amdgpu_top # GPU view on host
```
## Pin manifest
| Component | Pin |
|---|---|
| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) |
| Weights | `Mia-AiLab/Qwable-3.6-27b``Qwable-27b_Q4_K_M.gguf` (~16.5 GB) |
| Default port | 8082 |
| Default context | 65536 |
| KV cache type | q8_0 (k and v) |
| License | MIT (model); Qwen3.6-27B base license also applies |
## Status
Compose artifacts written; awaiting box-side weight pull + bring-up.
Wired as a `swap-model qwable` target. Wire as an opencode/LiteLLM
provider only if the Fable-style reasoning proves useful in practice.

View File

@@ -0,0 +1,46 @@
#!/usr/bin/env bash
# Smoke-test the running qwable llama-server (port 8082). Hits /health
# for liveness, then a tiny OpenAI-compatible chat completion, then
# measures eval_tps via /completion. Dense 27B → expect ~10-15 tok/s.
set -euo pipefail
HOST="${QWABLE_HOST:-127.0.0.1:8082}"
MODEL="${QWABLE_MODEL:-qwable}"
echo "[smoke] GET /health on $HOST"
curl -fsS "http://$HOST/health" | python3 -m json.tool
echo
echo "[smoke] POST /v1/chat/completions ($MODEL) — tiny generation"
curl -fsS "http://$HOST/v1/chat/completions" \
-H 'Content-Type: application/json' \
-d "{
\"model\": \"$MODEL\",
\"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: ok\"}],
\"max_tokens\": 16,
\"temperature\": 0.0
}" | python3 -m json.tool
echo
echo "[smoke] perf measure — eval_tps and prompt_tps (n_predict=128)"
# 128 tokens — at ~10-15 tok/s the per-token warmup noise still matters,
# but a dense 27B settles faster than the 235B so we don't need 64-only.
curl -fsS "http://$HOST/completion" \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Write a Python function that computes the Fibonacci sequence iteratively. Include type hints and a brief docstring.",
"n_predict": 128,
"temperature": 0.0,
"stream": false
}' | python3 -c "
import json, sys
r = json.load(sys.stdin)
t = r.get('timings', {})
print(f'predicted_per_second: {t.get(\"predicted_per_second\", \"?\"):.2f} tok/s')
print(f'prompt_per_second: {t.get(\"prompt_per_second\", \"?\"):.2f} tok/s')
print(f'predicted_n: {t.get(\"predicted_n\", \"?\")}')
print(f'prompt_n: {t.get(\"prompt_n\", \"?\")}')
"
echo
echo "[smoke] passed — expected band 10-15 tok/s decode (dense 27B Q4)"

View File

@@ -53,6 +53,15 @@ Disk: needs ~90 GB free on `/models`. Pull is bandwidth-bound; expect
## Bring up (M0.2 — first generation)
Easy path — `swap-model` handles the stop-conflicting-services dance + waits for `/health`:
```sh
ssh framework swap-model 235b # ~3-5 min on first cold load
ssh framework /srv/docker/qwen3-235b/smoke.sh # perf measure
```
Manual equivalent (for first-ever bring-up before the image is cached):
```sh
cd /srv/docker/qwen3-235b
docker compose pull # already-cached image if you ran llama first