added models, model-swap, ...

2026-06-26 08:13:33 -04:00
parent de1635872f
commit 224afbb3a6
18 changed files with 1659 additions and 243 deletions
--- a/pyinfra/framework/compose/code-server.yml
+++ b/pyinfra/framework/compose/code-server.yml
@@ -0,0 +1,55 @@
+# code-server — VS Code in the browser, served from the box. The point:
+# Claude Code (extension + CLI) runs *inside this container*, so long
+# agent tasks keep running when the laptop sleeps. Any device on the
+# tailnet gets the full VS Code experience at http://framework:8443.
+#
+# linuxserver image over codercom/code-server for the PUID/PGID
+# convention (matches host `noise` UID 1000 so workspace files don't
+# come out root-owned) and the s6 init that chowns /config.
+#
+# Extensions come from Open VSX (code-server's default registry). The
+# official Claude Code extension is published there:
+# https://open-vsx.org/extension/Anthropic/claude-code
+#
+# Persistent state: /config is the container user's $HOME — extensions,
+# settings, ~/.claude (OAuth creds + session history), ~/.local/bin
+# (claude CLI). Survives container recreation; back up one path.
+services:
+  code-server:
+    image: lscr.io/linuxserver/code-server:latest
+    container_name: code-server
+    restart: unless-stopped
+
+    # No PASSWORD env set — auth is disabled, same trust model as every
+    # other service here: reachable over Tailscale only. If this host
+    # ever sees LAN/internet traffic, set HASHED_PASSWORD (and even then
+    # prefer "127.0.0.1:8443:8443" + tunnel — a browser terminal is a
+    # shell on the box's docker network).
+    ports:
+      - "8443:8443"
+
+    environment:
+      - PUID=1000
+      - PGID=1000
+      - TZ=Etc/UTC
+      # Open this folder by default instead of the "welcome" screen.
+      - DEFAULT_WORKSPACE=/workspace
+      # tmux inside the container: detachable terminals for long claude
+      # runs, so a browser-tab close or code-server reconnect hiccup
+      # can't orphan your view of a session (the process itself never
+      # depended on the tab — it's a child of the container).
+      - DOCKER_MODS=linuxserver/mods:universal-package-install
+      - INSTALL_PACKAGES=tmux
+
+    extra_hosts:
+      # Reach the sibling services (Ollama :11434, LiteLLM :4000,
+      # Phoenix :6006, ...) from the integrated terminal / extensions.
+      - "host.docker.internal:host-gateway"
+
+    volumes:
+      - /srv/docker/code-server/config:/config
+      # Default project area. Container-scoped on purpose — Claude Code
+      # in here can't touch /srv/docker compose stacks or the host.
+      # Mount additional repos explicitly when you want them editable:
+      #   - /home/noise/src/somerepo:/workspace/somerepo
+      - /srv/docker/code-server/workspace:/workspace
--- a/pyinfra/framework/compose/code-server/README.md
+++ b/pyinfra/framework/compose/code-server/README.md
@@ -0,0 +1,85 @@
+# code-server — VS Code in the browser
+
+Full VS Code served from the box at <http://framework:8443>. The reason
+it exists in this stack: **Claude Code runs inside the container**, so
+long agent tasks execute on the Framework Desktop and survive the
+laptop sleeping, the browser tab closing, or you walking away with the
+phone. Any device on the tailnet gets the same editor with the same
+state.
+
+## Bring-up
+
+```sh
+cd /srv/docker/code-server && docker compose up -d
+```
+
+First start pulls the image and runs the universal-package-install mod
+(installs tmux), so give it a minute before the UI answers on :8443.
+
+> **HTTPS is required for extension panels.** VS Code webviews (the
+> Claude Code panel included) need a secure context; over plain
+> `http://framework:8443` they render blank. Serve it via Tailscale:
+> `sudo tailscale serve --bg --https=8443 8443` →
+> `https://framework.<tailnet>.ts.net:8443`. Localhost also counts as
+> secure (`ssh -L 8443:localhost:8443 framework`).
+
+## One-time setup (in the code-server UI)
+
+1. **Install the Claude Code extension.** Extensions panel → search
+   "Claude Code". code-server uses the Open VSX registry, where
+   Anthropic publishes the official extension
+   (<https://open-vsx.org/extension/Anthropic/claude-code>).
+2. **Sign in.** The OAuth flow in a browser context is a copy/paste
+   dance: the extension shows a URL → open it in another tab → approve
+   → paste the code back. Credentials land in `/config/.claude/` and
+   persist across container recreates.
+3. **(Optional) CLI in the terminal.** The extension bundles its own
+   CLI, but for tmux-based long runs install it explicitly:
+
+   ```sh
+   curl -fsSL https://claude.ai/install.sh | bash
+   ```
+
+   Lands in `~/.local/bin` (= `/config/.local/bin`, persisted).
+
+## Long-running tasks
+
+The Claude process is a child of the container, not of your browser
+tab — closing the tab does nothing to it. Reattach from any device and
+the session is where you left it. For multi-hour unattended runs,
+prefer a tmux pane in the integrated terminal:
+
+```sh
+tmux new -s longtask
+claude    # kick off the task, detach with C-b d
+```
+
+If you want to steer from a phone or claude.ai/code instead of this UI,
+use `claude remote-control` (see the framework README's
+"Claude Code on the box" section) — that works in the host's tmux too,
+no container needed.
+
+## Scope and security
+
+- **No password is set** — same trust model as every other service on
+  this Tailscale-only box. If the host ever sees LAN/internet traffic,
+  set `HASHED_PASSWORD` in the compose env and bind to localhost; a
+  browser terminal is a shell.
+- The workspace is **container-scoped on purpose**: Claude Code in here
+  sees `/workspace` and `/config`, not the host filesystem or the
+  docker socket. Bind-mount specific repos into `/workspace/<name>`
+  when you want them editable (host UID 1000 == container PUID 1000,
+  so ownership just works).
+- `host.docker.internal` reaches the sibling services — handy for
+  pointing scripts at Ollama (:11434) or LiteLLM (:4000) from the
+  integrated terminal.
+
+## State layout
+
+| Path (host)                            | What lives there                                  |
+| -------------------------------------- | ------------------------------------------------- |
+| `/srv/docker/code-server/config`       | `$HOME`: extensions, settings, `~/.claude`, CLIs  |
+| `/srv/docker/code-server/workspace`    | default project area (`DEFAULT_WORKSPACE`)        |
+
+Back up `config` if you care about extension state and Claude session
+history; everything else is reproducible from the repo.
--- a/pyinfra/framework/compose/coder.yml
+++ b/pyinfra/framework/compose/coder.yml
@@ -0,0 +1,64 @@
+# Coder — self-hosted workspace manager (PILOT). Web dashboard at :7080
+# that stamps out per-project dev containers from Terraform templates;
+# each workspace gets browser code-server with the Claude Code extension
+# pre-installed. Evaluating against the standalone code-server stack
+# (compose/code-server.yml, kept as-is during the pilot) — verdict
+# criteria in compose/coder/README.md.
+#
+# The server container holds the host docker socket and spawns workspace
+# containers as siblings (same pattern + same security posture as
+# OpenHands: fine Tailscale-only, never expose further). group_add needs
+# the socket's host GID — host-specific, so it comes from the sibling
+# .env, which pyinfra generates on the box on first deploy along with a
+# random one-time Postgres password.
+#
+# Workspaces don't publish host ports — code-server and terminals are
+# reached through the dashboard's tunnel. No port bookkeeping per
+# project.
+services:
+  coder:
+    # Pin and bump deliberately — Coder releases weekly and the workspace
+    # provisioner/agent protocol moves with it. Verify at
+    # https://github.com/coder/coder/releases.
+    image: ghcr.io/coder/coder:v2.33.8
+    container_name: coder
+    restart: unless-stopped
+    ports:
+      - "7080:7080"
+    environment:
+      CODER_HTTP_ADDRESS: "0.0.0.0:7080"
+      # The URL clients are told to use — tailnet MagicDNS name, same
+      # convention as every other service tile.
+      CODER_ACCESS_URL: "${CODER_ACCESS_URL}"
+      CODER_PG_CONNECTION_URL: "postgresql://coder:${POSTGRES_PASSWORD}@database/coder?sslmode=disable"
+    group_add:
+      # GID of /var/run/docker.sock — generated into .env by deploy.py.
+      - "${DOCKER_GROUP_ID}"
+    volumes:
+      - /var/run/docker.sock:/var/run/docker.sock
+      # Repo-shipped Terraform templates (source of truth:
+      # pyinfra/framework/compose/coder/templates/). Push after changes:
+      #   docker compose exec coder coder templates push code-server \
+      #     --directory /templates/code-server --yes
+      - /srv/docker/coder/templates:/templates:ro
+    depends_on:
+      database:
+        condition: service_healthy
+
+  database:
+    image: postgres:17
+    container_name: coder-db
+    restart: unless-stopped
+    environment:
+      POSTGRES_USER: coder
+      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
+      POSTGRES_DB: coder
+    volumes:
+      # Entrypoint starts as root and chowns this to the postgres uid;
+      # deploy.py just creates the mount point.
+      - /srv/docker/coder/postgres:/var/lib/postgresql/data
+    healthcheck:
+      test: ["CMD-SHELL", "pg_isready -U coder -d coder"]
+      interval: 5s
+      timeout: 5s
+      retries: 5
--- a/pyinfra/framework/compose/coder/README.md
+++ b/pyinfra/framework/compose/coder/README.md
@@ -0,0 +1,110 @@
+# Coder — workspace manager (pilot)
+
+Self-hosted control plane at <http://framework:7080> that creates
+per-project dev containers from Terraform templates. Each workspace:
+its own container, browser code-server with the Claude Code extension
+pre-installed, no host-port bookkeeping (everything tunnels through the
+dashboard), and idle autostop so parked projects give their RAM back to
+the inference stacks.
+
+**Pilot status.** Evaluating against the standalone code-server stack
+(`/srv/docker/code-server`, kept as-is meanwhile). Verdict after a week
+of real use: does the dashboard + autostop + create-from-template earn
+an always-on control plane + Postgres? If yes, the standalone stack
+retires; if no, Coder goes and a thin `workspace` spawner script
+replaces it.
+
+## Bring-up
+
+```sh
+cd /srv/docker/coder && docker compose up -d
+```
+
+The sibling `.env` (DOCKER_GROUP_ID, CODER_ACCESS_URL, random Postgres
+password) is generated by deploy.py on first `./run.sh` — no hand-fill
+needed. First visit to <http://framework:7080> creates the admin
+account (pick anything; it's local to the box).
+
+> **HTTPS is required for extension panels.** VS Code webviews (the
+> Claude Code panel, markdown preview, etc.) run on service workers,
+> which browsers only allow in a secure context — over plain
+> `http://framework:7080` the editor works but webview panels render
+> blank. Fix via Tailscale Serve (real Let's Encrypt cert for the
+> tailnet name; enable "HTTPS Certificates" once in the Tailscale
+> admin console):
+>
+> ```sh
+> sudo tailscale serve --bg 7080
+> # then in .env: CODER_ACCESS_URL=https://framework.<tailnet>.ts.net
+> docker compose up -d        # recreate server with new access URL
+> # restart any existing workspace — app URLs derive from the access URL
+> ```
+>
+> Localhost is also a secure context, so
+> `ssh -L 7080:localhost:7080 framework` + http://localhost:7080 works
+> in a pinch.
+
+## Push the template (one-time + after edits)
+
+The `coder` CLI ships inside the server image; authenticate it once:
+
+```sh
+docker compose exec coder coder login http://localhost:7080
+# prints a /cli-auth URL — open http://framework:7080/cli-auth in your
+# browser, copy the session token, paste it back
+```
+
+Then push:
+
+```sh
+docker compose exec coder coder templates push code-server \
+  --directory /templates/code-server --yes
+```
+
+Template source of truth is the repo
+(`pyinfra/framework/compose/coder/templates/code-server/main.tf`) —
+edit there, `./run.sh`, re-push. Edits on the box get overwritten.
+
+## First workspace
+
+Dashboard → Workspaces → Create → `code-server` template → name it
+after the project. Open the code-server app tile, then one-time
+Claude sign-in: the extension shows an OAuth URL → open in another tab
+→ approve → paste the code back. Credentials live in `~/.claude` on
+the workspace's home volume and survive stop/start and rebuilds.
+
+Set **idle autostop** under Template → Settings → Schedule (suggest
+1–2 h inactivity). Activity = open code-server tab, SSH, web terminal.
+
+## What persists where
+
+| Thing                            | Where                                        |
+| -------------------------------- | -------------------------------------------- |
+| Workspace home (repos, ~/.claude, extensions) | named volume `coder-<workspace-id>-home` |
+| Control-plane state (users, templates, workspace defs) | Postgres → `/srv/docker/coder/postgres` |
+| Template source                  | this repo, shipped to `/srv/docker/coder/templates` |
+
+A stopped workspace's container is deleted; only the home volume
+remains. `docker volume ls | grep coder-` to audit.
+
+## Security
+
+Same posture as OpenHands: the server container holds the docker
+socket and spawns code-running containers — root-equivalent on the
+box. Tailscale-only exposure is the mitigation; never forward :7080
+anywhere else. Coder does have real auth (the admin account), but
+treat that as defense-in-depth, not as permission to expose it.
+
+## Notes
+
+- Workspaces reach the sibling services via `host.docker.internal`
+  (Ollama :11434, LiteLLM :4000, Phoenix :6006, ...).
+- Long Claude runs: same rules as anywhere — the process lives in the
+  workspace container, so it survives laptop/browser disconnects, but
+  **autostop will kill an idle-looking workspace mid-run**. For
+  multi-hour unattended tasks either bump the workspace's TTL in the
+  dashboard or use the host-side tmux + `claude remote-control`
+  pattern (framework README, "Claude Code on the box").
+- The base image is `codercom/enterprise-base:ubuntu` (sudo-enabled,
+  common toolchain). Per-project images are a later refinement —
+  swap the `image` in main.tf or parameterize with `coder_parameter`.
--- a/pyinfra/framework/compose/coder/templates/code-server/main.tf
+++ b/pyinfra/framework/compose/coder/templates/code-server/main.tf
@@ -0,0 +1,121 @@
+# code-server workspace template — one dev container per project, with
+# browser VS Code and Claude Code (extension + CLI) ready on first open.
+#
+# Source of truth is the repo copy at
+# pyinfra/framework/compose/coder/templates/code-server/main.tf; pyinfra
+# ships it to /srv/docker/coder/templates/, mounted read-only into the
+# server container at /templates. Push after edits:
+#   cd /srv/docker/coder
+#   docker compose exec coder coder templates push code-server \
+#     --directory /templates/code-server --yes
+
+terraform {
+  required_providers {
+    coder = {
+      source = "coder/coder"
+    }
+    docker = {
+      source = "kreuzwerker/docker"
+    }
+  }
+}
+
+provider "coder" {}
+
+# Talks to the host daemon via the socket mounted into the server
+# container (default unix:///var/run/docker.sock) — workspace containers
+# are siblings of the compose stacks, not children.
+provider "docker" {}
+
+data "coder_workspace" "me" {}
+data "coder_workspace_owner" "me" {}
+
+resource "coder_agent" "main" {
+  arch = "amd64"
+  os   = "linux"
+
+  # Claude Code CLI — native installer, lands in ~/.local/bin inside the
+  # persisted home volume, so this is a no-op after the first start. The
+  # code-server extension bundles its own CLI, but having `claude` on
+  # PATH enables tmux-based long runs in the workspace terminal.
+  startup_script = <<-EOT
+    set -e
+    command -v claude >/dev/null 2>&1 || curl -fsSL https://claude.ai/install.sh | bash
+  EOT
+
+  env = {
+    GIT_AUTHOR_NAME     = data.coder_workspace_owner.me.full_name
+    GIT_AUTHOR_EMAIL    = data.coder_workspace_owner.me.email
+    GIT_COMMITTER_NAME  = data.coder_workspace_owner.me.full_name
+    GIT_COMMITTER_EMAIL = data.coder_workspace_owner.me.email
+  }
+
+  metadata {
+    display_name = "CPU"
+    key          = "cpu"
+    script       = "coder stat cpu"
+    interval     = 10
+    timeout      = 1
+  }
+  metadata {
+    display_name = "RAM"
+    key          = "mem"
+    script       = "coder stat mem"
+    interval     = 10
+    timeout      = 1
+  }
+}
+
+# Browser VS Code inside the workspace, surfaced as a dashboard app.
+# Extensions install from Open VSX — anthropic.claude-code is the
+# official Claude Code extension
+# (https://open-vsx.org/extension/Anthropic/claude-code). OAuth creds
+# land in ~/.claude inside the home volume and survive rebuilds.
+module "code_server" {
+  count    = data.coder_workspace.me.start_count
+  source   = "registry.coder.com/coder/code-server/coder"
+  version  = "~> 1.0"
+  agent_id = coder_agent.main.id
+  folder   = "/home/coder/project"
+
+  extensions = [
+    "anthropic.claude-code",
+  ]
+}
+
+# Home survives workspace stop/start AND template-driven rebuilds —
+# ignore_changes keeps Terraform from recreating the volume (and wiping
+# ~/.claude, extensions, repos) when template metadata shifts.
+resource "docker_volume" "home" {
+  name = "coder-${data.coder_workspace.me.id}-home"
+  lifecycle {
+    ignore_changes = all
+  }
+}
+
+resource "docker_container" "workspace" {
+  # start_count is 0 when the workspace is stopped — the container is
+  # deleted but the home volume above persists. This is what idle
+  # autostop reclaims: RAM back to the inference stacks.
+  count    = data.coder_workspace.me.start_count
+  image    = "codercom/enterprise-base:ubuntu"
+  name     = "coder-${data.coder_workspace_owner.me.name}-${lower(data.coder_workspace.me.name)}"
+  hostname = data.coder_workspace.me.name
+
+  # Agent bootstrap, rewritten to reach the control plane through the
+  # docker bridge (the workspace is a sibling container; "localhost"
+  # inside it isn't the Coder server).
+  entrypoint = ["sh", "-c", replace(coder_agent.main.init_script, "/localhost|127\\.0\\.0\\.1/", "host.docker.internal")]
+  env        = ["CODER_AGENT_TOKEN=${coder_agent.main.token}"]
+
+  host {
+    host = "host.docker.internal"
+    ip   = "host-gateway"
+  }
+
+  volumes {
+    container_path = "/home/coder"
+    volume_name    = docker_volume.home.name
+    read_only      = false
+  }
+}
--- a/pyinfra/framework/compose/homepage/services.yaml
+++ b/pyinfra/framework/compose/homepage/services.yaml
@@ -95,6 +95,28 @@
        server: localhost-docker
        container: openhands

+    - code-server:
+        icon: code-server.svg
+        href: http://framework:8443
+        description: VS Code in the browser — Claude Code long tasks live here
+        server: localhost-docker
+        container: code-server
+
+    - Coder:
+        icon: coder.svg
+        href: http://framework:7080
+        description: Workspace manager (pilot) — per-project containers from templates
+        server: localhost-docker
+        container: coder
+        # /api/v2/buildinfo is unauthenticated — cheap liveness + version.
+        widget:
+          type: customapi
+          url: http://host.docker.internal:7080/api/v2/buildinfo
+          refreshInterval: 30000
+          mappings:
+            - field: version
+              label: Version
+
 - Observability:
    - Beszel:
        icon: beszel.svg
--- a/pyinfra/framework/compose/qwable.yml
+++ b/pyinfra/framework/compose/qwable.yml
@@ -0,0 +1,102 @@
+# Qwable-3.6-27B (Qwen3.6-27B fine-tuned on Fable-5-style reasoning
+# traces — "Qwen + Fable") via the kyuz0 rocm-7.2.2 Strix Halo toolbox.
+# Same image + unified-memory recipe as compose/llama.yml; deltas are
+# model path, port, alias.
+# https://github.com/kyuz0/amd-strix-halo-toolboxes
+# Model: https://huggingface.co/Mia-AiLab/Qwable-3.6-27b (MIT)
+#
+# What it's for. A "thinks-like-Fable-5" interactive model — structured,
+# step-by-step explanatory output. Dense 27B (NOT MoE), so it's slower
+# per token than the 30B-A3B MoE workhorses despite being smaller on
+# disk: all 27B weights load per token. Bandwidth math (256 GB/s ÷
+# ~16.5 GB) → ~10-15 tok/s decode. Interactive but not snappy.
+#
+# Coexistence. At ~16.5 GB (Q4_K_M) it's the smallest GPU resident here
+# and fits alongside llama 30B (port 8080), Ollama, or Kimi in the
+# ~110 GB merged arena. It does NOT fit alongside qwen3-235b (88.8 GB)
+# or comfyui — swap-model tears those down for the `qwable` target.
+# `restart: "no"`: you bring it up deliberately via swap-model, it won't
+# auto-start after a reboot and surprise-collide with a big model.
+#
+# Weights. Single-file GGUF (not sharded). Download path on the box
+# (see compose/qwable/README.md):
+#   hf download Mia-AiLab/Qwable-3.6-27b \
+#       'Qwable-27b_Q4_K_M.gguf' \
+#       --local-dir /models/qwen/Qwable-3.6-27b
+# Verify exact filename in the HF repo before downloading.
+#
+# Port 8082 — distinct from llama 30B (8080) and qwen3-235b (8081).
+services:
+  qwable:
+    image: kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2
+    container_name: qwable
+    # Manual start only — see header note about GPU contention with
+    # the big models. swap-model brings it up/down.
+    restart: "no"
+    devices:
+      # ROCm needs both kfd (kernel fusion driver) and dri (DRM); Vulkan
+      # only needs dri. Don't drop kfd when on the rocm-* tag.
+      - /dev/kfd:/dev/kfd
+      - /dev/dri:/dev/dri
+    cap_add:
+      - SYS_PTRACE
+    security_opt:
+      - seccomp=unconfined
+    # Numeric GIDs of host's video (44) and render (991) groups —
+    # required for /dev/kfd + /dev/dri access from inside the container.
+    group_add:
+      - "44"
+      - "991"
+    shm_size: 8g
+    ipc: host
+    environment:
+      # Unified-memory recipe (same as compose/llama.yml + kimi-linear +
+      # qwen3-235b). BIOS UMA=0.5 GB + ttm.pages_limit cmdline → these
+      # flags merge the rocminfo pools into one ~110 GB arena. kyuz0's
+      # image is native gfx1151 so no HSA_OVERRIDE_GFX_VERSION.
+      - HSA_XNACK=1
+      - HSA_FORCE_FINE_GRAIN_PCIE=1
+    volumes:
+      - /models:/models:ro
+    ports:
+      - "8082:8082"
+    entrypoint: ["llama-server"]
+    command:
+      - --model
+      - /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf
+      # OpenAI-compatible served name (matches what opencode/curl request
+      # as "model"). Provider-side name lives in opencode.json if/when
+      # this gets wired as a provider.
+      - --alias
+      - qwable
+      - --host
+      - 0.0.0.0
+      - --port
+      - "8082"
+      # Push all layers to GPU. "999" = all available. A 27B Q4 (~16.5 GB)
+      # fits the merged arena with huge headroom.
+      - --n-gpu-layers
+      - "999"
+      # 64K to match llama/qwen3-235b — keeps opencode auto-compaction
+      # behaviour consistent across providers. Tons of arena headroom
+      # here (model is small), so this can ramp far higher if a workflow
+      # needs it; see compose/qwable/README.md.
+      - --ctx-size
+      - "65536"
+      # No-mmap is the Strix Halo standard — forces full GPU load.
+      - --no-mmap
+      # Flash attention — required for q8_0 KV cache; modern llama-server
+      # takes a value (on/off/auto), bare --flash-attn is deprecated.
+      - --flash-attn
+      - "on"
+      # Quantize KV cache to int8 — halves KV memory at minor/no quality
+      # loss. Matches the other llama.cpp stacks.
+      - --cache-type-k
+      - q8_0
+      - --cache-type-v
+      - q8_0
+      # Use the model's embedded jinja chat template — Qwable inherits
+      # Qwen3.6's chat format, which the Fable-trace fine-tune relies on.
+      - --jinja
+      # Expose Prometheus metrics at /metrics — scraped by OpenLIT.
+      - --metrics
--- a/pyinfra/framework/compose/qwable/README.md
+++ b/pyinfra/framework/compose/qwable/README.md
@@ -0,0 +1,130 @@
+# qwable
+
+Qwable-3.6-27B on Strix Halo via `kyuz0:rocm-7.2.2`. A full fine-tune
+of Qwen3.6-27B trained on **Fable-5-style reasoning traces** — the dev
+collected examples formatted like Fable 5's deliberate, step-by-step
+answers and trained Qwen to reproduce that structured, explanatory
+output. Think of it as a local "thinks-like-Fable" model.
+
+OpenAI-compatible endpoint at `http://framework:8082` once running.
+
+## Dense, not MoE (read first)
+
+Despite being smaller on disk than the 30B-A3B workhorses, Qwable is a
+**dense** 27B — every weight loads per token. On this bandwidth-bound
+box (256 GB/s ÷ ~16.5 GB) that's **~10-15 tok/s** decode, slower than
+the MoE 30B (~100 tok/s, only ~3B active). So Qwable is for when you
+specifically want the Fable-style reasoning, not for raw throughput.
+The interactive daily driver stays on Ollama / llama 30B.
+
+## Coexistence notes
+
+At ~16.5 GB (Q4_K_M) Qwable is the smallest GPU resident here:
+
+| Concurrent service | Coexists? |
+|---|---|
+| `llama` (Qwen3-Coder-30B, 8080) | ✅ yes (~35 GB total) |
+| `ollama` (11434) | ✅ yes |
+| `kimi-linear` (vLLM, 8000) | ✅ yes (~47 GB total) |
+| `qwen3-235b` (88.8 GB, 8081) | ❌ no — too tight, swap-model stops it |
+| `comfyui` (8188) | ❌ no — swap-model stops it |
+
+`restart: "no"`: you bring it up deliberately (via `swap-model qwable`),
+it won't auto-start after a reboot and surprise-collide with a big model.
+
+## Prereqs
+
+- Pyinfra deploy has run (creates `/srv/docker/qwable/` with right perms).
+- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
+  Verify: `cat /proc/cmdline | grep ttm.pages_limit`.
+
+## Download weights (~16.5 GB, single file)
+
+```sh
+# /models/qwen exists via pyinfra; just create the model subdir.
+mkdir -p /models/qwen/Qwable-3.6-27b
+
+hf download Mia-AiLab/Qwable-3.6-27b \
+    'Qwable-27b_Q4_K_M.gguf' \
+    --local-dir /models/qwen/Qwable-3.6-27b
+
+# File lands at:
+#   /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf  (~16.5 GB)
+```
+
+Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
+needs ~17 GB free on `/models`.
+
+> Abliterated variant (refusals removed) lives at
+> `huihui-ai/Huihui-Qwable-3.6-27b-abliterated-GGUF`
+> (`Huihui-Qwable-3.6-27b-abliterated-Q4_K_M_Q8.gguf`, ~18.3 GB).
+> Not the default — no safety filtering, careful with it.
+
+## Bring up
+
+Easy path — `swap-model` handles stop-conflicting-services + waits for
+`/health`:
+
+```sh
+ssh framework swap-model qwable     # ~1-2 min cold load (16.5 GB)
+ssh framework /srv/docker/qwable/smoke.sh    # perf measure
+```
+
+Manual equivalent (first-ever bring-up, before the image is cached):
+
+```sh
+cd /srv/docker/qwable
+docker compose pull       # already-cached image if you ran llama first
+docker compose up -d
+docker compose logs -f    # wait for "server is listening on http://0.0.0.0:8082"
+
+./smoke.sh                # /health + tiny generation + perf
+```
+
+First start is ~1-2 min (16.5 GB load off disk; much faster than the
+235B). If `./smoke.sh` reports `predicted_per_second` in the 10-15 tok/s
+band, it's healthy. <6 tok/s = investigate (likely arena < 100 GB — see
+qwen3-235b/README.md "Troubleshooting" for the arena checks).
+
+## Ramping context
+
+Defaults to 64K to match the other llama.cpp stacks (keeps opencode
+auto-compaction consistent across providers). The model is tiny relative
+to the arena, so there's plenty of room to push higher:
+
+| Stage | `--ctx-size` | Margin in arena |
+|---|---|---|
+| **Current default** | **65536** | huge (~90 GB free) |
+| Stretch | 131072+ | still comfortable |
+
+Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
+re-run `./smoke.sh`. The real ceiling is Qwable's trained context length
+(inherits Qwen3.6-27B's), not arena memory — verify the model's max
+positions before going past 128K.
+
+## Operations
+
+```sh
+docker compose logs -f                  # tail
+docker compose down                     # stop
+docker compose exec qwable bash         # shell in
+./smoke.sh                              # health + perf
+amdgpu_top                              # GPU view on host
+```
+
+## Pin manifest
+
+| Component | Pin |
+|---|---|
+| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) |
+| Weights | `Mia-AiLab/Qwable-3.6-27b` → `Qwable-27b_Q4_K_M.gguf` (~16.5 GB) |
+| Default port | 8082 |
+| Default context | 65536 |
+| KV cache type | q8_0 (k and v) |
+| License | MIT (model); Qwen3.6-27B base license also applies |
+
+## Status
+
+Compose artifacts written; awaiting box-side weight pull + bring-up.
+Wired as a `swap-model qwable` target. Wire as an opencode/LiteLLM
+provider only if the Fable-style reasoning proves useful in practice.
--- a/pyinfra/framework/compose/qwable/smoke.sh
+++ b/pyinfra/framework/compose/qwable/smoke.sh
@@ -0,0 +1,46 @@
+#!/usr/bin/env bash
+# Smoke-test the running qwable llama-server (port 8082). Hits /health
+# for liveness, then a tiny OpenAI-compatible chat completion, then
+# measures eval_tps via /completion. Dense 27B → expect ~10-15 tok/s.
+set -euo pipefail
+
+HOST="${QWABLE_HOST:-127.0.0.1:8082}"
+MODEL="${QWABLE_MODEL:-qwable}"
+
+echo "[smoke] GET /health on $HOST"
+curl -fsS "http://$HOST/health" | python3 -m json.tool
+
+echo
+echo "[smoke] POST /v1/chat/completions ($MODEL) — tiny generation"
+curl -fsS "http://$HOST/v1/chat/completions" \
+    -H 'Content-Type: application/json' \
+    -d "{
+        \"model\": \"$MODEL\",
+        \"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: ok\"}],
+        \"max_tokens\": 16,
+        \"temperature\": 0.0
+    }" | python3 -m json.tool
+
+echo
+echo "[smoke] perf measure — eval_tps and prompt_tps (n_predict=128)"
+# 128 tokens — at ~10-15 tok/s the per-token warmup noise still matters,
+# but a dense 27B settles faster than the 235B so we don't need 64-only.
+curl -fsS "http://$HOST/completion" \
+    -H 'Content-Type: application/json' \
+    -d '{
+        "prompt": "Write a Python function that computes the Fibonacci sequence iteratively. Include type hints and a brief docstring.",
+        "n_predict": 128,
+        "temperature": 0.0,
+        "stream": false
+    }' | python3 -c "
+import json, sys
+r = json.load(sys.stdin)
+t = r.get('timings', {})
+print(f'predicted_per_second:  {t.get(\"predicted_per_second\", \"?\"):.2f} tok/s')
+print(f'prompt_per_second:     {t.get(\"prompt_per_second\", \"?\"):.2f} tok/s')
+print(f'predicted_n:           {t.get(\"predicted_n\", \"?\")}')
+print(f'prompt_n:              {t.get(\"prompt_n\", \"?\")}')
+"
+
+echo
+echo "[smoke] passed — expected band 10-15 tok/s decode (dense 27B Q4)"
--- a/pyinfra/framework/compose/qwen3-235b/README.md
+++ b/pyinfra/framework/compose/qwen3-235b/README.md
@@ -53,6 +53,15 @@ Disk: needs ~90 GB free on `/models`. Pull is bandwidth-bound; expect

 ## Bring up (M0.2 — first generation)

+Easy path — `swap-model` handles the stop-conflicting-services dance + waits for `/health`:
+
+```sh
+ssh framework swap-model 235b      # ~3-5 min on first cold load
+ssh framework /srv/docker/qwen3-235b/smoke.sh    # perf measure
+```
+
+Manual equivalent (for first-ever bring-up before the image is cached):
+
 ```sh
 cd /srv/docker/qwen3-235b
 docker compose pull       # already-cached image if you ran llama first