added models, model-swap, ...

2026-06-26 08:13:33 -04:00
parent de1635872f
commit 224afbb3a6
18 changed files with 1659 additions and 243 deletions
--- a/bin/swap-model
+++ b/bin/swap-model
@@ -0,0 +1,14 @@
 #!/usr/bin/env bash
 # Mac-side wrapper for swap-model on the Framework Desktop.
 #
 # Symlink or alias this so `swap-model X` works from any shell:
 #   ln -s "$(pwd)/bin/swap-model" /usr/local/bin/swap-model
 # (or add ~/Documents/obsidian/localgenai/bin to PATH)
 #
 # Talks to whichever host name resolves the Framework Desktop. Default
 # `framework` works whether you're on the LAN (mDNS/etc/hosts) or
 # remote (Tailscale MagicDNS). Override per-shell when needed:
 #   SWAP_MODEL_HOST=10.0.0.70 swap-model 235b
 #
 # Assumes your SSH public key is in the box's authorized_keys.
 exec ssh "${SWAP_MODEL_HOST:-framework}" /usr/local/bin/swap-model "$@"
--- a/opencode/README.md
+++ b/opencode/README.md
@@ -1,253 +1,41 @@
-# opencode setup
+# Garmin Data Fetcher
-Canonical OpenCode config + Phoenix bridge plugin for the localgenai
+A Python script to download per-second activity data (heart rate, GPS, cadence, etc.) from Garmin Connect and extract it from FIT files for analysis.
 stack. `install.sh` deploys it to `~/.config/opencode/` on a Mac.
 ## What's wired up
 - **Local models**: two providers, manually switched via `/model`.
  - `framework/qwen3-coder:30b` — Qwen3-Coder 30B-A3B via Ollama, the
    daily-driver coding model. 128K context, 11434.
  - `framework-vllm/kimi-linear` — Kimi-Linear 48B-A3B via vLLM, the
    long-context play (hybrid KDA/MLA, MoE 3B active). 32K context for
    now (ramps further in P3 of the kimi-linear roadmap), 8000.
    **Tools disabled** (`tool_call: false`) — Kimi-Linear is a research
    architecture release and isn't strongly tool-trained; the model
    knows the Kimi-K2 tool tokens but emits non-structured output when
    given an MCP toolbox. Use it for chat / long-context reasoning;
    switch to `framework/qwen3-coder:30b` for agentic work.
 - **Playwright MCP** ([@playwright/mcp](https://github.com/microsoft/playwright-mcp)) —
  browser automation. The model can navigate pages, click, fill forms,
  read DOM snapshots. Closes the agentic-browsing gap.
 - **SearXNG MCP** ([mcp-searxng](https://github.com/ihor-sokoliuk/mcp-searxng)) —
  web search via your self-hosted instance at <https://searxng.n0n.io>.
  No external API keys, no rate-limit roulette.
 - **Serena MCP** ([oraios/serena](https://github.com/oraios/serena)) —
  LSP-backed semantic code navigation (find symbol, references, rename,
  insert before/after). Cuts the tokens a local 70B-class model burns on
  grep-style flailing by roughly an order of magnitude. Uses a **custom
  trimmed context** (`serena-ide-trim.yml`) that exposes only the 8
  unique-LSP-value tools — JetBrains tools, line-level edits redundant
  with opencode's `Edit`, Serena's own memory tools (basic-memory MCP is
  canonical), and onboarding/meta noise are all excluded. Down from 46
  raw → 41 ide-context-filtered → **8 active**. Scoped to the cwd via
  `--project-from-cwd`.
 - **basic-memory MCP** ([basicmachines-co/basic-memory](https://github.com/basicmachines-co/basic-memory)) —
  Markdown-backed persistent memory across sessions. Storage lives in
  `~/Documents/obsidian/AI-memory/` (symlinked from `~/basic-memory`),
  so notes are browsable in Obsidian's graph and search. Replaces
  Claude Code's auto-memory write-back, which opencode lacks natively.
 - **sequential-thinking MCP** ([modelcontextprotocol/servers/sequentialthinking](https://github.com/modelcontextprotocol/servers/tree/main/src/sequentialthinking)) —
  externalizes chain-of-thought as tool calls. Helps weaker local
  models stay on-plan over multi-step work; near-zero cost when not
  actively used.
 - **github MCP** ([github/github-mcp-server](https://github.com/github/github-mcp-server)) —
  GitHub repo / issue / PR / code-search access. Launched with
  `--read-only` and a narrowed `--toolsets repos,issues,pull_requests,code_security`
  allowlist. With a **classic** PAT (`ghp_…`), GitHub's auto-scope-filtering
  (Jan 2026) trims tools further by hiding ones whose scopes the token
  lacks — saves ~23k tokens of tool-list overhead, meaningful for a 70B's
  effective context. Requires `GITHUB_PERSONAL_ACCESS_TOKEN` to be exported
  in your shell env (not in opencode.json). Drop `--read-only` from
  `opencode.json` once you trust the model's tool calls.
  **Note**: This MCP is disabled since the user is utilizing a self-hosted Gitea instance instead of GitHub.
 - **task-master MCP** ([eyaltoledano/claude-task-master](https://github.com/eyaltoledano/claude-task-master)) —
  Workflow / task-gate MCP. File-based: each project gets a
  `.taskmaster/` dir with tasks, complexity, and config — no DB, no
  external service. `OLLAMA_BASE_URL` is pre-set in `opencode.json` so
  task-master's AI features (parse-prd, expand-task) route through your
  framework Ollama. The npm-global install also provides a `task-master`
  CLI (`task-master init` to scaffold per-project). Replaces the
  workflow-gate role originally proposed for Archon, without Supabase.
 - **Phoenix bridge plugin** (`.opencode/plugin/phoenix-bridge.js`) —
  exports OpenTelemetry spans for every LLM call, tool call, and
  subagent invocation to the Phoenix container running on the Framework
  Desktop. Per-prompt waterfall / flamegraph viz at
  <http://framework:6006>.
 ## Setup
-```sh
+1. Create virtual environment:
-./install.sh
+```bash
 python -m venv venv
 source venv/bin/activate
 ```
-Idempotent — re-run after editing `opencode.json` or pulling changes to
+2. Install dependencies:
-the plugin. Each step checks before doing work. Specifically:
+```bash
-
+pip install garminconnect garmin_fit_sdk
 1. Verifies Homebrew is present (won't install it for you)
 2. `brew install node uv jq sst/tap/opencode` (skips if already at latest)
 3. Pre-caches Playwright's chromium so the first MCP call is instant
 4. `uv tool install serena-agent@latest --prerelease=allow` so opencode
   can launch Serena as a plain `serena` binary on PATH (faster than
   re-resolving via `uvx` on every session)
 5. Creates `~/Documents/obsidian/AI-memory/` and symlinks `~/basic-memory`
   to it, so basic-memory MCP writes into the Obsidian vault by default
 6. `brew install github-mcp-server` and warns if `GITHUB_PERSONAL_ACCESS_TOKEN`
   isn't set in your shell — the MCP needs it to authenticate
 7. `npm install -g task-master-ai` (workflow MCP, also exposes the
   `task-master` CLI for `task-master init` per project)
 8. `npm install` in `.opencode/plugin/` for the Phoenix bridge OTel deps
 9. Generates `~/.config/opencode/opencode.json` from the repo's
   `opencode.json`, rewriting relative plugin paths to absolute so
   OpenCode loads the plugin regardless of which directory it's launched
   from
 Step 9 is the reason the deployed config isn't a plain symlink. The
 repo's `opencode.json` uses a relative plugin path (`./...`) so it stays
 valid in place; the deployed copy is generated with that path resolved
 to an absolute one. Edits to the repo's `opencode.json` need a re-run
 of `./install.sh` to take effect.
 ## Verify
 ```sh
 # Local model reachable
 curl -s http://framework:11434/v1/models | jq '.data[].id'
 # SearXNG instance answers JSON
 curl -s 'https://searxng.n0n.io/search?q=test&format=json' | jq '.results | length'
 ```
-Then in opencode:
+3. Set environment variables:
-
+```bash
-```
+export GARMIN_EMAIL=your_email@example.com
-opencode
+export GARMIN_PASSWORD=your_password
 > /mcp        # should list playwright, searxng, serena, basic-memory,
              # sequential-thinking, github, task-master as connected
 > search the web for "qwen3-coder benchmarks"
 > open https://example.com and tell me the H1
 > use serena to find the definition of `parse_request`
 > remember: this project ships its memory into the Obsidian vault
 > /sequentialthinking think through the trade-offs of X vs Y
 > list my recent github PRs across all repos
 > task-master init   # then ask the model to plan tasks for this project
 ```
-For parallel agents, plain tmux + git worktree is enough at the 70B's
+## Usage
 ~2-pane concurrency ceiling. A two-line zsh helper covers the
 "new isolated worktree → split tmux pane → start opencode" loop:
-```sh
+Run the script to download and extract data:
-work() {
+```bash
-  local name="${1:?usage: work <branch-name>}"
+python scripts/fetch_garmin_data.py
  local wt="../$(basename "$PWD")-$name"
  git worktree add "$wt" -b "$name" && tmux split-window -h -c "$wt" "opencode"
 }
 unwork() { local wt="$PWD"; cd .. && git worktree remove --force "$wt"; }
 ```
-Serena's first invocation in a project may take a few seconds — it
+The script will:
-indexes the workspace via the language server. basic-memory's first
+- Authenticate with Garmin Connect
-write creates the project layout under `~/Documents/obsidian/AI-memory/`
+- Download the most recent activity's FIT file
-which Obsidian will pick up on its next vault scan.
+- Extract per-second record data (heart rate, GPS, cadence, etc.)
 - Save the data as JSON for analysis
-## Phoenix tracing
+## Output
-The plugin at `.opencode/plugin/phoenix-bridge.js` boots an OpenTelemetry
+Data is saved in `garmin_data/` as:
-SDK on OpenCode startup and ships every span to Phoenix on the Framework
+- `{activity_id}.fit` - Original FIT file
-Desktop. With `experimental.openTelemetry: true` (already set in
+- `{activity_id}_data.json` - Extracted per-second records
 `opencode.json`), OpenCode emits Vercel AI SDK spans that Phoenix renders
 as a per-turn waterfall: user prompt → main agent's `ai.streamText` →
 each tool call (built-in + MCP) with token counts and latencies inline.
 The plugin uses `@opentelemetry/exporter-trace-otlp-proto` (not `-http`)
 because Phoenix's OTLP receiver only speaks protobuf — the JSON variant
 returns 415.
 Spans go to Phoenix only. Earlier versions of this plugin dual-exported
 to OpenLIT as well, but OpenLIT's container doesn't currently host an
 OTLP receiver — the failing exporter cascaded into OpenCode's tool-call
 parsing pipeline and broke tool use. Re-enable once `openlit.yml` adds
 an `otel-collector` sidecar.
 Defaults can be overridden via env vars (set before launching opencode):
 | Variable | Default | Purpose |
 |---|---|---|
 | `PHOENIX_OTLP_ENDPOINT` | `http://framework:6006/v1/traces` | Phoenix HTTP target |
 | `PHOENIX_SERVICE_NAME` | `opencode` | Phoenix project name |
 | `PHOENIX_OTEL_DEBUG` | unset | `1` to surface OTel internal logs |
 ### Verifying
 ```sh
 : > /tmp/phoenix-bridge.log    # truncate prior runs
 opencode                       # any directory; CWD doesn't matter
 tail -f /tmp/phoenix-bridge.log
 ```
 Healthy startup looks like:
 ```
 plugin function entered
 endpoint=http://framework:6006/v1/traces serviceName=opencode
 OTel imports resolved
 sdk.start() returned
 tracer obtained
 boot span emitted (will flush within ~5s)
 ```
 Then open <http://framework:6006/projects> — an `opencode` project should
 appear with at least one `phoenix-bridge.boot` span. Send a prompt in
 OpenCode and real LLM-call traces follow.
 If the plugin's deps aren't installed, OpenCode logs a warning and the
 plugin no-ops — the rest of OpenCode still works fine.
 ### Known limitations
 - **Subagent nesting is best-effort.** The plugin opens a parent span
  per session and tries to stitch child sessions (Task-tool subagents)
  under their parent, but Vercel AI SDK spans live in their own OTel
  trace context. Until [sst/opencode#6142](https://github.com/sst/opencode/issues/6142)
  exposes `sessionID` in the `chat.system.transform` hook, child-session
  spans may show as separate traces in Phoenix.
 - **Console output from plugins is swallowed by OpenCode's TUI.** That's
  why init progress goes to `/tmp/phoenix-bridge.log` rather than stdout.
 ## Notes
 - **SearXNG JSON output** must be enabled on the instance for the MCP
  server to work. If `format=json` returns HTML or 403, edit
  `settings.yml` on the SearXNG box: `search.formats: [html, json]`,
  restart.
 - **Playwright first-run** downloads ~200 MB of browser binaries into
  `~/Library/Caches/ms-playwright/`. Subsequent runs are instant.
 - **Tool-calling reliability** with Qwen3-Coder is decent but not
  Claude-grade. If a tool call hangs or returns malformed JSON, the
  model is the culprit, not the MCP. Worth trying the same prompt
  against a hosted Claude or GPT-5 to confirm before debugging the
  server.
 - **Adding more MCP servers**: drop another entry under the `mcp` key
  using the same `type/command/enabled` shape. The
  [official MCP registry](https://registry.modelcontextprotocol.io/)
  and [Awesome MCP Servers](https://mcpservers.org/) catalog options.
 - **Tool-list bloat is real on a local 70B.** Every tool description
  costs context. Five MCP servers exposing ~10 tools each puts the
  active-tool list around 50 — manageable, but adding two more
  full-spectrum servers (e.g. GitHub MCP at ~70 tools without scope
  filtering, plus Context7) starts crowding effective context. Prefer
  servers with toolset filtering or per-agent allow-lists in opencode.
 - **basic-memory storage path.** The symlink `~/basic-memory` →
  `~/Documents/obsidian/AI-memory` is created by `install.sh` only if
  `~/basic-memory` doesn't already exist. If you'd previously run
  basic-memory before this setup, move that directory's contents into
  `AI-memory/` first, then delete `~/basic-memory` and re-run
  `install.sh`.
 - **Serena PATH gotcha.** `uv tool install` puts `serena` in
  `~/.local/bin/`. If your shell rc doesn't export that, `opencode`
  won't find the binary. The script warns; fix is one line in
  `~/.zshrc`: `export PATH="$HOME/.local/bin:$PATH"`.
 - **Serena tool trim** (`serena-ide-trim.yml`). The custom context
  excludes 28 tools beyond what the built-in `ide` context already
  filters. To re-expose any of them, edit
  [`serena-ide-trim.yml`](serena-ide-trim.yml) and remove the entry
  from `excluded_tools`, then re-run `./install.sh`. The path injection
  (`./serena-ide-trim.yml` → absolute) is handled by install.sh's jq
  pass at deploy time.
 - **GitHub PAT.** Use a **classic** PAT (`ghp_…`) — auto-scope-filtering
  only kicks in for classic tokens, not fine-grained ones. Without
  it, the GitHub MCP exposes its full ~70-tool surface, which costs
  ~23k tokens of context the local 70B can ill afford. Generate at
  <https://github.com/settings/tokens> with the scopes you actually
  want exposed.
--- a/opencode/opencode.json
+++ b/opencode/opencode.json
@@ -9,7 +9,7 @@
      "npm": "@ai-sdk/openai-compatible",
      "name": "Framework Desktop (Strix Halo) — Ollama",
      "options": {
-        "baseURL": "http://framework:11434/v1"
+        "baseURL": "http://10.0.0.70:11434/v1"
      },
      "models": {
        "qwen3-coder:30b": {
@@ -25,7 +25,7 @@
      "npm": "@ai-sdk/openai-compatible",
      "name": "Framework Desktop (Strix Halo) — vLLM",
      "options": {
-        "baseURL": "http://framework:8000/v1",
+        "baseURL": "http://10.0.0.70:8000/v1",
        "apiKey": "dummy"
      },
      "models": {
@@ -43,7 +43,7 @@
      "npm": "@ai-sdk/openai-compatible",
      "name": "Framework Desktop (Strix Halo) — llama.cpp (long-task)",
      "options": {
-        "baseURL": "http://framework:8081/v1",
+        "baseURL": "http://10.0.0.70:8081/v1",
        "apiKey": "dummy"
      },
      "models": {
@@ -77,7 +77,7 @@
      "command": ["uvx", "kagimcp"],
      "enabled": true,
      "environment": {
-        "KAGI_API_KEY": "${KAGI_API_KEY}"
+        "KAGI_API_KEY": "gD6BmNHpHL2hLHYX0MnHhWMrmCjjs0dNIp4azxSTO0g.J64hRRR4NHIKcnEjcwyR4YV-6vuf622GsadLn8u4das"
      }
    },
    "serena": {
@@ -115,7 +115,7 @@
      "command": ["npx", "-y", "task-master-ai"],
      "enabled": true,
      "environment": {
-        "OLLAMA_BASE_URL": "http://framework:11434/v1"
+        "OLLAMA_BASE_URL": "http://10.0.0.70:11434/v1"
      }
    }
  },
--- a/opencode/scripts/fetch_garmin_data.py
+++ b/opencode/scripts/fetch_garmin_data.py
@@ -0,0 +1,198 @@
 import os
 import sqlite3
 from datetime import datetime
 from garminconnect import Garmin
 from garmin_fit_sdk import Decoder
 def get_db_path():
    """Return path to SQLite database."""
    return os.path.join('garmin_data', 'garmin.db')
 def init_database(db_path):
    """Initialize SQLite database with required tables."""
    os.makedirs(os.path.dirname(db_path), exist_ok=True)
    with sqlite3.connect(db_path) as conn:
        cursor = conn.cursor()
        # Create activities table
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS activities (
                id INTEGER PRIMARY KEY,
                start_time TEXT,
                end_time TEXT,
                distance REAL,
                duration INTEGER,
                activity_type TEXT,
                avg_heart_rate INTEGER,
                max_heart_rate INTEGER,
                avg_speed REAL,
                max_speed REAL,
                calories INTEGER,
                climb INTEGER,
                UNIQUE(id)
            )
        ''')
        # Create records table
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS records (
                id INTEGER PRIMARY KEY,
                activity_id INTEGER,
                timestamp TEXT,
                heart_rate INTEGER,
                cadence INTEGER,
                speed REAL,
                altitude REAL,
                latitude REAL,
                longitude REAL,
                power INTEGER,
                distance REAL,
                FOREIGN KEY (activity_id) REFERENCES activities (id)
            )
        ''')
        # Create indexes for better query performance
        cursor.execute('CREATE INDEX IF NOT EXISTS idx_records_activity ON records(activity_id)')
        cursor.execute('CREATE INDEX IF NOT EXISTS idx_records_time ON records(timestamp)')
        conn.commit()
 def get_garmin_client(email, password):
    """Authenticate with Garmin Connect."""
    try:
        client = Garmin(email, password)
        client.login()
        return client
    except Exception as e:
        print(f"Error authenticating: {e}")
        return None
 def download_activities(client):
    """Download activity list."""
    try:
        return client.get_activities(0, 1)  # Get most recent activity
    except Exception as e:
        print(f"Error downloading activity list: {e}")
        return None
 def download_fit_file(client, activity_id, output_dir):
    """Download FIT file for activity."""
    try:
        fit_data = client.download_activity(activity_id, dl_fmt=client.ActivityDownloadFormat.ORIGINAL)
        fit_path = os.path.join(output_dir, f"{activity_id}.fit")
        os.makedirs(output_dir, exist_ok=True)
        with open(fit_path, 'wb') as f:
            f.write(fit_data)
        return fit_path
    except Exception as e:
        print(f"Error downloading FIT file: {e}")
        return None
 def extract_fit_data(fit_file_path):
    """Extract data from FIT file."""
    try:
        decoder = Decoder()
        messages, errors = decoder.read_fit_file(fit_file_path)
        return messages, errors
    except Exception as e:
        print(f"Error decoding FIT file: {e}")
        return None, None
 def save_to_database(db_path, messages):
    """Save extracted data to SQLite database."""
    with sqlite3.connect(db_path) as conn:
        cursor = conn.cursor()
        # Extract activity metadata
        activity_message = next((m for m in messages if m['type'] == 'session'), None)
        if not activity_message:
            print("No session message found")
            return
        # Insert activity
        cursor.execute('''
            INSERT OR IGNORE INTO activities 
            (id, start_time, end_time, distance, duration, activity_type, 
             avg_heart_rate, max_heart_rate, avg_speed, max_speed, 
             calories, climb) 
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ''', (
            activity_message['message']['start_time'],
            activity_message['message']['end_time'],
            activity_message['message']['total_distance'],
            activity_message['message']['total_elapsed_time'],
            activity_message['message']['sport'],
            activity_message['message']['avg_heart_rate'],
            activity_message['message']['max_heart_rate'],
            activity_message['message']['avg_speed'],
            activity_message['message']['max_speed'],
            activity_message['message']['total_calories'],
            activity_message['message']['total_ascent']
        ))
        # Insert records
        record_messages = [m for m in messages if m['type'] == 'record']
        for message in record_messages:
            record = message['message']
            cursor.execute('''
                INSERT INTO records 
                (activity_id, timestamp, heart_rate, cadence, speed, 
                 altitude, latitude, longitude, power, distance) 
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ''', (
                activity_message['message']['start_time'],
                record.get('timestamp'),
                record.get('heart_rate'),
                record.get('cadence'),
                record.get('speed'),
                record.get('altitude'),
                record.get('position_lat'),
                record.get('position_long'),
                record.get('power'),
                record.get('distance')
            ))
        conn.commit()
        print(f"Saved {len(record_messages)} records to database")
 def main():
    # Authentication credentials
    email = os.getenv('GARMIN_EMAIL')
    password = os.getenv('GARMIN_PASSWORD')
    if not email or not password:
        print("Please set GARMIN_EMAIL and GARMIN_PASSWORD environment variables")
        return
    # Initialize database
    db_path = get_db_path()
    init_database(db_path)
    # Initialize client
    client = get_garmin_client(email, password)
    if not client:
        return
    # Get activities
    activities = download_activities(client)
    if not activities:
        return
    activity_id = activities[0]['activityId']
    # Download FIT file
    fit_path = download_fit_file(client, activity_id, 'garmin_data')
    if not fit_path:
        return
    # Extract data
    messages, errors = extract_fit_data(fit_path)
    if errors:
        print(f"FIT file errors: {errors}")
    # Save to database
    save_to_database(db_path, messages)
 if __name__ == "__main__":
    main()
--- a/pyinfra/framework/README.md
+++ b/pyinfra/framework/README.md
@@ -72,6 +72,106 @@ curl localhost:8080/v1/models                # smoke test
 Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
 needed — Ollama serves models on demand).
 ## Swapping GPU-resident models
 The merged ~110 GB GPU arena (BIOS UMA=0.5 GB + `ttm.pages_limit` + the
 `HSA_*` env recipe — see `StrixHaloMemory.md`) holds at most one
 ~88 GB-class model, and ROCm doesn't reclaim cleanly between consumers.
 So switching between, say, the daily-driver 30B and the 235B long-task
 model is a stop-then-start of whole compose stacks.
 `scripts/swap-model` (deployed to `/usr/local/bin/swap-model` on the
 box) encodes the coexistence table + per-service health probes so the
 swap is one command:
 ```sh
 ssh framework swap-model 235b      # stop conflicting svcs, start 235B, wait for /health
 ssh framework swap-model coder     # back to Qwen3-Coder-30B (Ollama)
 ssh framework swap-model status    # what's currently up?
 ssh framework swap-model none      # everything down, free the GPU arena
 ```
 Coexistence table baked into the script:
 | Target | What runs | What gets stopped |
 |---|---|---|
 | `coder` | ollama (Qwen3-Coder-30B) | 235b, comfyui |
 | `235b` | qwen3-235b (llama.cpp) | ollama, llama, kimi, comfyui |
 | `kimi` | kimi-linear (vLLM) | 235b, comfyui (ollama can stay) |
 | `comfyui` | comfyui | 235b, kimi (ollama can stay) |
 | `none` | — | all five |
 OpenWebUI, LiteLLM, Phoenix, Beszel, etc. are non-GPU and always-on;
 the script ignores them. Wait timeout defaults to 600 s (235B's 88 GB
 cold load takes 3-5 min); override with `SWAP_WAIT_TIMEOUT=<seconds>`.
 A Mac-side wrapper at `bin/swap-model` in the repo runs
 `ssh framework /usr/local/bin/swap-model "$@"` — symlink it onto your
 PATH so `swap-model X` works from any shell:
 ```sh
 ln -s "$(pwd)/bin/swap-model" /usr/local/bin/swap-model
 ```
 Override host with `SWAP_MODEL_HOST=10.0.0.70 swap-model 235b` when
 Tailscale resolution is misbehaving.
 ### Inference engine consolidation (in progress)
 The model count is now at the crossover point, so the manual
 `swap-model` script is being replaced. The plan, phased:
 **Framing.** Full consolidation to one engine isn't possible — **vLLM
 stays** as the specialist for the architectures llama.cpp can't serve
 (Kimi-Linear's KDA/MLA hybrid, the 235B). And the hard part of the
 memory problem — the cross-engine rule "235B can't share the arena
 with anything" — survives any consolidation, because it's a constraint
 *between* vLLM and the GGUF tier. So the real decision is only about
 the **GGUF tier** (Ollama and the standalone llama.cpp stack do the
 same job) and which tool, if any, coordinates the swaps.
 **M0 — benchmark the GGUF tier (the deciding unknown).** Ollama already
 auto-loads/unloads; the only question is whether its bundled llama.cpp
 keeps up with the gfx1151-tuned kyuz0 build on *this* box.
 `scripts/bench-engines` (deployed to `/usr/local/bin/bench-engines`)
 serves the identical GGUF on each engine in isolation and reports
 decode/prefill t/s using each engine's own timing fields:
 ```sh
 ssh framework bench-engines        # runs both, prints a verdict line
 ```
 **M1 — pick the end-state from the benchmark.** Both are two engines
 (GGUF tier + vLLM); the difference is the coordinator:
 - **Option 1 (Ollama ≥ ~85 % of llama.cpp decode):** Ollama + vLLM,
  drop the standalone llama.cpp stack. Ollama self-swaps its tier, so
  **no llama-swap needed**; the one cross-engine rule ("stop Ollama
  before 235B") is a few lines. Fewest moving parts.
 - **Option 2 (kyuz0 lead is large):** keep llama.cpp behind
  **[llama-swap](https://github.com/mostlygeek/llama-swap)**, drop
  Ollama. This is where llama-swap earns its place — a YAML-configured
  OpenAI-compatible proxy whose **groups** encode the coexistence table
  declaratively *and* can launch the vLLM backends too, so one tool
  coordinates the whole arena and the `swap-model` script retires. It
  also subsumes most of LiteLLM's routing role.
 Two alternatives considered, both weaker fits:
 - **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
  your whole stack" platform with 36+ backends; deeper rewrite for
  marginal gain.
 - **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
  cluster manager (DB-backed, UI-first); built for fleets, overkill
  on a single Strix Halo box.
 Two alternatives also considered, both weaker fits:
 - **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
  your whole stack" platform with 36+ backends; would force a deeper
  rewrite for marginal gain on a 4-model stack.
 - **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
  cluster manager (DB-backed, UI-first); built for fleets, overkill
  on a single Strix Halo box.
 ## Tunables
 Top of `deploy.py`:
@@ -272,3 +372,77 @@ model server.
 The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
 gfx1151-optimized with rocWMMA flash attention (the latter only kicks
 in on the ROCm tags). Auto-rebuilt against llama.cpp master.
 ## Remote IDE (code-server)
 **code-server** (`/srv/docker/code-server`, http://framework:8443) —
 full VS Code in the browser, served from the box. The point: Claude
 Code (extension from Open VSX + CLI) runs *inside the container*, so
 long agent tasks execute here and survive the laptop sleeping. Any
 tailnet device gets the same editor and the same running session.
 Bring-up: `cd /srv/docker/code-server && docker compose up -d`, then
 see `/srv/docker/code-server/README.md` for the one-time extension
 install + sign-in flow. No password is set (Tailscale-only trust
 model, like everything else here) — set `HASHED_PASSWORD` before this
 host ever sees other traffic; a browser terminal is a shell.
 The workspace is container-scoped on purpose (no host filesystem, no
 docker socket). For host-level Claude work, use the section below
 instead.
 ## Workspace manager (Coder) — pilot
 **Coder** (`/srv/docker/coder`, http://framework:7080) — self-hosted
 workspace manager: a web dashboard that stamps out per-project dev
 containers from Terraform templates. The shipped `code-server` template
 gives each workspace browser VS Code with the Claude Code extension
 (from Open VSX) pre-installed and the `claude` CLI on PATH; workspace
 home volumes persist `~/.claude` creds and repos across stop/start.
 Workspaces publish no host ports — everything tunnels through the
 dashboard — and idle autostop hands parked projects' RAM back to the
 inference stacks.
 Bring-up + template push + first workspace:
 `/srv/docker/coder/README.md`. The `.env` (docker GID, access URL,
 Postgres password) is auto-generated by `./run.sh` — nothing to fill
 in by hand.
 This is a **pilot** against the standalone code-server stack above:
 after a week of real use, one of them retires. Same security posture
 as OpenHands (docker-socket holder, spawns code-running containers) —
 Tailscale-only, never expose :7080 further.
 ## Claude Code on the box (remote-control)
 For long unattended runs steered from a phone or browser — no IDE
 involved — Claude Code's official Remote Control feature bridges a
 session running on the box to claude.ai/code and the Claude mobile
 app. Needs Claude Code ≥ 2.1.51 with a claude.ai login (Pro/Max plan;
 API keys don't work for this), traffic relays outbound-only over TLS
 to Anthropic — no inbound ports.
 ```sh
 ssh framework
 tmux new -s rc
 cd ~/src/whatever
 claude remote-control --name "Strix Halo"
 # prints a session URL → open on any device, or find it in the
 # session list at claude.ai/code / the Claude app's Code tab
 ```
 Operational notes (state of the feature as of 2026-06):
 - The local process is the engine — tasks keep executing with **no
  client attached**; clients are viewports. But if the process dies,
  the session is gone (no `--resume` equivalent for remote-control
  sessions yet, issue #30447). Hence tmux, always.
 - It needs a TTY — `nohup`/systemd don't work. tmux satisfies this.
 - `--spawn worktree` gives each connecting session its own git
  worktree; `--spawn same-dir` (default) shares the directory.
 - Headless boxes print the session URL but no QR (#34764) — copy the
  URL or use the session list.
 - Mobile clients sometimes drop after long idle (#28914, #34255);
  reconnecting always works — the task on the box is unaffected.
 - Run `claude` interactively once per project dir first to clear the
  workspace-trust prompt before automating anything.
--- a/pyinfra/framework/compose/code-server.yml
+++ b/pyinfra/framework/compose/code-server.yml
@@ -0,0 +1,55 @@
 # code-server — VS Code in the browser, served from the box. The point:
 # Claude Code (extension + CLI) runs *inside this container*, so long
 # agent tasks keep running when the laptop sleeps. Any device on the
 # tailnet gets the full VS Code experience at http://framework:8443.
 #
 # linuxserver image over codercom/code-server for the PUID/PGID
 # convention (matches host `noise` UID 1000 so workspace files don't
 # come out root-owned) and the s6 init that chowns /config.
 #
 # Extensions come from Open VSX (code-server's default registry). The
 # official Claude Code extension is published there:
 # https://open-vsx.org/extension/Anthropic/claude-code
 #
 # Persistent state: /config is the container user's $HOME — extensions,
 # settings, ~/.claude (OAuth creds + session history), ~/.local/bin
 # (claude CLI). Survives container recreation; back up one path.
 services:
  code-server:
    image: lscr.io/linuxserver/code-server:latest
    container_name: code-server
    restart: unless-stopped
    # No PASSWORD env set — auth is disabled, same trust model as every
    # other service here: reachable over Tailscale only. If this host
    # ever sees LAN/internet traffic, set HASHED_PASSWORD (and even then
    # prefer "127.0.0.1:8443:8443" + tunnel — a browser terminal is a
    # shell on the box's docker network).
    ports:
      - "8443:8443"
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Etc/UTC
      # Open this folder by default instead of the "welcome" screen.
      - DEFAULT_WORKSPACE=/workspace
      # tmux inside the container: detachable terminals for long claude
      # runs, so a browser-tab close or code-server reconnect hiccup
      # can't orphan your view of a session (the process itself never
      # depended on the tab — it's a child of the container).
      - DOCKER_MODS=linuxserver/mods:universal-package-install
      - INSTALL_PACKAGES=tmux
    extra_hosts:
      # Reach the sibling services (Ollama :11434, LiteLLM :4000,
      # Phoenix :6006, ...) from the integrated terminal / extensions.
      - "host.docker.internal:host-gateway"
    volumes:
      - /srv/docker/code-server/config:/config
      # Default project area. Container-scoped on purpose — Claude Code
      # in here can't touch /srv/docker compose stacks or the host.
      # Mount additional repos explicitly when you want them editable:
      #   - /home/noise/src/somerepo:/workspace/somerepo
      - /srv/docker/code-server/workspace:/workspace
--- a/pyinfra/framework/compose/code-server/README.md
+++ b/pyinfra/framework/compose/code-server/README.md
@@ -0,0 +1,85 @@
 # code-server — VS Code in the browser
 Full VS Code served from the box at <http://framework:8443>. The reason
 it exists in this stack: **Claude Code runs inside the container**, so
 long agent tasks execute on the Framework Desktop and survive the
 laptop sleeping, the browser tab closing, or you walking away with the
 phone. Any device on the tailnet gets the same editor with the same
 state.
 ## Bring-up
 ```sh
 cd /srv/docker/code-server && docker compose up -d
 ```
 First start pulls the image and runs the universal-package-install mod
 (installs tmux), so give it a minute before the UI answers on :8443.
 > **HTTPS is required for extension panels.** VS Code webviews (the
 > Claude Code panel included) need a secure context; over plain
 > `http://framework:8443` they render blank. Serve it via Tailscale:
 > `sudo tailscale serve --bg --https=8443 8443` →
 > `https://framework.<tailnet>.ts.net:8443`. Localhost also counts as
 > secure (`ssh -L 8443:localhost:8443 framework`).
 ## One-time setup (in the code-server UI)
 1. **Install the Claude Code extension.** Extensions panel → search
   "Claude Code". code-server uses the Open VSX registry, where
   Anthropic publishes the official extension
   (<https://open-vsx.org/extension/Anthropic/claude-code>).
 2. **Sign in.** The OAuth flow in a browser context is a copy/paste
   dance: the extension shows a URL → open it in another tab → approve
   → paste the code back. Credentials land in `/config/.claude/` and
   persist across container recreates.
 3. **(Optional) CLI in the terminal.** The extension bundles its own
   CLI, but for tmux-based long runs install it explicitly:
   ```sh
   curl -fsSL https://claude.ai/install.sh | bash
   ```
   Lands in `~/.local/bin` (= `/config/.local/bin`, persisted).
 ## Long-running tasks
 The Claude process is a child of the container, not of your browser
 tab — closing the tab does nothing to it. Reattach from any device and
 the session is where you left it. For multi-hour unattended runs,
 prefer a tmux pane in the integrated terminal:
 ```sh
 tmux new -s longtask
 claude    # kick off the task, detach with C-b d
 ```
 If you want to steer from a phone or claude.ai/code instead of this UI,
 use `claude remote-control` (see the framework README's
 "Claude Code on the box" section) — that works in the host's tmux too,
 no container needed.
 ## Scope and security
 - **No password is set** — same trust model as every other service on
  this Tailscale-only box. If the host ever sees LAN/internet traffic,
  set `HASHED_PASSWORD` in the compose env and bind to localhost; a
  browser terminal is a shell.
 - The workspace is **container-scoped on purpose**: Claude Code in here
  sees `/workspace` and `/config`, not the host filesystem or the
  docker socket. Bind-mount specific repos into `/workspace/<name>`
  when you want them editable (host UID 1000 == container PUID 1000,
  so ownership just works).
 - `host.docker.internal` reaches the sibling services — handy for
  pointing scripts at Ollama (:11434) or LiteLLM (:4000) from the
  integrated terminal.
 ## State layout
 | Path (host)                            | What lives there                                  |
 | -------------------------------------- | ------------------------------------------------- |
 | `/srv/docker/code-server/config`       | `$HOME`: extensions, settings, `~/.claude`, CLIs  |
 | `/srv/docker/code-server/workspace`    | default project area (`DEFAULT_WORKSPACE`)        |
 Back up `config` if you care about extension state and Claude session
 history; everything else is reproducible from the repo.
--- a/pyinfra/framework/compose/coder.yml
+++ b/pyinfra/framework/compose/coder.yml
@@ -0,0 +1,64 @@
 # Coder — self-hosted workspace manager (PILOT). Web dashboard at :7080
 # that stamps out per-project dev containers from Terraform templates;
 # each workspace gets browser code-server with the Claude Code extension
 # pre-installed. Evaluating against the standalone code-server stack
 # (compose/code-server.yml, kept as-is during the pilot) — verdict
 # criteria in compose/coder/README.md.
 #
 # The server container holds the host docker socket and spawns workspace
 # containers as siblings (same pattern + same security posture as
 # OpenHands: fine Tailscale-only, never expose further). group_add needs
 # the socket's host GID — host-specific, so it comes from the sibling
 # .env, which pyinfra generates on the box on first deploy along with a
 # random one-time Postgres password.
 #
 # Workspaces don't publish host ports — code-server and terminals are
 # reached through the dashboard's tunnel. No port bookkeeping per
 # project.
 services:
  coder:
    # Pin and bump deliberately — Coder releases weekly and the workspace
    # provisioner/agent protocol moves with it. Verify at
    # https://github.com/coder/coder/releases.
    image: ghcr.io/coder/coder:v2.33.8
    container_name: coder
    restart: unless-stopped
    ports:
      - "7080:7080"
    environment:
      CODER_HTTP_ADDRESS: "0.0.0.0:7080"
      # The URL clients are told to use — tailnet MagicDNS name, same
      # convention as every other service tile.
      CODER_ACCESS_URL: "${CODER_ACCESS_URL}"
      CODER_PG_CONNECTION_URL: "postgresql://coder:${POSTGRES_PASSWORD}@database/coder?sslmode=disable"
    group_add:
      # GID of /var/run/docker.sock — generated into .env by deploy.py.
      - "${DOCKER_GROUP_ID}"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      # Repo-shipped Terraform templates (source of truth:
      # pyinfra/framework/compose/coder/templates/). Push after changes:
      #   docker compose exec coder coder templates push code-server \
      #     --directory /templates/code-server --yes
      - /srv/docker/coder/templates:/templates:ro
    depends_on:
      database:
        condition: service_healthy
  database:
    image: postgres:17
    container_name: coder-db
    restart: unless-stopped
    environment:
      POSTGRES_USER: coder
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: coder
    volumes:
      # Entrypoint starts as root and chowns this to the postgres uid;
      # deploy.py just creates the mount point.
      - /srv/docker/coder/postgres:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U coder -d coder"]
      interval: 5s
      timeout: 5s
      retries: 5
--- a/pyinfra/framework/compose/coder/README.md
+++ b/pyinfra/framework/compose/coder/README.md
@@ -0,0 +1,110 @@
 # Coder — workspace manager (pilot)
 Self-hosted control plane at <http://framework:7080> that creates
 per-project dev containers from Terraform templates. Each workspace:
 its own container, browser code-server with the Claude Code extension
 pre-installed, no host-port bookkeeping (everything tunnels through the
 dashboard), and idle autostop so parked projects give their RAM back to
 the inference stacks.
 **Pilot status.** Evaluating against the standalone code-server stack
 (`/srv/docker/code-server`, kept as-is meanwhile). Verdict after a week
 of real use: does the dashboard + autostop + create-from-template earn
 an always-on control plane + Postgres? If yes, the standalone stack
 retires; if no, Coder goes and a thin `workspace` spawner script
 replaces it.
 ## Bring-up
 ```sh
 cd /srv/docker/coder && docker compose up -d
 ```
 The sibling `.env` (DOCKER_GROUP_ID, CODER_ACCESS_URL, random Postgres
 password) is generated by deploy.py on first `./run.sh` — no hand-fill
 needed. First visit to <http://framework:7080> creates the admin
 account (pick anything; it's local to the box).
 > **HTTPS is required for extension panels.** VS Code webviews (the
 > Claude Code panel, markdown preview, etc.) run on service workers,
 > which browsers only allow in a secure context — over plain
 > `http://framework:7080` the editor works but webview panels render
 > blank. Fix via Tailscale Serve (real Let's Encrypt cert for the
 > tailnet name; enable "HTTPS Certificates" once in the Tailscale
 > admin console):
 >
 > ```sh
 > sudo tailscale serve --bg 7080
 > # then in .env: CODER_ACCESS_URL=https://framework.<tailnet>.ts.net
 > docker compose up -d        # recreate server with new access URL
 > # restart any existing workspace — app URLs derive from the access URL
 > ```
 >
 > Localhost is also a secure context, so
 > `ssh -L 7080:localhost:7080 framework` + http://localhost:7080 works
 > in a pinch.
 ## Push the template (one-time + after edits)
 The `coder` CLI ships inside the server image; authenticate it once:
 ```sh
 docker compose exec coder coder login http://localhost:7080
 # prints a /cli-auth URL — open http://framework:7080/cli-auth in your
 # browser, copy the session token, paste it back
 ```
 Then push:
 ```sh
 docker compose exec coder coder templates push code-server \
  --directory /templates/code-server --yes
 ```
 Template source of truth is the repo
 (`pyinfra/framework/compose/coder/templates/code-server/main.tf`) —
 edit there, `./run.sh`, re-push. Edits on the box get overwritten.
 ## First workspace
 Dashboard → Workspaces → Create → `code-server` template → name it
 after the project. Open the code-server app tile, then one-time
 Claude sign-in: the extension shows an OAuth URL → open in another tab
 → approve → paste the code back. Credentials live in `~/.claude` on
 the workspace's home volume and survive stop/start and rebuilds.
 Set **idle autostop** under Template → Settings → Schedule (suggest
 1–2 h inactivity). Activity = open code-server tab, SSH, web terminal.
 ## What persists where
 | Thing                            | Where                                        |
 | -------------------------------- | -------------------------------------------- |
 | Workspace home (repos, ~/.claude, extensions) | named volume `coder-<workspace-id>-home` |
 | Control-plane state (users, templates, workspace defs) | Postgres → `/srv/docker/coder/postgres` |
 | Template source                  | this repo, shipped to `/srv/docker/coder/templates` |
 A stopped workspace's container is deleted; only the home volume
 remains. `docker volume ls | grep coder-` to audit.
 ## Security
 Same posture as OpenHands: the server container holds the docker
 socket and spawns code-running containers — root-equivalent on the
 box. Tailscale-only exposure is the mitigation; never forward :7080
 anywhere else. Coder does have real auth (the admin account), but
 treat that as defense-in-depth, not as permission to expose it.
 ## Notes
 - Workspaces reach the sibling services via `host.docker.internal`
  (Ollama :11434, LiteLLM :4000, Phoenix :6006, ...).
 - Long Claude runs: same rules as anywhere — the process lives in the
  workspace container, so it survives laptop/browser disconnects, but
  **autostop will kill an idle-looking workspace mid-run**. For
  multi-hour unattended tasks either bump the workspace's TTL in the
  dashboard or use the host-side tmux + `claude remote-control`
  pattern (framework README, "Claude Code on the box").
 - The base image is `codercom/enterprise-base:ubuntu` (sudo-enabled,
  common toolchain). Per-project images are a later refinement —
  swap the `image` in main.tf or parameterize with `coder_parameter`.
--- a/pyinfra/framework/compose/coder/templates/code-server/main.tf
+++ b/pyinfra/framework/compose/coder/templates/code-server/main.tf
@@ -0,0 +1,121 @@
 # code-server workspace template — one dev container per project, with
 # browser VS Code and Claude Code (extension + CLI) ready on first open.
 #
 # Source of truth is the repo copy at
 # pyinfra/framework/compose/coder/templates/code-server/main.tf; pyinfra
 # ships it to /srv/docker/coder/templates/, mounted read-only into the
 # server container at /templates. Push after edits:
 #   cd /srv/docker/coder
 #   docker compose exec coder coder templates push code-server \
 #     --directory /templates/code-server --yes
 terraform {
  required_providers {
    coder = {
      source = "coder/coder"
    }
    docker = {
      source = "kreuzwerker/docker"
    }
  }
 }
 provider "coder" {}
 # Talks to the host daemon via the socket mounted into the server
 # container (default unix:///var/run/docker.sock) — workspace containers
 # are siblings of the compose stacks, not children.
 provider "docker" {}
 data "coder_workspace" "me" {}
 data "coder_workspace_owner" "me" {}
 resource "coder_agent" "main" {
  arch = "amd64"
  os   = "linux"
  # Claude Code CLI — native installer, lands in ~/.local/bin inside the
  # persisted home volume, so this is a no-op after the first start. The
  # code-server extension bundles its own CLI, but having `claude` on
  # PATH enables tmux-based long runs in the workspace terminal.
  startup_script = <<-EOT
    set -e
    command -v claude >/dev/null 2>&1 || curl -fsSL https://claude.ai/install.sh | bash
  EOT
  env = {
    GIT_AUTHOR_NAME     = data.coder_workspace_owner.me.full_name
    GIT_AUTHOR_EMAIL    = data.coder_workspace_owner.me.email
    GIT_COMMITTER_NAME  = data.coder_workspace_owner.me.full_name
    GIT_COMMITTER_EMAIL = data.coder_workspace_owner.me.email
  }
  metadata {
    display_name = "CPU"
    key          = "cpu"
    script       = "coder stat cpu"
    interval     = 10
    timeout      = 1
  }
  metadata {
    display_name = "RAM"
    key          = "mem"
    script       = "coder stat mem"
    interval     = 10
    timeout      = 1
  }
 }
 # Browser VS Code inside the workspace, surfaced as a dashboard app.
 # Extensions install from Open VSX — anthropic.claude-code is the
 # official Claude Code extension
 # (https://open-vsx.org/extension/Anthropic/claude-code). OAuth creds
 # land in ~/.claude inside the home volume and survive rebuilds.
 module "code_server" {
  count    = data.coder_workspace.me.start_count
  source   = "registry.coder.com/coder/code-server/coder"
  version  = "~> 1.0"
  agent_id = coder_agent.main.id
  folder   = "/home/coder/project"
  extensions = [
    "anthropic.claude-code",
  ]
 }
 # Home survives workspace stop/start AND template-driven rebuilds —
 # ignore_changes keeps Terraform from recreating the volume (and wiping
 # ~/.claude, extensions, repos) when template metadata shifts.
 resource "docker_volume" "home" {
  name = "coder-${data.coder_workspace.me.id}-home"
  lifecycle {
    ignore_changes = all
  }
 }
 resource "docker_container" "workspace" {
  # start_count is 0 when the workspace is stopped — the container is
  # deleted but the home volume above persists. This is what idle
  # autostop reclaims: RAM back to the inference stacks.
  count    = data.coder_workspace.me.start_count
  image    = "codercom/enterprise-base:ubuntu"
  name     = "coder-${data.coder_workspace_owner.me.name}-${lower(data.coder_workspace.me.name)}"
  hostname = data.coder_workspace.me.name
  # Agent bootstrap, rewritten to reach the control plane through the
  # docker bridge (the workspace is a sibling container; "localhost"
  # inside it isn't the Coder server).
  entrypoint = ["sh", "-c", replace(coder_agent.main.init_script, "/localhost|127\\.0\\.0\\.1/", "host.docker.internal")]
  env        = ["CODER_AGENT_TOKEN=${coder_agent.main.token}"]
  host {
    host = "host.docker.internal"
    ip   = "host-gateway"
  }
  volumes {
    container_path = "/home/coder"
    volume_name    = docker_volume.home.name
    read_only      = false
  }
 }
--- a/pyinfra/framework/compose/homepage/services.yaml
+++ b/pyinfra/framework/compose/homepage/services.yaml
@@ -95,6 +95,28 @@
        server: localhost-docker
        container: openhands
    - code-server:
        icon: code-server.svg
        href: http://framework:8443
        description: VS Code in the browser — Claude Code long tasks live here
        server: localhost-docker
        container: code-server
    - Coder:
        icon: coder.svg
        href: http://framework:7080
        description: Workspace manager (pilot) — per-project containers from templates
        server: localhost-docker
        container: coder
        # /api/v2/buildinfo is unauthenticated — cheap liveness + version.
        widget:
          type: customapi
          url: http://host.docker.internal:7080/api/v2/buildinfo
          refreshInterval: 30000
          mappings:
            - field: version
              label: Version
 - Observability:
    - Beszel:
        icon: beszel.svg
--- a/pyinfra/framework/compose/qwable.yml
+++ b/pyinfra/framework/compose/qwable.yml
@@ -0,0 +1,102 @@
 # Qwable-3.6-27B (Qwen3.6-27B fine-tuned on Fable-5-style reasoning
 # traces — "Qwen + Fable") via the kyuz0 rocm-7.2.2 Strix Halo toolbox.
 # Same image + unified-memory recipe as compose/llama.yml; deltas are
 # model path, port, alias.
 # https://github.com/kyuz0/amd-strix-halo-toolboxes
 # Model: https://huggingface.co/Mia-AiLab/Qwable-3.6-27b (MIT)
 #
 # What it's for. A "thinks-like-Fable-5" interactive model — structured,
 # step-by-step explanatory output. Dense 27B (NOT MoE), so it's slower
 # per token than the 30B-A3B MoE workhorses despite being smaller on
 # disk: all 27B weights load per token. Bandwidth math (256 GB/s ÷
 # ~16.5 GB) → ~10-15 tok/s decode. Interactive but not snappy.
 #
 # Coexistence. At ~16.5 GB (Q4_K_M) it's the smallest GPU resident here
 # and fits alongside llama 30B (port 8080), Ollama, or Kimi in the
 # ~110 GB merged arena. It does NOT fit alongside qwen3-235b (88.8 GB)
 # or comfyui — swap-model tears those down for the `qwable` target.
 # `restart: "no"`: you bring it up deliberately via swap-model, it won't
 # auto-start after a reboot and surprise-collide with a big model.
 #
 # Weights. Single-file GGUF (not sharded). Download path on the box
 # (see compose/qwable/README.md):
 #   hf download Mia-AiLab/Qwable-3.6-27b \
 #       'Qwable-27b_Q4_K_M.gguf' \
 #       --local-dir /models/qwen/Qwable-3.6-27b
 # Verify exact filename in the HF repo before downloading.
 #
 # Port 8082 — distinct from llama 30B (8080) and qwen3-235b (8081).
 services:
  qwable:
    image: kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2
    container_name: qwable
    # Manual start only — see header note about GPU contention with
    # the big models. swap-model brings it up/down.
    restart: "no"
    devices:
      # ROCm needs both kfd (kernel fusion driver) and dri (DRM); Vulkan
      # only needs dri. Don't drop kfd when on the rocm-* tag.
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    cap_add:
      - SYS_PTRACE
    security_opt:
      - seccomp=unconfined
    # Numeric GIDs of host's video (44) and render (991) groups —
    # required for /dev/kfd + /dev/dri access from inside the container.
    group_add:
      - "44"
      - "991"
    shm_size: 8g
    ipc: host
    environment:
      # Unified-memory recipe (same as compose/llama.yml + kimi-linear +
      # qwen3-235b). BIOS UMA=0.5 GB + ttm.pages_limit cmdline → these
      # flags merge the rocminfo pools into one ~110 GB arena. kyuz0's
      # image is native gfx1151 so no HSA_OVERRIDE_GFX_VERSION.
      - HSA_XNACK=1
      - HSA_FORCE_FINE_GRAIN_PCIE=1
    volumes:
      - /models:/models:ro
    ports:
      - "8082:8082"
    entrypoint: ["llama-server"]
    command:
      - --model
      - /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf
      # OpenAI-compatible served name (matches what opencode/curl request
      # as "model"). Provider-side name lives in opencode.json if/when
      # this gets wired as a provider.
      - --alias
      - qwable
      - --host
      - 0.0.0.0
      - --port
      - "8082"
      # Push all layers to GPU. "999" = all available. A 27B Q4 (~16.5 GB)
      # fits the merged arena with huge headroom.
      - --n-gpu-layers
      - "999"
      # 64K to match llama/qwen3-235b — keeps opencode auto-compaction
      # behaviour consistent across providers. Tons of arena headroom
      # here (model is small), so this can ramp far higher if a workflow
      # needs it; see compose/qwable/README.md.
      - --ctx-size
      - "65536"
      # No-mmap is the Strix Halo standard — forces full GPU load.
      - --no-mmap
      # Flash attention — required for q8_0 KV cache; modern llama-server
      # takes a value (on/off/auto), bare --flash-attn is deprecated.
      - --flash-attn
      - "on"
      # Quantize KV cache to int8 — halves KV memory at minor/no quality
      # loss. Matches the other llama.cpp stacks.
      - --cache-type-k
      - q8_0
      - --cache-type-v
      - q8_0
      # Use the model's embedded jinja chat template — Qwable inherits
      # Qwen3.6's chat format, which the Fable-trace fine-tune relies on.
      - --jinja
      # Expose Prometheus metrics at /metrics — scraped by OpenLIT.
      - --metrics
--- a/pyinfra/framework/compose/qwable/README.md
+++ b/pyinfra/framework/compose/qwable/README.md
@@ -0,0 +1,130 @@
 # qwable
 Qwable-3.6-27B on Strix Halo via `kyuz0:rocm-7.2.2`. A full fine-tune
 of Qwen3.6-27B trained on **Fable-5-style reasoning traces** — the dev
 collected examples formatted like Fable 5's deliberate, step-by-step
 answers and trained Qwen to reproduce that structured, explanatory
 output. Think of it as a local "thinks-like-Fable" model.
 OpenAI-compatible endpoint at `http://framework:8082` once running.
 ## Dense, not MoE (read first)
 Despite being smaller on disk than the 30B-A3B workhorses, Qwable is a
 **dense** 27B — every weight loads per token. On this bandwidth-bound
 box (256 GB/s ÷ ~16.5 GB) that's **~10-15 tok/s** decode, slower than
 the MoE 30B (~100 tok/s, only ~3B active). So Qwable is for when you
 specifically want the Fable-style reasoning, not for raw throughput.
 The interactive daily driver stays on Ollama / llama 30B.
 ## Coexistence notes
 At ~16.5 GB (Q4_K_M) Qwable is the smallest GPU resident here:
 | Concurrent service | Coexists? |
 |---|---|
 | `llama` (Qwen3-Coder-30B, 8080) | ✅ yes (~35 GB total) |
 | `ollama` (11434) | ✅ yes |
 | `kimi-linear` (vLLM, 8000) | ✅ yes (~47 GB total) |
 | `qwen3-235b` (88.8 GB, 8081) | ❌ no — too tight, swap-model stops it |
 | `comfyui` (8188) | ❌ no — swap-model stops it |
 `restart: "no"`: you bring it up deliberately (via `swap-model qwable`),
 it won't auto-start after a reboot and surprise-collide with a big model.
 ## Prereqs
 - Pyinfra deploy has run (creates `/srv/docker/qwable/` with right perms).
 - BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
  Verify: `cat /proc/cmdline | grep ttm.pages_limit`.
 ## Download weights (~16.5 GB, single file)
 ```sh
 # /models/qwen exists via pyinfra; just create the model subdir.
 mkdir -p /models/qwen/Qwable-3.6-27b
 hf download Mia-AiLab/Qwable-3.6-27b \
    'Qwable-27b_Q4_K_M.gguf' \
    --local-dir /models/qwen/Qwable-3.6-27b
 # File lands at:
 #   /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf  (~16.5 GB)
 ```
 Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
 needs ~17 GB free on `/models`.
 > Abliterated variant (refusals removed) lives at
 > `huihui-ai/Huihui-Qwable-3.6-27b-abliterated-GGUF`
 > (`Huihui-Qwable-3.6-27b-abliterated-Q4_K_M_Q8.gguf`, ~18.3 GB).
 > Not the default — no safety filtering, careful with it.
 ## Bring up
 Easy path — `swap-model` handles stop-conflicting-services + waits for
 `/health`:
 ```sh
 ssh framework swap-model qwable     # ~1-2 min cold load (16.5 GB)
 ssh framework /srv/docker/qwable/smoke.sh    # perf measure
 ```
 Manual equivalent (first-ever bring-up, before the image is cached):
 ```sh
 cd /srv/docker/qwable
 docker compose pull       # already-cached image if you ran llama first
 docker compose up -d
 docker compose logs -f    # wait for "server is listening on http://0.0.0.0:8082"
 ./smoke.sh                # /health + tiny generation + perf
 ```
 First start is ~1-2 min (16.5 GB load off disk; much faster than the
 235B). If `./smoke.sh` reports `predicted_per_second` in the 10-15 tok/s
 band, it's healthy. <6 tok/s = investigate (likely arena < 100 GB — see
 qwen3-235b/README.md "Troubleshooting" for the arena checks).
 ## Ramping context
 Defaults to 64K to match the other llama.cpp stacks (keeps opencode
 auto-compaction consistent across providers). The model is tiny relative
 to the arena, so there's plenty of room to push higher:
 | Stage | `--ctx-size` | Margin in arena |
 |---|---|---|
 | **Current default** | **65536** | huge (~90 GB free) |
 | Stretch | 131072+ | still comfortable |
 Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
 re-run `./smoke.sh`. The real ceiling is Qwable's trained context length
 (inherits Qwen3.6-27B's), not arena memory — verify the model's max
 positions before going past 128K.
 ## Operations
 ```sh
 docker compose logs -f                  # tail
 docker compose down                     # stop
 docker compose exec qwable bash         # shell in
 ./smoke.sh                              # health + perf
 amdgpu_top                              # GPU view on host
 ```
 ## Pin manifest
 | Component | Pin |
 |---|---|
 | Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) |
 | Weights | `Mia-AiLab/Qwable-3.6-27b` → `Qwable-27b_Q4_K_M.gguf` (~16.5 GB) |
 | Default port | 8082 |
 | Default context | 65536 |
 | KV cache type | q8_0 (k and v) |
 | License | MIT (model); Qwen3.6-27B base license also applies |
 ## Status
 Compose artifacts written; awaiting box-side weight pull + bring-up.
 Wired as a `swap-model qwable` target. Wire as an opencode/LiteLLM
 provider only if the Fable-style reasoning proves useful in practice.
--- a/pyinfra/framework/compose/qwable/smoke.sh
+++ b/pyinfra/framework/compose/qwable/smoke.sh
@@ -0,0 +1,46 @@
 #!/usr/bin/env bash
 # Smoke-test the running qwable llama-server (port 8082). Hits /health
 # for liveness, then a tiny OpenAI-compatible chat completion, then
 # measures eval_tps via /completion. Dense 27B → expect ~10-15 tok/s.
 set -euo pipefail
 HOST="${QWABLE_HOST:-127.0.0.1:8082}"
 MODEL="${QWABLE_MODEL:-qwable}"
 echo "[smoke] GET /health on $HOST"
 curl -fsS "http://$HOST/health" | python3 -m json.tool
 echo
 echo "[smoke] POST /v1/chat/completions ($MODEL) — tiny generation"
 curl -fsS "http://$HOST/v1/chat/completions" \
    -H 'Content-Type: application/json' \
    -d "{
        \"model\": \"$MODEL\",
        \"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: ok\"}],
        \"max_tokens\": 16,
        \"temperature\": 0.0
    }" | python3 -m json.tool
 echo
 echo "[smoke] perf measure — eval_tps and prompt_tps (n_predict=128)"
 # 128 tokens — at ~10-15 tok/s the per-token warmup noise still matters,
 # but a dense 27B settles faster than the 235B so we don't need 64-only.
 curl -fsS "http://$HOST/completion" \
    -H 'Content-Type: application/json' \
    -d '{
        "prompt": "Write a Python function that computes the Fibonacci sequence iteratively. Include type hints and a brief docstring.",
        "n_predict": 128,
        "temperature": 0.0,
        "stream": false
    }' | python3 -c "
 import json, sys
 r = json.load(sys.stdin)
 t = r.get('timings', {})
 print(f'predicted_per_second:  {t.get(\"predicted_per_second\", \"?\"):.2f} tok/s')
 print(f'prompt_per_second:     {t.get(\"prompt_per_second\", \"?\"):.2f} tok/s')
 print(f'predicted_n:           {t.get(\"predicted_n\", \"?\")}')
 print(f'prompt_n:              {t.get(\"prompt_n\", \"?\")}')
 "
 echo
 echo "[smoke] passed — expected band 10-15 tok/s decode (dense 27B Q4)"
--- a/pyinfra/framework/compose/qwen3-235b/README.md
+++ b/pyinfra/framework/compose/qwen3-235b/README.md
@@ -53,6 +53,15 @@ Disk: needs ~90 GB free on `/models`. Pull is bandwidth-bound; expect
 ## Bring up (M0.2 — first generation)
 Easy path — `swap-model` handles the stop-conflicting-services dance + waits for `/health`:
 ```sh
 ssh framework swap-model 235b      # ~3-5 min on first cold load
 ssh framework /srv/docker/qwen3-235b/smoke.sh    # perf measure
 ```
 Manual equivalent (for first-ever bring-up before the image is cached):
 ```sh
 cd /srv/docker/qwen3-235b
 docker compose pull       # already-cached image if you ran llama first
--- a/pyinfra/framework/deploy.py
+++ b/pyinfra/framework/deploy.py
@@ -80,6 +80,8 @@ apt.packages(
        "ca-certificates",
        "unzip",
        "software-properties-common",   # for add-apt-repository (g++-14 PPA)
        "jq",                           # JSON parsing in operator scripts (bench-engines)
        "bc",                           # float math in bench-engines
    ],
    _sudo=True,
 )
@@ -433,6 +435,7 @@ for svc in (
    "ollama",
    "kimi-linear",
    "qwen3-235b",
    "qwable",
    "litellm",
    "comfyui",
    "openwebui",
@@ -440,6 +443,8 @@ for svc in (
    "openlit",
    "phoenix",
    "openhands",
    "code-server",
    "coder",
    "homepage",
    "whisper",
    "piper",
@@ -561,6 +566,89 @@ files.directory(
    _sudo=True,
 )
 # code-server persistent state. The linuxserver image's s6 init drops to
 # PUID/PGID 1000 and treats /config as the container user's $HOME —
 # extensions, settings, ~/.claude (Claude Code OAuth creds + session
 # history), ~/.local/bin. Owned 1000:1000 to match (same pattern as
 # kokoro: the container user isn't in the docker group, so 2775
 # root:docker wouldn't help it).
 files.directory(
    name="code-server config dir",
    path=f"{COMPOSE_DIR}/code-server/config",
    user="1000",
    group="1000",
    mode="0755",
    _sudo=True,
 )
 # Default workspace. Host UID 1000 == container PUID 1000, so files
 # created either side stay owned by the SSH user.
 files.directory(
    name="code-server workspace dir",
    path=f"{COMPOSE_DIR}/code-server/workspace",
    user=SSH_USER,
    group=SSH_USER,
    mode="2775",
    _sudo=True,
 )
 files.put(
    name="code-server: README.md",
    src="compose/code-server/README.md",
    dest=f"{COMPOSE_DIR}/code-server/README.md",
    group="docker",
    mode="0664",
    _sudo=True,
 )
 # Coder workspace manager (pilot — see compose/coder/README.md for the
 # evaluation criteria vs the standalone code-server stack). Postgres
 # chowns its own data dir at first start (entrypoint runs as root); we
 # just create the mount point. Templates are repo-sourced Terraform
 # mounted read-only into the server container, pushed with
 # `docker compose exec coder coder templates push`.
 files.directory(
    name="Coder postgres data dir",
    path=f"{COMPOSE_DIR}/coder/postgres",
    group="docker",
    mode="2775",
    _sudo=True,
 )
 files.directory(
    name="Coder templates/code-server dir",
    path=f"{COMPOSE_DIR}/coder/templates/code-server",
    group="docker",
    mode="2775",
    _sudo=True,
 )
 for asset, mode in (
    ("templates/code-server/main.tf", "0664"),
    ("README.md", "0664"),
 ):
    files.put(
        name=f"coder: {asset}",
        src=f"compose/coder/{asset}",
        dest=f"{COMPOSE_DIR}/coder/{asset}",
        group="docker",
        mode=mode,
        _sudo=True,
    )
 # Sibling .env: DOCKER_GROUP_ID (host-specific GID of the docker socket,
 # needed for the server container's group_add), the access URL, and a
 # random one-time Postgres password. Generated on the box rather than
 # placeholder-then-hand-fill because every value is derivable. Never
 # overwritten once present.
 server.shell(
    name="Coder .env (generate once)",
    commands=[
        f"test -f {COMPOSE_DIR}/coder/.env || {{ "
        f"printf 'DOCKER_GROUP_ID=%s\\nCODER_ACCESS_URL=http://framework:7080\\nPOSTGRES_PASSWORD=%s\\n' "
        f'"$(stat -c %g /var/run/docker.sock)" "$(openssl rand -hex 16)" '
        f"> {COMPOSE_DIR}/coder/.env && "
        f"chown root:docker {COMPOSE_DIR}/coder/.env && "
        f"chmod 640 {COMPOSE_DIR}/coder/.env; }}",
    ],
    _sudo=True,
 )
 # Homepage config. The compose loop above only copies homepage.yml; the
 # YAML config files live in compose/homepage/ on the source side and at
 # /srv/docker/homepage/config/ on the box. Source-of-truth is the repo —
@@ -622,6 +710,22 @@ for asset, mode in (
        _sudo=True,
    )
 # Qwable operator assets. Same image as llama (kyuz0 rocm-7.2.2); dense
 # 27B Qwen3.6 fine-tuned on Fable-5 traces. Weights live at /models/qwen/
 # via manual `hf download` per the README. swap-model `qwable` target.
 for asset, mode in (
    ("smoke.sh", "0775"),
    ("README.md", "0664"),
 ):
    files.put(
        name=f"qwable: {asset}",
        src=f"compose/qwable/{asset}",
        dest=f"{COMPOSE_DIR}/qwable/{asset}",
        group="docker",
        mode=mode,
        _sudo=True,
    )
 # LiteLLM router assets. config.yaml is the source-of-truth model
 # routing table — pyinfra syncs it on every run; edits on the box get
 # overwritten. The .env file holds LITELLM_MASTER_KEY + LITELLM_SALT_KEY
@@ -758,6 +862,37 @@ files.directory(
    _sudo=True,
 )
 # --- Operator scripts -------------------------------------------------------
 # swap-model — one-command swap between which inference container is
 # GPU-resident. Encodes the coexistence table (235B doesn't fit alongside
 # anything; ollama+kimi do) + per-service health probes. Lives in
 # /usr/local/bin so the SSH user (in the docker group) can run it
 # directly: `ssh framework swap-model 235b`. See scripts/swap-model for
 # the modes and bin/swap-model for the Mac-side wrapper.
 files.put(
    name="swap-model script (box-side)",
    src="scripts/swap-model",
    dest="/usr/local/bin/swap-model",
    user="root",
    group="root",
    mode="0755",
    _sudo=True,
 )
 # bench-engines — one-shot decision tool for the GGUF-tier consolidation
 # (Ollama vs kyuz0 llama.cpp decode t/s on gfx1151). See the framework
 # README "Inference engine consolidation".
 files.put(
    name="bench-engines script (box-side)",
    src="scripts/bench-engines",
    dest="/usr/local/bin/bench-engines",
    user="root",
    group="root",
    mode="0755",
    _sudo=True,
 )
 # --- Cleanup of artifacts from the prior native-build deploy ----------------
 # All idempotent — `present=False` is a no-op when the target is absent.
--- a/pyinfra/framework/scripts/bench-engines
+++ b/pyinfra/framework/scripts/bench-engines
@@ -0,0 +1,177 @@
 #!/usr/bin/env bash
 # bench-engines — compare decode/prefill throughput of Ollama vs
 # llama.cpp (kyuz0 toolbox) on the SAME GGUF on gfx1151.
 #
 # Why this exists. The GGUF-tier consolidation decision (see the
 # framework README "Inference engine consolidation") hinges on one
 # hardware-specific unknown: how close is Ollama's bundled llama.cpp to
 # the gfx1151-tuned kyuz0 build on *this* box? If decode t/s is within
 # ~10-15 %, Ollama's convenience wins (it auto-swaps, so no llama-swap
 # needed). If kyuz0's rocWMMA flash-attention lead is large, that argues
 # for keeping llama.cpp behind llama-swap. This measures it.
 #
 # Method. Serves the identical GGUF on each engine in isolation (the
 # other GGUF engine + 235b are stopped so nothing competes for the
 # arena), warms up, then runs R raw-completion trials at a fixed decode
 # length. Reads each engine's own authoritative timing fields — no
 # token-counting guesswork:
 #   - llama.cpp /completion → .timings.{prompt,predicted}_per_second
 #   - Ollama   /api/generate → {prompt_eval,eval}_{count,duration}
 # Uses raw prompts (no chat template) on both for an apples-to-apples
 # prompt-in / tokens-out measurement.
 #
 # Run ON THE BOX (hits localhost + docker). Requires jq.
 #
 # Usage:
 #   bench-engines                 # bench the model llama.yml serves
 #   bench-engines status          # show what's currently GPU-resident
 #   BENCH_RUNS=5 bench-engines     # more trials (default 3)
 #
 # To bench a different model (e.g. Qwen3.6-27B): point compose/llama.yml
 # at the new GGUF, set GGUF below to match, redeploy, rerun.
 set -euo pipefail
 COMPOSE_ROOT="/srv/docker"
 RUNS="${BENCH_RUNS:-3}"
 N_PREDICT="${BENCH_N_PREDICT:-256}"
 WAIT_TIMEOUT="${BENCH_WAIT_TIMEOUT:-600}"
 # Must match the GGUF that compose/llama.yml serves — this is the file
 # registered into Ollama so both engines run identical weights. The
 # path is the in-container path (/models is bind-mounted into both).
 GGUF="${BENCH_GGUF:-/models/qwen/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf}"
 OLLAMA_BENCH_MODEL="bench-engines"
 # A fixed, moderately long prompt so prefill is measurable. Decode is the
 # number that actually decides the consolidation (bandwidth-bound).
 read -r -d '' PROMPT <<'EOF' || true
 You are a careful systems engineer. Explain, in detail and step by step,
 how a unified-memory APU shares a single physical RAM pool between the
 CPU and an integrated GPU, what a GTT aperture is, why demand paging
 matters for large language model weights, and how this differs from a
 discrete GPU with dedicated VRAM. Be thorough and precise.
 EOF
 LLAMA_URL="http://127.0.0.1:8080"
 OLLAMA_URL="http://127.0.0.1:11434"
 need() { command -v "$1" >/dev/null 2>&1 || { echo "bench-engines: missing '$1'" >&2; exit 1; }; }
 need jq
 need curl
 is_running() { docker inspect -f '{{.State.Running}}' "$1" 2>/dev/null | grep -q true; }
 is_healthy() { curl -fsS --max-time 5 "$1" >/dev/null 2>&1; }
 down() {
    local dir="$COMPOSE_ROOT/$1"
    is_running "$1" || return 0
    echo "  stopping $1"
    (cd "$dir" && docker compose down >/dev/null 2>&1)
 }
 up_wait() {
    local svc="$1" health="$2" deadline=$(( SECONDS + WAIT_TIMEOUT ))
    echo "  starting $svc"
    (cd "$COMPOSE_ROOT/$svc" && docker compose up -d >/dev/null 2>&1)
    printf "    waiting for health"
    while ! is_healthy "$health"; do
        (( SECONDS > deadline )) && { echo " TIMEOUT"; docker logs --tail 20 "$svc" >&2; exit 1; }
        sleep 5; printf "."
    done
    echo " ok"
 }
 # --- isolation: only the engine under test is GPU-resident ------------
 isolate_for() {
    case "$1" in
        llama)  down ollama; down qwen3-235b; down kimi-linear ;;
        ollama) down llama;  down qwen3-235b; down kimi-linear ;;
    esac
 }
 # --- register the GGUF into Ollama (idempotent) -----------------------
 register_ollama_model() {
    if docker exec ollama ollama list 2>/dev/null | grep -q "^${OLLAMA_BENCH_MODEL}"; then
        echo "  ollama model '${OLLAMA_BENCH_MODEL}' already registered"
        return 0
    fi
    echo "  registering ${OLLAMA_BENCH_MODEL} from ${GGUF}"
    # FROM the in-container GGUF path; num_ctx/kv match llama.yml so the
    # comparison stays fair.
    printf 'FROM %s\nPARAMETER num_ctx 65536\n' "$GGUF" \
        | docker exec -i ollama ollama create "${OLLAMA_BENCH_MODEL}" -f -
 }
 # --- one trial; echoes "prefill_tps decode_tps" -----------------------
 trial_llama() {
    local body resp
    body=$(jq -n --arg p "$PROMPT" --argjson n "$N_PREDICT" \
        '{prompt:$p, n_predict:$n, temperature:0, cache_prompt:false}')
    resp=$(curl -fsS --max-time 300 "$LLAMA_URL/completion" \
        -H 'Content-Type: application/json' -d "$body")
    echo "$resp" | jq -r '"\(.timings.prompt_per_second) \(.timings.predicted_per_second)"'
 }
 trial_ollama() {
    local body resp
    body=$(jq -n --arg m "$OLLAMA_BENCH_MODEL" --arg p "$PROMPT" --argjson n "$N_PREDICT" \
        '{model:$m, prompt:$p, raw:true, stream:false, options:{temperature:0, num_predict:$n}}')
    resp=$(curl -fsS --max-time 300 "$OLLAMA_URL/api/generate" \
        -H 'Content-Type: application/json' -d "$body")
    # durations are ns; t/s = count / (duration/1e9)
    echo "$resp" | jq -r '
        "\(.prompt_eval_count / (.prompt_eval_duration/1e9)) \(.eval_count / (.eval_duration/1e9))"'
 }
 # --- run R trials, print per-trial + mean decode ----------------------
 bench() {
    local engine="$1" trialfn="$2"
    echo "  warmup..."; "$trialfn" >/dev/null
    local sum_pp=0 sum_tg=0
    for i in $(seq 1 "$RUNS"); do
        read -r pp tg < <("$trialfn")
        printf "  trial %d: prefill %6.1f t/s   decode %6.2f t/s\n" "$i" "$pp" "$tg"
        sum_pp=$(echo "$sum_pp + $pp" | bc -l)
        sum_tg=$(echo "$sum_tg + $tg" | bc -l)
    done
    MEAN_PP=$(echo "scale=1; $sum_pp / $RUNS" | bc -l)
    MEAN_TG=$(echo "scale=2; $sum_tg / $RUNS" | bc -l)
    printf "  %s mean: prefill %s t/s   decode %s t/s\n" "$engine" "$MEAN_PP" "$MEAN_TG"
 }
 if [[ "${1:-}" == "status" ]]; then
    for c in ollama llama kimi-linear qwen3-235b; do
        is_running "$c" && echo "$c: up" || echo "$c: down"
    done
    exit 0
 fi
 need bc
 echo "== llama.cpp (kyuz0 ${GGUF##*/}) =="
 isolate_for llama
 up_wait llama "$LLAMA_URL/health"
 bench "llama.cpp" trial_llama
 LLAMA_TG="$MEAN_TG"
 down llama
 echo
 echo "== Ollama (same GGUF) =="
 isolate_for ollama
 up_wait ollama "$OLLAMA_URL/api/tags"
 register_ollama_model
 bench "ollama" trial_ollama
 OLLAMA_TG="$MEAN_TG"
 echo
 echo "== Verdict =="
 # Ollama as % of llama.cpp decode throughput.
 PCT=$(echo "scale=1; 100 * $OLLAMA_TG / $LLAMA_TG" | bc -l)
 printf "  llama.cpp decode: %s t/s\n  ollama decode:    %s t/s  (%s%% of llama.cpp)\n" \
    "$LLAMA_TG" "$OLLAMA_TG" "$PCT"
 echo
 echo "  Guidance: Ollama >=85% of llama.cpp  -> option 1 (Ollama + vLLM,"
 echo "            drop standalone llama.cpp; Ollama self-swaps, no llama-swap)."
 echo "            Larger gap                  -> option 2 (keep llama.cpp"
 echo "            behind llama-swap with coexistence groups; drop Ollama)."
--- a/pyinfra/framework/scripts/swap-model
+++ b/pyinfra/framework/scripts/swap-model
@@ -0,0 +1,186 @@
 #!/usr/bin/env bash
 # swap-model — coordinate which inference container is GPU-resident on
 # the Strix Halo box.
 #
 # Why this exists. The GPU's merged ~110 GB arena (BIOS UMA=0.5 GB +
 # ttm.pages_limit + HSA_XNACK; see StrixHaloMemory.md) holds at most
 # one 88 GB-class model at a time, and ROCm doesn't reclaim cleanly
 # between consumers. So switching models means stop-then-start of
 # whole compose stacks. This script encodes the per-target conflict
 # table + per-service health probes so the swap is one command.
 #
 # Usage:
 #   swap-model coder        # Qwen3-Coder-30B via Ollama (interactive)
 #   swap-model 235b         # Qwen3-235B-A22B via llama.cpp (long-task)
 #   swap-model kimi         # Kimi-Linear-48B-A3B via vLLM (long-context)
 #   swap-model qwable       # Qwable-3.6-27B via llama.cpp (Fable-style)
 #   swap-model comfyui      # ComfyUI (image generation)
 #   swap-model none         # everything down — free the GPU
 #   swap-model status       # show what's currently up
 #
 # Env knobs:
 #   SWAP_WAIT_TIMEOUT       seconds to wait for /health after up; default 600
 #                           (235B's 88 GB cold load can take 3-5 min)
 #
 # Out of scope (deliberately):
 #   - Always-on services (openwebui, litellm, phoenix, beszel, etc.) —
 #     no GPU footprint, left alone.
 #   - llama.cpp 30B (port 8080) — same weights as Ollama's qwen3-coder
 #     but still LL-P0 perf-evaluating. `coder` target uses Ollama only.
 #   - Multi-target combos (e.g. kimi+ollama coexist on the arena);
 #     for now run swap-model twice if you want both.
 set -euo pipefail
 COMPOSE_ROOT="/srv/docker"
 WAIT_TIMEOUT="${SWAP_WAIT_TIMEOUT:-600}"
 # --- Service table -----------------------------------------------------------
 # Map short name → compose dir (under $COMPOSE_ROOT) and health URL.
 # Container name == compose dir name in every case (intentional convention,
 # enforced in compose/*.yml's container_name fields).
 declare -A SVC_DIR=(
    [ollama]=ollama
    [llama]=llama
    [kimi]=kimi-linear
    [235b]=qwen3-235b
    [qwable]=qwable
    [comfyui]=comfyui
 )
 declare -A SVC_HEALTH=(
    [ollama]="http://127.0.0.1:11434/api/tags"
    [llama]="http://127.0.0.1:8080/health"
    [kimi]="http://127.0.0.1:8000/v1/models"
    [235b]="http://127.0.0.1:8081/health"
    [qwable]="http://127.0.0.1:8082/health"
    [comfyui]="http://127.0.0.1:8188/"
 )
 # --- Target → plan -----------------------------------------------------------
 # UP   = services that should be running after the swap
 # DOWN = services that must be stopped to free the GPU arena
 # (anything not in either list is left untouched — e.g. switching to coder
 # leaves kimi alone, since kimi(30 GB) + ollama(30 GB) fit in the arena.)
 plan() {
    UP=() ; DOWN=()
    case "$1" in
        coder)   UP=(ollama)  ; DOWN=(235b comfyui) ;;
        235b)    UP=(235b)    ; DOWN=(ollama llama kimi qwable comfyui) ;;
        kimi)    UP=(kimi)    ; DOWN=(235b comfyui) ;;
        qwable)  UP=(qwable)  ; DOWN=(235b comfyui) ;;
        comfyui) UP=(comfyui) ; DOWN=(235b kimi qwable) ;;
        none)    UP=()        ; DOWN=(ollama llama kimi 235b qwable comfyui) ;;
        *)       return 1 ;;
    esac
 }
 # --- Probes ------------------------------------------------------------------
 is_running() {
    docker inspect -f '{{.State.Running}}' "$1" 2>/dev/null | grep -q true
 }
 is_healthy() {
    curl -fsS --max-time 5 "$1" >/dev/null 2>&1
 }
 wait_healthy() {
    local svc="$1" url="${SVC_HEALTH[$1]}" deadline=$(( SECONDS + WAIT_TIMEOUT ))
    printf "    waiting for %s health (timeout %ss)" "$svc" "$WAIT_TIMEOUT"
    while ! is_healthy "$url"; do
        if (( SECONDS > deadline )); then
            printf " TIMEOUT\n"
            echo "    last 20 lines of container log:" >&2
            docker logs --tail 20 "${SVC_DIR[$svc]}" 2>&1 | sed 's/^/      /' >&2
            return 1
        fi
        sleep 5
        printf "."
    done
    printf " ok\n"
 }
 # --- Actions -----------------------------------------------------------------
 down_svc() {
    local svc="$1" dir="$COMPOSE_ROOT/${SVC_DIR[$1]}"
    if ! is_running "${SVC_DIR[$svc]}"; then
        echo "  $svc: already down"
        return 0
    fi
    echo "  stopping $svc"
    (cd "$dir" && docker compose down)
 }
 up_svc() {
    local svc="$1" dir="$COMPOSE_ROOT/${SVC_DIR[$1]}"
    if is_running "${SVC_DIR[$svc]}" && is_healthy "${SVC_HEALTH[$svc]}"; then
        echo "  $svc: already up + healthy"
        return 0
    fi
    if [[ ! -d "$dir" ]]; then
        echo "  $svc: compose dir $dir missing — run pyinfra deploy first" >&2
        return 1
    fi
    echo "  starting $svc"
    (cd "$dir" && docker compose up -d)
    wait_healthy "$svc"
 }
 show_status() {
    echo "Inference services:"
    for svc in ollama llama kimi 235b qwable comfyui; do
        local container="${SVC_DIR[$svc]}" state="down" health=""
        if is_running "$container"; then
            state="up"
            if is_healthy "${SVC_HEALTH[$svc]}"; then
                health=" (healthy)"
            else
                health=" (starting/unhealthy)"
            fi
        fi
        printf "  %-8s %s%s\n" "$svc" "$state" "$health"
    done
 }
 usage() {
    cat <<EOF
 swap-model — coordinate inference containers on Strix Halo.
 Usage:
  swap-model coder        # Qwen3-Coder-30B  (Ollama, interactive daily-driver)
  swap-model 235b         # Qwen3-235B-A22B  (llama.cpp, long-task, ~5-10 tok/s)
  swap-model kimi         # Kimi-Linear-48B  (vLLM, long-context chat)
  swap-model qwable       # Qwable-3.6-27B   (llama.cpp, Fable-style, ~10-15 tok/s)
  swap-model comfyui      # ComfyUI          (image generation)
  swap-model none         # everything down  (free the GPU arena)
  swap-model status       # show current state
 Behaviour: stops conflicting services (frees the 110 GB GPU arena),
 starts the target, polls its /health until it returns 200. Wait timeout
 defaults to ${WAIT_TIMEOUT}s; override with SWAP_WAIT_TIMEOUT.
 Coexistence: ollama(30B), kimi, and qwable(27B, 16.5 GB) coexist with
 each other. 235B and comfyui coexist with nothing. See
 compose/qwen3-235b/README.md for arena math.
 EOF
 }
 # --- Main --------------------------------------------------------------------
 TARGET="${1:-}"
 case "$TARGET" in
    coder|235b|kimi|qwable|comfyui|none) ;;
    status)         show_status ; exit 0 ;;
    -h|--help|help|"") usage ; exit 0 ;;
    *)
        echo "swap-model: unknown target '$TARGET'" >&2
        echo "Try: swap-model help" >&2
        exit 2
        ;;
 esac
 plan "$TARGET"
 echo "Plan: down=[${DOWN[*]:-}] up=[${UP[*]:-}]"
 for svc in "${DOWN[@]}"; do down_svc "$svc"; done
 for svc in "${UP[@]}";   do up_svc   "$svc"; done
 echo
 show_status