added models, model-swap, ...

2026-06-26 08:13:33 -04:00
parent de1635872f
commit 224afbb3a6
18 changed files with 1659 additions and 243 deletions
--- a/pyinfra/framework/README.md
+++ b/pyinfra/framework/README.md
@@ -72,6 +72,106 @@ curl localhost:8080/v1/models                # smoke test
 Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
 needed — Ollama serves models on demand).

+## Swapping GPU-resident models
+
+The merged ~110 GB GPU arena (BIOS UMA=0.5 GB + `ttm.pages_limit` + the
+`HSA_*` env recipe — see `StrixHaloMemory.md`) holds at most one
+~88 GB-class model, and ROCm doesn't reclaim cleanly between consumers.
+So switching between, say, the daily-driver 30B and the 235B long-task
+model is a stop-then-start of whole compose stacks.
+
+`scripts/swap-model` (deployed to `/usr/local/bin/swap-model` on the
+box) encodes the coexistence table + per-service health probes so the
+swap is one command:
+
+```sh
+ssh framework swap-model 235b      # stop conflicting svcs, start 235B, wait for /health
+ssh framework swap-model coder     # back to Qwen3-Coder-30B (Ollama)
+ssh framework swap-model status    # what's currently up?
+ssh framework swap-model none      # everything down, free the GPU arena
+```
+
+Coexistence table baked into the script:
+
+| Target | What runs | What gets stopped |
+|---|---|---|
+| `coder` | ollama (Qwen3-Coder-30B) | 235b, comfyui |
+| `235b` | qwen3-235b (llama.cpp) | ollama, llama, kimi, comfyui |
+| `kimi` | kimi-linear (vLLM) | 235b, comfyui (ollama can stay) |
+| `comfyui` | comfyui | 235b, kimi (ollama can stay) |
+| `none` | — | all five |
+
+OpenWebUI, LiteLLM, Phoenix, Beszel, etc. are non-GPU and always-on;
+the script ignores them. Wait timeout defaults to 600 s (235B's 88 GB
+cold load takes 3-5 min); override with `SWAP_WAIT_TIMEOUT=<seconds>`.
+
+A Mac-side wrapper at `bin/swap-model` in the repo runs
+`ssh framework /usr/local/bin/swap-model "$@"` — symlink it onto your
+PATH so `swap-model X` works from any shell:
+
+```sh
+ln -s "$(pwd)/bin/swap-model" /usr/local/bin/swap-model
+```
+
+Override host with `SWAP_MODEL_HOST=10.0.0.70 swap-model 235b` when
+Tailscale resolution is misbehaving.
+
+### Inference engine consolidation (in progress)
+
+The model count is now at the crossover point, so the manual
+`swap-model` script is being replaced. The plan, phased:
+
+**Framing.** Full consolidation to one engine isn't possible — **vLLM
+stays** as the specialist for the architectures llama.cpp can't serve
+(Kimi-Linear's KDA/MLA hybrid, the 235B). And the hard part of the
+memory problem — the cross-engine rule "235B can't share the arena
+with anything" — survives any consolidation, because it's a constraint
+*between* vLLM and the GGUF tier. So the real decision is only about
+the **GGUF tier** (Ollama and the standalone llama.cpp stack do the
+same job) and which tool, if any, coordinates the swaps.
+
+**M0 — benchmark the GGUF tier (the deciding unknown).** Ollama already
+auto-loads/unloads; the only question is whether its bundled llama.cpp
+keeps up with the gfx1151-tuned kyuz0 build on *this* box.
+`scripts/bench-engines` (deployed to `/usr/local/bin/bench-engines`)
+serves the identical GGUF on each engine in isolation and reports
+decode/prefill t/s using each engine's own timing fields:
+
+```sh
+ssh framework bench-engines        # runs both, prints a verdict line
+```
+
+**M1 — pick the end-state from the benchmark.** Both are two engines
+(GGUF tier + vLLM); the difference is the coordinator:
+
+- **Option 1 (Ollama ≥ ~85 % of llama.cpp decode):** Ollama + vLLM,
+  drop the standalone llama.cpp stack. Ollama self-swaps its tier, so
+  **no llama-swap needed**; the one cross-engine rule ("stop Ollama
+  before 235B") is a few lines. Fewest moving parts.
+- **Option 2 (kyuz0 lead is large):** keep llama.cpp behind
+  **[llama-swap](https://github.com/mostlygeek/llama-swap)**, drop
+  Ollama. This is where llama-swap earns its place — a YAML-configured
+  OpenAI-compatible proxy whose **groups** encode the coexistence table
+  declaratively *and* can launch the vLLM backends too, so one tool
+  coordinates the whole arena and the `swap-model` script retires. It
+  also subsumes most of LiteLLM's routing role.
+
+Two alternatives considered, both weaker fits:
+- **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
+  your whole stack" platform with 36+ backends; deeper rewrite for
+  marginal gain.
+- **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
+  cluster manager (DB-backed, UI-first); built for fleets, overkill
+  on a single Strix Halo box.
+
+Two alternatives also considered, both weaker fits:
+- **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
+  your whole stack" platform with 36+ backends; would force a deeper
+  rewrite for marginal gain on a 4-model stack.
+- **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
+  cluster manager (DB-backed, UI-first); built for fleets, overkill
+  on a single Strix Halo box.
+
 ## Tunables

 Top of `deploy.py`:
@@ -272,3 +372,77 @@ model server.
 The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
 gfx1151-optimized with rocWMMA flash attention (the latter only kicks
 in on the ROCm tags). Auto-rebuilt against llama.cpp master.
+
+## Remote IDE (code-server)
+
+**code-server** (`/srv/docker/code-server`, http://framework:8443) —
+full VS Code in the browser, served from the box. The point: Claude
+Code (extension from Open VSX + CLI) runs *inside the container*, so
+long agent tasks execute here and survive the laptop sleeping. Any
+tailnet device gets the same editor and the same running session.
+
+Bring-up: `cd /srv/docker/code-server && docker compose up -d`, then
+see `/srv/docker/code-server/README.md` for the one-time extension
+install + sign-in flow. No password is set (Tailscale-only trust
+model, like everything else here) — set `HASHED_PASSWORD` before this
+host ever sees other traffic; a browser terminal is a shell.
+
+The workspace is container-scoped on purpose (no host filesystem, no
+docker socket). For host-level Claude work, use the section below
+instead.
+
+## Workspace manager (Coder) — pilot
+
+**Coder** (`/srv/docker/coder`, http://framework:7080) — self-hosted
+workspace manager: a web dashboard that stamps out per-project dev
+containers from Terraform templates. The shipped `code-server` template
+gives each workspace browser VS Code with the Claude Code extension
+(from Open VSX) pre-installed and the `claude` CLI on PATH; workspace
+home volumes persist `~/.claude` creds and repos across stop/start.
+Workspaces publish no host ports — everything tunnels through the
+dashboard — and idle autostop hands parked projects' RAM back to the
+inference stacks.
+
+Bring-up + template push + first workspace:
+`/srv/docker/coder/README.md`. The `.env` (docker GID, access URL,
+Postgres password) is auto-generated by `./run.sh` — nothing to fill
+in by hand.
+
+This is a **pilot** against the standalone code-server stack above:
+after a week of real use, one of them retires. Same security posture
+as OpenHands (docker-socket holder, spawns code-running containers) —
+Tailscale-only, never expose :7080 further.
+
+## Claude Code on the box (remote-control)
+
+For long unattended runs steered from a phone or browser — no IDE
+involved — Claude Code's official Remote Control feature bridges a
+session running on the box to claude.ai/code and the Claude mobile
+app. Needs Claude Code ≥ 2.1.51 with a claude.ai login (Pro/Max plan;
+API keys don't work for this), traffic relays outbound-only over TLS
+to Anthropic — no inbound ports.
+
+```sh
+ssh framework
+tmux new -s rc
+cd ~/src/whatever
+claude remote-control --name "Strix Halo"
+# prints a session URL → open on any device, or find it in the
+# session list at claude.ai/code / the Claude app's Code tab
+```
+
+Operational notes (state of the feature as of 2026-06):
+
+- The local process is the engine — tasks keep executing with **no
+  client attached**; clients are viewports. But if the process dies,
+  the session is gone (no `--resume` equivalent for remote-control
+  sessions yet, issue #30447). Hence tmux, always.
+- It needs a TTY — `nohup`/systemd don't work. tmux satisfies this.
+- `--spawn worktree` gives each connecting session its own git
+  worktree; `--spawn same-dir` (default) shares the directory.
+- Headless boxes print the session URL but no QR (#34764) — copy the
+  URL or use the session list.
+- Mobile clients sometimes drop after long idle (#28914, #34255);
+  reconnecting always works — the task on the box is unaffected.
+- Run `claude` interactively once per project dir first to clear the
+  workspace-trust prompt before automating anything.