added models, model-swap, ...

This commit is contained in:
2026-06-26 08:13:33 -04:00
parent de1635872f
commit 224afbb3a6
18 changed files with 1659 additions and 243 deletions

View File

@@ -72,6 +72,106 @@ curl localhost:8080/v1/models # smoke test
Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
needed — Ollama serves models on demand).
## Swapping GPU-resident models
The merged ~110 GB GPU arena (BIOS UMA=0.5 GB + `ttm.pages_limit` + the
`HSA_*` env recipe — see `StrixHaloMemory.md`) holds at most one
~88 GB-class model, and ROCm doesn't reclaim cleanly between consumers.
So switching between, say, the daily-driver 30B and the 235B long-task
model is a stop-then-start of whole compose stacks.
`scripts/swap-model` (deployed to `/usr/local/bin/swap-model` on the
box) encodes the coexistence table + per-service health probes so the
swap is one command:
```sh
ssh framework swap-model 235b # stop conflicting svcs, start 235B, wait for /health
ssh framework swap-model coder # back to Qwen3-Coder-30B (Ollama)
ssh framework swap-model status # what's currently up?
ssh framework swap-model none # everything down, free the GPU arena
```
Coexistence table baked into the script:
| Target | What runs | What gets stopped |
|---|---|---|
| `coder` | ollama (Qwen3-Coder-30B) | 235b, comfyui |
| `235b` | qwen3-235b (llama.cpp) | ollama, llama, kimi, comfyui |
| `kimi` | kimi-linear (vLLM) | 235b, comfyui (ollama can stay) |
| `comfyui` | comfyui | 235b, kimi (ollama can stay) |
| `none` | — | all five |
OpenWebUI, LiteLLM, Phoenix, Beszel, etc. are non-GPU and always-on;
the script ignores them. Wait timeout defaults to 600 s (235B's 88 GB
cold load takes 3-5 min); override with `SWAP_WAIT_TIMEOUT=<seconds>`.
A Mac-side wrapper at `bin/swap-model` in the repo runs
`ssh framework /usr/local/bin/swap-model "$@"` — symlink it onto your
PATH so `swap-model X` works from any shell:
```sh
ln -s "$(pwd)/bin/swap-model" /usr/local/bin/swap-model
```
Override host with `SWAP_MODEL_HOST=10.0.0.70 swap-model 235b` when
Tailscale resolution is misbehaving.
### Inference engine consolidation (in progress)
The model count is now at the crossover point, so the manual
`swap-model` script is being replaced. The plan, phased:
**Framing.** Full consolidation to one engine isn't possible — **vLLM
stays** as the specialist for the architectures llama.cpp can't serve
(Kimi-Linear's KDA/MLA hybrid, the 235B). And the hard part of the
memory problem — the cross-engine rule "235B can't share the arena
with anything" — survives any consolidation, because it's a constraint
*between* vLLM and the GGUF tier. So the real decision is only about
the **GGUF tier** (Ollama and the standalone llama.cpp stack do the
same job) and which tool, if any, coordinates the swaps.
**M0 — benchmark the GGUF tier (the deciding unknown).** Ollama already
auto-loads/unloads; the only question is whether its bundled llama.cpp
keeps up with the gfx1151-tuned kyuz0 build on *this* box.
`scripts/bench-engines` (deployed to `/usr/local/bin/bench-engines`)
serves the identical GGUF on each engine in isolation and reports
decode/prefill t/s using each engine's own timing fields:
```sh
ssh framework bench-engines # runs both, prints a verdict line
```
**M1 — pick the end-state from the benchmark.** Both are two engines
(GGUF tier + vLLM); the difference is the coordinator:
- **Option 1 (Ollama ≥ ~85 % of llama.cpp decode):** Ollama + vLLM,
drop the standalone llama.cpp stack. Ollama self-swaps its tier, so
**no llama-swap needed**; the one cross-engine rule ("stop Ollama
before 235B") is a few lines. Fewest moving parts.
- **Option 2 (kyuz0 lead is large):** keep llama.cpp behind
**[llama-swap](https://github.com/mostlygeek/llama-swap)**, drop
Ollama. This is where llama-swap earns its place — a YAML-configured
OpenAI-compatible proxy whose **groups** encode the coexistence table
declaratively *and* can launch the vLLM backends too, so one tool
coordinates the whole arena and the `swap-model` script retires. It
also subsumes most of LiteLLM's routing role.
Two alternatives considered, both weaker fits:
- **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
your whole stack" platform with 36+ backends; deeper rewrite for
marginal gain.
- **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
cluster manager (DB-backed, UI-first); built for fleets, overkill
on a single Strix Halo box.
Two alternatives also considered, both weaker fits:
- **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
your whole stack" platform with 36+ backends; would force a deeper
rewrite for marginal gain on a 4-model stack.
- **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
cluster manager (DB-backed, UI-first); built for fleets, overkill
on a single Strix Halo box.
## Tunables
Top of `deploy.py`:
@@ -272,3 +372,77 @@ model server.
The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
gfx1151-optimized with rocWMMA flash attention (the latter only kicks
in on the ROCm tags). Auto-rebuilt against llama.cpp master.
## Remote IDE (code-server)
**code-server** (`/srv/docker/code-server`, http://framework:8443) —
full VS Code in the browser, served from the box. The point: Claude
Code (extension from Open VSX + CLI) runs *inside the container*, so
long agent tasks execute here and survive the laptop sleeping. Any
tailnet device gets the same editor and the same running session.
Bring-up: `cd /srv/docker/code-server && docker compose up -d`, then
see `/srv/docker/code-server/README.md` for the one-time extension
install + sign-in flow. No password is set (Tailscale-only trust
model, like everything else here) — set `HASHED_PASSWORD` before this
host ever sees other traffic; a browser terminal is a shell.
The workspace is container-scoped on purpose (no host filesystem, no
docker socket). For host-level Claude work, use the section below
instead.
## Workspace manager (Coder) — pilot
**Coder** (`/srv/docker/coder`, http://framework:7080) — self-hosted
workspace manager: a web dashboard that stamps out per-project dev
containers from Terraform templates. The shipped `code-server` template
gives each workspace browser VS Code with the Claude Code extension
(from Open VSX) pre-installed and the `claude` CLI on PATH; workspace
home volumes persist `~/.claude` creds and repos across stop/start.
Workspaces publish no host ports — everything tunnels through the
dashboard — and idle autostop hands parked projects' RAM back to the
inference stacks.
Bring-up + template push + first workspace:
`/srv/docker/coder/README.md`. The `.env` (docker GID, access URL,
Postgres password) is auto-generated by `./run.sh` — nothing to fill
in by hand.
This is a **pilot** against the standalone code-server stack above:
after a week of real use, one of them retires. Same security posture
as OpenHands (docker-socket holder, spawns code-running containers) —
Tailscale-only, never expose :7080 further.
## Claude Code on the box (remote-control)
For long unattended runs steered from a phone or browser — no IDE
involved — Claude Code's official Remote Control feature bridges a
session running on the box to claude.ai/code and the Claude mobile
app. Needs Claude Code ≥ 2.1.51 with a claude.ai login (Pro/Max plan;
API keys don't work for this), traffic relays outbound-only over TLS
to Anthropic — no inbound ports.
```sh
ssh framework
tmux new -s rc
cd ~/src/whatever
claude remote-control --name "Strix Halo"
# prints a session URL → open on any device, or find it in the
# session list at claude.ai/code / the Claude app's Code tab
```
Operational notes (state of the feature as of 2026-06):
- The local process is the engine — tasks keep executing with **no
client attached**; clients are viewports. But if the process dies,
the session is gone (no `--resume` equivalent for remote-control
sessions yet, issue #30447). Hence tmux, always.
- It needs a TTY — `nohup`/systemd don't work. tmux satisfies this.
- `--spawn worktree` gives each connecting session its own git
worktree; `--spawn same-dir` (default) shares the directory.
- Headless boxes print the session URL but no QR (#34764) — copy the
URL or use the session list.
- Mobile clients sometimes drop after long idle (#28914, #34255);
reconnecting always works — the task on the box is unaffected.
- Run `claude` interactively once per project dir first to clear the
workspace-trust prompt before automating anything.