added models, model-swap, ...
This commit is contained in:
@@ -72,6 +72,106 @@ curl localhost:8080/v1/models # smoke test
|
||||
Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
|
||||
needed — Ollama serves models on demand).
|
||||
|
||||
## Swapping GPU-resident models
|
||||
|
||||
The merged ~110 GB GPU arena (BIOS UMA=0.5 GB + `ttm.pages_limit` + the
|
||||
`HSA_*` env recipe — see `StrixHaloMemory.md`) holds at most one
|
||||
~88 GB-class model, and ROCm doesn't reclaim cleanly between consumers.
|
||||
So switching between, say, the daily-driver 30B and the 235B long-task
|
||||
model is a stop-then-start of whole compose stacks.
|
||||
|
||||
`scripts/swap-model` (deployed to `/usr/local/bin/swap-model` on the
|
||||
box) encodes the coexistence table + per-service health probes so the
|
||||
swap is one command:
|
||||
|
||||
```sh
|
||||
ssh framework swap-model 235b # stop conflicting svcs, start 235B, wait for /health
|
||||
ssh framework swap-model coder # back to Qwen3-Coder-30B (Ollama)
|
||||
ssh framework swap-model status # what's currently up?
|
||||
ssh framework swap-model none # everything down, free the GPU arena
|
||||
```
|
||||
|
||||
Coexistence table baked into the script:
|
||||
|
||||
| Target | What runs | What gets stopped |
|
||||
|---|---|---|
|
||||
| `coder` | ollama (Qwen3-Coder-30B) | 235b, comfyui |
|
||||
| `235b` | qwen3-235b (llama.cpp) | ollama, llama, kimi, comfyui |
|
||||
| `kimi` | kimi-linear (vLLM) | 235b, comfyui (ollama can stay) |
|
||||
| `comfyui` | comfyui | 235b, kimi (ollama can stay) |
|
||||
| `none` | — | all five |
|
||||
|
||||
OpenWebUI, LiteLLM, Phoenix, Beszel, etc. are non-GPU and always-on;
|
||||
the script ignores them. Wait timeout defaults to 600 s (235B's 88 GB
|
||||
cold load takes 3-5 min); override with `SWAP_WAIT_TIMEOUT=<seconds>`.
|
||||
|
||||
A Mac-side wrapper at `bin/swap-model` in the repo runs
|
||||
`ssh framework /usr/local/bin/swap-model "$@"` — symlink it onto your
|
||||
PATH so `swap-model X` works from any shell:
|
||||
|
||||
```sh
|
||||
ln -s "$(pwd)/bin/swap-model" /usr/local/bin/swap-model
|
||||
```
|
||||
|
||||
Override host with `SWAP_MODEL_HOST=10.0.0.70 swap-model 235b` when
|
||||
Tailscale resolution is misbehaving.
|
||||
|
||||
### Inference engine consolidation (in progress)
|
||||
|
||||
The model count is now at the crossover point, so the manual
|
||||
`swap-model` script is being replaced. The plan, phased:
|
||||
|
||||
**Framing.** Full consolidation to one engine isn't possible — **vLLM
|
||||
stays** as the specialist for the architectures llama.cpp can't serve
|
||||
(Kimi-Linear's KDA/MLA hybrid, the 235B). And the hard part of the
|
||||
memory problem — the cross-engine rule "235B can't share the arena
|
||||
with anything" — survives any consolidation, because it's a constraint
|
||||
*between* vLLM and the GGUF tier. So the real decision is only about
|
||||
the **GGUF tier** (Ollama and the standalone llama.cpp stack do the
|
||||
same job) and which tool, if any, coordinates the swaps.
|
||||
|
||||
**M0 — benchmark the GGUF tier (the deciding unknown).** Ollama already
|
||||
auto-loads/unloads; the only question is whether its bundled llama.cpp
|
||||
keeps up with the gfx1151-tuned kyuz0 build on *this* box.
|
||||
`scripts/bench-engines` (deployed to `/usr/local/bin/bench-engines`)
|
||||
serves the identical GGUF on each engine in isolation and reports
|
||||
decode/prefill t/s using each engine's own timing fields:
|
||||
|
||||
```sh
|
||||
ssh framework bench-engines # runs both, prints a verdict line
|
||||
```
|
||||
|
||||
**M1 — pick the end-state from the benchmark.** Both are two engines
|
||||
(GGUF tier + vLLM); the difference is the coordinator:
|
||||
|
||||
- **Option 1 (Ollama ≥ ~85 % of llama.cpp decode):** Ollama + vLLM,
|
||||
drop the standalone llama.cpp stack. Ollama self-swaps its tier, so
|
||||
**no llama-swap needed**; the one cross-engine rule ("stop Ollama
|
||||
before 235B") is a few lines. Fewest moving parts.
|
||||
- **Option 2 (kyuz0 lead is large):** keep llama.cpp behind
|
||||
**[llama-swap](https://github.com/mostlygeek/llama-swap)**, drop
|
||||
Ollama. This is where llama-swap earns its place — a YAML-configured
|
||||
OpenAI-compatible proxy whose **groups** encode the coexistence table
|
||||
declaratively *and* can launch the vLLM backends too, so one tool
|
||||
coordinates the whole arena and the `swap-model` script retires. It
|
||||
also subsumes most of LiteLLM's routing role.
|
||||
|
||||
Two alternatives considered, both weaker fits:
|
||||
- **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
|
||||
your whole stack" platform with 36+ backends; deeper rewrite for
|
||||
marginal gain.
|
||||
- **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
|
||||
cluster manager (DB-backed, UI-first); built for fleets, overkill
|
||||
on a single Strix Halo box.
|
||||
|
||||
Two alternatives also considered, both weaker fits:
|
||||
- **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
|
||||
your whole stack" platform with 36+ backends; would force a deeper
|
||||
rewrite for marginal gain on a 4-model stack.
|
||||
- **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
|
||||
cluster manager (DB-backed, UI-first); built for fleets, overkill
|
||||
on a single Strix Halo box.
|
||||
|
||||
## Tunables
|
||||
|
||||
Top of `deploy.py`:
|
||||
@@ -272,3 +372,77 @@ model server.
|
||||
The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
|
||||
gfx1151-optimized with rocWMMA flash attention (the latter only kicks
|
||||
in on the ROCm tags). Auto-rebuilt against llama.cpp master.
|
||||
|
||||
## Remote IDE (code-server)
|
||||
|
||||
**code-server** (`/srv/docker/code-server`, http://framework:8443) —
|
||||
full VS Code in the browser, served from the box. The point: Claude
|
||||
Code (extension from Open VSX + CLI) runs *inside the container*, so
|
||||
long agent tasks execute here and survive the laptop sleeping. Any
|
||||
tailnet device gets the same editor and the same running session.
|
||||
|
||||
Bring-up: `cd /srv/docker/code-server && docker compose up -d`, then
|
||||
see `/srv/docker/code-server/README.md` for the one-time extension
|
||||
install + sign-in flow. No password is set (Tailscale-only trust
|
||||
model, like everything else here) — set `HASHED_PASSWORD` before this
|
||||
host ever sees other traffic; a browser terminal is a shell.
|
||||
|
||||
The workspace is container-scoped on purpose (no host filesystem, no
|
||||
docker socket). For host-level Claude work, use the section below
|
||||
instead.
|
||||
|
||||
## Workspace manager (Coder) — pilot
|
||||
|
||||
**Coder** (`/srv/docker/coder`, http://framework:7080) — self-hosted
|
||||
workspace manager: a web dashboard that stamps out per-project dev
|
||||
containers from Terraform templates. The shipped `code-server` template
|
||||
gives each workspace browser VS Code with the Claude Code extension
|
||||
(from Open VSX) pre-installed and the `claude` CLI on PATH; workspace
|
||||
home volumes persist `~/.claude` creds and repos across stop/start.
|
||||
Workspaces publish no host ports — everything tunnels through the
|
||||
dashboard — and idle autostop hands parked projects' RAM back to the
|
||||
inference stacks.
|
||||
|
||||
Bring-up + template push + first workspace:
|
||||
`/srv/docker/coder/README.md`. The `.env` (docker GID, access URL,
|
||||
Postgres password) is auto-generated by `./run.sh` — nothing to fill
|
||||
in by hand.
|
||||
|
||||
This is a **pilot** against the standalone code-server stack above:
|
||||
after a week of real use, one of them retires. Same security posture
|
||||
as OpenHands (docker-socket holder, spawns code-running containers) —
|
||||
Tailscale-only, never expose :7080 further.
|
||||
|
||||
## Claude Code on the box (remote-control)
|
||||
|
||||
For long unattended runs steered from a phone or browser — no IDE
|
||||
involved — Claude Code's official Remote Control feature bridges a
|
||||
session running on the box to claude.ai/code and the Claude mobile
|
||||
app. Needs Claude Code ≥ 2.1.51 with a claude.ai login (Pro/Max plan;
|
||||
API keys don't work for this), traffic relays outbound-only over TLS
|
||||
to Anthropic — no inbound ports.
|
||||
|
||||
```sh
|
||||
ssh framework
|
||||
tmux new -s rc
|
||||
cd ~/src/whatever
|
||||
claude remote-control --name "Strix Halo"
|
||||
# prints a session URL → open on any device, or find it in the
|
||||
# session list at claude.ai/code / the Claude app's Code tab
|
||||
```
|
||||
|
||||
Operational notes (state of the feature as of 2026-06):
|
||||
|
||||
- The local process is the engine — tasks keep executing with **no
|
||||
client attached**; clients are viewports. But if the process dies,
|
||||
the session is gone (no `--resume` equivalent for remote-control
|
||||
sessions yet, issue #30447). Hence tmux, always.
|
||||
- It needs a TTY — `nohup`/systemd don't work. tmux satisfies this.
|
||||
- `--spawn worktree` gives each connecting session its own git
|
||||
worktree; `--spawn same-dir` (default) shares the directory.
|
||||
- Headless boxes print the session URL but no QR (#34764) — copy the
|
||||
URL or use the session list.
|
||||
- Mobile clients sometimes drop after long idle (#28914, #34255);
|
||||
reconnecting always works — the task on the box is unaffected.
|
||||
- Run `claude` interactively once per project dir first to clear the
|
||||
workspace-trust prompt before automating anything.
|
||||
|
||||
Reference in New Issue
Block a user