2026-05-08 11:35:10 -04:00
|
|
|
|
# pyinfra: Strix Halo bring-up
|
|
|
|
|
|
|
|
|
|
|
|
Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon
|
|
|
|
|
|
8060S, 128 GB). The host stays minimal — kernel + driver + Docker +
|
|
|
|
|
|
diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker
|
|
|
|
|
|
compose services, each shipping its own ROCm/Vulkan stack.
|
|
|
|
|
|
|
|
|
|
|
|
## Manual prerequisites
|
|
|
|
|
|
|
|
|
|
|
|
1. **Phase 0** — update Framework BIOS, set GPU UMA carve-out (96 GB).
|
|
|
|
|
|
2. **OS install** — Ubuntu Server 24.04 LTS. AMD ROCm only ships for
|
|
|
|
|
|
jammy/noble; later Ubuntus install but break the host-side toolchain
|
|
|
|
|
|
(libxml2 ABI). Enable SSH, import your laptop key, create user `noise`.
|
|
|
|
|
|
Recommended partitioning: ≥300 GB on `/`, big disk mounted at `/models`,
|
|
|
|
|
|
plain ext4 (skip LVM).
|
|
|
|
|
|
3. The host must be reachable at `10.0.0.237` over SSH (edit `inventory.py`
|
|
|
|
|
|
if it moves).
|
|
|
|
|
|
4. **NOPASSWD sudo for `noise`** — pyinfra's fact layer doesn't reliably
|
|
|
|
|
|
thread sudo passwords. One-time setup:
|
|
|
|
|
|
```sh
|
|
|
|
|
|
ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Run
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
uv tool install pyinfra
|
|
|
|
|
|
./run.sh # equivalent to: pyinfra inventory.py deploy.py
|
|
|
|
|
|
./run.sh --dry # any extra args are forwarded to pyinfra
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`.
|
|
|
|
|
|
|
|
|
|
|
|
## What the deploy does
|
|
|
|
|
|
|
2026-06-08 15:31:50 +01:00
|
|
|
|
- Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, hf (huggingface_hub CLI)
|
2026-05-08 11:35:10 -04:00
|
|
|
|
- Tailscale (run `sudo tailscale up` on the box once, interactively)
|
|
|
|
|
|
- Docker engine + compose plugin, user added to `docker` group
|
|
|
|
|
|
- ROCm host diagnostics only (`rocminfo`) — no full toolchain
|
|
|
|
|
|
- GRUB kernel params for Strix Halo perf: `amd_iommu=off`,
|
|
|
|
|
|
`amdgpu.gttsize=117760` (per [Gygeek/Framework-strix-halo-llm-setup](https://github.com/Gygeek/Framework-strix-halo-llm-setup)).
|
|
|
|
|
|
Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
|
|
|
|
|
|
and runs `update-grub`, but won't reboot for you.
|
|
|
|
|
|
- `/models/<vendor>/` layout
|
2026-05-08 14:42:45 -04:00
|
|
|
|
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper,faster-whisper,kokoro}/docker-compose.yml`
|
2026-05-08 11:35:10 -04:00
|
|
|
|
dropped in, not auto-started — you edit the model path then
|
|
|
|
|
|
`docker compose up -d`
|
|
|
|
|
|
(OpenWebUI needs no edits — it's pre-configured to find Ollama at
|
|
|
|
|
|
`host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
|
|
|
|
|
|
(Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
|
2026-05-08 14:42:45 -04:00
|
|
|
|
OpenHands UI at :3030, Homepage at :7575, Whisper Wyoming :10300,
|
|
|
|
|
|
Piper Wyoming :10200, faster-whisper OpenAI-API :8001, Kokoro
|
|
|
|
|
|
OpenAI-API :8880 — see "Monitoring stack", "Agent harnesses",
|
2026-05-08 13:33:17 -04:00
|
|
|
|
"Front door", and "Voice" below)
|
2026-05-08 11:35:10 -04:00
|
|
|
|
|
|
|
|
|
|
If a previous run installed the native llama.cpp build / full ROCm /
|
|
|
|
|
|
native Ollama, those are auto-cleaned the next time `./run.sh` runs.
|
|
|
|
|
|
|
|
|
|
|
|
## After the deploy: starting an inference service
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
ssh noise@10.0.0.237
|
|
|
|
|
|
sudo tailscale up # one-time, interactive
|
|
|
|
|
|
|
|
|
|
|
|
# Drop a GGUF somewhere under /models, then:
|
|
|
|
|
|
cd /srv/docker/llama
|
|
|
|
|
|
vim docker-compose.yml # edit the --model path
|
|
|
|
|
|
docker compose up -d
|
|
|
|
|
|
curl localhost:8080/v1/models # smoke test
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
|
|
|
|
|
|
needed — Ollama serves models on demand).
|
|
|
|
|
|
|
2026-06-26 08:13:33 -04:00
|
|
|
|
## Swapping GPU-resident models
|
|
|
|
|
|
|
|
|
|
|
|
The merged ~110 GB GPU arena (BIOS UMA=0.5 GB + `ttm.pages_limit` + the
|
|
|
|
|
|
`HSA_*` env recipe — see `StrixHaloMemory.md`) holds at most one
|
|
|
|
|
|
~88 GB-class model, and ROCm doesn't reclaim cleanly between consumers.
|
|
|
|
|
|
So switching between, say, the daily-driver 30B and the 235B long-task
|
|
|
|
|
|
model is a stop-then-start of whole compose stacks.
|
|
|
|
|
|
|
|
|
|
|
|
`scripts/swap-model` (deployed to `/usr/local/bin/swap-model` on the
|
|
|
|
|
|
box) encodes the coexistence table + per-service health probes so the
|
|
|
|
|
|
swap is one command:
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
ssh framework swap-model 235b # stop conflicting svcs, start 235B, wait for /health
|
|
|
|
|
|
ssh framework swap-model coder # back to Qwen3-Coder-30B (Ollama)
|
|
|
|
|
|
ssh framework swap-model status # what's currently up?
|
|
|
|
|
|
ssh framework swap-model none # everything down, free the GPU arena
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Coexistence table baked into the script:
|
|
|
|
|
|
|
|
|
|
|
|
| Target | What runs | What gets stopped |
|
|
|
|
|
|
|---|---|---|
|
|
|
|
|
|
| `coder` | ollama (Qwen3-Coder-30B) | 235b, comfyui |
|
|
|
|
|
|
| `235b` | qwen3-235b (llama.cpp) | ollama, llama, kimi, comfyui |
|
|
|
|
|
|
| `kimi` | kimi-linear (vLLM) | 235b, comfyui (ollama can stay) |
|
|
|
|
|
|
| `comfyui` | comfyui | 235b, kimi (ollama can stay) |
|
|
|
|
|
|
| `none` | — | all five |
|
|
|
|
|
|
|
|
|
|
|
|
OpenWebUI, LiteLLM, Phoenix, Beszel, etc. are non-GPU and always-on;
|
|
|
|
|
|
the script ignores them. Wait timeout defaults to 600 s (235B's 88 GB
|
|
|
|
|
|
cold load takes 3-5 min); override with `SWAP_WAIT_TIMEOUT=<seconds>`.
|
|
|
|
|
|
|
|
|
|
|
|
A Mac-side wrapper at `bin/swap-model` in the repo runs
|
|
|
|
|
|
`ssh framework /usr/local/bin/swap-model "$@"` — symlink it onto your
|
|
|
|
|
|
PATH so `swap-model X` works from any shell:
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
ln -s "$(pwd)/bin/swap-model" /usr/local/bin/swap-model
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Override host with `SWAP_MODEL_HOST=10.0.0.70 swap-model 235b` when
|
|
|
|
|
|
Tailscale resolution is misbehaving.
|
|
|
|
|
|
|
|
|
|
|
|
### Inference engine consolidation (in progress)
|
|
|
|
|
|
|
|
|
|
|
|
The model count is now at the crossover point, so the manual
|
|
|
|
|
|
`swap-model` script is being replaced. The plan, phased:
|
|
|
|
|
|
|
|
|
|
|
|
**Framing.** Full consolidation to one engine isn't possible — **vLLM
|
|
|
|
|
|
stays** as the specialist for the architectures llama.cpp can't serve
|
|
|
|
|
|
(Kimi-Linear's KDA/MLA hybrid, the 235B). And the hard part of the
|
|
|
|
|
|
memory problem — the cross-engine rule "235B can't share the arena
|
|
|
|
|
|
with anything" — survives any consolidation, because it's a constraint
|
|
|
|
|
|
*between* vLLM and the GGUF tier. So the real decision is only about
|
|
|
|
|
|
the **GGUF tier** (Ollama and the standalone llama.cpp stack do the
|
|
|
|
|
|
same job) and which tool, if any, coordinates the swaps.
|
|
|
|
|
|
|
|
|
|
|
|
**M0 — benchmark the GGUF tier (the deciding unknown).** Ollama already
|
|
|
|
|
|
auto-loads/unloads; the only question is whether its bundled llama.cpp
|
|
|
|
|
|
keeps up with the gfx1151-tuned kyuz0 build on *this* box.
|
|
|
|
|
|
`scripts/bench-engines` (deployed to `/usr/local/bin/bench-engines`)
|
|
|
|
|
|
serves the identical GGUF on each engine in isolation and reports
|
|
|
|
|
|
decode/prefill t/s using each engine's own timing fields:
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
ssh framework bench-engines # runs both, prints a verdict line
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**M1 — pick the end-state from the benchmark.** Both are two engines
|
|
|
|
|
|
(GGUF tier + vLLM); the difference is the coordinator:
|
|
|
|
|
|
|
|
|
|
|
|
- **Option 1 (Ollama ≥ ~85 % of llama.cpp decode):** Ollama + vLLM,
|
|
|
|
|
|
drop the standalone llama.cpp stack. Ollama self-swaps its tier, so
|
|
|
|
|
|
**no llama-swap needed**; the one cross-engine rule ("stop Ollama
|
|
|
|
|
|
before 235B") is a few lines. Fewest moving parts.
|
|
|
|
|
|
- **Option 2 (kyuz0 lead is large):** keep llama.cpp behind
|
|
|
|
|
|
**[llama-swap](https://github.com/mostlygeek/llama-swap)**, drop
|
|
|
|
|
|
Ollama. This is where llama-swap earns its place — a YAML-configured
|
|
|
|
|
|
OpenAI-compatible proxy whose **groups** encode the coexistence table
|
|
|
|
|
|
declaratively *and* can launch the vLLM backends too, so one tool
|
|
|
|
|
|
coordinates the whole arena and the `swap-model` script retires. It
|
|
|
|
|
|
also subsumes most of LiteLLM's routing role.
|
|
|
|
|
|
|
|
|
|
|
|
Two alternatives considered, both weaker fits:
|
|
|
|
|
|
- **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
|
|
|
|
|
|
your whole stack" platform with 36+ backends; deeper rewrite for
|
|
|
|
|
|
marginal gain.
|
|
|
|
|
|
- **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
|
|
|
|
|
|
cluster manager (DB-backed, UI-first); built for fleets, overkill
|
|
|
|
|
|
on a single Strix Halo box.
|
|
|
|
|
|
|
|
|
|
|
|
Two alternatives also considered, both weaker fits:
|
|
|
|
|
|
- **[LocalAI](https://github.com/mudler/LocalAI)** — bigger "replace
|
|
|
|
|
|
your whole stack" platform with 36+ backends; would force a deeper
|
|
|
|
|
|
rewrite for marginal gain on a 4-model stack.
|
|
|
|
|
|
- **[GPUStack](https://github.com/gpustack/gpustack)** — multi-node
|
|
|
|
|
|
cluster manager (DB-backed, UI-first); built for fleets, overkill
|
|
|
|
|
|
on a single Strix Halo box.
|
|
|
|
|
|
|
2026-05-08 11:35:10 -04:00
|
|
|
|
## Tunables
|
|
|
|
|
|
|
|
|
|
|
|
Top of `deploy.py`:
|
|
|
|
|
|
- `ROCM_VERSION` and `AMDGPU_INSTALL_DEB` — bump when AMD ships a newer
|
|
|
|
|
|
release. The .deb filename has a build suffix that doesn't derive from
|
|
|
|
|
|
the version; find it at https://repo.radeon.com/amdgpu-install/.
|
|
|
|
|
|
- `AMDGPU_TOP_VERSION` — bump when a newer release lands at
|
|
|
|
|
|
https://github.com/Umio-Yasuno/amdgpu_top/releases.
|
2026-05-08 16:34:52 -04:00
|
|
|
|
- `NVTOP_VERSION` — built from source because apt's nvtop predates
|
|
|
|
|
|
gfx1151 detection. Bump when a newer release lands at
|
|
|
|
|
|
https://github.com/Syllo/nvtop/releases. Run `sudo nvtop` to see all
|
|
|
|
|
|
GPU processes (non-root only sees the calling user's own).
|
|
|
|
|
|
- `BTOP_VERSION` — built from source because apt's btop has no AMD GPU
|
|
|
|
|
|
support. 1.4+ requires C++23, hence the ubuntu-toolchain-r/test PPA
|
|
|
|
|
|
for g++-14. The build links `librocm-smi-dev` for AMD GPU monitoring.
|
|
|
|
|
|
Bump at https://github.com/aristocratos/btop/releases. In btop, Esc
|
|
|
|
|
|
→ Options → "show_gpu_info" → On to enable the GPU panel.
|
2026-05-08 11:35:10 -04:00
|
|
|
|
|
|
|
|
|
|
Compose images in
|
2026-05-08 12:00:05 -04:00
|
|
|
|
`compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage}.yml`
|
|
|
|
|
|
— pin tags here. Homepage's tile/layout config is in `compose/homepage/`
|
|
|
|
|
|
(`services.yaml`, `settings.yaml`, etc.); edit there, `./run.sh`, restart
|
|
|
|
|
|
the homepage container.
|
|
|
|
|
|
|
2026-05-08 13:33:17 -04:00
|
|
|
|
## Voice
|
|
|
|
|
|
|
2026-05-08 14:42:45 -04:00
|
|
|
|
Two parallel voice stacks, each speaking a different protocol so they
|
|
|
|
|
|
serve different clients without conflicting.
|
2026-05-08 13:33:17 -04:00
|
|
|
|
|
2026-05-08 14:42:45 -04:00
|
|
|
|
### Wyoming (Home Assistant Assist)
|
2026-05-08 13:33:17 -04:00
|
|
|
|
|
2026-05-08 14:42:45 -04:00
|
|
|
|
Wyoming-protocol speech servers. No web UI — TCP protocol servers
|
|
|
|
|
|
consumable by HA Assist, Wyoming satellites, or any Wyoming client.
|
2026-05-08 13:33:17 -04:00
|
|
|
|
|
2026-05-08 14:42:45 -04:00
|
|
|
|
- **Whisper** (`/srv/docker/whisper`, `tcp://framework:10300`) — STT.
|
|
|
|
|
|
Default model `tiny-int8` (~75 MB). Models download into
|
|
|
|
|
|
`/srv/docker/whisper/data/` on first run.
|
|
|
|
|
|
|
|
|
|
|
|
- **Piper** (`/srv/docker/piper`, `tcp://framework:10200`) — TTS.
|
|
|
|
|
|
Default voice `en_US-lessac-medium` (~63 MB). Voice catalog at
|
|
|
|
|
|
<https://github.com/rhasspy/piper/blob/master/VOICES.md>.
|
|
|
|
|
|
|
|
|
|
|
|
### OpenAI-compatible (OpenWebUI, Conduit, scripts)
|
|
|
|
|
|
|
|
|
|
|
|
OpenAI-API-compatible servers — `/v1/audio/transcriptions` and
|
|
|
|
|
|
`/v1/audio/speech`. Used by OpenWebUI's Audio settings (and through
|
|
|
|
|
|
OpenWebUI, by the Conduit Android app).
|
|
|
|
|
|
|
|
|
|
|
|
- **faster-whisper** (`/srv/docker/faster-whisper`,
|
|
|
|
|
|
http://framework:8001) — STT, model `large-v3-turbo` by default
|
|
|
|
|
|
(~810 MB). CPU-mode container; Strix Halo's 16 Zen 5 cores keep it
|
|
|
|
|
|
real-time. Built-in web UI at <http://framework:8001/>.
|
|
|
|
|
|
|
|
|
|
|
|
- **Kokoro** (`/srv/docker/kokoro`, http://framework:8880) — TTS,
|
|
|
|
|
|
Kokoro-82M (~340 MB). Apache 2.0, much more natural than Piper.
|
|
|
|
|
|
|
|
|
|
|
|
Bring-up (all four):
|
2026-05-08 13:33:17 -04:00
|
|
|
|
```sh
|
2026-05-08 14:42:45 -04:00
|
|
|
|
for svc in whisper piper faster-whisper kokoro; do
|
|
|
|
|
|
( cd /srv/docker/$svc && docker compose up -d )
|
|
|
|
|
|
done
|
2026-05-08 13:33:17 -04:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-08 14:42:45 -04:00
|
|
|
|
### Wiring OpenWebUI for voice
|
|
|
|
|
|
|
|
|
|
|
|
OpenWebUI Admin → Settings → Audio:
|
|
|
|
|
|
|
|
|
|
|
|
- **STT**: `OpenAI` engine, URL `http://faster-whisper:8000/v1`,
|
|
|
|
|
|
any non-empty API key, model `Systran/faster-whisper-large-v3-turbo`.
|
|
|
|
|
|
- **TTS**: `OpenAI` engine, URL `http://kokoro:8880/v1`, any non-empty
|
|
|
|
|
|
API key, voice `af_bella` (or any from the Kokoro voices list).
|
|
|
|
|
|
|
|
|
|
|
|
The hostnames `faster-whisper` and `kokoro` work because all containers
|
|
|
|
|
|
share Docker's default bridge network (or you can use `host.docker.internal:8001`
|
|
|
|
|
|
and `:8880` if OpenWebUI can't resolve them by container name — depends
|
|
|
|
|
|
on whether you're running OpenWebUI on the user-defined or default
|
|
|
|
|
|
bridge).
|
|
|
|
|
|
|
|
|
|
|
|
After saving, the microphone icon in OpenWebUI activates and a
|
|
|
|
|
|
voice-call quick action appears. Conduit on Android (an OpenWebUI
|
|
|
|
|
|
client) inherits this configuration automatically — no app-side voice
|
|
|
|
|
|
setup.
|
|
|
|
|
|
|
|
|
|
|
|
See [`localgenai/VoiceModels.md`](../../VoiceModels.md) for the
|
|
|
|
|
|
landscape and further upgrade options (Sesame CSM, F5-TTS, etc.).
|
2026-05-08 13:33:17 -04:00
|
|
|
|
|
2026-05-08 12:00:05 -04:00
|
|
|
|
## Front door
|
|
|
|
|
|
|
|
|
|
|
|
- **Homepage** (`/srv/docker/homepage`, http://framework:7575) — single
|
|
|
|
|
|
page with one tile per service in this stack, with native widgets
|
|
|
|
|
|
pulling live state (loaded models from Ollama, container status from
|
|
|
|
|
|
the docker socket, etc.). Bookmarks for the reference docs across the
|
|
|
|
|
|
bottom. Use as the default landing page on this network.
|
|
|
|
|
|
|
|
|
|
|
|
Bring-up: `cd /srv/docker/homepage && docker compose up -d`. No
|
|
|
|
|
|
first-run setup — it reads `/srv/docker/homepage/config/*.yaml` and
|
|
|
|
|
|
renders.
|
|
|
|
|
|
|
|
|
|
|
|
To customize: edit `pyinfra/framework/compose/homepage/*.yaml` in the
|
|
|
|
|
|
repo, run `./run.sh`, then `docker compose restart homepage` on the
|
|
|
|
|
|
box. Direct edits to `/srv/docker/homepage/config/*.yaml` will be
|
|
|
|
|
|
overwritten on the next pyinfra deploy.
|
2026-05-08 11:35:10 -04:00
|
|
|
|
|
|
|
|
|
|
## Monitoring stack
|
|
|
|
|
|
|
|
|
|
|
|
Two compose stacks watch the inference services:
|
|
|
|
|
|
|
|
|
|
|
|
- **Beszel** (`/srv/docker/beszel`, http://framework:8090) — host + Docker
|
|
|
|
|
|
container + AMD GPU dashboard. The agent's `amd_sysfs` collector reads
|
|
|
|
|
|
`/sys/class/drm/cardN/device/` directly; this is the only path that
|
|
|
|
|
|
reports real numbers on Strix Halo (gfx1151) — `amd-smi` returns N/A
|
|
|
|
|
|
for util/power/temp on this APU.
|
|
|
|
|
|
|
|
|
|
|
|
First-time bring-up:
|
|
|
|
|
|
```sh
|
|
|
|
|
|
cd /srv/docker/beszel
|
|
|
|
|
|
docker compose up -d beszel # 1. hub
|
|
|
|
|
|
# 2. open http://framework:8090 → create admin → "Add system"
|
|
|
|
|
|
# 3. copy the TOKEN and the SSH KEY from the dialog into .env:
|
|
|
|
|
|
sudo tee /srv/docker/beszel/.env >/dev/null <<EOF
|
|
|
|
|
|
BESZEL_TOKEN=<token-from-dialog>
|
|
|
|
|
|
BESZEL_KEY=ssh-ed25519 AAAA…
|
|
|
|
|
|
EOF
|
|
|
|
|
|
sudo chgrp docker /srv/docker/beszel/.env
|
|
|
|
|
|
sudo chmod 640 /srv/docker/beszel/.env
|
|
|
|
|
|
docker compose up -d --force-recreate beszel-agent # 4. agent
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
- **OpenLIT** (`/srv/docker/openlit`, http://framework:3001) — fleet
|
|
|
|
|
|
metrics for the LLM stack (cost, tokens, latency aggregated across
|
|
|
|
|
|
sessions and models). Auto-instruments Ollama and vLLM via
|
|
|
|
|
|
OpenTelemetry without app-code changes; for llama.cpp, the compose
|
|
|
|
|
|
file already passes `--metrics` so OpenLIT can scrape `/metrics`.
|
|
|
|
|
|
ClickHouse-backed. OTLP receivers exposed on the host at **:4327
|
|
|
|
|
|
(gRPC) / :4328 (HTTP)** — Phoenix owns 4317/4318.
|
|
|
|
|
|
|
|
|
|
|
|
Bring-up: `cd /srv/docker/openlit && docker compose up -d`. UI prompts
|
|
|
|
|
|
to create an admin account on first load.
|
|
|
|
|
|
|
|
|
|
|
|
- **Phoenix** (`/srv/docker/phoenix`, http://framework:6006) — per-trace
|
|
|
|
|
|
agent waterfall / flamegraph. Designed to answer "show me what one
|
|
|
|
|
|
OpenCode turn actually did" — full call tree, nested LLM/tool calls,
|
|
|
|
|
|
token counts inline. Single container, SQLite-backed. OTLP receivers
|
|
|
|
|
|
on the host at **:4317 (gRPC)** and **:6006/v1/traces (HTTP)** —
|
|
|
|
|
|
Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318.
|
|
|
|
|
|
|
|
|
|
|
|
Bring-up: `cd /srv/docker/phoenix && docker compose up -d`. UI
|
|
|
|
|
|
immediately available; no auth in the self-hosted single-user path.
|
|
|
|
|
|
|
|
|
|
|
|
See [`localgenai/opencode/README.md`](../../opencode/README.md) for
|
|
|
|
|
|
how OpenCode is configured to ship traces here.
|
|
|
|
|
|
|
|
|
|
|
|
The official AMD Prometheus exporters (`amd-smi-exporter`,
|
|
|
|
|
|
`device-metrics-exporter`) are intentionally **not** deployed — they're
|
|
|
|
|
|
broken on gfx1151 (ROCm#6035). If you ever want full Prometheus +
|
|
|
|
|
|
Grafana, the migration path is to replace Beszel with `node_exporter` +
|
|
|
|
|
|
a textfile collector reading `/sys/class/drm/cardN/device/` and
|
|
|
|
|
|
`/sys/class/hwmon/`.
|
|
|
|
|
|
|
|
|
|
|
|
## Agent harnesses
|
|
|
|
|
|
|
|
|
|
|
|
Two agent UIs in front of the local model stack — pick one (or both)
|
|
|
|
|
|
based on how you like to drive the agent:
|
|
|
|
|
|
|
|
|
|
|
|
- **OpenWebUI** (`/srv/docker/openwebui`, http://framework:3000) —
|
|
|
|
|
|
ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search.
|
|
|
|
|
|
Best for casual chat and household-shared LLM access.
|
|
|
|
|
|
|
|
|
|
|
|
Bring-up: `cd /srv/docker/openwebui && docker compose up -d`.
|
|
|
|
|
|
|
2026-05-08 12:04:35 -04:00
|
|
|
|
- **OpenHands** (`/srv/docker/openhands`, http://framework:3030) —
|
|
|
|
|
|
autonomous agent in a Docker sandbox. Spawns a per-conversation
|
|
|
|
|
|
`agent-server` container that can write code, run tests, browse the
|
|
|
|
|
|
web. Pre-configured for Ollama at `openai/qwen3-coder:30b` over the
|
|
|
|
|
|
OpenAI-compatible endpoint; ships traces to Phoenix.
|
|
|
|
|
|
|
|
|
|
|
|
Bring-up: `cd /srv/docker/openhands && docker compose up -d`. First
|
|
|
|
|
|
run pulls the agent-server image (~2 GB) lazily on first conversation,
|
|
|
|
|
|
not at startup, so the orchestrator comes up fast but your first
|
|
|
|
|
|
message takes 30–60 s. Pre-0.44 state path was `~/.openhands-state`;
|
|
|
|
|
|
not relevant on a fresh install.
|
|
|
|
|
|
|
|
|
|
|
|
> **Security note:** the orchestrator container has docker-socket
|
|
|
|
|
|
> access and spawns code-running sandboxes. Fine to expose on a
|
|
|
|
|
|
> Tailscale-only box; **change the compose port mapping back to
|
|
|
|
|
|
> `127.0.0.1:3030:3000` and tunnel in** if this host ever sees LAN
|
|
|
|
|
|
> or internet traffic.
|
2026-05-08 11:35:10 -04:00
|
|
|
|
|
|
|
|
|
|
Tool-call quality with local models is much better when Ollama's
|
|
|
|
|
|
context is bumped — the compose at `/srv/docker/ollama` already sets
|
|
|
|
|
|
`OLLAMA_CONTEXT_LENGTH=65536`, which is plenty.
|
|
|
|
|
|
|
|
|
|
|
|
OpenCode (the terminal driver, configured in
|
|
|
|
|
|
[`localgenai/opencode/`](../../opencode/)) runs on the Mac and points
|
|
|
|
|
|
at the same Ollama endpoint, so all three harnesses share the same
|
|
|
|
|
|
model server.
|
|
|
|
|
|
|
|
|
|
|
|
The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
|
|
|
|
|
|
gfx1151-optimized with rocWMMA flash attention (the latter only kicks
|
|
|
|
|
|
in on the ROCm tags). Auto-rebuilt against llama.cpp master.
|
2026-06-26 08:13:33 -04:00
|
|
|
|
|
|
|
|
|
|
## Remote IDE (code-server)
|
|
|
|
|
|
|
|
|
|
|
|
**code-server** (`/srv/docker/code-server`, http://framework:8443) —
|
|
|
|
|
|
full VS Code in the browser, served from the box. The point: Claude
|
|
|
|
|
|
Code (extension from Open VSX + CLI) runs *inside the container*, so
|
|
|
|
|
|
long agent tasks execute here and survive the laptop sleeping. Any
|
|
|
|
|
|
tailnet device gets the same editor and the same running session.
|
|
|
|
|
|
|
|
|
|
|
|
Bring-up: `cd /srv/docker/code-server && docker compose up -d`, then
|
|
|
|
|
|
see `/srv/docker/code-server/README.md` for the one-time extension
|
|
|
|
|
|
install + sign-in flow. No password is set (Tailscale-only trust
|
|
|
|
|
|
model, like everything else here) — set `HASHED_PASSWORD` before this
|
|
|
|
|
|
host ever sees other traffic; a browser terminal is a shell.
|
|
|
|
|
|
|
|
|
|
|
|
The workspace is container-scoped on purpose (no host filesystem, no
|
|
|
|
|
|
docker socket). For host-level Claude work, use the section below
|
|
|
|
|
|
instead.
|
|
|
|
|
|
|
|
|
|
|
|
## Workspace manager (Coder) — pilot
|
|
|
|
|
|
|
|
|
|
|
|
**Coder** (`/srv/docker/coder`, http://framework:7080) — self-hosted
|
|
|
|
|
|
workspace manager: a web dashboard that stamps out per-project dev
|
|
|
|
|
|
containers from Terraform templates. The shipped `code-server` template
|
|
|
|
|
|
gives each workspace browser VS Code with the Claude Code extension
|
|
|
|
|
|
(from Open VSX) pre-installed and the `claude` CLI on PATH; workspace
|
|
|
|
|
|
home volumes persist `~/.claude` creds and repos across stop/start.
|
|
|
|
|
|
Workspaces publish no host ports — everything tunnels through the
|
|
|
|
|
|
dashboard — and idle autostop hands parked projects' RAM back to the
|
|
|
|
|
|
inference stacks.
|
|
|
|
|
|
|
|
|
|
|
|
Bring-up + template push + first workspace:
|
|
|
|
|
|
`/srv/docker/coder/README.md`. The `.env` (docker GID, access URL,
|
|
|
|
|
|
Postgres password) is auto-generated by `./run.sh` — nothing to fill
|
|
|
|
|
|
in by hand.
|
|
|
|
|
|
|
|
|
|
|
|
This is a **pilot** against the standalone code-server stack above:
|
|
|
|
|
|
after a week of real use, one of them retires. Same security posture
|
|
|
|
|
|
as OpenHands (docker-socket holder, spawns code-running containers) —
|
|
|
|
|
|
Tailscale-only, never expose :7080 further.
|
|
|
|
|
|
|
|
|
|
|
|
## Claude Code on the box (remote-control)
|
|
|
|
|
|
|
|
|
|
|
|
For long unattended runs steered from a phone or browser — no IDE
|
|
|
|
|
|
involved — Claude Code's official Remote Control feature bridges a
|
|
|
|
|
|
session running on the box to claude.ai/code and the Claude mobile
|
|
|
|
|
|
app. Needs Claude Code ≥ 2.1.51 with a claude.ai login (Pro/Max plan;
|
|
|
|
|
|
API keys don't work for this), traffic relays outbound-only over TLS
|
|
|
|
|
|
to Anthropic — no inbound ports.
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
ssh framework
|
|
|
|
|
|
tmux new -s rc
|
|
|
|
|
|
cd ~/src/whatever
|
|
|
|
|
|
claude remote-control --name "Strix Halo"
|
|
|
|
|
|
# prints a session URL → open on any device, or find it in the
|
|
|
|
|
|
# session list at claude.ai/code / the Claude app's Code tab
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Operational notes (state of the feature as of 2026-06):
|
|
|
|
|
|
|
|
|
|
|
|
- The local process is the engine — tasks keep executing with **no
|
|
|
|
|
|
client attached**; clients are viewports. But if the process dies,
|
|
|
|
|
|
the session is gone (no `--resume` equivalent for remote-control
|
|
|
|
|
|
sessions yet, issue #30447). Hence tmux, always.
|
|
|
|
|
|
- It needs a TTY — `nohup`/systemd don't work. tmux satisfies this.
|
|
|
|
|
|
- `--spawn worktree` gives each connecting session its own git
|
|
|
|
|
|
worktree; `--spawn same-dir` (default) shares the directory.
|
|
|
|
|
|
- Headless boxes print the session URL but no QR (#34764) — copy the
|
|
|
|
|
|
URL or use the session list.
|
|
|
|
|
|
- Mobile clients sometimes drop after long idle (#28914, #34255);
|
|
|
|
|
|
reconnecting always works — the task on the box is unaffected.
|
|
|
|
|
|
- Run `claude` interactively once per project dir first to clear the
|
|
|
|
|
|
workspace-trust prompt before automating anything.
|