20 KiB
pyinfra: Strix Halo bring-up
Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon 8060S, 128 GB). The host stays minimal — kernel + driver + Docker + diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker compose services, each shipping its own ROCm/Vulkan stack.
Manual prerequisites
- Phase 0 — update Framework BIOS, set GPU UMA carve-out (96 GB).
- OS install — Ubuntu Server 24.04 LTS. AMD ROCm only ships for
jammy/noble; later Ubuntus install but break the host-side toolchain
(libxml2 ABI). Enable SSH, import your laptop key, create user
noise. Recommended partitioning: ≥300 GB on/, big disk mounted at/models, plain ext4 (skip LVM). - The host must be reachable at
10.0.0.237over SSH (editinventory.pyif it moves). - NOPASSWD sudo for
noise— pyinfra's fact layer doesn't reliably thread sudo passwords. One-time setup:ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
Run
uv tool install pyinfra
./run.sh # equivalent to: pyinfra inventory.py deploy.py
./run.sh --dry # any extra args are forwarded to pyinfra
Or run it ephemerally without installing: uvx pyinfra inventory.py deploy.py.
What the deploy does
- Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, hf (huggingface_hub CLI)
- Tailscale (run
sudo tailscale upon the box once, interactively) - Docker engine + compose plugin, user added to
dockergroup - ROCm host diagnostics only (
rocminfo) — no full toolchain - GRUB kernel params for Strix Halo perf:
amd_iommu=off,amdgpu.gttsize=117760(per Gygeek/Framework-strix-halo-llm-setup). Requires a reboot to activate — pyinfra rewrites/etc/default/gruband runsupdate-grub, but won't reboot for you. /models/<vendor>/layout/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper,faster-whisper,kokoro}/docker-compose.ymldropped in, not auto-started — you edit the model path thendocker compose up -d(OpenWebUI needs no edits — it's pre-configured to find Ollama athost.docker.internal:11434and usessearxng.n0n.iofor web search) (Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006, OpenHands UI at :3030, Homepage at :7575, Whisper Wyoming :10300, Piper Wyoming :10200, faster-whisper OpenAI-API :8001, Kokoro OpenAI-API :8880 — see "Monitoring stack", "Agent harnesses", "Front door", and "Voice" below)
If a previous run installed the native llama.cpp build / full ROCm /
native Ollama, those are auto-cleaned the next time ./run.sh runs.
After the deploy: starting an inference service
ssh noise@10.0.0.237
sudo tailscale up # one-time, interactive
# Drop a GGUF somewhere under /models, then:
cd /srv/docker/llama
vim docker-compose.yml # edit the --model path
docker compose up -d
curl localhost:8080/v1/models # smoke test
Same shape for vllm (port 8000) and ollama (port 11434, no model edit
needed — Ollama serves models on demand).
Swapping GPU-resident models
The merged ~110 GB GPU arena (BIOS UMA=0.5 GB + ttm.pages_limit + the
HSA_* env recipe — see StrixHaloMemory.md) holds at most one
~88 GB-class model, and ROCm doesn't reclaim cleanly between consumers.
So switching between, say, the daily-driver 30B and the 235B long-task
model is a stop-then-start of whole compose stacks.
scripts/swap-model (deployed to /usr/local/bin/swap-model on the
box) encodes the coexistence table + per-service health probes so the
swap is one command:
ssh framework swap-model 235b # stop conflicting svcs, start 235B, wait for /health
ssh framework swap-model coder # back to Qwen3-Coder-30B (Ollama)
ssh framework swap-model status # what's currently up?
ssh framework swap-model none # everything down, free the GPU arena
Coexistence table baked into the script:
| Target | What runs | What gets stopped |
|---|---|---|
coder |
ollama (Qwen3-Coder-30B) | 235b, comfyui |
235b |
qwen3-235b (llama.cpp) | ollama, llama, kimi, comfyui |
kimi |
kimi-linear (vLLM) | 235b, comfyui (ollama can stay) |
comfyui |
comfyui | 235b, kimi (ollama can stay) |
none |
— | all five |
OpenWebUI, LiteLLM, Phoenix, Beszel, etc. are non-GPU and always-on;
the script ignores them. Wait timeout defaults to 600 s (235B's 88 GB
cold load takes 3-5 min); override with SWAP_WAIT_TIMEOUT=<seconds>.
A Mac-side wrapper at bin/swap-model in the repo runs
ssh framework /usr/local/bin/swap-model "$@" — symlink it onto your
PATH so swap-model X works from any shell:
ln -s "$(pwd)/bin/swap-model" /usr/local/bin/swap-model
Override host with SWAP_MODEL_HOST=10.0.0.70 swap-model 235b when
Tailscale resolution is misbehaving.
Inference engine consolidation (in progress)
The model count is now at the crossover point, so the manual
swap-model script is being replaced. The plan, phased:
Framing. Full consolidation to one engine isn't possible — vLLM stays as the specialist for the architectures llama.cpp can't serve (Kimi-Linear's KDA/MLA hybrid, the 235B). And the hard part of the memory problem — the cross-engine rule "235B can't share the arena with anything" — survives any consolidation, because it's a constraint between vLLM and the GGUF tier. So the real decision is only about the GGUF tier (Ollama and the standalone llama.cpp stack do the same job) and which tool, if any, coordinates the swaps.
M0 — benchmark the GGUF tier (the deciding unknown). Ollama already
auto-loads/unloads; the only question is whether its bundled llama.cpp
keeps up with the gfx1151-tuned kyuz0 build on this box.
scripts/bench-engines (deployed to /usr/local/bin/bench-engines)
serves the identical GGUF on each engine in isolation and reports
decode/prefill t/s using each engine's own timing fields:
ssh framework bench-engines # runs both, prints a verdict line
M1 — pick the end-state from the benchmark. Both are two engines (GGUF tier + vLLM); the difference is the coordinator:
- Option 1 (Ollama ≥ ~85 % of llama.cpp decode): Ollama + vLLM, drop the standalone llama.cpp stack. Ollama self-swaps its tier, so no llama-swap needed; the one cross-engine rule ("stop Ollama before 235B") is a few lines. Fewest moving parts.
- Option 2 (kyuz0 lead is large): keep llama.cpp behind
llama-swap, drop
Ollama. This is where llama-swap earns its place — a YAML-configured
OpenAI-compatible proxy whose groups encode the coexistence table
declaratively and can launch the vLLM backends too, so one tool
coordinates the whole arena and the
swap-modelscript retires. It also subsumes most of LiteLLM's routing role.
Two alternatives considered, both weaker fits:
- LocalAI — bigger "replace your whole stack" platform with 36+ backends; deeper rewrite for marginal gain.
- GPUStack — multi-node cluster manager (DB-backed, UI-first); built for fleets, overkill on a single Strix Halo box.
Two alternatives also considered, both weaker fits:
- LocalAI — bigger "replace your whole stack" platform with 36+ backends; would force a deeper rewrite for marginal gain on a 4-model stack.
- GPUStack — multi-node cluster manager (DB-backed, UI-first); built for fleets, overkill on a single Strix Halo box.
Tunables
Top of deploy.py:
ROCM_VERSIONandAMDGPU_INSTALL_DEB— bump when AMD ships a newer release. The .deb filename has a build suffix that doesn't derive from the version; find it at https://repo.radeon.com/amdgpu-install/.AMDGPU_TOP_VERSION— bump when a newer release lands at https://github.com/Umio-Yasuno/amdgpu_top/releases.NVTOP_VERSION— built from source because apt's nvtop predates gfx1151 detection. Bump when a newer release lands at https://github.com/Syllo/nvtop/releases. Runsudo nvtopto see all GPU processes (non-root only sees the calling user's own).BTOP_VERSION— built from source because apt's btop has no AMD GPU support. 1.4+ requires C++23, hence the ubuntu-toolchain-r/test PPA for g++-14. The build linkslibrocm-smi-devfor AMD GPU monitoring. Bump at https://github.com/aristocratos/btop/releases. In btop, Esc → Options → "show_gpu_info" → On to enable the GPU panel.
Compose images in
compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage}.yml
— pin tags here. Homepage's tile/layout config is in compose/homepage/
(services.yaml, settings.yaml, etc.); edit there, ./run.sh, restart
the homepage container.
Voice
Two parallel voice stacks, each speaking a different protocol so they serve different clients without conflicting.
Wyoming (Home Assistant Assist)
Wyoming-protocol speech servers. No web UI — TCP protocol servers consumable by HA Assist, Wyoming satellites, or any Wyoming client.
-
Whisper (
/srv/docker/whisper,tcp://framework:10300) — STT. Default modeltiny-int8(~75 MB). Models download into/srv/docker/whisper/data/on first run. -
Piper (
/srv/docker/piper,tcp://framework:10200) — TTS. Default voiceen_US-lessac-medium(~63 MB). Voice catalog at https://github.com/rhasspy/piper/blob/master/VOICES.md.
OpenAI-compatible (OpenWebUI, Conduit, scripts)
OpenAI-API-compatible servers — /v1/audio/transcriptions and
/v1/audio/speech. Used by OpenWebUI's Audio settings (and through
OpenWebUI, by the Conduit Android app).
-
faster-whisper (
/srv/docker/faster-whisper, http://framework:8001) — STT, modellarge-v3-turboby default (~810 MB). CPU-mode container; Strix Halo's 16 Zen 5 cores keep it real-time. Built-in web UI at http://framework:8001/. -
Kokoro (
/srv/docker/kokoro, http://framework:8880) — TTS, Kokoro-82M (~340 MB). Apache 2.0, much more natural than Piper.
Bring-up (all four):
for svc in whisper piper faster-whisper kokoro; do
( cd /srv/docker/$svc && docker compose up -d )
done
Wiring OpenWebUI for voice
OpenWebUI Admin → Settings → Audio:
- STT:
OpenAIengine, URLhttp://faster-whisper:8000/v1, any non-empty API key, modelSystran/faster-whisper-large-v3-turbo. - TTS:
OpenAIengine, URLhttp://kokoro:8880/v1, any non-empty API key, voiceaf_bella(or any from the Kokoro voices list).
The hostnames faster-whisper and kokoro work because all containers
share Docker's default bridge network (or you can use host.docker.internal:8001
and :8880 if OpenWebUI can't resolve them by container name — depends
on whether you're running OpenWebUI on the user-defined or default
bridge).
After saving, the microphone icon in OpenWebUI activates and a voice-call quick action appears. Conduit on Android (an OpenWebUI client) inherits this configuration automatically — no app-side voice setup.
See localgenai/VoiceModels.md for the
landscape and further upgrade options (Sesame CSM, F5-TTS, etc.).
Front door
-
Homepage (
/srv/docker/homepage, http://framework:7575) — single page with one tile per service in this stack, with native widgets pulling live state (loaded models from Ollama, container status from the docker socket, etc.). Bookmarks for the reference docs across the bottom. Use as the default landing page on this network.Bring-up:
cd /srv/docker/homepage && docker compose up -d. No first-run setup — it reads/srv/docker/homepage/config/*.yamland renders.To customize: edit
pyinfra/framework/compose/homepage/*.yamlin the repo, run./run.sh, thendocker compose restart homepageon the box. Direct edits to/srv/docker/homepage/config/*.yamlwill be overwritten on the next pyinfra deploy.
Monitoring stack
Two compose stacks watch the inference services:
-
Beszel (
/srv/docker/beszel, http://framework:8090) — host + Docker container + AMD GPU dashboard. The agent'samd_sysfscollector reads/sys/class/drm/cardN/device/directly; this is the only path that reports real numbers on Strix Halo (gfx1151) —amd-smireturns N/A for util/power/temp on this APU.First-time bring-up:
cd /srv/docker/beszel docker compose up -d beszel # 1. hub # 2. open http://framework:8090 → create admin → "Add system" # 3. copy the TOKEN and the SSH KEY from the dialog into .env: sudo tee /srv/docker/beszel/.env >/dev/null <<EOF BESZEL_TOKEN=<token-from-dialog> BESZEL_KEY=ssh-ed25519 AAAA… EOF sudo chgrp docker /srv/docker/beszel/.env sudo chmod 640 /srv/docker/beszel/.env docker compose up -d --force-recreate beszel-agent # 4. agent -
OpenLIT (
/srv/docker/openlit, http://framework:3001) — fleet metrics for the LLM stack (cost, tokens, latency aggregated across sessions and models). Auto-instruments Ollama and vLLM via OpenTelemetry without app-code changes; for llama.cpp, the compose file already passes--metricsso OpenLIT can scrape/metrics. ClickHouse-backed. OTLP receivers exposed on the host at :4327 (gRPC) / :4328 (HTTP) — Phoenix owns 4317/4318.Bring-up:
cd /srv/docker/openlit && docker compose up -d. UI prompts to create an admin account on first load. -
Phoenix (
/srv/docker/phoenix, http://framework:6006) — per-trace agent waterfall / flamegraph. Designed to answer "show me what one OpenCode turn actually did" — full call tree, nested LLM/tool calls, token counts inline. Single container, SQLite-backed. OTLP receivers on the host at :4317 (gRPC) and :6006/v1/traces (HTTP) — Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318.Bring-up:
cd /srv/docker/phoenix && docker compose up -d. UI immediately available; no auth in the self-hosted single-user path.See
localgenai/opencode/README.mdfor how OpenCode is configured to ship traces here.
The official AMD Prometheus exporters (amd-smi-exporter,
device-metrics-exporter) are intentionally not deployed — they're
broken on gfx1151 (ROCm#6035). If you ever want full Prometheus +
Grafana, the migration path is to replace Beszel with node_exporter +
a textfile collector reading /sys/class/drm/cardN/device/ and
/sys/class/hwmon/.
Agent harnesses
Two agent UIs in front of the local model stack — pick one (or both) based on how you like to drive the agent:
-
OpenWebUI (
/srv/docker/openwebui, http://framework:3000) — ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search. Best for casual chat and household-shared LLM access.Bring-up:
cd /srv/docker/openwebui && docker compose up -d. -
OpenHands (
/srv/docker/openhands, http://framework:3030) — autonomous agent in a Docker sandbox. Spawns a per-conversationagent-servercontainer that can write code, run tests, browse the web. Pre-configured for Ollama atopenai/qwen3-coder:30bover the OpenAI-compatible endpoint; ships traces to Phoenix.Bring-up:
cd /srv/docker/openhands && docker compose up -d. First run pulls the agent-server image (~2 GB) lazily on first conversation, not at startup, so the orchestrator comes up fast but your first message takes 30–60 s. Pre-0.44 state path was~/.openhands-state; not relevant on a fresh install.Security note: the orchestrator container has docker-socket access and spawns code-running sandboxes. Fine to expose on a Tailscale-only box; change the compose port mapping back to
127.0.0.1:3030:3000and tunnel in if this host ever sees LAN or internet traffic.Tool-call quality with local models is much better when Ollama's context is bumped — the compose at
/srv/docker/ollamaalready setsOLLAMA_CONTEXT_LENGTH=65536, which is plenty.
OpenCode (the terminal driver, configured in
localgenai/opencode/) runs on the Mac and points
at the same Ollama endpoint, so all three harnesses share the same
model server.
The llama.cpp image is kyuz0/amd-strix-halo-toolboxes:vulkan-radv,
gfx1151-optimized with rocWMMA flash attention (the latter only kicks
in on the ROCm tags). Auto-rebuilt against llama.cpp master.
Remote IDE (code-server)
code-server (/srv/docker/code-server, http://framework:8443) —
full VS Code in the browser, served from the box. The point: Claude
Code (extension from Open VSX + CLI) runs inside the container, so
long agent tasks execute here and survive the laptop sleeping. Any
tailnet device gets the same editor and the same running session.
Bring-up: cd /srv/docker/code-server && docker compose up -d, then
see /srv/docker/code-server/README.md for the one-time extension
install + sign-in flow. No password is set (Tailscale-only trust
model, like everything else here) — set HASHED_PASSWORD before this
host ever sees other traffic; a browser terminal is a shell.
The workspace is container-scoped on purpose (no host filesystem, no docker socket). For host-level Claude work, use the section below instead.
Workspace manager (Coder) — pilot
Coder (/srv/docker/coder, http://framework:7080) — self-hosted
workspace manager: a web dashboard that stamps out per-project dev
containers from Terraform templates. The shipped code-server template
gives each workspace browser VS Code with the Claude Code extension
(from Open VSX) pre-installed and the claude CLI on PATH; workspace
home volumes persist ~/.claude creds and repos across stop/start.
Workspaces publish no host ports — everything tunnels through the
dashboard — and idle autostop hands parked projects' RAM back to the
inference stacks.
Bring-up + template push + first workspace:
/srv/docker/coder/README.md. The .env (docker GID, access URL,
Postgres password) is auto-generated by ./run.sh — nothing to fill
in by hand.
This is a pilot against the standalone code-server stack above: after a week of real use, one of them retires. Same security posture as OpenHands (docker-socket holder, spawns code-running containers) — Tailscale-only, never expose :7080 further.
Claude Code on the box (remote-control)
For long unattended runs steered from a phone or browser — no IDE involved — Claude Code's official Remote Control feature bridges a session running on the box to claude.ai/code and the Claude mobile app. Needs Claude Code ≥ 2.1.51 with a claude.ai login (Pro/Max plan; API keys don't work for this), traffic relays outbound-only over TLS to Anthropic — no inbound ports.
ssh framework
tmux new -s rc
cd ~/src/whatever
claude remote-control --name "Strix Halo"
# prints a session URL → open on any device, or find it in the
# session list at claude.ai/code / the Claude app's Code tab
Operational notes (state of the feature as of 2026-06):
- The local process is the engine — tasks keep executing with no
client attached; clients are viewports. But if the process dies,
the session is gone (no
--resumeequivalent for remote-control sessions yet, issue #30447). Hence tmux, always. - It needs a TTY —
nohup/systemd don't work. tmux satisfies this. --spawn worktreegives each connecting session its own git worktree;--spawn same-dir(default) shares the directory.- Headless boxes print the session URL but no QR (#34764) — copy the URL or use the session list.
- Mobile clients sometimes drop after long idle (#28914, #34255); reconnecting always works — the task on the box is unaffected.
- Run
claudeinteractively once per project dir first to clear the workspace-trust prompt before automating anything.