Path B from VoiceModels.md — adds two new compose stacks alongside the Wyoming pair so OpenWebUI/Conduit get voice without a Wyoming-shim: - compose/faster-whisper.yml — fedirz/faster-whisper-server CPU image, large-v3-turbo by default, OpenAI /v1/audio/transcriptions on :8001. Built-in web UI for ad-hoc transcription. - compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu, Kokoro-82M, OpenAI /v1/audio/speech on :8880. Both run alongside (not instead of) Wyoming Whisper + Piper — Wyoming keeps serving HA Assist, OpenAI-API serves OpenWebUI / Conduit. Memory budget on Strix Halo accommodates everything plus Qwen3-Coder loaded concurrently with plenty of headroom. Homepage gets dedicated tiles for both. README documents the OpenWebUI Audio configuration that wires the new endpoints. Conduit inherits voice via OpenWebUI without app-side setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
266 lines
11 KiB
Markdown
266 lines
11 KiB
Markdown
# pyinfra: Strix Halo bring-up
|
||
|
||
Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon
|
||
8060S, 128 GB). The host stays minimal — kernel + driver + Docker +
|
||
diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker
|
||
compose services, each shipping its own ROCm/Vulkan stack.
|
||
|
||
## Manual prerequisites
|
||
|
||
1. **Phase 0** — update Framework BIOS, set GPU UMA carve-out (96 GB).
|
||
2. **OS install** — Ubuntu Server 24.04 LTS. AMD ROCm only ships for
|
||
jammy/noble; later Ubuntus install but break the host-side toolchain
|
||
(libxml2 ABI). Enable SSH, import your laptop key, create user `noise`.
|
||
Recommended partitioning: ≥300 GB on `/`, big disk mounted at `/models`,
|
||
plain ext4 (skip LVM).
|
||
3. The host must be reachable at `10.0.0.237` over SSH (edit `inventory.py`
|
||
if it moves).
|
||
4. **NOPASSWD sudo for `noise`** — pyinfra's fact layer doesn't reliably
|
||
thread sudo passwords. One-time setup:
|
||
```sh
|
||
ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
|
||
```
|
||
|
||
## Run
|
||
|
||
```sh
|
||
uv tool install pyinfra
|
||
./run.sh # equivalent to: pyinfra inventory.py deploy.py
|
||
./run.sh --dry # any extra args are forwarded to pyinfra
|
||
```
|
||
|
||
Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`.
|
||
|
||
## What the deploy does
|
||
|
||
- Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, huggingface-cli
|
||
- Tailscale (run `sudo tailscale up` on the box once, interactively)
|
||
- Docker engine + compose plugin, user added to `docker` group
|
||
- ROCm host diagnostics only (`rocminfo`) — no full toolchain
|
||
- GRUB kernel params for Strix Halo perf: `amd_iommu=off`,
|
||
`amdgpu.gttsize=117760` (per [Gygeek/Framework-strix-halo-llm-setup](https://github.com/Gygeek/Framework-strix-halo-llm-setup)).
|
||
Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
|
||
and runs `update-grub`, but won't reboot for you.
|
||
- `/models/<vendor>/` layout
|
||
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper,faster-whisper,kokoro}/docker-compose.yml`
|
||
dropped in, not auto-started — you edit the model path then
|
||
`docker compose up -d`
|
||
(OpenWebUI needs no edits — it's pre-configured to find Ollama at
|
||
`host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
|
||
(Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
|
||
OpenHands UI at :3030, Homepage at :7575, Whisper Wyoming :10300,
|
||
Piper Wyoming :10200, faster-whisper OpenAI-API :8001, Kokoro
|
||
OpenAI-API :8880 — see "Monitoring stack", "Agent harnesses",
|
||
"Front door", and "Voice" below)
|
||
|
||
If a previous run installed the native llama.cpp build / full ROCm /
|
||
native Ollama, those are auto-cleaned the next time `./run.sh` runs.
|
||
|
||
## After the deploy: starting an inference service
|
||
|
||
```sh
|
||
ssh noise@10.0.0.237
|
||
sudo tailscale up # one-time, interactive
|
||
|
||
# Drop a GGUF somewhere under /models, then:
|
||
cd /srv/docker/llama
|
||
vim docker-compose.yml # edit the --model path
|
||
docker compose up -d
|
||
curl localhost:8080/v1/models # smoke test
|
||
```
|
||
|
||
Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
|
||
needed — Ollama serves models on demand).
|
||
|
||
## Tunables
|
||
|
||
Top of `deploy.py`:
|
||
- `ROCM_VERSION` and `AMDGPU_INSTALL_DEB` — bump when AMD ships a newer
|
||
release. The .deb filename has a build suffix that doesn't derive from
|
||
the version; find it at https://repo.radeon.com/amdgpu-install/.
|
||
- `AMDGPU_TOP_VERSION` — bump when a newer release lands at
|
||
https://github.com/Umio-Yasuno/amdgpu_top/releases.
|
||
|
||
Compose images in
|
||
`compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage}.yml`
|
||
— pin tags here. Homepage's tile/layout config is in `compose/homepage/`
|
||
(`services.yaml`, `settings.yaml`, etc.); edit there, `./run.sh`, restart
|
||
the homepage container.
|
||
|
||
## Voice
|
||
|
||
Two parallel voice stacks, each speaking a different protocol so they
|
||
serve different clients without conflicting.
|
||
|
||
### Wyoming (Home Assistant Assist)
|
||
|
||
Wyoming-protocol speech servers. No web UI — TCP protocol servers
|
||
consumable by HA Assist, Wyoming satellites, or any Wyoming client.
|
||
|
||
- **Whisper** (`/srv/docker/whisper`, `tcp://framework:10300`) — STT.
|
||
Default model `tiny-int8` (~75 MB). Models download into
|
||
`/srv/docker/whisper/data/` on first run.
|
||
|
||
- **Piper** (`/srv/docker/piper`, `tcp://framework:10200`) — TTS.
|
||
Default voice `en_US-lessac-medium` (~63 MB). Voice catalog at
|
||
<https://github.com/rhasspy/piper/blob/master/VOICES.md>.
|
||
|
||
### OpenAI-compatible (OpenWebUI, Conduit, scripts)
|
||
|
||
OpenAI-API-compatible servers — `/v1/audio/transcriptions` and
|
||
`/v1/audio/speech`. Used by OpenWebUI's Audio settings (and through
|
||
OpenWebUI, by the Conduit Android app).
|
||
|
||
- **faster-whisper** (`/srv/docker/faster-whisper`,
|
||
http://framework:8001) — STT, model `large-v3-turbo` by default
|
||
(~810 MB). CPU-mode container; Strix Halo's 16 Zen 5 cores keep it
|
||
real-time. Built-in web UI at <http://framework:8001/>.
|
||
|
||
- **Kokoro** (`/srv/docker/kokoro`, http://framework:8880) — TTS,
|
||
Kokoro-82M (~340 MB). Apache 2.0, much more natural than Piper.
|
||
|
||
Bring-up (all four):
|
||
```sh
|
||
for svc in whisper piper faster-whisper kokoro; do
|
||
( cd /srv/docker/$svc && docker compose up -d )
|
||
done
|
||
```
|
||
|
||
### Wiring OpenWebUI for voice
|
||
|
||
OpenWebUI Admin → Settings → Audio:
|
||
|
||
- **STT**: `OpenAI` engine, URL `http://faster-whisper:8000/v1`,
|
||
any non-empty API key, model `Systran/faster-whisper-large-v3-turbo`.
|
||
- **TTS**: `OpenAI` engine, URL `http://kokoro:8880/v1`, any non-empty
|
||
API key, voice `af_bella` (or any from the Kokoro voices list).
|
||
|
||
The hostnames `faster-whisper` and `kokoro` work because all containers
|
||
share Docker's default bridge network (or you can use `host.docker.internal:8001`
|
||
and `:8880` if OpenWebUI can't resolve them by container name — depends
|
||
on whether you're running OpenWebUI on the user-defined or default
|
||
bridge).
|
||
|
||
After saving, the microphone icon in OpenWebUI activates and a
|
||
voice-call quick action appears. Conduit on Android (an OpenWebUI
|
||
client) inherits this configuration automatically — no app-side voice
|
||
setup.
|
||
|
||
See [`localgenai/VoiceModels.md`](../../VoiceModels.md) for the
|
||
landscape and further upgrade options (Sesame CSM, F5-TTS, etc.).
|
||
|
||
## Front door
|
||
|
||
- **Homepage** (`/srv/docker/homepage`, http://framework:7575) — single
|
||
page with one tile per service in this stack, with native widgets
|
||
pulling live state (loaded models from Ollama, container status from
|
||
the docker socket, etc.). Bookmarks for the reference docs across the
|
||
bottom. Use as the default landing page on this network.
|
||
|
||
Bring-up: `cd /srv/docker/homepage && docker compose up -d`. No
|
||
first-run setup — it reads `/srv/docker/homepage/config/*.yaml` and
|
||
renders.
|
||
|
||
To customize: edit `pyinfra/framework/compose/homepage/*.yaml` in the
|
||
repo, run `./run.sh`, then `docker compose restart homepage` on the
|
||
box. Direct edits to `/srv/docker/homepage/config/*.yaml` will be
|
||
overwritten on the next pyinfra deploy.
|
||
|
||
## Monitoring stack
|
||
|
||
Two compose stacks watch the inference services:
|
||
|
||
- **Beszel** (`/srv/docker/beszel`, http://framework:8090) — host + Docker
|
||
container + AMD GPU dashboard. The agent's `amd_sysfs` collector reads
|
||
`/sys/class/drm/cardN/device/` directly; this is the only path that
|
||
reports real numbers on Strix Halo (gfx1151) — `amd-smi` returns N/A
|
||
for util/power/temp on this APU.
|
||
|
||
First-time bring-up:
|
||
```sh
|
||
cd /srv/docker/beszel
|
||
docker compose up -d beszel # 1. hub
|
||
# 2. open http://framework:8090 → create admin → "Add system"
|
||
# 3. copy the TOKEN and the SSH KEY from the dialog into .env:
|
||
sudo tee /srv/docker/beszel/.env >/dev/null <<EOF
|
||
BESZEL_TOKEN=<token-from-dialog>
|
||
BESZEL_KEY=ssh-ed25519 AAAA…
|
||
EOF
|
||
sudo chgrp docker /srv/docker/beszel/.env
|
||
sudo chmod 640 /srv/docker/beszel/.env
|
||
docker compose up -d --force-recreate beszel-agent # 4. agent
|
||
```
|
||
|
||
- **OpenLIT** (`/srv/docker/openlit`, http://framework:3001) — fleet
|
||
metrics for the LLM stack (cost, tokens, latency aggregated across
|
||
sessions and models). Auto-instruments Ollama and vLLM via
|
||
OpenTelemetry without app-code changes; for llama.cpp, the compose
|
||
file already passes `--metrics` so OpenLIT can scrape `/metrics`.
|
||
ClickHouse-backed. OTLP receivers exposed on the host at **:4327
|
||
(gRPC) / :4328 (HTTP)** — Phoenix owns 4317/4318.
|
||
|
||
Bring-up: `cd /srv/docker/openlit && docker compose up -d`. UI prompts
|
||
to create an admin account on first load.
|
||
|
||
- **Phoenix** (`/srv/docker/phoenix`, http://framework:6006) — per-trace
|
||
agent waterfall / flamegraph. Designed to answer "show me what one
|
||
OpenCode turn actually did" — full call tree, nested LLM/tool calls,
|
||
token counts inline. Single container, SQLite-backed. OTLP receivers
|
||
on the host at **:4317 (gRPC)** and **:6006/v1/traces (HTTP)** —
|
||
Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318.
|
||
|
||
Bring-up: `cd /srv/docker/phoenix && docker compose up -d`. UI
|
||
immediately available; no auth in the self-hosted single-user path.
|
||
|
||
See [`localgenai/opencode/README.md`](../../opencode/README.md) for
|
||
how OpenCode is configured to ship traces here.
|
||
|
||
The official AMD Prometheus exporters (`amd-smi-exporter`,
|
||
`device-metrics-exporter`) are intentionally **not** deployed — they're
|
||
broken on gfx1151 (ROCm#6035). If you ever want full Prometheus +
|
||
Grafana, the migration path is to replace Beszel with `node_exporter` +
|
||
a textfile collector reading `/sys/class/drm/cardN/device/` and
|
||
`/sys/class/hwmon/`.
|
||
|
||
## Agent harnesses
|
||
|
||
Two agent UIs in front of the local model stack — pick one (or both)
|
||
based on how you like to drive the agent:
|
||
|
||
- **OpenWebUI** (`/srv/docker/openwebui`, http://framework:3000) —
|
||
ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search.
|
||
Best for casual chat and household-shared LLM access.
|
||
|
||
Bring-up: `cd /srv/docker/openwebui && docker compose up -d`.
|
||
|
||
- **OpenHands** (`/srv/docker/openhands`, http://framework:3030) —
|
||
autonomous agent in a Docker sandbox. Spawns a per-conversation
|
||
`agent-server` container that can write code, run tests, browse the
|
||
web. Pre-configured for Ollama at `openai/qwen3-coder:30b` over the
|
||
OpenAI-compatible endpoint; ships traces to Phoenix.
|
||
|
||
Bring-up: `cd /srv/docker/openhands && docker compose up -d`. First
|
||
run pulls the agent-server image (~2 GB) lazily on first conversation,
|
||
not at startup, so the orchestrator comes up fast but your first
|
||
message takes 30–60 s. Pre-0.44 state path was `~/.openhands-state`;
|
||
not relevant on a fresh install.
|
||
|
||
> **Security note:** the orchestrator container has docker-socket
|
||
> access and spawns code-running sandboxes. Fine to expose on a
|
||
> Tailscale-only box; **change the compose port mapping back to
|
||
> `127.0.0.1:3030:3000` and tunnel in** if this host ever sees LAN
|
||
> or internet traffic.
|
||
|
||
Tool-call quality with local models is much better when Ollama's
|
||
context is bumped — the compose at `/srv/docker/ollama` already sets
|
||
`OLLAMA_CONTEXT_LENGTH=65536`, which is plenty.
|
||
|
||
OpenCode (the terminal driver, configured in
|
||
[`localgenai/opencode/`](../../opencode/)) runs on the Mac and points
|
||
at the same Ollama endpoint, so all three harnesses share the same
|
||
model server.
|
||
|
||
The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
|
||
gfx1151-optimized with rocWMMA flash attention (the latter only kicks
|
||
in on the ROCm tags). Auto-rebuilt against llama.cpp master.
|