Path B from VoiceModels.md — adds two new compose stacks alongside the Wyoming pair so OpenWebUI/Conduit get voice without a Wyoming-shim: - compose/faster-whisper.yml — fedirz/faster-whisper-server CPU image, large-v3-turbo by default, OpenAI /v1/audio/transcriptions on :8001. Built-in web UI for ad-hoc transcription. - compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu, Kokoro-82M, OpenAI /v1/audio/speech on :8880. Both run alongside (not instead of) Wyoming Whisper + Piper — Wyoming keeps serving HA Assist, OpenAI-API serves OpenWebUI / Conduit. Memory budget on Strix Halo accommodates everything plus Qwen3-Coder loaded concurrently with plenty of headroom. Homepage gets dedicated tiles for both. README documents the OpenWebUI Audio configuration that wires the new endpoints. Conduit inherits voice via OpenWebUI without app-side setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.4 KiB
4.4 KiB
localgenai
Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over Tailscale. Coding agents, monitoring, voice — all self-hosted.
Topology
┌─────────────────┐ Tailscale ┌──────────────────────────────┐
│ Mac │ ───────────────▶ │ Framework Desktop │
│ │ │ (Ubuntu 26.04, gfx1151 iGPU)│
│ • OpenCode │ │ • Inference: Ollama, │
│ + Phoenix │ │ llama.cpp, vLLM │
│ bridge │ │ • Agent UIs: OpenWebUI, │
│ • install.sh │ │ OpenHands │
│ │ │ • Monitoring: Beszel, │
│ │ │ OpenLIT, Phoenix │
└─────────────────┘ └──────────────────────────────┘
Repo layout
| Path | What's there |
|---|---|
pyinfra/ |
pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See pyinfra/framework/README.md. |
opencode/ |
OpenCode config + Phoenix bridge plugin. install.sh deploys it to ~/.config/opencode/ on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix. |
StrixHaloSetup.md |
Original phased bring-up plan for the box. |
StrixHaloMemory.md |
UMA / GTT / bandwidth notes — "what fits, what's slow, why." |
Roadmap.md |
What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. |
TODO.md |
Open questions and follow-ups. |
testing/qwen3-coder-30b/ |
Small evaluation harness for Qwen3-Coder. |
VoiceModels.md |
STT / TTS landscape and upgrade paths from the Wyoming defaults. |
Service ports (on framework)
| Port | Service | Notes |
|---|---|---|
7575 |
Homepage | Front door — start here. Tile per service with live widgets. |
8080 |
llama.cpp | Vulkan backend, --metrics for Prometheus |
8000 |
vLLM | ROCm; gfx1151 support varies |
11434 |
Ollama | ROCm with HSA_OVERRIDE_GFX_VERSION=11.0.0 |
3000 |
OpenWebUI | ChatGPT-style UI in front of Ollama |
3001 |
OpenLIT | LLM fleet metrics dashboard |
3030 |
OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design |
4317 |
Phoenix OTLP/gRPC | Trace ingestion |
6006 |
Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also :6006/v1/traces) |
8001 |
faster-whisper | STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit |
8090 |
Beszel | Host + container + AMD GPU dashboard |
8880 |
Kokoro | TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit |
10200 |
Piper | TTS (Wyoming protocol) — for Home Assistant Assist |
10300 |
Whisper | STT (Wyoming protocol) — for Home Assistant Assist |
Quick start
On the box (after the pyinfra deploy):
cd /srv/docker/ollama && docker compose up -d
cd /srv/docker/phoenix && docker compose up -d
cd /srv/docker/beszel && docker compose up -d # then add system per beszel.yml
On the Mac:
cd opencode && ./install.sh
opencode
Send a prompt; watch it land in Phoenix at http://framework:6006.
Why this stack
- Bandwidth, not VRAM, is the ceiling on Strix Halo. 256 GB/s memory
bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B,
Kimi-Linear-48B-A3B) dominate. See
StrixHaloMemory.md. - AMD APU monitoring is broken upstream.
amd-smireturns N/A on gfx1151, which kills the official Prometheus/Grafana exporters. Beszel reads sysfs directly and works. See the monitoring section ofpyinfra/framework/README.md. - Two-layer observability. OpenLIT is fleet metrics across sessions; Phoenix is per-trace waterfall for "what did this one prompt do."
- Reproducible. Every host-side config lives in
pyinfra/; every Mac-side config inopencode/install.sh. Re-running either is safe.