Files

noisedestroyers 6db46d8f6a Add OpenAI-compatible voice servers (faster-whisper + Kokoro)

Path B from VoiceModels.md — adds two new compose stacks alongside the
Wyoming pair so OpenWebUI/Conduit get voice without a Wyoming-shim:

- compose/faster-whisper.yml — fedirz/faster-whisper-server CPU image,
  large-v3-turbo by default, OpenAI /v1/audio/transcriptions on :8001.
  Built-in web UI for ad-hoc transcription.
- compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu, Kokoro-82M,
  OpenAI /v1/audio/speech on :8880.

Both run alongside (not instead of) Wyoming Whisper + Piper — Wyoming
keeps serving HA Assist, OpenAI-API serves OpenWebUI / Conduit. Memory
budget on Strix Halo accommodates everything plus Qwen3-Coder loaded
concurrently with plenty of headroom.

Homepage gets dedicated tiles for both. README documents the
OpenWebUI Audio configuration that wires the new endpoints. Conduit
inherits voice via OpenWebUI without app-side setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 14:42:45 -04:00

4.4 KiB

Raw Blame History

localgenai

Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over Tailscale. Coding agents, monitoring, voice — all self-hosted.

Topology

┌─────────────────┐    Tailscale     ┌──────────────────────────────┐
│  Mac            │ ───────────────▶ │  Framework Desktop           │
│                 │                  │  (Ubuntu 26.04, gfx1151 iGPU)│
│  • OpenCode     │                  │  • Inference: Ollama,        │
│    + Phoenix    │                  │    llama.cpp, vLLM           │
│    bridge       │                  │  • Agent UIs: OpenWebUI,     │
│  • install.sh   │                  │    OpenHands                 │
│                 │                  │  • Monitoring: Beszel,       │
│                 │                  │    OpenLIT, Phoenix          │
└─────────────────┘                  └──────────────────────────────┘

Repo layout

Path	What's there
`pyinfra/`	pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See `pyinfra/framework/README.md`.
`opencode/`	OpenCode config + Phoenix bridge plugin. `install.sh` deploys it to `~/.config/opencode/` on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix.
`StrixHaloSetup.md`	Original phased bring-up plan for the box.
`StrixHaloMemory.md`	UMA / GTT / bandwidth notes — "what fits, what's slow, why."
`Roadmap.md`	What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design.
`TODO.md`	Open questions and follow-ups.
`testing/qwen3-coder-30b/`	Small evaluation harness for Qwen3-Coder.
`VoiceModels.md`	STT / TTS landscape and upgrade paths from the Wyoming defaults.

Service ports (on `framework`)

Port	Service	Notes
`7575`	Homepage	Front door — start here. Tile per service with live widgets.
`8080`	llama.cpp	Vulkan backend, `--metrics` for Prometheus
`8000`	vLLM	ROCm; gfx1151 support varies
`11434`	Ollama	ROCm with `HSA_OVERRIDE_GFX_VERSION=11.0.0`
`3000`	OpenWebUI	ChatGPT-style UI in front of Ollama
`3001`	OpenLIT	LLM fleet metrics dashboard
`3030`	OpenHands	Autonomous agent + sandbox runtime — Tailscale-only by design
`4317`	Phoenix OTLP/gRPC	Trace ingestion
`6006`	Phoenix UI / OTLP/HTTP	Per-trace agent waterfall (also `:6006/v1/traces`)
`8001`	faster-whisper	STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit
`8090`	Beszel	Host + container + AMD GPU dashboard
`8880`	Kokoro	TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit
`10200`	Piper	TTS (Wyoming protocol) — for Home Assistant Assist
`10300`	Whisper	STT (Wyoming protocol) — for Home Assistant Assist

Quick start

On the box (after the pyinfra deploy):

cd /srv/docker/ollama && docker compose up -d
cd /srv/docker/phoenix && docker compose up -d
cd /srv/docker/beszel && docker compose up -d   # then add system per beszel.yml

On the Mac:

cd opencode && ./install.sh
opencode

Send a prompt; watch it land in Phoenix at http://framework:6006.

Why this stack

Bandwidth, not VRAM, is the ceiling on Strix Halo. 256 GB/s memory bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B, Kimi-Linear-48B-A3B) dominate. See StrixHaloMemory.md.
AMD APU monitoring is broken upstream. amd-smi returns N/A on gfx1151, which kills the official Prometheus/Grafana exporters. Beszel reads sysfs directly and works. See the monitoring section of pyinfra/framework/README.md.
Two-layer observability. OpenLIT is fleet metrics across sessions; Phoenix is per-trace waterfall for "what did this one prompt do."
Reproducible. Every host-side config lives in pyinfra/; every Mac-side config in opencode/install.sh. Re-running either is safe.

4.4 KiB Raw Blame History