Add OpenAI-compatible voice servers (faster-whisper + Kokoro)

Path B from VoiceModels.md — adds two new compose stacks alongside the
Wyoming pair so OpenWebUI/Conduit get voice without a Wyoming-shim:

- compose/faster-whisper.yml — fedirz/faster-whisper-server CPU image,
  large-v3-turbo by default, OpenAI /v1/audio/transcriptions on :8001.
  Built-in web UI for ad-hoc transcription.
- compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu, Kokoro-82M,
  OpenAI /v1/audio/speech on :8880.

Both run alongside (not instead of) Wyoming Whisper + Piper — Wyoming
keeps serving HA Assist, OpenAI-API serves OpenWebUI / Conduit. Memory
budget on Strix Halo accommodates everything plus Qwen3-Coder loaded
concurrently with plenty of headroom.

Homepage gets dedicated tiles for both. README documents the
OpenWebUI Audio configuration that wires the new endpoints. Conduit
inherits voice via OpenWebUI without app-side setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-08 14:42:45 -04:00
parent 36b8cfe835
commit 6db46d8f6a
7 changed files with 157 additions and 27 deletions

View File

@@ -42,14 +42,15 @@ Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`.
Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
and runs `update-grub`, but won't reboot for you.
- `/models/<vendor>/` layout
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper}/docker-compose.yml`
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper,faster-whisper,kokoro}/docker-compose.yml`
dropped in, not auto-started — you edit the model path then
`docker compose up -d`
(OpenWebUI needs no edits — it's pre-configured to find Ollama at
`host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
(Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
OpenHands UI at :3030, Homepage at :7575, Whisper STT at :10300,
Piper TTS at :10200 — see "Monitoring stack", "Agent harnesses",
OpenHands UI at :3030, Homepage at :7575, Whisper Wyoming :10300,
Piper Wyoming :10200, faster-whisper OpenAI-API :8001, Kokoro
OpenAI-API :8880 — see "Monitoring stack", "Agent harnesses",
"Front door", and "Voice" below)
If a previous run installed the native llama.cpp build / full ROCm /
@@ -88,28 +89,65 @@ the homepage container.
## Voice
Wyoming-protocol speech servers (Home Assistant's voice stack). No web
UI — these are TCP protocol servers consumable by HA Assist, Wyoming
satellites, or any Wyoming client.
Two parallel voice stacks, each speaking a different protocol so they
serve different clients without conflicting.
- **Whisper** (`/srv/docker/whisper`, `tcp://framework:10300`) —
speech-to-text. Default model is `tiny-int8` (~75 MB). Models
download into `/srv/docker/whisper/data/` on first run.
### Wyoming (Home Assistant Assist)
- **Piper** (`/srv/docker/piper`, `tcp://framework:10200`) —
text-to-speech. Default voice is `en_US-lessac-medium` (~63 MB).
Browse alternatives at <https://github.com/rhasspy/piper/blob/master/VOICES.md>.
Wyoming-protocol speech servers. No web UI — TCP protocol servers
consumable by HA Assist, Wyoming satellites, or any Wyoming client.
Bring-up:
- **Whisper** (`/srv/docker/whisper`, `tcp://framework:10300`) — STT.
Default model `tiny-int8` (~75 MB). Models download into
`/srv/docker/whisper/data/` on first run.
- **Piper** (`/srv/docker/piper`, `tcp://framework:10200`) — TTS.
Default voice `en_US-lessac-medium` (~63 MB). Voice catalog at
<https://github.com/rhasspy/piper/blob/master/VOICES.md>.
### OpenAI-compatible (OpenWebUI, Conduit, scripts)
OpenAI-API-compatible servers — `/v1/audio/transcriptions` and
`/v1/audio/speech`. Used by OpenWebUI's Audio settings (and through
OpenWebUI, by the Conduit Android app).
- **faster-whisper** (`/srv/docker/faster-whisper`,
http://framework:8001) — STT, model `large-v3-turbo` by default
(~810 MB). CPU-mode container; Strix Halo's 16 Zen 5 cores keep it
real-time. Built-in web UI at <http://framework:8001/>.
- **Kokoro** (`/srv/docker/kokoro`, http://framework:8880) — TTS,
Kokoro-82M (~340 MB). Apache 2.0, much more natural than Piper.
Bring-up (all four):
```sh
cd /srv/docker/whisper && docker compose up -d
cd /srv/docker/piper && docker compose up -d
for svc in whisper piper faster-whisper kokoro; do
( cd /srv/docker/$svc && docker compose up -d )
done
```
Both are conservative defaults sized for "small command-style usage";
on Strix Halo's hardware budget you can run substantially better
models. See [`localgenai/VoiceModels.md`](../../VoiceModels.md) for the
landscape and upgrade options.
### Wiring OpenWebUI for voice
OpenWebUI Admin → Settings → Audio:
- **STT**: `OpenAI` engine, URL `http://faster-whisper:8000/v1`,
any non-empty API key, model `Systran/faster-whisper-large-v3-turbo`.
- **TTS**: `OpenAI` engine, URL `http://kokoro:8880/v1`, any non-empty
API key, voice `af_bella` (or any from the Kokoro voices list).
The hostnames `faster-whisper` and `kokoro` work because all containers
share Docker's default bridge network (or you can use `host.docker.internal:8001`
and `:8880` if OpenWebUI can't resolve them by container name — depends
on whether you're running OpenWebUI on the user-defined or default
bridge).
After saving, the microphone icon in OpenWebUI activates and a
voice-call quick action appears. Conduit on Android (an OpenWebUI
client) inherits this configuration automatically — no app-side voice
setup.
See [`localgenai/VoiceModels.md`](../../VoiceModels.md) for the
landscape and further upgrade options (Sesame CSM, F5-TTS, etc.).
## Front door