Add OpenAI-compatible voice servers (faster-whisper + Kokoro)
Path B from VoiceModels.md — adds two new compose stacks alongside the Wyoming pair so OpenWebUI/Conduit get voice without a Wyoming-shim: - compose/faster-whisper.yml — fedirz/faster-whisper-server CPU image, large-v3-turbo by default, OpenAI /v1/audio/transcriptions on :8001. Built-in web UI for ad-hoc transcription. - compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu, Kokoro-82M, OpenAI /v1/audio/speech on :8880. Both run alongside (not instead of) Wyoming Whisper + Piper — Wyoming keeps serving HA Assist, OpenAI-API serves OpenWebUI / Conduit. Memory budget on Strix Halo accommodates everything plus Qwen3-Coder loaded concurrently with plenty of headroom. Homepage gets dedicated tiles for both. README documents the OpenWebUI Audio configuration that wires the new endpoints. Conduit inherits voice via OpenWebUI without app-side setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -45,9 +45,11 @@ Tailscale. Coding agents, monitoring, voice — all self-hosted.
|
|||||||
| `3030` | OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design |
|
| `3030` | OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design |
|
||||||
| `4317` | Phoenix OTLP/gRPC | Trace ingestion |
|
| `4317` | Phoenix OTLP/gRPC | Trace ingestion |
|
||||||
| `6006` | Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also `:6006/v1/traces`) |
|
| `6006` | Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also `:6006/v1/traces`) |
|
||||||
|
| `8001` | faster-whisper | STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit |
|
||||||
| `8090` | Beszel | Host + container + AMD GPU dashboard |
|
| `8090` | Beszel | Host + container + AMD GPU dashboard |
|
||||||
| `10200` | Piper | TTS (Wyoming protocol) — used by Home Assistant Assist |
|
| `8880` | Kokoro | TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit |
|
||||||
| `10300` | Whisper | STT (Wyoming protocol) — used by Home Assistant Assist |
|
| `10200` | Piper | TTS (Wyoming protocol) — for Home Assistant Assist |
|
||||||
|
| `10300` | Whisper | STT (Wyoming protocol) — for Home Assistant Assist |
|
||||||
|
|
||||||
## Quick start
|
## Quick start
|
||||||
|
|
||||||
|
|||||||
@@ -163,6 +163,12 @@ weekend project (gives you voice in the existing browser UIs without
|
|||||||
forcing the Wyoming dependency on every consumer). Skip C/D until you
|
forcing the Wyoming dependency on every consumer). Skip C/D until you
|
||||||
have a concrete use case that demands them.
|
have a concrete use case that demands them.
|
||||||
|
|
||||||
|
**Status:** Path B is now live in pyinfra — `compose/faster-whisper.yml`
|
||||||
|
and `compose/kokoro.yml` deploy alongside Wyoming. Wire OpenWebUI's
|
||||||
|
Audio settings at them per the "Wiring OpenWebUI for voice" section in
|
||||||
|
[`pyinfra/framework/README.md`](pyinfra/framework/README.md). The
|
||||||
|
Conduit Android app inherits voice transitively through OpenWebUI.
|
||||||
|
|
||||||
## Reference catalogs
|
## Reference catalogs
|
||||||
|
|
||||||
- [Hugging Face TTS model leaderboard](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) — blind preference scores across SOTA TTS
|
- [Hugging Face TTS model leaderboard](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) — blind preference scores across SOTA TTS
|
||||||
|
|||||||
@@ -42,14 +42,15 @@ Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`.
|
|||||||
Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
|
Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
|
||||||
and runs `update-grub`, but won't reboot for you.
|
and runs `update-grub`, but won't reboot for you.
|
||||||
- `/models/<vendor>/` layout
|
- `/models/<vendor>/` layout
|
||||||
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper}/docker-compose.yml`
|
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper,faster-whisper,kokoro}/docker-compose.yml`
|
||||||
dropped in, not auto-started — you edit the model path then
|
dropped in, not auto-started — you edit the model path then
|
||||||
`docker compose up -d`
|
`docker compose up -d`
|
||||||
(OpenWebUI needs no edits — it's pre-configured to find Ollama at
|
(OpenWebUI needs no edits — it's pre-configured to find Ollama at
|
||||||
`host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
|
`host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
|
||||||
(Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
|
(Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
|
||||||
OpenHands UI at :3030, Homepage at :7575, Whisper STT at :10300,
|
OpenHands UI at :3030, Homepage at :7575, Whisper Wyoming :10300,
|
||||||
Piper TTS at :10200 — see "Monitoring stack", "Agent harnesses",
|
Piper Wyoming :10200, faster-whisper OpenAI-API :8001, Kokoro
|
||||||
|
OpenAI-API :8880 — see "Monitoring stack", "Agent harnesses",
|
||||||
"Front door", and "Voice" below)
|
"Front door", and "Voice" below)
|
||||||
|
|
||||||
If a previous run installed the native llama.cpp build / full ROCm /
|
If a previous run installed the native llama.cpp build / full ROCm /
|
||||||
@@ -88,28 +89,65 @@ the homepage container.
|
|||||||
|
|
||||||
## Voice
|
## Voice
|
||||||
|
|
||||||
Wyoming-protocol speech servers (Home Assistant's voice stack). No web
|
Two parallel voice stacks, each speaking a different protocol so they
|
||||||
UI — these are TCP protocol servers consumable by HA Assist, Wyoming
|
serve different clients without conflicting.
|
||||||
satellites, or any Wyoming client.
|
|
||||||
|
|
||||||
- **Whisper** (`/srv/docker/whisper`, `tcp://framework:10300`) —
|
### Wyoming (Home Assistant Assist)
|
||||||
speech-to-text. Default model is `tiny-int8` (~75 MB). Models
|
|
||||||
download into `/srv/docker/whisper/data/` on first run.
|
|
||||||
|
|
||||||
- **Piper** (`/srv/docker/piper`, `tcp://framework:10200`) —
|
Wyoming-protocol speech servers. No web UI — TCP protocol servers
|
||||||
text-to-speech. Default voice is `en_US-lessac-medium` (~63 MB).
|
consumable by HA Assist, Wyoming satellites, or any Wyoming client.
|
||||||
Browse alternatives at <https://github.com/rhasspy/piper/blob/master/VOICES.md>.
|
|
||||||
|
|
||||||
Bring-up:
|
- **Whisper** (`/srv/docker/whisper`, `tcp://framework:10300`) — STT.
|
||||||
|
Default model `tiny-int8` (~75 MB). Models download into
|
||||||
|
`/srv/docker/whisper/data/` on first run.
|
||||||
|
|
||||||
|
- **Piper** (`/srv/docker/piper`, `tcp://framework:10200`) — TTS.
|
||||||
|
Default voice `en_US-lessac-medium` (~63 MB). Voice catalog at
|
||||||
|
<https://github.com/rhasspy/piper/blob/master/VOICES.md>.
|
||||||
|
|
||||||
|
### OpenAI-compatible (OpenWebUI, Conduit, scripts)
|
||||||
|
|
||||||
|
OpenAI-API-compatible servers — `/v1/audio/transcriptions` and
|
||||||
|
`/v1/audio/speech`. Used by OpenWebUI's Audio settings (and through
|
||||||
|
OpenWebUI, by the Conduit Android app).
|
||||||
|
|
||||||
|
- **faster-whisper** (`/srv/docker/faster-whisper`,
|
||||||
|
http://framework:8001) — STT, model `large-v3-turbo` by default
|
||||||
|
(~810 MB). CPU-mode container; Strix Halo's 16 Zen 5 cores keep it
|
||||||
|
real-time. Built-in web UI at <http://framework:8001/>.
|
||||||
|
|
||||||
|
- **Kokoro** (`/srv/docker/kokoro`, http://framework:8880) — TTS,
|
||||||
|
Kokoro-82M (~340 MB). Apache 2.0, much more natural than Piper.
|
||||||
|
|
||||||
|
Bring-up (all four):
|
||||||
```sh
|
```sh
|
||||||
cd /srv/docker/whisper && docker compose up -d
|
for svc in whisper piper faster-whisper kokoro; do
|
||||||
cd /srv/docker/piper && docker compose up -d
|
( cd /srv/docker/$svc && docker compose up -d )
|
||||||
|
done
|
||||||
```
|
```
|
||||||
|
|
||||||
Both are conservative defaults sized for "small command-style usage";
|
### Wiring OpenWebUI for voice
|
||||||
on Strix Halo's hardware budget you can run substantially better
|
|
||||||
models. See [`localgenai/VoiceModels.md`](../../VoiceModels.md) for the
|
OpenWebUI Admin → Settings → Audio:
|
||||||
landscape and upgrade options.
|
|
||||||
|
- **STT**: `OpenAI` engine, URL `http://faster-whisper:8000/v1`,
|
||||||
|
any non-empty API key, model `Systran/faster-whisper-large-v3-turbo`.
|
||||||
|
- **TTS**: `OpenAI` engine, URL `http://kokoro:8880/v1`, any non-empty
|
||||||
|
API key, voice `af_bella` (or any from the Kokoro voices list).
|
||||||
|
|
||||||
|
The hostnames `faster-whisper` and `kokoro` work because all containers
|
||||||
|
share Docker's default bridge network (or you can use `host.docker.internal:8001`
|
||||||
|
and `:8880` if OpenWebUI can't resolve them by container name — depends
|
||||||
|
on whether you're running OpenWebUI on the user-defined or default
|
||||||
|
bridge).
|
||||||
|
|
||||||
|
After saving, the microphone icon in OpenWebUI activates and a
|
||||||
|
voice-call quick action appears. Conduit on Android (an OpenWebUI
|
||||||
|
client) inherits this configuration automatically — no app-side voice
|
||||||
|
setup.
|
||||||
|
|
||||||
|
See [`localgenai/VoiceModels.md`](../../VoiceModels.md) for the
|
||||||
|
landscape and further upgrade options (Sesame CSM, F5-TTS, etc.).
|
||||||
|
|
||||||
## Front door
|
## Front door
|
||||||
|
|
||||||
|
|||||||
30
pyinfra/framework/compose/faster-whisper.yml
Normal file
30
pyinfra/framework/compose/faster-whisper.yml
Normal file
@@ -0,0 +1,30 @@
|
|||||||
|
# faster-whisper-server — OpenAI-compatible STT.
|
||||||
|
# https://github.com/fedirz/faster-whisper-server
|
||||||
|
#
|
||||||
|
# Speaks `/v1/audio/transcriptions` (and `/v1/audio/translations`) so any
|
||||||
|
# client that talks to OpenAI's audio API works without changes —
|
||||||
|
# OpenWebUI, Conduit (via OpenWebUI), arbitrary scripts.
|
||||||
|
#
|
||||||
|
# Runs alongside (not instead of) Wyoming Whisper. Wyoming stays for
|
||||||
|
# Home Assistant Assist; this server is for OpenAI-API consumers.
|
||||||
|
#
|
||||||
|
# CPU mode: Strix Halo's 16 Zen 5 cores comfortably real-time even on
|
||||||
|
# large-v3-turbo. CTranslate2's ROCm support for gfx1151 is unreliable;
|
||||||
|
# CPU sidesteps that.
|
||||||
|
services:
|
||||||
|
faster-whisper:
|
||||||
|
image: fedirz/faster-whisper-server:latest-cpu
|
||||||
|
container_name: faster-whisper
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
- "8001:8000"
|
||||||
|
environment:
|
||||||
|
# Default model loaded on first request. Auto-downloads on use.
|
||||||
|
WHISPER__MODEL: Systran/faster-whisper-large-v3-turbo
|
||||||
|
WHISPER__INFERENCE_DEVICE: cpu
|
||||||
|
WHISPER__COMPUTE_TYPE: int8
|
||||||
|
# Built-in web UI at /
|
||||||
|
ENABLE_UI: "true"
|
||||||
|
volumes:
|
||||||
|
# Persist model downloads across container recreates.
|
||||||
|
- /srv/docker/faster-whisper/cache:/root/.cache/huggingface
|
||||||
@@ -74,20 +74,35 @@
|
|||||||
container: phoenix
|
container: phoenix
|
||||||
|
|
||||||
- Voice:
|
- Voice:
|
||||||
# Wyoming-protocol services have no web UI; tiles are informational
|
# Wyoming-protocol services have no web UI; tiles are informational.
|
||||||
# (container status + port). Click-through goes nowhere meaningful.
|
# The OpenAI-compatible servers (faster-whisper, Kokoro) have UIs /
|
||||||
- Whisper:
|
# APIs you can hit directly.
|
||||||
|
- Whisper (Wyoming):
|
||||||
icon: mdi-microphone-message
|
icon: mdi-microphone-message
|
||||||
description: Speech-to-text (Wyoming :10300)
|
description: STT for Home Assistant Assist (Wyoming :10300)
|
||||||
server: localhost-docker
|
server: localhost-docker
|
||||||
container: wyoming-whisper
|
container: wyoming-whisper
|
||||||
|
|
||||||
- Piper:
|
- Piper (Wyoming):
|
||||||
icon: mdi-account-voice
|
icon: mdi-account-voice
|
||||||
description: Text-to-speech (Wyoming :10200)
|
description: TTS for Home Assistant Assist (Wyoming :10200)
|
||||||
server: localhost-docker
|
server: localhost-docker
|
||||||
container: wyoming-piper
|
container: wyoming-piper
|
||||||
|
|
||||||
|
- faster-whisper:
|
||||||
|
icon: mdi-microphone
|
||||||
|
href: http://framework:8001
|
||||||
|
description: STT (OpenAI API) — large-v3-turbo, used by OpenWebUI/Conduit
|
||||||
|
server: localhost-docker
|
||||||
|
container: faster-whisper
|
||||||
|
|
||||||
|
- Kokoro:
|
||||||
|
icon: mdi-account-music
|
||||||
|
href: http://framework:8880/web
|
||||||
|
description: TTS (OpenAI API) — Kokoro-82M, used by OpenWebUI/Conduit
|
||||||
|
server: localhost-docker
|
||||||
|
container: kokoro
|
||||||
|
|
||||||
- External:
|
- External:
|
||||||
- SearXNG:
|
- SearXNG:
|
||||||
icon: searxng.svg
|
icon: searxng.svg
|
||||||
|
|||||||
20
pyinfra/framework/compose/kokoro.yml
Normal file
20
pyinfra/framework/compose/kokoro.yml
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
# Kokoro-FastAPI — OpenAI-compatible TTS in front of Kokoro-82M.
|
||||||
|
# https://github.com/remsky/Kokoro-FastAPI
|
||||||
|
#
|
||||||
|
# Speaks `/v1/audio/speech`. Pair with faster-whisper-server for a full
|
||||||
|
# OpenAI-compatible voice loop driving OpenWebUI / Conduit.
|
||||||
|
#
|
||||||
|
# Kokoro-82M (hexgrad, Jan 2025) is small (~340 MB) but produces
|
||||||
|
# noticeably more natural prosody than Piper. Apache 2.0 licence.
|
||||||
|
# CPU image is plenty fast for this model size on Strix Halo.
|
||||||
|
services:
|
||||||
|
kokoro:
|
||||||
|
image: ghcr.io/remsky/kokoro-fastapi-cpu:latest
|
||||||
|
container_name: kokoro
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
# 8880 is Kokoro-FastAPI's own default — keeping the same on the
|
||||||
|
# host side so docs/tutorials line up.
|
||||||
|
- "8880:8880"
|
||||||
|
volumes:
|
||||||
|
- /srv/docker/kokoro/models:/app/api/src/models
|
||||||
@@ -341,6 +341,8 @@ for svc in (
|
|||||||
"homepage",
|
"homepage",
|
||||||
"whisper",
|
"whisper",
|
||||||
"piper",
|
"piper",
|
||||||
|
"faster-whisper",
|
||||||
|
"kokoro",
|
||||||
):
|
):
|
||||||
files.directory(
|
files.directory(
|
||||||
name=f"compose/{svc} dir",
|
name=f"compose/{svc} dir",
|
||||||
@@ -490,6 +492,23 @@ files.directory(
|
|||||||
_sudo=True,
|
_sudo=True,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# OpenAI-compatible voice servers — alternative path to Wyoming, used by
|
||||||
|
# OpenWebUI's Audio settings (and through it, Conduit on Android).
|
||||||
|
files.directory(
|
||||||
|
name="faster-whisper cache dir",
|
||||||
|
path=f"{COMPOSE_DIR}/faster-whisper/cache",
|
||||||
|
group="docker",
|
||||||
|
mode="2775",
|
||||||
|
_sudo=True,
|
||||||
|
)
|
||||||
|
files.directory(
|
||||||
|
name="Kokoro models dir",
|
||||||
|
path=f"{COMPOSE_DIR}/kokoro/models",
|
||||||
|
group="docker",
|
||||||
|
mode="2775",
|
||||||
|
_sudo=True,
|
||||||
|
)
|
||||||
|
|
||||||
# --- Cleanup of artifacts from the prior native-build deploy ----------------
|
# --- Cleanup of artifacts from the prior native-build deploy ----------------
|
||||||
# All idempotent — `present=False` is a no-op when the target is absent.
|
# All idempotent — `present=False` is a no-op when the target is absent.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user