diff --git a/README.md b/README.md index 0308bae..18771b4 100644 --- a/README.md +++ b/README.md @@ -45,9 +45,11 @@ Tailscale. Coding agents, monitoring, voice — all self-hosted. | `3030` | OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design | | `4317` | Phoenix OTLP/gRPC | Trace ingestion | | `6006` | Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also `:6006/v1/traces`) | +| `8001` | faster-whisper | STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit | | `8090` | Beszel | Host + container + AMD GPU dashboard | -| `10200` | Piper | TTS (Wyoming protocol) — used by Home Assistant Assist | -| `10300` | Whisper | STT (Wyoming protocol) — used by Home Assistant Assist | +| `8880` | Kokoro | TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit | +| `10200` | Piper | TTS (Wyoming protocol) — for Home Assistant Assist | +| `10300` | Whisper | STT (Wyoming protocol) — for Home Assistant Assist | ## Quick start diff --git a/VoiceModels.md b/VoiceModels.md index 39caa0d..dc5a715 100644 --- a/VoiceModels.md +++ b/VoiceModels.md @@ -163,6 +163,12 @@ weekend project (gives you voice in the existing browser UIs without forcing the Wyoming dependency on every consumer). Skip C/D until you have a concrete use case that demands them. +**Status:** Path B is now live in pyinfra — `compose/faster-whisper.yml` +and `compose/kokoro.yml` deploy alongside Wyoming. Wire OpenWebUI's +Audio settings at them per the "Wiring OpenWebUI for voice" section in +[`pyinfra/framework/README.md`](pyinfra/framework/README.md). The +Conduit Android app inherits voice transitively through OpenWebUI. + ## Reference catalogs - [Hugging Face TTS model leaderboard](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) — blind preference scores across SOTA TTS diff --git a/pyinfra/framework/README.md b/pyinfra/framework/README.md index 8528c22..5f133ac 100644 --- a/pyinfra/framework/README.md +++ b/pyinfra/framework/README.md @@ -42,14 +42,15 @@ Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`. Requires a reboot to activate — pyinfra rewrites `/etc/default/grub` and runs `update-grub`, but won't reboot for you. - `/models//` layout -- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper}/docker-compose.yml` +- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper,faster-whisper,kokoro}/docker-compose.yml` dropped in, not auto-started — you edit the model path then `docker compose up -d` (OpenWebUI needs no edits — it's pre-configured to find Ollama at `host.docker.internal:11434` and uses `searxng.n0n.io` for web search) (Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006, - OpenHands UI at :3030, Homepage at :7575, Whisper STT at :10300, - Piper TTS at :10200 — see "Monitoring stack", "Agent harnesses", + OpenHands UI at :3030, Homepage at :7575, Whisper Wyoming :10300, + Piper Wyoming :10200, faster-whisper OpenAI-API :8001, Kokoro + OpenAI-API :8880 — see "Monitoring stack", "Agent harnesses", "Front door", and "Voice" below) If a previous run installed the native llama.cpp build / full ROCm / @@ -88,28 +89,65 @@ the homepage container. ## Voice -Wyoming-protocol speech servers (Home Assistant's voice stack). No web -UI — these are TCP protocol servers consumable by HA Assist, Wyoming -satellites, or any Wyoming client. +Two parallel voice stacks, each speaking a different protocol so they +serve different clients without conflicting. -- **Whisper** (`/srv/docker/whisper`, `tcp://framework:10300`) — - speech-to-text. Default model is `tiny-int8` (~75 MB). Models - download into `/srv/docker/whisper/data/` on first run. +### Wyoming (Home Assistant Assist) -- **Piper** (`/srv/docker/piper`, `tcp://framework:10200`) — - text-to-speech. Default voice is `en_US-lessac-medium` (~63 MB). - Browse alternatives at . +Wyoming-protocol speech servers. No web UI — TCP protocol servers +consumable by HA Assist, Wyoming satellites, or any Wyoming client. -Bring-up: +- **Whisper** (`/srv/docker/whisper`, `tcp://framework:10300`) — STT. + Default model `tiny-int8` (~75 MB). Models download into + `/srv/docker/whisper/data/` on first run. + +- **Piper** (`/srv/docker/piper`, `tcp://framework:10200`) — TTS. + Default voice `en_US-lessac-medium` (~63 MB). Voice catalog at + . + +### OpenAI-compatible (OpenWebUI, Conduit, scripts) + +OpenAI-API-compatible servers — `/v1/audio/transcriptions` and +`/v1/audio/speech`. Used by OpenWebUI's Audio settings (and through +OpenWebUI, by the Conduit Android app). + +- **faster-whisper** (`/srv/docker/faster-whisper`, + http://framework:8001) — STT, model `large-v3-turbo` by default + (~810 MB). CPU-mode container; Strix Halo's 16 Zen 5 cores keep it + real-time. Built-in web UI at . + +- **Kokoro** (`/srv/docker/kokoro`, http://framework:8880) — TTS, + Kokoro-82M (~340 MB). Apache 2.0, much more natural than Piper. + +Bring-up (all four): ```sh -cd /srv/docker/whisper && docker compose up -d -cd /srv/docker/piper && docker compose up -d +for svc in whisper piper faster-whisper kokoro; do + ( cd /srv/docker/$svc && docker compose up -d ) +done ``` -Both are conservative defaults sized for "small command-style usage"; -on Strix Halo's hardware budget you can run substantially better -models. See [`localgenai/VoiceModels.md`](../../VoiceModels.md) for the -landscape and upgrade options. +### Wiring OpenWebUI for voice + +OpenWebUI Admin → Settings → Audio: + +- **STT**: `OpenAI` engine, URL `http://faster-whisper:8000/v1`, + any non-empty API key, model `Systran/faster-whisper-large-v3-turbo`. +- **TTS**: `OpenAI` engine, URL `http://kokoro:8880/v1`, any non-empty + API key, voice `af_bella` (or any from the Kokoro voices list). + +The hostnames `faster-whisper` and `kokoro` work because all containers +share Docker's default bridge network (or you can use `host.docker.internal:8001` +and `:8880` if OpenWebUI can't resolve them by container name — depends +on whether you're running OpenWebUI on the user-defined or default +bridge). + +After saving, the microphone icon in OpenWebUI activates and a +voice-call quick action appears. Conduit on Android (an OpenWebUI +client) inherits this configuration automatically — no app-side voice +setup. + +See [`localgenai/VoiceModels.md`](../../VoiceModels.md) for the +landscape and further upgrade options (Sesame CSM, F5-TTS, etc.). ## Front door diff --git a/pyinfra/framework/compose/faster-whisper.yml b/pyinfra/framework/compose/faster-whisper.yml new file mode 100644 index 0000000..ef287d5 --- /dev/null +++ b/pyinfra/framework/compose/faster-whisper.yml @@ -0,0 +1,30 @@ +# faster-whisper-server — OpenAI-compatible STT. +# https://github.com/fedirz/faster-whisper-server +# +# Speaks `/v1/audio/transcriptions` (and `/v1/audio/translations`) so any +# client that talks to OpenAI's audio API works without changes — +# OpenWebUI, Conduit (via OpenWebUI), arbitrary scripts. +# +# Runs alongside (not instead of) Wyoming Whisper. Wyoming stays for +# Home Assistant Assist; this server is for OpenAI-API consumers. +# +# CPU mode: Strix Halo's 16 Zen 5 cores comfortably real-time even on +# large-v3-turbo. CTranslate2's ROCm support for gfx1151 is unreliable; +# CPU sidesteps that. +services: + faster-whisper: + image: fedirz/faster-whisper-server:latest-cpu + container_name: faster-whisper + restart: unless-stopped + ports: + - "8001:8000" + environment: + # Default model loaded on first request. Auto-downloads on use. + WHISPER__MODEL: Systran/faster-whisper-large-v3-turbo + WHISPER__INFERENCE_DEVICE: cpu + WHISPER__COMPUTE_TYPE: int8 + # Built-in web UI at / + ENABLE_UI: "true" + volumes: + # Persist model downloads across container recreates. + - /srv/docker/faster-whisper/cache:/root/.cache/huggingface diff --git a/pyinfra/framework/compose/homepage/services.yaml b/pyinfra/framework/compose/homepage/services.yaml index f67d156..f2e510f 100644 --- a/pyinfra/framework/compose/homepage/services.yaml +++ b/pyinfra/framework/compose/homepage/services.yaml @@ -74,20 +74,35 @@ container: phoenix - Voice: - # Wyoming-protocol services have no web UI; tiles are informational - # (container status + port). Click-through goes nowhere meaningful. - - Whisper: + # Wyoming-protocol services have no web UI; tiles are informational. + # The OpenAI-compatible servers (faster-whisper, Kokoro) have UIs / + # APIs you can hit directly. + - Whisper (Wyoming): icon: mdi-microphone-message - description: Speech-to-text (Wyoming :10300) + description: STT for Home Assistant Assist (Wyoming :10300) server: localhost-docker container: wyoming-whisper - - Piper: + - Piper (Wyoming): icon: mdi-account-voice - description: Text-to-speech (Wyoming :10200) + description: TTS for Home Assistant Assist (Wyoming :10200) server: localhost-docker container: wyoming-piper + - faster-whisper: + icon: mdi-microphone + href: http://framework:8001 + description: STT (OpenAI API) — large-v3-turbo, used by OpenWebUI/Conduit + server: localhost-docker + container: faster-whisper + + - Kokoro: + icon: mdi-account-music + href: http://framework:8880/web + description: TTS (OpenAI API) — Kokoro-82M, used by OpenWebUI/Conduit + server: localhost-docker + container: kokoro + - External: - SearXNG: icon: searxng.svg diff --git a/pyinfra/framework/compose/kokoro.yml b/pyinfra/framework/compose/kokoro.yml new file mode 100644 index 0000000..1118fce --- /dev/null +++ b/pyinfra/framework/compose/kokoro.yml @@ -0,0 +1,20 @@ +# Kokoro-FastAPI — OpenAI-compatible TTS in front of Kokoro-82M. +# https://github.com/remsky/Kokoro-FastAPI +# +# Speaks `/v1/audio/speech`. Pair with faster-whisper-server for a full +# OpenAI-compatible voice loop driving OpenWebUI / Conduit. +# +# Kokoro-82M (hexgrad, Jan 2025) is small (~340 MB) but produces +# noticeably more natural prosody than Piper. Apache 2.0 licence. +# CPU image is plenty fast for this model size on Strix Halo. +services: + kokoro: + image: ghcr.io/remsky/kokoro-fastapi-cpu:latest + container_name: kokoro + restart: unless-stopped + ports: + # 8880 is Kokoro-FastAPI's own default — keeping the same on the + # host side so docs/tutorials line up. + - "8880:8880" + volumes: + - /srv/docker/kokoro/models:/app/api/src/models diff --git a/pyinfra/framework/deploy.py b/pyinfra/framework/deploy.py index e249b45..f2fe7eb 100644 --- a/pyinfra/framework/deploy.py +++ b/pyinfra/framework/deploy.py @@ -341,6 +341,8 @@ for svc in ( "homepage", "whisper", "piper", + "faster-whisper", + "kokoro", ): files.directory( name=f"compose/{svc} dir", @@ -490,6 +492,23 @@ files.directory( _sudo=True, ) +# OpenAI-compatible voice servers — alternative path to Wyoming, used by +# OpenWebUI's Audio settings (and through it, Conduit on Android). +files.directory( + name="faster-whisper cache dir", + path=f"{COMPOSE_DIR}/faster-whisper/cache", + group="docker", + mode="2775", + _sudo=True, +) +files.directory( + name="Kokoro models dir", + path=f"{COMPOSE_DIR}/kokoro/models", + group="docker", + mode="2775", + _sudo=True, +) + # --- Cleanup of artifacts from the prior native-build deploy ---------------- # All idempotent — `present=False` is a no-op when the target is absent.