diff --git a/README.md b/README.md index e80ff39..0308bae 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ Tailscale. Coding agents, monitoring, voice — all self-hosted. | [`Roadmap.md`](Roadmap.md) | What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. | | [`TODO.md`](TODO.md) | Open questions and follow-ups. | | [`testing/qwen3-coder-30b/`](testing/qwen3-coder-30b/) | Small evaluation harness for Qwen3-Coder. | -| `piper-compose.yaml`, `whisper-compose.yaml` | Voice (TTS / STT) compose files; not yet wired into the pyinfra deploy. | +| [`VoiceModels.md`](VoiceModels.md) | STT / TTS landscape and upgrade paths from the Wyoming defaults. | ## Service ports (on `framework`) @@ -46,6 +46,8 @@ Tailscale. Coding agents, monitoring, voice — all self-hosted. | `4317` | Phoenix OTLP/gRPC | Trace ingestion | | `6006` | Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also `:6006/v1/traces`) | | `8090` | Beszel | Host + container + AMD GPU dashboard | +| `10200` | Piper | TTS (Wyoming protocol) — used by Home Assistant Assist | +| `10300` | Whisper | STT (Wyoming protocol) — used by Home Assistant Assist | ## Quick start diff --git a/VoiceModels.md b/VoiceModels.md new file mode 100644 index 0000000..39caa0d --- /dev/null +++ b/VoiceModels.md @@ -0,0 +1,174 @@ +# Voice models — landscape and upgrade paths + +State of the OSS speech stack as of May 2026, with concrete upgrade +options for the Framework Desktop (Strix Halo iGPU, 64 GB UMA, 256 GB/s +bandwidth — plenty of headroom alongside a 30 GB code model). + +## What's deployed today + +- **Whisper** (`tcp://framework:10300`) — `rhasspy/wyoming-whisper`, + Wyoming protocol, model `tiny-int8` (~75 MB). Picked for HA defaults, + not quality. +- **Piper** (`tcp://framework:10200`) — `rhasspy/wyoming-piper`, + Wyoming protocol, voice `en_US-lessac-medium` (~63 MB, VITS-based). + +These are **convenient HA defaults**, not state-of-the-art. The Wyoming +protocol is from the Rhasspy / Home Assistant ecosystem; it's good for +HA Assist integration but doesn't expose an OpenAI-compatible HTTP API, +so direct integration with OpenWebUI / OpenCode / agentic stacks +requires a wrapper. + +## Speech-to-text (STT) landscape + +| Model | Provenance | Size | Quality (1–5) | Speed | License | Notes | +|---|---|---|---|---|---|---| +| **whisper tiny** | OpenAI (2022) | 75 MB | 2 | very fast | MIT | What we have. Commands OK; dictation rough. | +| **whisper small** | OpenAI | 480 MB | 3 | fast | MIT | Reasonable middle ground. | +| **whisper large-v3** | OpenAI (Nov 2023) | 3.1 GB | 5 | slow | MIT | Long the gold standard. | +| **whisper large-v3-turbo** | OpenAI (Oct 2024) | 810 MB | 5 (≈v3) | ~6× v3 | MIT | **Best Whisper for this box.** Quality of large-v3 at small-model speeds. | +| **faster-whisper** (any model) | SYSTRAN | same | same as base | ~4× faster | MIT | CTranslate2 reimpl. Drop-in via `fedirz/faster-whisper-server` (OpenAI-compatible). | +| **distil-whisper** | HF (2024) | ~750 MB | 4 | fast | MIT | Distilled v2/v3. English-only. Lower latency than turbo. | +| **WhisperX** | m-bain (2024) | + Whisper | 5 + alignment | slow | BSD | Whisper + word-level timestamps + speaker diarization. The right pick for meeting-transcription use cases. | +| **Parakeet-TDT-0.6B** | NVIDIA (2024) | ~1.2 GB | 5 | fast | Apache 2.0 | Trades blows with Whisper-large on benchmarks. Smaller ecosystem. | +| **Canary-1B** | NVIDIA (2024) | ~3.5 GB | 5 | medium | CC-BY-4.0 (non-commercial) | Multilingual; license is the catch. | +| **Moonshine** | Useful Sensors (2024) | ~190 MB tiny / ~620 MB base | 4 | very fast | MIT | Optimized for edge/streaming. Newer, less mature ecosystem than Whisper. | + +**Recommendation for this box:** `whisper-large-v3-turbo` via +`faster-whisper-server`. Best quality/speed/ecosystem balance, OpenAI +API compat lets non-Wyoming clients use it directly. + +## Text-to-speech (TTS) landscape + +| Model | Provenance | Size | Quality (1–5) | Voices | License | Notes | +|---|---|---|---|---|---|---| +| **Piper** (current) | Rhasspy / Home Assistant | ~60–80 MB / voice | 3 | many pre-trained | MIT | Fast, CPU-friendly, prosody is okay but recognizably synthetic. Default in HA Assist. | +| **Kokoro-82M** | hexgrad (Jan 2025) | ~340 MB | 4–5 | ~10 English | Apache 2.0 | **Sweet spot for this box.** Tiny model, surprisingly natural. OpenAI-API server: `Kokoro-FastAPI`. | +| **Sesame CSM-1B** | Sesame AI Labs (open-sourced Mar 2025) | ~3 GB | 5 | conversational, context-aware | Apache 2.0 | Best emotional prosody in OSS. Designed for back-and-forth dialogue, not announcements. | +| **F5-TTS** | F5-TTS group (2024) | ~1.4 GB | 5 | zero-shot voice cloning from 5–10 s reference | CC-BY-NC-4.0 | Voice cloning state-of-the-art for OSS. Non-commercial license. | +| **XTTS-v2** | Coqui (2023) | ~1.8 GB | 4 | zero-shot voice cloning | Coqui PL (non-commercial) | Older but well-supported. Tortoise-TTS lineage. | +| **MeloTTS** | MyShell (2024) | ~140 MB | 4 | multilingual | MIT | Strong multilingual support. | +| **StyleTTS2** | Yinghao Aaron Li et al. (2023) | ~700 MB | 5 | many | MIT | Very natural; heavier inference path than Kokoro. | +| **OpenVoice v2** | MyShell (2024) | ~1 GB | 4 | voice cloning + style transfer | MIT | Good cloning quality, more permissive license than F5/XTTS. | + +**Recommendation for this box:** **Kokoro** for general TTS (fast, +high quality, permissive license), **Sesame CSM** layered alongside +for conversational use cases. + +Commercial that nothing OSS reliably beats yet: **ElevenLabs** (general +quality / voice cloning), **Cartesia Sonic** (real-time streaming +latency). + +## Wyoming vs OpenAI-compatible servers + +The protocol choice matters depending on what you want to drive the +voice stack with. + +| Consumer | Wyoming | OpenAI-compatible | +|---|---|---| +| Home Assistant Assist | ✅ native | possible via custom integration | +| Wyoming satellite hardware (ESP32-S3-Box etc.) | ✅ native | ❌ | +| OpenWebUI Voice (built-in) | ❌ | ✅ | +| OpenCode / OpenHands voice add-ons | needs wrapper | ✅ direct | +| Custom Python/Node scripts | possible | trivial | + +If you only ever want HA voice assistant, Wyoming is fine. If you also +want voice in OpenWebUI or to script "hey local model, summarize this +audio file" outside HA, **add OpenAI-compatible servers alongside** — +they don't conflict (different ports). + +## Use cases for this stack + +In rough order of reach for a single-user homelab: + +1. **Home Assistant voice assistant.** Point HA's Assist pipeline at + Whisper:10300 and Piper:10200, add `wyoming-openwakeword`, plug a + Wyoming satellite (S3-Box, M5Stack Atom Echo, RPi). Self-hosted + Alexa replacement that talks to your local LLM. + +2. **Voice OpenWebUI.** With OpenAI-compatible STT/TTS servers, you can + speak to Qwen3-Coder in the browser and hear the response. OpenWebUI + has the hooks; Wyoming alone doesn't, hence the wrapper need. + +3. **Standalone transcription.** Drop a meeting recording on the box, + get a transcript out. Pair Whisper with WhisperX for diarization + ("who said what") and timestamps. Useful for podcast transcription, + interview indexing, lecture notes. + +4. **TTS notifications and announcements.** Pipe automation events + through Piper (or Kokoro). Voice notifications throughout the house + if you have HA + speakers. + +5. **Voice agent loops.** Mic input → Whisper → local LLM → Piper / + Kokoro → speakers. The "talk to your local assistant" loop that + neither Claude Code nor ChatGPT Code Interpreter natively does. + Roadmap.md item #8. + +6. **Meeting bots.** With Whisper + a custom integration, drop a bot + into Zoom/Google Meet, get a real-time transcript and post-meeting + summary from the local LLM. + +## Concrete upgrade paths + +### Path A — Stay Wyoming, bump model sizes (lowest effort) + +Edit the existing compose files to upgrade the models: + +- `compose/whisper.yml`: `--model tiny-int8` → `--model large-v3-turbo` + (or `medium-int8` for a middle step). Cost: ~810 MB / 1.5 GB on disk, + significant quality jump on dictation. +- `compose/piper.yml`: `en_US-lessac-medium` → another voice. Browse at + . + +`./run.sh` from the Mac, recreate the containers on the box. Same +Wyoming protocol, better quality. + +### Path B — Add OpenAI-compatible servers alongside (mid effort) + +Stand up `faster-whisper-server` + `Kokoro-FastAPI` as additional +compose stacks (different ports). Wyoming stays for HA, OpenAI-API +servers serve OpenWebUI / OpenCode / scripts. Both pull from the same +model files if you want to avoid duplication. + +Sample compose entries to add: +- `compose/faster-whisper.yml` — `fedirz/faster-whisper-server`, + port 8001 with `/v1/audio/transcriptions` endpoint. +- `compose/kokoro.yml` — `ghcr.io/remsky/kokoro-fastapi-cpu` (or `-gpu`), + port 8002 with `/v1/audio/speech`. + +Then OpenWebUI Settings → Audio → STT URL `http://framework:8001/v1`, +TTS URL `http://framework:8002/v1`, voice `af_bella` (Kokoro default). + +### Path C — Full SOTA conversational voice (higher effort) + +For "talk to my agent" with the best prosody available in OSS, run +**Sesame CSM-1B** as the TTS instead of Kokoro/Piper. Heavier +(~3-4 GB GPU at inference, ~5 GB on disk) but conversational quality +is markedly better. No first-class OpenAI-API wrapper exists yet — +expect a custom HTTP server or use the reference inference scripts. + +Pair with `whisper-large-v3-turbo` on STT for the matching jump in +input quality. + +### Path D — Voice cloning (specialized) + +If you want a custom voice, **F5-TTS** or **OpenVoice v2** clone from +a 5–10 second reference clip. F5 has the best quality, OpenVoice has +the better license (MIT). Self-hosted; no first-party HTTP server — +script-driven for now. + +## What I'd actually do + +Path A as the immediate quick win (5 minutes), Path B as the next +weekend project (gives you voice in the existing browser UIs without +forcing the Wyoming dependency on every consumer). Skip C/D until you +have a concrete use case that demands them. + +## Reference catalogs + +- [Hugging Face TTS model leaderboard](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) — blind preference scores across SOTA TTS +- [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) — STT WER benchmarks +- [Awesome STT / TTS](https://github.com/topics/text-to-speech) on GitHub +- [Rhasspy / Wyoming](https://github.com/rhasspy) ecosystem — Home Assistant voice +- [Kokoro repo](https://github.com/hexgrad/kokoro) and [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI) +- [Sesame CSM-1B](https://github.com/SesameAILabs/csm) +- [faster-whisper-server](https://github.com/fedirz/faster-whisper-server) diff --git a/piper-compose.yaml b/piper-compose.yaml deleted file mode 100644 index 37bc515..0000000 --- a/piper-compose.yaml +++ /dev/null @@ -1,12 +0,0 @@ -services: - piper: - image: rhasspy/wyoming-piper:latest - container_name: wyoming-piper - restart: unless-stopped - ports: - - "10200:10200" - volumes: - - ./piper-data:/data - command: - - --voice - - en_US-lessac-medium diff --git a/pyinfra/framework/README.md b/pyinfra/framework/README.md index df3bc97..8528c22 100644 --- a/pyinfra/framework/README.md +++ b/pyinfra/framework/README.md @@ -42,14 +42,15 @@ Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`. Requires a reboot to activate — pyinfra rewrites `/etc/default/grub` and runs `update-grub`, but won't reboot for you. - `/models//` layout -- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage}/docker-compose.yml` +- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper}/docker-compose.yml` dropped in, not auto-started — you edit the model path then `docker compose up -d` (OpenWebUI needs no edits — it's pre-configured to find Ollama at `host.docker.internal:11434` and uses `searxng.n0n.io` for web search) (Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006, - OpenHands UI at :3030, Homepage at :7575 — see "Monitoring stack", - "Agent harnesses", and "Front door" below) + OpenHands UI at :3030, Homepage at :7575, Whisper STT at :10300, + Piper TTS at :10200 — see "Monitoring stack", "Agent harnesses", + "Front door", and "Voice" below) If a previous run installed the native llama.cpp build / full ROCm / native Ollama, those are auto-cleaned the next time `./run.sh` runs. @@ -85,6 +86,31 @@ Compose images in (`services.yaml`, `settings.yaml`, etc.); edit there, `./run.sh`, restart the homepage container. +## Voice + +Wyoming-protocol speech servers (Home Assistant's voice stack). No web +UI — these are TCP protocol servers consumable by HA Assist, Wyoming +satellites, or any Wyoming client. + +- **Whisper** (`/srv/docker/whisper`, `tcp://framework:10300`) — + speech-to-text. Default model is `tiny-int8` (~75 MB). Models + download into `/srv/docker/whisper/data/` on first run. + +- **Piper** (`/srv/docker/piper`, `tcp://framework:10200`) — + text-to-speech. Default voice is `en_US-lessac-medium` (~63 MB). + Browse alternatives at . + +Bring-up: +```sh +cd /srv/docker/whisper && docker compose up -d +cd /srv/docker/piper && docker compose up -d +``` + +Both are conservative defaults sized for "small command-style usage"; +on Strix Halo's hardware budget you can run substantially better +models. See [`localgenai/VoiceModels.md`](../../VoiceModels.md) for the +landscape and upgrade options. + ## Front door - **Homepage** (`/srv/docker/homepage`, http://framework:7575) — single diff --git a/pyinfra/framework/compose/homepage/services.yaml b/pyinfra/framework/compose/homepage/services.yaml index 436b644..f67d156 100644 --- a/pyinfra/framework/compose/homepage/services.yaml +++ b/pyinfra/framework/compose/homepage/services.yaml @@ -73,6 +73,21 @@ server: localhost-docker container: phoenix +- Voice: + # Wyoming-protocol services have no web UI; tiles are informational + # (container status + port). Click-through goes nowhere meaningful. + - Whisper: + icon: mdi-microphone-message + description: Speech-to-text (Wyoming :10300) + server: localhost-docker + container: wyoming-whisper + + - Piper: + icon: mdi-account-voice + description: Text-to-speech (Wyoming :10200) + server: localhost-docker + container: wyoming-piper + - External: - SearXNG: icon: searxng.svg diff --git a/pyinfra/framework/compose/homepage/settings.yaml b/pyinfra/framework/compose/homepage/settings.yaml index d8184ac..35eefb2 100644 --- a/pyinfra/framework/compose/homepage/settings.yaml +++ b/pyinfra/framework/compose/homepage/settings.yaml @@ -21,6 +21,9 @@ layout: Observability: style: row columns: 3 + Voice: + style: row + columns: 3 External: style: row columns: 3 diff --git a/pyinfra/framework/compose/piper.yml b/pyinfra/framework/compose/piper.yml new file mode 100644 index 0000000..a837fc6 --- /dev/null +++ b/pyinfra/framework/compose/piper.yml @@ -0,0 +1,22 @@ +# Wyoming Piper — text-to-speech over the Wyoming protocol. +# https://github.com/rhasspy/wyoming-piper +# +# Wyoming is Home Assistant's voice protocol; it's also consumable by any +# Wyoming client. No web UI — this is a protocol server on TCP :10200. +# +# Voice selection: en_US-lessac-medium is the most balanced English voice +# (~63 MB, natural prosody). Browse alternatives at +# https://github.com/rhasspy/piper/blob/master/VOICES.md — pulled into +# /srv/docker/piper/data on first start. +services: + piper: + image: rhasspy/wyoming-piper:latest + container_name: wyoming-piper + restart: unless-stopped + ports: + - "10200:10200" + volumes: + - /srv/docker/piper/data:/data + command: + - --voice + - en_US-lessac-medium diff --git a/pyinfra/framework/compose/whisper.yml b/pyinfra/framework/compose/whisper.yml new file mode 100644 index 0000000..a431a37 --- /dev/null +++ b/pyinfra/framework/compose/whisper.yml @@ -0,0 +1,26 @@ +# Wyoming Whisper — speech-to-text over the Wyoming protocol. +# https://github.com/rhasspy/wyoming-whisper +# +# Wyoming is Home Assistant's voice protocol; it's also consumable by any +# Wyoming client. No web UI — this is a protocol server on TCP :10300. +# +# Model selection: `tiny-int8` is the smallest viable model (~75 MB), +# fast and good enough for command-style transcription. Bump to +# `base-int8` (140 MB) or `small-int8` (480 MB) for general dictation. +# Models are downloaded into /srv/docker/whisper/data on first start. +services: + whisper: + image: rhasspy/wyoming-whisper:latest + container_name: wyoming-whisper + restart: unless-stopped + ports: + - "10300:10300" + volumes: + - /srv/docker/whisper/data:/data + command: + - --model + - tiny-int8 + - --language + - en + - --beam-size + - "1" diff --git a/pyinfra/framework/deploy.py b/pyinfra/framework/deploy.py index c71ece0..e249b45 100644 --- a/pyinfra/framework/deploy.py +++ b/pyinfra/framework/deploy.py @@ -339,6 +339,8 @@ for svc in ( "phoenix", "openhands", "homepage", + "whisper", + "piper", ): files.directory( name=f"compose/{svc} dir", @@ -470,6 +472,24 @@ for cfg in ( _sudo=True, ) +# Voice stack — Wyoming-protocol Whisper (STT) and Piper (TTS). Models +# are downloaded on first start; bind-mounting these dirs survives +# container recreation. +files.directory( + name="Whisper data dir", + path=f"{COMPOSE_DIR}/whisper/data", + group="docker", + mode="2775", + _sudo=True, +) +files.directory( + name="Piper data dir", + path=f"{COMPOSE_DIR}/piper/data", + group="docker", + mode="2775", + _sudo=True, +) + # --- Cleanup of artifacts from the prior native-build deploy ---------------- # All idempotent — `present=False` is a no-op when the target is absent. diff --git a/whisper-compose.yaml b/whisper-compose.yaml deleted file mode 100644 index 1ee337a..0000000 --- a/whisper-compose.yaml +++ /dev/null @@ -1,16 +0,0 @@ -services: - whisper: - image: rhasspy/wyoming-whisper:latest - container_name: wyoming-whisper - restart: unless-stopped - ports: - - "10300:10300" - volumes: - - ./whisper-data:/data - command: - - --model - - tiny-int8 - - --language - - en - - --beam-size - - "1"