Add Wyoming voice stack to pyinfra + landscape doc
- Move piper-compose.yaml / whisper-compose.yaml from repo root into
pyinfra/framework/compose/{piper,whisper}.yml; bind paths shifted to
/srv/docker/{piper,whisper}/data on the box.
- deploy.py registers both stacks and provisions the data dirs.
- Homepage gets a "Voice" group with informational tiles (Wyoming has
no web UI, so tiles show container status without click-through).
- New VoiceModels.md captures the May 2026 STT/TTS landscape, why the
current Wyoming defaults aren't SOTA, and concrete upgrade paths
(whisper-large-v3-turbo + faster-whisper-server, Kokoro, Sesame CSM,
F5-TTS for cloning).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -30,7 +30,7 @@ Tailscale. Coding agents, monitoring, voice — all self-hosted.
|
||||
| [`Roadmap.md`](Roadmap.md) | What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. |
|
||||
| [`TODO.md`](TODO.md) | Open questions and follow-ups. |
|
||||
| [`testing/qwen3-coder-30b/`](testing/qwen3-coder-30b/) | Small evaluation harness for Qwen3-Coder. |
|
||||
| `piper-compose.yaml`, `whisper-compose.yaml` | Voice (TTS / STT) compose files; not yet wired into the pyinfra deploy. |
|
||||
| [`VoiceModels.md`](VoiceModels.md) | STT / TTS landscape and upgrade paths from the Wyoming defaults. |
|
||||
|
||||
## Service ports (on `framework`)
|
||||
|
||||
@@ -46,6 +46,8 @@ Tailscale. Coding agents, monitoring, voice — all self-hosted.
|
||||
| `4317` | Phoenix OTLP/gRPC | Trace ingestion |
|
||||
| `6006` | Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also `:6006/v1/traces`) |
|
||||
| `8090` | Beszel | Host + container + AMD GPU dashboard |
|
||||
| `10200` | Piper | TTS (Wyoming protocol) — used by Home Assistant Assist |
|
||||
| `10300` | Whisper | STT (Wyoming protocol) — used by Home Assistant Assist |
|
||||
|
||||
## Quick start
|
||||
|
||||
|
||||
174
VoiceModels.md
Normal file
174
VoiceModels.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Voice models — landscape and upgrade paths
|
||||
|
||||
State of the OSS speech stack as of May 2026, with concrete upgrade
|
||||
options for the Framework Desktop (Strix Halo iGPU, 64 GB UMA, 256 GB/s
|
||||
bandwidth — plenty of headroom alongside a 30 GB code model).
|
||||
|
||||
## What's deployed today
|
||||
|
||||
- **Whisper** (`tcp://framework:10300`) — `rhasspy/wyoming-whisper`,
|
||||
Wyoming protocol, model `tiny-int8` (~75 MB). Picked for HA defaults,
|
||||
not quality.
|
||||
- **Piper** (`tcp://framework:10200`) — `rhasspy/wyoming-piper`,
|
||||
Wyoming protocol, voice `en_US-lessac-medium` (~63 MB, VITS-based).
|
||||
|
||||
These are **convenient HA defaults**, not state-of-the-art. The Wyoming
|
||||
protocol is from the Rhasspy / Home Assistant ecosystem; it's good for
|
||||
HA Assist integration but doesn't expose an OpenAI-compatible HTTP API,
|
||||
so direct integration with OpenWebUI / OpenCode / agentic stacks
|
||||
requires a wrapper.
|
||||
|
||||
## Speech-to-text (STT) landscape
|
||||
|
||||
| Model | Provenance | Size | Quality (1–5) | Speed | License | Notes |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **whisper tiny** | OpenAI (2022) | 75 MB | 2 | very fast | MIT | What we have. Commands OK; dictation rough. |
|
||||
| **whisper small** | OpenAI | 480 MB | 3 | fast | MIT | Reasonable middle ground. |
|
||||
| **whisper large-v3** | OpenAI (Nov 2023) | 3.1 GB | 5 | slow | MIT | Long the gold standard. |
|
||||
| **whisper large-v3-turbo** | OpenAI (Oct 2024) | 810 MB | 5 (≈v3) | ~6× v3 | MIT | **Best Whisper for this box.** Quality of large-v3 at small-model speeds. |
|
||||
| **faster-whisper** (any model) | SYSTRAN | same | same as base | ~4× faster | MIT | CTranslate2 reimpl. Drop-in via `fedirz/faster-whisper-server` (OpenAI-compatible). |
|
||||
| **distil-whisper** | HF (2024) | ~750 MB | 4 | fast | MIT | Distilled v2/v3. English-only. Lower latency than turbo. |
|
||||
| **WhisperX** | m-bain (2024) | + Whisper | 5 + alignment | slow | BSD | Whisper + word-level timestamps + speaker diarization. The right pick for meeting-transcription use cases. |
|
||||
| **Parakeet-TDT-0.6B** | NVIDIA (2024) | ~1.2 GB | 5 | fast | Apache 2.0 | Trades blows with Whisper-large on benchmarks. Smaller ecosystem. |
|
||||
| **Canary-1B** | NVIDIA (2024) | ~3.5 GB | 5 | medium | CC-BY-4.0 (non-commercial) | Multilingual; license is the catch. |
|
||||
| **Moonshine** | Useful Sensors (2024) | ~190 MB tiny / ~620 MB base | 4 | very fast | MIT | Optimized for edge/streaming. Newer, less mature ecosystem than Whisper. |
|
||||
|
||||
**Recommendation for this box:** `whisper-large-v3-turbo` via
|
||||
`faster-whisper-server`. Best quality/speed/ecosystem balance, OpenAI
|
||||
API compat lets non-Wyoming clients use it directly.
|
||||
|
||||
## Text-to-speech (TTS) landscape
|
||||
|
||||
| Model | Provenance | Size | Quality (1–5) | Voices | License | Notes |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **Piper** (current) | Rhasspy / Home Assistant | ~60–80 MB / voice | 3 | many pre-trained | MIT | Fast, CPU-friendly, prosody is okay but recognizably synthetic. Default in HA Assist. |
|
||||
| **Kokoro-82M** | hexgrad (Jan 2025) | ~340 MB | 4–5 | ~10 English | Apache 2.0 | **Sweet spot for this box.** Tiny model, surprisingly natural. OpenAI-API server: `Kokoro-FastAPI`. |
|
||||
| **Sesame CSM-1B** | Sesame AI Labs (open-sourced Mar 2025) | ~3 GB | 5 | conversational, context-aware | Apache 2.0 | Best emotional prosody in OSS. Designed for back-and-forth dialogue, not announcements. |
|
||||
| **F5-TTS** | F5-TTS group (2024) | ~1.4 GB | 5 | zero-shot voice cloning from 5–10 s reference | CC-BY-NC-4.0 | Voice cloning state-of-the-art for OSS. Non-commercial license. |
|
||||
| **XTTS-v2** | Coqui (2023) | ~1.8 GB | 4 | zero-shot voice cloning | Coqui PL (non-commercial) | Older but well-supported. Tortoise-TTS lineage. |
|
||||
| **MeloTTS** | MyShell (2024) | ~140 MB | 4 | multilingual | MIT | Strong multilingual support. |
|
||||
| **StyleTTS2** | Yinghao Aaron Li et al. (2023) | ~700 MB | 5 | many | MIT | Very natural; heavier inference path than Kokoro. |
|
||||
| **OpenVoice v2** | MyShell (2024) | ~1 GB | 4 | voice cloning + style transfer | MIT | Good cloning quality, more permissive license than F5/XTTS. |
|
||||
|
||||
**Recommendation for this box:** **Kokoro** for general TTS (fast,
|
||||
high quality, permissive license), **Sesame CSM** layered alongside
|
||||
for conversational use cases.
|
||||
|
||||
Commercial that nothing OSS reliably beats yet: **ElevenLabs** (general
|
||||
quality / voice cloning), **Cartesia Sonic** (real-time streaming
|
||||
latency).
|
||||
|
||||
## Wyoming vs OpenAI-compatible servers
|
||||
|
||||
The protocol choice matters depending on what you want to drive the
|
||||
voice stack with.
|
||||
|
||||
| Consumer | Wyoming | OpenAI-compatible |
|
||||
|---|---|---|
|
||||
| Home Assistant Assist | ✅ native | possible via custom integration |
|
||||
| Wyoming satellite hardware (ESP32-S3-Box etc.) | ✅ native | ❌ |
|
||||
| OpenWebUI Voice (built-in) | ❌ | ✅ |
|
||||
| OpenCode / OpenHands voice add-ons | needs wrapper | ✅ direct |
|
||||
| Custom Python/Node scripts | possible | trivial |
|
||||
|
||||
If you only ever want HA voice assistant, Wyoming is fine. If you also
|
||||
want voice in OpenWebUI or to script "hey local model, summarize this
|
||||
audio file" outside HA, **add OpenAI-compatible servers alongside** —
|
||||
they don't conflict (different ports).
|
||||
|
||||
## Use cases for this stack
|
||||
|
||||
In rough order of reach for a single-user homelab:
|
||||
|
||||
1. **Home Assistant voice assistant.** Point HA's Assist pipeline at
|
||||
Whisper:10300 and Piper:10200, add `wyoming-openwakeword`, plug a
|
||||
Wyoming satellite (S3-Box, M5Stack Atom Echo, RPi). Self-hosted
|
||||
Alexa replacement that talks to your local LLM.
|
||||
|
||||
2. **Voice OpenWebUI.** With OpenAI-compatible STT/TTS servers, you can
|
||||
speak to Qwen3-Coder in the browser and hear the response. OpenWebUI
|
||||
has the hooks; Wyoming alone doesn't, hence the wrapper need.
|
||||
|
||||
3. **Standalone transcription.** Drop a meeting recording on the box,
|
||||
get a transcript out. Pair Whisper with WhisperX for diarization
|
||||
("who said what") and timestamps. Useful for podcast transcription,
|
||||
interview indexing, lecture notes.
|
||||
|
||||
4. **TTS notifications and announcements.** Pipe automation events
|
||||
through Piper (or Kokoro). Voice notifications throughout the house
|
||||
if you have HA + speakers.
|
||||
|
||||
5. **Voice agent loops.** Mic input → Whisper → local LLM → Piper /
|
||||
Kokoro → speakers. The "talk to your local assistant" loop that
|
||||
neither Claude Code nor ChatGPT Code Interpreter natively does.
|
||||
Roadmap.md item #8.
|
||||
|
||||
6. **Meeting bots.** With Whisper + a custom integration, drop a bot
|
||||
into Zoom/Google Meet, get a real-time transcript and post-meeting
|
||||
summary from the local LLM.
|
||||
|
||||
## Concrete upgrade paths
|
||||
|
||||
### Path A — Stay Wyoming, bump model sizes (lowest effort)
|
||||
|
||||
Edit the existing compose files to upgrade the models:
|
||||
|
||||
- `compose/whisper.yml`: `--model tiny-int8` → `--model large-v3-turbo`
|
||||
(or `medium-int8` for a middle step). Cost: ~810 MB / 1.5 GB on disk,
|
||||
significant quality jump on dictation.
|
||||
- `compose/piper.yml`: `en_US-lessac-medium` → another voice. Browse at
|
||||
<https://github.com/rhasspy/piper/blob/master/VOICES.md>.
|
||||
|
||||
`./run.sh` from the Mac, recreate the containers on the box. Same
|
||||
Wyoming protocol, better quality.
|
||||
|
||||
### Path B — Add OpenAI-compatible servers alongside (mid effort)
|
||||
|
||||
Stand up `faster-whisper-server` + `Kokoro-FastAPI` as additional
|
||||
compose stacks (different ports). Wyoming stays for HA, OpenAI-API
|
||||
servers serve OpenWebUI / OpenCode / scripts. Both pull from the same
|
||||
model files if you want to avoid duplication.
|
||||
|
||||
Sample compose entries to add:
|
||||
- `compose/faster-whisper.yml` — `fedirz/faster-whisper-server`,
|
||||
port 8001 with `/v1/audio/transcriptions` endpoint.
|
||||
- `compose/kokoro.yml` — `ghcr.io/remsky/kokoro-fastapi-cpu` (or `-gpu`),
|
||||
port 8002 with `/v1/audio/speech`.
|
||||
|
||||
Then OpenWebUI Settings → Audio → STT URL `http://framework:8001/v1`,
|
||||
TTS URL `http://framework:8002/v1`, voice `af_bella` (Kokoro default).
|
||||
|
||||
### Path C — Full SOTA conversational voice (higher effort)
|
||||
|
||||
For "talk to my agent" with the best prosody available in OSS, run
|
||||
**Sesame CSM-1B** as the TTS instead of Kokoro/Piper. Heavier
|
||||
(~3-4 GB GPU at inference, ~5 GB on disk) but conversational quality
|
||||
is markedly better. No first-class OpenAI-API wrapper exists yet —
|
||||
expect a custom HTTP server or use the reference inference scripts.
|
||||
|
||||
Pair with `whisper-large-v3-turbo` on STT for the matching jump in
|
||||
input quality.
|
||||
|
||||
### Path D — Voice cloning (specialized)
|
||||
|
||||
If you want a custom voice, **F5-TTS** or **OpenVoice v2** clone from
|
||||
a 5–10 second reference clip. F5 has the best quality, OpenVoice has
|
||||
the better license (MIT). Self-hosted; no first-party HTTP server —
|
||||
script-driven for now.
|
||||
|
||||
## What I'd actually do
|
||||
|
||||
Path A as the immediate quick win (5 minutes), Path B as the next
|
||||
weekend project (gives you voice in the existing browser UIs without
|
||||
forcing the Wyoming dependency on every consumer). Skip C/D until you
|
||||
have a concrete use case that demands them.
|
||||
|
||||
## Reference catalogs
|
||||
|
||||
- [Hugging Face TTS model leaderboard](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) — blind preference scores across SOTA TTS
|
||||
- [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) — STT WER benchmarks
|
||||
- [Awesome STT / TTS](https://github.com/topics/text-to-speech) on GitHub
|
||||
- [Rhasspy / Wyoming](https://github.com/rhasspy) ecosystem — Home Assistant voice
|
||||
- [Kokoro repo](https://github.com/hexgrad/kokoro) and [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI)
|
||||
- [Sesame CSM-1B](https://github.com/SesameAILabs/csm)
|
||||
- [faster-whisper-server](https://github.com/fedirz/faster-whisper-server)
|
||||
@@ -1,12 +0,0 @@
|
||||
services:
|
||||
piper:
|
||||
image: rhasspy/wyoming-piper:latest
|
||||
container_name: wyoming-piper
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "10200:10200"
|
||||
volumes:
|
||||
- ./piper-data:/data
|
||||
command:
|
||||
- --voice
|
||||
- en_US-lessac-medium
|
||||
@@ -42,14 +42,15 @@ Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`.
|
||||
Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
|
||||
and runs `update-grub`, but won't reboot for you.
|
||||
- `/models/<vendor>/` layout
|
||||
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage}/docker-compose.yml`
|
||||
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage,whisper,piper}/docker-compose.yml`
|
||||
dropped in, not auto-started — you edit the model path then
|
||||
`docker compose up -d`
|
||||
(OpenWebUI needs no edits — it's pre-configured to find Ollama at
|
||||
`host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
|
||||
(Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
|
||||
OpenHands UI at :3030, Homepage at :7575 — see "Monitoring stack",
|
||||
"Agent harnesses", and "Front door" below)
|
||||
OpenHands UI at :3030, Homepage at :7575, Whisper STT at :10300,
|
||||
Piper TTS at :10200 — see "Monitoring stack", "Agent harnesses",
|
||||
"Front door", and "Voice" below)
|
||||
|
||||
If a previous run installed the native llama.cpp build / full ROCm /
|
||||
native Ollama, those are auto-cleaned the next time `./run.sh` runs.
|
||||
@@ -85,6 +86,31 @@ Compose images in
|
||||
(`services.yaml`, `settings.yaml`, etc.); edit there, `./run.sh`, restart
|
||||
the homepage container.
|
||||
|
||||
## Voice
|
||||
|
||||
Wyoming-protocol speech servers (Home Assistant's voice stack). No web
|
||||
UI — these are TCP protocol servers consumable by HA Assist, Wyoming
|
||||
satellites, or any Wyoming client.
|
||||
|
||||
- **Whisper** (`/srv/docker/whisper`, `tcp://framework:10300`) —
|
||||
speech-to-text. Default model is `tiny-int8` (~75 MB). Models
|
||||
download into `/srv/docker/whisper/data/` on first run.
|
||||
|
||||
- **Piper** (`/srv/docker/piper`, `tcp://framework:10200`) —
|
||||
text-to-speech. Default voice is `en_US-lessac-medium` (~63 MB).
|
||||
Browse alternatives at <https://github.com/rhasspy/piper/blob/master/VOICES.md>.
|
||||
|
||||
Bring-up:
|
||||
```sh
|
||||
cd /srv/docker/whisper && docker compose up -d
|
||||
cd /srv/docker/piper && docker compose up -d
|
||||
```
|
||||
|
||||
Both are conservative defaults sized for "small command-style usage";
|
||||
on Strix Halo's hardware budget you can run substantially better
|
||||
models. See [`localgenai/VoiceModels.md`](../../VoiceModels.md) for the
|
||||
landscape and upgrade options.
|
||||
|
||||
## Front door
|
||||
|
||||
- **Homepage** (`/srv/docker/homepage`, http://framework:7575) — single
|
||||
|
||||
@@ -73,6 +73,21 @@
|
||||
server: localhost-docker
|
||||
container: phoenix
|
||||
|
||||
- Voice:
|
||||
# Wyoming-protocol services have no web UI; tiles are informational
|
||||
# (container status + port). Click-through goes nowhere meaningful.
|
||||
- Whisper:
|
||||
icon: mdi-microphone-message
|
||||
description: Speech-to-text (Wyoming :10300)
|
||||
server: localhost-docker
|
||||
container: wyoming-whisper
|
||||
|
||||
- Piper:
|
||||
icon: mdi-account-voice
|
||||
description: Text-to-speech (Wyoming :10200)
|
||||
server: localhost-docker
|
||||
container: wyoming-piper
|
||||
|
||||
- External:
|
||||
- SearXNG:
|
||||
icon: searxng.svg
|
||||
|
||||
@@ -21,6 +21,9 @@ layout:
|
||||
Observability:
|
||||
style: row
|
||||
columns: 3
|
||||
Voice:
|
||||
style: row
|
||||
columns: 3
|
||||
External:
|
||||
style: row
|
||||
columns: 3
|
||||
|
||||
22
pyinfra/framework/compose/piper.yml
Normal file
22
pyinfra/framework/compose/piper.yml
Normal file
@@ -0,0 +1,22 @@
|
||||
# Wyoming Piper — text-to-speech over the Wyoming protocol.
|
||||
# https://github.com/rhasspy/wyoming-piper
|
||||
#
|
||||
# Wyoming is Home Assistant's voice protocol; it's also consumable by any
|
||||
# Wyoming client. No web UI — this is a protocol server on TCP :10200.
|
||||
#
|
||||
# Voice selection: en_US-lessac-medium is the most balanced English voice
|
||||
# (~63 MB, natural prosody). Browse alternatives at
|
||||
# https://github.com/rhasspy/piper/blob/master/VOICES.md — pulled into
|
||||
# /srv/docker/piper/data on first start.
|
||||
services:
|
||||
piper:
|
||||
image: rhasspy/wyoming-piper:latest
|
||||
container_name: wyoming-piper
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "10200:10200"
|
||||
volumes:
|
||||
- /srv/docker/piper/data:/data
|
||||
command:
|
||||
- --voice
|
||||
- en_US-lessac-medium
|
||||
26
pyinfra/framework/compose/whisper.yml
Normal file
26
pyinfra/framework/compose/whisper.yml
Normal file
@@ -0,0 +1,26 @@
|
||||
# Wyoming Whisper — speech-to-text over the Wyoming protocol.
|
||||
# https://github.com/rhasspy/wyoming-whisper
|
||||
#
|
||||
# Wyoming is Home Assistant's voice protocol; it's also consumable by any
|
||||
# Wyoming client. No web UI — this is a protocol server on TCP :10300.
|
||||
#
|
||||
# Model selection: `tiny-int8` is the smallest viable model (~75 MB),
|
||||
# fast and good enough for command-style transcription. Bump to
|
||||
# `base-int8` (140 MB) or `small-int8` (480 MB) for general dictation.
|
||||
# Models are downloaded into /srv/docker/whisper/data on first start.
|
||||
services:
|
||||
whisper:
|
||||
image: rhasspy/wyoming-whisper:latest
|
||||
container_name: wyoming-whisper
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "10300:10300"
|
||||
volumes:
|
||||
- /srv/docker/whisper/data:/data
|
||||
command:
|
||||
- --model
|
||||
- tiny-int8
|
||||
- --language
|
||||
- en
|
||||
- --beam-size
|
||||
- "1"
|
||||
@@ -339,6 +339,8 @@ for svc in (
|
||||
"phoenix",
|
||||
"openhands",
|
||||
"homepage",
|
||||
"whisper",
|
||||
"piper",
|
||||
):
|
||||
files.directory(
|
||||
name=f"compose/{svc} dir",
|
||||
@@ -470,6 +472,24 @@ for cfg in (
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# Voice stack — Wyoming-protocol Whisper (STT) and Piper (TTS). Models
|
||||
# are downloaded on first start; bind-mounting these dirs survives
|
||||
# container recreation.
|
||||
files.directory(
|
||||
name="Whisper data dir",
|
||||
path=f"{COMPOSE_DIR}/whisper/data",
|
||||
group="docker",
|
||||
mode="2775",
|
||||
_sudo=True,
|
||||
)
|
||||
files.directory(
|
||||
name="Piper data dir",
|
||||
path=f"{COMPOSE_DIR}/piper/data",
|
||||
group="docker",
|
||||
mode="2775",
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# --- Cleanup of artifacts from the prior native-build deploy ----------------
|
||||
# All idempotent — `present=False` is a no-op when the target is absent.
|
||||
|
||||
|
||||
@@ -1,16 +0,0 @@
|
||||
services:
|
||||
whisper:
|
||||
image: rhasspy/wyoming-whisper:latest
|
||||
container_name: wyoming-whisper
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "10300:10300"
|
||||
volumes:
|
||||
- ./whisper-data:/data
|
||||
command:
|
||||
- --model
|
||||
- tiny-int8
|
||||
- --language
|
||||
- en
|
||||
- --beam-size
|
||||
- "1"
|
||||
Reference in New Issue
Block a user