Files

noisedestroyers 36b8cfe835 Add Wyoming voice stack to pyinfra + landscape doc

- Move piper-compose.yaml / whisper-compose.yaml from repo root into
  pyinfra/framework/compose/{piper,whisper}.yml; bind paths shifted to
  /srv/docker/{piper,whisper}/data on the box.
- deploy.py registers both stacks and provisions the data dirs.
- Homepage gets a "Voice" group with informational tiles (Wyoming has
  no web UI, so tiles show container status without click-through).
- New VoiceModels.md captures the May 2026 STT/TTS landscape, why the
  current Wyoming defaults aren't SOTA, and concrete upgrade paths
  (whisper-large-v3-turbo + faster-whisper-server, Kokoro, Sesame CSM,
  F5-TTS for cloning).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 13:33:17 -04:00

9.2 KiB

Raw Blame History

Voice models — landscape and upgrade paths

State of the OSS speech stack as of May 2026, with concrete upgrade options for the Framework Desktop (Strix Halo iGPU, 64 GB UMA, 256 GB/s bandwidth — plenty of headroom alongside a 30 GB code model).

What's deployed today

Whisper (tcp://framework:10300) — rhasspy/wyoming-whisper, Wyoming protocol, model tiny-int8 (~75 MB). Picked for HA defaults, not quality.
Piper (tcp://framework:10200) — rhasspy/wyoming-piper, Wyoming protocol, voice en_US-lessac-medium (~63 MB, VITS-based).

These are convenient HA defaults, not state-of-the-art. The Wyoming protocol is from the Rhasspy / Home Assistant ecosystem; it's good for HA Assist integration but doesn't expose an OpenAI-compatible HTTP API, so direct integration with OpenWebUI / OpenCode / agentic stacks requires a wrapper.

Speech-to-text (STT) landscape

Model	Provenance	Size	Quality (1–5)	Speed	License	Notes
whisper tiny	OpenAI (2022)	75 MB	2	very fast	MIT	What we have. Commands OK; dictation rough.
whisper small	OpenAI	480 MB	3	fast	MIT	Reasonable middle ground.
whisper large-v3	OpenAI (Nov 2023)	3.1 GB	5	slow	MIT	Long the gold standard.
whisper large-v3-turbo	OpenAI (Oct 2024)	810 MB	5 (≈v3)	~6× v3	MIT	Best Whisper for this box. Quality of large-v3 at small-model speeds.
faster-whisper (any model)	SYSTRAN	same	same as base	~4× faster	MIT	CTranslate2 reimpl. Drop-in via `fedirz/faster-whisper-server` (OpenAI-compatible).
distil-whisper	HF (2024)	~750 MB	4	fast	MIT	Distilled v2/v3. English-only. Lower latency than turbo.
WhisperX	m-bain (2024)	+ Whisper	5 + alignment	slow	BSD	Whisper + word-level timestamps + speaker diarization. The right pick for meeting-transcription use cases.
Parakeet-TDT-0.6B	NVIDIA (2024)	~1.2 GB	5	fast	Apache 2.0	Trades blows with Whisper-large on benchmarks. Smaller ecosystem.
Canary-1B	NVIDIA (2024)	~3.5 GB	5	medium	CC-BY-4.0 (non-commercial)	Multilingual; license is the catch.
Moonshine	Useful Sensors (2024)	~190 MB tiny / ~620 MB base	4	very fast	MIT	Optimized for edge/streaming. Newer, less mature ecosystem than Whisper.

Recommendation for this box: whisper-large-v3-turbo via faster-whisper-server. Best quality/speed/ecosystem balance, OpenAI API compat lets non-Wyoming clients use it directly.

Text-to-speech (TTS) landscape

Model	Provenance	Size	Quality (1–5)	Voices	License	Notes
Piper (current)	Rhasspy / Home Assistant	~60–80 MB / voice	3	many pre-trained	MIT	Fast, CPU-friendly, prosody is okay but recognizably synthetic. Default in HA Assist.
Kokoro-82M	hexgrad (Jan 2025)	~340 MB	4–5	~10 English	Apache 2.0	Sweet spot for this box. Tiny model, surprisingly natural. OpenAI-API server: `Kokoro-FastAPI`.
Sesame CSM-1B	Sesame AI Labs (open-sourced Mar 2025)	~3 GB	5	conversational, context-aware	Apache 2.0	Best emotional prosody in OSS. Designed for back-and-forth dialogue, not announcements.
F5-TTS	F5-TTS group (2024)	~1.4 GB	5	zero-shot voice cloning from 5–10 s reference	CC-BY-NC-4.0	Voice cloning state-of-the-art for OSS. Non-commercial license.
XTTS-v2	Coqui (2023)	~1.8 GB	4	zero-shot voice cloning	Coqui PL (non-commercial)	Older but well-supported. Tortoise-TTS lineage.
MeloTTS	MyShell (2024)	~140 MB	4	multilingual	MIT	Strong multilingual support.
StyleTTS2	Yinghao Aaron Li et al. (2023)	~700 MB	5	many	MIT	Very natural; heavier inference path than Kokoro.
OpenVoice v2	MyShell (2024)	~1 GB	4	voice cloning + style transfer	MIT	Good cloning quality, more permissive license than F5/XTTS.

Recommendation for this box: Kokoro for general TTS (fast, high quality, permissive license), Sesame CSM layered alongside for conversational use cases.

Commercial that nothing OSS reliably beats yet: ElevenLabs (general quality / voice cloning), Cartesia Sonic (real-time streaming latency).

Wyoming vs OpenAI-compatible servers

The protocol choice matters depending on what you want to drive the voice stack with.

Consumer	Wyoming	OpenAI-compatible
Home Assistant Assist	✅ native	possible via custom integration
Wyoming satellite hardware (ESP32-S3-Box etc.)	✅ native	❌
OpenWebUI Voice (built-in)	❌	✅
OpenCode / OpenHands voice add-ons	needs wrapper	✅ direct
Custom Python/Node scripts	possible	trivial

If you only ever want HA voice assistant, Wyoming is fine. If you also want voice in OpenWebUI or to script "hey local model, summarize this audio file" outside HA, add OpenAI-compatible servers alongside — they don't conflict (different ports).

Use cases for this stack

In rough order of reach for a single-user homelab:

Home Assistant voice assistant. Point HA's Assist pipeline at Whisper:10300 and Piper:10200, add wyoming-openwakeword, plug a Wyoming satellite (S3-Box, M5Stack Atom Echo, RPi). Self-hosted Alexa replacement that talks to your local LLM.
Voice OpenWebUI. With OpenAI-compatible STT/TTS servers, you can speak to Qwen3-Coder in the browser and hear the response. OpenWebUI has the hooks; Wyoming alone doesn't, hence the wrapper need.
Standalone transcription. Drop a meeting recording on the box, get a transcript out. Pair Whisper with WhisperX for diarization ("who said what") and timestamps. Useful for podcast transcription, interview indexing, lecture notes.
TTS notifications and announcements. Pipe automation events through Piper (or Kokoro). Voice notifications throughout the house if you have HA + speakers.
Voice agent loops. Mic input → Whisper → local LLM → Piper / Kokoro → speakers. The "talk to your local assistant" loop that neither Claude Code nor ChatGPT Code Interpreter natively does. Roadmap.md item #8.
Meeting bots. With Whisper + a custom integration, drop a bot into Zoom/Google Meet, get a real-time transcript and post-meeting summary from the local LLM.

Concrete upgrade paths

Path A — Stay Wyoming, bump model sizes (lowest effort)

Edit the existing compose files to upgrade the models:

compose/whisper.yml: --model tiny-int8 → --model large-v3-turbo (or medium-int8 for a middle step). Cost: ~810 MB / 1.5 GB on disk, significant quality jump on dictation.
compose/piper.yml: en_US-lessac-medium → another voice. Browse at https://github.com/rhasspy/piper/blob/master/VOICES.md.

./run.sh from the Mac, recreate the containers on the box. Same Wyoming protocol, better quality.

Path B — Add OpenAI-compatible servers alongside (mid effort)

Stand up faster-whisper-server + Kokoro-FastAPI as additional compose stacks (different ports). Wyoming stays for HA, OpenAI-API servers serve OpenWebUI / OpenCode / scripts. Both pull from the same model files if you want to avoid duplication.

Sample compose entries to add:

compose/faster-whisper.yml — fedirz/faster-whisper-server, port 8001 with /v1/audio/transcriptions endpoint.
compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu (or -gpu), port 8002 with /v1/audio/speech.

Then OpenWebUI Settings → Audio → STT URL http://framework:8001/v1, TTS URL http://framework:8002/v1, voice af_bella (Kokoro default).

Path C — Full SOTA conversational voice (higher effort)

For "talk to my agent" with the best prosody available in OSS, run Sesame CSM-1B as the TTS instead of Kokoro/Piper. Heavier (~3-4 GB GPU at inference, ~5 GB on disk) but conversational quality is markedly better. No first-class OpenAI-API wrapper exists yet — expect a custom HTTP server or use the reference inference scripts.

Pair with whisper-large-v3-turbo on STT for the matching jump in input quality.

Path D — Voice cloning (specialized)

If you want a custom voice, F5-TTS or OpenVoice v2 clone from a 5–10 second reference clip. F5 has the best quality, OpenVoice has the better license (MIT). Self-hosted; no first-party HTTP server — script-driven for now.

What I'd actually do

Path A as the immediate quick win (5 minutes), Path B as the next weekend project (gives you voice in the existing browser UIs without forcing the Wyoming dependency on every consumer). Skip C/D until you have a concrete use case that demands them.

Reference catalogs

Hugging Face TTS model leaderboard — blind preference scores across SOTA TTS
Open ASR Leaderboard — STT WER benchmarks
Awesome STT / TTS on GitHub
Rhasspy / Wyoming ecosystem — Home Assistant voice
Kokoro repo and Kokoro-FastAPI
Sesame CSM-1B
faster-whisper-server

9.2 KiB Raw Blame History Unescape Escape