Files
localgenai/VoiceModels.md
noisedestroyers 36b8cfe835 Add Wyoming voice stack to pyinfra + landscape doc
- Move piper-compose.yaml / whisper-compose.yaml from repo root into
  pyinfra/framework/compose/{piper,whisper}.yml; bind paths shifted to
  /srv/docker/{piper,whisper}/data on the box.
- deploy.py registers both stacks and provisions the data dirs.
- Homepage gets a "Voice" group with informational tiles (Wyoming has
  no web UI, so tiles show container status without click-through).
- New VoiceModels.md captures the May 2026 STT/TTS landscape, why the
  current Wyoming defaults aren't SOTA, and concrete upgrade paths
  (whisper-large-v3-turbo + faster-whisper-server, Kokoro, Sesame CSM,
  F5-TTS for cloning).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:33:17 -04:00

9.2 KiB
Raw Blame History

Voice models — landscape and upgrade paths

State of the OSS speech stack as of May 2026, with concrete upgrade options for the Framework Desktop (Strix Halo iGPU, 64 GB UMA, 256 GB/s bandwidth — plenty of headroom alongside a 30 GB code model).

What's deployed today

  • Whisper (tcp://framework:10300) — rhasspy/wyoming-whisper, Wyoming protocol, model tiny-int8 (~75 MB). Picked for HA defaults, not quality.
  • Piper (tcp://framework:10200) — rhasspy/wyoming-piper, Wyoming protocol, voice en_US-lessac-medium (~63 MB, VITS-based).

These are convenient HA defaults, not state-of-the-art. The Wyoming protocol is from the Rhasspy / Home Assistant ecosystem; it's good for HA Assist integration but doesn't expose an OpenAI-compatible HTTP API, so direct integration with OpenWebUI / OpenCode / agentic stacks requires a wrapper.

Speech-to-text (STT) landscape

Model Provenance Size Quality (15) Speed License Notes
whisper tiny OpenAI (2022) 75 MB 2 very fast MIT What we have. Commands OK; dictation rough.
whisper small OpenAI 480 MB 3 fast MIT Reasonable middle ground.
whisper large-v3 OpenAI (Nov 2023) 3.1 GB 5 slow MIT Long the gold standard.
whisper large-v3-turbo OpenAI (Oct 2024) 810 MB 5 (≈v3) ~6× v3 MIT Best Whisper for this box. Quality of large-v3 at small-model speeds.
faster-whisper (any model) SYSTRAN same same as base ~4× faster MIT CTranslate2 reimpl. Drop-in via fedirz/faster-whisper-server (OpenAI-compatible).
distil-whisper HF (2024) ~750 MB 4 fast MIT Distilled v2/v3. English-only. Lower latency than turbo.
WhisperX m-bain (2024) + Whisper 5 + alignment slow BSD Whisper + word-level timestamps + speaker diarization. The right pick for meeting-transcription use cases.
Parakeet-TDT-0.6B NVIDIA (2024) ~1.2 GB 5 fast Apache 2.0 Trades blows with Whisper-large on benchmarks. Smaller ecosystem.
Canary-1B NVIDIA (2024) ~3.5 GB 5 medium CC-BY-4.0 (non-commercial) Multilingual; license is the catch.
Moonshine Useful Sensors (2024) ~190 MB tiny / ~620 MB base 4 very fast MIT Optimized for edge/streaming. Newer, less mature ecosystem than Whisper.

Recommendation for this box: whisper-large-v3-turbo via faster-whisper-server. Best quality/speed/ecosystem balance, OpenAI API compat lets non-Wyoming clients use it directly.

Text-to-speech (TTS) landscape

Model Provenance Size Quality (15) Voices License Notes
Piper (current) Rhasspy / Home Assistant ~6080 MB / voice 3 many pre-trained MIT Fast, CPU-friendly, prosody is okay but recognizably synthetic. Default in HA Assist.
Kokoro-82M hexgrad (Jan 2025) ~340 MB 45 ~10 English Apache 2.0 Sweet spot for this box. Tiny model, surprisingly natural. OpenAI-API server: Kokoro-FastAPI.
Sesame CSM-1B Sesame AI Labs (open-sourced Mar 2025) ~3 GB 5 conversational, context-aware Apache 2.0 Best emotional prosody in OSS. Designed for back-and-forth dialogue, not announcements.
F5-TTS F5-TTS group (2024) ~1.4 GB 5 zero-shot voice cloning from 510 s reference CC-BY-NC-4.0 Voice cloning state-of-the-art for OSS. Non-commercial license.
XTTS-v2 Coqui (2023) ~1.8 GB 4 zero-shot voice cloning Coqui PL (non-commercial) Older but well-supported. Tortoise-TTS lineage.
MeloTTS MyShell (2024) ~140 MB 4 multilingual MIT Strong multilingual support.
StyleTTS2 Yinghao Aaron Li et al. (2023) ~700 MB 5 many MIT Very natural; heavier inference path than Kokoro.
OpenVoice v2 MyShell (2024) ~1 GB 4 voice cloning + style transfer MIT Good cloning quality, more permissive license than F5/XTTS.

Recommendation for this box: Kokoro for general TTS (fast, high quality, permissive license), Sesame CSM layered alongside for conversational use cases.

Commercial that nothing OSS reliably beats yet: ElevenLabs (general quality / voice cloning), Cartesia Sonic (real-time streaming latency).

Wyoming vs OpenAI-compatible servers

The protocol choice matters depending on what you want to drive the voice stack with.

Consumer Wyoming OpenAI-compatible
Home Assistant Assist native possible via custom integration
Wyoming satellite hardware (ESP32-S3-Box etc.) native
OpenWebUI Voice (built-in)
OpenCode / OpenHands voice add-ons needs wrapper direct
Custom Python/Node scripts possible trivial

If you only ever want HA voice assistant, Wyoming is fine. If you also want voice in OpenWebUI or to script "hey local model, summarize this audio file" outside HA, add OpenAI-compatible servers alongside — they don't conflict (different ports).

Use cases for this stack

In rough order of reach for a single-user homelab:

  1. Home Assistant voice assistant. Point HA's Assist pipeline at Whisper:10300 and Piper:10200, add wyoming-openwakeword, plug a Wyoming satellite (S3-Box, M5Stack Atom Echo, RPi). Self-hosted Alexa replacement that talks to your local LLM.

  2. Voice OpenWebUI. With OpenAI-compatible STT/TTS servers, you can speak to Qwen3-Coder in the browser and hear the response. OpenWebUI has the hooks; Wyoming alone doesn't, hence the wrapper need.

  3. Standalone transcription. Drop a meeting recording on the box, get a transcript out. Pair Whisper with WhisperX for diarization ("who said what") and timestamps. Useful for podcast transcription, interview indexing, lecture notes.

  4. TTS notifications and announcements. Pipe automation events through Piper (or Kokoro). Voice notifications throughout the house if you have HA + speakers.

  5. Voice agent loops. Mic input → Whisper → local LLM → Piper / Kokoro → speakers. The "talk to your local assistant" loop that neither Claude Code nor ChatGPT Code Interpreter natively does. Roadmap.md item #8.

  6. Meeting bots. With Whisper + a custom integration, drop a bot into Zoom/Google Meet, get a real-time transcript and post-meeting summary from the local LLM.

Concrete upgrade paths

Path A — Stay Wyoming, bump model sizes (lowest effort)

Edit the existing compose files to upgrade the models:

  • compose/whisper.yml: --model tiny-int8--model large-v3-turbo (or medium-int8 for a middle step). Cost: ~810 MB / 1.5 GB on disk, significant quality jump on dictation.
  • compose/piper.yml: en_US-lessac-medium → another voice. Browse at https://github.com/rhasspy/piper/blob/master/VOICES.md.

./run.sh from the Mac, recreate the containers on the box. Same Wyoming protocol, better quality.

Path B — Add OpenAI-compatible servers alongside (mid effort)

Stand up faster-whisper-server + Kokoro-FastAPI as additional compose stacks (different ports). Wyoming stays for HA, OpenAI-API servers serve OpenWebUI / OpenCode / scripts. Both pull from the same model files if you want to avoid duplication.

Sample compose entries to add:

  • compose/faster-whisper.ymlfedirz/faster-whisper-server, port 8001 with /v1/audio/transcriptions endpoint.
  • compose/kokoro.ymlghcr.io/remsky/kokoro-fastapi-cpu (or -gpu), port 8002 with /v1/audio/speech.

Then OpenWebUI Settings → Audio → STT URL http://framework:8001/v1, TTS URL http://framework:8002/v1, voice af_bella (Kokoro default).

Path C — Full SOTA conversational voice (higher effort)

For "talk to my agent" with the best prosody available in OSS, run Sesame CSM-1B as the TTS instead of Kokoro/Piper. Heavier (~3-4 GB GPU at inference, ~5 GB on disk) but conversational quality is markedly better. No first-class OpenAI-API wrapper exists yet — expect a custom HTTP server or use the reference inference scripts.

Pair with whisper-large-v3-turbo on STT for the matching jump in input quality.

Path D — Voice cloning (specialized)

If you want a custom voice, F5-TTS or OpenVoice v2 clone from a 510 second reference clip. F5 has the best quality, OpenVoice has the better license (MIT). Self-hosted; no first-party HTTP server — script-driven for now.

What I'd actually do

Path A as the immediate quick win (5 minutes), Path B as the next weekend project (gives you voice in the existing browser UIs without forcing the Wyoming dependency on every consumer). Skip C/D until you have a concrete use case that demands them.

Reference catalogs