# Voice models — landscape and upgrade paths State of the OSS speech stack as of May 2026, with concrete upgrade options for the Framework Desktop (Strix Halo iGPU, 64 GB UMA, 256 GB/s bandwidth — plenty of headroom alongside a 30 GB code model). ## What's deployed today - **Whisper** (`tcp://framework:10300`) — `rhasspy/wyoming-whisper`, Wyoming protocol, model `tiny-int8` (~75 MB). Picked for HA defaults, not quality. - **Piper** (`tcp://framework:10200`) — `rhasspy/wyoming-piper`, Wyoming protocol, voice `en_US-lessac-medium` (~63 MB, VITS-based). These are **convenient HA defaults**, not state-of-the-art. The Wyoming protocol is from the Rhasspy / Home Assistant ecosystem; it's good for HA Assist integration but doesn't expose an OpenAI-compatible HTTP API, so direct integration with OpenWebUI / OpenCode / agentic stacks requires a wrapper. ## Speech-to-text (STT) landscape | Model | Provenance | Size | Quality (1–5) | Speed | License | Notes | |---|---|---|---|---|---|---| | **whisper tiny** | OpenAI (2022) | 75 MB | 2 | very fast | MIT | What we have. Commands OK; dictation rough. | | **whisper small** | OpenAI | 480 MB | 3 | fast | MIT | Reasonable middle ground. | | **whisper large-v3** | OpenAI (Nov 2023) | 3.1 GB | 5 | slow | MIT | Long the gold standard. | | **whisper large-v3-turbo** | OpenAI (Oct 2024) | 810 MB | 5 (≈v3) | ~6× v3 | MIT | **Best Whisper for this box.** Quality of large-v3 at small-model speeds. | | **faster-whisper** (any model) | SYSTRAN | same | same as base | ~4× faster | MIT | CTranslate2 reimpl. Drop-in via `fedirz/faster-whisper-server` (OpenAI-compatible). | | **distil-whisper** | HF (2024) | ~750 MB | 4 | fast | MIT | Distilled v2/v3. English-only. Lower latency than turbo. | | **WhisperX** | m-bain (2024) | + Whisper | 5 + alignment | slow | BSD | Whisper + word-level timestamps + speaker diarization. The right pick for meeting-transcription use cases. | | **Parakeet-TDT-0.6B** | NVIDIA (2024) | ~1.2 GB | 5 | fast | Apache 2.0 | Trades blows with Whisper-large on benchmarks. Smaller ecosystem. | | **Canary-1B** | NVIDIA (2024) | ~3.5 GB | 5 | medium | CC-BY-4.0 (non-commercial) | Multilingual; license is the catch. | | **Moonshine** | Useful Sensors (2024) | ~190 MB tiny / ~620 MB base | 4 | very fast | MIT | Optimized for edge/streaming. Newer, less mature ecosystem than Whisper. | **Recommendation for this box:** `whisper-large-v3-turbo` via `faster-whisper-server`. Best quality/speed/ecosystem balance, OpenAI API compat lets non-Wyoming clients use it directly. ## Text-to-speech (TTS) landscape | Model | Provenance | Size | Quality (1–5) | Voices | License | Notes | |---|---|---|---|---|---|---| | **Piper** (current) | Rhasspy / Home Assistant | ~60–80 MB / voice | 3 | many pre-trained | MIT | Fast, CPU-friendly, prosody is okay but recognizably synthetic. Default in HA Assist. | | **Kokoro-82M** | hexgrad (Jan 2025) | ~340 MB | 4–5 | ~10 English | Apache 2.0 | **Sweet spot for this box.** Tiny model, surprisingly natural. OpenAI-API server: `Kokoro-FastAPI`. | | **Sesame CSM-1B** | Sesame AI Labs (open-sourced Mar 2025) | ~3 GB | 5 | conversational, context-aware | Apache 2.0 | Best emotional prosody in OSS. Designed for back-and-forth dialogue, not announcements. | | **F5-TTS** | F5-TTS group (2024) | ~1.4 GB | 5 | zero-shot voice cloning from 5–10 s reference | CC-BY-NC-4.0 | Voice cloning state-of-the-art for OSS. Non-commercial license. | | **XTTS-v2** | Coqui (2023) | ~1.8 GB | 4 | zero-shot voice cloning | Coqui PL (non-commercial) | Older but well-supported. Tortoise-TTS lineage. | | **MeloTTS** | MyShell (2024) | ~140 MB | 4 | multilingual | MIT | Strong multilingual support. | | **StyleTTS2** | Yinghao Aaron Li et al. (2023) | ~700 MB | 5 | many | MIT | Very natural; heavier inference path than Kokoro. | | **OpenVoice v2** | MyShell (2024) | ~1 GB | 4 | voice cloning + style transfer | MIT | Good cloning quality, more permissive license than F5/XTTS. | **Recommendation for this box:** **Kokoro** for general TTS (fast, high quality, permissive license), **Sesame CSM** layered alongside for conversational use cases. Commercial that nothing OSS reliably beats yet: **ElevenLabs** (general quality / voice cloning), **Cartesia Sonic** (real-time streaming latency). ## Wyoming vs OpenAI-compatible servers The protocol choice matters depending on what you want to drive the voice stack with. | Consumer | Wyoming | OpenAI-compatible | |---|---|---| | Home Assistant Assist | ✅ native | possible via custom integration | | Wyoming satellite hardware (ESP32-S3-Box etc.) | ✅ native | ❌ | | OpenWebUI Voice (built-in) | ❌ | ✅ | | OpenCode / OpenHands voice add-ons | needs wrapper | ✅ direct | | Custom Python/Node scripts | possible | trivial | If you only ever want HA voice assistant, Wyoming is fine. If you also want voice in OpenWebUI or to script "hey local model, summarize this audio file" outside HA, **add OpenAI-compatible servers alongside** — they don't conflict (different ports). ## Use cases for this stack In rough order of reach for a single-user homelab: 1. **Home Assistant voice assistant.** Point HA's Assist pipeline at Whisper:10300 and Piper:10200, add `wyoming-openwakeword`, plug a Wyoming satellite (S3-Box, M5Stack Atom Echo, RPi). Self-hosted Alexa replacement that talks to your local LLM. 2. **Voice OpenWebUI.** With OpenAI-compatible STT/TTS servers, you can speak to Qwen3-Coder in the browser and hear the response. OpenWebUI has the hooks; Wyoming alone doesn't, hence the wrapper need. 3. **Standalone transcription.** Drop a meeting recording on the box, get a transcript out. Pair Whisper with WhisperX for diarization ("who said what") and timestamps. Useful for podcast transcription, interview indexing, lecture notes. 4. **TTS notifications and announcements.** Pipe automation events through Piper (or Kokoro). Voice notifications throughout the house if you have HA + speakers. 5. **Voice agent loops.** Mic input → Whisper → local LLM → Piper / Kokoro → speakers. The "talk to your local assistant" loop that neither Claude Code nor ChatGPT Code Interpreter natively does. Roadmap.md item #8. 6. **Meeting bots.** With Whisper + a custom integration, drop a bot into Zoom/Google Meet, get a real-time transcript and post-meeting summary from the local LLM. ## Concrete upgrade paths ### Path A — Stay Wyoming, bump model sizes (lowest effort) Edit the existing compose files to upgrade the models: - `compose/whisper.yml`: `--model tiny-int8` → `--model large-v3-turbo` (or `medium-int8` for a middle step). Cost: ~810 MB / 1.5 GB on disk, significant quality jump on dictation. - `compose/piper.yml`: `en_US-lessac-medium` → another voice. Browse at . `./run.sh` from the Mac, recreate the containers on the box. Same Wyoming protocol, better quality. ### Path B — Add OpenAI-compatible servers alongside (mid effort) Stand up `faster-whisper-server` + `Kokoro-FastAPI` as additional compose stacks (different ports). Wyoming stays for HA, OpenAI-API servers serve OpenWebUI / OpenCode / scripts. Both pull from the same model files if you want to avoid duplication. Sample compose entries to add: - `compose/faster-whisper.yml` — `fedirz/faster-whisper-server`, port 8001 with `/v1/audio/transcriptions` endpoint. - `compose/kokoro.yml` — `ghcr.io/remsky/kokoro-fastapi-cpu` (or `-gpu`), port 8002 with `/v1/audio/speech`. Then OpenWebUI Settings → Audio → STT URL `http://framework:8001/v1`, TTS URL `http://framework:8002/v1`, voice `af_bella` (Kokoro default). ### Path C — Full SOTA conversational voice (higher effort) For "talk to my agent" with the best prosody available in OSS, run **Sesame CSM-1B** as the TTS instead of Kokoro/Piper. Heavier (~3-4 GB GPU at inference, ~5 GB on disk) but conversational quality is markedly better. No first-class OpenAI-API wrapper exists yet — expect a custom HTTP server or use the reference inference scripts. Pair with `whisper-large-v3-turbo` on STT for the matching jump in input quality. ### Path D — Voice cloning (specialized) If you want a custom voice, **F5-TTS** or **OpenVoice v2** clone from a 5–10 second reference clip. F5 has the best quality, OpenVoice has the better license (MIT). Self-hosted; no first-party HTTP server — script-driven for now. ## What I'd actually do Path A as the immediate quick win (5 minutes), Path B as the next weekend project (gives you voice in the existing browser UIs without forcing the Wyoming dependency on every consumer). Skip C/D until you have a concrete use case that demands them. **Status:** Path B is now live in pyinfra — `compose/faster-whisper.yml` and `compose/kokoro.yml` deploy alongside Wyoming. Wire OpenWebUI's Audio settings at them per the "Wiring OpenWebUI for voice" section in [`pyinfra/framework/README.md`](pyinfra/framework/README.md). The Conduit Android app inherits voice transitively through OpenWebUI. ## Reference catalogs - [Hugging Face TTS model leaderboard](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) — blind preference scores across SOTA TTS - [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) — STT WER benchmarks - [Awesome STT / TTS](https://github.com/topics/text-to-speech) on GitHub - [Rhasspy / Wyoming](https://github.com/rhasspy) ecosystem — Home Assistant voice - [Kokoro repo](https://github.com/hexgrad/kokoro) and [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI) - [Sesame CSM-1B](https://github.com/SesameAILabs/csm) - [faster-whisper-server](https://github.com/fedirz/faster-whisper-server)