Path B from VoiceModels.md — adds two new compose stacks alongside the Wyoming pair so OpenWebUI/Conduit get voice without a Wyoming-shim: - compose/faster-whisper.yml — fedirz/faster-whisper-server CPU image, large-v3-turbo by default, OpenAI /v1/audio/transcriptions on :8001. Built-in web UI for ad-hoc transcription. - compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu, Kokoro-82M, OpenAI /v1/audio/speech on :8880. Both run alongside (not instead of) Wyoming Whisper + Piper — Wyoming keeps serving HA Assist, OpenAI-API serves OpenWebUI / Conduit. Memory budget on Strix Halo accommodates everything plus Qwen3-Coder loaded concurrently with plenty of headroom. Homepage gets dedicated tiles for both. README documents the OpenWebUI Audio configuration that wires the new endpoints. Conduit inherits voice via OpenWebUI without app-side setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
181 lines
9.6 KiB
Markdown
181 lines
9.6 KiB
Markdown
# Voice models — landscape and upgrade paths
|
||
|
||
State of the OSS speech stack as of May 2026, with concrete upgrade
|
||
options for the Framework Desktop (Strix Halo iGPU, 64 GB UMA, 256 GB/s
|
||
bandwidth — plenty of headroom alongside a 30 GB code model).
|
||
|
||
## What's deployed today
|
||
|
||
- **Whisper** (`tcp://framework:10300`) — `rhasspy/wyoming-whisper`,
|
||
Wyoming protocol, model `tiny-int8` (~75 MB). Picked for HA defaults,
|
||
not quality.
|
||
- **Piper** (`tcp://framework:10200`) — `rhasspy/wyoming-piper`,
|
||
Wyoming protocol, voice `en_US-lessac-medium` (~63 MB, VITS-based).
|
||
|
||
These are **convenient HA defaults**, not state-of-the-art. The Wyoming
|
||
protocol is from the Rhasspy / Home Assistant ecosystem; it's good for
|
||
HA Assist integration but doesn't expose an OpenAI-compatible HTTP API,
|
||
so direct integration with OpenWebUI / OpenCode / agentic stacks
|
||
requires a wrapper.
|
||
|
||
## Speech-to-text (STT) landscape
|
||
|
||
| Model | Provenance | Size | Quality (1–5) | Speed | License | Notes |
|
||
|---|---|---|---|---|---|---|
|
||
| **whisper tiny** | OpenAI (2022) | 75 MB | 2 | very fast | MIT | What we have. Commands OK; dictation rough. |
|
||
| **whisper small** | OpenAI | 480 MB | 3 | fast | MIT | Reasonable middle ground. |
|
||
| **whisper large-v3** | OpenAI (Nov 2023) | 3.1 GB | 5 | slow | MIT | Long the gold standard. |
|
||
| **whisper large-v3-turbo** | OpenAI (Oct 2024) | 810 MB | 5 (≈v3) | ~6× v3 | MIT | **Best Whisper for this box.** Quality of large-v3 at small-model speeds. |
|
||
| **faster-whisper** (any model) | SYSTRAN | same | same as base | ~4× faster | MIT | CTranslate2 reimpl. Drop-in via `fedirz/faster-whisper-server` (OpenAI-compatible). |
|
||
| **distil-whisper** | HF (2024) | ~750 MB | 4 | fast | MIT | Distilled v2/v3. English-only. Lower latency than turbo. |
|
||
| **WhisperX** | m-bain (2024) | + Whisper | 5 + alignment | slow | BSD | Whisper + word-level timestamps + speaker diarization. The right pick for meeting-transcription use cases. |
|
||
| **Parakeet-TDT-0.6B** | NVIDIA (2024) | ~1.2 GB | 5 | fast | Apache 2.0 | Trades blows with Whisper-large on benchmarks. Smaller ecosystem. |
|
||
| **Canary-1B** | NVIDIA (2024) | ~3.5 GB | 5 | medium | CC-BY-4.0 (non-commercial) | Multilingual; license is the catch. |
|
||
| **Moonshine** | Useful Sensors (2024) | ~190 MB tiny / ~620 MB base | 4 | very fast | MIT | Optimized for edge/streaming. Newer, less mature ecosystem than Whisper. |
|
||
|
||
**Recommendation for this box:** `whisper-large-v3-turbo` via
|
||
`faster-whisper-server`. Best quality/speed/ecosystem balance, OpenAI
|
||
API compat lets non-Wyoming clients use it directly.
|
||
|
||
## Text-to-speech (TTS) landscape
|
||
|
||
| Model | Provenance | Size | Quality (1–5) | Voices | License | Notes |
|
||
|---|---|---|---|---|---|---|
|
||
| **Piper** (current) | Rhasspy / Home Assistant | ~60–80 MB / voice | 3 | many pre-trained | MIT | Fast, CPU-friendly, prosody is okay but recognizably synthetic. Default in HA Assist. |
|
||
| **Kokoro-82M** | hexgrad (Jan 2025) | ~340 MB | 4–5 | ~10 English | Apache 2.0 | **Sweet spot for this box.** Tiny model, surprisingly natural. OpenAI-API server: `Kokoro-FastAPI`. |
|
||
| **Sesame CSM-1B** | Sesame AI Labs (open-sourced Mar 2025) | ~3 GB | 5 | conversational, context-aware | Apache 2.0 | Best emotional prosody in OSS. Designed for back-and-forth dialogue, not announcements. |
|
||
| **F5-TTS** | F5-TTS group (2024) | ~1.4 GB | 5 | zero-shot voice cloning from 5–10 s reference | CC-BY-NC-4.0 | Voice cloning state-of-the-art for OSS. Non-commercial license. |
|
||
| **XTTS-v2** | Coqui (2023) | ~1.8 GB | 4 | zero-shot voice cloning | Coqui PL (non-commercial) | Older but well-supported. Tortoise-TTS lineage. |
|
||
| **MeloTTS** | MyShell (2024) | ~140 MB | 4 | multilingual | MIT | Strong multilingual support. |
|
||
| **StyleTTS2** | Yinghao Aaron Li et al. (2023) | ~700 MB | 5 | many | MIT | Very natural; heavier inference path than Kokoro. |
|
||
| **OpenVoice v2** | MyShell (2024) | ~1 GB | 4 | voice cloning + style transfer | MIT | Good cloning quality, more permissive license than F5/XTTS. |
|
||
|
||
**Recommendation for this box:** **Kokoro** for general TTS (fast,
|
||
high quality, permissive license), **Sesame CSM** layered alongside
|
||
for conversational use cases.
|
||
|
||
Commercial that nothing OSS reliably beats yet: **ElevenLabs** (general
|
||
quality / voice cloning), **Cartesia Sonic** (real-time streaming
|
||
latency).
|
||
|
||
## Wyoming vs OpenAI-compatible servers
|
||
|
||
The protocol choice matters depending on what you want to drive the
|
||
voice stack with.
|
||
|
||
| Consumer | Wyoming | OpenAI-compatible |
|
||
|---|---|---|
|
||
| Home Assistant Assist | ✅ native | possible via custom integration |
|
||
| Wyoming satellite hardware (ESP32-S3-Box etc.) | ✅ native | ❌ |
|
||
| OpenWebUI Voice (built-in) | ❌ | ✅ |
|
||
| OpenCode / OpenHands voice add-ons | needs wrapper | ✅ direct |
|
||
| Custom Python/Node scripts | possible | trivial |
|
||
|
||
If you only ever want HA voice assistant, Wyoming is fine. If you also
|
||
want voice in OpenWebUI or to script "hey local model, summarize this
|
||
audio file" outside HA, **add OpenAI-compatible servers alongside** —
|
||
they don't conflict (different ports).
|
||
|
||
## Use cases for this stack
|
||
|
||
In rough order of reach for a single-user homelab:
|
||
|
||
1. **Home Assistant voice assistant.** Point HA's Assist pipeline at
|
||
Whisper:10300 and Piper:10200, add `wyoming-openwakeword`, plug a
|
||
Wyoming satellite (S3-Box, M5Stack Atom Echo, RPi). Self-hosted
|
||
Alexa replacement that talks to your local LLM.
|
||
|
||
2. **Voice OpenWebUI.** With OpenAI-compatible STT/TTS servers, you can
|
||
speak to Qwen3-Coder in the browser and hear the response. OpenWebUI
|
||
has the hooks; Wyoming alone doesn't, hence the wrapper need.
|
||
|
||
3. **Standalone transcription.** Drop a meeting recording on the box,
|
||
get a transcript out. Pair Whisper with WhisperX for diarization
|
||
("who said what") and timestamps. Useful for podcast transcription,
|
||
interview indexing, lecture notes.
|
||
|
||
4. **TTS notifications and announcements.** Pipe automation events
|
||
through Piper (or Kokoro). Voice notifications throughout the house
|
||
if you have HA + speakers.
|
||
|
||
5. **Voice agent loops.** Mic input → Whisper → local LLM → Piper /
|
||
Kokoro → speakers. The "talk to your local assistant" loop that
|
||
neither Claude Code nor ChatGPT Code Interpreter natively does.
|
||
Roadmap.md item #8.
|
||
|
||
6. **Meeting bots.** With Whisper + a custom integration, drop a bot
|
||
into Zoom/Google Meet, get a real-time transcript and post-meeting
|
||
summary from the local LLM.
|
||
|
||
## Concrete upgrade paths
|
||
|
||
### Path A — Stay Wyoming, bump model sizes (lowest effort)
|
||
|
||
Edit the existing compose files to upgrade the models:
|
||
|
||
- `compose/whisper.yml`: `--model tiny-int8` → `--model large-v3-turbo`
|
||
(or `medium-int8` for a middle step). Cost: ~810 MB / 1.5 GB on disk,
|
||
significant quality jump on dictation.
|
||
- `compose/piper.yml`: `en_US-lessac-medium` → another voice. Browse at
|
||
<https://github.com/rhasspy/piper/blob/master/VOICES.md>.
|
||
|
||
`./run.sh` from the Mac, recreate the containers on the box. Same
|
||
Wyoming protocol, better quality.
|
||
|
||
### Path B — Add OpenAI-compatible servers alongside (mid effort)
|
||
|
||
Stand up `faster-whisper-server` + `Kokoro-FastAPI` as additional
|
||
compose stacks (different ports). Wyoming stays for HA, OpenAI-API
|
||
servers serve OpenWebUI / OpenCode / scripts. Both pull from the same
|
||
model files if you want to avoid duplication.
|
||
|
||
Sample compose entries to add:
|
||
- `compose/faster-whisper.yml` — `fedirz/faster-whisper-server`,
|
||
port 8001 with `/v1/audio/transcriptions` endpoint.
|
||
- `compose/kokoro.yml` — `ghcr.io/remsky/kokoro-fastapi-cpu` (or `-gpu`),
|
||
port 8002 with `/v1/audio/speech`.
|
||
|
||
Then OpenWebUI Settings → Audio → STT URL `http://framework:8001/v1`,
|
||
TTS URL `http://framework:8002/v1`, voice `af_bella` (Kokoro default).
|
||
|
||
### Path C — Full SOTA conversational voice (higher effort)
|
||
|
||
For "talk to my agent" with the best prosody available in OSS, run
|
||
**Sesame CSM-1B** as the TTS instead of Kokoro/Piper. Heavier
|
||
(~3-4 GB GPU at inference, ~5 GB on disk) but conversational quality
|
||
is markedly better. No first-class OpenAI-API wrapper exists yet —
|
||
expect a custom HTTP server or use the reference inference scripts.
|
||
|
||
Pair with `whisper-large-v3-turbo` on STT for the matching jump in
|
||
input quality.
|
||
|
||
### Path D — Voice cloning (specialized)
|
||
|
||
If you want a custom voice, **F5-TTS** or **OpenVoice v2** clone from
|
||
a 5–10 second reference clip. F5 has the best quality, OpenVoice has
|
||
the better license (MIT). Self-hosted; no first-party HTTP server —
|
||
script-driven for now.
|
||
|
||
## What I'd actually do
|
||
|
||
Path A as the immediate quick win (5 minutes), Path B as the next
|
||
weekend project (gives you voice in the existing browser UIs without
|
||
forcing the Wyoming dependency on every consumer). Skip C/D until you
|
||
have a concrete use case that demands them.
|
||
|
||
**Status:** Path B is now live in pyinfra — `compose/faster-whisper.yml`
|
||
and `compose/kokoro.yml` deploy alongside Wyoming. Wire OpenWebUI's
|
||
Audio settings at them per the "Wiring OpenWebUI for voice" section in
|
||
[`pyinfra/framework/README.md`](pyinfra/framework/README.md). The
|
||
Conduit Android app inherits voice transitively through OpenWebUI.
|
||
|
||
## Reference catalogs
|
||
|
||
- [Hugging Face TTS model leaderboard](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) — blind preference scores across SOTA TTS
|
||
- [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) — STT WER benchmarks
|
||
- [Awesome STT / TTS](https://github.com/topics/text-to-speech) on GitHub
|
||
- [Rhasspy / Wyoming](https://github.com/rhasspy) ecosystem — Home Assistant voice
|
||
- [Kokoro repo](https://github.com/hexgrad/kokoro) and [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI)
|
||
- [Sesame CSM-1B](https://github.com/SesameAILabs/csm)
|
||
- [faster-whisper-server](https://github.com/fedirz/faster-whisper-server)
|