Files
localgenai/VoiceModels.md
noisedestroyers 6db46d8f6a Add OpenAI-compatible voice servers (faster-whisper + Kokoro)
Path B from VoiceModels.md — adds two new compose stacks alongside the
Wyoming pair so OpenWebUI/Conduit get voice without a Wyoming-shim:

- compose/faster-whisper.yml — fedirz/faster-whisper-server CPU image,
  large-v3-turbo by default, OpenAI /v1/audio/transcriptions on :8001.
  Built-in web UI for ad-hoc transcription.
- compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu, Kokoro-82M,
  OpenAI /v1/audio/speech on :8880.

Both run alongside (not instead of) Wyoming Whisper + Piper — Wyoming
keeps serving HA Assist, OpenAI-API serves OpenWebUI / Conduit. Memory
budget on Strix Halo accommodates everything plus Qwen3-Coder loaded
concurrently with plenty of headroom.

Homepage gets dedicated tiles for both. README documents the
OpenWebUI Audio configuration that wires the new endpoints. Conduit
inherits voice via OpenWebUI without app-side setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:42:45 -04:00

181 lines
9.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Voice models — landscape and upgrade paths
State of the OSS speech stack as of May 2026, with concrete upgrade
options for the Framework Desktop (Strix Halo iGPU, 64 GB UMA, 256 GB/s
bandwidth — plenty of headroom alongside a 30 GB code model).
## What's deployed today
- **Whisper** (`tcp://framework:10300`) — `rhasspy/wyoming-whisper`,
Wyoming protocol, model `tiny-int8` (~75 MB). Picked for HA defaults,
not quality.
- **Piper** (`tcp://framework:10200`) — `rhasspy/wyoming-piper`,
Wyoming protocol, voice `en_US-lessac-medium` (~63 MB, VITS-based).
These are **convenient HA defaults**, not state-of-the-art. The Wyoming
protocol is from the Rhasspy / Home Assistant ecosystem; it's good for
HA Assist integration but doesn't expose an OpenAI-compatible HTTP API,
so direct integration with OpenWebUI / OpenCode / agentic stacks
requires a wrapper.
## Speech-to-text (STT) landscape
| Model | Provenance | Size | Quality (15) | Speed | License | Notes |
|---|---|---|---|---|---|---|
| **whisper tiny** | OpenAI (2022) | 75 MB | 2 | very fast | MIT | What we have. Commands OK; dictation rough. |
| **whisper small** | OpenAI | 480 MB | 3 | fast | MIT | Reasonable middle ground. |
| **whisper large-v3** | OpenAI (Nov 2023) | 3.1 GB | 5 | slow | MIT | Long the gold standard. |
| **whisper large-v3-turbo** | OpenAI (Oct 2024) | 810 MB | 5 (≈v3) | ~6× v3 | MIT | **Best Whisper for this box.** Quality of large-v3 at small-model speeds. |
| **faster-whisper** (any model) | SYSTRAN | same | same as base | ~4× faster | MIT | CTranslate2 reimpl. Drop-in via `fedirz/faster-whisper-server` (OpenAI-compatible). |
| **distil-whisper** | HF (2024) | ~750 MB | 4 | fast | MIT | Distilled v2/v3. English-only. Lower latency than turbo. |
| **WhisperX** | m-bain (2024) | + Whisper | 5 + alignment | slow | BSD | Whisper + word-level timestamps + speaker diarization. The right pick for meeting-transcription use cases. |
| **Parakeet-TDT-0.6B** | NVIDIA (2024) | ~1.2 GB | 5 | fast | Apache 2.0 | Trades blows with Whisper-large on benchmarks. Smaller ecosystem. |
| **Canary-1B** | NVIDIA (2024) | ~3.5 GB | 5 | medium | CC-BY-4.0 (non-commercial) | Multilingual; license is the catch. |
| **Moonshine** | Useful Sensors (2024) | ~190 MB tiny / ~620 MB base | 4 | very fast | MIT | Optimized for edge/streaming. Newer, less mature ecosystem than Whisper. |
**Recommendation for this box:** `whisper-large-v3-turbo` via
`faster-whisper-server`. Best quality/speed/ecosystem balance, OpenAI
API compat lets non-Wyoming clients use it directly.
## Text-to-speech (TTS) landscape
| Model | Provenance | Size | Quality (15) | Voices | License | Notes |
|---|---|---|---|---|---|---|
| **Piper** (current) | Rhasspy / Home Assistant | ~6080 MB / voice | 3 | many pre-trained | MIT | Fast, CPU-friendly, prosody is okay but recognizably synthetic. Default in HA Assist. |
| **Kokoro-82M** | hexgrad (Jan 2025) | ~340 MB | 45 | ~10 English | Apache 2.0 | **Sweet spot for this box.** Tiny model, surprisingly natural. OpenAI-API server: `Kokoro-FastAPI`. |
| **Sesame CSM-1B** | Sesame AI Labs (open-sourced Mar 2025) | ~3 GB | 5 | conversational, context-aware | Apache 2.0 | Best emotional prosody in OSS. Designed for back-and-forth dialogue, not announcements. |
| **F5-TTS** | F5-TTS group (2024) | ~1.4 GB | 5 | zero-shot voice cloning from 510 s reference | CC-BY-NC-4.0 | Voice cloning state-of-the-art for OSS. Non-commercial license. |
| **XTTS-v2** | Coqui (2023) | ~1.8 GB | 4 | zero-shot voice cloning | Coqui PL (non-commercial) | Older but well-supported. Tortoise-TTS lineage. |
| **MeloTTS** | MyShell (2024) | ~140 MB | 4 | multilingual | MIT | Strong multilingual support. |
| **StyleTTS2** | Yinghao Aaron Li et al. (2023) | ~700 MB | 5 | many | MIT | Very natural; heavier inference path than Kokoro. |
| **OpenVoice v2** | MyShell (2024) | ~1 GB | 4 | voice cloning + style transfer | MIT | Good cloning quality, more permissive license than F5/XTTS. |
**Recommendation for this box:** **Kokoro** for general TTS (fast,
high quality, permissive license), **Sesame CSM** layered alongside
for conversational use cases.
Commercial that nothing OSS reliably beats yet: **ElevenLabs** (general
quality / voice cloning), **Cartesia Sonic** (real-time streaming
latency).
## Wyoming vs OpenAI-compatible servers
The protocol choice matters depending on what you want to drive the
voice stack with.
| Consumer | Wyoming | OpenAI-compatible |
|---|---|---|
| Home Assistant Assist | ✅ native | possible via custom integration |
| Wyoming satellite hardware (ESP32-S3-Box etc.) | ✅ native | ❌ |
| OpenWebUI Voice (built-in) | ❌ | ✅ |
| OpenCode / OpenHands voice add-ons | needs wrapper | ✅ direct |
| Custom Python/Node scripts | possible | trivial |
If you only ever want HA voice assistant, Wyoming is fine. If you also
want voice in OpenWebUI or to script "hey local model, summarize this
audio file" outside HA, **add OpenAI-compatible servers alongside**
they don't conflict (different ports).
## Use cases for this stack
In rough order of reach for a single-user homelab:
1. **Home Assistant voice assistant.** Point HA's Assist pipeline at
Whisper:10300 and Piper:10200, add `wyoming-openwakeword`, plug a
Wyoming satellite (S3-Box, M5Stack Atom Echo, RPi). Self-hosted
Alexa replacement that talks to your local LLM.
2. **Voice OpenWebUI.** With OpenAI-compatible STT/TTS servers, you can
speak to Qwen3-Coder in the browser and hear the response. OpenWebUI
has the hooks; Wyoming alone doesn't, hence the wrapper need.
3. **Standalone transcription.** Drop a meeting recording on the box,
get a transcript out. Pair Whisper with WhisperX for diarization
("who said what") and timestamps. Useful for podcast transcription,
interview indexing, lecture notes.
4. **TTS notifications and announcements.** Pipe automation events
through Piper (or Kokoro). Voice notifications throughout the house
if you have HA + speakers.
5. **Voice agent loops.** Mic input → Whisper → local LLM → Piper /
Kokoro → speakers. The "talk to your local assistant" loop that
neither Claude Code nor ChatGPT Code Interpreter natively does.
Roadmap.md item #8.
6. **Meeting bots.** With Whisper + a custom integration, drop a bot
into Zoom/Google Meet, get a real-time transcript and post-meeting
summary from the local LLM.
## Concrete upgrade paths
### Path A — Stay Wyoming, bump model sizes (lowest effort)
Edit the existing compose files to upgrade the models:
- `compose/whisper.yml`: `--model tiny-int8``--model large-v3-turbo`
(or `medium-int8` for a middle step). Cost: ~810 MB / 1.5 GB on disk,
significant quality jump on dictation.
- `compose/piper.yml`: `en_US-lessac-medium` → another voice. Browse at
<https://github.com/rhasspy/piper/blob/master/VOICES.md>.
`./run.sh` from the Mac, recreate the containers on the box. Same
Wyoming protocol, better quality.
### Path B — Add OpenAI-compatible servers alongside (mid effort)
Stand up `faster-whisper-server` + `Kokoro-FastAPI` as additional
compose stacks (different ports). Wyoming stays for HA, OpenAI-API
servers serve OpenWebUI / OpenCode / scripts. Both pull from the same
model files if you want to avoid duplication.
Sample compose entries to add:
- `compose/faster-whisper.yml``fedirz/faster-whisper-server`,
port 8001 with `/v1/audio/transcriptions` endpoint.
- `compose/kokoro.yml``ghcr.io/remsky/kokoro-fastapi-cpu` (or `-gpu`),
port 8002 with `/v1/audio/speech`.
Then OpenWebUI Settings → Audio → STT URL `http://framework:8001/v1`,
TTS URL `http://framework:8002/v1`, voice `af_bella` (Kokoro default).
### Path C — Full SOTA conversational voice (higher effort)
For "talk to my agent" with the best prosody available in OSS, run
**Sesame CSM-1B** as the TTS instead of Kokoro/Piper. Heavier
(~3-4 GB GPU at inference, ~5 GB on disk) but conversational quality
is markedly better. No first-class OpenAI-API wrapper exists yet —
expect a custom HTTP server or use the reference inference scripts.
Pair with `whisper-large-v3-turbo` on STT for the matching jump in
input quality.
### Path D — Voice cloning (specialized)
If you want a custom voice, **F5-TTS** or **OpenVoice v2** clone from
a 510 second reference clip. F5 has the best quality, OpenVoice has
the better license (MIT). Self-hosted; no first-party HTTP server —
script-driven for now.
## What I'd actually do
Path A as the immediate quick win (5 minutes), Path B as the next
weekend project (gives you voice in the existing browser UIs without
forcing the Wyoming dependency on every consumer). Skip C/D until you
have a concrete use case that demands them.
**Status:** Path B is now live in pyinfra — `compose/faster-whisper.yml`
and `compose/kokoro.yml` deploy alongside Wyoming. Wire OpenWebUI's
Audio settings at them per the "Wiring OpenWebUI for voice" section in
[`pyinfra/framework/README.md`](pyinfra/framework/README.md). The
Conduit Android app inherits voice transitively through OpenWebUI.
## Reference catalogs
- [Hugging Face TTS model leaderboard](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) — blind preference scores across SOTA TTS
- [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) — STT WER benchmarks
- [Awesome STT / TTS](https://github.com/topics/text-to-speech) on GitHub
- [Rhasspy / Wyoming](https://github.com/rhasspy) ecosystem — Home Assistant voice
- [Kokoro repo](https://github.com/hexgrad/kokoro) and [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI)
- [Sesame CSM-1B](https://github.com/SesameAILabs/csm)
- [faster-whisper-server](https://github.com/fedirz/faster-whisper-server)