Path B from VoiceModels.md — adds two new compose stacks alongside the Wyoming pair so OpenWebUI/Conduit get voice without a Wyoming-shim: - compose/faster-whisper.yml — fedirz/faster-whisper-server CPU image, large-v3-turbo by default, OpenAI /v1/audio/transcriptions on :8001. Built-in web UI for ad-hoc transcription. - compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu, Kokoro-82M, OpenAI /v1/audio/speech on :8880. Both run alongside (not instead of) Wyoming Whisper + Piper — Wyoming keeps serving HA Assist, OpenAI-API serves OpenWebUI / Conduit. Memory budget on Strix Halo accommodates everything plus Qwen3-Coder loaded concurrently with plenty of headroom. Homepage gets dedicated tiles for both. README documents the OpenWebUI Audio configuration that wires the new endpoints. Conduit inherits voice via OpenWebUI without app-side setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
84 lines
4.4 KiB
Markdown
84 lines
4.4 KiB
Markdown
# localgenai
|
|
|
|
Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix
|
|
Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over
|
|
Tailscale. Coding agents, monitoring, voice — all self-hosted.
|
|
|
|
## Topology
|
|
|
|
```
|
|
┌─────────────────┐ Tailscale ┌──────────────────────────────┐
|
|
│ Mac │ ───────────────▶ │ Framework Desktop │
|
|
│ │ │ (Ubuntu 26.04, gfx1151 iGPU)│
|
|
│ • OpenCode │ │ • Inference: Ollama, │
|
|
│ + Phoenix │ │ llama.cpp, vLLM │
|
|
│ bridge │ │ • Agent UIs: OpenWebUI, │
|
|
│ • install.sh │ │ OpenHands │
|
|
│ │ │ • Monitoring: Beszel, │
|
|
│ │ │ OpenLIT, Phoenix │
|
|
└─────────────────┘ └──────────────────────────────┘
|
|
```
|
|
|
|
## Repo layout
|
|
|
|
| Path | What's there |
|
|
|---|---|
|
|
| [`pyinfra/`](pyinfra/) | pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See [`pyinfra/framework/README.md`](pyinfra/framework/README.md). |
|
|
| [`opencode/`](opencode/) | OpenCode config + Phoenix bridge plugin. `install.sh` deploys it to `~/.config/opencode/` on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix. |
|
|
| [`StrixHaloSetup.md`](StrixHaloSetup.md) | Original phased bring-up plan for the box. |
|
|
| [`StrixHaloMemory.md`](StrixHaloMemory.md) | UMA / GTT / bandwidth notes — "what fits, what's slow, why." |
|
|
| [`Roadmap.md`](Roadmap.md) | What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. |
|
|
| [`TODO.md`](TODO.md) | Open questions and follow-ups. |
|
|
| [`testing/qwen3-coder-30b/`](testing/qwen3-coder-30b/) | Small evaluation harness for Qwen3-Coder. |
|
|
| [`VoiceModels.md`](VoiceModels.md) | STT / TTS landscape and upgrade paths from the Wyoming defaults. |
|
|
|
|
## Service ports (on `framework`)
|
|
|
|
| Port | Service | Notes |
|
|
|---|---|---|
|
|
| `7575` | **Homepage** | **Front door — start here.** Tile per service with live widgets. |
|
|
| `8080` | llama.cpp | Vulkan backend, `--metrics` for Prometheus |
|
|
| `8000` | vLLM | ROCm; gfx1151 support varies |
|
|
| `11434` | Ollama | ROCm with `HSA_OVERRIDE_GFX_VERSION=11.0.0` |
|
|
| `3000` | OpenWebUI | ChatGPT-style UI in front of Ollama |
|
|
| `3001` | OpenLIT | LLM fleet metrics dashboard |
|
|
| `3030` | OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design |
|
|
| `4317` | Phoenix OTLP/gRPC | Trace ingestion |
|
|
| `6006` | Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also `:6006/v1/traces`) |
|
|
| `8001` | faster-whisper | STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit |
|
|
| `8090` | Beszel | Host + container + AMD GPU dashboard |
|
|
| `8880` | Kokoro | TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit |
|
|
| `10200` | Piper | TTS (Wyoming protocol) — for Home Assistant Assist |
|
|
| `10300` | Whisper | STT (Wyoming protocol) — for Home Assistant Assist |
|
|
|
|
## Quick start
|
|
|
|
On the box (after the pyinfra deploy):
|
|
```sh
|
|
cd /srv/docker/ollama && docker compose up -d
|
|
cd /srv/docker/phoenix && docker compose up -d
|
|
cd /srv/docker/beszel && docker compose up -d # then add system per beszel.yml
|
|
```
|
|
|
|
On the Mac:
|
|
```sh
|
|
cd opencode && ./install.sh
|
|
opencode
|
|
```
|
|
|
|
Send a prompt; watch it land in Phoenix at <http://framework:6006>.
|
|
|
|
## Why this stack
|
|
|
|
- **Bandwidth, not VRAM, is the ceiling on Strix Halo.** 256 GB/s memory
|
|
bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B,
|
|
Kimi-Linear-48B-A3B) dominate. See `StrixHaloMemory.md`.
|
|
- **AMD APU monitoring is broken upstream.** `amd-smi` returns N/A on
|
|
gfx1151, which kills the official Prometheus/Grafana exporters. Beszel
|
|
reads sysfs directly and works. See the monitoring section of
|
|
`pyinfra/framework/README.md`.
|
|
- **Two-layer observability.** OpenLIT is fleet metrics across sessions;
|
|
Phoenix is per-trace waterfall for "what did this one prompt do."
|
|
- **Reproducible.** Every host-side config lives in `pyinfra/`; every
|
|
Mac-side config in `opencode/install.sh`. Re-running either is safe.
|