README.md

# localgenai

Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix
Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over
Tailscale. Coding agents, monitoring, voice — all self-hosted.

## Topology

```
┌─────────────────┐    Tailscale     ┌──────────────────────────────┐
│  Mac            │ ───────────────▶ │  Framework Desktop           │
│                 │                  │  (Ubuntu 26.04, gfx1151 iGPU)│
│  • OpenCode     │                  │  • Inference: Ollama,        │
│    + Phoenix    │                  │    llama.cpp, vLLM           │
│    bridge       │                  │  • Agent UIs: OpenWebUI,     │
│  • install.sh   │                  │    OpenHands                 │
│                 │                  │  • Monitoring: Beszel,       │
│                 │                  │    OpenLIT, Phoenix          │
└─────────────────┘                  └──────────────────────────────┘
```

## Repo layout

| Path | What's there |
|---|---|
| [`pyinfra/`](pyinfra/) | pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See [`pyinfra/framework/README.md`](pyinfra/framework/README.md). |
| [`opencode/`](opencode/) | OpenCode config + Phoenix bridge plugin. `install.sh` deploys it to `~/.config/opencode/` on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix. |
| [`StrixHaloSetup.md`](StrixHaloSetup.md) | Original phased bring-up plan for the box. |
| [`StrixHaloMemory.md`](StrixHaloMemory.md) | UMA / GTT / bandwidth notes — "what fits, what's slow, why." |
| [`Roadmap.md`](Roadmap.md) | What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. |
| [`TODO.md`](TODO.md) | Open questions and follow-ups. |
| [`testing/qwen3-coder-30b/`](testing/qwen3-coder-30b/) | Small evaluation harness for Qwen3-Coder. |
| [`VoiceModels.md`](VoiceModels.md) | STT / TTS landscape and upgrade paths from the Wyoming defaults. |

## Service ports (on `framework`)

| Port | Service | Notes |
|---|---|---|
| `7575` | **Homepage** | **Front door — start here.** Tile per service with live widgets. |
| `8080` | llama.cpp | Vulkan backend, `--metrics` for Prometheus |
| `8000` | vLLM | ROCm; gfx1151 support varies |
| `11434` | Ollama | ROCm with `HSA_OVERRIDE_GFX_VERSION=11.0.0` |
| `3000` | OpenWebUI | ChatGPT-style UI in front of Ollama |
| `3001` | OpenLIT | LLM fleet metrics dashboard |
| `3030` | OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design |
| `4317` | Phoenix OTLP/gRPC | Trace ingestion |
| `6006` | Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also `:6006/v1/traces`) |
| `8001` | faster-whisper | STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit |
| `8090` | Beszel | Host + container + AMD GPU dashboard |
| `8880` | Kokoro | TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit |
| `10200` | Piper | TTS (Wyoming protocol) — for Home Assistant Assist |
| `10300` | Whisper | STT (Wyoming protocol) — for Home Assistant Assist |

## Quick start

On the box (after the pyinfra deploy):
```sh
cd /srv/docker/ollama && docker compose up -d
cd /srv/docker/phoenix && docker compose up -d
cd /srv/docker/beszel && docker compose up -d   # then add system per beszel.yml
```

On the Mac:
```sh
cd opencode && ./install.sh
opencode
```

Send a prompt; watch it land in Phoenix at <http://framework:6006>.

## Coding workflows — state of the stack (2026-05-10)

| Component | Status | Notes |
|---|---|---|
| **Primary**: opencode → `framework/qwen3-coder:30b` (Ollama) | Working daily-driver | Full MCP toolbox (playwright, searxng, serena, basic-memory, sequential-thinking, task-master). Phoenix bridge emits OTel traces for every call. |
| **Secondary**: opencode → `framework-vllm/kimi-linear` (vLLM) | Configured, experimental | `tool_call: false` — Kimi-Linear is a research model and isn't strongly tool-trained. Long-context chat only. Switch via `/model framework-vllm/kimi-linear`. |
| **Chat front-end for Kimi-Linear**: OpenWebUI | Working | vLLM endpoint wired via `OPENAI_API_BASE_URLS`. Pick `kimi-linear` from the model selector in OpenWebUI. |
| **Voice loop**: faster-whisper + Kokoro | Deployed | OpenAI-compatible endpoints; not yet wired into a hands-free chat loop. |
| **Observability**: Phoenix per-trace + OpenLIT fleet | Working | All opencode prompts traced via `.opencode/plugin/phoenix-bridge.js`. |
| **In flight**: oc-tree TUI sidecar | M0 skeleton done, awaiting probe | Live tree view of opencode's SSE event stream; tmux-side companion to Phoenix's post-hoc traces. See [`oc-tree/NEXT_STEPS.md`](oc-tree/NEXT_STEPS.md). |
| **In flight**: Kimi-Linear context ramp | P0 done at 32K, BIOS+GTT recipe applied | Next: ramp `--max-model-len` toward 1M and benchmark. See [`kimi-linear/NEXT_STEPS.md`](kimi-linear/NEXT_STEPS.md). |
| **Configured but not deployed**: VS Code Continue | YAML ready | Drop [`vscode-continue-config.yml`](vscode-continue-config.yml) into `~/.continue/config.yaml`; `ollama pull nomic-embed-text` on the box for `@codebase` indexing. |
| **Planned next**: ComfyUI for image generation | Phase plan written | Flux.1-Dev via kyuz0 toolbox, same pyinfra pattern as kimi-linear. See task list (CF-P0..P4). |
| **Roadmap-only** (not started): Aider, Cline, OpenHands, LiteLLM, Multica | — | See `Roadmap.md`. |

**Single-line summary**: opencode + Qwen3-Coder + the MCP toolbox is the working coding harness; Kimi-Linear is alongside for long-context chat; oc-tree and ComfyUI are the next two infra additions.

## Why this stack

- **Bandwidth, not VRAM, is the ceiling on Strix Halo.** 256 GB/s memory
  bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B,
  Kimi-Linear-48B-A3B) dominate. See `StrixHaloMemory.md`.
- **AMD APU monitoring is broken upstream.** `amd-smi` returns N/A on
  gfx1151, which kills the official Prometheus/Grafana exporters. Beszel
  reads sysfs directly and works. See the monitoring section of
  `pyinfra/framework/README.md`.
- **Two-layer observability.** OpenLIT is fleet metrics across sessions;
  Phoenix is per-trace waterfall for "what did this one prompt do."
- **Reproducible.** Every host-side config lives in `pyinfra/`; every
  Mac-side config in `opencode/install.sh`. Re-running either is safe.
Add top-level README One-page orientation: topology diagram, repo layout, service-port table, quick-start, and the rationale notes that get glossed over in the sub-READMEs (bandwidth ceiling, AMD APU monitoring gap, two-layer observability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 11:41:25 -04:00			`# localgenai`

			`Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix`
			`Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over`
			`Tailscale. Coding agents, monitoring, voice — all self-hosted.`

			`## Topology`

			```
			`┌─────────────────┐ Tailscale ┌──────────────────────────────┐`
			`│ Mac │ ───────────────▶ │ Framework Desktop │`
			`│ │ │ (Ubuntu 26.04, gfx1151 iGPU)│`
			`│ • OpenCode │ │ • Inference: Ollama, │`
			`│ + Phoenix │ │ llama.cpp, vLLM │`
			`│ bridge │ │ • Agent UIs: OpenWebUI, │`
			`│ • install.sh │ │ OpenHands │`
			`│ │ │ • Monitoring: Beszel, │`
			`│ │ │ OpenLIT, Phoenix │`
			`└─────────────────┘ └──────────────────────────────┘`
			```

			`## Repo layout`

			`\| Path \| What's there \|`
			`\|---\|---\|`
			\| [`pyinfra/`](pyinfra/) \| pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See [`pyinfra/framework/README.md`](pyinfra/framework/README.md). \|
			\| [`opencode/`](opencode/) \| OpenCode config + Phoenix bridge plugin. `install.sh` deploys it to `~/.config/opencode/` on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix. \|
			\| [`StrixHaloSetup.md`](StrixHaloSetup.md) \| Original phased bring-up plan for the box. \|
			\| [`StrixHaloMemory.md`](StrixHaloMemory.md) \| UMA / GTT / bandwidth notes — "what fits, what's slow, why." \|
			\| [`Roadmap.md`](Roadmap.md) \| What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. \|
			\| [`TODO.md`](TODO.md) \| Open questions and follow-ups. \|
			\| [`testing/qwen3-coder-30b/`](testing/qwen3-coder-30b/) \| Small evaluation harness for Qwen3-Coder. \|
Add Wyoming voice stack to pyinfra + landscape doc - Move piper-compose.yaml / whisper-compose.yaml from repo root into pyinfra/framework/compose/{piper,whisper}.yml; bind paths shifted to /srv/docker/{piper,whisper}/data on the box. - deploy.py registers both stacks and provisions the data dirs. - Homepage gets a "Voice" group with informational tiles (Wyoming has no web UI, so tiles show container status without click-through). - New VoiceModels.md captures the May 2026 STT/TTS landscape, why the current Wyoming defaults aren't SOTA, and concrete upgrade paths (whisper-large-v3-turbo + faster-whisper-server, Kokoro, Sesame CSM, F5-TTS for cloning). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 13:33:17 -04:00			\| [`VoiceModels.md`](VoiceModels.md) \| STT / TTS landscape and upgrade paths from the Wyoming defaults. \|
Add top-level README One-page orientation: topology diagram, repo layout, service-port table, quick-start, and the rationale notes that get glossed over in the sub-READMEs (bandwidth ceiling, AMD APU monitoring gap, two-layer observability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 11:41:25 -04:00
			## Service ports (on `framework`)

			`\| Port \| Service \| Notes \|`
			`\|---\|---\|---\|`
Add Homepage dashboard + dual-export OpenCode traces Homepage as the front door: single page at framework:7575 with one tile per service, live widgets where the upstream supports it (Ollama loaded models, container state via docker.sock, etc.), bookmarks for reference docs. Config files are pyinfra-managed — source of truth lives in compose/homepage/, sync by editing there and re-running ./run.sh. OpenCode plugin now dual-exports spans to Phoenix and OpenLIT in parallel. Phoenix remains the per-trace waterfall view; OpenLIT picks up the same data for fleet-level metrics. Each destination has its own batch processor so a hiccup at one doesn't block the other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 12:00:05 -04:00			\| `7575` \| Homepage \| Front door — start here. Tile per service with live widgets. \|
Add top-level README One-page orientation: topology diagram, repo layout, service-port table, quick-start, and the rationale notes that get glossed over in the sub-READMEs (bandwidth ceiling, AMD APU monitoring gap, two-layer observability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 11:41:25 -04:00			\| `8080` \| llama.cpp \| Vulkan backend, `--metrics` for Prometheus \|
			\| `8000` \| vLLM \| ROCm; gfx1151 support varies \|
			\| `11434` \| Ollama \| ROCm with `HSA_OVERRIDE_GFX_VERSION=11.0.0` \|
			\| `3000` \| OpenWebUI \| ChatGPT-style UI in front of Ollama \|
			\| `3001` \| OpenLIT \| LLM fleet metrics dashboard \|
Open OpenHands UI to all interfaces Was bound to 127.0.0.1:3030 — overcautious on a Tailscale-only box where Phoenix/Beszel/OpenWebUI are all reached the same way. Updated the homepage tile description and added a security note in the README for the case where the host ever leaves the tailnet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 12:04:35 -04:00			\| `3030` \| OpenHands \| Autonomous agent + sandbox runtime — Tailscale-only by design \|
Add top-level README One-page orientation: topology diagram, repo layout, service-port table, quick-start, and the rationale notes that get glossed over in the sub-READMEs (bandwidth ceiling, AMD APU monitoring gap, two-layer observability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 11:41:25 -04:00			\| `4317` \| Phoenix OTLP/gRPC \| Trace ingestion \|
			\| `6006` \| Phoenix UI / OTLP/HTTP \| Per-trace agent waterfall (also `:6006/v1/traces`) \|
Add OpenAI-compatible voice servers (faster-whisper + Kokoro) Path B from VoiceModels.md — adds two new compose stacks alongside the Wyoming pair so OpenWebUI/Conduit get voice without a Wyoming-shim: - compose/faster-whisper.yml — fedirz/faster-whisper-server CPU image, large-v3-turbo by default, OpenAI /v1/audio/transcriptions on :8001. Built-in web UI for ad-hoc transcription. - compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu, Kokoro-82M, OpenAI /v1/audio/speech on :8880. Both run alongside (not instead of) Wyoming Whisper + Piper — Wyoming keeps serving HA Assist, OpenAI-API serves OpenWebUI / Conduit. Memory budget on Strix Halo accommodates everything plus Qwen3-Coder loaded concurrently with plenty of headroom. Homepage gets dedicated tiles for both. README documents the OpenWebUI Audio configuration that wires the new endpoints. Conduit inherits voice via OpenWebUI without app-side setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 14:42:45 -04:00			\| `8001` \| faster-whisper \| STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit \|
Add top-level README One-page orientation: topology diagram, repo layout, service-port table, quick-start, and the rationale notes that get glossed over in the sub-READMEs (bandwidth ceiling, AMD APU monitoring gap, two-layer observability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 11:41:25 -04:00			\| `8090` \| Beszel \| Host + container + AMD GPU dashboard \|
Add OpenAI-compatible voice servers (faster-whisper + Kokoro) Path B from VoiceModels.md — adds two new compose stacks alongside the Wyoming pair so OpenWebUI/Conduit get voice without a Wyoming-shim: - compose/faster-whisper.yml — fedirz/faster-whisper-server CPU image, large-v3-turbo by default, OpenAI /v1/audio/transcriptions on :8001. Built-in web UI for ad-hoc transcription. - compose/kokoro.yml — ghcr.io/remsky/kokoro-fastapi-cpu, Kokoro-82M, OpenAI /v1/audio/speech on :8880. Both run alongside (not instead of) Wyoming Whisper + Piper — Wyoming keeps serving HA Assist, OpenAI-API serves OpenWebUI / Conduit. Memory budget on Strix Halo accommodates everything plus Qwen3-Coder loaded concurrently with plenty of headroom. Homepage gets dedicated tiles for both. README documents the OpenWebUI Audio configuration that wires the new endpoints. Conduit inherits voice via OpenWebUI without app-side setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 14:42:45 -04:00			\| `8880` \| Kokoro \| TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit \|
			\| `10200` \| Piper \| TTS (Wyoming protocol) — for Home Assistant Assist \|
			\| `10300` \| Whisper \| STT (Wyoming protocol) — for Home Assistant Assist \|
Add top-level README One-page orientation: topology diagram, repo layout, service-port table, quick-start, and the rationale notes that get glossed over in the sub-READMEs (bandwidth ceiling, AMD APU monitoring gap, two-layer observability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 11:41:25 -04:00
			`## Quick start`

			`On the box (after the pyinfra deploy):`
			```sh
			`cd /srv/docker/ollama && docker compose up -d`
			`cd /srv/docker/phoenix && docker compose up -d`
			`cd /srv/docker/beszel && docker compose up -d # then add system per beszel.yml`
			```

			`On the Mac:`
			```sh
			`cd opencode && ./install.sh`
			`opencode`
			```

			`Send a prompt; watch it land in Phoenix at <http://framework:6006>.`

Document current coding-workflow stack state Snapshot of where opencode + Qwen3-Coder + MCPs + Kimi-Linear + voice + Phoenix tracing land today, plus in-flight (oc-tree, kimi-linear context ramp) and next (ComfyUI) items with pointers to per-project NEXT_STEPS.md guides. 2026-05-10 21:14:43 -04:00			`## Coding workflows — state of the stack (2026-05-10)`

			`\| Component \| Status \| Notes \|`
			`\|---\|---\|---\|`
			\| Primary: opencode → `framework/qwen3-coder:30b` (Ollama) \| Working daily-driver \| Full MCP toolbox (playwright, searxng, serena, basic-memory, sequential-thinking, task-master). Phoenix bridge emits OTel traces for every call. \|
			\| Secondary: opencode → `framework-vllm/kimi-linear` (vLLM) \| Configured, experimental \| `tool_call: false` — Kimi-Linear is a research model and isn't strongly tool-trained. Long-context chat only. Switch via `/model framework-vllm/kimi-linear`. \|
			\| Chat front-end for Kimi-Linear: OpenWebUI \| Working \| vLLM endpoint wired via `OPENAI_API_BASE_URLS`. Pick `kimi-linear` from the model selector in OpenWebUI. \|
			`\| Voice loop: faster-whisper + Kokoro \| Deployed \| OpenAI-compatible endpoints; not yet wired into a hands-free chat loop. \|`
			\| Observability: Phoenix per-trace + OpenLIT fleet \| Working \| All opencode prompts traced via `.opencode/plugin/phoenix-bridge.js`. \|
			\| In flight: oc-tree TUI sidecar \| M0 skeleton done, awaiting probe \| Live tree view of opencode's SSE event stream; tmux-side companion to Phoenix's post-hoc traces. See [`oc-tree/NEXT_STEPS.md`](oc-tree/NEXT_STEPS.md). \|
			\| In flight: Kimi-Linear context ramp \| P0 done at 32K, BIOS+GTT recipe applied \| Next: ramp `--max-model-len` toward 1M and benchmark. See [`kimi-linear/NEXT_STEPS.md`](kimi-linear/NEXT_STEPS.md). \|
			\| Configured but not deployed: VS Code Continue \| YAML ready \| Drop [`vscode-continue-config.yml`](vscode-continue-config.yml) into `~/.continue/config.yaml`; `ollama pull nomic-embed-text` on the box for `@codebase` indexing. \|
			`\| Planned next: ComfyUI for image generation \| Phase plan written \| Flux.1-Dev via kyuz0 toolbox, same pyinfra pattern as kimi-linear. See task list (CF-P0..P4). \|`
			\| Roadmap-only (not started): Aider, Cline, OpenHands, LiteLLM, Multica \| — \| See `Roadmap.md`. \|

			`Single-line summary: opencode + Qwen3-Coder + the MCP toolbox is the working coding harness; Kimi-Linear is alongside for long-context chat; oc-tree and ComfyUI are the next two infra additions.`

Add top-level README One-page orientation: topology diagram, repo layout, service-port table, quick-start, and the rationale notes that get glossed over in the sub-READMEs (bandwidth ceiling, AMD APU monitoring gap, two-layer observability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-08 11:41:25 -04:00			`## Why this stack`

			`- Bandwidth, not VRAM, is the ceiling on Strix Halo. 256 GB/s memory`
			`bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B,`
			Kimi-Linear-48B-A3B) dominate. See `StrixHaloMemory.md`.
			- AMD APU monitoring is broken upstream. `amd-smi` returns N/A on
			`gfx1151, which kills the official Prometheus/Grafana exporters. Beszel`
			`reads sysfs directly and works. See the monitoring section of`
			`pyinfra/framework/README.md`.
			`- Two-layer observability. OpenLIT is fleet metrics across sessions;`
			`Phoenix is per-trace waterfall for "what did this one prompt do."`
			- Reproducible. Every host-side config lives in `pyinfra/`; every
			Mac-side config in `opencode/install.sh`. Re-running either is safe.