a29793032dea88bb952358ecb8281a46a6b29be7
Snapshot of where opencode + Qwen3-Coder + MCPs + Kimi-Linear + voice + Phoenix tracing land today, plus in-flight (oc-tree, kimi-linear context ramp) and next (ComfyUI) items with pointers to per-project NEXT_STEPS.md guides.
localgenai
Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over Tailscale. Coding agents, monitoring, voice — all self-hosted.
Topology
┌─────────────────┐ Tailscale ┌──────────────────────────────┐
│ Mac │ ───────────────▶ │ Framework Desktop │
│ │ │ (Ubuntu 26.04, gfx1151 iGPU)│
│ • OpenCode │ │ • Inference: Ollama, │
│ + Phoenix │ │ llama.cpp, vLLM │
│ bridge │ │ • Agent UIs: OpenWebUI, │
│ • install.sh │ │ OpenHands │
│ │ │ • Monitoring: Beszel, │
│ │ │ OpenLIT, Phoenix │
└─────────────────┘ └──────────────────────────────┘
Repo layout
| Path | What's there |
|---|---|
pyinfra/ |
pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See pyinfra/framework/README.md. |
opencode/ |
OpenCode config + Phoenix bridge plugin. install.sh deploys it to ~/.config/opencode/ on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix. |
StrixHaloSetup.md |
Original phased bring-up plan for the box. |
StrixHaloMemory.md |
UMA / GTT / bandwidth notes — "what fits, what's slow, why." |
Roadmap.md |
What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. |
TODO.md |
Open questions and follow-ups. |
testing/qwen3-coder-30b/ |
Small evaluation harness for Qwen3-Coder. |
VoiceModels.md |
STT / TTS landscape and upgrade paths from the Wyoming defaults. |
Service ports (on framework)
| Port | Service | Notes |
|---|---|---|
7575 |
Homepage | Front door — start here. Tile per service with live widgets. |
8080 |
llama.cpp | Vulkan backend, --metrics for Prometheus |
8000 |
vLLM | ROCm; gfx1151 support varies |
11434 |
Ollama | ROCm with HSA_OVERRIDE_GFX_VERSION=11.0.0 |
3000 |
OpenWebUI | ChatGPT-style UI in front of Ollama |
3001 |
OpenLIT | LLM fleet metrics dashboard |
3030 |
OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design |
4317 |
Phoenix OTLP/gRPC | Trace ingestion |
6006 |
Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also :6006/v1/traces) |
8001 |
faster-whisper | STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit |
8090 |
Beszel | Host + container + AMD GPU dashboard |
8880 |
Kokoro | TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit |
10200 |
Piper | TTS (Wyoming protocol) — for Home Assistant Assist |
10300 |
Whisper | STT (Wyoming protocol) — for Home Assistant Assist |
Quick start
On the box (after the pyinfra deploy):
cd /srv/docker/ollama && docker compose up -d
cd /srv/docker/phoenix && docker compose up -d
cd /srv/docker/beszel && docker compose up -d # then add system per beszel.yml
On the Mac:
cd opencode && ./install.sh
opencode
Send a prompt; watch it land in Phoenix at http://framework:6006.
Coding workflows — state of the stack (2026-05-10)
| Component | Status | Notes |
|---|---|---|
Primary: opencode → framework/qwen3-coder:30b (Ollama) |
Working daily-driver | Full MCP toolbox (playwright, searxng, serena, basic-memory, sequential-thinking, task-master). Phoenix bridge emits OTel traces for every call. |
Secondary: opencode → framework-vllm/kimi-linear (vLLM) |
Configured, experimental | tool_call: false — Kimi-Linear is a research model and isn't strongly tool-trained. Long-context chat only. Switch via /model framework-vllm/kimi-linear. |
| Chat front-end for Kimi-Linear: OpenWebUI | Working | vLLM endpoint wired via OPENAI_API_BASE_URLS. Pick kimi-linear from the model selector in OpenWebUI. |
| Voice loop: faster-whisper + Kokoro | Deployed | OpenAI-compatible endpoints; not yet wired into a hands-free chat loop. |
| Observability: Phoenix per-trace + OpenLIT fleet | Working | All opencode prompts traced via .opencode/plugin/phoenix-bridge.js. |
| In flight: oc-tree TUI sidecar | M0 skeleton done, awaiting probe | Live tree view of opencode's SSE event stream; tmux-side companion to Phoenix's post-hoc traces. See oc-tree/NEXT_STEPS.md. |
| In flight: Kimi-Linear context ramp | P0 done at 32K, BIOS+GTT recipe applied | Next: ramp --max-model-len toward 1M and benchmark. See kimi-linear/NEXT_STEPS.md. |
| Configured but not deployed: VS Code Continue | YAML ready | Drop vscode-continue-config.yml into ~/.continue/config.yaml; ollama pull nomic-embed-text on the box for @codebase indexing. |
| Planned next: ComfyUI for image generation | Phase plan written | Flux.1-Dev via kyuz0 toolbox, same pyinfra pattern as kimi-linear. See task list (CF-P0..P4). |
| Roadmap-only (not started): Aider, Cline, OpenHands, LiteLLM, Multica | — | See Roadmap.md. |
Single-line summary: opencode + Qwen3-Coder + the MCP toolbox is the working coding harness; Kimi-Linear is alongside for long-context chat; oc-tree and ComfyUI are the next two infra additions.
Why this stack
- Bandwidth, not VRAM, is the ceiling on Strix Halo. 256 GB/s memory
bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B,
Kimi-Linear-48B-A3B) dominate. See
StrixHaloMemory.md. - AMD APU monitoring is broken upstream.
amd-smireturns N/A on gfx1151, which kills the official Prometheus/Grafana exporters. Beszel reads sysfs directly and works. See the monitoring section ofpyinfra/framework/README.md. - Two-layer observability. OpenLIT is fleet metrics across sessions; Phoenix is per-trace waterfall for "what did this one prompt do."
- Reproducible. Every host-side config lives in
pyinfra/; every Mac-side config inopencode/install.sh. Re-running either is safe.
Description
Languages
Python
64.7%
Shell
23.8%
JavaScript
9.6%
Dockerfile
1.9%