de1635872ffbb47671853ef13f85e3cdfdb66f57
localgenai
Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over Tailscale. Coding agents, monitoring, voice — all self-hosted.
Topology
┌─────────────────┐ Tailscale ┌──────────────────────────────┐
│ Mac │ ───────────────▶ │ Framework Desktop │
│ │ │ (Ubuntu 26.04, gfx1151 iGPU)│
│ • OpenCode │ │ • Inference: Ollama, │
│ + Phoenix │ │ llama.cpp, vLLM │
│ bridge │ │ • Agent UIs: OpenWebUI, │
│ • install.sh │ │ OpenHands │
│ │ │ • Monitoring: Beszel, │
│ │ │ OpenLIT, Phoenix │
└─────────────────┘ └──────────────────────────────┘
Repo layout
| Path | What's there |
|---|---|
pyinfra/ |
pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See pyinfra/framework/README.md. |
opencode/ |
OpenCode config + Phoenix bridge plugin. install.sh deploys it to ~/.config/opencode/ on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix. |
StrixHaloSetup.md |
Original phased bring-up plan for the box. |
StrixHaloMemory.md |
UMA / GTT / bandwidth notes — "what fits, what's slow, why." |
Roadmap.md |
What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. |
TODO.md |
Open questions and follow-ups. |
testing/qwen3-coder-30b/ |
Small evaluation harness for Qwen3-Coder. |
VoiceModels.md |
STT / TTS landscape and upgrade paths from the Wyoming defaults. |
ImageGen.md |
Image generation workflows on the ComfyUI stack — resolution envelope, hires fix, Redux/IP-Adapter for reference-driven generation. |
Service ports (on framework)
| Port | Service | Notes |
|---|---|---|
7575 |
Homepage | Front door — start here. Tile per service with live widgets. |
8080 |
llama.cpp | Vulkan backend, --metrics for Prometheus |
8000 |
vLLM | ROCm; gfx1151 support varies |
11434 |
Ollama | ROCm with HSA_OVERRIDE_GFX_VERSION=11.0.0 |
3000 |
OpenWebUI | ChatGPT-style UI in front of Ollama |
3001 |
OpenLIT | LLM fleet metrics dashboard |
4000 |
LiteLLM | OpenAI-compatible router in front of all local backends — Bearer $LITELLM_MASTER_KEY |
3030 |
OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design |
4317 |
Phoenix OTLP/gRPC | Trace ingestion |
6006 |
Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also :6006/v1/traces) |
8001 |
faster-whisper | STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit |
8090 |
Beszel | Host + container + AMD GPU dashboard |
8081 |
llama.cpp (Qwen3-235B) | Long-task model, ~5-10 tok/s — manual start only (can't coexist with 30B/Kimi/ComfyUI) |
8188 |
ComfyUI | Image generation web UI — manual start only (GPU contention) |
8880 |
Kokoro | TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit |
10200 |
Piper | TTS (Wyoming protocol) — for Home Assistant Assist |
10300 |
Whisper | STT (Wyoming protocol) — for Home Assistant Assist |
Quick start
On the box (after the pyinfra deploy):
cd /srv/docker/ollama && docker compose up -d
cd /srv/docker/phoenix && docker compose up -d
cd /srv/docker/beszel && docker compose up -d # then add system per beszel.yml
On the Mac:
cd opencode && ./install.sh
opencode
Send a prompt; watch it land in Phoenix at http://framework:6006.
Coding workflows — state of the stack (2026-05-10)
| Component | Status | Notes |
|---|---|---|
Primary: opencode → framework/qwen3-coder:30b (Ollama) |
Working daily-driver | Full MCP toolbox (playwright, searxng, serena, basic-memory, sequential-thinking, task-master). Phoenix bridge emits OTel traces for every call. |
Secondary: opencode → framework-vllm/kimi-linear (vLLM) |
Configured, experimental | tool_call: false — Kimi-Linear is a research model and isn't strongly tool-trained. Long-context chat only. Switch via /model framework-vllm/kimi-linear. |
| Chat front-end for Kimi-Linear: OpenWebUI | Working | vLLM endpoint wired via OPENAI_API_BASE_URLS. Pick kimi-linear from the model selector in OpenWebUI. |
| Voice loop: faster-whisper + Kokoro | Deployed | OpenAI-compatible endpoints; not yet wired into a hands-free chat loop. |
| Observability: Phoenix per-trace + OpenLIT fleet | Working | All opencode prompts traced via .opencode/plugin/phoenix-bridge.js. |
| In flight: oc-tree TUI sidecar | M0 skeleton done, awaiting probe | Live tree view of opencode's SSE event stream; tmux-side companion to Phoenix's post-hoc traces. See oc-tree/NEXT_STEPS.md. |
| In flight: Kimi-Linear context ramp | P0 done at 32K, BIOS+GTT recipe applied | Next: ramp --max-model-len toward 1M and benchmark. See kimi-linear/NEXT_STEPS.md. |
| Configured but not deployed: VS Code Continue | YAML ready | Drop vscode-continue-config.yml into ~/.continue/config.yaml; ollama pull nomic-embed-text on the box for @codebase indexing. |
| In flight: ComfyUI for image generation | CF-P0 — artifacts written, awaiting box-side bring-up | Flux.1-Dev via kyuz0/amd-strix-halo-comfyui. Manual start (restart: "no") so it doesn't contend with kimi-linear at boot. See pyinfra/framework/compose/comfyui/README.md. |
| In flight: Qwen3-235B long-task model | M0 — artifacts written, awaiting weight pull + bring-up | Unsloth UD-Q2_K_XL (~88.8 GB) via kyuz0:rocm-7.2.2. The "overnight" model at ~5-10 tok/s — can't coexist with the 30B/Kimi/ComfyUI. See pyinfra/framework/compose/qwen3-235b/README.md. Opencode wired via direct framework-llama provider. |
| In flight: LiteLLM router | M1 — artifacts written, awaiting box-side bring-up | Single OpenAI-compatible endpoint (port 4000) in front of all 4 local backends; routing table in pyinfra/framework/compose/litellm/config.yaml. Opencode stays direct; OpenHands + future orchestrator point here. |
| Roadmap-only (not started): Aider, Cline, OpenHands integration, Multica | — | See Roadmap.md. |
Single-line summary: opencode + Qwen3-Coder + the MCP toolbox is the working coding harness; Kimi-Linear is alongside for long-context chat; Qwen3-235B is being added as the overnight long-task model, fronted by a LiteLLM router for the future orchestrator; oc-tree and ComfyUI are the next two infra additions.
Why this stack
- Bandwidth, not VRAM, is the ceiling on Strix Halo. 256 GB/s memory
bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B,
Kimi-Linear-48B-A3B) dominate. See
StrixHaloMemory.md. - AMD APU monitoring is broken upstream.
amd-smireturns N/A on gfx1151, which kills the official Prometheus/Grafana exporters. Beszel reads sysfs directly and works. See the monitoring section ofpyinfra/framework/README.md. - Two-layer observability. OpenLIT is fleet metrics across sessions; Phoenix is per-trace waterfall for "what did this one prompt do."
- Reproducible. Every host-side config lives in
pyinfra/; every Mac-side config inopencode/install.sh. Re-running either is safe.
Description
Languages
Python
60.8%
Shell
29.2%
JavaScript
8.3%
Dockerfile
1.7%