Was bound to 127.0.0.1:3030 — overcautious on a Tailscale-only box where Phoenix/Beszel/OpenWebUI are all reached the same way. Updated the homepage tile description and added a security note in the README for the case where the host ever leaves the tailnet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.2 KiB
4.2 KiB
localgenai
Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over Tailscale. Coding agents, monitoring, voice — all self-hosted.
Topology
┌─────────────────┐ Tailscale ┌──────────────────────────────┐
│ Mac │ ───────────────▶ │ Framework Desktop │
│ │ │ (Ubuntu 26.04, gfx1151 iGPU)│
│ • OpenCode │ │ • Inference: Ollama, │
│ + Phoenix │ │ llama.cpp, vLLM │
│ bridge │ │ • Agent UIs: OpenWebUI, │
│ • install.sh │ │ OpenHands │
│ │ │ • Monitoring: Beszel, │
│ │ │ OpenLIT, Phoenix │
└─────────────────┘ └──────────────────────────────┘
Repo layout
| Path | What's there |
|---|---|
pyinfra/ |
pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See pyinfra/framework/README.md. |
opencode/ |
OpenCode config + Phoenix bridge plugin. install.sh deploys it to ~/.config/opencode/ on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix. |
StrixHaloSetup.md |
Original phased bring-up plan for the box. |
StrixHaloMemory.md |
UMA / GTT / bandwidth notes — "what fits, what's slow, why." |
Roadmap.md |
What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. |
TODO.md |
Open questions and follow-ups. |
testing/qwen3-coder-30b/ |
Small evaluation harness for Qwen3-Coder. |
piper-compose.yaml, whisper-compose.yaml |
Voice (TTS / STT) compose files; not yet wired into the pyinfra deploy. |
Service ports (on framework)
| Port | Service | Notes |
|---|---|---|
7575 |
Homepage | Front door — start here. Tile per service with live widgets. |
8080 |
llama.cpp | Vulkan backend, --metrics for Prometheus |
8000 |
vLLM | ROCm; gfx1151 support varies |
11434 |
Ollama | ROCm with HSA_OVERRIDE_GFX_VERSION=11.0.0 |
3000 |
OpenWebUI | ChatGPT-style UI in front of Ollama |
3001 |
OpenLIT | LLM fleet metrics dashboard |
3030 |
OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design |
4317 |
Phoenix OTLP/gRPC | Trace ingestion |
6006 |
Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also :6006/v1/traces) |
8090 |
Beszel | Host + container + AMD GPU dashboard |
Quick start
On the box (after the pyinfra deploy):
cd /srv/docker/ollama && docker compose up -d
cd /srv/docker/phoenix && docker compose up -d
cd /srv/docker/beszel && docker compose up -d # then add system per beszel.yml
On the Mac:
cd opencode && ./install.sh
opencode
Send a prompt; watch it land in Phoenix at http://framework:6006.
Why this stack
- Bandwidth, not VRAM, is the ceiling on Strix Halo. 256 GB/s memory
bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B,
Kimi-Linear-48B-A3B) dominate. See
StrixHaloMemory.md. - AMD APU monitoring is broken upstream.
amd-smireturns N/A on gfx1151, which kills the official Prometheus/Grafana exporters. Beszel reads sysfs directly and works. See the monitoring section ofpyinfra/framework/README.md. - Two-layer observability. OpenLIT is fleet metrics across sessions; Phoenix is per-trace waterfall for "what did this one prompt do."
- Reproducible. Every host-side config lives in
pyinfra/; every Mac-side config inopencode/install.sh. Re-running either is safe.