diff --git a/README.md b/README.md new file mode 100644 index 0000000..d036dad --- /dev/null +++ b/README.md @@ -0,0 +1,78 @@ +# localgenai + +Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix +Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over +Tailscale. Coding agents, monitoring, voice — all self-hosted. + +## Topology + +``` +┌─────────────────┐ Tailscale ┌──────────────────────────────┐ +│ Mac │ ───────────────▶ │ Framework Desktop │ +│ │ │ (Ubuntu 26.04, gfx1151 iGPU)│ +│ • OpenCode │ │ • Inference: Ollama, │ +│ + Phoenix │ │ llama.cpp, vLLM │ +│ bridge │ │ • Agent UIs: OpenWebUI, │ +│ • install.sh │ │ OpenHands │ +│ │ │ • Monitoring: Beszel, │ +│ │ │ OpenLIT, Phoenix │ +└─────────────────┘ └──────────────────────────────┘ +``` + +## Repo layout + +| Path | What's there | +|---|---| +| [`pyinfra/`](pyinfra/) | pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See [`pyinfra/framework/README.md`](pyinfra/framework/README.md). | +| [`opencode/`](opencode/) | OpenCode config + Phoenix bridge plugin. `install.sh` deploys it to `~/.config/opencode/` on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix. | +| [`StrixHaloSetup.md`](StrixHaloSetup.md) | Original phased bring-up plan for the box. | +| [`StrixHaloMemory.md`](StrixHaloMemory.md) | UMA / GTT / bandwidth notes — "what fits, what's slow, why." | +| [`Roadmap.md`](Roadmap.md) | What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. | +| [`TODO.md`](TODO.md) | Open questions and follow-ups. | +| [`testing/qwen3-coder-30b/`](testing/qwen3-coder-30b/) | Small evaluation harness for Qwen3-Coder. | +| `piper-compose.yaml`, `whisper-compose.yaml` | Voice (TTS / STT) compose files; not yet wired into the pyinfra deploy. | + +## Service ports (on `framework`) + +| Port | Service | Notes | +|---|---|---| +| `8080` | llama.cpp | Vulkan backend, `--metrics` for Prometheus | +| `8000` | vLLM | ROCm; gfx1151 support varies | +| `11434` | Ollama | ROCm with `HSA_OVERRIDE_GFX_VERSION=11.0.0` | +| `3000` | OpenWebUI | ChatGPT-style UI in front of Ollama | +| `3001` | OpenLIT | LLM fleet metrics dashboard | +| `3030` | OpenHands | Loopback only — SSH-tunnel from your laptop | +| `4317` | Phoenix OTLP/gRPC | Trace ingestion | +| `6006` | Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also `:6006/v1/traces`) | +| `8090` | Beszel | Host + container + AMD GPU dashboard | + +## Quick start + +On the box (after the pyinfra deploy): +```sh +cd /srv/docker/ollama && docker compose up -d +cd /srv/docker/phoenix && docker compose up -d +cd /srv/docker/beszel && docker compose up -d # then add system per beszel.yml +``` + +On the Mac: +```sh +cd opencode && ./install.sh +opencode +``` + +Send a prompt; watch it land in Phoenix at . + +## Why this stack + +- **Bandwidth, not VRAM, is the ceiling on Strix Halo.** 256 GB/s memory + bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B, + Kimi-Linear-48B-A3B) dominate. See `StrixHaloMemory.md`. +- **AMD APU monitoring is broken upstream.** `amd-smi` returns N/A on + gfx1151, which kills the official Prometheus/Grafana exporters. Beszel + reads sysfs directly and works. See the monitoring section of + `pyinfra/framework/README.md`. +- **Two-layer observability.** OpenLIT is fleet metrics across sessions; + Phoenix is per-trace waterfall for "what did this one prompt do." +- **Reproducible.** Every host-side config lives in `pyinfra/`; every + Mac-side config in `opencode/install.sh`. Re-running either is safe.