107 lines
8.1 KiB
Markdown
107 lines
8.1 KiB
Markdown
# localgenai
|
|
|
|
Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix
|
|
Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over
|
|
Tailscale. Coding agents, monitoring, voice — all self-hosted.
|
|
|
|
## Topology
|
|
|
|
```
|
|
┌─────────────────┐ Tailscale ┌──────────────────────────────┐
|
|
│ Mac │ ───────────────▶ │ Framework Desktop │
|
|
│ │ │ (Ubuntu 26.04, gfx1151 iGPU)│
|
|
│ • OpenCode │ │ • Inference: Ollama, │
|
|
│ + Phoenix │ │ llama.cpp, vLLM │
|
|
│ bridge │ │ • Agent UIs: OpenWebUI, │
|
|
│ • install.sh │ │ OpenHands │
|
|
│ │ │ • Monitoring: Beszel, │
|
|
│ │ │ OpenLIT, Phoenix │
|
|
└─────────────────┘ └──────────────────────────────┘
|
|
```
|
|
|
|
## Repo layout
|
|
|
|
| Path | What's there |
|
|
|---|---|
|
|
| [`pyinfra/`](pyinfra/) | pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See [`pyinfra/framework/README.md`](pyinfra/framework/README.md). |
|
|
| [`opencode/`](opencode/) | OpenCode config + Phoenix bridge plugin. `install.sh` deploys it to `~/.config/opencode/` on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix. |
|
|
| [`StrixHaloSetup.md`](StrixHaloSetup.md) | Original phased bring-up plan for the box. |
|
|
| [`StrixHaloMemory.md`](StrixHaloMemory.md) | UMA / GTT / bandwidth notes — "what fits, what's slow, why." |
|
|
| [`Roadmap.md`](Roadmap.md) | What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design. |
|
|
| [`TODO.md`](TODO.md) | Open questions and follow-ups. |
|
|
| [`testing/qwen3-coder-30b/`](testing/qwen3-coder-30b/) | Small evaluation harness for Qwen3-Coder. |
|
|
| [`VoiceModels.md`](VoiceModels.md) | STT / TTS landscape and upgrade paths from the Wyoming defaults. |
|
|
| [`ImageGen.md`](ImageGen.md) | Image generation workflows on the ComfyUI stack — resolution envelope, hires fix, Redux/IP-Adapter for reference-driven generation. |
|
|
|
|
## Service ports (on `framework`)
|
|
|
|
| Port | Service | Notes |
|
|
|---|---|---|
|
|
| `7575` | **Homepage** | **Front door — start here.** Tile per service with live widgets. |
|
|
| `8080` | llama.cpp | Vulkan backend, `--metrics` for Prometheus |
|
|
| `8000` | vLLM | ROCm; gfx1151 support varies |
|
|
| `11434` | Ollama | ROCm with `HSA_OVERRIDE_GFX_VERSION=11.0.0` |
|
|
| `3000` | OpenWebUI | ChatGPT-style UI in front of Ollama |
|
|
| `3001` | OpenLIT | LLM fleet metrics dashboard |
|
|
| `4000` | LiteLLM | OpenAI-compatible router in front of all local backends — `Bearer $LITELLM_MASTER_KEY` |
|
|
| `3030` | OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design |
|
|
| `4317` | Phoenix OTLP/gRPC | Trace ingestion |
|
|
| `6006` | Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also `:6006/v1/traces`) |
|
|
| `8001` | faster-whisper | STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit |
|
|
| `8090` | Beszel | Host + container + AMD GPU dashboard |
|
|
| `8081` | llama.cpp (Qwen3-235B) | Long-task model, ~5-10 tok/s — manual start only (can't coexist with 30B/Kimi/ComfyUI) |
|
|
| `8188` | ComfyUI | Image generation web UI — manual start only (GPU contention) |
|
|
| `8880` | Kokoro | TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit |
|
|
| `10200` | Piper | TTS (Wyoming protocol) — for Home Assistant Assist |
|
|
| `10300` | Whisper | STT (Wyoming protocol) — for Home Assistant Assist |
|
|
|
|
## Quick start
|
|
|
|
On the box (after the pyinfra deploy):
|
|
```sh
|
|
cd /srv/docker/ollama && docker compose up -d
|
|
cd /srv/docker/phoenix && docker compose up -d
|
|
cd /srv/docker/beszel && docker compose up -d # then add system per beszel.yml
|
|
```
|
|
|
|
On the Mac:
|
|
```sh
|
|
cd opencode && ./install.sh
|
|
opencode
|
|
```
|
|
|
|
Send a prompt; watch it land in Phoenix at <http://framework:6006>.
|
|
|
|
## Coding workflows — state of the stack (2026-05-10)
|
|
|
|
| Component | Status | Notes |
|
|
|---|---|---|
|
|
| **Primary**: opencode → `framework/qwen3-coder:30b` (Ollama) | Working daily-driver | Full MCP toolbox (playwright, searxng, serena, basic-memory, sequential-thinking, task-master). Phoenix bridge emits OTel traces for every call. |
|
|
| **Secondary**: opencode → `framework-vllm/kimi-linear` (vLLM) | Configured, experimental | `tool_call: false` — Kimi-Linear is a research model and isn't strongly tool-trained. Long-context chat only. Switch via `/model framework-vllm/kimi-linear`. |
|
|
| **Chat front-end for Kimi-Linear**: OpenWebUI | Working | vLLM endpoint wired via `OPENAI_API_BASE_URLS`. Pick `kimi-linear` from the model selector in OpenWebUI. |
|
|
| **Voice loop**: faster-whisper + Kokoro | Deployed | OpenAI-compatible endpoints; not yet wired into a hands-free chat loop. |
|
|
| **Observability**: Phoenix per-trace + OpenLIT fleet | Working | All opencode prompts traced via `.opencode/plugin/phoenix-bridge.js`. |
|
|
| **In flight**: oc-tree TUI sidecar | M0 skeleton done, awaiting probe | Live tree view of opencode's SSE event stream; tmux-side companion to Phoenix's post-hoc traces. See [`oc-tree/NEXT_STEPS.md`](oc-tree/NEXT_STEPS.md). |
|
|
| **In flight**: Kimi-Linear context ramp | P0 done at 32K, BIOS+GTT recipe applied | Next: ramp `--max-model-len` toward 1M and benchmark. See [`kimi-linear/NEXT_STEPS.md`](kimi-linear/NEXT_STEPS.md). |
|
|
| **Configured but not deployed**: VS Code Continue | YAML ready | Drop [`vscode-continue-config.yml`](vscode-continue-config.yml) into `~/.continue/config.yaml`; `ollama pull nomic-embed-text` on the box for `@codebase` indexing. |
|
|
| **In flight**: ComfyUI for image generation | CF-P0 — artifacts written, awaiting box-side bring-up | Flux.1-Dev via `kyuz0/amd-strix-halo-comfyui`. Manual start (`restart: "no"`) so it doesn't contend with kimi-linear at boot. See [`pyinfra/framework/compose/comfyui/README.md`](pyinfra/framework/compose/comfyui/README.md). |
|
|
| **In flight**: Qwen3-235B long-task model | M0 — artifacts written, awaiting weight pull + bring-up | Unsloth UD-Q2_K_XL (~88.8 GB) via `kyuz0:rocm-7.2.2`. The "overnight" model at ~5-10 tok/s — can't coexist with the 30B/Kimi/ComfyUI. See [`pyinfra/framework/compose/qwen3-235b/README.md`](pyinfra/framework/compose/qwen3-235b/README.md). Opencode wired via direct `framework-llama` provider. |
|
|
| **In flight**: LiteLLM router | M1 — artifacts written, awaiting box-side bring-up | Single OpenAI-compatible endpoint (port 4000) in front of all 4 local backends; routing table in [`pyinfra/framework/compose/litellm/config.yaml`](pyinfra/framework/compose/litellm/config.yaml). Opencode stays direct; OpenHands + future orchestrator point here. |
|
|
| **Roadmap-only** (not started): Aider, Cline, OpenHands integration, Multica | — | See `Roadmap.md`. |
|
|
|
|
**Single-line summary**: opencode + Qwen3-Coder + the MCP toolbox is the working coding harness; Kimi-Linear is alongside for long-context chat; Qwen3-235B is being added as the overnight long-task model, fronted by a LiteLLM router for the future orchestrator; oc-tree and ComfyUI are the next two infra additions.
|
|
|
|
## Why this stack
|
|
|
|
- **Bandwidth, not VRAM, is the ceiling on Strix Halo.** 256 GB/s memory
|
|
bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B,
|
|
Kimi-Linear-48B-A3B) dominate. See `StrixHaloMemory.md`.
|
|
- **AMD APU monitoring is broken upstream.** `amd-smi` returns N/A on
|
|
gfx1151, which kills the official Prometheus/Grafana exporters. Beszel
|
|
reads sysfs directly and works. See the monitoring section of
|
|
`pyinfra/framework/README.md`.
|
|
- **Two-layer observability.** OpenLIT is fleet metrics across sessions;
|
|
Phoenix is per-trace waterfall for "what did this one prompt do."
|
|
- **Reproducible.** Every host-side config lives in `pyinfra/`; every
|
|
Mac-side config in `opencode/install.sh`. Re-running either is safe.
|