Initial commit: localgenai stack

Containerized local LLM stack for the Framework Desktop / Strix Halo, plus the OpenCode harness on the Mac side. - pyinfra/framework/: pyinfra deploy targeting the box - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override for gfx1151), OpenWebUI - Beszel (host + container + AMD GPU dashboard via sysfs) - OpenLIT (LLM fleet metrics) - Phoenix (per-trace agent waterfall) - OpenHands (autonomous agent in a Docker sandbox) - opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter) - install.sh deploys to ~/.config/opencode/ - StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md: documentation and planning - testing/qwen3-coder-30b/: small evaluation harness Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:35:10 -04:00
commit 2c4bfefa95
36 changed files with 5265 additions and 0 deletions
--- a/pyinfra/framework/README.md
+++ b/pyinfra/framework/README.md
@@ -0,0 +1,184 @@
+# pyinfra: Strix Halo bring-up
+
+Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon
+8060S, 128 GB). The host stays minimal — kernel + driver + Docker +
+diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker
+compose services, each shipping its own ROCm/Vulkan stack.
+
+## Manual prerequisites
+
+1. **Phase 0** — update Framework BIOS, set GPU UMA carve-out (96 GB).
+2. **OS install** — Ubuntu Server 24.04 LTS. AMD ROCm only ships for
+   jammy/noble; later Ubuntus install but break the host-side toolchain
+   (libxml2 ABI). Enable SSH, import your laptop key, create user `noise`.
+   Recommended partitioning: ≥300 GB on `/`, big disk mounted at `/models`,
+   plain ext4 (skip LVM).
+3. The host must be reachable at `10.0.0.237` over SSH (edit `inventory.py`
+   if it moves).
+4. **NOPASSWD sudo for `noise`** — pyinfra's fact layer doesn't reliably
+   thread sudo passwords. One-time setup:
+   ```sh
+   ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
+   ```
+
+## Run
+
+```sh
+uv tool install pyinfra
+./run.sh           # equivalent to: pyinfra inventory.py deploy.py
+./run.sh --dry     # any extra args are forwarded to pyinfra
+```
+
+Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`.
+
+## What the deploy does
+
+- Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, huggingface-cli
+- Tailscale (run `sudo tailscale up` on the box once, interactively)
+- Docker engine + compose plugin, user added to `docker` group
+- ROCm host diagnostics only (`rocminfo`) — no full toolchain
+- GRUB kernel params for Strix Halo perf: `amd_iommu=off`,
+  `amdgpu.gttsize=117760` (per [Gygeek/Framework-strix-halo-llm-setup](https://github.com/Gygeek/Framework-strix-halo-llm-setup)).
+  Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
+  and runs `update-grub`, but won't reboot for you.
+- `/models/<vendor>/` layout
+- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}/docker-compose.yml`
+  dropped in, not auto-started — you edit the model path then
+  `docker compose up -d`
+  (OpenWebUI needs no edits — it's pre-configured to find Ollama at
+  `host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
+  (Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
+  OpenHands UI at :3030 — see "Monitoring stack" and "Agent harnesses"
+  below)
+
+If a previous run installed the native llama.cpp build / full ROCm /
+native Ollama, those are auto-cleaned the next time `./run.sh` runs.
+
+## After the deploy: starting an inference service
+
+```sh
+ssh noise@10.0.0.237
+sudo tailscale up                            # one-time, interactive
+
+# Drop a GGUF somewhere under /models, then:
+cd /srv/docker/llama
+vim docker-compose.yml                       # edit the --model path
+docker compose up -d
+curl localhost:8080/v1/models                # smoke test
+```
+
+Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
+needed — Ollama serves models on demand).
+
+## Tunables
+
+Top of `deploy.py`:
+- `ROCM_VERSION` and `AMDGPU_INSTALL_DEB` — bump when AMD ships a newer
+  release. The .deb filename has a build suffix that doesn't derive from
+  the version; find it at https://repo.radeon.com/amdgpu-install/.
+- `AMDGPU_TOP_VERSION` — bump when a newer release lands at
+  https://github.com/Umio-Yasuno/amdgpu_top/releases.
+
+Compose images in
+`compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}.yml`
+— pin tags here.
+
+## Monitoring stack
+
+Two compose stacks watch the inference services:
+
+- **Beszel** (`/srv/docker/beszel`, http://framework:8090) — host + Docker
+  container + AMD GPU dashboard. The agent's `amd_sysfs` collector reads
+  `/sys/class/drm/cardN/device/` directly; this is the only path that
+  reports real numbers on Strix Halo (gfx1151) — `amd-smi` returns N/A
+  for util/power/temp on this APU.
+
+  First-time bring-up:
+  ```sh
+  cd /srv/docker/beszel
+  docker compose up -d beszel                                 # 1. hub
+  # 2. open http://framework:8090 → create admin → "Add system"
+  # 3. copy the TOKEN and the SSH KEY from the dialog into .env:
+  sudo tee /srv/docker/beszel/.env >/dev/null <<EOF
+  BESZEL_TOKEN=<token-from-dialog>
+  BESZEL_KEY=ssh-ed25519 AAAA…
+  EOF
+  sudo chgrp docker /srv/docker/beszel/.env
+  sudo chmod 640 /srv/docker/beszel/.env
+  docker compose up -d --force-recreate beszel-agent          # 4. agent
+  ```
+
+- **OpenLIT** (`/srv/docker/openlit`, http://framework:3001) — fleet
+  metrics for the LLM stack (cost, tokens, latency aggregated across
+  sessions and models). Auto-instruments Ollama and vLLM via
+  OpenTelemetry without app-code changes; for llama.cpp, the compose
+  file already passes `--metrics` so OpenLIT can scrape `/metrics`.
+  ClickHouse-backed. OTLP receivers exposed on the host at **:4327
+  (gRPC) / :4328 (HTTP)** — Phoenix owns 4317/4318.
+
+  Bring-up: `cd /srv/docker/openlit && docker compose up -d`. UI prompts
+  to create an admin account on first load.
+
+- **Phoenix** (`/srv/docker/phoenix`, http://framework:6006) — per-trace
+  agent waterfall / flamegraph. Designed to answer "show me what one
+  OpenCode turn actually did" — full call tree, nested LLM/tool calls,
+  token counts inline. Single container, SQLite-backed. OTLP receivers
+  on the host at **:4317 (gRPC)** and **:6006/v1/traces (HTTP)** —
+  Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318.
+
+  Bring-up: `cd /srv/docker/phoenix && docker compose up -d`. UI
+  immediately available; no auth in the self-hosted single-user path.
+
+  See [`localgenai/opencode/README.md`](../../opencode/README.md) for
+  how OpenCode is configured to ship traces here.
+
+The official AMD Prometheus exporters (`amd-smi-exporter`,
+`device-metrics-exporter`) are intentionally **not** deployed — they're
+broken on gfx1151 (ROCm#6035). If you ever want full Prometheus +
+Grafana, the migration path is to replace Beszel with `node_exporter` +
+a textfile collector reading `/sys/class/drm/cardN/device/` and
+`/sys/class/hwmon/`.
+
+## Agent harnesses
+
+Two agent UIs in front of the local model stack — pick one (or both)
+based on how you like to drive the agent:
+
+- **OpenWebUI** (`/srv/docker/openwebui`, http://framework:3000) —
+  ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search.
+  Best for casual chat and household-shared LLM access.
+
+  Bring-up: `cd /srv/docker/openwebui && docker compose up -d`.
+
+- **OpenHands** (`/srv/docker/openhands`, http://framework:3030,
+  loopback-only) — autonomous agent in a Docker sandbox. Spawns a
+  per-conversation `agent-server` container that can write code, run
+  tests, browse the web. Pre-configured for Ollama at
+  `openai/qwen3-coder:30b` over the OpenAI-compatible endpoint;
+  ships traces to Phoenix.
+
+  Bring-up:
+  ```sh
+  cd /srv/docker/openhands && docker compose up -d
+  # Tunnel the loopback-bound UI from your laptop:
+  ssh -L 3030:127.0.0.1:3030 noise@framework
+  open http://localhost:3030
+  ```
+
+  First run pulls the agent-server image (~2 GB) lazily on first
+  conversation, not at startup, so the orchestrator comes up fast but
+  your first message takes 30–60 s. Pre-0.44 state path was
+  `~/.openhands-state`; not relevant on a fresh install.
+
+  Tool-call quality with local models is much better when Ollama's
+  context is bumped — the compose at `/srv/docker/ollama` already sets
+  `OLLAMA_CONTEXT_LENGTH=65536`, which is plenty.
+
+OpenCode (the terminal driver, configured in
+[`localgenai/opencode/`](../../opencode/)) runs on the Mac and points
+at the same Ollama endpoint, so all three harnesses share the same
+model server.
+
+The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
+gfx1151-optimized with rocWMMA flash attention (the latter only kicks
+in on the ROCm tags). Auto-rebuilt against llama.cpp master.