Files
localgenai/pyinfra/framework/README.md
noisedestroyers 2c4bfefa95 Initial commit: localgenai stack
Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:35:10 -04:00

185 lines
7.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# pyinfra: Strix Halo bring-up
Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon
8060S, 128 GB). The host stays minimal — kernel + driver + Docker +
diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker
compose services, each shipping its own ROCm/Vulkan stack.
## Manual prerequisites
1. **Phase 0** — update Framework BIOS, set GPU UMA carve-out (96 GB).
2. **OS install** — Ubuntu Server 24.04 LTS. AMD ROCm only ships for
jammy/noble; later Ubuntus install but break the host-side toolchain
(libxml2 ABI). Enable SSH, import your laptop key, create user `noise`.
Recommended partitioning: ≥300 GB on `/`, big disk mounted at `/models`,
plain ext4 (skip LVM).
3. The host must be reachable at `10.0.0.237` over SSH (edit `inventory.py`
if it moves).
4. **NOPASSWD sudo for `noise`** — pyinfra's fact layer doesn't reliably
thread sudo passwords. One-time setup:
```sh
ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
```
## Run
```sh
uv tool install pyinfra
./run.sh # equivalent to: pyinfra inventory.py deploy.py
./run.sh --dry # any extra args are forwarded to pyinfra
```
Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`.
## What the deploy does
- Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, huggingface-cli
- Tailscale (run `sudo tailscale up` on the box once, interactively)
- Docker engine + compose plugin, user added to `docker` group
- ROCm host diagnostics only (`rocminfo`) — no full toolchain
- GRUB kernel params for Strix Halo perf: `amd_iommu=off`,
`amdgpu.gttsize=117760` (per [Gygeek/Framework-strix-halo-llm-setup](https://github.com/Gygeek/Framework-strix-halo-llm-setup)).
Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
and runs `update-grub`, but won't reboot for you.
- `/models/<vendor>/` layout
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}/docker-compose.yml`
dropped in, not auto-started — you edit the model path then
`docker compose up -d`
(OpenWebUI needs no edits — it's pre-configured to find Ollama at
`host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
(Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
OpenHands UI at :3030 — see "Monitoring stack" and "Agent harnesses"
below)
If a previous run installed the native llama.cpp build / full ROCm /
native Ollama, those are auto-cleaned the next time `./run.sh` runs.
## After the deploy: starting an inference service
```sh
ssh noise@10.0.0.237
sudo tailscale up # one-time, interactive
# Drop a GGUF somewhere under /models, then:
cd /srv/docker/llama
vim docker-compose.yml # edit the --model path
docker compose up -d
curl localhost:8080/v1/models # smoke test
```
Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
needed — Ollama serves models on demand).
## Tunables
Top of `deploy.py`:
- `ROCM_VERSION` and `AMDGPU_INSTALL_DEB` — bump when AMD ships a newer
release. The .deb filename has a build suffix that doesn't derive from
the version; find it at https://repo.radeon.com/amdgpu-install/.
- `AMDGPU_TOP_VERSION` — bump when a newer release lands at
https://github.com/Umio-Yasuno/amdgpu_top/releases.
Compose images in
`compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}.yml`
— pin tags here.
## Monitoring stack
Two compose stacks watch the inference services:
- **Beszel** (`/srv/docker/beszel`, http://framework:8090) — host + Docker
container + AMD GPU dashboard. The agent's `amd_sysfs` collector reads
`/sys/class/drm/cardN/device/` directly; this is the only path that
reports real numbers on Strix Halo (gfx1151) — `amd-smi` returns N/A
for util/power/temp on this APU.
First-time bring-up:
```sh
cd /srv/docker/beszel
docker compose up -d beszel # 1. hub
# 2. open http://framework:8090 → create admin → "Add system"
# 3. copy the TOKEN and the SSH KEY from the dialog into .env:
sudo tee /srv/docker/beszel/.env >/dev/null <<EOF
BESZEL_TOKEN=<token-from-dialog>
BESZEL_KEY=ssh-ed25519 AAAA…
EOF
sudo chgrp docker /srv/docker/beszel/.env
sudo chmod 640 /srv/docker/beszel/.env
docker compose up -d --force-recreate beszel-agent # 4. agent
```
- **OpenLIT** (`/srv/docker/openlit`, http://framework:3001) — fleet
metrics for the LLM stack (cost, tokens, latency aggregated across
sessions and models). Auto-instruments Ollama and vLLM via
OpenTelemetry without app-code changes; for llama.cpp, the compose
file already passes `--metrics` so OpenLIT can scrape `/metrics`.
ClickHouse-backed. OTLP receivers exposed on the host at **:4327
(gRPC) / :4328 (HTTP)** — Phoenix owns 4317/4318.
Bring-up: `cd /srv/docker/openlit && docker compose up -d`. UI prompts
to create an admin account on first load.
- **Phoenix** (`/srv/docker/phoenix`, http://framework:6006) — per-trace
agent waterfall / flamegraph. Designed to answer "show me what one
OpenCode turn actually did" — full call tree, nested LLM/tool calls,
token counts inline. Single container, SQLite-backed. OTLP receivers
on the host at **:4317 (gRPC)** and **:6006/v1/traces (HTTP)** —
Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318.
Bring-up: `cd /srv/docker/phoenix && docker compose up -d`. UI
immediately available; no auth in the self-hosted single-user path.
See [`localgenai/opencode/README.md`](../../opencode/README.md) for
how OpenCode is configured to ship traces here.
The official AMD Prometheus exporters (`amd-smi-exporter`,
`device-metrics-exporter`) are intentionally **not** deployed — they're
broken on gfx1151 (ROCm#6035). If you ever want full Prometheus +
Grafana, the migration path is to replace Beszel with `node_exporter` +
a textfile collector reading `/sys/class/drm/cardN/device/` and
`/sys/class/hwmon/`.
## Agent harnesses
Two agent UIs in front of the local model stack — pick one (or both)
based on how you like to drive the agent:
- **OpenWebUI** (`/srv/docker/openwebui`, http://framework:3000) —
ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search.
Best for casual chat and household-shared LLM access.
Bring-up: `cd /srv/docker/openwebui && docker compose up -d`.
- **OpenHands** (`/srv/docker/openhands`, http://framework:3030,
loopback-only) — autonomous agent in a Docker sandbox. Spawns a
per-conversation `agent-server` container that can write code, run
tests, browse the web. Pre-configured for Ollama at
`openai/qwen3-coder:30b` over the OpenAI-compatible endpoint;
ships traces to Phoenix.
Bring-up:
```sh
cd /srv/docker/openhands && docker compose up -d
# Tunnel the loopback-bound UI from your laptop:
ssh -L 3030:127.0.0.1:3030 noise@framework
open http://localhost:3030
```
First run pulls the agent-server image (~2 GB) lazily on first
conversation, not at startup, so the orchestrator comes up fast but
your first message takes 3060 s. Pre-0.44 state path was
`~/.openhands-state`; not relevant on a fresh install.
Tool-call quality with local models is much better when Ollama's
context is bumped — the compose at `/srv/docker/ollama` already sets
`OLLAMA_CONTEXT_LENGTH=65536`, which is plenty.
OpenCode (the terminal driver, configured in
[`localgenai/opencode/`](../../opencode/)) runs on the Mac and points
at the same Ollama endpoint, so all three harnesses share the same
model server.
The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
gfx1151-optimized with rocWMMA flash attention (the latter only kicks
in on the ROCm tags). Auto-rebuilt against llama.cpp master.