# pyinfra: Strix Halo bring-up Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon 8060S, 128 GB). The host stays minimal — kernel + driver + Docker + diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker compose services, each shipping its own ROCm/Vulkan stack. ## Manual prerequisites 1. **Phase 0** — update Framework BIOS, set GPU UMA carve-out (96 GB). 2. **OS install** — Ubuntu Server 24.04 LTS. AMD ROCm only ships for jammy/noble; later Ubuntus install but break the host-side toolchain (libxml2 ABI). Enable SSH, import your laptop key, create user `noise`. Recommended partitioning: ≥300 GB on `/`, big disk mounted at `/models`, plain ext4 (skip LVM). 3. The host must be reachable at `10.0.0.237` over SSH (edit `inventory.py` if it moves). 4. **NOPASSWD sudo for `noise`** — pyinfra's fact layer doesn't reliably thread sudo passwords. One-time setup: ```sh ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd' ``` ## Run ```sh uv tool install pyinfra ./run.sh # equivalent to: pyinfra inventory.py deploy.py ./run.sh --dry # any extra args are forwarded to pyinfra ``` Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`. ## What the deploy does - Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, huggingface-cli - Tailscale (run `sudo tailscale up` on the box once, interactively) - Docker engine + compose plugin, user added to `docker` group - ROCm host diagnostics only (`rocminfo`) — no full toolchain - GRUB kernel params for Strix Halo perf: `amd_iommu=off`, `amdgpu.gttsize=117760` (per [Gygeek/Framework-strix-halo-llm-setup](https://github.com/Gygeek/Framework-strix-halo-llm-setup)). Requires a reboot to activate — pyinfra rewrites `/etc/default/grub` and runs `update-grub`, but won't reboot for you. - `/models//` layout - `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}/docker-compose.yml` dropped in, not auto-started — you edit the model path then `docker compose up -d` (OpenWebUI needs no edits — it's pre-configured to find Ollama at `host.docker.internal:11434` and uses `searxng.n0n.io` for web search) (Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006, OpenHands UI at :3030 — see "Monitoring stack" and "Agent harnesses" below) If a previous run installed the native llama.cpp build / full ROCm / native Ollama, those are auto-cleaned the next time `./run.sh` runs. ## After the deploy: starting an inference service ```sh ssh noise@10.0.0.237 sudo tailscale up # one-time, interactive # Drop a GGUF somewhere under /models, then: cd /srv/docker/llama vim docker-compose.yml # edit the --model path docker compose up -d curl localhost:8080/v1/models # smoke test ``` Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit needed — Ollama serves models on demand). ## Tunables Top of `deploy.py`: - `ROCM_VERSION` and `AMDGPU_INSTALL_DEB` — bump when AMD ships a newer release. The .deb filename has a build suffix that doesn't derive from the version; find it at https://repo.radeon.com/amdgpu-install/. - `AMDGPU_TOP_VERSION` — bump when a newer release lands at https://github.com/Umio-Yasuno/amdgpu_top/releases. Compose images in `compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}.yml` — pin tags here. ## Monitoring stack Two compose stacks watch the inference services: - **Beszel** (`/srv/docker/beszel`, http://framework:8090) — host + Docker container + AMD GPU dashboard. The agent's `amd_sysfs` collector reads `/sys/class/drm/cardN/device/` directly; this is the only path that reports real numbers on Strix Halo (gfx1151) — `amd-smi` returns N/A for util/power/temp on this APU. First-time bring-up: ```sh cd /srv/docker/beszel docker compose up -d beszel # 1. hub # 2. open http://framework:8090 → create admin → "Add system" # 3. copy the TOKEN and the SSH KEY from the dialog into .env: sudo tee /srv/docker/beszel/.env >/dev/null < BESZEL_KEY=ssh-ed25519 AAAA… EOF sudo chgrp docker /srv/docker/beszel/.env sudo chmod 640 /srv/docker/beszel/.env docker compose up -d --force-recreate beszel-agent # 4. agent ``` - **OpenLIT** (`/srv/docker/openlit`, http://framework:3001) — fleet metrics for the LLM stack (cost, tokens, latency aggregated across sessions and models). Auto-instruments Ollama and vLLM via OpenTelemetry without app-code changes; for llama.cpp, the compose file already passes `--metrics` so OpenLIT can scrape `/metrics`. ClickHouse-backed. OTLP receivers exposed on the host at **:4327 (gRPC) / :4328 (HTTP)** — Phoenix owns 4317/4318. Bring-up: `cd /srv/docker/openlit && docker compose up -d`. UI prompts to create an admin account on first load. - **Phoenix** (`/srv/docker/phoenix`, http://framework:6006) — per-trace agent waterfall / flamegraph. Designed to answer "show me what one OpenCode turn actually did" — full call tree, nested LLM/tool calls, token counts inline. Single container, SQLite-backed. OTLP receivers on the host at **:4317 (gRPC)** and **:6006/v1/traces (HTTP)** — Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318. Bring-up: `cd /srv/docker/phoenix && docker compose up -d`. UI immediately available; no auth in the self-hosted single-user path. See [`localgenai/opencode/README.md`](../../opencode/README.md) for how OpenCode is configured to ship traces here. The official AMD Prometheus exporters (`amd-smi-exporter`, `device-metrics-exporter`) are intentionally **not** deployed — they're broken on gfx1151 (ROCm#6035). If you ever want full Prometheus + Grafana, the migration path is to replace Beszel with `node_exporter` + a textfile collector reading `/sys/class/drm/cardN/device/` and `/sys/class/hwmon/`. ## Agent harnesses Two agent UIs in front of the local model stack — pick one (or both) based on how you like to drive the agent: - **OpenWebUI** (`/srv/docker/openwebui`, http://framework:3000) — ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search. Best for casual chat and household-shared LLM access. Bring-up: `cd /srv/docker/openwebui && docker compose up -d`. - **OpenHands** (`/srv/docker/openhands`, http://framework:3030, loopback-only) — autonomous agent in a Docker sandbox. Spawns a per-conversation `agent-server` container that can write code, run tests, browse the web. Pre-configured for Ollama at `openai/qwen3-coder:30b` over the OpenAI-compatible endpoint; ships traces to Phoenix. Bring-up: ```sh cd /srv/docker/openhands && docker compose up -d # Tunnel the loopback-bound UI from your laptop: ssh -L 3030:127.0.0.1:3030 noise@framework open http://localhost:3030 ``` First run pulls the agent-server image (~2 GB) lazily on first conversation, not at startup, so the orchestrator comes up fast but your first message takes 30–60 s. Pre-0.44 state path was `~/.openhands-state`; not relevant on a fresh install. Tool-call quality with local models is much better when Ollama's context is bumped — the compose at `/srv/docker/ollama` already sets `OLLAMA_CONTEXT_LENGTH=65536`, which is plenty. OpenCode (the terminal driver, configured in [`localgenai/opencode/`](../../opencode/)) runs on the Mac and points at the same Ollama endpoint, so all three harnesses share the same model server. The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`, gfx1151-optimized with rocWMMA flash attention (the latter only kicks in on the ROCm tags). Auto-rebuilt against llama.cpp master.