Files
localgenai/pyinfra/framework/README.md
noisedestroyers 178d7d3c0f Add Homepage dashboard + dual-export OpenCode traces
Homepage as the front door: single page at framework:7575 with one tile
per service, live widgets where the upstream supports it (Ollama loaded
models, container state via docker.sock, etc.), bookmarks for reference
docs. Config files are pyinfra-managed — source of truth lives in
compose/homepage/, sync by editing there and re-running ./run.sh.

OpenCode plugin now dual-exports spans to Phoenix and OpenLIT in
parallel. Phoenix remains the per-trace waterfall view; OpenLIT picks
up the same data for fleet-level metrics. Each destination has its own
batch processor so a hiccup at one doesn't block the other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:00:05 -04:00

8.7 KiB
Raw Blame History

pyinfra: Strix Halo bring-up

Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon 8060S, 128 GB). The host stays minimal — kernel + driver + Docker + diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker compose services, each shipping its own ROCm/Vulkan stack.

Manual prerequisites

  1. Phase 0 — update Framework BIOS, set GPU UMA carve-out (96 GB).
  2. OS install — Ubuntu Server 24.04 LTS. AMD ROCm only ships for jammy/noble; later Ubuntus install but break the host-side toolchain (libxml2 ABI). Enable SSH, import your laptop key, create user noise. Recommended partitioning: ≥300 GB on /, big disk mounted at /models, plain ext4 (skip LVM).
  3. The host must be reachable at 10.0.0.237 over SSH (edit inventory.py if it moves).
  4. NOPASSWD sudo for noise — pyinfra's fact layer doesn't reliably thread sudo passwords. One-time setup:
    ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
    

Run

uv tool install pyinfra
./run.sh           # equivalent to: pyinfra inventory.py deploy.py
./run.sh --dry     # any extra args are forwarded to pyinfra

Or run it ephemerally without installing: uvx pyinfra inventory.py deploy.py.

What the deploy does

  • Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, huggingface-cli
  • Tailscale (run sudo tailscale up on the box once, interactively)
  • Docker engine + compose plugin, user added to docker group
  • ROCm host diagnostics only (rocminfo) — no full toolchain
  • GRUB kernel params for Strix Halo perf: amd_iommu=off, amdgpu.gttsize=117760 (per Gygeek/Framework-strix-halo-llm-setup). Requires a reboot to activate — pyinfra rewrites /etc/default/grub and runs update-grub, but won't reboot for you.
  • /models/<vendor>/ layout
  • /srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage}/docker-compose.yml dropped in, not auto-started — you edit the model path then docker compose up -d (OpenWebUI needs no edits — it's pre-configured to find Ollama at host.docker.internal:11434 and uses searxng.n0n.io for web search) (Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006, OpenHands UI at :3030, Homepage at :7575 — see "Monitoring stack", "Agent harnesses", and "Front door" below)

If a previous run installed the native llama.cpp build / full ROCm / native Ollama, those are auto-cleaned the next time ./run.sh runs.

After the deploy: starting an inference service

ssh noise@10.0.0.237
sudo tailscale up                            # one-time, interactive

# Drop a GGUF somewhere under /models, then:
cd /srv/docker/llama
vim docker-compose.yml                       # edit the --model path
docker compose up -d
curl localhost:8080/v1/models                # smoke test

Same shape for vllm (port 8000) and ollama (port 11434, no model edit needed — Ollama serves models on demand).

Tunables

Top of deploy.py:

Compose images in compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands,homepage}.yml — pin tags here. Homepage's tile/layout config is in compose/homepage/ (services.yaml, settings.yaml, etc.); edit there, ./run.sh, restart the homepage container.

Front door

  • Homepage (/srv/docker/homepage, http://framework:7575) — single page with one tile per service in this stack, with native widgets pulling live state (loaded models from Ollama, container status from the docker socket, etc.). Bookmarks for the reference docs across the bottom. Use as the default landing page on this network.

    Bring-up: cd /srv/docker/homepage && docker compose up -d. No first-run setup — it reads /srv/docker/homepage/config/*.yaml and renders.

    To customize: edit pyinfra/framework/compose/homepage/*.yaml in the repo, run ./run.sh, then docker compose restart homepage on the box. Direct edits to /srv/docker/homepage/config/*.yaml will be overwritten on the next pyinfra deploy.

Monitoring stack

Two compose stacks watch the inference services:

  • Beszel (/srv/docker/beszel, http://framework:8090) — host + Docker container + AMD GPU dashboard. The agent's amd_sysfs collector reads /sys/class/drm/cardN/device/ directly; this is the only path that reports real numbers on Strix Halo (gfx1151) — amd-smi returns N/A for util/power/temp on this APU.

    First-time bring-up:

    cd /srv/docker/beszel
    docker compose up -d beszel                                 # 1. hub
    # 2. open http://framework:8090 → create admin → "Add system"
    # 3. copy the TOKEN and the SSH KEY from the dialog into .env:
    sudo tee /srv/docker/beszel/.env >/dev/null <<EOF
    BESZEL_TOKEN=<token-from-dialog>
    BESZEL_KEY=ssh-ed25519 AAAA…
    EOF
    sudo chgrp docker /srv/docker/beszel/.env
    sudo chmod 640 /srv/docker/beszel/.env
    docker compose up -d --force-recreate beszel-agent          # 4. agent
    
  • OpenLIT (/srv/docker/openlit, http://framework:3001) — fleet metrics for the LLM stack (cost, tokens, latency aggregated across sessions and models). Auto-instruments Ollama and vLLM via OpenTelemetry without app-code changes; for llama.cpp, the compose file already passes --metrics so OpenLIT can scrape /metrics. ClickHouse-backed. OTLP receivers exposed on the host at :4327 (gRPC) / :4328 (HTTP) — Phoenix owns 4317/4318.

    Bring-up: cd /srv/docker/openlit && docker compose up -d. UI prompts to create an admin account on first load.

  • Phoenix (/srv/docker/phoenix, http://framework:6006) — per-trace agent waterfall / flamegraph. Designed to answer "show me what one OpenCode turn actually did" — full call tree, nested LLM/tool calls, token counts inline. Single container, SQLite-backed. OTLP receivers on the host at :4317 (gRPC) and :6006/v1/traces (HTTP) — Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318.

    Bring-up: cd /srv/docker/phoenix && docker compose up -d. UI immediately available; no auth in the self-hosted single-user path.

    See localgenai/opencode/README.md for how OpenCode is configured to ship traces here.

The official AMD Prometheus exporters (amd-smi-exporter, device-metrics-exporter) are intentionally not deployed — they're broken on gfx1151 (ROCm#6035). If you ever want full Prometheus + Grafana, the migration path is to replace Beszel with node_exporter + a textfile collector reading /sys/class/drm/cardN/device/ and /sys/class/hwmon/.

Agent harnesses

Two agent UIs in front of the local model stack — pick one (or both) based on how you like to drive the agent:

  • OpenWebUI (/srv/docker/openwebui, http://framework:3000) — ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search. Best for casual chat and household-shared LLM access.

    Bring-up: cd /srv/docker/openwebui && docker compose up -d.

  • OpenHands (/srv/docker/openhands, http://framework:3030, loopback-only) — autonomous agent in a Docker sandbox. Spawns a per-conversation agent-server container that can write code, run tests, browse the web. Pre-configured for Ollama at openai/qwen3-coder:30b over the OpenAI-compatible endpoint; ships traces to Phoenix.

    Bring-up:

    cd /srv/docker/openhands && docker compose up -d
    # Tunnel the loopback-bound UI from your laptop:
    ssh -L 3030:127.0.0.1:3030 noise@framework
    open http://localhost:3030
    

    First run pulls the agent-server image (~2 GB) lazily on first conversation, not at startup, so the orchestrator comes up fast but your first message takes 3060 s. Pre-0.44 state path was ~/.openhands-state; not relevant on a fresh install.

    Tool-call quality with local models is much better when Ollama's context is bumped — the compose at /srv/docker/ollama already sets OLLAMA_CONTEXT_LENGTH=65536, which is plenty.

OpenCode (the terminal driver, configured in localgenai/opencode/) runs on the Mac and points at the same Ollama endpoint, so all three harnesses share the same model server.

The llama.cpp image is kyuz0/amd-strix-halo-toolboxes:vulkan-radv, gfx1151-optimized with rocWMMA flash attention (the latter only kicks in on the ROCm tags). Auto-rebuilt against llama.cpp master.