Files
localgenai/pyinfra/framework/README.md
noisedestroyers 2c4bfefa95 Initial commit: localgenai stack
Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:35:10 -04:00

7.8 KiB
Raw Blame History

pyinfra: Strix Halo bring-up

Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon 8060S, 128 GB). The host stays minimal — kernel + driver + Docker + diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker compose services, each shipping its own ROCm/Vulkan stack.

Manual prerequisites

  1. Phase 0 — update Framework BIOS, set GPU UMA carve-out (96 GB).
  2. OS install — Ubuntu Server 24.04 LTS. AMD ROCm only ships for jammy/noble; later Ubuntus install but break the host-side toolchain (libxml2 ABI). Enable SSH, import your laptop key, create user noise. Recommended partitioning: ≥300 GB on /, big disk mounted at /models, plain ext4 (skip LVM).
  3. The host must be reachable at 10.0.0.237 over SSH (edit inventory.py if it moves).
  4. NOPASSWD sudo for noise — pyinfra's fact layer doesn't reliably thread sudo passwords. One-time setup:
    ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
    

Run

uv tool install pyinfra
./run.sh           # equivalent to: pyinfra inventory.py deploy.py
./run.sh --dry     # any extra args are forwarded to pyinfra

Or run it ephemerally without installing: uvx pyinfra inventory.py deploy.py.

What the deploy does

  • Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, huggingface-cli
  • Tailscale (run sudo tailscale up on the box once, interactively)
  • Docker engine + compose plugin, user added to docker group
  • ROCm host diagnostics only (rocminfo) — no full toolchain
  • GRUB kernel params for Strix Halo perf: amd_iommu=off, amdgpu.gttsize=117760 (per Gygeek/Framework-strix-halo-llm-setup). Requires a reboot to activate — pyinfra rewrites /etc/default/grub and runs update-grub, but won't reboot for you.
  • /models/<vendor>/ layout
  • /srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}/docker-compose.yml dropped in, not auto-started — you edit the model path then docker compose up -d (OpenWebUI needs no edits — it's pre-configured to find Ollama at host.docker.internal:11434 and uses searxng.n0n.io for web search) (Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006, OpenHands UI at :3030 — see "Monitoring stack" and "Agent harnesses" below)

If a previous run installed the native llama.cpp build / full ROCm / native Ollama, those are auto-cleaned the next time ./run.sh runs.

After the deploy: starting an inference service

ssh noise@10.0.0.237
sudo tailscale up                            # one-time, interactive

# Drop a GGUF somewhere under /models, then:
cd /srv/docker/llama
vim docker-compose.yml                       # edit the --model path
docker compose up -d
curl localhost:8080/v1/models                # smoke test

Same shape for vllm (port 8000) and ollama (port 11434, no model edit needed — Ollama serves models on demand).

Tunables

Top of deploy.py:

Compose images in compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}.yml — pin tags here.

Monitoring stack

Two compose stacks watch the inference services:

  • Beszel (/srv/docker/beszel, http://framework:8090) — host + Docker container + AMD GPU dashboard. The agent's amd_sysfs collector reads /sys/class/drm/cardN/device/ directly; this is the only path that reports real numbers on Strix Halo (gfx1151) — amd-smi returns N/A for util/power/temp on this APU.

    First-time bring-up:

    cd /srv/docker/beszel
    docker compose up -d beszel                                 # 1. hub
    # 2. open http://framework:8090 → create admin → "Add system"
    # 3. copy the TOKEN and the SSH KEY from the dialog into .env:
    sudo tee /srv/docker/beszel/.env >/dev/null <<EOF
    BESZEL_TOKEN=<token-from-dialog>
    BESZEL_KEY=ssh-ed25519 AAAA…
    EOF
    sudo chgrp docker /srv/docker/beszel/.env
    sudo chmod 640 /srv/docker/beszel/.env
    docker compose up -d --force-recreate beszel-agent          # 4. agent
    
  • OpenLIT (/srv/docker/openlit, http://framework:3001) — fleet metrics for the LLM stack (cost, tokens, latency aggregated across sessions and models). Auto-instruments Ollama and vLLM via OpenTelemetry without app-code changes; for llama.cpp, the compose file already passes --metrics so OpenLIT can scrape /metrics. ClickHouse-backed. OTLP receivers exposed on the host at :4327 (gRPC) / :4328 (HTTP) — Phoenix owns 4317/4318.

    Bring-up: cd /srv/docker/openlit && docker compose up -d. UI prompts to create an admin account on first load.

  • Phoenix (/srv/docker/phoenix, http://framework:6006) — per-trace agent waterfall / flamegraph. Designed to answer "show me what one OpenCode turn actually did" — full call tree, nested LLM/tool calls, token counts inline. Single container, SQLite-backed. OTLP receivers on the host at :4317 (gRPC) and :6006/v1/traces (HTTP) — Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318.

    Bring-up: cd /srv/docker/phoenix && docker compose up -d. UI immediately available; no auth in the self-hosted single-user path.

    See localgenai/opencode/README.md for how OpenCode is configured to ship traces here.

The official AMD Prometheus exporters (amd-smi-exporter, device-metrics-exporter) are intentionally not deployed — they're broken on gfx1151 (ROCm#6035). If you ever want full Prometheus + Grafana, the migration path is to replace Beszel with node_exporter + a textfile collector reading /sys/class/drm/cardN/device/ and /sys/class/hwmon/.

Agent harnesses

Two agent UIs in front of the local model stack — pick one (or both) based on how you like to drive the agent:

  • OpenWebUI (/srv/docker/openwebui, http://framework:3000) — ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search. Best for casual chat and household-shared LLM access.

    Bring-up: cd /srv/docker/openwebui && docker compose up -d.

  • OpenHands (/srv/docker/openhands, http://framework:3030, loopback-only) — autonomous agent in a Docker sandbox. Spawns a per-conversation agent-server container that can write code, run tests, browse the web. Pre-configured for Ollama at openai/qwen3-coder:30b over the OpenAI-compatible endpoint; ships traces to Phoenix.

    Bring-up:

    cd /srv/docker/openhands && docker compose up -d
    # Tunnel the loopback-bound UI from your laptop:
    ssh -L 3030:127.0.0.1:3030 noise@framework
    open http://localhost:3030
    

    First run pulls the agent-server image (~2 GB) lazily on first conversation, not at startup, so the orchestrator comes up fast but your first message takes 3060 s. Pre-0.44 state path was ~/.openhands-state; not relevant on a fresh install.

    Tool-call quality with local models is much better when Ollama's context is bumped — the compose at /srv/docker/ollama already sets OLLAMA_CONTEXT_LENGTH=65536, which is plenty.

OpenCode (the terminal driver, configured in localgenai/opencode/) runs on the Mac and points at the same Ollama endpoint, so all three harnesses share the same model server.

The llama.cpp image is kyuz0/amd-strix-halo-toolboxes:vulkan-radv, gfx1151-optimized with rocWMMA flash attention (the latter only kicks in on the ROCm tags). Auto-rebuilt against llama.cpp master.