Initial commit: localgenai stack

Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-08 11:35:10 -04:00
commit 2c4bfefa95
36 changed files with 5265 additions and 0 deletions

82
pyinfra/HowToScan.md Normal file
View File

@@ -0,0 +1,82 @@
# How to onboard an existing host into pyinfra
There's no auto-generator that reads a running system and emits a
deploy.py — not in pyinfra, not in Ansible, not in any IaC tool.
Reverse-engineering "what's intentional vs incidental" needs human
judgment. Below is the workflow that actually works.
## What pyinfra gives you for free
The **facts** API queries live state without writing any deploy:
```sh
pyinfra @host fact deb.DebPackages # installed packages
pyinfra @host fact apt.AptSources # extra apt repos
pyinfra @host fact systemd.SystemdServices # service state
pyinfra @host fact server.Crontab # cron
pyinfra @host fact server.Users # users
```
Output is JSON. There's no fact → op translator, but the data is easy
to grep through and tells you what's actually on the box.
## Practical workflow
1. Run a scanner against the live host that dumps the *meaningful*
state — packages explicitly installed, enabled services, custom
configs, dotfiles, sudoers fragments, fstab non-defaults. Skip the
distro-default noise.
2. Hand-pick the bits worth managing into a new `<station>/deploy.py`.
3. Run with `--dry` against the live box; pyinfra's diff shows what's
still drifting.
4. Iterate until `--dry` reports "no changes."
## One-shot scanner
```sh
ssh you@host '
echo "=== manually-installed packages ==="
apt-mark showmanual
echo "=== enabled services ==="
systemctl list-unit-files --state=enabled --type=service --no-legend | awk "{print \$1}"
echo "=== extra apt repos ==="
ls /etc/apt/sources.list.d/
echo "=== sudoers fragments ==="
ls /etc/sudoers.d/
echo "=== fstab non-default ==="
grep -vE "^#|^$|/proc|/sys|/dev/pts" /etc/fstab
echo "=== users with login shells ==="
getent passwd | awk -F: "\$3>=1000 && \$3<60000 {print \$1}"
echo "=== docker stacks ==="
ls /srv/docker/ 2>/dev/null
echo "=== home dotfiles ==="
sudo find /home -maxdepth 3 -name ".*rc" -o -name ".*config" 2>/dev/null
'
```
That gets you ~80% of what's worth managing in roughly 30 lines of
pyinfra. The remaining 20% is case-by-case (kernel cmdline tweaks,
hand-edited /etc files, dotfile contents) that no scanner fully
captures — you write those as you re-discover the box's quirks.
## What scanners can't catch
- The *intent* behind a config (why this `sysctl` value, why this
cron entry).
- Hand-edited fragments mixed into otherwise-default files (e.g. a
custom line in `/etc/ssh/sshd_config`).
- Out-of-band state: data in databases, container volumes,
`/var/lib/<service>`.
For these, treat pyinfra as managing the *bones* (packages, services,
users, configs) and leave the data alone. Re-deploying onto fresh
hardware should give you a working system; restoring data is a
separate concern.
## If "scan and reproduce" is a hard requirement
The native answer is **NixOS**. The whole OS is declarative;
`nixos-generate-config` reads your machine and writes a full config
that recreates it. Different paradigm, big switch — but the right
answer if reproducible-from-a-file is core to your workflow and you
aren't already deep in Ubuntu-land.

12
pyinfra/README.md Normal file
View File

@@ -0,0 +1,12 @@
# pyinfra
One folder per station. Each subfolder is a self-contained pyinfra
deploy: `inventory.py`, `deploy.py`, `run.sh`, plus any compose files
or assets that ship to the host.
| Station | Host | Notes |
|---------|------|-------|
| [`framework/`](framework/README.md) | `10.0.0.237` | Framework Desktop (Strix Halo, 128 GB) — local LLM box |
To bring up a station, `cd` into its folder and run `./run.sh`. See the
station's own README for prerequisites.

184
pyinfra/framework/README.md Normal file
View File

@@ -0,0 +1,184 @@
# pyinfra: Strix Halo bring-up
Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon
8060S, 128 GB). The host stays minimal — kernel + driver + Docker +
diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker
compose services, each shipping its own ROCm/Vulkan stack.
## Manual prerequisites
1. **Phase 0** — update Framework BIOS, set GPU UMA carve-out (96 GB).
2. **OS install** — Ubuntu Server 24.04 LTS. AMD ROCm only ships for
jammy/noble; later Ubuntus install but break the host-side toolchain
(libxml2 ABI). Enable SSH, import your laptop key, create user `noise`.
Recommended partitioning: ≥300 GB on `/`, big disk mounted at `/models`,
plain ext4 (skip LVM).
3. The host must be reachable at `10.0.0.237` over SSH (edit `inventory.py`
if it moves).
4. **NOPASSWD sudo for `noise`** — pyinfra's fact layer doesn't reliably
thread sudo passwords. One-time setup:
```sh
ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
```
## Run
```sh
uv tool install pyinfra
./run.sh # equivalent to: pyinfra inventory.py deploy.py
./run.sh --dry # any extra args are forwarded to pyinfra
```
Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`.
## What the deploy does
- Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, huggingface-cli
- Tailscale (run `sudo tailscale up` on the box once, interactively)
- Docker engine + compose plugin, user added to `docker` group
- ROCm host diagnostics only (`rocminfo`) — no full toolchain
- GRUB kernel params for Strix Halo perf: `amd_iommu=off`,
`amdgpu.gttsize=117760` (per [Gygeek/Framework-strix-halo-llm-setup](https://github.com/Gygeek/Framework-strix-halo-llm-setup)).
Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
and runs `update-grub`, but won't reboot for you.
- `/models/<vendor>/` layout
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}/docker-compose.yml`
dropped in, not auto-started — you edit the model path then
`docker compose up -d`
(OpenWebUI needs no edits — it's pre-configured to find Ollama at
`host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
(Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
OpenHands UI at :3030 — see "Monitoring stack" and "Agent harnesses"
below)
If a previous run installed the native llama.cpp build / full ROCm /
native Ollama, those are auto-cleaned the next time `./run.sh` runs.
## After the deploy: starting an inference service
```sh
ssh noise@10.0.0.237
sudo tailscale up # one-time, interactive
# Drop a GGUF somewhere under /models, then:
cd /srv/docker/llama
vim docker-compose.yml # edit the --model path
docker compose up -d
curl localhost:8080/v1/models # smoke test
```
Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
needed — Ollama serves models on demand).
## Tunables
Top of `deploy.py`:
- `ROCM_VERSION` and `AMDGPU_INSTALL_DEB` — bump when AMD ships a newer
release. The .deb filename has a build suffix that doesn't derive from
the version; find it at https://repo.radeon.com/amdgpu-install/.
- `AMDGPU_TOP_VERSION` — bump when a newer release lands at
https://github.com/Umio-Yasuno/amdgpu_top/releases.
Compose images in
`compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}.yml`
— pin tags here.
## Monitoring stack
Two compose stacks watch the inference services:
- **Beszel** (`/srv/docker/beszel`, http://framework:8090) — host + Docker
container + AMD GPU dashboard. The agent's `amd_sysfs` collector reads
`/sys/class/drm/cardN/device/` directly; this is the only path that
reports real numbers on Strix Halo (gfx1151) — `amd-smi` returns N/A
for util/power/temp on this APU.
First-time bring-up:
```sh
cd /srv/docker/beszel
docker compose up -d beszel # 1. hub
# 2. open http://framework:8090 → create admin → "Add system"
# 3. copy the TOKEN and the SSH KEY from the dialog into .env:
sudo tee /srv/docker/beszel/.env >/dev/null <<EOF
BESZEL_TOKEN=<token-from-dialog>
BESZEL_KEY=ssh-ed25519 AAAA…
EOF
sudo chgrp docker /srv/docker/beszel/.env
sudo chmod 640 /srv/docker/beszel/.env
docker compose up -d --force-recreate beszel-agent # 4. agent
```
- **OpenLIT** (`/srv/docker/openlit`, http://framework:3001) — fleet
metrics for the LLM stack (cost, tokens, latency aggregated across
sessions and models). Auto-instruments Ollama and vLLM via
OpenTelemetry without app-code changes; for llama.cpp, the compose
file already passes `--metrics` so OpenLIT can scrape `/metrics`.
ClickHouse-backed. OTLP receivers exposed on the host at **:4327
(gRPC) / :4328 (HTTP)** — Phoenix owns 4317/4318.
Bring-up: `cd /srv/docker/openlit && docker compose up -d`. UI prompts
to create an admin account on first load.
- **Phoenix** (`/srv/docker/phoenix`, http://framework:6006) — per-trace
agent waterfall / flamegraph. Designed to answer "show me what one
OpenCode turn actually did" — full call tree, nested LLM/tool calls,
token counts inline. Single container, SQLite-backed. OTLP receivers
on the host at **:4317 (gRPC)** and **:6006/v1/traces (HTTP)** —
Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318.
Bring-up: `cd /srv/docker/phoenix && docker compose up -d`. UI
immediately available; no auth in the self-hosted single-user path.
See [`localgenai/opencode/README.md`](../../opencode/README.md) for
how OpenCode is configured to ship traces here.
The official AMD Prometheus exporters (`amd-smi-exporter`,
`device-metrics-exporter`) are intentionally **not** deployed — they're
broken on gfx1151 (ROCm#6035). If you ever want full Prometheus +
Grafana, the migration path is to replace Beszel with `node_exporter` +
a textfile collector reading `/sys/class/drm/cardN/device/` and
`/sys/class/hwmon/`.
## Agent harnesses
Two agent UIs in front of the local model stack — pick one (or both)
based on how you like to drive the agent:
- **OpenWebUI** (`/srv/docker/openwebui`, http://framework:3000) —
ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search.
Best for casual chat and household-shared LLM access.
Bring-up: `cd /srv/docker/openwebui && docker compose up -d`.
- **OpenHands** (`/srv/docker/openhands`, http://framework:3030,
loopback-only) — autonomous agent in a Docker sandbox. Spawns a
per-conversation `agent-server` container that can write code, run
tests, browse the web. Pre-configured for Ollama at
`openai/qwen3-coder:30b` over the OpenAI-compatible endpoint;
ships traces to Phoenix.
Bring-up:
```sh
cd /srv/docker/openhands && docker compose up -d
# Tunnel the loopback-bound UI from your laptop:
ssh -L 3030:127.0.0.1:3030 noise@framework
open http://localhost:3030
```
First run pulls the agent-server image (~2 GB) lazily on first
conversation, not at startup, so the orchestrator comes up fast but
your first message takes 3060 s. Pre-0.44 state path was
`~/.openhands-state`; not relevant on a fresh install.
Tool-call quality with local models is much better when Ollama's
context is bumped — the compose at `/srv/docker/ollama` already sets
`OLLAMA_CONTEXT_LENGTH=65536`, which is plenty.
OpenCode (the terminal driver, configured in
[`localgenai/opencode/`](../../opencode/)) runs on the Mac and points
at the same Ollama endpoint, so all three harnesses share the same
model server.
The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
gfx1151-optimized with rocWMMA flash attention (the latter only kicks
in on the ROCm tags). Auto-rebuilt against llama.cpp master.

View File

@@ -0,0 +1,66 @@
# Beszel — host + container + GPU dashboard.
# https://beszel.dev
#
# Picked over Prometheus+Grafana for this box because:
# - The agent's `amd_sysfs` collector reads /sys/class/drm/card*/device/
# directly, which is the only reliable GPU metric source on Strix Halo
# (gfx1151). AMD's amd-smi / Device Metrics Exporter return N/A for
# util/power/temp on this APU (ROCm#6035), so the official Prometheus
# exporter path is dead.
# - Two containers vs six.
#
# First-time setup (WebSocket connection model — current Beszel default):
# 1. `docker compose up -d beszel` (start the hub)
# 2. Open http://framework:8090, create the admin account
# 3. Click "Add system" — the dialog gives you a TOKEN and an SSH KEY.
# 4. Edit /srv/docker/beszel/.env (created empty by pyinfra; pyinfra
# doesn't overwrite). Add:
# BESZEL_TOKEN=<token-from-dialog>
# BESZEL_KEY=ssh-ed25519 AAAA…
# 5. `docker compose up -d --force-recreate beszel-agent`
#
# Docker Compose auto-reads the sibling .env file for ${VAR} interpolation
# in the environment block below — so secrets stay out of the compose
# file (which pyinfra overwrites) but the env-var names match exactly
# what the agent expects.
#
# Why both TOKEN and KEY: TOKEN identifies which system this agent is,
# KEY authenticates the agent (the SSH key is reused as the auth secret
# in the WebSocket handshake). Rotate either by editing the .env and
# `docker compose up -d --force-recreate`.
services:
beszel:
image: henrygd/beszel:latest
container_name: beszel
restart: unless-stopped
ports:
- "8090:8090"
volumes:
- /srv/docker/beszel/data:/beszel_data
beszel-agent:
image: henrygd/beszel-agent:latest
container_name: beszel-agent
restart: unless-stopped
# Host networking so the agent sees real CPU/memory/network counters
# without bridge-NAT distortion.
network_mode: host
volumes:
# Read-only Docker socket for per-container CPU/mem/net.
- /var/run/docker.sock:/var/run/docker.sock:ro
# Sysfs paths the AMD GPU collector reads.
- /sys/class/drm:/sys/class/drm:ro
- /sys/class/hwmon:/sys/class/hwmon:ro
environment:
# Pulled from /srv/docker/beszel/.env at compose-parse time.
TOKEN: "${BESZEL_TOKEN:-}"
KEY: "${BESZEL_KEY:-}"
# WebSocket dial-out target — the hub on this same host. The agent
# is on host networking, so localhost is the host machine, where
# the hub container exposes port 8090.
HUB_URL: "http://localhost:8090"
# Optional fallback: legacy SSH listener for hub-initiated probing.
# Harmless to keep — hub only uses it if WebSocket is unreachable.
LISTEN: "45876"
# Enable the AMD sysfs GPU collector.
GPU: "true"

View File

@@ -0,0 +1,44 @@
# llama.cpp server, gfx1151-optimized via kyuz0's Strix Halo toolboxes.
# https://github.com/kyuz0/amd-strix-halo-toolboxes
#
# Tag options on docker.io/kyuz0/amd-strix-halo-toolboxes:
# vulkan-radv — most stable, recommended default (this one)
# vulkan-amdvlk — alternate Vulkan driver, sometimes faster
# rocm-7.2.2 — ROCm 7.x; needs /dev/kfd + render group_add (see vllm.yml pattern)
# rocm-6.4.4 — ROCm 6.x fallback
# rocm7-nightlies — avoid: caps memory allocation to 64 GB (May 2026)
#
# Toolbox images use a shell entrypoint, so we override to launch
# llama-server directly. Edit the --model path before `docker compose up -d`.
services:
llama:
image: kyuz0/amd-strix-halo-toolboxes:vulkan-radv
container_name: llama
restart: unless-stopped
devices:
- /dev/dri:/dev/dri
volumes:
- /models:/models:ro
ports:
- "8080:8080"
entrypoint: ["llama-server"]
command:
- --model
- /models/REPLACE/ME/model.gguf
- --host
- 0.0.0.0
- --port
- "8080"
- --n-gpu-layers
- "999"
- --ctx-size
- "32768"
# Required for GPU backends on Strix Halo per Gygeek's setup
# guide. Forces full load into GPU memory rather than mmap.
- --no-mmap
# Flash attention — works on Vulkan too; the big win is on the
# ROCm tag where kyuz0's build has rocWMMA acceleration.
- --flash-attn
# Expose Prometheus metrics at /metrics — scraped by OpenLIT for
# tokens/sec, KV-cache use, queue depth, and request latency.
- --metrics

View File

@@ -0,0 +1,38 @@
# Ollama, ROCm backend. Serves models on demand — safe to start before
# you've put anything in /models.
#
# Storage: Ollama's content-addressed blob store is bind-mounted under
# /models/ollama so all model data on the host lives under /models.
# Note: Ollama's blobs are SHA256-named, not raw GGUFs — llama.cpp/vLLM
# can't load them directly. Keep curated GGUFs at /models/<vendor>/...
# for those engines.
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
restart: unless-stopped
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
# Numeric GIDs of host's video (44) and render (991) groups — names
# don't exist inside the container, but the GIDs need to match the
# host so /dev/kfd + /dev/dri are accessible.
group_add:
- "44"
- "991"
environment:
# Strix Halo's iGPU is gfx1151 (RDNA 3.5), which Ollama's bundled
# ROCm runtime doesn't recognize — without this override it falls
# back to CPU silently. 11.0.0 = gfx1100 (Navi 31); the RDNA 3.x
# ISAs are close enough that gfx1100 kernels run on gfx1151.
- HSA_OVERRIDE_GFX_VERSION=11.0.0
# Default context. 256K (the upstream default for Qwen3-Coder)
# blows the KV cache up to ~25-30 GB and forces ollama to split
# layers between GPU and CPU. 64K keeps the model fully on GPU
# while still being plenty for coding contexts.
- OLLAMA_CONTEXT_LENGTH=65536
volumes:
- /models/ollama:/root/.ollama
- /models:/models:ro
ports:
- "11434:11434"

View File

@@ -0,0 +1,94 @@
# OpenHands 1.7 (May 2026) — autonomous agent in a Docker sandbox.
# https://docs.openhands.dev — repo: github.com/OpenHands/OpenHands
#
# Architecture: this container is a thin orchestrator. Per conversation
# it spawns a separate `agent-server` container on the host Docker daemon
# (that's what the docker.sock mount is for) and talks to it over REST.
# AGENT_SERVER_IMAGE_TAG below pins the per-session sandbox image.
#
# Complements OpenCode: OpenCode is the interactive terminal driver,
# OpenHands is for autonomous loops (write code, run tests, browse the
# web in a sandbox, report back).
services:
openhands:
# Org rebranded All-Hands-AI → OpenHands at v1.0 (Dec 2025); the old
# docker.all-hands.dev/all-hands-ai/openhands image is gone.
image: docker.openhands.dev/openhands/openhands:1.7
container_name: openhands
restart: unless-stopped
# 3030 host-side because :3000 is OpenWebUI and :3001 is OpenLIT.
# Loopback-only — reach via SSH tunnel or Tailscale, don't expose
# this directly.
ports:
- "127.0.0.1:3030:3000"
volumes:
# Required: orchestrator spawns sandbox containers via the host daemon.
- /var/run/docker.sock:/var/run/docker.sock
# State, settings, conversation history, MCP config, secrets.
# Pre-0.44 used ~/.openhands-state — N/A on a fresh install.
- /srv/docker/openhands/state:/.openhands
# Workspace the sandbox reads/writes. The host path on the LEFT must
# match SANDBOX_VOLUMES below — the sandbox container is spawned by
# the host daemon, so its bind mount is resolved on the host, not
# via this container's filesystem.
- /srv/docker/openhands/workspace:/srv/docker/openhands/workspace
# Linux Docker doesn't auto-provide host.docker.internal; this fixes it.
extra_hosts:
- "host.docker.internal:host-gateway"
environment:
# ---- Sandbox / agent-server image pin ----
# Replaces the V0.x SANDBOX_RUNTIME_CONTAINER_IMAGE. 1.19.1-python is
# the agent-server tag the 1.7 main image expects; bumping the main
# image will likely want a newer agent-server tag — check the
# upstream docker-compose.yml on each upgrade.
AGENT_SERVER_IMAGE_REPOSITORY: ghcr.io/openhands/agent-server
AGENT_SERVER_IMAGE_TAG: 1.19.1-python
# ---- Workspace mount into the per-session sandbox ----
# SANDBOX_VOLUMES is the V1 replacement for the deprecated
# WORKSPACE_BASE / WORKSPACE_MOUNT_PATH variables.
SANDBOX_VOLUMES: /srv/docker/openhands/workspace:/workspace:rw
# Match the host's `noise` UID so files the agent writes aren't
# owned by root.
SANDBOX_USER_ID: "1000"
# ---- LLM: host Ollama via OpenAI-compatible endpoint ----
# Per the official local-llms doc, the recommended path is the
# /v1 OpenAI-compatible endpoint with the `openai/` LiteLLM prefix
# — NOT `ollama/...`, which has worse tool-call behaviour.
LLM_MODEL: "openai/qwen3-coder:30b"
LLM_BASE_URL: "http://host.docker.internal:11434/v1"
LLM_API_KEY: "ollama" # any non-empty string; Ollama doesn't auth.
# Default tool-calling renderer mismatches Qwen3-Coder's training
# format and produces malformed calls (issue #8140). Forcing false
# falls back to OpenHands' prompt-based protocol — costs some token
# efficiency, gains reliability with local models.
LLM_NATIVE_TOOL_CALLING: "false"
LOG_ALL_EVENTS: "true"
# ---- Optional: ship traces to Phoenix on :4318 ----
# OpenHands V1 uses LiteLLM + OpenTelemetry; standard OTLP env vars
# are honoured. Comment out to disable.
OTEL_EXPORTER_OTLP_ENDPOINT: "http://host.docker.internal:4318"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
OTEL_SERVICE_NAME: "openhands"
# Per-session agent-server containers spawn headless chromium for
# browser tasks; default 64 MB shm causes silent crashes.
shm_size: "2gb"
# Playwright/chromium needs higher fd limits than Docker's default.
ulimits:
nofile:
soft: 65536
hard: 65536
# Bridge networking is correct here. Don't switch to network_mode: host
# — the spawned sandbox containers reach this orchestrator via Docker
# bridge DNS, which only works on a bridge network.

View File

@@ -0,0 +1,59 @@
# OpenLIT — LLM observability (traces, costs, KV-cache, prompt/decode
# latencies, tokens/sec). https://openlit.io
#
# Two services:
# - clickhouse : columnar store for traces (internal only, no host port)
# - openlit : Next.js UI on :3001 (3000 is OpenWebUI)
#
# Why OpenLIT vs Langfuse/Phoenix/Laminar: it's the only OSS dashboard
# (May 2026) that auto-instruments Ollama AND vLLM via OpenTelemetry
# without adding code to client apps. For llama.cpp, start the server
# with --metrics (see ../llama/docker-compose.yml) and OpenLIT can scrape
# /metrics.
#
# To send traces from a Python script calling Ollama/vLLM:
# pip install openlit
# python -c "import openlit; openlit.init(otlp_endpoint='http://framework:4318')"
#
# To wire OpenWebUI → OpenLIT, install OpenLIT's pipeline middleware
# in OpenWebUI per https://openlit.io/blogs/openlit-openwebui.
services:
clickhouse:
image: clickhouse/clickhouse-server:25.3-alpine
container_name: openlit-clickhouse
restart: unless-stopped
environment:
CLICKHOUSE_USER: default
CLICKHOUSE_PASSWORD: OPENLIT
CLICKHOUSE_DB: openlit
volumes:
- /srv/docker/openlit/clickhouse:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
openlit:
image: ghcr.io/openlit/openlit:latest
container_name: openlit
restart: unless-stopped
depends_on:
- clickhouse
ports:
# Host:container — UI on 3001 (OpenWebUI owns 3000).
- "3001:3000"
# OTLP receivers exposed on the host so SDKs running off-box can
# ship traces here. gRPC + HTTP. Remapped (4327/4328 → 4317/4318)
# because Phoenix owns the canonical 4317/4318 ports for OpenCode
# traces — OpenLIT here is a secondary/fleet-metrics destination.
- "4327:4317"
- "4328:4318"
environment:
INIT_DB_HOST: clickhouse
INIT_DB_PORT: "8123"
INIT_DB_USERNAME: default
INIT_DB_PASSWORD: OPENLIT
INIT_DB_DATABASE: openlit
SQLITE_DATABASE_URL: file:/app/client/data/data.db
volumes:
- /srv/docker/openlit/data:/app/client/data

View File

@@ -0,0 +1,25 @@
# OpenWebUI — ChatGPT-like web UI in front of Ollama. Pre-configured to
# use the host's Ollama instance and the project's SearXNG for web
# search. Default port 3000.
#
# Persistent state (users, conversations, uploaded docs, RAG vector
# index) lives at /srv/docker/openwebui/data so backups touch one path.
services:
openwebui:
image: ghcr.io/open-webui/open-webui:main
container_name: openwebui
restart: unless-stopped
ports:
- "3000:8080"
extra_hosts:
# Lets the container reach Ollama on the host's :11434 without
# needing to share Docker networks.
- "host.docker.internal:host-gateway"
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
# Built-in web search via the project's SearXNG instance.
- ENABLE_RAG_WEB_SEARCH=true
- RAG_WEB_SEARCH_ENGINE=searxng
- SEARXNG_QUERY_URL=https://searxng.n0n.io/search?q=<query>&format=json
volumes:
- /srv/docker/openwebui/data:/app/backend/data

View File

@@ -0,0 +1,35 @@
# Arize Phoenix — per-trace agent waterfall / flamegraph viz.
# https://github.com/Arize-ai/phoenix
#
# Picked over Langfuse for "show me one OpenCode turn as a tree":
# - Single container vs Langfuse's six (Postgres+ClickHouse+Redis+MinIO+web+worker).
# - First-class ingestion of Vercel AI SDK spans (which is what OpenCode
# emits under the hood when experimental.openTelemetry=true).
# - Best-in-class waterfall + agent-graph view for nested LLM/tool calls.
#
# Complements OpenLIT, doesn't replace it: OpenLIT is the fleet-metrics
# layer (cost / tokens / latency aggregated across sessions). Phoenix is
# the per-prompt debugger (see what one turn actually did).
#
# Bring-up: `docker compose up -d` — no first-run setup needed; UI prompts
# for project name on first trace ingest. Storage is SQLite at /data.
services:
phoenix:
image: arizephoenix/phoenix:latest
container_name: phoenix
restart: unless-stopped
ports:
# UI + OTLP/HTTP both ride on 6006 in Phoenix 15.x — HTTP traces go
# to http://framework:6006/v1/traces. (Pre-15 had a separate 4318;
# the consolidation happened in Phoenix v15.0.)
- "6006:6006"
# OTLP/gRPC stays separate.
- "4317:4317"
environment:
PHOENIX_WORKING_DIR: /data
# Phoenix listens on all interfaces by default; explicit for clarity.
PHOENIX_HOST: 0.0.0.0
PHOENIX_PORT: "6006"
PHOENIX_GRPC_PORT: "4317"
volumes:
- /srv/docker/phoenix/data:/data

View File

@@ -0,0 +1,36 @@
# vLLM, ROCm backend.
#
# NOTE: vLLM's official ROCm support targets datacenter cards (MI300X /
# gfx942). Strix Halo is gfx1151 — support varies by image tag and
# release. If `rocm/vllm:latest` doesn't run on this iGPU, try
# `rocm/vllm-dev:nightly` or build from source against ROCm 7.x.
services:
vllm:
image: rocm/vllm:latest
container_name: vllm
restart: unless-stopped
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
cap_add:
- SYS_PTRACE
security_opt:
- seccomp=unconfined
# Numeric GIDs of host's video (44) and render (991) groups — names
# don't exist inside the container.
group_add:
- "44"
- "991"
shm_size: 16g
ipc: host
volumes:
- /models:/models:ro
ports:
- "8000:8000"
command:
- --model
- /models/REPLACE/ME
- --host
- 0.0.0.0
- --port
- "8000"

487
pyinfra/framework/deploy.py Normal file
View File

@@ -0,0 +1,487 @@
"""
pyinfra deploy for Framework Desktop (Ryzen AI Max+ 395 / Strix Halo).
Containerized layout: inference engines (llama.cpp, vLLM, Ollama) run as
docker compose services, each shipping its own ROCm/Vulkan stack. The
host stays minimal — kernel + driver + diagnostics + Docker.
Run with:
./run.sh # equivalent to pyinfra inventory.py deploy.py
Idempotent — re-running only does work that's actually needed. Includes
one-shot cleanup steps for artifacts left over from the earlier native
build (llama.cpp git checkout, symlinks, systemd unit, full ROCm).
"""
from io import StringIO
from pyinfra.operations import apt, files, server, systemd
from pyinfra import host
# --- Tunables ----------------------------------------------------------------
# Latest stable as of 2026-04-30. Verify at https://repo.radeon.com/amdgpu-install/.
ROCM_VERSION = "7.2.3"
# The .deb filename has an opaque build suffix — find the exact name at
# the URL above and paste it here.
AMDGPU_INSTALL_DEB = "amdgpu-install_7.2.3.70203-1_all.deb"
# amdgpu_top — modern GPU monitor that handles new AMD cards / APUs
# (Strix Halo / gfx1151). Replaces radeontop, which doesn't know about
# RDNA 3.5. Verify at https://github.com/Umio-Yasuno/amdgpu_top/releases.
AMDGPU_TOP_VERSION = "0.11.4-1"
AMDGPU_TOP_DEB = f"amdgpu-top_without_gui_{AMDGPU_TOP_VERSION}_amd64.deb"
SSH_USER = host.data.get("ssh_user", "noise")
MODELS_DIR = "/models"
# /srv is the FHS-blessed location for "data and configuration for
# services this system provides." Owned root:docker with setgid +
# group-write so any docker-group member can manage compose stacks.
COMPOSE_DIR = "/srv/docker"
# --- Phase 1 — Base OS basics ------------------------------------------------
apt.update(name="apt update", _sudo=True)
apt.packages(
name="HWE kernel (>=6.11 for ROCm 7)",
packages=["linux-generic-hwe-24.04"],
_sudo=True,
)
# User basics + monitoring tools.
apt.packages(
name="Base CLI tools",
# radeontop intentionally omitted — it predates RDNA 3.5 / Strix Halo
# and just errors with "no VRAM support". amdgpu_top installed below.
packages=[
"tmux",
"vim",
"htop",
"btop",
"nvtop",
"git",
"curl",
"ca-certificates",
"unzip",
],
_sudo=True,
)
# Vulkan diagnostics on the host (containers ship their own runtime).
apt.packages(
name="Vulkan host diagnostics",
packages=["mesa-vulkan-drivers", "vulkan-tools"],
_sudo=True,
)
# uv (Python package/tool manager).
server.shell(
name="Install / upgrade uv",
commands=[
"curl -LsSf https://astral.sh/uv/install.sh | "
"env UV_INSTALL_DIR=/usr/local/bin UV_UNMANAGED_INSTALL=1 sh",
],
_sudo=True,
)
# huggingface_hub CLI for `huggingface-cli download <repo> --local-dir ...`
# Lands at ~/.local/bin/huggingface-cli for the SSH user. Other users can
# repeat the command themselves.
server.shell(
name="Install / upgrade huggingface_hub CLI",
commands=["uv tool install --upgrade 'huggingface_hub[cli]'"],
_sudo=True,
_sudo_user=SSH_USER,
)
# gpakosz/.tmux config (Oh My Tmux!).
TMUX_CONF_DIR = f"/home/{SSH_USER}/.tmux"
server.shell(
name="Clone gpakosz/.tmux",
commands=[
f"test -d {TMUX_CONF_DIR}/.git || "
f"git clone --depth 1 https://github.com/gpakosz/.tmux.git {TMUX_CONF_DIR}",
],
_sudo=True,
_sudo_user=SSH_USER,
)
files.link(
name="Symlink ~/.tmux.conf -> ~/.tmux/.tmux.conf",
path=f"/home/{SSH_USER}/.tmux.conf",
target=f"{TMUX_CONF_DIR}/.tmux.conf",
user=SSH_USER,
group=SSH_USER,
_sudo=True,
)
# Seed the user override file only if absent — preserves customizations.
server.shell(
name="Seed ~/.tmux.conf.local (if missing)",
commands=[
f"test -f /home/{SSH_USER}/.tmux.conf.local || "
f"cp {TMUX_CONF_DIR}/.tmux.conf.local /home/{SSH_USER}/",
],
_sudo=True,
_sudo_user=SSH_USER,
)
# --- Tailscale ---------------------------------------------------------------
# Tailscale's noble repo works fine on later Ubuntus — the package is
# self-contained Go binaries.
files.download(
name="Fetch Tailscale apt key",
src="https://pkgs.tailscale.com/stable/ubuntu/noble.noarmor.gpg",
dest="/usr/share/keyrings/tailscale-archive-keyring.gpg",
mode="644",
_sudo=True,
)
files.download(
name="Fetch Tailscale apt list",
src="https://pkgs.tailscale.com/stable/ubuntu/noble.tailscale-keyring.list",
dest="/etc/apt/sources.list.d/tailscale.list",
mode="644",
_sudo=True,
)
apt.update(name="apt update (Tailscale)", _sudo=True)
apt.packages(name="Install Tailscale", packages=["tailscale"], _sudo=True)
systemd.service(
name="Enable tailscaled",
service="tailscaled",
running=True,
enabled=True,
_sudo=True,
)
# `sudo tailscale up` is interactive (browser auth) — run manually once.
# --- Docker -----------------------------------------------------------------
apt.packages(
name="Install Docker + compose plugin",
packages=["docker.io", "docker-compose-v2"],
_sudo=True,
)
systemd.service(
name="Enable docker daemon",
service="docker",
running=True,
enabled=True,
_sudo=True,
)
# lazydocker (TUI for docker). System-wide install via official script.
server.shell(
name="Install / upgrade lazydocker",
commands=[
"curl -fsSL "
"https://raw.githubusercontent.com/jesseduffield/lazydocker/master/scripts/install_update_linux.sh "
"| DIR=/usr/local/bin bash",
],
_sudo=True,
)
# --- GPU access (host kernel/driver bits only) ------------------------------
# AMD's amdgpu-install package adds the ROCm apt repo; we use it just to
# get rocminfo for host-side diagnostics. Containers ship
# their own ROCm.
files.directory(
name="apt keyring dir",
path="/etc/apt/keyrings",
mode="755",
_sudo=True,
)
amdgpu_deb = f"/tmp/{AMDGPU_INSTALL_DEB}"
amdgpu_url = (
f"https://repo.radeon.com/amdgpu-install/{ROCM_VERSION}/ubuntu/noble/"
f"{AMDGPU_INSTALL_DEB}"
)
server.shell(
name="Fetch amdgpu-install .deb",
commands=[f"test -f {amdgpu_deb} || curl -fsSL {amdgpu_url} -o {amdgpu_deb}"],
_sudo=True,
)
server.shell(
name="Install amdgpu-install package",
commands=[f"apt install -y {amdgpu_deb}"],
_sudo=True,
)
# Idempotent cleanup: if a prior run installed the full ROCm userspace
# (~25 GB), tear it down before installing the diagnostic-only subset.
# On a fresh box this is a no-op.
server.shell(
name="Remove full ROCm install if present",
commands=[
"if dpkg -l rocm-dev 2>/dev/null | grep -q '^ii'; then "
" amdgpu-install -y --uninstall || true; "
"fi",
],
_sudo=True,
)
apt.packages(
name="ROCm host diagnostics (rocminfo)",
# rocminfo is the stable diagnostic. The SMI tool's package name has
# churned across ROCm releases (rocm-smi-lib → amd-smi-lib in 7.x);
# install on demand if you need it.
packages=["rocminfo"],
_sudo=True,
)
# amdgpu_top — fetch + install the .deb if not already at the pinned
# version. Replaces radeontop for newer AMD cards (Strix Halo / gfx1151).
amdgpu_top_deb = f"/tmp/{AMDGPU_TOP_DEB}"
amdgpu_top_url = (
f"https://github.com/Umio-Yasuno/amdgpu_top/releases/download/"
f"v{AMDGPU_TOP_VERSION.split('-')[0]}/{AMDGPU_TOP_DEB}"
)
server.shell(
name="Install / upgrade amdgpu_top",
commands=[
f"dpkg -s amdgpu-top 2>/dev/null | grep -q '^Version: {AMDGPU_TOP_VERSION}' || "
f"(test -f {amdgpu_top_deb} || curl -fsSL {amdgpu_top_url} -o {amdgpu_top_deb}; "
f"apt install -y {amdgpu_top_deb})",
],
_sudo=True,
)
# Group membership for /dev/kfd + /dev/dri access (needed for GPU passthrough
# into containers, and for unprivileged host-side rocminfo).
server.group(name="ensure render group", group="render", _sudo=True)
server.group(name="ensure video group", group="video", _sudo=True)
server.user(
name="Add login user to render/video/docker",
user=SSH_USER,
groups=["render", "video", "docker"],
append=True,
_sudo=True,
)
# Kernel cmdline tuning per Gygeek/Framework-strix-halo-llm-setup:
# - amd_iommu=off — ~6 % memory-read improvement on Strix Halo
# - amdgpu.gttsize=117760 — ~115 GB GTT ceiling so the GPU can borrow
# most of system RAM dynamically. Acts as a
# ceiling, not an allocation. See ../../StrixHaloMemory.md
# for the UMA-vs-GTT trade-off discussion.
# Requires a reboot to take effect; pyinfra leaves that to you.
files.line(
name="GRUB cmdline (amd_iommu, gttsize)",
path="/etc/default/grub",
line=r"^GRUB_CMDLINE_LINUX_DEFAULT=.*",
replace='GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"',
_sudo=True,
)
server.shell(
name="Regenerate GRUB config",
commands=["update-grub"],
_sudo=True,
)
# --- Storage layout ---------------------------------------------------------
files.directory(
name="/models root",
path=MODELS_DIR,
user=SSH_USER,
group=SSH_USER,
mode="755",
_sudo=True,
)
for sub in ("moonshotai", "qwen", "deepseek", "zai", "mistralai"):
files.directory(
name=f"{MODELS_DIR}/{sub}",
path=f"{MODELS_DIR}/{sub}",
user=SSH_USER,
group=SSH_USER,
mode="755",
_sudo=True,
)
# Ollama bind-mounts its content-addressed store here.
files.directory(
name=f"{MODELS_DIR}/ollama",
path=f"{MODELS_DIR}/ollama",
user=SSH_USER,
group=SSH_USER,
mode="755",
_sudo=True,
)
# --- Compose files for inference services ----------------------------------
files.directory(
name="Compose root",
path=COMPOSE_DIR,
group="docker",
mode="2775", # setgid: new files inherit docker group
_sudo=True,
)
# Idempotent cleanup of earlier locations.
files.directory(
name="Remove old /srv/compose",
path="/srv/compose",
present=False,
_sudo=True,
)
files.directory(
name="Remove old ~/docker compose dir",
path=f"/home/{SSH_USER}/docker",
present=False,
_sudo=True,
)
for svc in (
"llama",
"vllm",
"ollama",
"openwebui",
"beszel",
"openlit",
"phoenix",
"openhands",
):
files.directory(
name=f"compose/{svc} dir",
path=f"{COMPOSE_DIR}/{svc}",
group="docker",
mode="2775",
_sudo=True,
)
files.put(
name=f"compose/{svc}/docker-compose.yml",
src=f"compose/{svc}.yml",
dest=f"{COMPOSE_DIR}/{svc}/docker-compose.yml",
group="docker",
mode="664",
_sudo=True,
)
# OpenWebUI persistent state (users, conversations, uploaded docs,
# RAG vector index) — bind-mounted into the container.
files.directory(
name="OpenWebUI data dir",
path=f"{COMPOSE_DIR}/openwebui/data",
group="docker",
mode="2775",
_sudo=True,
)
# Beszel persistent state (admin account, system list, metric history).
files.directory(
name="Beszel data dir",
path=f"{COMPOSE_DIR}/beszel/data",
group="docker",
mode="2775",
_sudo=True,
)
# Sibling .env file Docker Compose auto-reads for variable interpolation.
# Pyinfra creates it empty if missing and never overwrites — the user
# fills in BESZEL_TOKEN= and BESZEL_KEY= from the hub UI on first setup.
# Mode 640 root:docker so docker group members can read it at compose
# parse time but it isn't world-readable.
files.file(
name="Beszel .env (placeholder)",
path=f"{COMPOSE_DIR}/beszel/.env",
present=True,
mode="640",
user="root",
group="docker",
_sudo=True,
)
# Tear down the previous /etc/default/beszel-agent location if a prior
# deploy created it (idempotent — no-op on a fresh box).
files.file(
name="Remove legacy /etc/default/beszel-agent",
path="/etc/default/beszel-agent",
present=False,
_sudo=True,
)
# OpenLIT persistent state — Next.js app config + ClickHouse trace store.
files.directory(
name="OpenLIT data dir",
path=f"{COMPOSE_DIR}/openlit/data",
group="docker",
mode="2775",
_sudo=True,
)
files.directory(
name="OpenLIT ClickHouse data dir",
path=f"{COMPOSE_DIR}/openlit/clickhouse",
group="docker",
mode="2775",
_sudo=True,
)
# Phoenix persistent state (SQLite-backed trace store).
files.directory(
name="Phoenix data dir",
path=f"{COMPOSE_DIR}/phoenix/data",
group="docker",
mode="2775",
_sudo=True,
)
# OpenHands state + workspace. Owned by the SSH user (UID 1000) so the
# sandbox containers — which run as SANDBOX_USER_ID=1000 — can write to
# the shared workspace without root-owned files leaking out.
files.directory(
name="OpenHands state dir",
path=f"{COMPOSE_DIR}/openhands/state",
user=SSH_USER,
group=SSH_USER,
mode="2775",
_sudo=True,
)
files.directory(
name="OpenHands workspace dir",
path=f"{COMPOSE_DIR}/openhands/workspace",
user=SSH_USER,
group=SSH_USER,
mode="2775",
_sudo=True,
)
# --- Cleanup of artifacts from the prior native-build deploy ----------------
# All idempotent — `present=False` is a no-op when the target is absent.
server.shell(
name="Stop & disable old native llama-server.service",
commands=[
"systemctl disable --now llama-server.service 2>/dev/null || true",
],
_sudo=True,
)
files.file(
name="Remove old llama-server.service",
path="/etc/systemd/system/llama-server.service",
present=False,
_sudo=True,
)
files.link(
name="Remove old llama-server-vulkan symlink",
path="/usr/local/bin/llama-server-vulkan",
present=False,
_sudo=True,
)
files.link(
name="Remove old llama-server-rocm symlink",
path="/usr/local/bin/llama-server-rocm",
present=False,
_sudo=True,
)
files.directory(
name="Remove old llama.cpp checkout",
path="/opt/llama.cpp",
present=False,
_sudo=True,
)
server.shell(
name="Stop & remove native Ollama install",
commands=[
"systemctl disable --now ollama.service 2>/dev/null || true",
"rm -f /etc/systemd/system/ollama.service /usr/local/bin/ollama",
"userdel ollama 2>/dev/null || true",
],
_sudo=True,
)
systemd.daemon_reload(name="systemctl daemon-reload", _sudo=True)

View File

@@ -0,0 +1,5 @@
# pyinfra inventory for the Framework Desktop / Strix Halo box.
framework_desktop = [
("framework", {"ssh_hostname": "10.0.0.70", "ssh_user": "noise"}),
]

7
pyinfra/framework/run.sh Executable file
View File

@@ -0,0 +1,7 @@
#!/usr/bin/env bash
# Wrapper: invokes pyinfra against the Framework Desktop, forwarding any
# extra args (e.g. --dry, -v) to pyinfra. Assumes NOPASSWD sudo on the box.
set -euo pipefail
cd "$(dirname "$0")"
exec pyinfra -v --ssh-password-prompt inventory.py deploy.py "$@"