2.7 KiB
2.7 KiB
pyinfra: Strix Halo bring-up
Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon 8060S, 128 GB). The host stays minimal — kernel + driver + Docker + diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker compose services, each shipping its own ROCm/Vulkan stack.
Manual prerequisites
- Phase 0 — update Framework BIOS, set GPU UMA carve-out (96 GB).
- OS install — Ubuntu Server 24.04 LTS. AMD ROCm only ships for
jammy/noble; later Ubuntus install but break the host-side toolchain
(libxml2 ABI). Enable SSH, import your laptop key, create user
noise. Recommended partitioning: ≥300 GB on/, big disk mounted at/models, plain ext4 (skip LVM). - The host must be reachable at
10.0.0.237over SSH (editinventory.pyif it moves). - NOPASSWD sudo for
noise— pyinfra's fact layer doesn't reliably thread sudo passwords. One-time setup:ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
Run
uv tool install pyinfra
./run.sh # equivalent to: pyinfra inventory.py deploy.py
./run.sh --dry # any extra args are forwarded to pyinfra
Or run it ephemerally without installing: uvx pyinfra inventory.py deploy.py.
What the deploy does
- Base CLI: tmux, vim, htop, btop, nvtop, radeontop, uv
- Tailscale (run
sudo tailscale upon the box once, interactively) - Docker engine + compose plugin, user added to
dockergroup - ROCm host diagnostics only (
rocminfo,rocm-smi) — no full toolchain /models/<vendor>/layout~/docker/{llama,vllm,ollama}/docker-compose.ymldropped in, not auto-started — you edit the model path thendocker compose up -d
If a previous run installed the native llama.cpp build / full ROCm /
native Ollama, those are auto-cleaned the next time ./run.sh runs.
After the deploy: starting an inference service
ssh noise@10.0.0.237
sudo tailscale up # one-time, interactive
# Drop a GGUF somewhere under /models, then:
cd ~/docker/llama
vim docker-compose.yml # edit the --model path
docker compose up -d
curl localhost:8080/v1/models # smoke test
Same shape for vllm (port 8000) and ollama (port 11434, no model edit
needed — Ollama serves models on demand).
Tunables
Top of deploy.py:
ROCM_VERSIONandAMDGPU_INSTALL_DEB— bump when AMD ships a newer release. The .deb filename has a build suffix that doesn't derive from the version; find it at https://repo.radeon.com/amdgpu-install/.
Compose images in compose/{llama,vllm,ollama}.yml — pin tags here.