Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.
- pyinfra/framework/: pyinfra deploy targeting the box
- llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
for gfx1151), OpenWebUI
- Beszel (host + container + AMD GPU dashboard via sysfs)
- OpenLIT (LLM fleet metrics)
- Phoenix (per-trace agent waterfall)
- OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
- install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.0 KiB
Strix Halo Setup Plan — Local Coding Model Server
Preparing a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S iGPU, 128 GB unified LPDDR5x) to serve coding models locally, with Kimi Linear as the eventual target.
Target hardware
- CPU: Ryzen AI Max+ 395 (16 Zen 5 cores)
- iGPU: Radeon 8060S (40 RDNA 3.5 CUs,
gfx1151) - Memory: 128 GB LPDDR5x-8000, unified CPU/GPU
- GPU-addressable: ~96 GB (configurable in BIOS)
- Memory bandwidth: ~256 GB/s (the binding constraint for inference)
OS choice: Ubuntu 24.04 LTS Server
- Headless-first, with the option to add a slim DE (XFCE / LXQt) later.
- AMD's official ROCm support and the Lemonade SDK target Ubuntu LTS — fewest sharp edges for
gfx1151. - Skip Ubuntu Desktop (full GNOME fights you on a headless box). Add
xubuntu-coreorlubuntu-coreonly when/if you decide you want a desktop.
Phase 0 — Pre-install
- Update Framework BIOS / firmware to the latest before installing anything. Strix Halo firmware updates have meaningfully changed memory bandwidth and iGPU stability; don't chase ghost issues on stale firmware.
- Set GPU memory carve-out in BIOS to the maximum UMA frame buffer (typically 96 GB). Leaves 32 GB for the OS + everything else.
Phase 1 — Base OS
- Ubuntu Server 24.04 LTS installed via USB.
- During install: enable SSH server, import OpenSSH key from your laptop.
- Post-install:
apt install linux-generic-hwe-24.04for a newer kernel (ROCm 7.x wants ≥ 6.11). - Install Tailscale for SSH/HTTP tunnel access from anywhere on your tailnet — no exposed ports needed.
Deliverable: SSH-accessible headless box, reachable by hostname over Tailscale.
Phase 2 — Optional slim DE (skip until needed)
apt install xubuntu-core(XFCE) orlubuntu-core(LXQt) — both tiny.- Reach it via xrdp or NoMachine rather than wiring up a monitor/keyboard. NoMachine is the smoother experience for occasional GUI sessions.
Phase 3 — GPU stack
- Install ROCm 7.x for
gfx1151per AMD's Ubuntu 24.04 instructions. Verify withrocminfoandrocm-smi. - Mesa RADV (Vulkan) is already on Ubuntu — verify with
vulkaninfo --summary. - Keep both backends available. Vulkan currently beats ROCm for MoE decode on Strix Halo, but that may invert as ROCm matures or Kimi Linear lands.
Deliverable: Both rocminfo and vulkaninfo see the 8060S.
Phase 4 — Inference engines
- llama.cpp — build twice into separate binaries:
llama-server-vulkanandllama-server-rocm. Run whichever wins for the model in question. This is the path for Kimi Linear once PR #17592 lands. - Ollama — convenient front for llama.cpp; uses Vulkan/ROCm under the hood.
- vLLM with ROCm — install only if you actually need batched serving on an OpenAI-compatible endpoint. Heavier setup; skip until needed.
Deliverable: A llama.cpp server that loads a known-good GGUF and serves on localhost:8080 over an OpenAI-compatible API.
Phase 5 — Storage & model layout
- Big GGUFs add up fast (Kimi Linear Q4_K_M alone is ~30 GB; plan for 4–6 concurrent models).
- Use a dedicated
/modelsdirectory; if the Framework Desktop has a second NVMe slot, put models there. - Standard layout so Ollama, llama.cpp, and editor configs can share paths:
/models/<vendor>/<model>/<quant>.gguf
Phase 6 — Daily coding model (use now, swap to Kimi Linear later)
Worth its own decision pass. Strong contenders for a 96 GB carve-out at Q4–Q6 (as of May 2026):
- Qwen3-Coder
- DeepSeek-Coder-V3.x
- GLM-4.6
- Devstral (smaller, faster)
- Kimi-K2 (stay in the Moonshot family while waiting for Linear)
→ Open question: research current Strix Halo benchmarks for these before committing to one.
Phase 7 — Front-end
All of these speak OpenAI-compatible APIs, so the Phase 4 server is the only thing they need to know about:
- Continue.dev or Cline (VS Code) pointed at the local server.
- Aider for terminal-based pair programming.
- OpenWebUI for a ChatGPT-style UI when desired.
Phase 8 — Monitoring & guardrails
nvtoporradeontopfor GPU;btopfor CPU/RAM.- Small systemd unit to auto-start the llama.cpp server on boot.
- Strix Halo is power-flexible (45–120 W TDP); pick a TDP that matches your noise/perf preference.
Phase 9 — Kimi Linear readiness checklist
When the weekly routine flips to READY TO TRY, the bring-up should be ~30 minutes:
git pullthe llama.cpp checkout (or rebuild from PR #17592 / its merged commit).- Download a Q4_K_M or Q6_K GGUF from bartowski / ymcki to
/models/moonshotai/kimi-linear/. - Swap the systemd unit's
--modelpath. Restart.
Status tracking
- Weekly Kimi Linear support check: routine
trig_01XPweKNZaGTsFjEsF3Ce6HH— runs Fridays at 13:00 UTC. View at https://claude.ai/code/routines/trig_01XPweKNZaGTsFjEsF3Ce6HH - Baseline (2026-05-01): llama.cpp PR #17592 OPEN/unmerged, CUDA-only validation. GGUFs exist but require a patched build. No Strix Halo (Vulkan/ROCm) reports yet.
Key references
- llama.cpp PR #17592 — Kimi Linear support
- llama.cpp Issue #16930 — Kimi Linear feature request
- llama.cpp Discussion #20856 — Known-Good Strix Halo ROCm Stack
- Strix Halo backend benchmarks (kyuz0)
- strixhalo.wiki llama.cpp performance
- Lychee-Technology/llama-cpp-for-strix-halo
- bartowski Kimi-Linear GGUF
- ymcki Kimi-Linear GGUF (fork-required)
- vLLM Kimi-Linear recipe