Files
localgenai/StrixHaloSetup.md
noisedestroyers 2c4bfefa95 Initial commit: localgenai stack
Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:35:10 -04:00

113 lines
6.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Strix Halo Setup Plan — Local Coding Model Server
Preparing a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S iGPU, 128 GB unified LPDDR5x) to serve coding models locally, with **Kimi Linear** as the eventual target.
## Target hardware
- **CPU:** Ryzen AI Max+ 395 (16 Zen 5 cores)
- **iGPU:** Radeon 8060S (40 RDNA 3.5 CUs, `gfx1151`)
- **Memory:** 128 GB LPDDR5x-8000, unified CPU/GPU
- **GPU-addressable:** ~96 GB (configurable in BIOS)
- **Memory bandwidth:** ~256 GB/s (the binding constraint for inference)
## OS choice: Ubuntu 24.04 LTS Server
- Headless-first, with the option to add a slim DE (XFCE / LXQt) later.
- AMD's official ROCm support and the Lemonade SDK target Ubuntu LTS — fewest sharp edges for `gfx1151`.
- Skip Ubuntu Desktop (full GNOME fights you on a headless box). Add `xubuntu-core` or `lubuntu-core` only when/if you decide you want a desktop.
---
## Phase 0 — Pre-install
- **Update Framework BIOS / firmware** to the latest before installing anything. Strix Halo firmware updates have meaningfully changed memory bandwidth and iGPU stability; don't chase ghost issues on stale firmware.
- **Set GPU memory carve-out** in BIOS to the maximum UMA frame buffer (typically 96 GB). Leaves 32 GB for the OS + everything else.
## Phase 1 — Base OS
- **Ubuntu Server 24.04 LTS** installed via USB.
- During install: enable SSH server, import OpenSSH key from your laptop.
- Post-install: `apt install linux-generic-hwe-24.04` for a newer kernel (ROCm 7.x wants ≥ 6.11).
- Install **Tailscale** for SSH/HTTP tunnel access from anywhere on your tailnet — no exposed ports needed.
**Deliverable:** SSH-accessible headless box, reachable by hostname over Tailscale.
## Phase 2 — Optional slim DE (skip until needed)
- `apt install xubuntu-core` (XFCE) or `lubuntu-core` (LXQt) — both tiny.
- Reach it via **xrdp** or **NoMachine** rather than wiring up a monitor/keyboard. NoMachine is the smoother experience for occasional GUI sessions.
## Phase 3 — GPU stack
- Install **ROCm 7.x** for `gfx1151` per AMD's Ubuntu 24.04 instructions. Verify with `rocminfo` and `rocm-smi`.
- **Mesa RADV** (Vulkan) is already on Ubuntu — verify with `vulkaninfo --summary`.
- Keep **both** backends available. Vulkan currently beats ROCm for MoE decode on Strix Halo, but that may invert as ROCm matures or Kimi Linear lands.
**Deliverable:** Both `rocminfo` and `vulkaninfo` see the 8060S.
## Phase 4 — Inference engines
- **llama.cpp** — build twice into separate binaries: `llama-server-vulkan` and `llama-server-rocm`. Run whichever wins for the model in question. This is the path for Kimi Linear once PR #17592 lands.
- **Ollama** — convenient front for llama.cpp; uses Vulkan/ROCm under the hood.
- **vLLM with ROCm** — install only if you actually need batched serving on an OpenAI-compatible endpoint. Heavier setup; skip until needed.
**Deliverable:** A llama.cpp server that loads a known-good GGUF and serves on `localhost:8080` over an OpenAI-compatible API.
## Phase 5 — Storage & model layout
- Big GGUFs add up fast (Kimi Linear Q4_K_M alone is ~30 GB; plan for 46 concurrent models).
- Use a dedicated `/models` directory; if the Framework Desktop has a second NVMe slot, put models there.
- Standard layout so Ollama, llama.cpp, and editor configs can share paths:
```
/models/<vendor>/<model>/<quant>.gguf
```
## Phase 6 — Daily coding model (use now, swap to Kimi Linear later)
Worth its own decision pass. Strong contenders for a 96 GB carve-out at Q4Q6 (as of May 2026):
- **Qwen3-Coder**
- **DeepSeek-Coder-V3.x**
- **GLM-4.6**
- **Devstral** (smaller, faster)
- **Kimi-K2** (stay in the Moonshot family while waiting for Linear)
→ Open question: research current Strix Halo benchmarks for these before committing to one.
## Phase 7 — Front-end
All of these speak OpenAI-compatible APIs, so the Phase 4 server is the only thing they need to know about:
- **Continue.dev** or **Cline** (VS Code) pointed at the local server.
- **Aider** for terminal-based pair programming.
- **OpenWebUI** for a ChatGPT-style UI when desired.
## Phase 8 — Monitoring & guardrails
- `nvtop` or `radeontop` for GPU; `btop` for CPU/RAM.
- Small systemd unit to auto-start the llama.cpp server on boot.
- Strix Halo is power-flexible (45120 W TDP); pick a TDP that matches your noise/perf preference.
## Phase 9 — Kimi Linear readiness checklist
When the weekly routine flips to **READY TO TRY**, the bring-up should be ~30 minutes:
1. `git pull` the llama.cpp checkout (or rebuild from PR #17592 / its merged commit).
2. Download a Q4_K_M or Q6_K GGUF from bartowski / ymcki to `/models/moonshotai/kimi-linear/`.
3. Swap the systemd unit's `--model` path. Restart.
---
## Status tracking
- **Weekly Kimi Linear support check:** routine `trig_01XPweKNZaGTsFjEsF3Ce6HH` — runs Fridays at 13:00 UTC. View at https://claude.ai/code/routines/trig_01XPweKNZaGTsFjEsF3Ce6HH
- **Baseline (2026-05-01):** llama.cpp PR #17592 OPEN/unmerged, CUDA-only validation. GGUFs exist but require a patched build. No Strix Halo (Vulkan/ROCm) reports yet.
## Key references
- [llama.cpp PR #17592 — Kimi Linear support](https://github.com/ggml-org/llama.cpp/pull/17592)
- [llama.cpp Issue #16930 — Kimi Linear feature request](https://github.com/ggml-org/llama.cpp/issues/16930)
- [llama.cpp Discussion #20856 — Known-Good Strix Halo ROCm Stack](https://github.com/ggml-org/llama.cpp/discussions/20856)
- [Strix Halo backend benchmarks (kyuz0)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
- [strixhalo.wiki llama.cpp performance](https://strixhalo.wiki/AI/llamacpp-performance)
- [Lychee-Technology/llama-cpp-for-strix-halo](https://github.com/Lychee-Technology/llama-cpp-for-strix-halo)
- [bartowski Kimi-Linear GGUF](https://huggingface.co/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF)
- [ymcki Kimi-Linear GGUF (fork-required)](https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF)
- [vLLM Kimi-Linear recipe](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-Linear.html)