localgenai/StrixHaloSetup.md

# Strix Halo Setup Plan — Local Coding Model Server

Preparing a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S iGPU, 128 GB unified LPDDR5x) to serve coding models locally, with **Kimi Linear** as the eventual target.

## Target hardware
- **CPU:** Ryzen AI Max+ 395 (16 Zen 5 cores)
- **iGPU:** Radeon 8060S (40 RDNA 3.5 CUs, `gfx1151`)
- **Memory:** 128 GB LPDDR5x-8000, unified CPU/GPU
- **GPU-addressable:** ~96 GB (configurable in BIOS)
- **Memory bandwidth:** ~256 GB/s (the binding constraint for inference)

## OS choice: Ubuntu 24.04 LTS Server
- Headless-first, with the option to add a slim DE (XFCE / LXQt) later.
- AMD's official ROCm support and the Lemonade SDK target Ubuntu LTS — fewest sharp edges for `gfx1151`.
- Skip Ubuntu Desktop (full GNOME fights you on a headless box). Add `xubuntu-core` or `lubuntu-core` only when/if you decide you want a desktop.

---

## Phase 0 — Pre-install

- **Update Framework BIOS / firmware** to the latest before installing anything. Strix Halo firmware updates have meaningfully changed memory bandwidth and iGPU stability; don't chase ghost issues on stale firmware.
- **Set GPU memory carve-out** in BIOS to the maximum UMA frame buffer (typically 96 GB). Leaves 32 GB for the OS + everything else.

## Phase 1 — Base OS

- **Ubuntu Server 24.04 LTS** installed via USB.
- During install: enable SSH server, import OpenSSH key from your laptop.
- Post-install: `apt install linux-generic-hwe-24.04` for a newer kernel (ROCm 7.x wants ≥ 6.11).
- Install **Tailscale** for SSH/HTTP tunnel access from anywhere on your tailnet — no exposed ports needed.

**Deliverable:** SSH-accessible headless box, reachable by hostname over Tailscale.

## Phase 2 — Optional slim DE (skip until needed)

- `apt install xubuntu-core` (XFCE) or `lubuntu-core` (LXQt) — both tiny.
- Reach it via **xrdp** or **NoMachine** rather than wiring up a monitor/keyboard. NoMachine is the smoother experience for occasional GUI sessions.

## Phase 3 — GPU stack

- Install **ROCm 7.x** for `gfx1151` per AMD's Ubuntu 24.04 instructions. Verify with `rocminfo` and `rocm-smi`.
- **Mesa RADV** (Vulkan) is already on Ubuntu — verify with `vulkaninfo --summary`.
- Keep **both** backends available. Vulkan currently beats ROCm for MoE decode on Strix Halo, but that may invert as ROCm matures or Kimi Linear lands.

**Deliverable:** Both `rocminfo` and `vulkaninfo` see the 8060S.

## Phase 4 — Inference engines

- **llama.cpp** — build twice into separate binaries: `llama-server-vulkan` and `llama-server-rocm`. Run whichever wins for the model in question. This is the path for Kimi Linear once PR #17592 lands.
- **Ollama** — convenient front for llama.cpp; uses Vulkan/ROCm under the hood.
- **vLLM with ROCm** — install only if you actually need batched serving on an OpenAI-compatible endpoint. Heavier setup; skip until needed.

**Deliverable:** A llama.cpp server that loads a known-good GGUF and serves on `localhost:8080` over an OpenAI-compatible API.

## Phase 5 — Storage & model layout

- Big GGUFs add up fast (Kimi Linear Q4_K_M alone is ~30 GB; plan for 4–6 concurrent models).
- Use a dedicated `/models` directory; if the Framework Desktop has a second NVMe slot, put models there.
- Standard layout so Ollama, llama.cpp, and editor configs can share paths:
  ```
  /models/<vendor>/<model>/<quant>.gguf
  ```

## Phase 6 — Daily coding model (use now, swap to Kimi Linear later)

Worth its own decision pass. Strong contenders for a 96 GB carve-out at Q4–Q6 (as of May 2026):
- **Qwen3-Coder**
- **DeepSeek-Coder-V3.x**
- **GLM-4.6**
- **Devstral** (smaller, faster)
- **Kimi-K2** (stay in the Moonshot family while waiting for Linear)

→ Open question: research current Strix Halo benchmarks for these before committing to one.

## Phase 7 — Front-end

All of these speak OpenAI-compatible APIs, so the Phase 4 server is the only thing they need to know about:
- **Continue.dev** or **Cline** (VS Code) pointed at the local server.
- **Aider** for terminal-based pair programming.
- **OpenWebUI** for a ChatGPT-style UI when desired.

## Phase 8 — Monitoring & guardrails

- `nvtop` or `radeontop` for GPU; `btop` for CPU/RAM.
- Small systemd unit to auto-start the llama.cpp server on boot.
- Strix Halo is power-flexible (45–120 W TDP); pick a TDP that matches your noise/perf preference.

## Phase 9 — Kimi Linear readiness checklist

When the weekly routine flips to **READY TO TRY**, the bring-up should be ~30 minutes:

1. `git pull` the llama.cpp checkout (or rebuild from PR #17592 / its merged commit).
2. Download a Q4_K_M or Q6_K GGUF from bartowski / ymcki to `/models/moonshotai/kimi-linear/`.
3. Swap the systemd unit's `--model` path. Restart.

---

## Status tracking

- **Weekly Kimi Linear support check:** routine `trig_01XPweKNZaGTsFjEsF3Ce6HH` — runs Fridays at 13:00 UTC. View at https://claude.ai/code/routines/trig_01XPweKNZaGTsFjEsF3Ce6HH
- **Baseline (2026-05-01):** llama.cpp PR #17592 OPEN/unmerged, CUDA-only validation. GGUFs exist but require a patched build. No Strix Halo (Vulkan/ROCm) reports yet.

## Key references

- [llama.cpp PR #17592 — Kimi Linear support](https://github.com/ggml-org/llama.cpp/pull/17592)
- [llama.cpp Issue #16930 — Kimi Linear feature request](https://github.com/ggml-org/llama.cpp/issues/16930)
- [llama.cpp Discussion #20856 — Known-Good Strix Halo ROCm Stack](https://github.com/ggml-org/llama.cpp/discussions/20856)
- [Strix Halo backend benchmarks (kyuz0)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
- [strixhalo.wiki llama.cpp performance](https://strixhalo.wiki/AI/llamacpp-performance)
- [Lychee-Technology/llama-cpp-for-strix-halo](https://github.com/Lychee-Technology/llama-cpp-for-strix-halo)
- [bartowski Kimi-Linear GGUF](https://huggingface.co/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF)
- [ymcki Kimi-Linear GGUF (fork-required)](https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF)
- [vLLM Kimi-Linear recipe](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-Linear.html)