# Strix Halo Setup Plan — Local Coding Model Server Preparing a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S iGPU, 128 GB unified LPDDR5x) to serve coding models locally, with **Kimi Linear** as the eventual target. ## Target hardware - **CPU:** Ryzen AI Max+ 395 (16 Zen 5 cores) - **iGPU:** Radeon 8060S (40 RDNA 3.5 CUs, `gfx1151`) - **Memory:** 128 GB LPDDR5x-8000, unified CPU/GPU - **GPU-addressable:** ~96 GB (configurable in BIOS) - **Memory bandwidth:** ~256 GB/s (the binding constraint for inference) ## OS choice: Ubuntu 24.04 LTS Server - Headless-first, with the option to add a slim DE (XFCE / LXQt) later. - AMD's official ROCm support and the Lemonade SDK target Ubuntu LTS — fewest sharp edges for `gfx1151`. - Skip Ubuntu Desktop (full GNOME fights you on a headless box). Add `xubuntu-core` or `lubuntu-core` only when/if you decide you want a desktop. --- ## Phase 0 — Pre-install - **Update Framework BIOS / firmware** to the latest before installing anything. Strix Halo firmware updates have meaningfully changed memory bandwidth and iGPU stability; don't chase ghost issues on stale firmware. - **Set GPU memory carve-out** in BIOS to the maximum UMA frame buffer (typically 96 GB). Leaves 32 GB for the OS + everything else. ## Phase 1 — Base OS - **Ubuntu Server 24.04 LTS** installed via USB. - During install: enable SSH server, import OpenSSH key from your laptop. - Post-install: `apt install linux-generic-hwe-24.04` for a newer kernel (ROCm 7.x wants ≥ 6.11). - Install **Tailscale** for SSH/HTTP tunnel access from anywhere on your tailnet — no exposed ports needed. **Deliverable:** SSH-accessible headless box, reachable by hostname over Tailscale. ## Phase 2 — Optional slim DE (skip until needed) - `apt install xubuntu-core` (XFCE) or `lubuntu-core` (LXQt) — both tiny. - Reach it via **xrdp** or **NoMachine** rather than wiring up a monitor/keyboard. NoMachine is the smoother experience for occasional GUI sessions. ## Phase 3 — GPU stack - Install **ROCm 7.x** for `gfx1151` per AMD's Ubuntu 24.04 instructions. Verify with `rocminfo` and `rocm-smi`. - **Mesa RADV** (Vulkan) is already on Ubuntu — verify with `vulkaninfo --summary`. - Keep **both** backends available. Vulkan currently beats ROCm for MoE decode on Strix Halo, but that may invert as ROCm matures or Kimi Linear lands. **Deliverable:** Both `rocminfo` and `vulkaninfo` see the 8060S. ## Phase 4 — Inference engines - **llama.cpp** — build twice into separate binaries: `llama-server-vulkan` and `llama-server-rocm`. Run whichever wins for the model in question. This is the path for Kimi Linear once PR #17592 lands. - **Ollama** — convenient front for llama.cpp; uses Vulkan/ROCm under the hood. - **vLLM with ROCm** — install only if you actually need batched serving on an OpenAI-compatible endpoint. Heavier setup; skip until needed. **Deliverable:** A llama.cpp server that loads a known-good GGUF and serves on `localhost:8080` over an OpenAI-compatible API. ## Phase 5 — Storage & model layout - Big GGUFs add up fast (Kimi Linear Q4_K_M alone is ~30 GB; plan for 4–6 concurrent models). - Use a dedicated `/models` directory; if the Framework Desktop has a second NVMe slot, put models there. - Standard layout so Ollama, llama.cpp, and editor configs can share paths: ``` /models///.gguf ``` ## Phase 6 — Daily coding model (use now, swap to Kimi Linear later) Worth its own decision pass. Strong contenders for a 96 GB carve-out at Q4–Q6 (as of May 2026): - **Qwen3-Coder** - **DeepSeek-Coder-V3.x** - **GLM-4.6** - **Devstral** (smaller, faster) - **Kimi-K2** (stay in the Moonshot family while waiting for Linear) → Open question: research current Strix Halo benchmarks for these before committing to one. ## Phase 7 — Front-end All of these speak OpenAI-compatible APIs, so the Phase 4 server is the only thing they need to know about: - **Continue.dev** or **Cline** (VS Code) pointed at the local server. - **Aider** for terminal-based pair programming. - **OpenWebUI** for a ChatGPT-style UI when desired. ## Phase 8 — Monitoring & guardrails - `nvtop` or `radeontop` for GPU; `btop` for CPU/RAM. - Small systemd unit to auto-start the llama.cpp server on boot. - Strix Halo is power-flexible (45–120 W TDP); pick a TDP that matches your noise/perf preference. ## Phase 9 — Kimi Linear readiness checklist When the weekly routine flips to **READY TO TRY**, the bring-up should be ~30 minutes: 1. `git pull` the llama.cpp checkout (or rebuild from PR #17592 / its merged commit). 2. Download a Q4_K_M or Q6_K GGUF from bartowski / ymcki to `/models/moonshotai/kimi-linear/`. 3. Swap the systemd unit's `--model` path. Restart. --- ## Status tracking - **Weekly Kimi Linear support check:** routine `trig_01XPweKNZaGTsFjEsF3Ce6HH` — runs Fridays at 13:00 UTC. View at https://claude.ai/code/routines/trig_01XPweKNZaGTsFjEsF3Ce6HH - **Baseline (2026-05-01):** llama.cpp PR #17592 OPEN/unmerged, CUDA-only validation. GGUFs exist but require a patched build. No Strix Halo (Vulkan/ROCm) reports yet. ## Key references - [llama.cpp PR #17592 — Kimi Linear support](https://github.com/ggml-org/llama.cpp/pull/17592) - [llama.cpp Issue #16930 — Kimi Linear feature request](https://github.com/ggml-org/llama.cpp/issues/16930) - [llama.cpp Discussion #20856 — Known-Good Strix Halo ROCm Stack](https://github.com/ggml-org/llama.cpp/discussions/20856) - [Strix Halo backend benchmarks (kyuz0)](https://kyuz0.github.io/amd-strix-halo-toolboxes/) - [strixhalo.wiki llama.cpp performance](https://strixhalo.wiki/AI/llamacpp-performance) - [Lychee-Technology/llama-cpp-for-strix-halo](https://github.com/Lychee-Technology/llama-cpp-for-strix-halo) - [bartowski Kimi-Linear GGUF](https://huggingface.co/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF) - [ymcki Kimi-Linear GGUF (fork-required)](https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF) - [vLLM Kimi-Linear recipe](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-Linear.html)