Initial commit: localgenai stack

Containerized local LLM stack for the Framework Desktop / Strix Halo, plus the OpenCode harness on the Mac side. - pyinfra/framework/: pyinfra deploy targeting the box - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override for gfx1151), OpenWebUI - Beszel (host + container + AMD GPU dashboard via sysfs) - OpenLIT (LLM fleet metrics) - Phoenix (per-trace agent waterfall) - OpenHands (autonomous agent in a Docker sandbox) - opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter) - install.sh deploys to ~/.config/opencode/ - StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md: documentation and planning - testing/qwen3-coder-30b/: small evaluation harness Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:35:10 -04:00
commit 2c4bfefa95
36 changed files with 5265 additions and 0 deletions
--- a/StrixHaloSetup.md
+++ b/StrixHaloSetup.md
@@ -0,0 +1,112 @@
+# Strix Halo Setup Plan — Local Coding Model Server
+
+Preparing a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S iGPU, 128 GB unified LPDDR5x) to serve coding models locally, with **Kimi Linear** as the eventual target.
+
+## Target hardware
+- **CPU:** Ryzen AI Max+ 395 (16 Zen 5 cores)
+- **iGPU:** Radeon 8060S (40 RDNA 3.5 CUs, `gfx1151`)
+- **Memory:** 128 GB LPDDR5x-8000, unified CPU/GPU
+- **GPU-addressable:** ~96 GB (configurable in BIOS)
+- **Memory bandwidth:** ~256 GB/s (the binding constraint for inference)
+
+## OS choice: Ubuntu 24.04 LTS Server
+- Headless-first, with the option to add a slim DE (XFCE / LXQt) later.
+- AMD's official ROCm support and the Lemonade SDK target Ubuntu LTS — fewest sharp edges for `gfx1151`.
+- Skip Ubuntu Desktop (full GNOME fights you on a headless box). Add `xubuntu-core` or `lubuntu-core` only when/if you decide you want a desktop.
+
+---
+
+## Phase 0 — Pre-install
+
+- **Update Framework BIOS / firmware** to the latest before installing anything. Strix Halo firmware updates have meaningfully changed memory bandwidth and iGPU stability; don't chase ghost issues on stale firmware.
+- **Set GPU memory carve-out** in BIOS to the maximum UMA frame buffer (typically 96 GB). Leaves 32 GB for the OS + everything else.
+
+## Phase 1 — Base OS
+
+- **Ubuntu Server 24.04 LTS** installed via USB.
+- During install: enable SSH server, import OpenSSH key from your laptop.
+- Post-install: `apt install linux-generic-hwe-24.04` for a newer kernel (ROCm 7.x wants ≥ 6.11).
+- Install **Tailscale** for SSH/HTTP tunnel access from anywhere on your tailnet — no exposed ports needed.
+
+**Deliverable:** SSH-accessible headless box, reachable by hostname over Tailscale.
+
+## Phase 2 — Optional slim DE (skip until needed)
+
+- `apt install xubuntu-core` (XFCE) or `lubuntu-core` (LXQt) — both tiny.
+- Reach it via **xrdp** or **NoMachine** rather than wiring up a monitor/keyboard. NoMachine is the smoother experience for occasional GUI sessions.
+
+## Phase 3 — GPU stack
+
+- Install **ROCm 7.x** for `gfx1151` per AMD's Ubuntu 24.04 instructions. Verify with `rocminfo` and `rocm-smi`.
+- **Mesa RADV** (Vulkan) is already on Ubuntu — verify with `vulkaninfo --summary`.
+- Keep **both** backends available. Vulkan currently beats ROCm for MoE decode on Strix Halo, but that may invert as ROCm matures or Kimi Linear lands.
+
+**Deliverable:** Both `rocminfo` and `vulkaninfo` see the 8060S.
+
+## Phase 4 — Inference engines
+
+- **llama.cpp** — build twice into separate binaries: `llama-server-vulkan` and `llama-server-rocm`. Run whichever wins for the model in question. This is the path for Kimi Linear once PR #17592 lands.
+- **Ollama** — convenient front for llama.cpp; uses Vulkan/ROCm under the hood.
+- **vLLM with ROCm** — install only if you actually need batched serving on an OpenAI-compatible endpoint. Heavier setup; skip until needed.
+
+**Deliverable:** A llama.cpp server that loads a known-good GGUF and serves on `localhost:8080` over an OpenAI-compatible API.
+
+## Phase 5 — Storage & model layout
+
+- Big GGUFs add up fast (Kimi Linear Q4_K_M alone is ~30 GB; plan for 4–6 concurrent models).
+- Use a dedicated `/models` directory; if the Framework Desktop has a second NVMe slot, put models there.
+- Standard layout so Ollama, llama.cpp, and editor configs can share paths:
+  ```
+  /models/<vendor>/<model>/<quant>.gguf
+  ```
+
+## Phase 6 — Daily coding model (use now, swap to Kimi Linear later)
+
+Worth its own decision pass. Strong contenders for a 96 GB carve-out at Q4–Q6 (as of May 2026):
+- **Qwen3-Coder**
+- **DeepSeek-Coder-V3.x**
+- **GLM-4.6**
+- **Devstral** (smaller, faster)
+- **Kimi-K2** (stay in the Moonshot family while waiting for Linear)
+
+→ Open question: research current Strix Halo benchmarks for these before committing to one.
+
+## Phase 7 — Front-end
+
+All of these speak OpenAI-compatible APIs, so the Phase 4 server is the only thing they need to know about:
+- **Continue.dev** or **Cline** (VS Code) pointed at the local server.
+- **Aider** for terminal-based pair programming.
+- **OpenWebUI** for a ChatGPT-style UI when desired.
+
+## Phase 8 — Monitoring & guardrails
+
+- `nvtop` or `radeontop` for GPU; `btop` for CPU/RAM.
+- Small systemd unit to auto-start the llama.cpp server on boot.
+- Strix Halo is power-flexible (45–120 W TDP); pick a TDP that matches your noise/perf preference.
+
+## Phase 9 — Kimi Linear readiness checklist
+
+When the weekly routine flips to **READY TO TRY**, the bring-up should be ~30 minutes:
+
+1. `git pull` the llama.cpp checkout (or rebuild from PR #17592 / its merged commit).
+2. Download a Q4_K_M or Q6_K GGUF from bartowski / ymcki to `/models/moonshotai/kimi-linear/`.
+3. Swap the systemd unit's `--model` path. Restart.
+
+---
+
+## Status tracking
+
+- **Weekly Kimi Linear support check:** routine `trig_01XPweKNZaGTsFjEsF3Ce6HH` — runs Fridays at 13:00 UTC. View at https://claude.ai/code/routines/trig_01XPweKNZaGTsFjEsF3Ce6HH
+- **Baseline (2026-05-01):** llama.cpp PR #17592 OPEN/unmerged, CUDA-only validation. GGUFs exist but require a patched build. No Strix Halo (Vulkan/ROCm) reports yet.
+
+## Key references
+
+- [llama.cpp PR #17592 — Kimi Linear support](https://github.com/ggml-org/llama.cpp/pull/17592)
+- [llama.cpp Issue #16930 — Kimi Linear feature request](https://github.com/ggml-org/llama.cpp/issues/16930)
+- [llama.cpp Discussion #20856 — Known-Good Strix Halo ROCm Stack](https://github.com/ggml-org/llama.cpp/discussions/20856)
+- [Strix Halo backend benchmarks (kyuz0)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
+- [strixhalo.wiki llama.cpp performance](https://strixhalo.wiki/AI/llamacpp-performance)
+- [Lychee-Technology/llama-cpp-for-strix-halo](https://github.com/Lychee-Technology/llama-cpp-for-strix-halo)
+- [bartowski Kimi-Linear GGUF](https://huggingface.co/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF)
+- [ymcki Kimi-Linear GGUF (fork-required)](https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF)
+- [vLLM Kimi-Linear recipe](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-Linear.html)