Initial commit: localgenai stack

Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-08 11:35:10 -04:00
commit 2c4bfefa95
36 changed files with 5265 additions and 0 deletions

112
StrixHaloSetup.md Normal file
View File

@@ -0,0 +1,112 @@
# Strix Halo Setup Plan — Local Coding Model Server
Preparing a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S iGPU, 128 GB unified LPDDR5x) to serve coding models locally, with **Kimi Linear** as the eventual target.
## Target hardware
- **CPU:** Ryzen AI Max+ 395 (16 Zen 5 cores)
- **iGPU:** Radeon 8060S (40 RDNA 3.5 CUs, `gfx1151`)
- **Memory:** 128 GB LPDDR5x-8000, unified CPU/GPU
- **GPU-addressable:** ~96 GB (configurable in BIOS)
- **Memory bandwidth:** ~256 GB/s (the binding constraint for inference)
## OS choice: Ubuntu 24.04 LTS Server
- Headless-first, with the option to add a slim DE (XFCE / LXQt) later.
- AMD's official ROCm support and the Lemonade SDK target Ubuntu LTS — fewest sharp edges for `gfx1151`.
- Skip Ubuntu Desktop (full GNOME fights you on a headless box). Add `xubuntu-core` or `lubuntu-core` only when/if you decide you want a desktop.
---
## Phase 0 — Pre-install
- **Update Framework BIOS / firmware** to the latest before installing anything. Strix Halo firmware updates have meaningfully changed memory bandwidth and iGPU stability; don't chase ghost issues on stale firmware.
- **Set GPU memory carve-out** in BIOS to the maximum UMA frame buffer (typically 96 GB). Leaves 32 GB for the OS + everything else.
## Phase 1 — Base OS
- **Ubuntu Server 24.04 LTS** installed via USB.
- During install: enable SSH server, import OpenSSH key from your laptop.
- Post-install: `apt install linux-generic-hwe-24.04` for a newer kernel (ROCm 7.x wants ≥ 6.11).
- Install **Tailscale** for SSH/HTTP tunnel access from anywhere on your tailnet — no exposed ports needed.
**Deliverable:** SSH-accessible headless box, reachable by hostname over Tailscale.
## Phase 2 — Optional slim DE (skip until needed)
- `apt install xubuntu-core` (XFCE) or `lubuntu-core` (LXQt) — both tiny.
- Reach it via **xrdp** or **NoMachine** rather than wiring up a monitor/keyboard. NoMachine is the smoother experience for occasional GUI sessions.
## Phase 3 — GPU stack
- Install **ROCm 7.x** for `gfx1151` per AMD's Ubuntu 24.04 instructions. Verify with `rocminfo` and `rocm-smi`.
- **Mesa RADV** (Vulkan) is already on Ubuntu — verify with `vulkaninfo --summary`.
- Keep **both** backends available. Vulkan currently beats ROCm for MoE decode on Strix Halo, but that may invert as ROCm matures or Kimi Linear lands.
**Deliverable:** Both `rocminfo` and `vulkaninfo` see the 8060S.
## Phase 4 — Inference engines
- **llama.cpp** — build twice into separate binaries: `llama-server-vulkan` and `llama-server-rocm`. Run whichever wins for the model in question. This is the path for Kimi Linear once PR #17592 lands.
- **Ollama** — convenient front for llama.cpp; uses Vulkan/ROCm under the hood.
- **vLLM with ROCm** — install only if you actually need batched serving on an OpenAI-compatible endpoint. Heavier setup; skip until needed.
**Deliverable:** A llama.cpp server that loads a known-good GGUF and serves on `localhost:8080` over an OpenAI-compatible API.
## Phase 5 — Storage & model layout
- Big GGUFs add up fast (Kimi Linear Q4_K_M alone is ~30 GB; plan for 46 concurrent models).
- Use a dedicated `/models` directory; if the Framework Desktop has a second NVMe slot, put models there.
- Standard layout so Ollama, llama.cpp, and editor configs can share paths:
```
/models/<vendor>/<model>/<quant>.gguf
```
## Phase 6 — Daily coding model (use now, swap to Kimi Linear later)
Worth its own decision pass. Strong contenders for a 96 GB carve-out at Q4Q6 (as of May 2026):
- **Qwen3-Coder**
- **DeepSeek-Coder-V3.x**
- **GLM-4.6**
- **Devstral** (smaller, faster)
- **Kimi-K2** (stay in the Moonshot family while waiting for Linear)
→ Open question: research current Strix Halo benchmarks for these before committing to one.
## Phase 7 — Front-end
All of these speak OpenAI-compatible APIs, so the Phase 4 server is the only thing they need to know about:
- **Continue.dev** or **Cline** (VS Code) pointed at the local server.
- **Aider** for terminal-based pair programming.
- **OpenWebUI** for a ChatGPT-style UI when desired.
## Phase 8 — Monitoring & guardrails
- `nvtop` or `radeontop` for GPU; `btop` for CPU/RAM.
- Small systemd unit to auto-start the llama.cpp server on boot.
- Strix Halo is power-flexible (45120 W TDP); pick a TDP that matches your noise/perf preference.
## Phase 9 — Kimi Linear readiness checklist
When the weekly routine flips to **READY TO TRY**, the bring-up should be ~30 minutes:
1. `git pull` the llama.cpp checkout (or rebuild from PR #17592 / its merged commit).
2. Download a Q4_K_M or Q6_K GGUF from bartowski / ymcki to `/models/moonshotai/kimi-linear/`.
3. Swap the systemd unit's `--model` path. Restart.
---
## Status tracking
- **Weekly Kimi Linear support check:** routine `trig_01XPweKNZaGTsFjEsF3Ce6HH` — runs Fridays at 13:00 UTC. View at https://claude.ai/code/routines/trig_01XPweKNZaGTsFjEsF3Ce6HH
- **Baseline (2026-05-01):** llama.cpp PR #17592 OPEN/unmerged, CUDA-only validation. GGUFs exist but require a patched build. No Strix Halo (Vulkan/ROCm) reports yet.
## Key references
- [llama.cpp PR #17592 — Kimi Linear support](https://github.com/ggml-org/llama.cpp/pull/17592)
- [llama.cpp Issue #16930 — Kimi Linear feature request](https://github.com/ggml-org/llama.cpp/issues/16930)
- [llama.cpp Discussion #20856 — Known-Good Strix Halo ROCm Stack](https://github.com/ggml-org/llama.cpp/discussions/20856)
- [Strix Halo backend benchmarks (kyuz0)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
- [strixhalo.wiki llama.cpp performance](https://strixhalo.wiki/AI/llamacpp-performance)
- [Lychee-Technology/llama-cpp-for-strix-halo](https://github.com/Lychee-Technology/llama-cpp-for-strix-halo)
- [bartowski Kimi-Linear GGUF](https://huggingface.co/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF)
- [ymcki Kimi-Linear GGUF (fork-required)](https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF)
- [vLLM Kimi-Linear recipe](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-Linear.html)