Files

noisedestroyers 2c4bfefa95 Initial commit: localgenai stack

Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 11:35:10 -04:00

6.0 KiB

Raw Permalink Blame History

Strix Halo Setup Plan — Local Coding Model Server

Preparing a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S iGPU, 128 GB unified LPDDR5x) to serve coding models locally, with Kimi Linear as the eventual target.

Target hardware

CPU: Ryzen AI Max+ 395 (16 Zen 5 cores)
iGPU: Radeon 8060S (40 RDNA 3.5 CUs, gfx1151)
Memory: 128 GB LPDDR5x-8000, unified CPU/GPU
GPU-addressable: ~96 GB (configurable in BIOS)
Memory bandwidth: ~256 GB/s (the binding constraint for inference)

OS choice: Ubuntu 24.04 LTS Server

Headless-first, with the option to add a slim DE (XFCE / LXQt) later.
AMD's official ROCm support and the Lemonade SDK target Ubuntu LTS — fewest sharp edges for gfx1151.
Skip Ubuntu Desktop (full GNOME fights you on a headless box). Add xubuntu-core or lubuntu-core only when/if you decide you want a desktop.

Phase 0 — Pre-install

Update Framework BIOS / firmware to the latest before installing anything. Strix Halo firmware updates have meaningfully changed memory bandwidth and iGPU stability; don't chase ghost issues on stale firmware.
Set GPU memory carve-out in BIOS to the maximum UMA frame buffer (typically 96 GB). Leaves 32 GB for the OS + everything else.

Phase 1 — Base OS

Ubuntu Server 24.04 LTS installed via USB.
During install: enable SSH server, import OpenSSH key from your laptop.
Post-install: apt install linux-generic-hwe-24.04 for a newer kernel (ROCm 7.x wants ≥ 6.11).
Install Tailscale for SSH/HTTP tunnel access from anywhere on your tailnet — no exposed ports needed.

Deliverable: SSH-accessible headless box, reachable by hostname over Tailscale.

Phase 2 — Optional slim DE (skip until needed)

apt install xubuntu-core (XFCE) or lubuntu-core (LXQt) — both tiny.
Reach it via xrdp or NoMachine rather than wiring up a monitor/keyboard. NoMachine is the smoother experience for occasional GUI sessions.

Phase 3 — GPU stack

Install ROCm 7.x for gfx1151 per AMD's Ubuntu 24.04 instructions. Verify with rocminfo and rocm-smi.
Mesa RADV (Vulkan) is already on Ubuntu — verify with vulkaninfo --summary.
Keep both backends available. Vulkan currently beats ROCm for MoE decode on Strix Halo, but that may invert as ROCm matures or Kimi Linear lands.

Deliverable: Both rocminfo and vulkaninfo see the 8060S.

Phase 4 — Inference engines

llama.cpp — build twice into separate binaries: llama-server-vulkan and llama-server-rocm. Run whichever wins for the model in question. This is the path for Kimi Linear once PR #17592 lands.
Ollama — convenient front for llama.cpp; uses Vulkan/ROCm under the hood.
vLLM with ROCm — install only if you actually need batched serving on an OpenAI-compatible endpoint. Heavier setup; skip until needed.

Deliverable: A llama.cpp server that loads a known-good GGUF and serves on localhost:8080 over an OpenAI-compatible API.

Phase 5 — Storage & model layout

Big GGUFs add up fast (Kimi Linear Q4_K_M alone is ~30 GB; plan for 4–6 concurrent models).
Use a dedicated /models directory; if the Framework Desktop has a second NVMe slot, put models there.
Standard layout so Ollama, llama.cpp, and editor configs can share paths:
```
/models/<vendor>/<model>/<quant>.gguf
```

Phase 6 — Daily coding model (use now, swap to Kimi Linear later)

Worth its own decision pass. Strong contenders for a 96 GB carve-out at Q4–Q6 (as of May 2026):

Qwen3-Coder
DeepSeek-Coder-V3.x
GLM-4.6
Devstral (smaller, faster)
Kimi-K2 (stay in the Moonshot family while waiting for Linear)

→ Open question: research current Strix Halo benchmarks for these before committing to one.

Phase 7 — Front-end

All of these speak OpenAI-compatible APIs, so the Phase 4 server is the only thing they need to know about:

Continue.dev or Cline (VS Code) pointed at the local server.
Aider for terminal-based pair programming.
OpenWebUI for a ChatGPT-style UI when desired.

Phase 8 — Monitoring & guardrails

nvtop or radeontop for GPU; btop for CPU/RAM.
Small systemd unit to auto-start the llama.cpp server on boot.
Strix Halo is power-flexible (45–120 W TDP); pick a TDP that matches your noise/perf preference.

Phase 9 — Kimi Linear readiness checklist

When the weekly routine flips to READY TO TRY, the bring-up should be ~30 minutes:

git pull the llama.cpp checkout (or rebuild from PR #17592 / its merged commit).
Download a Q4_K_M or Q6_K GGUF from bartowski / ymcki to /models/moonshotai/kimi-linear/.
Swap the systemd unit's --model path. Restart.

Status tracking

Weekly Kimi Linear support check: routine trig_01XPweKNZaGTsFjEsF3Ce6HH — runs Fridays at 13:00 UTC. View at https://claude.ai/code/routines/trig_01XPweKNZaGTsFjEsF3Ce6HH
Baseline (2026-05-01): llama.cpp PR #17592 OPEN/unmerged, CUDA-only validation. GGUFs exist but require a patched build. No Strix Halo (Vulkan/ROCm) reports yet.

6.0 KiB Raw Permalink Blame History Unescape Escape