Initial commit: localgenai stack

Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-08 11:35:10 -04:00
commit 2c4bfefa95
36 changed files with 5265 additions and 0 deletions

29
.gitignore vendored Normal file
View File

@@ -0,0 +1,29 @@
# macOS
.DS_Store
# Logs
*.log
pyinfra-debug.log
# Python build artefacts
__pycache__/
*.pyc
# Local agent state files (binary, machine-specific)
agentdb.rvf
agentdb.rvf.lock
ruvector.db
**/ruvector.db
# Test outputs
testing/**/test_results/
testing/**/__pycache__/
# Node deps in the OpenCode plugin — install.sh's `npm install` regenerates
opencode/.opencode/plugin/node_modules/
# Phoenix bridge diagnostic log (Mac path, written at runtime)
/tmp/phoenix-bridge.log
# Per-machine Claude Code permissions (each clone gets its own)
.claude/settings.local.json

263
Roadmap.md Normal file
View File

@@ -0,0 +1,263 @@
# Local GenAI roadmap
What to build on top of the Framework Desktop / Strix Halo stack to
narrow the functional gap with Claude Code + Anthropic's broader
ecosystem (Cowork, Skills, Design). Captured 2026-05-07; revisit
quarterly.
## Where we stand
- Containerized inference at `/srv/docker/{llama,vllm,ollama}`
- Models on the unified-memory iGPU at 64 GB UMA
- OpenCode on the Mac, pointed at the box over Tailscale
- Playwright + SearXNG MCP wired into OpenCode (`localgenai/opencode/`)
- Self-hosted SearXNG instance at `https://searxng.n0n.io`
- Solid floor for single-user terminal coding with Qwen3-Coder-30B-A3B
## The mental model: three layers
The OSS ecosystem stratifies cleanly once you stop thinking about
individual products:
```
┌─────────────────────────────────────────────────────────────────┐
│ Layer 3: Management & observability │
│ Dispatch tickets to agents, watch spend, track skill │
│ accumulation. Where Claude Cowork lives. (Multica, Plane.) │
└────────────────────────────┬────────────────────────────────────┘
│ assigns work to ↓
┌────────────────────────────┴────────────────────────────────────┐
│ Layer 2: Orchestration frameworks │
│ Code that defines who does what, in what order. Reach for │
│ when you want to *encode* a pipeline, not drive interactively. │
│ (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK.) │
└────────────────────────────┬────────────────────────────────────┘
│ runs ↓
┌────────────────────────────┴────────────────────────────────────┐
│ Layer 1: Harnesses │
│ The thing you actually drive — terminal or IDE agent that │
│ takes a prompt and uses tools. (OpenCode, Aider, Cline, │
│ OpenHands, OpenWork, Goose, Pi.) │
└────────────────────────────┬────────────────────────────────────┘
│ calls ↓
┌────────────────────────────┴────────────────────────────────────┐
│ Layer 0: Inference + tools │
│ Models, MCP servers, embedding stores, voice/vision pipelines. │
│ (Ollama, vLLM, llama.cpp, MCP catalog.) │
└─────────────────────────────────────────────────────────────────┘
```
Pick what you need from each layer; you don't need anything from a
higher layer if the one below is doing the job.
## Layer 1: Harnesses
The interactive driver's seat. We use **OpenCode**; alternatives worth
having in your back pocket:
- **[Aider](https://aider.chat)** — terminal, git-native, every change
is a commit. Strong on multi-file refactors. Often outperforms
OpenCode on small-context local models.
- **[Cline](https://github.com/cline/cline)** — VS Code agent, 5M+
installs. v3.58 added native subagents and a CI/CD headless mode.
Closest "Claude Code in the IDE" feel.
- **[OpenHands](https://github.com/All-Hands-AI/OpenHands)** (was
OpenDevin) — autonomous agent in a Docker sandbox; edits code, runs
tests, browses. Heavier than OpenCode, more autonomous.
- **[OpenWork](https://supergok.com/openwork-open-source-ai-agent-framework/)**
— desktop GUI agent built on `deepagentsjs`. Multi-step planning,
filesystem access, **subagent delegation** baked in. `npx` to start.
- **[Goose](https://github.com/block/goose)** by Block — MCP-native,
opinionated about toolkits. Strong fit if you're going all-in on MCP.
- **[Pi](https://github.com/bradAGI/pi-mono)** — minimal terminal
harness; small surface area, easy to read and extend.
- Catalog: <https://github.com/bradAGI/awesome-cli-coding-agents>
When to add another: when you hit a workflow OpenCode doesn't handle
well. Aider for refactors, Cline for IDE-bound work, OpenHands for
autonomous loops, OpenWork for subagent-heavy planning.
## Layer 2: Orchestration frameworks
When you want to *build* a pipeline rather than drive a prebuilt one —
write code that defines agents, tools, and a flow. Overkill for solo
coding; on-target for personal automation pipelines.
- **[LangGraph](https://github.com/langchain-ai/langgraph)** — directed
graph with conditional edges, built-in checkpointing with time-travel.
Surpassed CrewAI in stars in early 2026; production-grade.
- **[CrewAI](https://github.com/crewAIInc/crewAI)** — role-based crews,
agents/tasks/crew in <20 lines of Python. Lowest learning curve,
~5× faster than LangGraph in many cases. A2A protocol support.
- **[AutoGen / AG2](https://github.com/microsoft/autogen)** —
event-driven, async-first; GroupChat coordination with a selector
deciding who speaks next.
- **[OpenAI Agents SDK](https://github.com/openai/openai-agents-python)** —
production replacement for Swarm. Explicit agent-to-agent handoffs
carrying conversation context.
- **[OpenAgents](https://github.com/openagents/openagents)** — only
framework with native support for **both MCP and A2A** protocols.
All five are model-agnostic — point them at the local Ollama endpoint
or LiteLLM and they work the same as if you were using GPT-5.
## Layer 3: Management & observability
The Claude Cowork analog layer. Dispatch work, track progress, monitor
spend. The piece I missed in earlier roadmap drafts:
- **[Multica](https://github.com/multica-ai/multica)** — open-source
platform for managing AI coding agents *as teammates*. File issues,
assign to agents, they pick up work, write code, report blockers,
update statuses. Skill accumulation across tasks. Dispatches to
Claude Code, Codex, Copilot CLI, OpenCode, Pi, Cursor Agent, Hermes,
Gemini, Kiro, etc. **#1 GitHub TypeScript Trending in April 2026,
10.7 k stars in three months.** Closest existing OSS analog to
Cowork; works *above* OpenCode rather than replacing it.
- **[Plane](https://plane.so/)** — AI-native project management with
MCP server and full Agent Run lifecycle tracking. Velocity, workload,
blockers auto-populated. Self-hostable.
- **[Mission Control](https://github.com/builderz-labs/mission-control)** —
self-hosted dashboard for agent fleets. Token-spend per model, trend
charts, real-time posture scoring. SQLite, zero deps. Useful even
solo just to watch your local model's spend pattern.
- **[ClawDeck](https://github.com/builderz-labs/clawdeck)** — focused
kanban for OpenClaw agents. Smaller scope, more opinionated.
- **[OpenWebUI](https://github.com/open-webui/open-webui)** — multi-user
chat UI in front of Ollama. Doesn't manage *agent sessions*, but
covers shared-LLM-access for a household.
**Still gap (May 2026):** *live* shared agent sessions — multiple humans
driving one agent on one repo in real time, Cowork's multiplayer mode.
Multica is async-ticket-flow, not live-collaborative. Watch for
projects that close this.
## Layer 0: Cross-cutting capabilities
Things you wire in once, useful from any harness above.
### Aggregation & routing
- **[LiteLLM](https://github.com/BerriAI/litellm)** — proxy that exposes
one OpenAI-compatible endpoint over many backends. Rules like "code
task → local Qwen3, image input → vision model, fast chat → Haiku."
Worth standing up early so all higher-layer tools just point at one
URL.
### Code intelligence / RAG
The local model can't see beyond what fits in its prompt. Claude Code
papers over this with long-context recall + WebFetch; locally we need
explicit retrieval.
- **[Continue.dev](https://continue.dev)** — VS Code / JetBrains plugin
with built-in `@codebase` indexer. Pairs with `nomic-embed-text` via
Ollama for local embeddings. **Biggest single-shot quality lift** for
local-model coding.
- **[Tabby](https://github.com/TabbyML/tabby)** — self-hosted
Copilot-style inline completion. Different niche from agentic coding,
covers the "as I type" loop OpenCode doesn't.
### Web access — done
Already wired in `localgenai/opencode/`:
- **[@playwright/mcp](https://github.com/microsoft/playwright-mcp)** —
browser automation. Navigate, click, fill forms, read DOM snapshots.
- **[mcp-searxng](https://github.com/ihor-sokoliuk/mcp-searxng)** —
web search via your `searxng.n0n.io` instance. No API keys.
- **[mcp-server-fetch](https://github.com/modelcontextprotocol/servers/tree/main/src/fetch)** —
not yet wired; useful if Playwright feels heavy for plain page reads.
### Persistent memory
OpenCode's memory is per-session unless you wire MCP for it.
- **`mcp-server-memory`** (official) — knowledge-graph memory across
any MCP client. Crude but works.
- **[mem0](https://github.com/mem0ai/mem0)** — sophisticated memory,
self-hostable with a local embedding model.
### Multi-modal
- **Vision**: `qwen2.5-vl:7b` or `llama3.2-vision:11b` via Ollama.
Route via LiteLLM so OpenCode auto-picks vision for image-bearing
prompts. Closes the screenshot gap.
- **Voice**: `whisper-compose.yaml` and `piper-compose.yaml` are
already in `localgenai/`. STT in, TTS out — a "talk to your local
assistant" loop is something Claude Code itself doesn't do.
### Design (genuine new capability)
Anthropic ships **Claude Design** (Anthropic Labs); the OSS analog is
real and active:
- **[nexu-io/open-design](https://github.com/nexu-io/open-design)** —
local-first OSS alternative to Claude Design. 19 Skills, 71
brand-grade Design Systems, generates web/desktop/mobile prototypes,
slides, images, videos with sandboxed preview and HTML/PDF/PPTX/MP4
export. **Ships a stdio MCP server** exposing design source — tokens,
JSX components, entry HTML — as a structured API to any MCP-aware
agent. Pairs naturally with OpenCode + the local model.
- **[Penpot](https://penpot.app)** — full OSS Figma alternative.
Different scope (interactive design tool, not agentic).
- **[OpenPencil](https://github.com/ZSeven-W/openpencil)** — AI-native
vector design tool, "design-as-code." Newer; track.
### Skills / MCP catalog
- **[Official MCP registry](https://registry.modelcontextprotocol.io/)**
- **[Awesome MCP Servers](https://github.com/punkpeye/awesome-mcp-servers)**
(1200+ entries)
- **[mcpservers.org](https://mcpservers.org/)** searchable directory
Build the skills you miss; publish them.
## Prioritized next steps
In rough order of impact-per-hour-spent:
1. **Continue.dev + `nomic-embed-text` in VS Code.** Biggest single
quality jump for local-model coding via codebase indexing.
2. **Symlink `localgenai/opencode/opencode.json`** into place so the
Playwright + SearXNG MCP servers activate. (Already written;
one `ln -sf` away.)
3. **Stand up LiteLLM** as the aggregation layer in front of Ollama —
even at one model, the abstraction pays off the moment you add a
second.
4. **Stand up Multica** for ticket-style work dispatch. Closest Cowork
analog; useful even solo for tracking what the agent has done across
sessions.
5. **Add a vision model** (`qwen2.5-vl:7b`) and route image prompts via
LiteLLM. Closes the screenshot gap.
6. **Stand up OpenDesign** as a compose stack. Connect its MCP server
to OpenCode. Genuinely new capability versus Claude Code.
7. **Try Aider on a real refactor** — different agentic loop than
OpenCode, often better with smaller-context local models.
8. **Wire whisper + piper into a voice loop.** Not a Claude Code parity
item — a *new* capability the local stack uniquely enables.
9. **Mission Control on the box** for token-spend visibility once
multi-model routing through LiteLLM is real.
10. **Evaluate OpenHands or Goose** for autonomous agentic loops where
OpenCode's per-task UX feels too hands-on.
11. **OpenWebUI for household-shared chat access.** Adds value for
non-coding LLM use, fills a small gap in the multi-user story.
## Reference catalogs to monitor
- Official MCP registry: <https://registry.modelcontextprotocol.io/>
- Awesome CLI coding agents: <https://github.com/bradAGI/awesome-cli-coding-agents>
- Awesome MCP servers: <https://github.com/punkpeye/awesome-mcp-servers>
- mcpservers.org directory: <https://mcpservers.org/>
- Anthropic plugin catalog (parity-tracking): <https://claude.com/plugins>
## Open gaps to keep watching
- **Live shared agent sessions** — Cowork's multiplayer mode (multiple
humans driving one agent on one repo, real-time). No direct OSS
equivalent yet; Multica is async-ticket-flow.
- **Cohesive "skills" library** as polished as Anthropic's plugin
catalog. The MCP ecosystem has the parts but lacks the curation.
- **AMD ROCm packages for Ubuntu 26.04** — relevant if we ever decide
to abandon containerization for native ROCm tooling. (Tracked
separately in `TODO.md`.)

134
StrixHaloMemory.md Normal file
View File

@@ -0,0 +1,134 @@
# Strix Halo memory layout — UMA, GTT, and what fits
The Framework Desktop has 128 GB of LPDDR5x-8000 unified between CPU and
GPU. How much the GPU can actually use is a stack of three knobs:
## The three knobs
1. **Total system RAM** — 128 GB on this box. Set by the SKU.
2. **UMA frame buffer (BIOS, "iGPU memory size")** — a *fixed* carve-out
reserved for the GPU at boot. The OS can't see or reclaim it. Pinned,
guaranteed, lowest latency. As of BIOS lfsp0.03.05 (May 2026), the
max selectable on Framework Desktop is **64 GB**.
3. **GTT, Graphics Translation Table (kernel, runtime)** — a *dynamic*
pool the GPU can borrow from system RAM beyond UMA. Default size on
recent Linux is roughly half of system RAM. Higher latency than UMA,
evictable under memory pressure.
So on a 128 GB box with 64 GB UMA: GPU sees 64 GB of pinned, fast VRAM,
plus up to ~64 GB of slower borrowable GTT, leaving the OS to share what
GTT isn't currently using.
## Why the BIOS caps at 64 GB
Not a hardware limit. AMD/Framework deliberately keeps the UMA ceiling
below total RAM so the OS doesn't get starved. Earlier BIOS revisions
were stricter (16 / 24 GB). Newer ones may unlock 96 / 110 GB — worth
re-checking https://knowledgebase.frame.work after firmware updates.
The "128 GB support" on Framework's spec page refers to total system
RAM capacity, **not** the UMA maximum. Distinct knobs.
## What 64 GB UMA actually fits
| Model | Quant | Size | Fits 64 GB? |
|-------|-------|------|-------------|
| Qwen3-Coder-30B-A3B | Q8 | ~32 GB | ✅ huge headroom |
| Llama 3.3 70B (dense) | Q4 | ~42 GB | ✅ |
| GLM-4.5-Air-style 100B-A12B MoE | Q4 | ~60 GB | ✅ tight |
| Kimi-Linear-48B-A3B (when llama.cpp lands) | Q4-Q8 | ~30-50 GB | ✅ |
| Qwen3-235B-A22B | Q3 | ~100 GB | ❌ |
| DeepSeek-V3 / Kimi-K2 (600B+ MoE) | Q4 | 200 GB+ | ❌ |
The ❌ row wouldn't be enjoyable on this hardware regardless — see
"Bandwidth is the real ceiling" below.
## When GTT spillover saves you
If a model is bigger than UMA but ≤ UMA+GTT:
- **Vulkan (llama.cpp `server-vulkan` container)** — happily uses GTT
for weights. Slower than UMA but works.
- **ROCm (vLLM, Ollama:rocm)** — picky; often refuses GTT for weight
buffers. Effectively capped at UMA.
So if you want to run something bigger than 64 GB and stay performant,
the path is: stick with the Vulkan llama.cpp container and accept the
GTT latency hit, *or* wait for a BIOS that unlocks a larger UMA.
## Two architectural strategies (UMA-first vs GTT-first)
There's a real choice in how you allocate memory between OS and GPU:
### Strategy A — UMA-first (current setup)
- BIOS UMA: **64 GB** (the menu max on lfsp0.03.05)
- Kernel `amdgpu.gttsize`: default (~50 % of remaining RAM, so ~32 GB)
- Effective GPU pool: **64 GB pinned + small GTT spillover**
- OS sees: 64 GB
- Pros: lowest-latency GPU memory; OS can't reclaim under pressure;
ROCm-based stacks (vLLM, Ollama:rocm) get the full pool
- Cons: hard ceiling at the BIOS UMA max; OS lives in 64 GB
### Strategy B — GTT-first (per Gygeek's setup guide)
- BIOS UMA: **512 MB** (the menu min)
- Kernel `amdgpu.gttsize=117760` (~115 GB)
- Effective GPU pool: **~115 GB borrowable from the OS-visible pool**
- OS sees: ~127 GB
- Pros: dynamic — GPU only takes what it needs; can address far more
than the BIOS UMA cap; future-proof if you want >64 GB models
- Cons: GTT pages are evictable under memory pressure (avoid running
heavy CPU workloads alongside inference); ROCm stacks may refuse
GTT for weight buffers — Vulkan is the safer backend with this
strategy
### Which to pick
- **Single-user, mostly LLMs, ROCm or Vulkan**: Strategy A is fine.
64 GB covers everything that runs well on this box's bandwidth (see
"Bandwidth is the real ceiling" below).
- **Models > 64 GB or you want headroom**: Strategy B. Trade some
worst-case latency for the ability to load anything that fits in
127 GB. Vulkan-first.
- **Mixed workload (Docker compiles + LLMs + dev tools)**: Strategy B
also helps because the OS isn't permanently squeezed into 64 GB.
The pyinfra deploy currently sets `amdgpu.gttsize=117760` via GRUB
(strategy B-compatible) but leaves BIOS UMA to whatever you've
configured. If you want full strategy B, drop BIOS UMA to 512 MB at
the next boot. If staying on strategy A, the kernel param is benign —
GTT just becomes a high ceiling you'll never reach.
## Bandwidth is the real ceiling, not VRAM
256 GB/s memory bandwidth. Decode tokens/sec ≈ bandwidth ÷ active-params
size. Practical implications:
- **3B active** (e.g. Qwen3-Coder-30B-A3B, Kimi-Linear-48B-A3B): ~50-80 tok/s. Excellent.
- **12B active** (GLM-4.5-Air-style MoEs): ~10-20 tok/s. Usable.
- **35B active** (Qwen3-Coder-480B-A35B): ~3-7 tok/s. Painful.
- **70B dense**: ~2-3 tok/s. Slideshow.
This is why MoE models with small active-params dominate on this box
even though the box has lots of memory. The 64 GB UMA cap doesn't hurt
much because the bandwidth would have made the bigger-than-64 GB models
unusable anyway.
## Other tunings worth knowing
- **`amd_iommu=off`** kernel param gives ~6 % memory-read improvement
on Strix Halo (per Gygeek). Wired into pyinfra's GRUB management.
Reboot required to take effect.
- **TDP** — configurable in BIOS (45120 W). Gygeek suggests 85 W as a
noise/perf sweet spot. BIOS-only, can't be automated.
## TL;DR
For most users on this hardware: stay on strategy A (UMA = 64 GB), let
pyinfra apply `amd_iommu=off` + the GTT ceiling, reboot, done. The
hardware's bandwidth makes anything bigger than ~70 GB of weights
impractical anyway, so the 64 GB cap rarely bites.
Switch to strategy B (UMA = 512 MB, GTT-driven) if you specifically
want to load >64 GB models on Vulkan, or if the OS feels squeezed.

112
StrixHaloSetup.md Normal file
View File

@@ -0,0 +1,112 @@
# Strix Halo Setup Plan — Local Coding Model Server
Preparing a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S iGPU, 128 GB unified LPDDR5x) to serve coding models locally, with **Kimi Linear** as the eventual target.
## Target hardware
- **CPU:** Ryzen AI Max+ 395 (16 Zen 5 cores)
- **iGPU:** Radeon 8060S (40 RDNA 3.5 CUs, `gfx1151`)
- **Memory:** 128 GB LPDDR5x-8000, unified CPU/GPU
- **GPU-addressable:** ~96 GB (configurable in BIOS)
- **Memory bandwidth:** ~256 GB/s (the binding constraint for inference)
## OS choice: Ubuntu 24.04 LTS Server
- Headless-first, with the option to add a slim DE (XFCE / LXQt) later.
- AMD's official ROCm support and the Lemonade SDK target Ubuntu LTS — fewest sharp edges for `gfx1151`.
- Skip Ubuntu Desktop (full GNOME fights you on a headless box). Add `xubuntu-core` or `lubuntu-core` only when/if you decide you want a desktop.
---
## Phase 0 — Pre-install
- **Update Framework BIOS / firmware** to the latest before installing anything. Strix Halo firmware updates have meaningfully changed memory bandwidth and iGPU stability; don't chase ghost issues on stale firmware.
- **Set GPU memory carve-out** in BIOS to the maximum UMA frame buffer (typically 96 GB). Leaves 32 GB for the OS + everything else.
## Phase 1 — Base OS
- **Ubuntu Server 24.04 LTS** installed via USB.
- During install: enable SSH server, import OpenSSH key from your laptop.
- Post-install: `apt install linux-generic-hwe-24.04` for a newer kernel (ROCm 7.x wants ≥ 6.11).
- Install **Tailscale** for SSH/HTTP tunnel access from anywhere on your tailnet — no exposed ports needed.
**Deliverable:** SSH-accessible headless box, reachable by hostname over Tailscale.
## Phase 2 — Optional slim DE (skip until needed)
- `apt install xubuntu-core` (XFCE) or `lubuntu-core` (LXQt) — both tiny.
- Reach it via **xrdp** or **NoMachine** rather than wiring up a monitor/keyboard. NoMachine is the smoother experience for occasional GUI sessions.
## Phase 3 — GPU stack
- Install **ROCm 7.x** for `gfx1151` per AMD's Ubuntu 24.04 instructions. Verify with `rocminfo` and `rocm-smi`.
- **Mesa RADV** (Vulkan) is already on Ubuntu — verify with `vulkaninfo --summary`.
- Keep **both** backends available. Vulkan currently beats ROCm for MoE decode on Strix Halo, but that may invert as ROCm matures or Kimi Linear lands.
**Deliverable:** Both `rocminfo` and `vulkaninfo` see the 8060S.
## Phase 4 — Inference engines
- **llama.cpp** — build twice into separate binaries: `llama-server-vulkan` and `llama-server-rocm`. Run whichever wins for the model in question. This is the path for Kimi Linear once PR #17592 lands.
- **Ollama** — convenient front for llama.cpp; uses Vulkan/ROCm under the hood.
- **vLLM with ROCm** — install only if you actually need batched serving on an OpenAI-compatible endpoint. Heavier setup; skip until needed.
**Deliverable:** A llama.cpp server that loads a known-good GGUF and serves on `localhost:8080` over an OpenAI-compatible API.
## Phase 5 — Storage & model layout
- Big GGUFs add up fast (Kimi Linear Q4_K_M alone is ~30 GB; plan for 46 concurrent models).
- Use a dedicated `/models` directory; if the Framework Desktop has a second NVMe slot, put models there.
- Standard layout so Ollama, llama.cpp, and editor configs can share paths:
```
/models/<vendor>/<model>/<quant>.gguf
```
## Phase 6 — Daily coding model (use now, swap to Kimi Linear later)
Worth its own decision pass. Strong contenders for a 96 GB carve-out at Q4Q6 (as of May 2026):
- **Qwen3-Coder**
- **DeepSeek-Coder-V3.x**
- **GLM-4.6**
- **Devstral** (smaller, faster)
- **Kimi-K2** (stay in the Moonshot family while waiting for Linear)
→ Open question: research current Strix Halo benchmarks for these before committing to one.
## Phase 7 — Front-end
All of these speak OpenAI-compatible APIs, so the Phase 4 server is the only thing they need to know about:
- **Continue.dev** or **Cline** (VS Code) pointed at the local server.
- **Aider** for terminal-based pair programming.
- **OpenWebUI** for a ChatGPT-style UI when desired.
## Phase 8 — Monitoring & guardrails
- `nvtop` or `radeontop` for GPU; `btop` for CPU/RAM.
- Small systemd unit to auto-start the llama.cpp server on boot.
- Strix Halo is power-flexible (45120 W TDP); pick a TDP that matches your noise/perf preference.
## Phase 9 — Kimi Linear readiness checklist
When the weekly routine flips to **READY TO TRY**, the bring-up should be ~30 minutes:
1. `git pull` the llama.cpp checkout (or rebuild from PR #17592 / its merged commit).
2. Download a Q4_K_M or Q6_K GGUF from bartowski / ymcki to `/models/moonshotai/kimi-linear/`.
3. Swap the systemd unit's `--model` path. Restart.
---
## Status tracking
- **Weekly Kimi Linear support check:** routine `trig_01XPweKNZaGTsFjEsF3Ce6HH` — runs Fridays at 13:00 UTC. View at https://claude.ai/code/routines/trig_01XPweKNZaGTsFjEsF3Ce6HH
- **Baseline (2026-05-01):** llama.cpp PR #17592 OPEN/unmerged, CUDA-only validation. GGUFs exist but require a patched build. No Strix Halo (Vulkan/ROCm) reports yet.
## Key references
- [llama.cpp PR #17592 — Kimi Linear support](https://github.com/ggml-org/llama.cpp/pull/17592)
- [llama.cpp Issue #16930 — Kimi Linear feature request](https://github.com/ggml-org/llama.cpp/issues/16930)
- [llama.cpp Discussion #20856 — Known-Good Strix Halo ROCm Stack](https://github.com/ggml-org/llama.cpp/discussions/20856)
- [Strix Halo backend benchmarks (kyuz0)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
- [strixhalo.wiki llama.cpp performance](https://strixhalo.wiki/AI/llamacpp-performance)
- [Lychee-Technology/llama-cpp-for-strix-halo](https://github.com/Lychee-Technology/llama-cpp-for-strix-halo)
- [bartowski Kimi-Linear GGUF](https://huggingface.co/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF)
- [ymcki Kimi-Linear GGUF (fork-required)](https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF)
- [vLLM Kimi-Linear recipe](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-Linear.html)

44
TODO.md Normal file
View File

@@ -0,0 +1,44 @@
# TODO
## ROCm / vLLM on Strix Halo (gfx1151)
The Framework Desktop runs **Ubuntu 26.04 LTS**; AMD only ships ROCm
7.2.3 packages for jammy (22.04) and noble (24.04). We installed the
noble repo but pulled only `rocminfo` + `rocm-smi-lib` for host-side
diagnostics — all heavy ROCm work runs in containers, which ship their
own ROCm stack. This sidesteps the host-side libxml2 ABI mismatch (noble
ships `libxml2.so.2`, 26.04 ships `libxml2.so.16`) that broke the native
HIP toolchain.
### Open questions
- **Does `rocm/vllm:latest` actually run on Strix Halo's iGPU?** vLLM's
AMD support officially targets datacenter cards (MI300X / gfx942).
gfx1151 (RDNA 3.5 consumer) is a different ISA. If the stock image
doesn't initialize the device, try `rocm/vllm-dev:nightly` or build
from source against ROCm 7.x with `-DAMDGPU_TARGETS=gfx1151`.
- **AMD support for 26.04** — watch https://repo.radeon.com/amdgpu-install/<latest>/ubuntu/
for a directory matching the box's codename. AMD historically lags
Ubuntu LTS by 612 months for ROCm packaging.
### When 26.04 ROCm packages land
If you ever want to do native ROCm work on the host (rather than via
containers):
1. Bump `ROCM_VERSION` and `AMDGPU_INSTALL_DEB` in `pyinfra/deploy.py`
to the new release.
2. Update the apt source URL path in `deploy.py` if AMD adds a new
release codename (currently hardcoded to `noble`).
3. Add a step that runs `amdgpu-install -y --usecase=rocm --no-dkms`
(the current deploy explicitly avoids this to stay slim).
4. `./run.sh`.
For container-only workflows (current default), no action is needed —
container images update independently of the host.
## Pick a coding model (StrixHaloSetup Phase 6)
Open question — research current Strix Halo benchmarks before
committing. Candidates: Qwen3-Coder, DeepSeek-Coder-V3.x, GLM-4.6,
Devstral, Kimi-K2. Track Kimi Linear separately via the weekly routine
referenced in `StrixHaloSetup.md`.

7
opencode/.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
# Plugin deps — install.sh's `npm install` repopulates these.
.opencode/plugin/node_modules/
# Diagnostic log written by phoenix-bridge.js (Mac path).
/tmp/phoenix-bridge.log
.DS_Store

1685
opencode/.opencode/plugin/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,13 @@
{
"name": "localgenai-opencode-plugins",
"version": "0.1.0",
"private": true,
"type": "module",
"description": "OpenCode plugins for the localgenai stack. Run `npm install` here once.",
"dependencies": {
"@opentelemetry/api": "^1.9.0",
"@opentelemetry/exporter-trace-otlp-proto": "^0.205.0",
"@opentelemetry/sdk-node": "^0.205.0",
"@opentelemetry/sdk-trace-base": "^2.0.0"
}
}

View File

@@ -0,0 +1,198 @@
// phoenix-bridge — register an OpenTelemetry SDK so the spans OpenCode
// already emits (when experimental.openTelemetry=true) actually leave
// the process and land in Arize Phoenix on the Framework Desktop.
//
// What it does:
// - Boots @opentelemetry/sdk-node with OTLP/HTTP exporter pointed at
// http://framework:4318/v1/traces (override via PHOENIX_OTLP_ENDPOINT).
// - Inserts @arizeai/openinference-vercel's OpenInferenceSpanProcessor so
// Phoenix's LLM/tool-aware UI gets richer attributes than vanilla OTel.
// - Opens a parent span around each OpenCode session, with sub-sessions
// (subagents spawned via the Task tool) nested under the parent. This
// gives Phoenix a unified tree per top-level conversation.
//
// Caveats (May 2026):
// - Subagent OTel context propagation across session boundaries is best
// effort; OpenCode's experimental.chat.system.transform doesn't expose
// sessionID yet (issue sst/opencode#6142). When that lands, swap the
// manual stitching here for proper traceparent injection.
// - Dynamic imports below are deliberate — per OpenCode's plugin guide,
// direct imports of optional deps freeze the harness if they're absent.
// Run `npm install` in this directory first.
export const PhoenixBridge = async ({ project, directory, worktree }) => {
// Direct file logging — OpenCode swallows stdout/stderr from plugin
// code, so this is how we get visibility into init progress. Tail with:
// tail -f /tmp/phoenix-bridge.log
const { appendFileSync } = await import("node:fs");
const log = (msg) => {
try {
appendFileSync(
"/tmp/phoenix-bridge.log",
`[${new Date().toISOString()}] ${msg}\n`,
);
} catch (_) {
/* best effort */
}
};
log("plugin function entered");
// Phoenix 15.x serves OTLP/HTTP at /v1/traces on the same port as the UI
// (6006). Earlier versions used a separate 4318 — override here if you
// ever pin Phoenix < 15.0.
const endpoint =
process.env.PHOENIX_OTLP_ENDPOINT || "http://framework:6006/v1/traces";
const serviceName = process.env.PHOENIX_SERVICE_NAME || "opencode";
// NodeSDK's `serviceName` constructor option is ignored in some
// versions; setting OTEL_SERVICE_NAME forces the resource attribute
// through the standard env-var path. Has to happen before sdk.start().
if (!process.env.OTEL_SERVICE_NAME) {
process.env.OTEL_SERVICE_NAME = serviceName;
}
log(`endpoint=${endpoint} serviceName=${serviceName}`);
// Dynamic imports so a missing dep produces a warning, not a freeze.
let NodeSDK,
OTLPTraceExporter,
BatchSpanProcessor,
SimpleSpanProcessor,
trace,
context,
diag,
DiagLogLevel;
try {
({ NodeSDK } = await import("@opentelemetry/sdk-node"));
// Use the protobuf-over-HTTP exporter — Phoenix's OTLP receiver
// only speaks protobuf (returns 415 for the JSON variant from
// @opentelemetry/exporter-trace-otlp-http).
({ OTLPTraceExporter } = await import(
"@opentelemetry/exporter-trace-otlp-proto"
));
({ BatchSpanProcessor, SimpleSpanProcessor } = await import(
"@opentelemetry/sdk-trace-base"
));
({ trace, context, diag, DiagLogLevel } = await import(
"@opentelemetry/api"
));
log("OTel imports resolved");
} catch (err) {
console.warn(
"[phoenix-bridge] OTel deps not installed — skipping. " +
"Run `npm install` in .opencode/plugin/ to enable.",
err && err.message,
);
return {};
}
// Pipe OTel's internal diagnostics into our log so export errors are
// visible. Without this, BatchSpanProcessor swallows exporter failures.
// Default to WARN — bump to DEBUG via PHOENIX_OTEL_DEBUG=1 when chasing
// a regression.
const otelLevel =
process.env.PHOENIX_OTEL_DEBUG === "1"
? DiagLogLevel.DEBUG
: DiagLogLevel.WARN;
diag.setLogger(
{
verbose: (m) => log(`[otel:verbose] ${m}`),
debug: (m) => log(`[otel:debug] ${m}`),
info: (m) => log(`[otel:info] ${m}`),
warn: (m) => log(`[otel:warn] ${m}`),
error: (m) => log(`[otel:error] ${m}`),
},
otelLevel,
);
// NodeSDK accepts `serviceName` directly, sidestepping the Resource API
// (which broke between @opentelemetry/resources v1.x and v2.x).
// SimpleSpanProcessor (vs BatchSpanProcessor) exports each span
// immediately — easier to debug while we sort out the pipeline.
const exporter = new OTLPTraceExporter({ url: endpoint });
const sdk = new NodeSDK({
serviceName,
spanProcessors: [new SimpleSpanProcessor(exporter)],
});
sdk.start();
log("sdk.start() returned");
const tracer = trace.getTracer("opencode.phoenix-bridge");
log("tracer obtained");
// Smoke-test span. If this one doesn't show up in Phoenix, the export
// path is broken and there's no point chasing missing session spans.
// Should appear under service "opencode" within ~5 s of OpenCode start.
const bootSpan = tracer.startSpan("phoenix-bridge.boot", {
attributes: {
"opencode.endpoint": endpoint,
"opencode.service_name": serviceName,
},
});
bootSpan.end();
log("boot span emitted (will flush within ~5s)");
// sessionId -> { span, ctx }
const sessions = new Map();
const closeSession = (sessionId) => {
const rec = sessions.get(sessionId);
if (!rec) return;
rec.span.end();
sessions.delete(sessionId);
};
const onShutdown = async () => {
for (const id of [...sessions.keys()]) closeSession(id);
try {
await sdk.shutdown();
} catch (_) {
/* ignore */
}
};
process.once("beforeExit", onShutdown);
process.once("SIGINT", onShutdown);
process.once("SIGTERM", onShutdown);
return {
event: async ({ event }) => {
if (!event || !event.type) return;
switch (event.type) {
case "session.created": {
const session = event.properties?.session || event.properties || {};
const id = session.id;
const parentID = session.parentID;
if (!id) return;
const parentRec = parentID ? sessions.get(parentID) : null;
const parentCtx = parentRec ? parentRec.ctx : context.active();
const span = tracer.startSpan(
parentID ? `subagent ${id}` : `session ${id}`,
{
attributes: {
"opencode.session.id": id,
"opencode.session.parent_id": parentID || "",
"opencode.directory": directory || "",
"opencode.worktree": worktree || "",
"opencode.project": project?.name || "",
},
},
parentCtx,
);
sessions.set(id, { span, ctx: trace.setSpan(parentCtx, span) });
break;
}
case "session.idle":
case "session.deleted":
case "session.compacted": {
const id =
event.properties?.session?.id || event.properties?.id || null;
if (id) closeSession(id);
break;
}
default:
break;
}
},
};
};

138
opencode/README.md Normal file
View File

@@ -0,0 +1,138 @@
# opencode setup
Canonical OpenCode config + Phoenix bridge plugin for the localgenai
stack. `install.sh` deploys it to `~/.config/opencode/` on a Mac.
## What's wired up
- **Local model**: `framework/qwen3-coder:30b` served by Ollama on the
Framework Desktop, reachable over Tailscale.
- **Playwright MCP** ([@playwright/mcp](https://github.com/microsoft/playwright-mcp)) —
browser automation. The model can navigate pages, click, fill forms,
read DOM snapshots. Closes the agentic-browsing gap.
- **SearXNG MCP** ([mcp-searxng](https://github.com/ihor-sokoliuk/mcp-searxng)) —
web search via your self-hosted instance at <https://searxng.n0n.io>.
No external API keys, no rate-limit roulette.
- **Phoenix bridge plugin** (`.opencode/plugin/phoenix-bridge.js`) —
exports OpenTelemetry spans for every LLM call, tool call, and
subagent invocation to the Phoenix container running on the Framework
Desktop. Per-prompt waterfall / flamegraph viz at
<http://framework:6006>.
## Setup
```sh
./install.sh
```
Idempotent — re-run after editing `opencode.json` or pulling changes to
the plugin. Each step checks before doing work. Specifically:
1. Verifies Homebrew is present (won't install it for you)
2. `brew install node uv jq sst/tap/opencode` (skips if already at latest)
3. Pre-caches Playwright's chromium so the first MCP call is instant
4. `npm install` in `.opencode/plugin/` for the Phoenix bridge OTel deps
5. Generates `~/.config/opencode/opencode.json` from the repo's
`opencode.json`, rewriting relative plugin paths to absolute so
OpenCode loads the plugin regardless of which directory it's launched
from
Step 5 is the reason the deployed config isn't a plain symlink. The
repo's `opencode.json` uses a relative plugin path (`./...`) so it stays
valid in place; the deployed copy is generated with that path resolved
to an absolute one. Edits to the repo's `opencode.json` need a re-run
of `./install.sh` to take effect.
## Verify
```sh
# Local model reachable
curl -s http://framework:11434/v1/models | jq '.data[].id'
# SearXNG instance answers JSON
curl -s 'https://searxng.n0n.io/search?q=test&format=json' | jq '.results | length'
```
Then in opencode:
```
opencode
> /mcp # should list playwright and searxng as connected
> search the web for "qwen3-coder benchmarks"
> open https://example.com and tell me the H1
```
## Phoenix tracing
The plugin at `.opencode/plugin/phoenix-bridge.js` boots an OpenTelemetry
SDK on OpenCode startup and ships every span to Phoenix on the Framework
Desktop. With `experimental.openTelemetry: true` (already set in
`opencode.json`), OpenCode emits Vercel AI SDK spans that Phoenix renders
as a per-turn waterfall: user prompt → main agent's `ai.streamText`
each tool call (built-in + MCP) with token counts and latencies inline.
The plugin uses `@opentelemetry/exporter-trace-otlp-proto` (not `-http`)
because Phoenix's OTLP receiver only speaks protobuf — the JSON variant
returns 415.
Defaults can be overridden via env vars (set before launching opencode):
| Variable | Default | Purpose |
|---|---|---|
| `PHOENIX_OTLP_ENDPOINT` | `http://framework:6006/v1/traces` | OTLP/HTTP target |
| `PHOENIX_SERVICE_NAME` | `opencode` | Phoenix project name |
| `PHOENIX_OTEL_DEBUG` | unset | `1` to surface OTel internal logs |
### Verifying
```sh
: > /tmp/phoenix-bridge.log # truncate prior runs
opencode # any directory; CWD doesn't matter
tail -f /tmp/phoenix-bridge.log
```
Healthy startup looks like:
```
plugin function entered
endpoint=http://framework:6006/v1/traces serviceName=opencode
OTel imports resolved
sdk.start() returned
tracer obtained
boot span emitted (will flush within ~5s)
```
Then open <http://framework:6006/projects> — an `opencode` project should
appear with at least one `phoenix-bridge.boot` span. Send a prompt in
OpenCode and real LLM-call traces follow.
If the plugin's deps aren't installed, OpenCode logs a warning and the
plugin no-ops — the rest of OpenCode still works fine.
### Known limitations
- **Subagent nesting is best-effort.** The plugin opens a parent span
per session and tries to stitch child sessions (Task-tool subagents)
under their parent, but Vercel AI SDK spans live in their own OTel
trace context. Until [sst/opencode#6142](https://github.com/sst/opencode/issues/6142)
exposes `sessionID` in the `chat.system.transform` hook, child-session
spans may show as separate traces in Phoenix.
- **Console output from plugins is swallowed by OpenCode's TUI.** That's
why init progress goes to `/tmp/phoenix-bridge.log` rather than stdout.
## Notes
- **SearXNG JSON output** must be enabled on the instance for the MCP
server to work. If `format=json` returns HTML or 403, edit
`settings.yml` on the SearXNG box: `search.formats: [html, json]`,
restart.
- **Playwright first-run** downloads ~200 MB of browser binaries into
`~/Library/Caches/ms-playwright/`. Subsequent runs are instant.
- **Tool-calling reliability** with Qwen3-Coder is decent but not
Claude-grade. If a tool call hangs or returns malformed JSON, the
model is the culprit, not the MCP. Worth trying the same prompt
against a hosted Claude or GPT-5 to confirm before debugging the
server.
- **Adding more MCP servers**: drop another entry under the `mcp` key
using the same `type/command/enabled` shape. The
[official MCP registry](https://registry.modelcontextprotocol.io/)
and [Awesome MCP Servers](https://mcpservers.org/) catalog options.

145
opencode/install.sh Executable file
View File

@@ -0,0 +1,145 @@
#!/usr/bin/env bash
# Bootstrap or re-sync the OpenCode harness on this Mac.
#
# Idempotent — re-run after pulling changes to opencode.json or the
# Phoenix bridge plugin. Each step checks before doing work.
#
# What this does:
# 1. Verify Homebrew is present
# 2. Install node, uv, opencode, jq (skips if already at latest)
# 3. Pre-cache Playwright's chromium so the first MCP call is instant
# 4. Install the Phoenix bridge plugin's OTel deps
# 5. Generate ~/.config/opencode/opencode.json from the repo's
# opencode.json with relative plugin paths rewritten to absolute,
# so opencode loads the plugin regardless of where it's launched.
#
# Usage: ./install.sh (from this directory)
set -euo pipefail
cd "$(dirname "$0")"
HERE="$(pwd)"
# --- Pretty printing ---------------------------------------------------------
bold() { printf '\033[1m%s\033[0m\n' "$*"; }
ok() { printf ' \033[32m✓\033[0m %s\n' "$*"; }
info() { printf ' → %s\n' "$*"; }
warn() { printf ' \033[33m!\033[0m %s\n' "$*"; }
fail() { printf ' \033[31m✗\033[0m %s\n' "$*"; exit 1; }
# --- 1. Homebrew -------------------------------------------------------------
bold "[1/5] Homebrew"
if ! command -v brew >/dev/null 2>&1; then
fail "brew not found. Install from https://brew.sh, then re-run."
fi
ok "brew $(brew --version | head -1 | awk '{print $2}')"
# --- 2. CLI deps -------------------------------------------------------------
bold "[2/5] CLI dependencies"
brew_install_if_missing() {
local pkg="$1"
local bin="${2:-$1}"
if command -v "$bin" >/dev/null 2>&1; then
ok "$pkg already installed ($(command -v "$bin"))"
else
info "installing $pkg"
brew install "$pkg"
ok "$pkg installed"
fi
}
brew_install_if_missing node node
brew_install_if_missing uv uv
brew_install_if_missing jq jq
# opencode is in a tap; check the binary, not the formula name.
if command -v opencode >/dev/null 2>&1; then
ok "opencode already installed ($(command -v opencode))"
else
info "tapping sst/tap and installing opencode"
brew install sst/tap/opencode
ok "opencode installed"
fi
# --- 3. Playwright browsers --------------------------------------------------
bold "[3/5] Playwright browser cache"
PW_CACHE="${HOME}/Library/Caches/ms-playwright"
if [[ -d "$PW_CACHE" ]] && find "$PW_CACHE" -name "chrome" -o -name "Chromium*" 2>/dev/null | grep -q .; then
ok "browsers already cached at $PW_CACHE"
else
info "downloading chromium (~200 MB) — first run only"
npx -y @playwright/mcp@latest --help >/dev/null 2>&1 || true
ok "browsers cached"
fi
# --- 4. Phoenix bridge plugin deps ------------------------------------------
bold "[4/5] Phoenix bridge plugin deps"
if [[ -d ".opencode/plugin/node_modules" && -f ".opencode/plugin/package-lock.json" ]]; then
# Re-run npm install if package.json is newer than the lockfile, otherwise skip.
if [[ ".opencode/plugin/package.json" -nt ".opencode/plugin/package-lock.json" ]]; then
info "package.json is newer than lockfile — running npm install"
( cd .opencode/plugin && npm install )
ok "deps updated"
else
ok "deps already installed"
fi
else
info "installing OTel deps (one-time, ~40 MB)"
( cd .opencode/plugin && npm install )
ok "deps installed"
fi
# --- 5. Generate ~/.config/opencode/opencode.json ---------------------------
# The repo's opencode.json uses relative plugin paths so it stays valid
# in-place. Rewriting them to absolute paths here makes opencode find the
# plugin regardless of which directory it was launched from. Re-run this
# script after editing opencode.json.
bold "[5/5] Deploy global config"
mkdir -p "${HOME}/.config/opencode"
src="${HERE}/opencode.json"
dst="${HOME}/.config/opencode/opencode.json"
# If the user previously had a symlink from the old install.sh, replace it.
if [[ -L "$dst" ]]; then
info "removing stale symlink at $dst"
rm "$dst"
fi
# And the old .opencode dir symlink — no longer needed now that plugin
# paths are absolute.
if [[ -L "${HOME}/.config/opencode/.opencode" ]]; then
info "removing stale ~/.config/opencode/.opencode symlink"
rm "${HOME}/.config/opencode/.opencode"
fi
# Rewrite any relative plugin path (./foo, ../foo) to an absolute path
# rooted at this directory. Absolute paths and npm-package refs pass
# through untouched.
jq --arg here "$HERE" '
.plugin = (
(.plugin // [])
| map(
if type == "string" and (startswith("./") or startswith("../"))
then ($here + "/" + ltrimstr("./") | gsub("/\\./"; "/"))
else .
end
)
)
' "$src" > "$dst.tmp"
mv "$dst.tmp" "$dst"
ok "wrote $dst"
info "plugin paths resolved to:"
jq -r '.plugin[]?' "$dst" | sed 's/^/ /'
echo
bold "Done."
cat <<EOF
Re-run this script after editing opencode.json — the deployed copy at
~/.config/opencode/opencode.json is generated, not symlinked.
Next steps:
- Verify the model server: curl -s http://framework:11434/v1/models | jq '.data[].id'
- Verify Phoenix is up: curl -sf http://framework:6006/
- Run opencode and send one prompt. A trace should appear at
http://framework:6006 within a couple of seconds.
- If nothing appears, check ~/.local/share/opencode/log/*.log for a
line starting with "[phoenix-bridge]".
EOF

41
opencode/opencode.json Normal file
View File

@@ -0,0 +1,41 @@
{
"$schema": "https://opencode.ai/config.json",
"experimental": {
"openTelemetry": true
},
"plugin": ["./.opencode/plugin/phoenix-bridge.js"],
"provider": {
"framework": {
"npm": "@ai-sdk/openai-compatible",
"name": "Framework Desktop (Strix Halo)",
"options": {
"baseURL": "http://framework:11434/v1"
},
"models": {
"qwen3-coder:30b": {
"name": "Qwen3 Coder 30B (local)",
"limit": {
"context": 131072,
"output": 16384
}
}
}
}
},
"mcp": {
"playwright": {
"type": "local",
"command": ["npx", "-y", "@playwright/mcp@latest"],
"enabled": true
},
"searxng": {
"type": "local",
"command": ["npx", "-y", "mcp-searxng"],
"enabled": true,
"environment": {
"SEARXNG_URL": "https://searxng.n0n.io"
}
}
},
"model": "framework/qwen3-coder:30b"
}

12
piper-compose.yaml Normal file
View File

@@ -0,0 +1,12 @@
services:
piper:
image: rhasspy/wyoming-piper:latest
container_name: wyoming-piper
restart: unless-stopped
ports:
- "10200:10200"
volumes:
- ./piper-data:/data
command:
- --voice
- en_US-lessac-medium

82
pyinfra/HowToScan.md Normal file
View File

@@ -0,0 +1,82 @@
# How to onboard an existing host into pyinfra
There's no auto-generator that reads a running system and emits a
deploy.py — not in pyinfra, not in Ansible, not in any IaC tool.
Reverse-engineering "what's intentional vs incidental" needs human
judgment. Below is the workflow that actually works.
## What pyinfra gives you for free
The **facts** API queries live state without writing any deploy:
```sh
pyinfra @host fact deb.DebPackages # installed packages
pyinfra @host fact apt.AptSources # extra apt repos
pyinfra @host fact systemd.SystemdServices # service state
pyinfra @host fact server.Crontab # cron
pyinfra @host fact server.Users # users
```
Output is JSON. There's no fact → op translator, but the data is easy
to grep through and tells you what's actually on the box.
## Practical workflow
1. Run a scanner against the live host that dumps the *meaningful*
state — packages explicitly installed, enabled services, custom
configs, dotfiles, sudoers fragments, fstab non-defaults. Skip the
distro-default noise.
2. Hand-pick the bits worth managing into a new `<station>/deploy.py`.
3. Run with `--dry` against the live box; pyinfra's diff shows what's
still drifting.
4. Iterate until `--dry` reports "no changes."
## One-shot scanner
```sh
ssh you@host '
echo "=== manually-installed packages ==="
apt-mark showmanual
echo "=== enabled services ==="
systemctl list-unit-files --state=enabled --type=service --no-legend | awk "{print \$1}"
echo "=== extra apt repos ==="
ls /etc/apt/sources.list.d/
echo "=== sudoers fragments ==="
ls /etc/sudoers.d/
echo "=== fstab non-default ==="
grep -vE "^#|^$|/proc|/sys|/dev/pts" /etc/fstab
echo "=== users with login shells ==="
getent passwd | awk -F: "\$3>=1000 && \$3<60000 {print \$1}"
echo "=== docker stacks ==="
ls /srv/docker/ 2>/dev/null
echo "=== home dotfiles ==="
sudo find /home -maxdepth 3 -name ".*rc" -o -name ".*config" 2>/dev/null
'
```
That gets you ~80% of what's worth managing in roughly 30 lines of
pyinfra. The remaining 20% is case-by-case (kernel cmdline tweaks,
hand-edited /etc files, dotfile contents) that no scanner fully
captures — you write those as you re-discover the box's quirks.
## What scanners can't catch
- The *intent* behind a config (why this `sysctl` value, why this
cron entry).
- Hand-edited fragments mixed into otherwise-default files (e.g. a
custom line in `/etc/ssh/sshd_config`).
- Out-of-band state: data in databases, container volumes,
`/var/lib/<service>`.
For these, treat pyinfra as managing the *bones* (packages, services,
users, configs) and leave the data alone. Re-deploying onto fresh
hardware should give you a working system; restoring data is a
separate concern.
## If "scan and reproduce" is a hard requirement
The native answer is **NixOS**. The whole OS is declarative;
`nixos-generate-config` reads your machine and writes a full config
that recreates it. Different paradigm, big switch — but the right
answer if reproducible-from-a-file is core to your workflow and you
aren't already deep in Ubuntu-land.

12
pyinfra/README.md Normal file
View File

@@ -0,0 +1,12 @@
# pyinfra
One folder per station. Each subfolder is a self-contained pyinfra
deploy: `inventory.py`, `deploy.py`, `run.sh`, plus any compose files
or assets that ship to the host.
| Station | Host | Notes |
|---------|------|-------|
| [`framework/`](framework/README.md) | `10.0.0.237` | Framework Desktop (Strix Halo, 128 GB) — local LLM box |
To bring up a station, `cd` into its folder and run `./run.sh`. See the
station's own README for prerequisites.

184
pyinfra/framework/README.md Normal file
View File

@@ -0,0 +1,184 @@
# pyinfra: Strix Halo bring-up
Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon
8060S, 128 GB). The host stays minimal — kernel + driver + Docker +
diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker
compose services, each shipping its own ROCm/Vulkan stack.
## Manual prerequisites
1. **Phase 0** — update Framework BIOS, set GPU UMA carve-out (96 GB).
2. **OS install** — Ubuntu Server 24.04 LTS. AMD ROCm only ships for
jammy/noble; later Ubuntus install but break the host-side toolchain
(libxml2 ABI). Enable SSH, import your laptop key, create user `noise`.
Recommended partitioning: ≥300 GB on `/`, big disk mounted at `/models`,
plain ext4 (skip LVM).
3. The host must be reachable at `10.0.0.237` over SSH (edit `inventory.py`
if it moves).
4. **NOPASSWD sudo for `noise`** — pyinfra's fact layer doesn't reliably
thread sudo passwords. One-time setup:
```sh
ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
```
## Run
```sh
uv tool install pyinfra
./run.sh # equivalent to: pyinfra inventory.py deploy.py
./run.sh --dry # any extra args are forwarded to pyinfra
```
Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`.
## What the deploy does
- Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, huggingface-cli
- Tailscale (run `sudo tailscale up` on the box once, interactively)
- Docker engine + compose plugin, user added to `docker` group
- ROCm host diagnostics only (`rocminfo`) — no full toolchain
- GRUB kernel params for Strix Halo perf: `amd_iommu=off`,
`amdgpu.gttsize=117760` (per [Gygeek/Framework-strix-halo-llm-setup](https://github.com/Gygeek/Framework-strix-halo-llm-setup)).
Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
and runs `update-grub`, but won't reboot for you.
- `/models/<vendor>/` layout
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}/docker-compose.yml`
dropped in, not auto-started — you edit the model path then
`docker compose up -d`
(OpenWebUI needs no edits — it's pre-configured to find Ollama at
`host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
(Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
OpenHands UI at :3030 — see "Monitoring stack" and "Agent harnesses"
below)
If a previous run installed the native llama.cpp build / full ROCm /
native Ollama, those are auto-cleaned the next time `./run.sh` runs.
## After the deploy: starting an inference service
```sh
ssh noise@10.0.0.237
sudo tailscale up # one-time, interactive
# Drop a GGUF somewhere under /models, then:
cd /srv/docker/llama
vim docker-compose.yml # edit the --model path
docker compose up -d
curl localhost:8080/v1/models # smoke test
```
Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
needed — Ollama serves models on demand).
## Tunables
Top of `deploy.py`:
- `ROCM_VERSION` and `AMDGPU_INSTALL_DEB` — bump when AMD ships a newer
release. The .deb filename has a build suffix that doesn't derive from
the version; find it at https://repo.radeon.com/amdgpu-install/.
- `AMDGPU_TOP_VERSION` — bump when a newer release lands at
https://github.com/Umio-Yasuno/amdgpu_top/releases.
Compose images in
`compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}.yml`
— pin tags here.
## Monitoring stack
Two compose stacks watch the inference services:
- **Beszel** (`/srv/docker/beszel`, http://framework:8090) — host + Docker
container + AMD GPU dashboard. The agent's `amd_sysfs` collector reads
`/sys/class/drm/cardN/device/` directly; this is the only path that
reports real numbers on Strix Halo (gfx1151) — `amd-smi` returns N/A
for util/power/temp on this APU.
First-time bring-up:
```sh
cd /srv/docker/beszel
docker compose up -d beszel # 1. hub
# 2. open http://framework:8090 → create admin → "Add system"
# 3. copy the TOKEN and the SSH KEY from the dialog into .env:
sudo tee /srv/docker/beszel/.env >/dev/null <<EOF
BESZEL_TOKEN=<token-from-dialog>
BESZEL_KEY=ssh-ed25519 AAAA…
EOF
sudo chgrp docker /srv/docker/beszel/.env
sudo chmod 640 /srv/docker/beszel/.env
docker compose up -d --force-recreate beszel-agent # 4. agent
```
- **OpenLIT** (`/srv/docker/openlit`, http://framework:3001) — fleet
metrics for the LLM stack (cost, tokens, latency aggregated across
sessions and models). Auto-instruments Ollama and vLLM via
OpenTelemetry without app-code changes; for llama.cpp, the compose
file already passes `--metrics` so OpenLIT can scrape `/metrics`.
ClickHouse-backed. OTLP receivers exposed on the host at **:4327
(gRPC) / :4328 (HTTP)** — Phoenix owns 4317/4318.
Bring-up: `cd /srv/docker/openlit && docker compose up -d`. UI prompts
to create an admin account on first load.
- **Phoenix** (`/srv/docker/phoenix`, http://framework:6006) — per-trace
agent waterfall / flamegraph. Designed to answer "show me what one
OpenCode turn actually did" — full call tree, nested LLM/tool calls,
token counts inline. Single container, SQLite-backed. OTLP receivers
on the host at **:4317 (gRPC)** and **:6006/v1/traces (HTTP)** —
Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318.
Bring-up: `cd /srv/docker/phoenix && docker compose up -d`. UI
immediately available; no auth in the self-hosted single-user path.
See [`localgenai/opencode/README.md`](../../opencode/README.md) for
how OpenCode is configured to ship traces here.
The official AMD Prometheus exporters (`amd-smi-exporter`,
`device-metrics-exporter`) are intentionally **not** deployed — they're
broken on gfx1151 (ROCm#6035). If you ever want full Prometheus +
Grafana, the migration path is to replace Beszel with `node_exporter` +
a textfile collector reading `/sys/class/drm/cardN/device/` and
`/sys/class/hwmon/`.
## Agent harnesses
Two agent UIs in front of the local model stack — pick one (or both)
based on how you like to drive the agent:
- **OpenWebUI** (`/srv/docker/openwebui`, http://framework:3000) —
ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search.
Best for casual chat and household-shared LLM access.
Bring-up: `cd /srv/docker/openwebui && docker compose up -d`.
- **OpenHands** (`/srv/docker/openhands`, http://framework:3030,
loopback-only) — autonomous agent in a Docker sandbox. Spawns a
per-conversation `agent-server` container that can write code, run
tests, browse the web. Pre-configured for Ollama at
`openai/qwen3-coder:30b` over the OpenAI-compatible endpoint;
ships traces to Phoenix.
Bring-up:
```sh
cd /srv/docker/openhands && docker compose up -d
# Tunnel the loopback-bound UI from your laptop:
ssh -L 3030:127.0.0.1:3030 noise@framework
open http://localhost:3030
```
First run pulls the agent-server image (~2 GB) lazily on first
conversation, not at startup, so the orchestrator comes up fast but
your first message takes 3060 s. Pre-0.44 state path was
`~/.openhands-state`; not relevant on a fresh install.
Tool-call quality with local models is much better when Ollama's
context is bumped — the compose at `/srv/docker/ollama` already sets
`OLLAMA_CONTEXT_LENGTH=65536`, which is plenty.
OpenCode (the terminal driver, configured in
[`localgenai/opencode/`](../../opencode/)) runs on the Mac and points
at the same Ollama endpoint, so all three harnesses share the same
model server.
The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
gfx1151-optimized with rocWMMA flash attention (the latter only kicks
in on the ROCm tags). Auto-rebuilt against llama.cpp master.

View File

@@ -0,0 +1,66 @@
# Beszel — host + container + GPU dashboard.
# https://beszel.dev
#
# Picked over Prometheus+Grafana for this box because:
# - The agent's `amd_sysfs` collector reads /sys/class/drm/card*/device/
# directly, which is the only reliable GPU metric source on Strix Halo
# (gfx1151). AMD's amd-smi / Device Metrics Exporter return N/A for
# util/power/temp on this APU (ROCm#6035), so the official Prometheus
# exporter path is dead.
# - Two containers vs six.
#
# First-time setup (WebSocket connection model — current Beszel default):
# 1. `docker compose up -d beszel` (start the hub)
# 2. Open http://framework:8090, create the admin account
# 3. Click "Add system" — the dialog gives you a TOKEN and an SSH KEY.
# 4. Edit /srv/docker/beszel/.env (created empty by pyinfra; pyinfra
# doesn't overwrite). Add:
# BESZEL_TOKEN=<token-from-dialog>
# BESZEL_KEY=ssh-ed25519 AAAA…
# 5. `docker compose up -d --force-recreate beszel-agent`
#
# Docker Compose auto-reads the sibling .env file for ${VAR} interpolation
# in the environment block below — so secrets stay out of the compose
# file (which pyinfra overwrites) but the env-var names match exactly
# what the agent expects.
#
# Why both TOKEN and KEY: TOKEN identifies which system this agent is,
# KEY authenticates the agent (the SSH key is reused as the auth secret
# in the WebSocket handshake). Rotate either by editing the .env and
# `docker compose up -d --force-recreate`.
services:
beszel:
image: henrygd/beszel:latest
container_name: beszel
restart: unless-stopped
ports:
- "8090:8090"
volumes:
- /srv/docker/beszel/data:/beszel_data
beszel-agent:
image: henrygd/beszel-agent:latest
container_name: beszel-agent
restart: unless-stopped
# Host networking so the agent sees real CPU/memory/network counters
# without bridge-NAT distortion.
network_mode: host
volumes:
# Read-only Docker socket for per-container CPU/mem/net.
- /var/run/docker.sock:/var/run/docker.sock:ro
# Sysfs paths the AMD GPU collector reads.
- /sys/class/drm:/sys/class/drm:ro
- /sys/class/hwmon:/sys/class/hwmon:ro
environment:
# Pulled from /srv/docker/beszel/.env at compose-parse time.
TOKEN: "${BESZEL_TOKEN:-}"
KEY: "${BESZEL_KEY:-}"
# WebSocket dial-out target — the hub on this same host. The agent
# is on host networking, so localhost is the host machine, where
# the hub container exposes port 8090.
HUB_URL: "http://localhost:8090"
# Optional fallback: legacy SSH listener for hub-initiated probing.
# Harmless to keep — hub only uses it if WebSocket is unreachable.
LISTEN: "45876"
# Enable the AMD sysfs GPU collector.
GPU: "true"

View File

@@ -0,0 +1,44 @@
# llama.cpp server, gfx1151-optimized via kyuz0's Strix Halo toolboxes.
# https://github.com/kyuz0/amd-strix-halo-toolboxes
#
# Tag options on docker.io/kyuz0/amd-strix-halo-toolboxes:
# vulkan-radv — most stable, recommended default (this one)
# vulkan-amdvlk — alternate Vulkan driver, sometimes faster
# rocm-7.2.2 — ROCm 7.x; needs /dev/kfd + render group_add (see vllm.yml pattern)
# rocm-6.4.4 — ROCm 6.x fallback
# rocm7-nightlies — avoid: caps memory allocation to 64 GB (May 2026)
#
# Toolbox images use a shell entrypoint, so we override to launch
# llama-server directly. Edit the --model path before `docker compose up -d`.
services:
llama:
image: kyuz0/amd-strix-halo-toolboxes:vulkan-radv
container_name: llama
restart: unless-stopped
devices:
- /dev/dri:/dev/dri
volumes:
- /models:/models:ro
ports:
- "8080:8080"
entrypoint: ["llama-server"]
command:
- --model
- /models/REPLACE/ME/model.gguf
- --host
- 0.0.0.0
- --port
- "8080"
- --n-gpu-layers
- "999"
- --ctx-size
- "32768"
# Required for GPU backends on Strix Halo per Gygeek's setup
# guide. Forces full load into GPU memory rather than mmap.
- --no-mmap
# Flash attention — works on Vulkan too; the big win is on the
# ROCm tag where kyuz0's build has rocWMMA acceleration.
- --flash-attn
# Expose Prometheus metrics at /metrics — scraped by OpenLIT for
# tokens/sec, KV-cache use, queue depth, and request latency.
- --metrics

View File

@@ -0,0 +1,38 @@
# Ollama, ROCm backend. Serves models on demand — safe to start before
# you've put anything in /models.
#
# Storage: Ollama's content-addressed blob store is bind-mounted under
# /models/ollama so all model data on the host lives under /models.
# Note: Ollama's blobs are SHA256-named, not raw GGUFs — llama.cpp/vLLM
# can't load them directly. Keep curated GGUFs at /models/<vendor>/...
# for those engines.
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
restart: unless-stopped
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
# Numeric GIDs of host's video (44) and render (991) groups — names
# don't exist inside the container, but the GIDs need to match the
# host so /dev/kfd + /dev/dri are accessible.
group_add:
- "44"
- "991"
environment:
# Strix Halo's iGPU is gfx1151 (RDNA 3.5), which Ollama's bundled
# ROCm runtime doesn't recognize — without this override it falls
# back to CPU silently. 11.0.0 = gfx1100 (Navi 31); the RDNA 3.x
# ISAs are close enough that gfx1100 kernels run on gfx1151.
- HSA_OVERRIDE_GFX_VERSION=11.0.0
# Default context. 256K (the upstream default for Qwen3-Coder)
# blows the KV cache up to ~25-30 GB and forces ollama to split
# layers between GPU and CPU. 64K keeps the model fully on GPU
# while still being plenty for coding contexts.
- OLLAMA_CONTEXT_LENGTH=65536
volumes:
- /models/ollama:/root/.ollama
- /models:/models:ro
ports:
- "11434:11434"

View File

@@ -0,0 +1,94 @@
# OpenHands 1.7 (May 2026) — autonomous agent in a Docker sandbox.
# https://docs.openhands.dev — repo: github.com/OpenHands/OpenHands
#
# Architecture: this container is a thin orchestrator. Per conversation
# it spawns a separate `agent-server` container on the host Docker daemon
# (that's what the docker.sock mount is for) and talks to it over REST.
# AGENT_SERVER_IMAGE_TAG below pins the per-session sandbox image.
#
# Complements OpenCode: OpenCode is the interactive terminal driver,
# OpenHands is for autonomous loops (write code, run tests, browse the
# web in a sandbox, report back).
services:
openhands:
# Org rebranded All-Hands-AI → OpenHands at v1.0 (Dec 2025); the old
# docker.all-hands.dev/all-hands-ai/openhands image is gone.
image: docker.openhands.dev/openhands/openhands:1.7
container_name: openhands
restart: unless-stopped
# 3030 host-side because :3000 is OpenWebUI and :3001 is OpenLIT.
# Loopback-only — reach via SSH tunnel or Tailscale, don't expose
# this directly.
ports:
- "127.0.0.1:3030:3000"
volumes:
# Required: orchestrator spawns sandbox containers via the host daemon.
- /var/run/docker.sock:/var/run/docker.sock
# State, settings, conversation history, MCP config, secrets.
# Pre-0.44 used ~/.openhands-state — N/A on a fresh install.
- /srv/docker/openhands/state:/.openhands
# Workspace the sandbox reads/writes. The host path on the LEFT must
# match SANDBOX_VOLUMES below — the sandbox container is spawned by
# the host daemon, so its bind mount is resolved on the host, not
# via this container's filesystem.
- /srv/docker/openhands/workspace:/srv/docker/openhands/workspace
# Linux Docker doesn't auto-provide host.docker.internal; this fixes it.
extra_hosts:
- "host.docker.internal:host-gateway"
environment:
# ---- Sandbox / agent-server image pin ----
# Replaces the V0.x SANDBOX_RUNTIME_CONTAINER_IMAGE. 1.19.1-python is
# the agent-server tag the 1.7 main image expects; bumping the main
# image will likely want a newer agent-server tag — check the
# upstream docker-compose.yml on each upgrade.
AGENT_SERVER_IMAGE_REPOSITORY: ghcr.io/openhands/agent-server
AGENT_SERVER_IMAGE_TAG: 1.19.1-python
# ---- Workspace mount into the per-session sandbox ----
# SANDBOX_VOLUMES is the V1 replacement for the deprecated
# WORKSPACE_BASE / WORKSPACE_MOUNT_PATH variables.
SANDBOX_VOLUMES: /srv/docker/openhands/workspace:/workspace:rw
# Match the host's `noise` UID so files the agent writes aren't
# owned by root.
SANDBOX_USER_ID: "1000"
# ---- LLM: host Ollama via OpenAI-compatible endpoint ----
# Per the official local-llms doc, the recommended path is the
# /v1 OpenAI-compatible endpoint with the `openai/` LiteLLM prefix
# — NOT `ollama/...`, which has worse tool-call behaviour.
LLM_MODEL: "openai/qwen3-coder:30b"
LLM_BASE_URL: "http://host.docker.internal:11434/v1"
LLM_API_KEY: "ollama" # any non-empty string; Ollama doesn't auth.
# Default tool-calling renderer mismatches Qwen3-Coder's training
# format and produces malformed calls (issue #8140). Forcing false
# falls back to OpenHands' prompt-based protocol — costs some token
# efficiency, gains reliability with local models.
LLM_NATIVE_TOOL_CALLING: "false"
LOG_ALL_EVENTS: "true"
# ---- Optional: ship traces to Phoenix on :4318 ----
# OpenHands V1 uses LiteLLM + OpenTelemetry; standard OTLP env vars
# are honoured. Comment out to disable.
OTEL_EXPORTER_OTLP_ENDPOINT: "http://host.docker.internal:4318"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
OTEL_SERVICE_NAME: "openhands"
# Per-session agent-server containers spawn headless chromium for
# browser tasks; default 64 MB shm causes silent crashes.
shm_size: "2gb"
# Playwright/chromium needs higher fd limits than Docker's default.
ulimits:
nofile:
soft: 65536
hard: 65536
# Bridge networking is correct here. Don't switch to network_mode: host
# — the spawned sandbox containers reach this orchestrator via Docker
# bridge DNS, which only works on a bridge network.

View File

@@ -0,0 +1,59 @@
# OpenLIT — LLM observability (traces, costs, KV-cache, prompt/decode
# latencies, tokens/sec). https://openlit.io
#
# Two services:
# - clickhouse : columnar store for traces (internal only, no host port)
# - openlit : Next.js UI on :3001 (3000 is OpenWebUI)
#
# Why OpenLIT vs Langfuse/Phoenix/Laminar: it's the only OSS dashboard
# (May 2026) that auto-instruments Ollama AND vLLM via OpenTelemetry
# without adding code to client apps. For llama.cpp, start the server
# with --metrics (see ../llama/docker-compose.yml) and OpenLIT can scrape
# /metrics.
#
# To send traces from a Python script calling Ollama/vLLM:
# pip install openlit
# python -c "import openlit; openlit.init(otlp_endpoint='http://framework:4318')"
#
# To wire OpenWebUI → OpenLIT, install OpenLIT's pipeline middleware
# in OpenWebUI per https://openlit.io/blogs/openlit-openwebui.
services:
clickhouse:
image: clickhouse/clickhouse-server:25.3-alpine
container_name: openlit-clickhouse
restart: unless-stopped
environment:
CLICKHOUSE_USER: default
CLICKHOUSE_PASSWORD: OPENLIT
CLICKHOUSE_DB: openlit
volumes:
- /srv/docker/openlit/clickhouse:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
openlit:
image: ghcr.io/openlit/openlit:latest
container_name: openlit
restart: unless-stopped
depends_on:
- clickhouse
ports:
# Host:container — UI on 3001 (OpenWebUI owns 3000).
- "3001:3000"
# OTLP receivers exposed on the host so SDKs running off-box can
# ship traces here. gRPC + HTTP. Remapped (4327/4328 → 4317/4318)
# because Phoenix owns the canonical 4317/4318 ports for OpenCode
# traces — OpenLIT here is a secondary/fleet-metrics destination.
- "4327:4317"
- "4328:4318"
environment:
INIT_DB_HOST: clickhouse
INIT_DB_PORT: "8123"
INIT_DB_USERNAME: default
INIT_DB_PASSWORD: OPENLIT
INIT_DB_DATABASE: openlit
SQLITE_DATABASE_URL: file:/app/client/data/data.db
volumes:
- /srv/docker/openlit/data:/app/client/data

View File

@@ -0,0 +1,25 @@
# OpenWebUI — ChatGPT-like web UI in front of Ollama. Pre-configured to
# use the host's Ollama instance and the project's SearXNG for web
# search. Default port 3000.
#
# Persistent state (users, conversations, uploaded docs, RAG vector
# index) lives at /srv/docker/openwebui/data so backups touch one path.
services:
openwebui:
image: ghcr.io/open-webui/open-webui:main
container_name: openwebui
restart: unless-stopped
ports:
- "3000:8080"
extra_hosts:
# Lets the container reach Ollama on the host's :11434 without
# needing to share Docker networks.
- "host.docker.internal:host-gateway"
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
# Built-in web search via the project's SearXNG instance.
- ENABLE_RAG_WEB_SEARCH=true
- RAG_WEB_SEARCH_ENGINE=searxng
- SEARXNG_QUERY_URL=https://searxng.n0n.io/search?q=<query>&format=json
volumes:
- /srv/docker/openwebui/data:/app/backend/data

View File

@@ -0,0 +1,35 @@
# Arize Phoenix — per-trace agent waterfall / flamegraph viz.
# https://github.com/Arize-ai/phoenix
#
# Picked over Langfuse for "show me one OpenCode turn as a tree":
# - Single container vs Langfuse's six (Postgres+ClickHouse+Redis+MinIO+web+worker).
# - First-class ingestion of Vercel AI SDK spans (which is what OpenCode
# emits under the hood when experimental.openTelemetry=true).
# - Best-in-class waterfall + agent-graph view for nested LLM/tool calls.
#
# Complements OpenLIT, doesn't replace it: OpenLIT is the fleet-metrics
# layer (cost / tokens / latency aggregated across sessions). Phoenix is
# the per-prompt debugger (see what one turn actually did).
#
# Bring-up: `docker compose up -d` — no first-run setup needed; UI prompts
# for project name on first trace ingest. Storage is SQLite at /data.
services:
phoenix:
image: arizephoenix/phoenix:latest
container_name: phoenix
restart: unless-stopped
ports:
# UI + OTLP/HTTP both ride on 6006 in Phoenix 15.x — HTTP traces go
# to http://framework:6006/v1/traces. (Pre-15 had a separate 4318;
# the consolidation happened in Phoenix v15.0.)
- "6006:6006"
# OTLP/gRPC stays separate.
- "4317:4317"
environment:
PHOENIX_WORKING_DIR: /data
# Phoenix listens on all interfaces by default; explicit for clarity.
PHOENIX_HOST: 0.0.0.0
PHOENIX_PORT: "6006"
PHOENIX_GRPC_PORT: "4317"
volumes:
- /srv/docker/phoenix/data:/data

View File

@@ -0,0 +1,36 @@
# vLLM, ROCm backend.
#
# NOTE: vLLM's official ROCm support targets datacenter cards (MI300X /
# gfx942). Strix Halo is gfx1151 — support varies by image tag and
# release. If `rocm/vllm:latest` doesn't run on this iGPU, try
# `rocm/vllm-dev:nightly` or build from source against ROCm 7.x.
services:
vllm:
image: rocm/vllm:latest
container_name: vllm
restart: unless-stopped
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
cap_add:
- SYS_PTRACE
security_opt:
- seccomp=unconfined
# Numeric GIDs of host's video (44) and render (991) groups — names
# don't exist inside the container.
group_add:
- "44"
- "991"
shm_size: 16g
ipc: host
volumes:
- /models:/models:ro
ports:
- "8000:8000"
command:
- --model
- /models/REPLACE/ME
- --host
- 0.0.0.0
- --port
- "8000"

487
pyinfra/framework/deploy.py Normal file
View File

@@ -0,0 +1,487 @@
"""
pyinfra deploy for Framework Desktop (Ryzen AI Max+ 395 / Strix Halo).
Containerized layout: inference engines (llama.cpp, vLLM, Ollama) run as
docker compose services, each shipping its own ROCm/Vulkan stack. The
host stays minimal — kernel + driver + diagnostics + Docker.
Run with:
./run.sh # equivalent to pyinfra inventory.py deploy.py
Idempotent — re-running only does work that's actually needed. Includes
one-shot cleanup steps for artifacts left over from the earlier native
build (llama.cpp git checkout, symlinks, systemd unit, full ROCm).
"""
from io import StringIO
from pyinfra.operations import apt, files, server, systemd
from pyinfra import host
# --- Tunables ----------------------------------------------------------------
# Latest stable as of 2026-04-30. Verify at https://repo.radeon.com/amdgpu-install/.
ROCM_VERSION = "7.2.3"
# The .deb filename has an opaque build suffix — find the exact name at
# the URL above and paste it here.
AMDGPU_INSTALL_DEB = "amdgpu-install_7.2.3.70203-1_all.deb"
# amdgpu_top — modern GPU monitor that handles new AMD cards / APUs
# (Strix Halo / gfx1151). Replaces radeontop, which doesn't know about
# RDNA 3.5. Verify at https://github.com/Umio-Yasuno/amdgpu_top/releases.
AMDGPU_TOP_VERSION = "0.11.4-1"
AMDGPU_TOP_DEB = f"amdgpu-top_without_gui_{AMDGPU_TOP_VERSION}_amd64.deb"
SSH_USER = host.data.get("ssh_user", "noise")
MODELS_DIR = "/models"
# /srv is the FHS-blessed location for "data and configuration for
# services this system provides." Owned root:docker with setgid +
# group-write so any docker-group member can manage compose stacks.
COMPOSE_DIR = "/srv/docker"
# --- Phase 1 — Base OS basics ------------------------------------------------
apt.update(name="apt update", _sudo=True)
apt.packages(
name="HWE kernel (>=6.11 for ROCm 7)",
packages=["linux-generic-hwe-24.04"],
_sudo=True,
)
# User basics + monitoring tools.
apt.packages(
name="Base CLI tools",
# radeontop intentionally omitted — it predates RDNA 3.5 / Strix Halo
# and just errors with "no VRAM support". amdgpu_top installed below.
packages=[
"tmux",
"vim",
"htop",
"btop",
"nvtop",
"git",
"curl",
"ca-certificates",
"unzip",
],
_sudo=True,
)
# Vulkan diagnostics on the host (containers ship their own runtime).
apt.packages(
name="Vulkan host diagnostics",
packages=["mesa-vulkan-drivers", "vulkan-tools"],
_sudo=True,
)
# uv (Python package/tool manager).
server.shell(
name="Install / upgrade uv",
commands=[
"curl -LsSf https://astral.sh/uv/install.sh | "
"env UV_INSTALL_DIR=/usr/local/bin UV_UNMANAGED_INSTALL=1 sh",
],
_sudo=True,
)
# huggingface_hub CLI for `huggingface-cli download <repo> --local-dir ...`
# Lands at ~/.local/bin/huggingface-cli for the SSH user. Other users can
# repeat the command themselves.
server.shell(
name="Install / upgrade huggingface_hub CLI",
commands=["uv tool install --upgrade 'huggingface_hub[cli]'"],
_sudo=True,
_sudo_user=SSH_USER,
)
# gpakosz/.tmux config (Oh My Tmux!).
TMUX_CONF_DIR = f"/home/{SSH_USER}/.tmux"
server.shell(
name="Clone gpakosz/.tmux",
commands=[
f"test -d {TMUX_CONF_DIR}/.git || "
f"git clone --depth 1 https://github.com/gpakosz/.tmux.git {TMUX_CONF_DIR}",
],
_sudo=True,
_sudo_user=SSH_USER,
)
files.link(
name="Symlink ~/.tmux.conf -> ~/.tmux/.tmux.conf",
path=f"/home/{SSH_USER}/.tmux.conf",
target=f"{TMUX_CONF_DIR}/.tmux.conf",
user=SSH_USER,
group=SSH_USER,
_sudo=True,
)
# Seed the user override file only if absent — preserves customizations.
server.shell(
name="Seed ~/.tmux.conf.local (if missing)",
commands=[
f"test -f /home/{SSH_USER}/.tmux.conf.local || "
f"cp {TMUX_CONF_DIR}/.tmux.conf.local /home/{SSH_USER}/",
],
_sudo=True,
_sudo_user=SSH_USER,
)
# --- Tailscale ---------------------------------------------------------------
# Tailscale's noble repo works fine on later Ubuntus — the package is
# self-contained Go binaries.
files.download(
name="Fetch Tailscale apt key",
src="https://pkgs.tailscale.com/stable/ubuntu/noble.noarmor.gpg",
dest="/usr/share/keyrings/tailscale-archive-keyring.gpg",
mode="644",
_sudo=True,
)
files.download(
name="Fetch Tailscale apt list",
src="https://pkgs.tailscale.com/stable/ubuntu/noble.tailscale-keyring.list",
dest="/etc/apt/sources.list.d/tailscale.list",
mode="644",
_sudo=True,
)
apt.update(name="apt update (Tailscale)", _sudo=True)
apt.packages(name="Install Tailscale", packages=["tailscale"], _sudo=True)
systemd.service(
name="Enable tailscaled",
service="tailscaled",
running=True,
enabled=True,
_sudo=True,
)
# `sudo tailscale up` is interactive (browser auth) — run manually once.
# --- Docker -----------------------------------------------------------------
apt.packages(
name="Install Docker + compose plugin",
packages=["docker.io", "docker-compose-v2"],
_sudo=True,
)
systemd.service(
name="Enable docker daemon",
service="docker",
running=True,
enabled=True,
_sudo=True,
)
# lazydocker (TUI for docker). System-wide install via official script.
server.shell(
name="Install / upgrade lazydocker",
commands=[
"curl -fsSL "
"https://raw.githubusercontent.com/jesseduffield/lazydocker/master/scripts/install_update_linux.sh "
"| DIR=/usr/local/bin bash",
],
_sudo=True,
)
# --- GPU access (host kernel/driver bits only) ------------------------------
# AMD's amdgpu-install package adds the ROCm apt repo; we use it just to
# get rocminfo for host-side diagnostics. Containers ship
# their own ROCm.
files.directory(
name="apt keyring dir",
path="/etc/apt/keyrings",
mode="755",
_sudo=True,
)
amdgpu_deb = f"/tmp/{AMDGPU_INSTALL_DEB}"
amdgpu_url = (
f"https://repo.radeon.com/amdgpu-install/{ROCM_VERSION}/ubuntu/noble/"
f"{AMDGPU_INSTALL_DEB}"
)
server.shell(
name="Fetch amdgpu-install .deb",
commands=[f"test -f {amdgpu_deb} || curl -fsSL {amdgpu_url} -o {amdgpu_deb}"],
_sudo=True,
)
server.shell(
name="Install amdgpu-install package",
commands=[f"apt install -y {amdgpu_deb}"],
_sudo=True,
)
# Idempotent cleanup: if a prior run installed the full ROCm userspace
# (~25 GB), tear it down before installing the diagnostic-only subset.
# On a fresh box this is a no-op.
server.shell(
name="Remove full ROCm install if present",
commands=[
"if dpkg -l rocm-dev 2>/dev/null | grep -q '^ii'; then "
" amdgpu-install -y --uninstall || true; "
"fi",
],
_sudo=True,
)
apt.packages(
name="ROCm host diagnostics (rocminfo)",
# rocminfo is the stable diagnostic. The SMI tool's package name has
# churned across ROCm releases (rocm-smi-lib → amd-smi-lib in 7.x);
# install on demand if you need it.
packages=["rocminfo"],
_sudo=True,
)
# amdgpu_top — fetch + install the .deb if not already at the pinned
# version. Replaces radeontop for newer AMD cards (Strix Halo / gfx1151).
amdgpu_top_deb = f"/tmp/{AMDGPU_TOP_DEB}"
amdgpu_top_url = (
f"https://github.com/Umio-Yasuno/amdgpu_top/releases/download/"
f"v{AMDGPU_TOP_VERSION.split('-')[0]}/{AMDGPU_TOP_DEB}"
)
server.shell(
name="Install / upgrade amdgpu_top",
commands=[
f"dpkg -s amdgpu-top 2>/dev/null | grep -q '^Version: {AMDGPU_TOP_VERSION}' || "
f"(test -f {amdgpu_top_deb} || curl -fsSL {amdgpu_top_url} -o {amdgpu_top_deb}; "
f"apt install -y {amdgpu_top_deb})",
],
_sudo=True,
)
# Group membership for /dev/kfd + /dev/dri access (needed for GPU passthrough
# into containers, and for unprivileged host-side rocminfo).
server.group(name="ensure render group", group="render", _sudo=True)
server.group(name="ensure video group", group="video", _sudo=True)
server.user(
name="Add login user to render/video/docker",
user=SSH_USER,
groups=["render", "video", "docker"],
append=True,
_sudo=True,
)
# Kernel cmdline tuning per Gygeek/Framework-strix-halo-llm-setup:
# - amd_iommu=off — ~6 % memory-read improvement on Strix Halo
# - amdgpu.gttsize=117760 — ~115 GB GTT ceiling so the GPU can borrow
# most of system RAM dynamically. Acts as a
# ceiling, not an allocation. See ../../StrixHaloMemory.md
# for the UMA-vs-GTT trade-off discussion.
# Requires a reboot to take effect; pyinfra leaves that to you.
files.line(
name="GRUB cmdline (amd_iommu, gttsize)",
path="/etc/default/grub",
line=r"^GRUB_CMDLINE_LINUX_DEFAULT=.*",
replace='GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"',
_sudo=True,
)
server.shell(
name="Regenerate GRUB config",
commands=["update-grub"],
_sudo=True,
)
# --- Storage layout ---------------------------------------------------------
files.directory(
name="/models root",
path=MODELS_DIR,
user=SSH_USER,
group=SSH_USER,
mode="755",
_sudo=True,
)
for sub in ("moonshotai", "qwen", "deepseek", "zai", "mistralai"):
files.directory(
name=f"{MODELS_DIR}/{sub}",
path=f"{MODELS_DIR}/{sub}",
user=SSH_USER,
group=SSH_USER,
mode="755",
_sudo=True,
)
# Ollama bind-mounts its content-addressed store here.
files.directory(
name=f"{MODELS_DIR}/ollama",
path=f"{MODELS_DIR}/ollama",
user=SSH_USER,
group=SSH_USER,
mode="755",
_sudo=True,
)
# --- Compose files for inference services ----------------------------------
files.directory(
name="Compose root",
path=COMPOSE_DIR,
group="docker",
mode="2775", # setgid: new files inherit docker group
_sudo=True,
)
# Idempotent cleanup of earlier locations.
files.directory(
name="Remove old /srv/compose",
path="/srv/compose",
present=False,
_sudo=True,
)
files.directory(
name="Remove old ~/docker compose dir",
path=f"/home/{SSH_USER}/docker",
present=False,
_sudo=True,
)
for svc in (
"llama",
"vllm",
"ollama",
"openwebui",
"beszel",
"openlit",
"phoenix",
"openhands",
):
files.directory(
name=f"compose/{svc} dir",
path=f"{COMPOSE_DIR}/{svc}",
group="docker",
mode="2775",
_sudo=True,
)
files.put(
name=f"compose/{svc}/docker-compose.yml",
src=f"compose/{svc}.yml",
dest=f"{COMPOSE_DIR}/{svc}/docker-compose.yml",
group="docker",
mode="664",
_sudo=True,
)
# OpenWebUI persistent state (users, conversations, uploaded docs,
# RAG vector index) — bind-mounted into the container.
files.directory(
name="OpenWebUI data dir",
path=f"{COMPOSE_DIR}/openwebui/data",
group="docker",
mode="2775",
_sudo=True,
)
# Beszel persistent state (admin account, system list, metric history).
files.directory(
name="Beszel data dir",
path=f"{COMPOSE_DIR}/beszel/data",
group="docker",
mode="2775",
_sudo=True,
)
# Sibling .env file Docker Compose auto-reads for variable interpolation.
# Pyinfra creates it empty if missing and never overwrites — the user
# fills in BESZEL_TOKEN= and BESZEL_KEY= from the hub UI on first setup.
# Mode 640 root:docker so docker group members can read it at compose
# parse time but it isn't world-readable.
files.file(
name="Beszel .env (placeholder)",
path=f"{COMPOSE_DIR}/beszel/.env",
present=True,
mode="640",
user="root",
group="docker",
_sudo=True,
)
# Tear down the previous /etc/default/beszel-agent location if a prior
# deploy created it (idempotent — no-op on a fresh box).
files.file(
name="Remove legacy /etc/default/beszel-agent",
path="/etc/default/beszel-agent",
present=False,
_sudo=True,
)
# OpenLIT persistent state — Next.js app config + ClickHouse trace store.
files.directory(
name="OpenLIT data dir",
path=f"{COMPOSE_DIR}/openlit/data",
group="docker",
mode="2775",
_sudo=True,
)
files.directory(
name="OpenLIT ClickHouse data dir",
path=f"{COMPOSE_DIR}/openlit/clickhouse",
group="docker",
mode="2775",
_sudo=True,
)
# Phoenix persistent state (SQLite-backed trace store).
files.directory(
name="Phoenix data dir",
path=f"{COMPOSE_DIR}/phoenix/data",
group="docker",
mode="2775",
_sudo=True,
)
# OpenHands state + workspace. Owned by the SSH user (UID 1000) so the
# sandbox containers — which run as SANDBOX_USER_ID=1000 — can write to
# the shared workspace without root-owned files leaking out.
files.directory(
name="OpenHands state dir",
path=f"{COMPOSE_DIR}/openhands/state",
user=SSH_USER,
group=SSH_USER,
mode="2775",
_sudo=True,
)
files.directory(
name="OpenHands workspace dir",
path=f"{COMPOSE_DIR}/openhands/workspace",
user=SSH_USER,
group=SSH_USER,
mode="2775",
_sudo=True,
)
# --- Cleanup of artifacts from the prior native-build deploy ----------------
# All idempotent — `present=False` is a no-op when the target is absent.
server.shell(
name="Stop & disable old native llama-server.service",
commands=[
"systemctl disable --now llama-server.service 2>/dev/null || true",
],
_sudo=True,
)
files.file(
name="Remove old llama-server.service",
path="/etc/systemd/system/llama-server.service",
present=False,
_sudo=True,
)
files.link(
name="Remove old llama-server-vulkan symlink",
path="/usr/local/bin/llama-server-vulkan",
present=False,
_sudo=True,
)
files.link(
name="Remove old llama-server-rocm symlink",
path="/usr/local/bin/llama-server-rocm",
present=False,
_sudo=True,
)
files.directory(
name="Remove old llama.cpp checkout",
path="/opt/llama.cpp",
present=False,
_sudo=True,
)
server.shell(
name="Stop & remove native Ollama install",
commands=[
"systemctl disable --now ollama.service 2>/dev/null || true",
"rm -f /etc/systemd/system/ollama.service /usr/local/bin/ollama",
"userdel ollama 2>/dev/null || true",
],
_sudo=True,
)
systemd.daemon_reload(name="systemctl daemon-reload", _sudo=True)

View File

@@ -0,0 +1,5 @@
# pyinfra inventory for the Framework Desktop / Strix Halo box.
framework_desktop = [
("framework", {"ssh_hostname": "10.0.0.70", "ssh_user": "noise"}),
]

7
pyinfra/framework/run.sh Executable file
View File

@@ -0,0 +1,7 @@
#!/usr/bin/env bash
# Wrapper: invokes pyinfra against the Framework Desktop, forwarding any
# extra args (e.g. --dry, -v) to pyinfra. Assumes NOPASSWD sudo on the box.
set -euo pipefail
cd "$(dirname "$0")"
exec pyinfra -v --ssh-password-prompt inventory.py deploy.py "$@"

View File

@@ -0,0 +1,311 @@
# Deploying Kimi Linear Model on Strix Halo with 128GB Unified Memory
This document provides a comprehensive guide to deploying the Kimi Linear 48B parameter model on your AMD Strix Halo hardware with 128GB unified memory.
## Hardware Specifications
- **Processor**: AMD Ryzen AI Max+ 395 (Strix Halo)
- **Memory**: 128GB LPDDR5X unified memory
- **GPU**: Integrated Radeon 8060S (40 CU RDNA 3.5)
- **Architecture**: Zen 5 CPU with unified memory access
## Prerequisites
### Hardware Requirements
1. AMD Ryzen AI Max+ 395 processor (or compatible Strix Halo APU)
2. Minimum 128GB RAM (as specified)
3. Ubuntu 24.04 LTS or later with kernel 6.16.9+
4. BIOS access for configuration
### Software Requirements
1. Python 3.10+
2. ROCm 6.4.4 or higher
3. llama.cpp with ROCm support
4. HuggingFace CLI client
## Deployment Steps
### Phase 1: Hardware Configuration
#### 1.1 BIOS Configuration
Before installation, you must set these BIOS options:
- **UMA Frame Buffer Size**: 512MB (for display)
- **IOMMU**: Disabled (for performance)
- **Power Mode**: 85W TDP for balanced performance
#### 1.2 Kernel Configuration
Ensure you're running kernel 6.16.9 or later:
```bash
# Check current kernel
uname -r
# If upgrading needed:
sudo add-apt-repository ppa:cappelikan/ppa -y
sudo apt update
sudo apt install mainline -y
sudo mainline --install 6.16.9
```
#### 1.3 GRUB Boot Parameters
Edit `/etc/default/grub`:
```bash
sudo nano /etc/default/grub
```
Add to `GRUB_CMDLINE_LINUX_DEFAULT`:
```bash
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"
```
Apply the changes:
```bash
sudo update-grub
sudo reboot
```
#### 1.4 GPU Access Configuration
Create udev rules for proper GPU permissions:
```bash
sudo bash -c 'cat > /etc/udev/rules.d/99-amd-kfd.rules << EOF
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666"
EOF'
sudo udevadm control --reload-rules
sudo udevadm trigger
sudo usermod -aG video,render $USER
```
### Phase 2: ROCm Installation
#### 2.1 Install ROCm from AMD Repository
```bash
# Add AMD ROCm repository
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/latest ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install rocm-hip-runtime rocm-hip-sdk -y
```
#### 2.2 Verify ROCm Installation
```bash
# Check ROCm version
rocminfo | grep -E "Agent|Name:|Marketing"
# Check GTT memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total
```
### Phase 3: Model Setup
#### 3.1 Download Kimi Linear GGUF Model
```bash
# Install huggingface-cli
pip install "huggingface_hub[cli]"
# Create model directory
mkdir -p ~/models/kimi-linear
cd ~/models/kimi-linear
# Download the Q4_K_M quantized GGUF model (recommended for your hardware)
huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
--include "moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf" \
--local-dir ./
# Or download a lower quality version if memory is tight
# huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
# --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-IQ4_XS.gguf" \
# --local-dir ./
```
#### 3.2 Validate Download
```bash
ls -la ~/models/kimi-linear/
# You should see a file named moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf
```
### Phase 4: llama.cpp Setup for AMD ROCm
#### 4.1 Build llama.cpp with ROCm Support
```bash
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Install build dependencies
sudo apt install cmake gcc g++ git libcurl4-openssl-dev -y
# Build for ROCm
cmake -B build -S . \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1151"
cmake --build build --config Release -j$(nproc)
```
#### 4.2 Alternative: Use Pre-built Container
Instead of building, you can use a container with ROCm pre-configured:
```bash
# Install Podman and Distrobox
sudo apt install podman -y
curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo sh
# Create LLM container with ROCm
distrobox create llama-rocm \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
--additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render"
```
### Phase 5: Test Inference
#### 5.1 Run a Basic Test
```bash
# If building locally:
./build/bin/llama-cli \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
--no-mmap \
-ngl 99 \
-p "Explain quantum computing in simple terms:" \
-n 256
# If using container:
distrobox enter llama-rocm
llama-cli \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
--no-mmap \
-ngl 99 \
-p "Explain quantum computing in simple terms:" \
-n 256
```
### Phase 6: Benchmarking and Optimization
#### 6.1 Run Performance Benchmarks
```bash
# Run llama-bench for performance testing
llama-bench \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
-mmp 0 \
-ngl 99 \
-p 512 \
-n 128
```
#### 6.2 Monitor GPU Resources
```bash
# Watch GPU usage during inference
watch -n 1 rocm-smi
# Check memory usage
cat /sys/class/drm/card*/device/mem_info_gtt_total
cat /sys/class/drm/card*/device/mem_info_gtt_used
```
## Performance Optimization Tips
### Memory Management
1. **Use 128GB GTT**: Your system can utilize up to ~115GB of memory for computing
2. **Choose appropriate quantization**:
- Q4_K_M: Good balance of quality and size (~30GB)
- IQ4_XS: Smaller but with some quality loss (~26GB)
- Q5_K_M: Higher quality but larger (~35GB)
### Inference Parameters
```bash
# Optimal flags for AMD Strix Halo
llama-cli \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
--no-mmap \
-ngl 99 \
-p "Your prompt text" \
-n 512 \
--temp 0.7 \
--top-p 0.9 \
--repeat-penalty 1.1
```
## Expected Performance on Strix Halo
| Model Quantization | Predicted Performance | Notes |
| ------------------ | --------------------- | ----- |
| Q4_K_M (~30GB) | 10-15 tokens/sec | Good balance |
| IQ4_XS (~26GB) | 8-12 tokens/sec | Memory efficient |
| Q5_K_M (~35GB) | 12-18 tokens/sec | Higher quality |
## Troubleshooting
### Common Issues
1. **HSA_STATUS_ERROR_OUT_OF_RESOURCES**: Check udev rules
2. **Slow model loading**: Add `--no-mmap` flag
3. **Memory allocation errors**: Verify GTT size in GRUB
4. **Container GPU access**: Check device permissions
### Verification Commands
```bash
# Verify memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total
# Verify ROCm detection
rocminfo | grep -E "Name:|Marketing"
# Test basic GPU functionality
rocminfo
```
## Additional Tools and Integration
### For Development/Testing
1. **llama-cpp-python**: For Python integration
```python
# Install package
pip install llama-cpp-python
# Test with Python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF",
filename="moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf"
)
```
2. **OpenWebUI**: Web interface for LLMs
3. **Ollama**: Alternative for simpler deployment
4. **vLLM**: High-throughput inference (for API deployment)
## Monitoring and Resource Usage
### System Resources to Monitor
- **GPU Utilization**: Use `watch -n 1 rocm-smi`
- **Memory Usage**: Check `/sys/class/drm/card*/device/mem_info_gtt_total`
- **CPU Load**: Monitor with `htop`
- **Temperature**: Monitor GPU temperature
### Storage Considerations
- Model files require ~30-50GB on disk
- Additional swap space recommended for complex tasks
- Consider SSD storage for optimal I/O performance
## Recommendations for Different Use Cases
### For Coding Tasks
- Use `--temp 0.3` for more deterministic outputs
- Enable `--repeat-penalty 1.2` to reduce repetition
- Set `-n 1024` for long context generation
### For Research/Testing
- Use `Q4_K_M` for reliable performance
- Try `IQ4_XS` to test memory efficiency
- Monitor with `llama-bench` for reproducible results
### For Production Use
- Consider containerized deployment for stability
- Set up automated monitoring for resource usage
- Create model-specific configuration files
## Conclusion
Your Strix Halo with 128GB unified memory provides an excellent platform for running large language models like Kimi Linear. The key to success is proper configuration of both the system and inference tools to fully utilize the unified memory architecture.
With this setup, you should be able to run Kimi Linear and achieve performance that rivals much more expensive hardware setups.

View File

@@ -0,0 +1,109 @@
# Local LLM Performance Testing Framework
A comprehensive framework for comparing the performance and quality of different local hosted LLMs on coding tasks.
## Features
- **Performance Metrics**: Response time, tokens per second, CPU/memory/GPU usage
- **Quality Assessment**: Automated code quality evaluation
- **Multi-Model Testing**: Test multiple LLMs simultaneously
- **Parallel Execution**: Run tests efficiently
- **Comprehensive Reporting**: Detailed test results and comparisons
## Installation
```bash
# Install required dependencies
pip install psutil
```
## Usage
### Quick Start
1. Create a test suite and add models:
```python
from llm_test_framework import TestSuite, MockLLM
# Create test suite
suite = TestSuite()
# Add models to test
suite.add_model(MockLLM("gpt4all", {"model_path": "/path/to/model"}))
suite.add_model(MockLLM("llama.cpp", {"model_path": "/path/to/llama"}))
# Define test tasks
tasks = [
{
"name": "Simple function",
"prompt": "Write a Python function to add two numbers",
"expected_output": "return a + b"
}
]
# Run tests
suite.run_all_tests(tasks)
suite.save_results()
```
### Running the Example
```bash
python llm_test_framework.py
```
This will run sample tests with mock models and save results to JSON.
## Components
### 1. TestSuite
Main class managing the testing process, including:
- Model registration
- Test execution
- Result collection and saving
### 2. LLMInterface
Abstract base class for LLM implementations with methods:
- `generate()`: Generate text from prompt
- `get_model_info()`: Get model information
### 3. TestResult
Data class storing individual test results with:
- Performance metrics
- Quality scores
- Raw outputs
- Success status
## Extending the Framework
To add support for a new LLM:
1. Create a new class inheriting from `LLMInterface`
2. Implement `generate()` and `get_model_info()` methods
3. Add your model to the test suite
## Test Types
The framework supports multiple types of coding tasks:
- Basic function implementations
- Code refactoring examples
- Algorithm challenges
- System design questions
## Output
Results are saved in JSON format with:
- Performance metrics (response time, TPS)
- Resource usage (CPU, memory, GPU)
- Quality assessment scores
- Raw model outputs
- Success/failure indicators
## Future Enhancements
- Integration with local LLM APIs (text-generation-webui, llama.cpp, etc.)
- Advanced code quality evaluation using static analysis
- Web-based dashboard for results visualization
- Database storage for historical comparisons
- Automated statistical analysis of results

View File

@@ -0,0 +1,158 @@
# Local LLM Performance Testing Plan
## Overview
This document outlines a comprehensive testing framework to compare the performance and quality of different local hosted LLMs on coding tasks. The framework will measure both speed and quality metrics to provide actionable insights for model selection and optimization.
## Objectives
- Evaluate response time and throughput of different LLMs
- Assess code quality and correctness of generated outputs
- Measure resource utilization during code generation
- Provide standardized comparison between different models
- Enable reproducible testing environments
## Key Metrics to Measure
### Performance Metrics
1. **Response Time**: Time from request to first token
2. **Tokens Per Second (TPS)**: Generation speed
3. **Total Latency**: Complete generation time
4. **Resource Usage**:
- CPU utilization
- Memory consumption
- GPU utilization (for GPU-enabled models)
- VRAM usage (for GPU-enabled models)
### Quality Metrics
1. **Code Syntax Accuracy**: Valid syntax according to language rules
2. **Functional Correctness**: Code produces expected outputs
3. **Code Efficiency**: Algorithmic efficiency and best practices
4. **Code Readability**: Code structure and documentation
5. **Task Completion Rate**: Properly solving the given coding challenge
## Testing Framework Architecture
### 1. Test Runner Component
- Initialize and configure different LLM instances
- Manage test execution workflow
- Coordinate parallel execution of multiple models
- Handle test parameters and configurations
### 2. Performance Monitoring
- Real-time resource usage tracking
- Timing measurements for each request
- Memory allocation monitoring
- GPU utilization tracking
### 3. Task Definition Catalog
- Collection of coding challenges at various difficulty levels
- Standardized test cases with expected outputs
- Task categorization by programming language and complexity
- Reproducible test scenarios
### 4. Evaluation Engine
- Automated code quality assessment
- Functional correctness validation
- Syntax checking and linting
- Comparison against expected outputs
### 5. Reporting System
- Performance comparison dashboards
- Quality assessment reports
- Resource usage visualizations
- Summary statistics and trend analysis
## Test Scenarios
### Basic Programming Tasks
- Variable assignments and data structures
- Function implementations
- Control flow (loops, conditionals)
- Error handling and validation
### Intermediate Complexity
- Algorithm implementation
- Data structure manipulation
- API integration examples
- Modular code design
### Advanced Challenges
- Multi-threading and concurrency
- Mathematical and scientific computing
- System design and architecture
- Code optimization and refactoring
## Implementation Approach
### Phase 1: Core Framework Development
1. Create base testing infrastructure
2. Implement model interface abstraction
3. Build test runner and execution engine
4. Develop basic performance monitoring
### Phase 2: Task and Evaluation System
1. Design coding challenge catalog
2. Implement automated quality assessment
3. Create validation mechanisms
4. Develop reporting templates
### Phase 3: Advanced Features
1. Parallel execution support
2. Advanced visualization capabilities
3. Statistical analysis and confidence intervals
4. Configuration management and customization
## Expected Output
### Test Reports
- Detailed performance metrics per model
- Quality assessment scores
- Resource usage comparisons
- Statistical significance analysis
- Recommendations for model selection
### Visualization Components
- Performance comparison charts
- Resource utilization graphs
- Quality trend analysis
- Multi-model side-by-side comparisons
## Configuration Options
### Model Parameters
- Context window size
- Sampling parameters (temperature, top-p, etc.)
- System prompts and templates
- Hardware specifications
### Test Parameters
- Number of iterations per task
- Timeout thresholds
- Resource monitoring intervals
- Quality evaluation criteria weights
## Integration Points
### Existing Tools
- Local LLM APIs (e.g., text-generation-webui)
- Python-based LLM libraries
- System monitoring tools (htop, nvidia-smi)
- Code linting and validation tools
### Data Storage
- CSV/JSON output files
- Database for historical comparisons
- Version control integration
- Cloud storage for sharing results
## Success Criteria
- Reproducible test results
- Clear performance differentiation between models
- Comprehensive quality assessment coverage
- Scalable architecture for additional models
- User-friendly interface for result interpretation
## Risks and Mitigation
1. **Hardware Variability**: Address with controlled test environments
2. **Inconsistent Results**: Use multiple iterations and statistical analysis
3. **Resource Limitations**: Implement proper monitoring and resource allocation
4. **Quality Assessment Bias**: Use multiple evaluation criteria and human validation

View File

@@ -0,0 +1,70 @@
#!/usr/bin/env python3
"""
Example usage of the LLM Test Framework with real models
"""
import time
import json
from pathlib import Path
from llm_test_framework import TestSuite, MockLLM
# Example tasks for testing
SAMPLE_TASKS = [
{
"name": "Hello World Function",
"prompt": "Write a Python function that prints 'Hello, World!'",
"expected_output": "print('Hello, World!')",
"parameters": {"temperature": 0.3}
},
{
"name": "Sum Function",
"prompt": "Write a Python function to calculate the sum of two numbers",
"expected_output": "return a + b",
"parameters": {"temperature": 0.5}
},
{
"name": "FizzBuzz",
"prompt": "Write a Python function to solve the FizzBuzz problem",
"expected_output": "if i % 3 == 0 and i % 5 == 0",
"parameters": {"temperature": 0.7}
}
]
def main():
print("Starting LLM Performance Testing Framework")
print("=" * 50)
# Create test suite
suite = TestSuite("sample_test_results")
# Add some sample models
print("Adding sample models...")
suite.add_model(MockLLM("Model_A", {"context": 2048}))
suite.add_model(MockLLM("Model_B", {"context": 4096}))
# Run tests
print("Running sample tests...")
suite.run_all_tests(SAMPLE_TASKS)
# Save results
print("Saving results...")
suite.save_results("sample_results.json")
# Show summary
print("\n" + "=" * 50)
print("TEST SUMMARY")
print("=" * 50)
for result in suite.results:
print(f"Model: {result.model_name}")
print(f"Task: {result.task_name}")
print(f"Response Time: {result.response_time:.2f}s")
print(f"Quality Score: {result.quality_score:.1f}/100")
print(f"Success: {'Yes' if result.success else 'No'}")
print("-" * 30)
print(f"\nTotal tests run: {len(suite.results)}")
print(f"Results saved to: {suite.output_dir}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,129 @@
#!/usr/bin/env python3
"""
Find duplicate files in a directory tree by content hash, with parallelization support.
"""
import hashlib
import os
import sys
from collections import defaultdict
from pathlib import Path
from multiprocessing import Pool, cpu_count
from typing import Dict, List, Tuple
def get_file_hash(filepath: Path, chunk_size: int = 8192) -> str:
"""
Calculate the SHA256 hash of a file.
Args:
filepath: Path to the file
chunk_size: Size of chunks to read at a time
Returns:
SHA256 hash of the file as a hexadecimal string
"""
hash_sha256 = hashlib.sha256()
try:
with open(filepath, 'rb') as f:
# Read file in chunks to handle large files efficiently
for chunk in iter(lambda: f.read(chunk_size), b""):
hash_sha256.update(chunk)
return hash_sha256.hexdigest()
except Exception as e:
print(f"Error reading {filepath}: {e}")
return ""
def process_file(filepath: Path) -> Tuple[str, Path]:
"""
Process a single file and return its hash and path.
Args:
filepath: Path to the file
Returns:
Tuple of (file_hash, file_path)
"""
file_hash = get_file_hash(filepath)
return (file_hash, filepath)
def find_duplicates_parallel(root_dir: Path, num_processes: int = None) -> Dict[str, List[Path]]:
"""
Find duplicate files in a directory tree by content hash using parallel processing.
Args:
root_dir: Root directory to search
num_processes: Number of processes to use (default: CPU count)
Returns:
Dictionary mapping file hashes to lists of file paths with that hash
"""
if num_processes is None:
num_processes = cpu_count()
# Get all files in the directory tree
files = []
for file_path in root_dir.rglob('*'):
if file_path.is_file():
files.append(file_path)
print(f"Found {len(files)} files to process")
# Process files in parallel
with Pool(processes=num_processes) as pool:
results = pool.map(process_file, files)
# Group files by hash
hash_to_files = defaultdict(list)
for file_hash, file_path in results:
if file_hash: # Only add files that were successfully hashed
hash_to_files[file_hash].append(file_path)
# Filter out unique files (only keep duplicates)
duplicates = {hash_val: file_list for hash_val, file_list in hash_to_files.items()
if len(file_list) > 1}
return duplicates
def print_duplicates(duplicates: Dict[str, List[Path]]) -> None:
"""
Print the duplicate files in a formatted way.
Args:
duplicates: Dictionary mapping file hashes to lists of file paths
"""
if not duplicates:
print("No duplicates found.")
return
print(f"Found {len(duplicates)} sets of duplicate files:")
for hash_val, file_list in duplicates.items():
print(f"\nHash: {hash_val}")
for file_path in file_list:
print(f" {file_path}")
def main():
"""Main function to run the duplicate file finder."""
if len(sys.argv) < 2:
print("Usage: python find_duplicates.py <root_directory> [num_processes]")
sys.exit(1)
root_dir = Path(sys.argv[1])
if not root_dir.exists():
print(f"Error: Directory {root_dir} does not exist")
sys.exit(1)
num_processes = None
if len(sys.argv) > 2:
try:
num_processes = int(sys.argv[2])
except ValueError:
print("Error: num_processes must be an integer")
sys.exit(1)
print(f"Searching for duplicates in {root_dir} using {num_processes or cpu_count()} processes")
duplicates = find_duplicates_parallel(root_dir, num_processes)
print_duplicates(duplicates)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,246 @@
#!/usr/bin/env python3
"""
LLM Performance Testing Framework
"""
import time
import psutil
import json
import subprocess
import os
from abc import ABC, abstractmethod
from typing import Dict, List, Any, Tuple
from dataclasses import dataclass
from pathlib import Path
@dataclass
class TestResult:
"""Container for test results"""
model_name: str
task_name: str
response_time: float
tps: float
cpu_usage: float
memory_usage: float
gpu_usage: float
quality_score: float
raw_output: str
expected_output: str
success: bool
timestamp: float
@dataclass
class PerformanceMetrics:
"""Container for performance metrics"""
response_time: float
tps: float
cpu_avg: float
memory_avg: float
gpu_avg: float
class LLMInterface(ABC):
"""Abstract base class for LLM interfaces"""
def __init__(self, model_name: str, config: Dict[str, Any]):
self.model_name = model_name
self.config = config
@abstractmethod
def generate(self, prompt: str, **kwargs) -> str:
"""Generate text from prompt"""
pass
@abstractmethod
def get_model_info(self) -> Dict[str, Any]:
"""Get model information"""
pass
class TestSuite:
"""Main test suite class"""
def __init__(self, output_dir: str = "test_results"):
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
self.results: List[TestResult] = []
self.models: List[LLMInterface] = []
def add_model(self, model: LLMInterface):
"""Add a model to test"""
self.models.append(model)
def run_single_test(self, model: LLMInterface, task: Dict[str, Any]) -> TestResult:
"""Run a single test on a model"""
# Start resource monitoring
start_time = time.time()
# Monitor resources
cpu_start = psutil.cpu_percent(interval=1)
memory_start = psutil.virtual_memory().used
gpu_start = self._get_gpu_usage()
# Generate response
try:
response = model.generate(task["prompt"], **task.get("parameters", {}))
success = True
except Exception as e:
response = f"Error: {str(e)}"
success = False
# Measure completion
end_time = time.time()
response_time = end_time - start_time
# Calculate TPS
tps = 0
if success and response:
# Estimate tokens (very rough approximation)
tps = len(response.split()) / response_time if response_time > 0 else 0
# Monitor end resources
cpu_end = psutil.cpu_percent(interval=1)
memory_end = psutil.virtual_memory().used
gpu_end = self._get_gpu_usage()
# Calculate averages
cpu_avg = (cpu_start + cpu_end) / 2
memory_avg = (memory_start + memory_end) / 2
gpu_avg = (gpu_start + gpu_end) / 2
# Quality score - this is a placeholder
quality_score = self._calculate_quality_score(response, task.get("expected_output", ""))
result = TestResult(
model_name=model.model_name,
task_name=task["name"],
response_time=response_time,
tps=tps,
cpu_usage=cpu_avg,
memory_usage=memory_avg,
gpu_usage=gpu_avg,
quality_score=quality_score,
raw_output=response,
expected_output=task.get("expected_output", ""),
success=success,
timestamp=time.time()
)
self.results.append(result)
return result
def run_all_tests(self, tasks: List[Dict[str, Any]]):
"""Run all tests across all models"""
for model in self.models:
print(f"Running tests on {model.model_name}")
for task in tasks:
result = self.run_single_test(model, task)
print(f" {task['name']}: {result.response_time:.2f}s, {result.quality_score:.2f}/100")
def _get_gpu_usage(self) -> float:
"""Get GPU usage if available"""
try:
# This is a placeholder - in real implementation,
# you'd use nvidia-smi or similar
return 0.0
except:
return 0.0
def _calculate_quality_score(self, response: str, expected: str) -> float:
"""Calculate quality score based on response"""
# This is a placeholder implementation
# In practice, this could use static analysis, code execution, etc.
if not response:
return 0.0
# Very basic quality scoring
score = 50 # Base score
# Add points for presence of expected patterns
if expected and expected.lower() in response.lower():
score += 20
# Add points for valid syntax (simplified)
if not response.startswith("Error:") and len(response) > 10:
score += 10
# Cap score at 100
return min(score, 100.0)
def save_results(self, filename: str = None):
"""Save test results to JSON file"""
if filename is None:
filename = f"results_{int(time.time())}.json"
results_data = []
for result in self.results:
results_data.append({
"model_name": result.model_name,
"task_name": result.task_name,
"response_time": result.response_time,
"tps": result.tps,
"cpu_usage": result.cpu_usage,
"memory_usage": result.memory_usage,
"gpu_usage": result.gpu_usage,
"quality_score": result.quality_score,
"raw_output": result.raw_output,
"expected_output": result.expected_output,
"success": result.success,
"timestamp": result.timestamp
})
with open(self.output_dir / filename, 'w') as f:
json.dump(results_data, f, indent=2)
print(f"Results saved to {self.output_dir / filename}")
# Example model implementations
class MockLLM(LLMInterface):
"""Mock LLM for testing purposes"""
def generate(self, prompt: str, **kwargs) -> str:
# Simulate generation time
time.sleep(0.5)
return f"Mock response for: {prompt}"
def get_model_info(self) -> Dict[str, Any]:
return {
"name": self.model_name,
"type": "mock",
"version": "1.0"
}
class ExampleTask:
"""Example task class for testing"""
@staticmethod
def get_sample_tasks() -> List[Dict[str, Any]]:
return [
{
"name": "Simple function",
"prompt": "Write a Python function to add two numbers",
"expected_output": "return a + b",
"parameters": {}
},
{
"name": "Loop example",
"prompt": "Write a Python loop that prints numbers 1 to 10",
"expected_output": "for i in range(1, 11)",
"parameters": {}
}
]
if __name__ == "__main__":
# Create test suite
suite = TestSuite()
# Add mock models
suite.add_model(MockLLM("mock1", {}))
suite.add_model(MockLLM("mock2", {}))
# Run sample tests
tasks = ExampleTask.get_sample_tasks()
suite.run_all_tests(tasks)
# Save results
suite.save_results()
print("Testing completed!")

View File

@@ -0,0 +1,228 @@
#!/usr/bin/env python3
"""
LLM Performance Testing Framework (Simplified Version)
"""
import time
import json
import os
from abc import ABC, abstractmethod
from typing import Dict, List, Any, Tuple
from dataclasses import dataclass
from pathlib import Path
@dataclass
class TestResult:
"""Container for test results"""
model_name: str
task_name: str
response_time: float
tps: float
cpu_usage: float
memory_usage: float
quality_score: float
raw_output: str
expected_output: str
success: bool
timestamp: float
@dataclass
class PerformanceMetrics:
"""Container for performance metrics"""
response_time: float
tps: float
cpu_avg: float
memory_avg: float
class LLMInterface(ABC):
"""Abstract base class for LLM interfaces"""
def __init__(self, model_name: str, config: Dict[str, Any]):
self.model_name = model_name
self.config = config
@abstractmethod
def generate(self, prompt: str, **kwargs) -> str:
"""Generate text from prompt"""
pass
@abstractmethod
def get_model_info(self) -> Dict[str, Any]:
"""Get model information"""
pass
class TestSuite:
"""Main test suite class"""
def __init__(self, output_dir: str = "test_results"):
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
self.results: List[TestResult] = []
self.models: List[LLMInterface] = []
def add_model(self, model: LLMInterface):
"""Add a model to test"""
self.models.append(model)
def run_single_test(self, model: LLMInterface, task: Dict[str, Any]) -> TestResult:
"""Run a single test on a model"""
# Start timing
start_time = time.time()
# Simulate resource usage
cpu_start = 50 # Percentage (simulated)
memory_start = 1024 # MB (simulated)
# Generate response
try:
response = model.generate(task["prompt"], **task.get("parameters", {}))
success = True
except Exception as e:
response = f"Error: {str(e)}"
success = False
# Measure completion
end_time = time.time()
response_time = end_time - start_time
# Calculate TPS (simulated)
tps = 0
if success and response:
# Estimate tokens (very rough approximation)
tps = len(response.split()) / response_time if response_time > 0 else 0
# Simulate end resources
cpu_end = 60 # Percentage (simulated)
memory_end = 1536 # MB (simulated)
# Calculate averages
cpu_avg = (cpu_start + cpu_end) / 2
memory_avg = (memory_start + memory_end) / 2
# Quality score - this is a placeholder
quality_score = self._calculate_quality_score(response, task.get("expected_output", ""))
result = TestResult(
model_name=model.model_name,
task_name=task["name"],
response_time=response_time,
tps=tps,
cpu_usage=cpu_avg,
memory_usage=memory_avg,
quality_score=quality_score,
raw_output=response,
expected_output=task.get("expected_output", ""),
success=success,
timestamp=time.time()
)
self.results.append(result)
return result
def run_all_tests(self, tasks: List[Dict[str, Any]]):
"""Run all tests across all models"""
for model in self.models:
print(f"Running tests on {model.model_name}")
for task in tasks:
result = self.run_single_test(model, task)
print(f" {task['name']}: {result.response_time:.2f}s, {result.quality_score:.2f}/100")
def _calculate_quality_score(self, response: str, expected: str) -> float:
"""Calculate quality score based on response"""
# This is a placeholder implementation
# In practice, this could use static analysis, code execution, etc.
if not response:
return 0.0
# Very basic quality scoring
score = 50 # Base score
# Add points for presence of expected patterns
if expected and expected.lower() in response.lower():
score += 20
# Add points for valid syntax (simplified)
if not response.startswith("Error:") and len(response) > 10:
score += 10
# Cap score at 100
return min(score, 100.0)
def save_results(self, filename: str = None):
"""Save test results to JSON file"""
if filename is None:
filename = f"results_{int(time.time())}.json"
results_data = []
for result in self.results:
results_data.append({
"model_name": result.model_name,
"task_name": result.task_name,
"response_time": result.response_time,
"tps": result.tps,
"cpu_usage": result.cpu_usage,
"memory_usage": result.memory_usage,
"quality_score": result.quality_score,
"raw_output": result.raw_output,
"expected_output": result.expected_output,
"success": result.success,
"timestamp": result.timestamp
})
with open(self.output_dir / filename, 'w') as f:
json.dump(results_data, f, indent=2)
print(f"Results saved to {self.output_dir / filename}")
# Example model implementations
class MockLLM(LLMInterface):
"""Mock LLM for testing purposes"""
def generate(self, prompt: str, **kwargs) -> str:
# Simulate generation time
time.sleep(0.1)
return f"Mock response for: {prompt}"
def get_model_info(self) -> Dict[str, Any]:
return {
"name": self.model_name,
"type": "mock",
"version": "1.0"
}
class ExampleTask:
"""Example task class for testing"""
@staticmethod
def get_sample_tasks() -> List[Dict[str, Any]]:
return [
{
"name": "Simple function",
"prompt": "Write a Python function to add two numbers",
"expected_output": "return a + b",
"parameters": {}
},
{
"name": "Loop example",
"prompt": "Write a Python loop that prints numbers 1 to 10",
"expected_output": "for i in range(1, 11)",
"parameters": {}
}
]
if __name__ == "__main__":
# Create test suite
suite = TestSuite()
# Add mock models
suite.add_model(MockLLM("mock1", {}))
suite.add_model(MockLLM("mock2", {}))
# Run sample tests
tasks = ExampleTask.get_sample_tasks()
suite.run_all_tests(tasks)
# Save results
suite.save_results()
print("Testing completed!")

View File

@@ -0,0 +1,3 @@
# LLM Performance Testing Framework Requirements
psutil>=5.8.0

16
whisper-compose.yaml Normal file
View File

@@ -0,0 +1,16 @@
services:
whisper:
image: rhasspy/wyoming-whisper:latest
container_name: wyoming-whisper
restart: unless-stopped
ports:
- "10300:10300"
volumes:
- ./whisper-data:/data
command:
- --model
- tiny-int8
- --language
- en
- --beam-size
- "1"