Initial commit: localgenai stack
Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.
- pyinfra/framework/: pyinfra deploy targeting the box
- llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
for gfx1151), OpenWebUI
- Beszel (host + container + AMD GPU dashboard via sysfs)
- OpenLIT (LLM fleet metrics)
- Phoenix (per-trace agent waterfall)
- OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
- install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
29
.gitignore
vendored
Normal file
29
.gitignore
vendored
Normal file
@@ -0,0 +1,29 @@
|
||||
# macOS
|
||||
.DS_Store
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
pyinfra-debug.log
|
||||
|
||||
# Python build artefacts
|
||||
__pycache__/
|
||||
*.pyc
|
||||
|
||||
# Local agent state files (binary, machine-specific)
|
||||
agentdb.rvf
|
||||
agentdb.rvf.lock
|
||||
ruvector.db
|
||||
**/ruvector.db
|
||||
|
||||
# Test outputs
|
||||
testing/**/test_results/
|
||||
testing/**/__pycache__/
|
||||
|
||||
# Node deps in the OpenCode plugin — install.sh's `npm install` regenerates
|
||||
opencode/.opencode/plugin/node_modules/
|
||||
|
||||
# Phoenix bridge diagnostic log (Mac path, written at runtime)
|
||||
/tmp/phoenix-bridge.log
|
||||
|
||||
# Per-machine Claude Code permissions (each clone gets its own)
|
||||
.claude/settings.local.json
|
||||
263
Roadmap.md
Normal file
263
Roadmap.md
Normal file
@@ -0,0 +1,263 @@
|
||||
# Local GenAI roadmap
|
||||
|
||||
What to build on top of the Framework Desktop / Strix Halo stack to
|
||||
narrow the functional gap with Claude Code + Anthropic's broader
|
||||
ecosystem (Cowork, Skills, Design). Captured 2026-05-07; revisit
|
||||
quarterly.
|
||||
|
||||
## Where we stand
|
||||
|
||||
- Containerized inference at `/srv/docker/{llama,vllm,ollama}`
|
||||
- Models on the unified-memory iGPU at 64 GB UMA
|
||||
- OpenCode on the Mac, pointed at the box over Tailscale
|
||||
- Playwright + SearXNG MCP wired into OpenCode (`localgenai/opencode/`)
|
||||
- Self-hosted SearXNG instance at `https://searxng.n0n.io`
|
||||
- Solid floor for single-user terminal coding with Qwen3-Coder-30B-A3B
|
||||
|
||||
## The mental model: three layers
|
||||
|
||||
The OSS ecosystem stratifies cleanly once you stop thinking about
|
||||
individual products:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Layer 3: Management & observability │
|
||||
│ Dispatch tickets to agents, watch spend, track skill │
|
||||
│ accumulation. Where Claude Cowork lives. (Multica, Plane.) │
|
||||
└────────────────────────────┬────────────────────────────────────┘
|
||||
│ assigns work to ↓
|
||||
┌────────────────────────────┴────────────────────────────────────┐
|
||||
│ Layer 2: Orchestration frameworks │
|
||||
│ Code that defines who does what, in what order. Reach for │
|
||||
│ when you want to *encode* a pipeline, not drive interactively. │
|
||||
│ (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK.) │
|
||||
└────────────────────────────┬────────────────────────────────────┘
|
||||
│ runs ↓
|
||||
┌────────────────────────────┴────────────────────────────────────┐
|
||||
│ Layer 1: Harnesses │
|
||||
│ The thing you actually drive — terminal or IDE agent that │
|
||||
│ takes a prompt and uses tools. (OpenCode, Aider, Cline, │
|
||||
│ OpenHands, OpenWork, Goose, Pi.) │
|
||||
└────────────────────────────┬────────────────────────────────────┘
|
||||
│ calls ↓
|
||||
┌────────────────────────────┴────────────────────────────────────┐
|
||||
│ Layer 0: Inference + tools │
|
||||
│ Models, MCP servers, embedding stores, voice/vision pipelines. │
|
||||
│ (Ollama, vLLM, llama.cpp, MCP catalog.) │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Pick what you need from each layer; you don't need anything from a
|
||||
higher layer if the one below is doing the job.
|
||||
|
||||
## Layer 1: Harnesses
|
||||
|
||||
The interactive driver's seat. We use **OpenCode**; alternatives worth
|
||||
having in your back pocket:
|
||||
|
||||
- **[Aider](https://aider.chat)** — terminal, git-native, every change
|
||||
is a commit. Strong on multi-file refactors. Often outperforms
|
||||
OpenCode on small-context local models.
|
||||
- **[Cline](https://github.com/cline/cline)** — VS Code agent, 5M+
|
||||
installs. v3.58 added native subagents and a CI/CD headless mode.
|
||||
Closest "Claude Code in the IDE" feel.
|
||||
- **[OpenHands](https://github.com/All-Hands-AI/OpenHands)** (was
|
||||
OpenDevin) — autonomous agent in a Docker sandbox; edits code, runs
|
||||
tests, browses. Heavier than OpenCode, more autonomous.
|
||||
- **[OpenWork](https://supergok.com/openwork-open-source-ai-agent-framework/)**
|
||||
— desktop GUI agent built on `deepagentsjs`. Multi-step planning,
|
||||
filesystem access, **subagent delegation** baked in. `npx` to start.
|
||||
- **[Goose](https://github.com/block/goose)** by Block — MCP-native,
|
||||
opinionated about toolkits. Strong fit if you're going all-in on MCP.
|
||||
- **[Pi](https://github.com/bradAGI/pi-mono)** — minimal terminal
|
||||
harness; small surface area, easy to read and extend.
|
||||
- Catalog: <https://github.com/bradAGI/awesome-cli-coding-agents>
|
||||
|
||||
When to add another: when you hit a workflow OpenCode doesn't handle
|
||||
well. Aider for refactors, Cline for IDE-bound work, OpenHands for
|
||||
autonomous loops, OpenWork for subagent-heavy planning.
|
||||
|
||||
## Layer 2: Orchestration frameworks
|
||||
|
||||
When you want to *build* a pipeline rather than drive a prebuilt one —
|
||||
write code that defines agents, tools, and a flow. Overkill for solo
|
||||
coding; on-target for personal automation pipelines.
|
||||
|
||||
- **[LangGraph](https://github.com/langchain-ai/langgraph)** — directed
|
||||
graph with conditional edges, built-in checkpointing with time-travel.
|
||||
Surpassed CrewAI in stars in early 2026; production-grade.
|
||||
- **[CrewAI](https://github.com/crewAIInc/crewAI)** — role-based crews,
|
||||
agents/tasks/crew in <20 lines of Python. Lowest learning curve,
|
||||
~5× faster than LangGraph in many cases. A2A protocol support.
|
||||
- **[AutoGen / AG2](https://github.com/microsoft/autogen)** —
|
||||
event-driven, async-first; GroupChat coordination with a selector
|
||||
deciding who speaks next.
|
||||
- **[OpenAI Agents SDK](https://github.com/openai/openai-agents-python)** —
|
||||
production replacement for Swarm. Explicit agent-to-agent handoffs
|
||||
carrying conversation context.
|
||||
- **[OpenAgents](https://github.com/openagents/openagents)** — only
|
||||
framework with native support for **both MCP and A2A** protocols.
|
||||
|
||||
All five are model-agnostic — point them at the local Ollama endpoint
|
||||
or LiteLLM and they work the same as if you were using GPT-5.
|
||||
|
||||
## Layer 3: Management & observability
|
||||
|
||||
The Claude Cowork analog layer. Dispatch work, track progress, monitor
|
||||
spend. The piece I missed in earlier roadmap drafts:
|
||||
|
||||
- **[Multica](https://github.com/multica-ai/multica)** — open-source
|
||||
platform for managing AI coding agents *as teammates*. File issues,
|
||||
assign to agents, they pick up work, write code, report blockers,
|
||||
update statuses. Skill accumulation across tasks. Dispatches to
|
||||
Claude Code, Codex, Copilot CLI, OpenCode, Pi, Cursor Agent, Hermes,
|
||||
Gemini, Kiro, etc. **#1 GitHub TypeScript Trending in April 2026,
|
||||
10.7 k stars in three months.** Closest existing OSS analog to
|
||||
Cowork; works *above* OpenCode rather than replacing it.
|
||||
- **[Plane](https://plane.so/)** — AI-native project management with
|
||||
MCP server and full Agent Run lifecycle tracking. Velocity, workload,
|
||||
blockers auto-populated. Self-hostable.
|
||||
- **[Mission Control](https://github.com/builderz-labs/mission-control)** —
|
||||
self-hosted dashboard for agent fleets. Token-spend per model, trend
|
||||
charts, real-time posture scoring. SQLite, zero deps. Useful even
|
||||
solo just to watch your local model's spend pattern.
|
||||
- **[ClawDeck](https://github.com/builderz-labs/clawdeck)** — focused
|
||||
kanban for OpenClaw agents. Smaller scope, more opinionated.
|
||||
- **[OpenWebUI](https://github.com/open-webui/open-webui)** — multi-user
|
||||
chat UI in front of Ollama. Doesn't manage *agent sessions*, but
|
||||
covers shared-LLM-access for a household.
|
||||
|
||||
**Still gap (May 2026):** *live* shared agent sessions — multiple humans
|
||||
driving one agent on one repo in real time, Cowork's multiplayer mode.
|
||||
Multica is async-ticket-flow, not live-collaborative. Watch for
|
||||
projects that close this.
|
||||
|
||||
## Layer 0: Cross-cutting capabilities
|
||||
|
||||
Things you wire in once, useful from any harness above.
|
||||
|
||||
### Aggregation & routing
|
||||
|
||||
- **[LiteLLM](https://github.com/BerriAI/litellm)** — proxy that exposes
|
||||
one OpenAI-compatible endpoint over many backends. Rules like "code
|
||||
task → local Qwen3, image input → vision model, fast chat → Haiku."
|
||||
Worth standing up early so all higher-layer tools just point at one
|
||||
URL.
|
||||
|
||||
### Code intelligence / RAG
|
||||
|
||||
The local model can't see beyond what fits in its prompt. Claude Code
|
||||
papers over this with long-context recall + WebFetch; locally we need
|
||||
explicit retrieval.
|
||||
|
||||
- **[Continue.dev](https://continue.dev)** — VS Code / JetBrains plugin
|
||||
with built-in `@codebase` indexer. Pairs with `nomic-embed-text` via
|
||||
Ollama for local embeddings. **Biggest single-shot quality lift** for
|
||||
local-model coding.
|
||||
- **[Tabby](https://github.com/TabbyML/tabby)** — self-hosted
|
||||
Copilot-style inline completion. Different niche from agentic coding,
|
||||
covers the "as I type" loop OpenCode doesn't.
|
||||
|
||||
### Web access — done
|
||||
|
||||
Already wired in `localgenai/opencode/`:
|
||||
|
||||
- **[@playwright/mcp](https://github.com/microsoft/playwright-mcp)** —
|
||||
browser automation. Navigate, click, fill forms, read DOM snapshots.
|
||||
- **[mcp-searxng](https://github.com/ihor-sokoliuk/mcp-searxng)** —
|
||||
web search via your `searxng.n0n.io` instance. No API keys.
|
||||
- **[mcp-server-fetch](https://github.com/modelcontextprotocol/servers/tree/main/src/fetch)** —
|
||||
not yet wired; useful if Playwright feels heavy for plain page reads.
|
||||
|
||||
### Persistent memory
|
||||
|
||||
OpenCode's memory is per-session unless you wire MCP for it.
|
||||
|
||||
- **`mcp-server-memory`** (official) — knowledge-graph memory across
|
||||
any MCP client. Crude but works.
|
||||
- **[mem0](https://github.com/mem0ai/mem0)** — sophisticated memory,
|
||||
self-hostable with a local embedding model.
|
||||
|
||||
### Multi-modal
|
||||
|
||||
- **Vision**: `qwen2.5-vl:7b` or `llama3.2-vision:11b` via Ollama.
|
||||
Route via LiteLLM so OpenCode auto-picks vision for image-bearing
|
||||
prompts. Closes the screenshot gap.
|
||||
- **Voice**: `whisper-compose.yaml` and `piper-compose.yaml` are
|
||||
already in `localgenai/`. STT in, TTS out — a "talk to your local
|
||||
assistant" loop is something Claude Code itself doesn't do.
|
||||
|
||||
### Design (genuine new capability)
|
||||
|
||||
Anthropic ships **Claude Design** (Anthropic Labs); the OSS analog is
|
||||
real and active:
|
||||
|
||||
- **[nexu-io/open-design](https://github.com/nexu-io/open-design)** —
|
||||
local-first OSS alternative to Claude Design. 19 Skills, 71
|
||||
brand-grade Design Systems, generates web/desktop/mobile prototypes,
|
||||
slides, images, videos with sandboxed preview and HTML/PDF/PPTX/MP4
|
||||
export. **Ships a stdio MCP server** exposing design source — tokens,
|
||||
JSX components, entry HTML — as a structured API to any MCP-aware
|
||||
agent. Pairs naturally with OpenCode + the local model.
|
||||
- **[Penpot](https://penpot.app)** — full OSS Figma alternative.
|
||||
Different scope (interactive design tool, not agentic).
|
||||
- **[OpenPencil](https://github.com/ZSeven-W/openpencil)** — AI-native
|
||||
vector design tool, "design-as-code." Newer; track.
|
||||
|
||||
### Skills / MCP catalog
|
||||
|
||||
- **[Official MCP registry](https://registry.modelcontextprotocol.io/)**
|
||||
- **[Awesome MCP Servers](https://github.com/punkpeye/awesome-mcp-servers)**
|
||||
(1200+ entries)
|
||||
- **[mcpservers.org](https://mcpservers.org/)** searchable directory
|
||||
|
||||
Build the skills you miss; publish them.
|
||||
|
||||
## Prioritized next steps
|
||||
|
||||
In rough order of impact-per-hour-spent:
|
||||
|
||||
1. **Continue.dev + `nomic-embed-text` in VS Code.** Biggest single
|
||||
quality jump for local-model coding via codebase indexing.
|
||||
2. **Symlink `localgenai/opencode/opencode.json`** into place so the
|
||||
Playwright + SearXNG MCP servers activate. (Already written;
|
||||
one `ln -sf` away.)
|
||||
3. **Stand up LiteLLM** as the aggregation layer in front of Ollama —
|
||||
even at one model, the abstraction pays off the moment you add a
|
||||
second.
|
||||
4. **Stand up Multica** for ticket-style work dispatch. Closest Cowork
|
||||
analog; useful even solo for tracking what the agent has done across
|
||||
sessions.
|
||||
5. **Add a vision model** (`qwen2.5-vl:7b`) and route image prompts via
|
||||
LiteLLM. Closes the screenshot gap.
|
||||
6. **Stand up OpenDesign** as a compose stack. Connect its MCP server
|
||||
to OpenCode. Genuinely new capability versus Claude Code.
|
||||
7. **Try Aider on a real refactor** — different agentic loop than
|
||||
OpenCode, often better with smaller-context local models.
|
||||
8. **Wire whisper + piper into a voice loop.** Not a Claude Code parity
|
||||
item — a *new* capability the local stack uniquely enables.
|
||||
9. **Mission Control on the box** for token-spend visibility once
|
||||
multi-model routing through LiteLLM is real.
|
||||
10. **Evaluate OpenHands or Goose** for autonomous agentic loops where
|
||||
OpenCode's per-task UX feels too hands-on.
|
||||
11. **OpenWebUI for household-shared chat access.** Adds value for
|
||||
non-coding LLM use, fills a small gap in the multi-user story.
|
||||
|
||||
## Reference catalogs to monitor
|
||||
|
||||
- Official MCP registry: <https://registry.modelcontextprotocol.io/>
|
||||
- Awesome CLI coding agents: <https://github.com/bradAGI/awesome-cli-coding-agents>
|
||||
- Awesome MCP servers: <https://github.com/punkpeye/awesome-mcp-servers>
|
||||
- mcpservers.org directory: <https://mcpservers.org/>
|
||||
- Anthropic plugin catalog (parity-tracking): <https://claude.com/plugins>
|
||||
|
||||
## Open gaps to keep watching
|
||||
|
||||
- **Live shared agent sessions** — Cowork's multiplayer mode (multiple
|
||||
humans driving one agent on one repo, real-time). No direct OSS
|
||||
equivalent yet; Multica is async-ticket-flow.
|
||||
- **Cohesive "skills" library** as polished as Anthropic's plugin
|
||||
catalog. The MCP ecosystem has the parts but lacks the curation.
|
||||
- **AMD ROCm packages for Ubuntu 26.04** — relevant if we ever decide
|
||||
to abandon containerization for native ROCm tooling. (Tracked
|
||||
separately in `TODO.md`.)
|
||||
134
StrixHaloMemory.md
Normal file
134
StrixHaloMemory.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# Strix Halo memory layout — UMA, GTT, and what fits
|
||||
|
||||
The Framework Desktop has 128 GB of LPDDR5x-8000 unified between CPU and
|
||||
GPU. How much the GPU can actually use is a stack of three knobs:
|
||||
|
||||
## The three knobs
|
||||
|
||||
1. **Total system RAM** — 128 GB on this box. Set by the SKU.
|
||||
2. **UMA frame buffer (BIOS, "iGPU memory size")** — a *fixed* carve-out
|
||||
reserved for the GPU at boot. The OS can't see or reclaim it. Pinned,
|
||||
guaranteed, lowest latency. As of BIOS lfsp0.03.05 (May 2026), the
|
||||
max selectable on Framework Desktop is **64 GB**.
|
||||
3. **GTT, Graphics Translation Table (kernel, runtime)** — a *dynamic*
|
||||
pool the GPU can borrow from system RAM beyond UMA. Default size on
|
||||
recent Linux is roughly half of system RAM. Higher latency than UMA,
|
||||
evictable under memory pressure.
|
||||
|
||||
So on a 128 GB box with 64 GB UMA: GPU sees 64 GB of pinned, fast VRAM,
|
||||
plus up to ~64 GB of slower borrowable GTT, leaving the OS to share what
|
||||
GTT isn't currently using.
|
||||
|
||||
## Why the BIOS caps at 64 GB
|
||||
|
||||
Not a hardware limit. AMD/Framework deliberately keeps the UMA ceiling
|
||||
below total RAM so the OS doesn't get starved. Earlier BIOS revisions
|
||||
were stricter (16 / 24 GB). Newer ones may unlock 96 / 110 GB — worth
|
||||
re-checking https://knowledgebase.frame.work after firmware updates.
|
||||
|
||||
The "128 GB support" on Framework's spec page refers to total system
|
||||
RAM capacity, **not** the UMA maximum. Distinct knobs.
|
||||
|
||||
## What 64 GB UMA actually fits
|
||||
|
||||
| Model | Quant | Size | Fits 64 GB? |
|
||||
|-------|-------|------|-------------|
|
||||
| Qwen3-Coder-30B-A3B | Q8 | ~32 GB | ✅ huge headroom |
|
||||
| Llama 3.3 70B (dense) | Q4 | ~42 GB | ✅ |
|
||||
| GLM-4.5-Air-style 100B-A12B MoE | Q4 | ~60 GB | ✅ tight |
|
||||
| Kimi-Linear-48B-A3B (when llama.cpp lands) | Q4-Q8 | ~30-50 GB | ✅ |
|
||||
| Qwen3-235B-A22B | Q3 | ~100 GB | ❌ |
|
||||
| DeepSeek-V3 / Kimi-K2 (600B+ MoE) | Q4 | 200 GB+ | ❌ |
|
||||
|
||||
The ❌ row wouldn't be enjoyable on this hardware regardless — see
|
||||
"Bandwidth is the real ceiling" below.
|
||||
|
||||
## When GTT spillover saves you
|
||||
|
||||
If a model is bigger than UMA but ≤ UMA+GTT:
|
||||
|
||||
- **Vulkan (llama.cpp `server-vulkan` container)** — happily uses GTT
|
||||
for weights. Slower than UMA but works.
|
||||
- **ROCm (vLLM, Ollama:rocm)** — picky; often refuses GTT for weight
|
||||
buffers. Effectively capped at UMA.
|
||||
|
||||
So if you want to run something bigger than 64 GB and stay performant,
|
||||
the path is: stick with the Vulkan llama.cpp container and accept the
|
||||
GTT latency hit, *or* wait for a BIOS that unlocks a larger UMA.
|
||||
|
||||
## Two architectural strategies (UMA-first vs GTT-first)
|
||||
|
||||
There's a real choice in how you allocate memory between OS and GPU:
|
||||
|
||||
### Strategy A — UMA-first (current setup)
|
||||
|
||||
- BIOS UMA: **64 GB** (the menu max on lfsp0.03.05)
|
||||
- Kernel `amdgpu.gttsize`: default (~50 % of remaining RAM, so ~32 GB)
|
||||
- Effective GPU pool: **64 GB pinned + small GTT spillover**
|
||||
- OS sees: 64 GB
|
||||
- Pros: lowest-latency GPU memory; OS can't reclaim under pressure;
|
||||
ROCm-based stacks (vLLM, Ollama:rocm) get the full pool
|
||||
- Cons: hard ceiling at the BIOS UMA max; OS lives in 64 GB
|
||||
|
||||
### Strategy B — GTT-first (per Gygeek's setup guide)
|
||||
|
||||
- BIOS UMA: **512 MB** (the menu min)
|
||||
- Kernel `amdgpu.gttsize=117760` (~115 GB)
|
||||
- Effective GPU pool: **~115 GB borrowable from the OS-visible pool**
|
||||
- OS sees: ~127 GB
|
||||
- Pros: dynamic — GPU only takes what it needs; can address far more
|
||||
than the BIOS UMA cap; future-proof if you want >64 GB models
|
||||
- Cons: GTT pages are evictable under memory pressure (avoid running
|
||||
heavy CPU workloads alongside inference); ROCm stacks may refuse
|
||||
GTT for weight buffers — Vulkan is the safer backend with this
|
||||
strategy
|
||||
|
||||
### Which to pick
|
||||
|
||||
- **Single-user, mostly LLMs, ROCm or Vulkan**: Strategy A is fine.
|
||||
64 GB covers everything that runs well on this box's bandwidth (see
|
||||
"Bandwidth is the real ceiling" below).
|
||||
- **Models > 64 GB or you want headroom**: Strategy B. Trade some
|
||||
worst-case latency for the ability to load anything that fits in
|
||||
127 GB. Vulkan-first.
|
||||
- **Mixed workload (Docker compiles + LLMs + dev tools)**: Strategy B
|
||||
also helps because the OS isn't permanently squeezed into 64 GB.
|
||||
|
||||
The pyinfra deploy currently sets `amdgpu.gttsize=117760` via GRUB
|
||||
(strategy B-compatible) but leaves BIOS UMA to whatever you've
|
||||
configured. If you want full strategy B, drop BIOS UMA to 512 MB at
|
||||
the next boot. If staying on strategy A, the kernel param is benign —
|
||||
GTT just becomes a high ceiling you'll never reach.
|
||||
|
||||
## Bandwidth is the real ceiling, not VRAM
|
||||
|
||||
256 GB/s memory bandwidth. Decode tokens/sec ≈ bandwidth ÷ active-params
|
||||
size. Practical implications:
|
||||
|
||||
- **3B active** (e.g. Qwen3-Coder-30B-A3B, Kimi-Linear-48B-A3B): ~50-80 tok/s. Excellent.
|
||||
- **12B active** (GLM-4.5-Air-style MoEs): ~10-20 tok/s. Usable.
|
||||
- **35B active** (Qwen3-Coder-480B-A35B): ~3-7 tok/s. Painful.
|
||||
- **70B dense**: ~2-3 tok/s. Slideshow.
|
||||
|
||||
This is why MoE models with small active-params dominate on this box
|
||||
even though the box has lots of memory. The 64 GB UMA cap doesn't hurt
|
||||
much because the bandwidth would have made the bigger-than-64 GB models
|
||||
unusable anyway.
|
||||
|
||||
## Other tunings worth knowing
|
||||
|
||||
- **`amd_iommu=off`** kernel param gives ~6 % memory-read improvement
|
||||
on Strix Halo (per Gygeek). Wired into pyinfra's GRUB management.
|
||||
Reboot required to take effect.
|
||||
- **TDP** — configurable in BIOS (45–120 W). Gygeek suggests 85 W as a
|
||||
noise/perf sweet spot. BIOS-only, can't be automated.
|
||||
|
||||
## TL;DR
|
||||
|
||||
For most users on this hardware: stay on strategy A (UMA = 64 GB), let
|
||||
pyinfra apply `amd_iommu=off` + the GTT ceiling, reboot, done. The
|
||||
hardware's bandwidth makes anything bigger than ~70 GB of weights
|
||||
impractical anyway, so the 64 GB cap rarely bites.
|
||||
|
||||
Switch to strategy B (UMA = 512 MB, GTT-driven) if you specifically
|
||||
want to load >64 GB models on Vulkan, or if the OS feels squeezed.
|
||||
112
StrixHaloSetup.md
Normal file
112
StrixHaloSetup.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# Strix Halo Setup Plan — Local Coding Model Server
|
||||
|
||||
Preparing a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S iGPU, 128 GB unified LPDDR5x) to serve coding models locally, with **Kimi Linear** as the eventual target.
|
||||
|
||||
## Target hardware
|
||||
- **CPU:** Ryzen AI Max+ 395 (16 Zen 5 cores)
|
||||
- **iGPU:** Radeon 8060S (40 RDNA 3.5 CUs, `gfx1151`)
|
||||
- **Memory:** 128 GB LPDDR5x-8000, unified CPU/GPU
|
||||
- **GPU-addressable:** ~96 GB (configurable in BIOS)
|
||||
- **Memory bandwidth:** ~256 GB/s (the binding constraint for inference)
|
||||
|
||||
## OS choice: Ubuntu 24.04 LTS Server
|
||||
- Headless-first, with the option to add a slim DE (XFCE / LXQt) later.
|
||||
- AMD's official ROCm support and the Lemonade SDK target Ubuntu LTS — fewest sharp edges for `gfx1151`.
|
||||
- Skip Ubuntu Desktop (full GNOME fights you on a headless box). Add `xubuntu-core` or `lubuntu-core` only when/if you decide you want a desktop.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 — Pre-install
|
||||
|
||||
- **Update Framework BIOS / firmware** to the latest before installing anything. Strix Halo firmware updates have meaningfully changed memory bandwidth and iGPU stability; don't chase ghost issues on stale firmware.
|
||||
- **Set GPU memory carve-out** in BIOS to the maximum UMA frame buffer (typically 96 GB). Leaves 32 GB for the OS + everything else.
|
||||
|
||||
## Phase 1 — Base OS
|
||||
|
||||
- **Ubuntu Server 24.04 LTS** installed via USB.
|
||||
- During install: enable SSH server, import OpenSSH key from your laptop.
|
||||
- Post-install: `apt install linux-generic-hwe-24.04` for a newer kernel (ROCm 7.x wants ≥ 6.11).
|
||||
- Install **Tailscale** for SSH/HTTP tunnel access from anywhere on your tailnet — no exposed ports needed.
|
||||
|
||||
**Deliverable:** SSH-accessible headless box, reachable by hostname over Tailscale.
|
||||
|
||||
## Phase 2 — Optional slim DE (skip until needed)
|
||||
|
||||
- `apt install xubuntu-core` (XFCE) or `lubuntu-core` (LXQt) — both tiny.
|
||||
- Reach it via **xrdp** or **NoMachine** rather than wiring up a monitor/keyboard. NoMachine is the smoother experience for occasional GUI sessions.
|
||||
|
||||
## Phase 3 — GPU stack
|
||||
|
||||
- Install **ROCm 7.x** for `gfx1151` per AMD's Ubuntu 24.04 instructions. Verify with `rocminfo` and `rocm-smi`.
|
||||
- **Mesa RADV** (Vulkan) is already on Ubuntu — verify with `vulkaninfo --summary`.
|
||||
- Keep **both** backends available. Vulkan currently beats ROCm for MoE decode on Strix Halo, but that may invert as ROCm matures or Kimi Linear lands.
|
||||
|
||||
**Deliverable:** Both `rocminfo` and `vulkaninfo` see the 8060S.
|
||||
|
||||
## Phase 4 — Inference engines
|
||||
|
||||
- **llama.cpp** — build twice into separate binaries: `llama-server-vulkan` and `llama-server-rocm`. Run whichever wins for the model in question. This is the path for Kimi Linear once PR #17592 lands.
|
||||
- **Ollama** — convenient front for llama.cpp; uses Vulkan/ROCm under the hood.
|
||||
- **vLLM with ROCm** — install only if you actually need batched serving on an OpenAI-compatible endpoint. Heavier setup; skip until needed.
|
||||
|
||||
**Deliverable:** A llama.cpp server that loads a known-good GGUF and serves on `localhost:8080` over an OpenAI-compatible API.
|
||||
|
||||
## Phase 5 — Storage & model layout
|
||||
|
||||
- Big GGUFs add up fast (Kimi Linear Q4_K_M alone is ~30 GB; plan for 4–6 concurrent models).
|
||||
- Use a dedicated `/models` directory; if the Framework Desktop has a second NVMe slot, put models there.
|
||||
- Standard layout so Ollama, llama.cpp, and editor configs can share paths:
|
||||
```
|
||||
/models/<vendor>/<model>/<quant>.gguf
|
||||
```
|
||||
|
||||
## Phase 6 — Daily coding model (use now, swap to Kimi Linear later)
|
||||
|
||||
Worth its own decision pass. Strong contenders for a 96 GB carve-out at Q4–Q6 (as of May 2026):
|
||||
- **Qwen3-Coder**
|
||||
- **DeepSeek-Coder-V3.x**
|
||||
- **GLM-4.6**
|
||||
- **Devstral** (smaller, faster)
|
||||
- **Kimi-K2** (stay in the Moonshot family while waiting for Linear)
|
||||
|
||||
→ Open question: research current Strix Halo benchmarks for these before committing to one.
|
||||
|
||||
## Phase 7 — Front-end
|
||||
|
||||
All of these speak OpenAI-compatible APIs, so the Phase 4 server is the only thing they need to know about:
|
||||
- **Continue.dev** or **Cline** (VS Code) pointed at the local server.
|
||||
- **Aider** for terminal-based pair programming.
|
||||
- **OpenWebUI** for a ChatGPT-style UI when desired.
|
||||
|
||||
## Phase 8 — Monitoring & guardrails
|
||||
|
||||
- `nvtop` or `radeontop` for GPU; `btop` for CPU/RAM.
|
||||
- Small systemd unit to auto-start the llama.cpp server on boot.
|
||||
- Strix Halo is power-flexible (45–120 W TDP); pick a TDP that matches your noise/perf preference.
|
||||
|
||||
## Phase 9 — Kimi Linear readiness checklist
|
||||
|
||||
When the weekly routine flips to **READY TO TRY**, the bring-up should be ~30 minutes:
|
||||
|
||||
1. `git pull` the llama.cpp checkout (or rebuild from PR #17592 / its merged commit).
|
||||
2. Download a Q4_K_M or Q6_K GGUF from bartowski / ymcki to `/models/moonshotai/kimi-linear/`.
|
||||
3. Swap the systemd unit's `--model` path. Restart.
|
||||
|
||||
---
|
||||
|
||||
## Status tracking
|
||||
|
||||
- **Weekly Kimi Linear support check:** routine `trig_01XPweKNZaGTsFjEsF3Ce6HH` — runs Fridays at 13:00 UTC. View at https://claude.ai/code/routines/trig_01XPweKNZaGTsFjEsF3Ce6HH
|
||||
- **Baseline (2026-05-01):** llama.cpp PR #17592 OPEN/unmerged, CUDA-only validation. GGUFs exist but require a patched build. No Strix Halo (Vulkan/ROCm) reports yet.
|
||||
|
||||
## Key references
|
||||
|
||||
- [llama.cpp PR #17592 — Kimi Linear support](https://github.com/ggml-org/llama.cpp/pull/17592)
|
||||
- [llama.cpp Issue #16930 — Kimi Linear feature request](https://github.com/ggml-org/llama.cpp/issues/16930)
|
||||
- [llama.cpp Discussion #20856 — Known-Good Strix Halo ROCm Stack](https://github.com/ggml-org/llama.cpp/discussions/20856)
|
||||
- [Strix Halo backend benchmarks (kyuz0)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
|
||||
- [strixhalo.wiki llama.cpp performance](https://strixhalo.wiki/AI/llamacpp-performance)
|
||||
- [Lychee-Technology/llama-cpp-for-strix-halo](https://github.com/Lychee-Technology/llama-cpp-for-strix-halo)
|
||||
- [bartowski Kimi-Linear GGUF](https://huggingface.co/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF)
|
||||
- [ymcki Kimi-Linear GGUF (fork-required)](https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF)
|
||||
- [vLLM Kimi-Linear recipe](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-Linear.html)
|
||||
44
TODO.md
Normal file
44
TODO.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# TODO
|
||||
|
||||
## ROCm / vLLM on Strix Halo (gfx1151)
|
||||
|
||||
The Framework Desktop runs **Ubuntu 26.04 LTS**; AMD only ships ROCm
|
||||
7.2.3 packages for jammy (22.04) and noble (24.04). We installed the
|
||||
noble repo but pulled only `rocminfo` + `rocm-smi-lib` for host-side
|
||||
diagnostics — all heavy ROCm work runs in containers, which ship their
|
||||
own ROCm stack. This sidesteps the host-side libxml2 ABI mismatch (noble
|
||||
ships `libxml2.so.2`, 26.04 ships `libxml2.so.16`) that broke the native
|
||||
HIP toolchain.
|
||||
|
||||
### Open questions
|
||||
|
||||
- **Does `rocm/vllm:latest` actually run on Strix Halo's iGPU?** vLLM's
|
||||
AMD support officially targets datacenter cards (MI300X / gfx942).
|
||||
gfx1151 (RDNA 3.5 consumer) is a different ISA. If the stock image
|
||||
doesn't initialize the device, try `rocm/vllm-dev:nightly` or build
|
||||
from source against ROCm 7.x with `-DAMDGPU_TARGETS=gfx1151`.
|
||||
- **AMD support for 26.04** — watch https://repo.radeon.com/amdgpu-install/<latest>/ubuntu/
|
||||
for a directory matching the box's codename. AMD historically lags
|
||||
Ubuntu LTS by 6–12 months for ROCm packaging.
|
||||
|
||||
### When 26.04 ROCm packages land
|
||||
|
||||
If you ever want to do native ROCm work on the host (rather than via
|
||||
containers):
|
||||
1. Bump `ROCM_VERSION` and `AMDGPU_INSTALL_DEB` in `pyinfra/deploy.py`
|
||||
to the new release.
|
||||
2. Update the apt source URL path in `deploy.py` if AMD adds a new
|
||||
release codename (currently hardcoded to `noble`).
|
||||
3. Add a step that runs `amdgpu-install -y --usecase=rocm --no-dkms`
|
||||
(the current deploy explicitly avoids this to stay slim).
|
||||
4. `./run.sh`.
|
||||
|
||||
For container-only workflows (current default), no action is needed —
|
||||
container images update independently of the host.
|
||||
|
||||
## Pick a coding model (StrixHaloSetup Phase 6)
|
||||
|
||||
Open question — research current Strix Halo benchmarks before
|
||||
committing. Candidates: Qwen3-Coder, DeepSeek-Coder-V3.x, GLM-4.6,
|
||||
Devstral, Kimi-K2. Track Kimi Linear separately via the weekly routine
|
||||
referenced in `StrixHaloSetup.md`.
|
||||
7
opencode/.gitignore
vendored
Normal file
7
opencode/.gitignore
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
# Plugin deps — install.sh's `npm install` repopulates these.
|
||||
.opencode/plugin/node_modules/
|
||||
|
||||
# Diagnostic log written by phoenix-bridge.js (Mac path).
|
||||
/tmp/phoenix-bridge.log
|
||||
|
||||
.DS_Store
|
||||
1685
opencode/.opencode/plugin/package-lock.json
generated
Normal file
1685
opencode/.opencode/plugin/package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
13
opencode/.opencode/plugin/package.json
Normal file
13
opencode/.opencode/plugin/package.json
Normal file
@@ -0,0 +1,13 @@
|
||||
{
|
||||
"name": "localgenai-opencode-plugins",
|
||||
"version": "0.1.0",
|
||||
"private": true,
|
||||
"type": "module",
|
||||
"description": "OpenCode plugins for the localgenai stack. Run `npm install` here once.",
|
||||
"dependencies": {
|
||||
"@opentelemetry/api": "^1.9.0",
|
||||
"@opentelemetry/exporter-trace-otlp-proto": "^0.205.0",
|
||||
"@opentelemetry/sdk-node": "^0.205.0",
|
||||
"@opentelemetry/sdk-trace-base": "^2.0.0"
|
||||
}
|
||||
}
|
||||
198
opencode/.opencode/plugin/phoenix-bridge.js
Normal file
198
opencode/.opencode/plugin/phoenix-bridge.js
Normal file
@@ -0,0 +1,198 @@
|
||||
// phoenix-bridge — register an OpenTelemetry SDK so the spans OpenCode
|
||||
// already emits (when experimental.openTelemetry=true) actually leave
|
||||
// the process and land in Arize Phoenix on the Framework Desktop.
|
||||
//
|
||||
// What it does:
|
||||
// - Boots @opentelemetry/sdk-node with OTLP/HTTP exporter pointed at
|
||||
// http://framework:4318/v1/traces (override via PHOENIX_OTLP_ENDPOINT).
|
||||
// - Inserts @arizeai/openinference-vercel's OpenInferenceSpanProcessor so
|
||||
// Phoenix's LLM/tool-aware UI gets richer attributes than vanilla OTel.
|
||||
// - Opens a parent span around each OpenCode session, with sub-sessions
|
||||
// (subagents spawned via the Task tool) nested under the parent. This
|
||||
// gives Phoenix a unified tree per top-level conversation.
|
||||
//
|
||||
// Caveats (May 2026):
|
||||
// - Subagent OTel context propagation across session boundaries is best
|
||||
// effort; OpenCode's experimental.chat.system.transform doesn't expose
|
||||
// sessionID yet (issue sst/opencode#6142). When that lands, swap the
|
||||
// manual stitching here for proper traceparent injection.
|
||||
// - Dynamic imports below are deliberate — per OpenCode's plugin guide,
|
||||
// direct imports of optional deps freeze the harness if they're absent.
|
||||
// Run `npm install` in this directory first.
|
||||
|
||||
export const PhoenixBridge = async ({ project, directory, worktree }) => {
|
||||
// Direct file logging — OpenCode swallows stdout/stderr from plugin
|
||||
// code, so this is how we get visibility into init progress. Tail with:
|
||||
// tail -f /tmp/phoenix-bridge.log
|
||||
const { appendFileSync } = await import("node:fs");
|
||||
const log = (msg) => {
|
||||
try {
|
||||
appendFileSync(
|
||||
"/tmp/phoenix-bridge.log",
|
||||
`[${new Date().toISOString()}] ${msg}\n`,
|
||||
);
|
||||
} catch (_) {
|
||||
/* best effort */
|
||||
}
|
||||
};
|
||||
log("plugin function entered");
|
||||
|
||||
// Phoenix 15.x serves OTLP/HTTP at /v1/traces on the same port as the UI
|
||||
// (6006). Earlier versions used a separate 4318 — override here if you
|
||||
// ever pin Phoenix < 15.0.
|
||||
const endpoint =
|
||||
process.env.PHOENIX_OTLP_ENDPOINT || "http://framework:6006/v1/traces";
|
||||
const serviceName = process.env.PHOENIX_SERVICE_NAME || "opencode";
|
||||
// NodeSDK's `serviceName` constructor option is ignored in some
|
||||
// versions; setting OTEL_SERVICE_NAME forces the resource attribute
|
||||
// through the standard env-var path. Has to happen before sdk.start().
|
||||
if (!process.env.OTEL_SERVICE_NAME) {
|
||||
process.env.OTEL_SERVICE_NAME = serviceName;
|
||||
}
|
||||
log(`endpoint=${endpoint} serviceName=${serviceName}`);
|
||||
|
||||
// Dynamic imports so a missing dep produces a warning, not a freeze.
|
||||
let NodeSDK,
|
||||
OTLPTraceExporter,
|
||||
BatchSpanProcessor,
|
||||
SimpleSpanProcessor,
|
||||
trace,
|
||||
context,
|
||||
diag,
|
||||
DiagLogLevel;
|
||||
try {
|
||||
({ NodeSDK } = await import("@opentelemetry/sdk-node"));
|
||||
// Use the protobuf-over-HTTP exporter — Phoenix's OTLP receiver
|
||||
// only speaks protobuf (returns 415 for the JSON variant from
|
||||
// @opentelemetry/exporter-trace-otlp-http).
|
||||
({ OTLPTraceExporter } = await import(
|
||||
"@opentelemetry/exporter-trace-otlp-proto"
|
||||
));
|
||||
({ BatchSpanProcessor, SimpleSpanProcessor } = await import(
|
||||
"@opentelemetry/sdk-trace-base"
|
||||
));
|
||||
({ trace, context, diag, DiagLogLevel } = await import(
|
||||
"@opentelemetry/api"
|
||||
));
|
||||
log("OTel imports resolved");
|
||||
} catch (err) {
|
||||
console.warn(
|
||||
"[phoenix-bridge] OTel deps not installed — skipping. " +
|
||||
"Run `npm install` in .opencode/plugin/ to enable.",
|
||||
err && err.message,
|
||||
);
|
||||
return {};
|
||||
}
|
||||
|
||||
// Pipe OTel's internal diagnostics into our log so export errors are
|
||||
// visible. Without this, BatchSpanProcessor swallows exporter failures.
|
||||
// Default to WARN — bump to DEBUG via PHOENIX_OTEL_DEBUG=1 when chasing
|
||||
// a regression.
|
||||
const otelLevel =
|
||||
process.env.PHOENIX_OTEL_DEBUG === "1"
|
||||
? DiagLogLevel.DEBUG
|
||||
: DiagLogLevel.WARN;
|
||||
diag.setLogger(
|
||||
{
|
||||
verbose: (m) => log(`[otel:verbose] ${m}`),
|
||||
debug: (m) => log(`[otel:debug] ${m}`),
|
||||
info: (m) => log(`[otel:info] ${m}`),
|
||||
warn: (m) => log(`[otel:warn] ${m}`),
|
||||
error: (m) => log(`[otel:error] ${m}`),
|
||||
},
|
||||
otelLevel,
|
||||
);
|
||||
|
||||
// NodeSDK accepts `serviceName` directly, sidestepping the Resource API
|
||||
// (which broke between @opentelemetry/resources v1.x and v2.x).
|
||||
// SimpleSpanProcessor (vs BatchSpanProcessor) exports each span
|
||||
// immediately — easier to debug while we sort out the pipeline.
|
||||
const exporter = new OTLPTraceExporter({ url: endpoint });
|
||||
const sdk = new NodeSDK({
|
||||
serviceName,
|
||||
spanProcessors: [new SimpleSpanProcessor(exporter)],
|
||||
});
|
||||
sdk.start();
|
||||
log("sdk.start() returned");
|
||||
|
||||
const tracer = trace.getTracer("opencode.phoenix-bridge");
|
||||
log("tracer obtained");
|
||||
|
||||
// Smoke-test span. If this one doesn't show up in Phoenix, the export
|
||||
// path is broken and there's no point chasing missing session spans.
|
||||
// Should appear under service "opencode" within ~5 s of OpenCode start.
|
||||
const bootSpan = tracer.startSpan("phoenix-bridge.boot", {
|
||||
attributes: {
|
||||
"opencode.endpoint": endpoint,
|
||||
"opencode.service_name": serviceName,
|
||||
},
|
||||
});
|
||||
bootSpan.end();
|
||||
log("boot span emitted (will flush within ~5s)");
|
||||
// sessionId -> { span, ctx }
|
||||
const sessions = new Map();
|
||||
|
||||
const closeSession = (sessionId) => {
|
||||
const rec = sessions.get(sessionId);
|
||||
if (!rec) return;
|
||||
rec.span.end();
|
||||
sessions.delete(sessionId);
|
||||
};
|
||||
|
||||
const onShutdown = async () => {
|
||||
for (const id of [...sessions.keys()]) closeSession(id);
|
||||
try {
|
||||
await sdk.shutdown();
|
||||
} catch (_) {
|
||||
/* ignore */
|
||||
}
|
||||
};
|
||||
process.once("beforeExit", onShutdown);
|
||||
process.once("SIGINT", onShutdown);
|
||||
process.once("SIGTERM", onShutdown);
|
||||
|
||||
return {
|
||||
event: async ({ event }) => {
|
||||
if (!event || !event.type) return;
|
||||
|
||||
switch (event.type) {
|
||||
case "session.created": {
|
||||
const session = event.properties?.session || event.properties || {};
|
||||
const id = session.id;
|
||||
const parentID = session.parentID;
|
||||
if (!id) return;
|
||||
|
||||
const parentRec = parentID ? sessions.get(parentID) : null;
|
||||
const parentCtx = parentRec ? parentRec.ctx : context.active();
|
||||
const span = tracer.startSpan(
|
||||
parentID ? `subagent ${id}` : `session ${id}`,
|
||||
{
|
||||
attributes: {
|
||||
"opencode.session.id": id,
|
||||
"opencode.session.parent_id": parentID || "",
|
||||
"opencode.directory": directory || "",
|
||||
"opencode.worktree": worktree || "",
|
||||
"opencode.project": project?.name || "",
|
||||
},
|
||||
},
|
||||
parentCtx,
|
||||
);
|
||||
sessions.set(id, { span, ctx: trace.setSpan(parentCtx, span) });
|
||||
break;
|
||||
}
|
||||
|
||||
case "session.idle":
|
||||
case "session.deleted":
|
||||
case "session.compacted": {
|
||||
const id =
|
||||
event.properties?.session?.id || event.properties?.id || null;
|
||||
if (id) closeSession(id);
|
||||
break;
|
||||
}
|
||||
|
||||
default:
|
||||
break;
|
||||
}
|
||||
},
|
||||
};
|
||||
};
|
||||
138
opencode/README.md
Normal file
138
opencode/README.md
Normal file
@@ -0,0 +1,138 @@
|
||||
# opencode setup
|
||||
|
||||
Canonical OpenCode config + Phoenix bridge plugin for the localgenai
|
||||
stack. `install.sh` deploys it to `~/.config/opencode/` on a Mac.
|
||||
|
||||
## What's wired up
|
||||
|
||||
- **Local model**: `framework/qwen3-coder:30b` served by Ollama on the
|
||||
Framework Desktop, reachable over Tailscale.
|
||||
- **Playwright MCP** ([@playwright/mcp](https://github.com/microsoft/playwright-mcp)) —
|
||||
browser automation. The model can navigate pages, click, fill forms,
|
||||
read DOM snapshots. Closes the agentic-browsing gap.
|
||||
- **SearXNG MCP** ([mcp-searxng](https://github.com/ihor-sokoliuk/mcp-searxng)) —
|
||||
web search via your self-hosted instance at <https://searxng.n0n.io>.
|
||||
No external API keys, no rate-limit roulette.
|
||||
- **Phoenix bridge plugin** (`.opencode/plugin/phoenix-bridge.js`) —
|
||||
exports OpenTelemetry spans for every LLM call, tool call, and
|
||||
subagent invocation to the Phoenix container running on the Framework
|
||||
Desktop. Per-prompt waterfall / flamegraph viz at
|
||||
<http://framework:6006>.
|
||||
|
||||
## Setup
|
||||
|
||||
```sh
|
||||
./install.sh
|
||||
```
|
||||
|
||||
Idempotent — re-run after editing `opencode.json` or pulling changes to
|
||||
the plugin. Each step checks before doing work. Specifically:
|
||||
|
||||
1. Verifies Homebrew is present (won't install it for you)
|
||||
2. `brew install node uv jq sst/tap/opencode` (skips if already at latest)
|
||||
3. Pre-caches Playwright's chromium so the first MCP call is instant
|
||||
4. `npm install` in `.opencode/plugin/` for the Phoenix bridge OTel deps
|
||||
5. Generates `~/.config/opencode/opencode.json` from the repo's
|
||||
`opencode.json`, rewriting relative plugin paths to absolute so
|
||||
OpenCode loads the plugin regardless of which directory it's launched
|
||||
from
|
||||
|
||||
Step 5 is the reason the deployed config isn't a plain symlink. The
|
||||
repo's `opencode.json` uses a relative plugin path (`./...`) so it stays
|
||||
valid in place; the deployed copy is generated with that path resolved
|
||||
to an absolute one. Edits to the repo's `opencode.json` need a re-run
|
||||
of `./install.sh` to take effect.
|
||||
|
||||
## Verify
|
||||
|
||||
```sh
|
||||
# Local model reachable
|
||||
curl -s http://framework:11434/v1/models | jq '.data[].id'
|
||||
|
||||
# SearXNG instance answers JSON
|
||||
curl -s 'https://searxng.n0n.io/search?q=test&format=json' | jq '.results | length'
|
||||
```
|
||||
|
||||
Then in opencode:
|
||||
|
||||
```
|
||||
opencode
|
||||
> /mcp # should list playwright and searxng as connected
|
||||
> search the web for "qwen3-coder benchmarks"
|
||||
> open https://example.com and tell me the H1
|
||||
```
|
||||
|
||||
## Phoenix tracing
|
||||
|
||||
The plugin at `.opencode/plugin/phoenix-bridge.js` boots an OpenTelemetry
|
||||
SDK on OpenCode startup and ships every span to Phoenix on the Framework
|
||||
Desktop. With `experimental.openTelemetry: true` (already set in
|
||||
`opencode.json`), OpenCode emits Vercel AI SDK spans that Phoenix renders
|
||||
as a per-turn waterfall: user prompt → main agent's `ai.streamText` →
|
||||
each tool call (built-in + MCP) with token counts and latencies inline.
|
||||
|
||||
The plugin uses `@opentelemetry/exporter-trace-otlp-proto` (not `-http`)
|
||||
because Phoenix's OTLP receiver only speaks protobuf — the JSON variant
|
||||
returns 415.
|
||||
|
||||
Defaults can be overridden via env vars (set before launching opencode):
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
|---|---|---|
|
||||
| `PHOENIX_OTLP_ENDPOINT` | `http://framework:6006/v1/traces` | OTLP/HTTP target |
|
||||
| `PHOENIX_SERVICE_NAME` | `opencode` | Phoenix project name |
|
||||
| `PHOENIX_OTEL_DEBUG` | unset | `1` to surface OTel internal logs |
|
||||
|
||||
### Verifying
|
||||
|
||||
```sh
|
||||
: > /tmp/phoenix-bridge.log # truncate prior runs
|
||||
opencode # any directory; CWD doesn't matter
|
||||
tail -f /tmp/phoenix-bridge.log
|
||||
```
|
||||
|
||||
Healthy startup looks like:
|
||||
```
|
||||
plugin function entered
|
||||
endpoint=http://framework:6006/v1/traces serviceName=opencode
|
||||
OTel imports resolved
|
||||
sdk.start() returned
|
||||
tracer obtained
|
||||
boot span emitted (will flush within ~5s)
|
||||
```
|
||||
|
||||
Then open <http://framework:6006/projects> — an `opencode` project should
|
||||
appear with at least one `phoenix-bridge.boot` span. Send a prompt in
|
||||
OpenCode and real LLM-call traces follow.
|
||||
|
||||
If the plugin's deps aren't installed, OpenCode logs a warning and the
|
||||
plugin no-ops — the rest of OpenCode still works fine.
|
||||
|
||||
### Known limitations
|
||||
|
||||
- **Subagent nesting is best-effort.** The plugin opens a parent span
|
||||
per session and tries to stitch child sessions (Task-tool subagents)
|
||||
under their parent, but Vercel AI SDK spans live in their own OTel
|
||||
trace context. Until [sst/opencode#6142](https://github.com/sst/opencode/issues/6142)
|
||||
exposes `sessionID` in the `chat.system.transform` hook, child-session
|
||||
spans may show as separate traces in Phoenix.
|
||||
- **Console output from plugins is swallowed by OpenCode's TUI.** That's
|
||||
why init progress goes to `/tmp/phoenix-bridge.log` rather than stdout.
|
||||
|
||||
## Notes
|
||||
|
||||
- **SearXNG JSON output** must be enabled on the instance for the MCP
|
||||
server to work. If `format=json` returns HTML or 403, edit
|
||||
`settings.yml` on the SearXNG box: `search.formats: [html, json]`,
|
||||
restart.
|
||||
- **Playwright first-run** downloads ~200 MB of browser binaries into
|
||||
`~/Library/Caches/ms-playwright/`. Subsequent runs are instant.
|
||||
- **Tool-calling reliability** with Qwen3-Coder is decent but not
|
||||
Claude-grade. If a tool call hangs or returns malformed JSON, the
|
||||
model is the culprit, not the MCP. Worth trying the same prompt
|
||||
against a hosted Claude or GPT-5 to confirm before debugging the
|
||||
server.
|
||||
- **Adding more MCP servers**: drop another entry under the `mcp` key
|
||||
using the same `type/command/enabled` shape. The
|
||||
[official MCP registry](https://registry.modelcontextprotocol.io/)
|
||||
and [Awesome MCP Servers](https://mcpservers.org/) catalog options.
|
||||
145
opencode/install.sh
Executable file
145
opencode/install.sh
Executable file
@@ -0,0 +1,145 @@
|
||||
#!/usr/bin/env bash
|
||||
# Bootstrap or re-sync the OpenCode harness on this Mac.
|
||||
#
|
||||
# Idempotent — re-run after pulling changes to opencode.json or the
|
||||
# Phoenix bridge plugin. Each step checks before doing work.
|
||||
#
|
||||
# What this does:
|
||||
# 1. Verify Homebrew is present
|
||||
# 2. Install node, uv, opencode, jq (skips if already at latest)
|
||||
# 3. Pre-cache Playwright's chromium so the first MCP call is instant
|
||||
# 4. Install the Phoenix bridge plugin's OTel deps
|
||||
# 5. Generate ~/.config/opencode/opencode.json from the repo's
|
||||
# opencode.json with relative plugin paths rewritten to absolute,
|
||||
# so opencode loads the plugin regardless of where it's launched.
|
||||
#
|
||||
# Usage: ./install.sh (from this directory)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
cd "$(dirname "$0")"
|
||||
HERE="$(pwd)"
|
||||
|
||||
# --- Pretty printing ---------------------------------------------------------
|
||||
bold() { printf '\033[1m%s\033[0m\n' "$*"; }
|
||||
ok() { printf ' \033[32m✓\033[0m %s\n' "$*"; }
|
||||
info() { printf ' → %s\n' "$*"; }
|
||||
warn() { printf ' \033[33m!\033[0m %s\n' "$*"; }
|
||||
fail() { printf ' \033[31m✗\033[0m %s\n' "$*"; exit 1; }
|
||||
|
||||
# --- 1. Homebrew -------------------------------------------------------------
|
||||
bold "[1/5] Homebrew"
|
||||
if ! command -v brew >/dev/null 2>&1; then
|
||||
fail "brew not found. Install from https://brew.sh, then re-run."
|
||||
fi
|
||||
ok "brew $(brew --version | head -1 | awk '{print $2}')"
|
||||
|
||||
# --- 2. CLI deps -------------------------------------------------------------
|
||||
bold "[2/5] CLI dependencies"
|
||||
brew_install_if_missing() {
|
||||
local pkg="$1"
|
||||
local bin="${2:-$1}"
|
||||
if command -v "$bin" >/dev/null 2>&1; then
|
||||
ok "$pkg already installed ($(command -v "$bin"))"
|
||||
else
|
||||
info "installing $pkg"
|
||||
brew install "$pkg"
|
||||
ok "$pkg installed"
|
||||
fi
|
||||
}
|
||||
brew_install_if_missing node node
|
||||
brew_install_if_missing uv uv
|
||||
brew_install_if_missing jq jq
|
||||
# opencode is in a tap; check the binary, not the formula name.
|
||||
if command -v opencode >/dev/null 2>&1; then
|
||||
ok "opencode already installed ($(command -v opencode))"
|
||||
else
|
||||
info "tapping sst/tap and installing opencode"
|
||||
brew install sst/tap/opencode
|
||||
ok "opencode installed"
|
||||
fi
|
||||
|
||||
# --- 3. Playwright browsers --------------------------------------------------
|
||||
bold "[3/5] Playwright browser cache"
|
||||
PW_CACHE="${HOME}/Library/Caches/ms-playwright"
|
||||
if [[ -d "$PW_CACHE" ]] && find "$PW_CACHE" -name "chrome" -o -name "Chromium*" 2>/dev/null | grep -q .; then
|
||||
ok "browsers already cached at $PW_CACHE"
|
||||
else
|
||||
info "downloading chromium (~200 MB) — first run only"
|
||||
npx -y @playwright/mcp@latest --help >/dev/null 2>&1 || true
|
||||
ok "browsers cached"
|
||||
fi
|
||||
|
||||
# --- 4. Phoenix bridge plugin deps ------------------------------------------
|
||||
bold "[4/5] Phoenix bridge plugin deps"
|
||||
if [[ -d ".opencode/plugin/node_modules" && -f ".opencode/plugin/package-lock.json" ]]; then
|
||||
# Re-run npm install if package.json is newer than the lockfile, otherwise skip.
|
||||
if [[ ".opencode/plugin/package.json" -nt ".opencode/plugin/package-lock.json" ]]; then
|
||||
info "package.json is newer than lockfile — running npm install"
|
||||
( cd .opencode/plugin && npm install )
|
||||
ok "deps updated"
|
||||
else
|
||||
ok "deps already installed"
|
||||
fi
|
||||
else
|
||||
info "installing OTel deps (one-time, ~40 MB)"
|
||||
( cd .opencode/plugin && npm install )
|
||||
ok "deps installed"
|
||||
fi
|
||||
|
||||
# --- 5. Generate ~/.config/opencode/opencode.json ---------------------------
|
||||
# The repo's opencode.json uses relative plugin paths so it stays valid
|
||||
# in-place. Rewriting them to absolute paths here makes opencode find the
|
||||
# plugin regardless of which directory it was launched from. Re-run this
|
||||
# script after editing opencode.json.
|
||||
bold "[5/5] Deploy global config"
|
||||
mkdir -p "${HOME}/.config/opencode"
|
||||
src="${HERE}/opencode.json"
|
||||
dst="${HOME}/.config/opencode/opencode.json"
|
||||
|
||||
# If the user previously had a symlink from the old install.sh, replace it.
|
||||
if [[ -L "$dst" ]]; then
|
||||
info "removing stale symlink at $dst"
|
||||
rm "$dst"
|
||||
fi
|
||||
# And the old .opencode dir symlink — no longer needed now that plugin
|
||||
# paths are absolute.
|
||||
if [[ -L "${HOME}/.config/opencode/.opencode" ]]; then
|
||||
info "removing stale ~/.config/opencode/.opencode symlink"
|
||||
rm "${HOME}/.config/opencode/.opencode"
|
||||
fi
|
||||
|
||||
# Rewrite any relative plugin path (./foo, ../foo) to an absolute path
|
||||
# rooted at this directory. Absolute paths and npm-package refs pass
|
||||
# through untouched.
|
||||
jq --arg here "$HERE" '
|
||||
.plugin = (
|
||||
(.plugin // [])
|
||||
| map(
|
||||
if type == "string" and (startswith("./") or startswith("../"))
|
||||
then ($here + "/" + ltrimstr("./") | gsub("/\\./"; "/"))
|
||||
else .
|
||||
end
|
||||
)
|
||||
)
|
||||
' "$src" > "$dst.tmp"
|
||||
mv "$dst.tmp" "$dst"
|
||||
ok "wrote $dst"
|
||||
info "plugin paths resolved to:"
|
||||
jq -r '.plugin[]?' "$dst" | sed 's/^/ /'
|
||||
|
||||
echo
|
||||
bold "Done."
|
||||
cat <<EOF
|
||||
|
||||
Re-run this script after editing opencode.json — the deployed copy at
|
||||
~/.config/opencode/opencode.json is generated, not symlinked.
|
||||
|
||||
Next steps:
|
||||
- Verify the model server: curl -s http://framework:11434/v1/models | jq '.data[].id'
|
||||
- Verify Phoenix is up: curl -sf http://framework:6006/
|
||||
- Run opencode and send one prompt. A trace should appear at
|
||||
http://framework:6006 within a couple of seconds.
|
||||
- If nothing appears, check ~/.local/share/opencode/log/*.log for a
|
||||
line starting with "[phoenix-bridge]".
|
||||
EOF
|
||||
41
opencode/opencode.json
Normal file
41
opencode/opencode.json
Normal file
@@ -0,0 +1,41 @@
|
||||
{
|
||||
"$schema": "https://opencode.ai/config.json",
|
||||
"experimental": {
|
||||
"openTelemetry": true
|
||||
},
|
||||
"plugin": ["./.opencode/plugin/phoenix-bridge.js"],
|
||||
"provider": {
|
||||
"framework": {
|
||||
"npm": "@ai-sdk/openai-compatible",
|
||||
"name": "Framework Desktop (Strix Halo)",
|
||||
"options": {
|
||||
"baseURL": "http://framework:11434/v1"
|
||||
},
|
||||
"models": {
|
||||
"qwen3-coder:30b": {
|
||||
"name": "Qwen3 Coder 30B (local)",
|
||||
"limit": {
|
||||
"context": 131072,
|
||||
"output": 16384
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"mcp": {
|
||||
"playwright": {
|
||||
"type": "local",
|
||||
"command": ["npx", "-y", "@playwright/mcp@latest"],
|
||||
"enabled": true
|
||||
},
|
||||
"searxng": {
|
||||
"type": "local",
|
||||
"command": ["npx", "-y", "mcp-searxng"],
|
||||
"enabled": true,
|
||||
"environment": {
|
||||
"SEARXNG_URL": "https://searxng.n0n.io"
|
||||
}
|
||||
}
|
||||
},
|
||||
"model": "framework/qwen3-coder:30b"
|
||||
}
|
||||
12
piper-compose.yaml
Normal file
12
piper-compose.yaml
Normal file
@@ -0,0 +1,12 @@
|
||||
services:
|
||||
piper:
|
||||
image: rhasspy/wyoming-piper:latest
|
||||
container_name: wyoming-piper
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "10200:10200"
|
||||
volumes:
|
||||
- ./piper-data:/data
|
||||
command:
|
||||
- --voice
|
||||
- en_US-lessac-medium
|
||||
82
pyinfra/HowToScan.md
Normal file
82
pyinfra/HowToScan.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# How to onboard an existing host into pyinfra
|
||||
|
||||
There's no auto-generator that reads a running system and emits a
|
||||
deploy.py — not in pyinfra, not in Ansible, not in any IaC tool.
|
||||
Reverse-engineering "what's intentional vs incidental" needs human
|
||||
judgment. Below is the workflow that actually works.
|
||||
|
||||
## What pyinfra gives you for free
|
||||
|
||||
The **facts** API queries live state without writing any deploy:
|
||||
|
||||
```sh
|
||||
pyinfra @host fact deb.DebPackages # installed packages
|
||||
pyinfra @host fact apt.AptSources # extra apt repos
|
||||
pyinfra @host fact systemd.SystemdServices # service state
|
||||
pyinfra @host fact server.Crontab # cron
|
||||
pyinfra @host fact server.Users # users
|
||||
```
|
||||
|
||||
Output is JSON. There's no fact → op translator, but the data is easy
|
||||
to grep through and tells you what's actually on the box.
|
||||
|
||||
## Practical workflow
|
||||
|
||||
1. Run a scanner against the live host that dumps the *meaningful*
|
||||
state — packages explicitly installed, enabled services, custom
|
||||
configs, dotfiles, sudoers fragments, fstab non-defaults. Skip the
|
||||
distro-default noise.
|
||||
2. Hand-pick the bits worth managing into a new `<station>/deploy.py`.
|
||||
3. Run with `--dry` against the live box; pyinfra's diff shows what's
|
||||
still drifting.
|
||||
4. Iterate until `--dry` reports "no changes."
|
||||
|
||||
## One-shot scanner
|
||||
|
||||
```sh
|
||||
ssh you@host '
|
||||
echo "=== manually-installed packages ==="
|
||||
apt-mark showmanual
|
||||
echo "=== enabled services ==="
|
||||
systemctl list-unit-files --state=enabled --type=service --no-legend | awk "{print \$1}"
|
||||
echo "=== extra apt repos ==="
|
||||
ls /etc/apt/sources.list.d/
|
||||
echo "=== sudoers fragments ==="
|
||||
ls /etc/sudoers.d/
|
||||
echo "=== fstab non-default ==="
|
||||
grep -vE "^#|^$|/proc|/sys|/dev/pts" /etc/fstab
|
||||
echo "=== users with login shells ==="
|
||||
getent passwd | awk -F: "\$3>=1000 && \$3<60000 {print \$1}"
|
||||
echo "=== docker stacks ==="
|
||||
ls /srv/docker/ 2>/dev/null
|
||||
echo "=== home dotfiles ==="
|
||||
sudo find /home -maxdepth 3 -name ".*rc" -o -name ".*config" 2>/dev/null
|
||||
'
|
||||
```
|
||||
|
||||
That gets you ~80% of what's worth managing in roughly 30 lines of
|
||||
pyinfra. The remaining 20% is case-by-case (kernel cmdline tweaks,
|
||||
hand-edited /etc files, dotfile contents) that no scanner fully
|
||||
captures — you write those as you re-discover the box's quirks.
|
||||
|
||||
## What scanners can't catch
|
||||
|
||||
- The *intent* behind a config (why this `sysctl` value, why this
|
||||
cron entry).
|
||||
- Hand-edited fragments mixed into otherwise-default files (e.g. a
|
||||
custom line in `/etc/ssh/sshd_config`).
|
||||
- Out-of-band state: data in databases, container volumes,
|
||||
`/var/lib/<service>`.
|
||||
|
||||
For these, treat pyinfra as managing the *bones* (packages, services,
|
||||
users, configs) and leave the data alone. Re-deploying onto fresh
|
||||
hardware should give you a working system; restoring data is a
|
||||
separate concern.
|
||||
|
||||
## If "scan and reproduce" is a hard requirement
|
||||
|
||||
The native answer is **NixOS**. The whole OS is declarative;
|
||||
`nixos-generate-config` reads your machine and writes a full config
|
||||
that recreates it. Different paradigm, big switch — but the right
|
||||
answer if reproducible-from-a-file is core to your workflow and you
|
||||
aren't already deep in Ubuntu-land.
|
||||
12
pyinfra/README.md
Normal file
12
pyinfra/README.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# pyinfra
|
||||
|
||||
One folder per station. Each subfolder is a self-contained pyinfra
|
||||
deploy: `inventory.py`, `deploy.py`, `run.sh`, plus any compose files
|
||||
or assets that ship to the host.
|
||||
|
||||
| Station | Host | Notes |
|
||||
|---------|------|-------|
|
||||
| [`framework/`](framework/README.md) | `10.0.0.237` | Framework Desktop (Strix Halo, 128 GB) — local LLM box |
|
||||
|
||||
To bring up a station, `cd` into its folder and run `./run.sh`. See the
|
||||
station's own README for prerequisites.
|
||||
184
pyinfra/framework/README.md
Normal file
184
pyinfra/framework/README.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# pyinfra: Strix Halo bring-up
|
||||
|
||||
Containerized setup for the Framework Desktop (Ryzen AI Max+ 395, Radeon
|
||||
8060S, 128 GB). The host stays minimal — kernel + driver + Docker +
|
||||
diagnostics. Inference engines (llama.cpp, vLLM, Ollama) run as docker
|
||||
compose services, each shipping its own ROCm/Vulkan stack.
|
||||
|
||||
## Manual prerequisites
|
||||
|
||||
1. **Phase 0** — update Framework BIOS, set GPU UMA carve-out (96 GB).
|
||||
2. **OS install** — Ubuntu Server 24.04 LTS. AMD ROCm only ships for
|
||||
jammy/noble; later Ubuntus install but break the host-side toolchain
|
||||
(libxml2 ABI). Enable SSH, import your laptop key, create user `noise`.
|
||||
Recommended partitioning: ≥300 GB on `/`, big disk mounted at `/models`,
|
||||
plain ext4 (skip LVM).
|
||||
3. The host must be reachable at `10.0.0.237` over SSH (edit `inventory.py`
|
||||
if it moves).
|
||||
4. **NOPASSWD sudo for `noise`** — pyinfra's fact layer doesn't reliably
|
||||
thread sudo passwords. One-time setup:
|
||||
```sh
|
||||
ssh noise@10.0.0.237 'echo "noise ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/noise-nopasswd && sudo chmod 440 /etc/sudoers.d/noise-nopasswd'
|
||||
```
|
||||
|
||||
## Run
|
||||
|
||||
```sh
|
||||
uv tool install pyinfra
|
||||
./run.sh # equivalent to: pyinfra inventory.py deploy.py
|
||||
./run.sh --dry # any extra args are forwarded to pyinfra
|
||||
```
|
||||
|
||||
Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`.
|
||||
|
||||
## What the deploy does
|
||||
|
||||
- Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, huggingface-cli
|
||||
- Tailscale (run `sudo tailscale up` on the box once, interactively)
|
||||
- Docker engine + compose plugin, user added to `docker` group
|
||||
- ROCm host diagnostics only (`rocminfo`) — no full toolchain
|
||||
- GRUB kernel params for Strix Halo perf: `amd_iommu=off`,
|
||||
`amdgpu.gttsize=117760` (per [Gygeek/Framework-strix-halo-llm-setup](https://github.com/Gygeek/Framework-strix-halo-llm-setup)).
|
||||
Requires a reboot to activate — pyinfra rewrites `/etc/default/grub`
|
||||
and runs `update-grub`, but won't reboot for you.
|
||||
- `/models/<vendor>/` layout
|
||||
- `/srv/docker/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}/docker-compose.yml`
|
||||
dropped in, not auto-started — you edit the model path then
|
||||
`docker compose up -d`
|
||||
(OpenWebUI needs no edits — it's pre-configured to find Ollama at
|
||||
`host.docker.internal:11434` and uses `searxng.n0n.io` for web search)
|
||||
(Beszel hub at :8090, OpenLIT UI at :3001, Phoenix UI at :6006,
|
||||
OpenHands UI at :3030 — see "Monitoring stack" and "Agent harnesses"
|
||||
below)
|
||||
|
||||
If a previous run installed the native llama.cpp build / full ROCm /
|
||||
native Ollama, those are auto-cleaned the next time `./run.sh` runs.
|
||||
|
||||
## After the deploy: starting an inference service
|
||||
|
||||
```sh
|
||||
ssh noise@10.0.0.237
|
||||
sudo tailscale up # one-time, interactive
|
||||
|
||||
# Drop a GGUF somewhere under /models, then:
|
||||
cd /srv/docker/llama
|
||||
vim docker-compose.yml # edit the --model path
|
||||
docker compose up -d
|
||||
curl localhost:8080/v1/models # smoke test
|
||||
```
|
||||
|
||||
Same shape for `vllm` (port 8000) and `ollama` (port 11434, no model edit
|
||||
needed — Ollama serves models on demand).
|
||||
|
||||
## Tunables
|
||||
|
||||
Top of `deploy.py`:
|
||||
- `ROCM_VERSION` and `AMDGPU_INSTALL_DEB` — bump when AMD ships a newer
|
||||
release. The .deb filename has a build suffix that doesn't derive from
|
||||
the version; find it at https://repo.radeon.com/amdgpu-install/.
|
||||
- `AMDGPU_TOP_VERSION` — bump when a newer release lands at
|
||||
https://github.com/Umio-Yasuno/amdgpu_top/releases.
|
||||
|
||||
Compose images in
|
||||
`compose/{llama,vllm,ollama,openwebui,beszel,openlit,phoenix,openhands}.yml`
|
||||
— pin tags here.
|
||||
|
||||
## Monitoring stack
|
||||
|
||||
Two compose stacks watch the inference services:
|
||||
|
||||
- **Beszel** (`/srv/docker/beszel`, http://framework:8090) — host + Docker
|
||||
container + AMD GPU dashboard. The agent's `amd_sysfs` collector reads
|
||||
`/sys/class/drm/cardN/device/` directly; this is the only path that
|
||||
reports real numbers on Strix Halo (gfx1151) — `amd-smi` returns N/A
|
||||
for util/power/temp on this APU.
|
||||
|
||||
First-time bring-up:
|
||||
```sh
|
||||
cd /srv/docker/beszel
|
||||
docker compose up -d beszel # 1. hub
|
||||
# 2. open http://framework:8090 → create admin → "Add system"
|
||||
# 3. copy the TOKEN and the SSH KEY from the dialog into .env:
|
||||
sudo tee /srv/docker/beszel/.env >/dev/null <<EOF
|
||||
BESZEL_TOKEN=<token-from-dialog>
|
||||
BESZEL_KEY=ssh-ed25519 AAAA…
|
||||
EOF
|
||||
sudo chgrp docker /srv/docker/beszel/.env
|
||||
sudo chmod 640 /srv/docker/beszel/.env
|
||||
docker compose up -d --force-recreate beszel-agent # 4. agent
|
||||
```
|
||||
|
||||
- **OpenLIT** (`/srv/docker/openlit`, http://framework:3001) — fleet
|
||||
metrics for the LLM stack (cost, tokens, latency aggregated across
|
||||
sessions and models). Auto-instruments Ollama and vLLM via
|
||||
OpenTelemetry without app-code changes; for llama.cpp, the compose
|
||||
file already passes `--metrics` so OpenLIT can scrape `/metrics`.
|
||||
ClickHouse-backed. OTLP receivers exposed on the host at **:4327
|
||||
(gRPC) / :4328 (HTTP)** — Phoenix owns 4317/4318.
|
||||
|
||||
Bring-up: `cd /srv/docker/openlit && docker compose up -d`. UI prompts
|
||||
to create an admin account on first load.
|
||||
|
||||
- **Phoenix** (`/srv/docker/phoenix`, http://framework:6006) — per-trace
|
||||
agent waterfall / flamegraph. Designed to answer "show me what one
|
||||
OpenCode turn actually did" — full call tree, nested LLM/tool calls,
|
||||
token counts inline. Single container, SQLite-backed. OTLP receivers
|
||||
on the host at **:4317 (gRPC)** and **:6006/v1/traces (HTTP)** —
|
||||
Phoenix 15.x serves HTTP OTLP on the UI port, not a separate 4318.
|
||||
|
||||
Bring-up: `cd /srv/docker/phoenix && docker compose up -d`. UI
|
||||
immediately available; no auth in the self-hosted single-user path.
|
||||
|
||||
See [`localgenai/opencode/README.md`](../../opencode/README.md) for
|
||||
how OpenCode is configured to ship traces here.
|
||||
|
||||
The official AMD Prometheus exporters (`amd-smi-exporter`,
|
||||
`device-metrics-exporter`) are intentionally **not** deployed — they're
|
||||
broken on gfx1151 (ROCm#6035). If you ever want full Prometheus +
|
||||
Grafana, the migration path is to replace Beszel with `node_exporter` +
|
||||
a textfile collector reading `/sys/class/drm/cardN/device/` and
|
||||
`/sys/class/hwmon/`.
|
||||
|
||||
## Agent harnesses
|
||||
|
||||
Two agent UIs in front of the local model stack — pick one (or both)
|
||||
based on how you like to drive the agent:
|
||||
|
||||
- **OpenWebUI** (`/srv/docker/openwebui`, http://framework:3000) —
|
||||
ChatGPT-style web UI. Pre-wired to Ollama + SearXNG web search.
|
||||
Best for casual chat and household-shared LLM access.
|
||||
|
||||
Bring-up: `cd /srv/docker/openwebui && docker compose up -d`.
|
||||
|
||||
- **OpenHands** (`/srv/docker/openhands`, http://framework:3030,
|
||||
loopback-only) — autonomous agent in a Docker sandbox. Spawns a
|
||||
per-conversation `agent-server` container that can write code, run
|
||||
tests, browse the web. Pre-configured for Ollama at
|
||||
`openai/qwen3-coder:30b` over the OpenAI-compatible endpoint;
|
||||
ships traces to Phoenix.
|
||||
|
||||
Bring-up:
|
||||
```sh
|
||||
cd /srv/docker/openhands && docker compose up -d
|
||||
# Tunnel the loopback-bound UI from your laptop:
|
||||
ssh -L 3030:127.0.0.1:3030 noise@framework
|
||||
open http://localhost:3030
|
||||
```
|
||||
|
||||
First run pulls the agent-server image (~2 GB) lazily on first
|
||||
conversation, not at startup, so the orchestrator comes up fast but
|
||||
your first message takes 30–60 s. Pre-0.44 state path was
|
||||
`~/.openhands-state`; not relevant on a fresh install.
|
||||
|
||||
Tool-call quality with local models is much better when Ollama's
|
||||
context is bumped — the compose at `/srv/docker/ollama` already sets
|
||||
`OLLAMA_CONTEXT_LENGTH=65536`, which is plenty.
|
||||
|
||||
OpenCode (the terminal driver, configured in
|
||||
[`localgenai/opencode/`](../../opencode/)) runs on the Mac and points
|
||||
at the same Ollama endpoint, so all three harnesses share the same
|
||||
model server.
|
||||
|
||||
The llama.cpp image is `kyuz0/amd-strix-halo-toolboxes:vulkan-radv`,
|
||||
gfx1151-optimized with rocWMMA flash attention (the latter only kicks
|
||||
in on the ROCm tags). Auto-rebuilt against llama.cpp master.
|
||||
66
pyinfra/framework/compose/beszel.yml
Normal file
66
pyinfra/framework/compose/beszel.yml
Normal file
@@ -0,0 +1,66 @@
|
||||
# Beszel — host + container + GPU dashboard.
|
||||
# https://beszel.dev
|
||||
#
|
||||
# Picked over Prometheus+Grafana for this box because:
|
||||
# - The agent's `amd_sysfs` collector reads /sys/class/drm/card*/device/
|
||||
# directly, which is the only reliable GPU metric source on Strix Halo
|
||||
# (gfx1151). AMD's amd-smi / Device Metrics Exporter return N/A for
|
||||
# util/power/temp on this APU (ROCm#6035), so the official Prometheus
|
||||
# exporter path is dead.
|
||||
# - Two containers vs six.
|
||||
#
|
||||
# First-time setup (WebSocket connection model — current Beszel default):
|
||||
# 1. `docker compose up -d beszel` (start the hub)
|
||||
# 2. Open http://framework:8090, create the admin account
|
||||
# 3. Click "Add system" — the dialog gives you a TOKEN and an SSH KEY.
|
||||
# 4. Edit /srv/docker/beszel/.env (created empty by pyinfra; pyinfra
|
||||
# doesn't overwrite). Add:
|
||||
# BESZEL_TOKEN=<token-from-dialog>
|
||||
# BESZEL_KEY=ssh-ed25519 AAAA…
|
||||
# 5. `docker compose up -d --force-recreate beszel-agent`
|
||||
#
|
||||
# Docker Compose auto-reads the sibling .env file for ${VAR} interpolation
|
||||
# in the environment block below — so secrets stay out of the compose
|
||||
# file (which pyinfra overwrites) but the env-var names match exactly
|
||||
# what the agent expects.
|
||||
#
|
||||
# Why both TOKEN and KEY: TOKEN identifies which system this agent is,
|
||||
# KEY authenticates the agent (the SSH key is reused as the auth secret
|
||||
# in the WebSocket handshake). Rotate either by editing the .env and
|
||||
# `docker compose up -d --force-recreate`.
|
||||
services:
|
||||
beszel:
|
||||
image: henrygd/beszel:latest
|
||||
container_name: beszel
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "8090:8090"
|
||||
volumes:
|
||||
- /srv/docker/beszel/data:/beszel_data
|
||||
|
||||
beszel-agent:
|
||||
image: henrygd/beszel-agent:latest
|
||||
container_name: beszel-agent
|
||||
restart: unless-stopped
|
||||
# Host networking so the agent sees real CPU/memory/network counters
|
||||
# without bridge-NAT distortion.
|
||||
network_mode: host
|
||||
volumes:
|
||||
# Read-only Docker socket for per-container CPU/mem/net.
|
||||
- /var/run/docker.sock:/var/run/docker.sock:ro
|
||||
# Sysfs paths the AMD GPU collector reads.
|
||||
- /sys/class/drm:/sys/class/drm:ro
|
||||
- /sys/class/hwmon:/sys/class/hwmon:ro
|
||||
environment:
|
||||
# Pulled from /srv/docker/beszel/.env at compose-parse time.
|
||||
TOKEN: "${BESZEL_TOKEN:-}"
|
||||
KEY: "${BESZEL_KEY:-}"
|
||||
# WebSocket dial-out target — the hub on this same host. The agent
|
||||
# is on host networking, so localhost is the host machine, where
|
||||
# the hub container exposes port 8090.
|
||||
HUB_URL: "http://localhost:8090"
|
||||
# Optional fallback: legacy SSH listener for hub-initiated probing.
|
||||
# Harmless to keep — hub only uses it if WebSocket is unreachable.
|
||||
LISTEN: "45876"
|
||||
# Enable the AMD sysfs GPU collector.
|
||||
GPU: "true"
|
||||
44
pyinfra/framework/compose/llama.yml
Normal file
44
pyinfra/framework/compose/llama.yml
Normal file
@@ -0,0 +1,44 @@
|
||||
# llama.cpp server, gfx1151-optimized via kyuz0's Strix Halo toolboxes.
|
||||
# https://github.com/kyuz0/amd-strix-halo-toolboxes
|
||||
#
|
||||
# Tag options on docker.io/kyuz0/amd-strix-halo-toolboxes:
|
||||
# vulkan-radv — most stable, recommended default (this one)
|
||||
# vulkan-amdvlk — alternate Vulkan driver, sometimes faster
|
||||
# rocm-7.2.2 — ROCm 7.x; needs /dev/kfd + render group_add (see vllm.yml pattern)
|
||||
# rocm-6.4.4 — ROCm 6.x fallback
|
||||
# rocm7-nightlies — avoid: caps memory allocation to 64 GB (May 2026)
|
||||
#
|
||||
# Toolbox images use a shell entrypoint, so we override to launch
|
||||
# llama-server directly. Edit the --model path before `docker compose up -d`.
|
||||
services:
|
||||
llama:
|
||||
image: kyuz0/amd-strix-halo-toolboxes:vulkan-radv
|
||||
container_name: llama
|
||||
restart: unless-stopped
|
||||
devices:
|
||||
- /dev/dri:/dev/dri
|
||||
volumes:
|
||||
- /models:/models:ro
|
||||
ports:
|
||||
- "8080:8080"
|
||||
entrypoint: ["llama-server"]
|
||||
command:
|
||||
- --model
|
||||
- /models/REPLACE/ME/model.gguf
|
||||
- --host
|
||||
- 0.0.0.0
|
||||
- --port
|
||||
- "8080"
|
||||
- --n-gpu-layers
|
||||
- "999"
|
||||
- --ctx-size
|
||||
- "32768"
|
||||
# Required for GPU backends on Strix Halo per Gygeek's setup
|
||||
# guide. Forces full load into GPU memory rather than mmap.
|
||||
- --no-mmap
|
||||
# Flash attention — works on Vulkan too; the big win is on the
|
||||
# ROCm tag where kyuz0's build has rocWMMA acceleration.
|
||||
- --flash-attn
|
||||
# Expose Prometheus metrics at /metrics — scraped by OpenLIT for
|
||||
# tokens/sec, KV-cache use, queue depth, and request latency.
|
||||
- --metrics
|
||||
38
pyinfra/framework/compose/ollama.yml
Normal file
38
pyinfra/framework/compose/ollama.yml
Normal file
@@ -0,0 +1,38 @@
|
||||
# Ollama, ROCm backend. Serves models on demand — safe to start before
|
||||
# you've put anything in /models.
|
||||
#
|
||||
# Storage: Ollama's content-addressed blob store is bind-mounted under
|
||||
# /models/ollama so all model data on the host lives under /models.
|
||||
# Note: Ollama's blobs are SHA256-named, not raw GGUFs — llama.cpp/vLLM
|
||||
# can't load them directly. Keep curated GGUFs at /models/<vendor>/...
|
||||
# for those engines.
|
||||
services:
|
||||
ollama:
|
||||
image: ollama/ollama:rocm
|
||||
container_name: ollama
|
||||
restart: unless-stopped
|
||||
devices:
|
||||
- /dev/kfd:/dev/kfd
|
||||
- /dev/dri:/dev/dri
|
||||
# Numeric GIDs of host's video (44) and render (991) groups — names
|
||||
# don't exist inside the container, but the GIDs need to match the
|
||||
# host so /dev/kfd + /dev/dri are accessible.
|
||||
group_add:
|
||||
- "44"
|
||||
- "991"
|
||||
environment:
|
||||
# Strix Halo's iGPU is gfx1151 (RDNA 3.5), which Ollama's bundled
|
||||
# ROCm runtime doesn't recognize — without this override it falls
|
||||
# back to CPU silently. 11.0.0 = gfx1100 (Navi 31); the RDNA 3.x
|
||||
# ISAs are close enough that gfx1100 kernels run on gfx1151.
|
||||
- HSA_OVERRIDE_GFX_VERSION=11.0.0
|
||||
# Default context. 256K (the upstream default for Qwen3-Coder)
|
||||
# blows the KV cache up to ~25-30 GB and forces ollama to split
|
||||
# layers between GPU and CPU. 64K keeps the model fully on GPU
|
||||
# while still being plenty for coding contexts.
|
||||
- OLLAMA_CONTEXT_LENGTH=65536
|
||||
volumes:
|
||||
- /models/ollama:/root/.ollama
|
||||
- /models:/models:ro
|
||||
ports:
|
||||
- "11434:11434"
|
||||
94
pyinfra/framework/compose/openhands.yml
Normal file
94
pyinfra/framework/compose/openhands.yml
Normal file
@@ -0,0 +1,94 @@
|
||||
# OpenHands 1.7 (May 2026) — autonomous agent in a Docker sandbox.
|
||||
# https://docs.openhands.dev — repo: github.com/OpenHands/OpenHands
|
||||
#
|
||||
# Architecture: this container is a thin orchestrator. Per conversation
|
||||
# it spawns a separate `agent-server` container on the host Docker daemon
|
||||
# (that's what the docker.sock mount is for) and talks to it over REST.
|
||||
# AGENT_SERVER_IMAGE_TAG below pins the per-session sandbox image.
|
||||
#
|
||||
# Complements OpenCode: OpenCode is the interactive terminal driver,
|
||||
# OpenHands is for autonomous loops (write code, run tests, browse the
|
||||
# web in a sandbox, report back).
|
||||
services:
|
||||
openhands:
|
||||
# Org rebranded All-Hands-AI → OpenHands at v1.0 (Dec 2025); the old
|
||||
# docker.all-hands.dev/all-hands-ai/openhands image is gone.
|
||||
image: docker.openhands.dev/openhands/openhands:1.7
|
||||
container_name: openhands
|
||||
restart: unless-stopped
|
||||
|
||||
# 3030 host-side because :3000 is OpenWebUI and :3001 is OpenLIT.
|
||||
# Loopback-only — reach via SSH tunnel or Tailscale, don't expose
|
||||
# this directly.
|
||||
ports:
|
||||
- "127.0.0.1:3030:3000"
|
||||
|
||||
volumes:
|
||||
# Required: orchestrator spawns sandbox containers via the host daemon.
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
# State, settings, conversation history, MCP config, secrets.
|
||||
# Pre-0.44 used ~/.openhands-state — N/A on a fresh install.
|
||||
- /srv/docker/openhands/state:/.openhands
|
||||
# Workspace the sandbox reads/writes. The host path on the LEFT must
|
||||
# match SANDBOX_VOLUMES below — the sandbox container is spawned by
|
||||
# the host daemon, so its bind mount is resolved on the host, not
|
||||
# via this container's filesystem.
|
||||
- /srv/docker/openhands/workspace:/srv/docker/openhands/workspace
|
||||
|
||||
# Linux Docker doesn't auto-provide host.docker.internal; this fixes it.
|
||||
extra_hosts:
|
||||
- "host.docker.internal:host-gateway"
|
||||
|
||||
environment:
|
||||
# ---- Sandbox / agent-server image pin ----
|
||||
# Replaces the V0.x SANDBOX_RUNTIME_CONTAINER_IMAGE. 1.19.1-python is
|
||||
# the agent-server tag the 1.7 main image expects; bumping the main
|
||||
# image will likely want a newer agent-server tag — check the
|
||||
# upstream docker-compose.yml on each upgrade.
|
||||
AGENT_SERVER_IMAGE_REPOSITORY: ghcr.io/openhands/agent-server
|
||||
AGENT_SERVER_IMAGE_TAG: 1.19.1-python
|
||||
|
||||
# ---- Workspace mount into the per-session sandbox ----
|
||||
# SANDBOX_VOLUMES is the V1 replacement for the deprecated
|
||||
# WORKSPACE_BASE / WORKSPACE_MOUNT_PATH variables.
|
||||
SANDBOX_VOLUMES: /srv/docker/openhands/workspace:/workspace:rw
|
||||
# Match the host's `noise` UID so files the agent writes aren't
|
||||
# owned by root.
|
||||
SANDBOX_USER_ID: "1000"
|
||||
|
||||
# ---- LLM: host Ollama via OpenAI-compatible endpoint ----
|
||||
# Per the official local-llms doc, the recommended path is the
|
||||
# /v1 OpenAI-compatible endpoint with the `openai/` LiteLLM prefix
|
||||
# — NOT `ollama/...`, which has worse tool-call behaviour.
|
||||
LLM_MODEL: "openai/qwen3-coder:30b"
|
||||
LLM_BASE_URL: "http://host.docker.internal:11434/v1"
|
||||
LLM_API_KEY: "ollama" # any non-empty string; Ollama doesn't auth.
|
||||
|
||||
# Default tool-calling renderer mismatches Qwen3-Coder's training
|
||||
# format and produces malformed calls (issue #8140). Forcing false
|
||||
# falls back to OpenHands' prompt-based protocol — costs some token
|
||||
# efficiency, gains reliability with local models.
|
||||
LLM_NATIVE_TOOL_CALLING: "false"
|
||||
|
||||
LOG_ALL_EVENTS: "true"
|
||||
|
||||
# ---- Optional: ship traces to Phoenix on :4318 ----
|
||||
# OpenHands V1 uses LiteLLM + OpenTelemetry; standard OTLP env vars
|
||||
# are honoured. Comment out to disable.
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "http://host.docker.internal:4318"
|
||||
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
|
||||
OTEL_SERVICE_NAME: "openhands"
|
||||
|
||||
# Per-session agent-server containers spawn headless chromium for
|
||||
# browser tasks; default 64 MB shm causes silent crashes.
|
||||
shm_size: "2gb"
|
||||
|
||||
# Playwright/chromium needs higher fd limits than Docker's default.
|
||||
ulimits:
|
||||
nofile:
|
||||
soft: 65536
|
||||
hard: 65536
|
||||
|
||||
# Bridge networking is correct here. Don't switch to network_mode: host
|
||||
# — the spawned sandbox containers reach this orchestrator via Docker
|
||||
# bridge DNS, which only works on a bridge network.
|
||||
59
pyinfra/framework/compose/openlit.yml
Normal file
59
pyinfra/framework/compose/openlit.yml
Normal file
@@ -0,0 +1,59 @@
|
||||
# OpenLIT — LLM observability (traces, costs, KV-cache, prompt/decode
|
||||
# latencies, tokens/sec). https://openlit.io
|
||||
#
|
||||
# Two services:
|
||||
# - clickhouse : columnar store for traces (internal only, no host port)
|
||||
# - openlit : Next.js UI on :3001 (3000 is OpenWebUI)
|
||||
#
|
||||
# Why OpenLIT vs Langfuse/Phoenix/Laminar: it's the only OSS dashboard
|
||||
# (May 2026) that auto-instruments Ollama AND vLLM via OpenTelemetry
|
||||
# without adding code to client apps. For llama.cpp, start the server
|
||||
# with --metrics (see ../llama/docker-compose.yml) and OpenLIT can scrape
|
||||
# /metrics.
|
||||
#
|
||||
# To send traces from a Python script calling Ollama/vLLM:
|
||||
# pip install openlit
|
||||
# python -c "import openlit; openlit.init(otlp_endpoint='http://framework:4318')"
|
||||
#
|
||||
# To wire OpenWebUI → OpenLIT, install OpenLIT's pipeline middleware
|
||||
# in OpenWebUI per https://openlit.io/blogs/openlit-openwebui.
|
||||
services:
|
||||
clickhouse:
|
||||
image: clickhouse/clickhouse-server:25.3-alpine
|
||||
container_name: openlit-clickhouse
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
CLICKHOUSE_USER: default
|
||||
CLICKHOUSE_PASSWORD: OPENLIT
|
||||
CLICKHOUSE_DB: openlit
|
||||
volumes:
|
||||
- /srv/docker/openlit/clickhouse:/var/lib/clickhouse
|
||||
ulimits:
|
||||
nofile:
|
||||
soft: 262144
|
||||
hard: 262144
|
||||
|
||||
openlit:
|
||||
image: ghcr.io/openlit/openlit:latest
|
||||
container_name: openlit
|
||||
restart: unless-stopped
|
||||
depends_on:
|
||||
- clickhouse
|
||||
ports:
|
||||
# Host:container — UI on 3001 (OpenWebUI owns 3000).
|
||||
- "3001:3000"
|
||||
# OTLP receivers exposed on the host so SDKs running off-box can
|
||||
# ship traces here. gRPC + HTTP. Remapped (4327/4328 → 4317/4318)
|
||||
# because Phoenix owns the canonical 4317/4318 ports for OpenCode
|
||||
# traces — OpenLIT here is a secondary/fleet-metrics destination.
|
||||
- "4327:4317"
|
||||
- "4328:4318"
|
||||
environment:
|
||||
INIT_DB_HOST: clickhouse
|
||||
INIT_DB_PORT: "8123"
|
||||
INIT_DB_USERNAME: default
|
||||
INIT_DB_PASSWORD: OPENLIT
|
||||
INIT_DB_DATABASE: openlit
|
||||
SQLITE_DATABASE_URL: file:/app/client/data/data.db
|
||||
volumes:
|
||||
- /srv/docker/openlit/data:/app/client/data
|
||||
25
pyinfra/framework/compose/openwebui.yml
Normal file
25
pyinfra/framework/compose/openwebui.yml
Normal file
@@ -0,0 +1,25 @@
|
||||
# OpenWebUI — ChatGPT-like web UI in front of Ollama. Pre-configured to
|
||||
# use the host's Ollama instance and the project's SearXNG for web
|
||||
# search. Default port 3000.
|
||||
#
|
||||
# Persistent state (users, conversations, uploaded docs, RAG vector
|
||||
# index) lives at /srv/docker/openwebui/data so backups touch one path.
|
||||
services:
|
||||
openwebui:
|
||||
image: ghcr.io/open-webui/open-webui:main
|
||||
container_name: openwebui
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "3000:8080"
|
||||
extra_hosts:
|
||||
# Lets the container reach Ollama on the host's :11434 without
|
||||
# needing to share Docker networks.
|
||||
- "host.docker.internal:host-gateway"
|
||||
environment:
|
||||
- OLLAMA_BASE_URL=http://host.docker.internal:11434
|
||||
# Built-in web search via the project's SearXNG instance.
|
||||
- ENABLE_RAG_WEB_SEARCH=true
|
||||
- RAG_WEB_SEARCH_ENGINE=searxng
|
||||
- SEARXNG_QUERY_URL=https://searxng.n0n.io/search?q=<query>&format=json
|
||||
volumes:
|
||||
- /srv/docker/openwebui/data:/app/backend/data
|
||||
35
pyinfra/framework/compose/phoenix.yml
Normal file
35
pyinfra/framework/compose/phoenix.yml
Normal file
@@ -0,0 +1,35 @@
|
||||
# Arize Phoenix — per-trace agent waterfall / flamegraph viz.
|
||||
# https://github.com/Arize-ai/phoenix
|
||||
#
|
||||
# Picked over Langfuse for "show me one OpenCode turn as a tree":
|
||||
# - Single container vs Langfuse's six (Postgres+ClickHouse+Redis+MinIO+web+worker).
|
||||
# - First-class ingestion of Vercel AI SDK spans (which is what OpenCode
|
||||
# emits under the hood when experimental.openTelemetry=true).
|
||||
# - Best-in-class waterfall + agent-graph view for nested LLM/tool calls.
|
||||
#
|
||||
# Complements OpenLIT, doesn't replace it: OpenLIT is the fleet-metrics
|
||||
# layer (cost / tokens / latency aggregated across sessions). Phoenix is
|
||||
# the per-prompt debugger (see what one turn actually did).
|
||||
#
|
||||
# Bring-up: `docker compose up -d` — no first-run setup needed; UI prompts
|
||||
# for project name on first trace ingest. Storage is SQLite at /data.
|
||||
services:
|
||||
phoenix:
|
||||
image: arizephoenix/phoenix:latest
|
||||
container_name: phoenix
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
# UI + OTLP/HTTP both ride on 6006 in Phoenix 15.x — HTTP traces go
|
||||
# to http://framework:6006/v1/traces. (Pre-15 had a separate 4318;
|
||||
# the consolidation happened in Phoenix v15.0.)
|
||||
- "6006:6006"
|
||||
# OTLP/gRPC stays separate.
|
||||
- "4317:4317"
|
||||
environment:
|
||||
PHOENIX_WORKING_DIR: /data
|
||||
# Phoenix listens on all interfaces by default; explicit for clarity.
|
||||
PHOENIX_HOST: 0.0.0.0
|
||||
PHOENIX_PORT: "6006"
|
||||
PHOENIX_GRPC_PORT: "4317"
|
||||
volumes:
|
||||
- /srv/docker/phoenix/data:/data
|
||||
36
pyinfra/framework/compose/vllm.yml
Normal file
36
pyinfra/framework/compose/vllm.yml
Normal file
@@ -0,0 +1,36 @@
|
||||
# vLLM, ROCm backend.
|
||||
#
|
||||
# NOTE: vLLM's official ROCm support targets datacenter cards (MI300X /
|
||||
# gfx942). Strix Halo is gfx1151 — support varies by image tag and
|
||||
# release. If `rocm/vllm:latest` doesn't run on this iGPU, try
|
||||
# `rocm/vllm-dev:nightly` or build from source against ROCm 7.x.
|
||||
services:
|
||||
vllm:
|
||||
image: rocm/vllm:latest
|
||||
container_name: vllm
|
||||
restart: unless-stopped
|
||||
devices:
|
||||
- /dev/kfd:/dev/kfd
|
||||
- /dev/dri:/dev/dri
|
||||
cap_add:
|
||||
- SYS_PTRACE
|
||||
security_opt:
|
||||
- seccomp=unconfined
|
||||
# Numeric GIDs of host's video (44) and render (991) groups — names
|
||||
# don't exist inside the container.
|
||||
group_add:
|
||||
- "44"
|
||||
- "991"
|
||||
shm_size: 16g
|
||||
ipc: host
|
||||
volumes:
|
||||
- /models:/models:ro
|
||||
ports:
|
||||
- "8000:8000"
|
||||
command:
|
||||
- --model
|
||||
- /models/REPLACE/ME
|
||||
- --host
|
||||
- 0.0.0.0
|
||||
- --port
|
||||
- "8000"
|
||||
487
pyinfra/framework/deploy.py
Normal file
487
pyinfra/framework/deploy.py
Normal file
@@ -0,0 +1,487 @@
|
||||
"""
|
||||
pyinfra deploy for Framework Desktop (Ryzen AI Max+ 395 / Strix Halo).
|
||||
|
||||
Containerized layout: inference engines (llama.cpp, vLLM, Ollama) run as
|
||||
docker compose services, each shipping its own ROCm/Vulkan stack. The
|
||||
host stays minimal — kernel + driver + diagnostics + Docker.
|
||||
|
||||
Run with:
|
||||
./run.sh # equivalent to pyinfra inventory.py deploy.py
|
||||
|
||||
Idempotent — re-running only does work that's actually needed. Includes
|
||||
one-shot cleanup steps for artifacts left over from the earlier native
|
||||
build (llama.cpp git checkout, symlinks, systemd unit, full ROCm).
|
||||
"""
|
||||
|
||||
from io import StringIO
|
||||
|
||||
from pyinfra.operations import apt, files, server, systemd
|
||||
from pyinfra import host
|
||||
|
||||
# --- Tunables ----------------------------------------------------------------
|
||||
|
||||
# Latest stable as of 2026-04-30. Verify at https://repo.radeon.com/amdgpu-install/.
|
||||
ROCM_VERSION = "7.2.3"
|
||||
# The .deb filename has an opaque build suffix — find the exact name at
|
||||
# the URL above and paste it here.
|
||||
AMDGPU_INSTALL_DEB = "amdgpu-install_7.2.3.70203-1_all.deb"
|
||||
|
||||
# amdgpu_top — modern GPU monitor that handles new AMD cards / APUs
|
||||
# (Strix Halo / gfx1151). Replaces radeontop, which doesn't know about
|
||||
# RDNA 3.5. Verify at https://github.com/Umio-Yasuno/amdgpu_top/releases.
|
||||
AMDGPU_TOP_VERSION = "0.11.4-1"
|
||||
AMDGPU_TOP_DEB = f"amdgpu-top_without_gui_{AMDGPU_TOP_VERSION}_amd64.deb"
|
||||
|
||||
SSH_USER = host.data.get("ssh_user", "noise")
|
||||
MODELS_DIR = "/models"
|
||||
# /srv is the FHS-blessed location for "data and configuration for
|
||||
# services this system provides." Owned root:docker with setgid +
|
||||
# group-write so any docker-group member can manage compose stacks.
|
||||
COMPOSE_DIR = "/srv/docker"
|
||||
|
||||
# --- Phase 1 — Base OS basics ------------------------------------------------
|
||||
|
||||
apt.update(name="apt update", _sudo=True)
|
||||
|
||||
apt.packages(
|
||||
name="HWE kernel (>=6.11 for ROCm 7)",
|
||||
packages=["linux-generic-hwe-24.04"],
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# User basics + monitoring tools.
|
||||
apt.packages(
|
||||
name="Base CLI tools",
|
||||
# radeontop intentionally omitted — it predates RDNA 3.5 / Strix Halo
|
||||
# and just errors with "no VRAM support". amdgpu_top installed below.
|
||||
packages=[
|
||||
"tmux",
|
||||
"vim",
|
||||
"htop",
|
||||
"btop",
|
||||
"nvtop",
|
||||
"git",
|
||||
"curl",
|
||||
"ca-certificates",
|
||||
"unzip",
|
||||
],
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# Vulkan diagnostics on the host (containers ship their own runtime).
|
||||
apt.packages(
|
||||
name="Vulkan host diagnostics",
|
||||
packages=["mesa-vulkan-drivers", "vulkan-tools"],
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# uv (Python package/tool manager).
|
||||
server.shell(
|
||||
name="Install / upgrade uv",
|
||||
commands=[
|
||||
"curl -LsSf https://astral.sh/uv/install.sh | "
|
||||
"env UV_INSTALL_DIR=/usr/local/bin UV_UNMANAGED_INSTALL=1 sh",
|
||||
],
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# huggingface_hub CLI for `huggingface-cli download <repo> --local-dir ...`
|
||||
# Lands at ~/.local/bin/huggingface-cli for the SSH user. Other users can
|
||||
# repeat the command themselves.
|
||||
server.shell(
|
||||
name="Install / upgrade huggingface_hub CLI",
|
||||
commands=["uv tool install --upgrade 'huggingface_hub[cli]'"],
|
||||
_sudo=True,
|
||||
_sudo_user=SSH_USER,
|
||||
)
|
||||
|
||||
# gpakosz/.tmux config (Oh My Tmux!).
|
||||
TMUX_CONF_DIR = f"/home/{SSH_USER}/.tmux"
|
||||
server.shell(
|
||||
name="Clone gpakosz/.tmux",
|
||||
commands=[
|
||||
f"test -d {TMUX_CONF_DIR}/.git || "
|
||||
f"git clone --depth 1 https://github.com/gpakosz/.tmux.git {TMUX_CONF_DIR}",
|
||||
],
|
||||
_sudo=True,
|
||||
_sudo_user=SSH_USER,
|
||||
)
|
||||
files.link(
|
||||
name="Symlink ~/.tmux.conf -> ~/.tmux/.tmux.conf",
|
||||
path=f"/home/{SSH_USER}/.tmux.conf",
|
||||
target=f"{TMUX_CONF_DIR}/.tmux.conf",
|
||||
user=SSH_USER,
|
||||
group=SSH_USER,
|
||||
_sudo=True,
|
||||
)
|
||||
# Seed the user override file only if absent — preserves customizations.
|
||||
server.shell(
|
||||
name="Seed ~/.tmux.conf.local (if missing)",
|
||||
commands=[
|
||||
f"test -f /home/{SSH_USER}/.tmux.conf.local || "
|
||||
f"cp {TMUX_CONF_DIR}/.tmux.conf.local /home/{SSH_USER}/",
|
||||
],
|
||||
_sudo=True,
|
||||
_sudo_user=SSH_USER,
|
||||
)
|
||||
|
||||
# --- Tailscale ---------------------------------------------------------------
|
||||
|
||||
# Tailscale's noble repo works fine on later Ubuntus — the package is
|
||||
# self-contained Go binaries.
|
||||
files.download(
|
||||
name="Fetch Tailscale apt key",
|
||||
src="https://pkgs.tailscale.com/stable/ubuntu/noble.noarmor.gpg",
|
||||
dest="/usr/share/keyrings/tailscale-archive-keyring.gpg",
|
||||
mode="644",
|
||||
_sudo=True,
|
||||
)
|
||||
files.download(
|
||||
name="Fetch Tailscale apt list",
|
||||
src="https://pkgs.tailscale.com/stable/ubuntu/noble.tailscale-keyring.list",
|
||||
dest="/etc/apt/sources.list.d/tailscale.list",
|
||||
mode="644",
|
||||
_sudo=True,
|
||||
)
|
||||
apt.update(name="apt update (Tailscale)", _sudo=True)
|
||||
apt.packages(name="Install Tailscale", packages=["tailscale"], _sudo=True)
|
||||
systemd.service(
|
||||
name="Enable tailscaled",
|
||||
service="tailscaled",
|
||||
running=True,
|
||||
enabled=True,
|
||||
_sudo=True,
|
||||
)
|
||||
# `sudo tailscale up` is interactive (browser auth) — run manually once.
|
||||
|
||||
# --- Docker -----------------------------------------------------------------
|
||||
|
||||
apt.packages(
|
||||
name="Install Docker + compose plugin",
|
||||
packages=["docker.io", "docker-compose-v2"],
|
||||
_sudo=True,
|
||||
)
|
||||
systemd.service(
|
||||
name="Enable docker daemon",
|
||||
service="docker",
|
||||
running=True,
|
||||
enabled=True,
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# lazydocker (TUI for docker). System-wide install via official script.
|
||||
server.shell(
|
||||
name="Install / upgrade lazydocker",
|
||||
commands=[
|
||||
"curl -fsSL "
|
||||
"https://raw.githubusercontent.com/jesseduffield/lazydocker/master/scripts/install_update_linux.sh "
|
||||
"| DIR=/usr/local/bin bash",
|
||||
],
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# --- GPU access (host kernel/driver bits only) ------------------------------
|
||||
|
||||
# AMD's amdgpu-install package adds the ROCm apt repo; we use it just to
|
||||
# get rocminfo for host-side diagnostics. Containers ship
|
||||
# their own ROCm.
|
||||
files.directory(
|
||||
name="apt keyring dir",
|
||||
path="/etc/apt/keyrings",
|
||||
mode="755",
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
amdgpu_deb = f"/tmp/{AMDGPU_INSTALL_DEB}"
|
||||
amdgpu_url = (
|
||||
f"https://repo.radeon.com/amdgpu-install/{ROCM_VERSION}/ubuntu/noble/"
|
||||
f"{AMDGPU_INSTALL_DEB}"
|
||||
)
|
||||
server.shell(
|
||||
name="Fetch amdgpu-install .deb",
|
||||
commands=[f"test -f {amdgpu_deb} || curl -fsSL {amdgpu_url} -o {amdgpu_deb}"],
|
||||
_sudo=True,
|
||||
)
|
||||
server.shell(
|
||||
name="Install amdgpu-install package",
|
||||
commands=[f"apt install -y {amdgpu_deb}"],
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# Idempotent cleanup: if a prior run installed the full ROCm userspace
|
||||
# (~25 GB), tear it down before installing the diagnostic-only subset.
|
||||
# On a fresh box this is a no-op.
|
||||
server.shell(
|
||||
name="Remove full ROCm install if present",
|
||||
commands=[
|
||||
"if dpkg -l rocm-dev 2>/dev/null | grep -q '^ii'; then "
|
||||
" amdgpu-install -y --uninstall || true; "
|
||||
"fi",
|
||||
],
|
||||
_sudo=True,
|
||||
)
|
||||
apt.packages(
|
||||
name="ROCm host diagnostics (rocminfo)",
|
||||
# rocminfo is the stable diagnostic. The SMI tool's package name has
|
||||
# churned across ROCm releases (rocm-smi-lib → amd-smi-lib in 7.x);
|
||||
# install on demand if you need it.
|
||||
packages=["rocminfo"],
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# amdgpu_top — fetch + install the .deb if not already at the pinned
|
||||
# version. Replaces radeontop for newer AMD cards (Strix Halo / gfx1151).
|
||||
amdgpu_top_deb = f"/tmp/{AMDGPU_TOP_DEB}"
|
||||
amdgpu_top_url = (
|
||||
f"https://github.com/Umio-Yasuno/amdgpu_top/releases/download/"
|
||||
f"v{AMDGPU_TOP_VERSION.split('-')[0]}/{AMDGPU_TOP_DEB}"
|
||||
)
|
||||
server.shell(
|
||||
name="Install / upgrade amdgpu_top",
|
||||
commands=[
|
||||
f"dpkg -s amdgpu-top 2>/dev/null | grep -q '^Version: {AMDGPU_TOP_VERSION}' || "
|
||||
f"(test -f {amdgpu_top_deb} || curl -fsSL {amdgpu_top_url} -o {amdgpu_top_deb}; "
|
||||
f"apt install -y {amdgpu_top_deb})",
|
||||
],
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# Group membership for /dev/kfd + /dev/dri access (needed for GPU passthrough
|
||||
# into containers, and for unprivileged host-side rocminfo).
|
||||
server.group(name="ensure render group", group="render", _sudo=True)
|
||||
server.group(name="ensure video group", group="video", _sudo=True)
|
||||
server.user(
|
||||
name="Add login user to render/video/docker",
|
||||
user=SSH_USER,
|
||||
groups=["render", "video", "docker"],
|
||||
append=True,
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# Kernel cmdline tuning per Gygeek/Framework-strix-halo-llm-setup:
|
||||
# - amd_iommu=off — ~6 % memory-read improvement on Strix Halo
|
||||
# - amdgpu.gttsize=117760 — ~115 GB GTT ceiling so the GPU can borrow
|
||||
# most of system RAM dynamically. Acts as a
|
||||
# ceiling, not an allocation. See ../../StrixHaloMemory.md
|
||||
# for the UMA-vs-GTT trade-off discussion.
|
||||
# Requires a reboot to take effect; pyinfra leaves that to you.
|
||||
files.line(
|
||||
name="GRUB cmdline (amd_iommu, gttsize)",
|
||||
path="/etc/default/grub",
|
||||
line=r"^GRUB_CMDLINE_LINUX_DEFAULT=.*",
|
||||
replace='GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"',
|
||||
_sudo=True,
|
||||
)
|
||||
server.shell(
|
||||
name="Regenerate GRUB config",
|
||||
commands=["update-grub"],
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# --- Storage layout ---------------------------------------------------------
|
||||
|
||||
files.directory(
|
||||
name="/models root",
|
||||
path=MODELS_DIR,
|
||||
user=SSH_USER,
|
||||
group=SSH_USER,
|
||||
mode="755",
|
||||
_sudo=True,
|
||||
)
|
||||
for sub in ("moonshotai", "qwen", "deepseek", "zai", "mistralai"):
|
||||
files.directory(
|
||||
name=f"{MODELS_DIR}/{sub}",
|
||||
path=f"{MODELS_DIR}/{sub}",
|
||||
user=SSH_USER,
|
||||
group=SSH_USER,
|
||||
mode="755",
|
||||
_sudo=True,
|
||||
)
|
||||
# Ollama bind-mounts its content-addressed store here.
|
||||
files.directory(
|
||||
name=f"{MODELS_DIR}/ollama",
|
||||
path=f"{MODELS_DIR}/ollama",
|
||||
user=SSH_USER,
|
||||
group=SSH_USER,
|
||||
mode="755",
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# --- Compose files for inference services ----------------------------------
|
||||
|
||||
files.directory(
|
||||
name="Compose root",
|
||||
path=COMPOSE_DIR,
|
||||
group="docker",
|
||||
mode="2775", # setgid: new files inherit docker group
|
||||
_sudo=True,
|
||||
)
|
||||
# Idempotent cleanup of earlier locations.
|
||||
files.directory(
|
||||
name="Remove old /srv/compose",
|
||||
path="/srv/compose",
|
||||
present=False,
|
||||
_sudo=True,
|
||||
)
|
||||
files.directory(
|
||||
name="Remove old ~/docker compose dir",
|
||||
path=f"/home/{SSH_USER}/docker",
|
||||
present=False,
|
||||
_sudo=True,
|
||||
)
|
||||
for svc in (
|
||||
"llama",
|
||||
"vllm",
|
||||
"ollama",
|
||||
"openwebui",
|
||||
"beszel",
|
||||
"openlit",
|
||||
"phoenix",
|
||||
"openhands",
|
||||
):
|
||||
files.directory(
|
||||
name=f"compose/{svc} dir",
|
||||
path=f"{COMPOSE_DIR}/{svc}",
|
||||
group="docker",
|
||||
mode="2775",
|
||||
_sudo=True,
|
||||
)
|
||||
files.put(
|
||||
name=f"compose/{svc}/docker-compose.yml",
|
||||
src=f"compose/{svc}.yml",
|
||||
dest=f"{COMPOSE_DIR}/{svc}/docker-compose.yml",
|
||||
group="docker",
|
||||
mode="664",
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# OpenWebUI persistent state (users, conversations, uploaded docs,
|
||||
# RAG vector index) — bind-mounted into the container.
|
||||
files.directory(
|
||||
name="OpenWebUI data dir",
|
||||
path=f"{COMPOSE_DIR}/openwebui/data",
|
||||
group="docker",
|
||||
mode="2775",
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# Beszel persistent state (admin account, system list, metric history).
|
||||
files.directory(
|
||||
name="Beszel data dir",
|
||||
path=f"{COMPOSE_DIR}/beszel/data",
|
||||
group="docker",
|
||||
mode="2775",
|
||||
_sudo=True,
|
||||
)
|
||||
# Sibling .env file Docker Compose auto-reads for variable interpolation.
|
||||
# Pyinfra creates it empty if missing and never overwrites — the user
|
||||
# fills in BESZEL_TOKEN= and BESZEL_KEY= from the hub UI on first setup.
|
||||
# Mode 640 root:docker so docker group members can read it at compose
|
||||
# parse time but it isn't world-readable.
|
||||
files.file(
|
||||
name="Beszel .env (placeholder)",
|
||||
path=f"{COMPOSE_DIR}/beszel/.env",
|
||||
present=True,
|
||||
mode="640",
|
||||
user="root",
|
||||
group="docker",
|
||||
_sudo=True,
|
||||
)
|
||||
# Tear down the previous /etc/default/beszel-agent location if a prior
|
||||
# deploy created it (idempotent — no-op on a fresh box).
|
||||
files.file(
|
||||
name="Remove legacy /etc/default/beszel-agent",
|
||||
path="/etc/default/beszel-agent",
|
||||
present=False,
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# OpenLIT persistent state — Next.js app config + ClickHouse trace store.
|
||||
files.directory(
|
||||
name="OpenLIT data dir",
|
||||
path=f"{COMPOSE_DIR}/openlit/data",
|
||||
group="docker",
|
||||
mode="2775",
|
||||
_sudo=True,
|
||||
)
|
||||
files.directory(
|
||||
name="OpenLIT ClickHouse data dir",
|
||||
path=f"{COMPOSE_DIR}/openlit/clickhouse",
|
||||
group="docker",
|
||||
mode="2775",
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# Phoenix persistent state (SQLite-backed trace store).
|
||||
files.directory(
|
||||
name="Phoenix data dir",
|
||||
path=f"{COMPOSE_DIR}/phoenix/data",
|
||||
group="docker",
|
||||
mode="2775",
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# OpenHands state + workspace. Owned by the SSH user (UID 1000) so the
|
||||
# sandbox containers — which run as SANDBOX_USER_ID=1000 — can write to
|
||||
# the shared workspace without root-owned files leaking out.
|
||||
files.directory(
|
||||
name="OpenHands state dir",
|
||||
path=f"{COMPOSE_DIR}/openhands/state",
|
||||
user=SSH_USER,
|
||||
group=SSH_USER,
|
||||
mode="2775",
|
||||
_sudo=True,
|
||||
)
|
||||
files.directory(
|
||||
name="OpenHands workspace dir",
|
||||
path=f"{COMPOSE_DIR}/openhands/workspace",
|
||||
user=SSH_USER,
|
||||
group=SSH_USER,
|
||||
mode="2775",
|
||||
_sudo=True,
|
||||
)
|
||||
|
||||
# --- Cleanup of artifacts from the prior native-build deploy ----------------
|
||||
# All idempotent — `present=False` is a no-op when the target is absent.
|
||||
|
||||
server.shell(
|
||||
name="Stop & disable old native llama-server.service",
|
||||
commands=[
|
||||
"systemctl disable --now llama-server.service 2>/dev/null || true",
|
||||
],
|
||||
_sudo=True,
|
||||
)
|
||||
files.file(
|
||||
name="Remove old llama-server.service",
|
||||
path="/etc/systemd/system/llama-server.service",
|
||||
present=False,
|
||||
_sudo=True,
|
||||
)
|
||||
files.link(
|
||||
name="Remove old llama-server-vulkan symlink",
|
||||
path="/usr/local/bin/llama-server-vulkan",
|
||||
present=False,
|
||||
_sudo=True,
|
||||
)
|
||||
files.link(
|
||||
name="Remove old llama-server-rocm symlink",
|
||||
path="/usr/local/bin/llama-server-rocm",
|
||||
present=False,
|
||||
_sudo=True,
|
||||
)
|
||||
files.directory(
|
||||
name="Remove old llama.cpp checkout",
|
||||
path="/opt/llama.cpp",
|
||||
present=False,
|
||||
_sudo=True,
|
||||
)
|
||||
server.shell(
|
||||
name="Stop & remove native Ollama install",
|
||||
commands=[
|
||||
"systemctl disable --now ollama.service 2>/dev/null || true",
|
||||
"rm -f /etc/systemd/system/ollama.service /usr/local/bin/ollama",
|
||||
"userdel ollama 2>/dev/null || true",
|
||||
],
|
||||
_sudo=True,
|
||||
)
|
||||
systemd.daemon_reload(name="systemctl daemon-reload", _sudo=True)
|
||||
5
pyinfra/framework/inventory.py
Normal file
5
pyinfra/framework/inventory.py
Normal file
@@ -0,0 +1,5 @@
|
||||
# pyinfra inventory for the Framework Desktop / Strix Halo box.
|
||||
|
||||
framework_desktop = [
|
||||
("framework", {"ssh_hostname": "10.0.0.70", "ssh_user": "noise"}),
|
||||
]
|
||||
7
pyinfra/framework/run.sh
Executable file
7
pyinfra/framework/run.sh
Executable file
@@ -0,0 +1,7 @@
|
||||
#!/usr/bin/env bash
|
||||
# Wrapper: invokes pyinfra against the Framework Desktop, forwarding any
|
||||
# extra args (e.g. --dry, -v) to pyinfra. Assumes NOPASSWD sudo on the box.
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")"
|
||||
exec pyinfra -v --ssh-password-prompt inventory.py deploy.py "$@"
|
||||
311
testing/qwen3-coder-30b/DEPLOYMENT_PLAN.md
Normal file
311
testing/qwen3-coder-30b/DEPLOYMENT_PLAN.md
Normal file
@@ -0,0 +1,311 @@
|
||||
# Deploying Kimi Linear Model on Strix Halo with 128GB Unified Memory
|
||||
|
||||
This document provides a comprehensive guide to deploying the Kimi Linear 48B parameter model on your AMD Strix Halo hardware with 128GB unified memory.
|
||||
|
||||
## Hardware Specifications
|
||||
|
||||
- **Processor**: AMD Ryzen AI Max+ 395 (Strix Halo)
|
||||
- **Memory**: 128GB LPDDR5X unified memory
|
||||
- **GPU**: Integrated Radeon 8060S (40 CU RDNA 3.5)
|
||||
- **Architecture**: Zen 5 CPU with unified memory access
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Hardware Requirements
|
||||
1. AMD Ryzen AI Max+ 395 processor (or compatible Strix Halo APU)
|
||||
2. Minimum 128GB RAM (as specified)
|
||||
3. Ubuntu 24.04 LTS or later with kernel 6.16.9+
|
||||
4. BIOS access for configuration
|
||||
|
||||
### Software Requirements
|
||||
1. Python 3.10+
|
||||
2. ROCm 6.4.4 or higher
|
||||
3. llama.cpp with ROCm support
|
||||
4. HuggingFace CLI client
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### Phase 1: Hardware Configuration
|
||||
|
||||
#### 1.1 BIOS Configuration
|
||||
Before installation, you must set these BIOS options:
|
||||
- **UMA Frame Buffer Size**: 512MB (for display)
|
||||
- **IOMMU**: Disabled (for performance)
|
||||
- **Power Mode**: 85W TDP for balanced performance
|
||||
|
||||
#### 1.2 Kernel Configuration
|
||||
Ensure you're running kernel 6.16.9 or later:
|
||||
```bash
|
||||
# Check current kernel
|
||||
uname -r
|
||||
|
||||
# If upgrading needed:
|
||||
sudo add-apt-repository ppa:cappelikan/ppa -y
|
||||
sudo apt update
|
||||
sudo apt install mainline -y
|
||||
sudo mainline --install 6.16.9
|
||||
```
|
||||
|
||||
#### 1.3 GRUB Boot Parameters
|
||||
Edit `/etc/default/grub`:
|
||||
```bash
|
||||
sudo nano /etc/default/grub
|
||||
```
|
||||
|
||||
Add to `GRUB_CMDLINE_LINUX_DEFAULT`:
|
||||
```bash
|
||||
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"
|
||||
```
|
||||
|
||||
Apply the changes:
|
||||
```bash
|
||||
sudo update-grub
|
||||
sudo reboot
|
||||
```
|
||||
|
||||
#### 1.4 GPU Access Configuration
|
||||
Create udev rules for proper GPU permissions:
|
||||
```bash
|
||||
sudo bash -c 'cat > /etc/udev/rules.d/99-amd-kfd.rules << EOF
|
||||
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
|
||||
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
|
||||
SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666"
|
||||
EOF'
|
||||
|
||||
sudo udevadm control --reload-rules
|
||||
sudo udevadm trigger
|
||||
sudo usermod -aG video,render $USER
|
||||
```
|
||||
|
||||
### Phase 2: ROCm Installation
|
||||
|
||||
#### 2.1 Install ROCm from AMD Repository
|
||||
```bash
|
||||
# Add AMD ROCm repository
|
||||
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add -
|
||||
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/latest ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
|
||||
|
||||
sudo apt update
|
||||
sudo apt install rocm-hip-runtime rocm-hip-sdk -y
|
||||
```
|
||||
|
||||
#### 2.2 Verify ROCm Installation
|
||||
```bash
|
||||
# Check ROCm version
|
||||
rocminfo | grep -E "Agent|Name:|Marketing"
|
||||
|
||||
# Check GTT memory allocation
|
||||
cat /sys/class/drm/card*/device/mem_info_gtt_total
|
||||
```
|
||||
|
||||
### Phase 3: Model Setup
|
||||
|
||||
#### 3.1 Download Kimi Linear GGUF Model
|
||||
```bash
|
||||
# Install huggingface-cli
|
||||
pip install "huggingface_hub[cli]"
|
||||
|
||||
# Create model directory
|
||||
mkdir -p ~/models/kimi-linear
|
||||
cd ~/models/kimi-linear
|
||||
|
||||
# Download the Q4_K_M quantized GGUF model (recommended for your hardware)
|
||||
huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
|
||||
--include "moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf" \
|
||||
--local-dir ./
|
||||
|
||||
# Or download a lower quality version if memory is tight
|
||||
# huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
|
||||
# --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-IQ4_XS.gguf" \
|
||||
# --local-dir ./
|
||||
```
|
||||
|
||||
#### 3.2 Validate Download
|
||||
```bash
|
||||
ls -la ~/models/kimi-linear/
|
||||
# You should see a file named moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf
|
||||
```
|
||||
|
||||
### Phase 4: llama.cpp Setup for AMD ROCm
|
||||
|
||||
#### 4.1 Build llama.cpp with ROCm Support
|
||||
```bash
|
||||
# Clone llama.cpp
|
||||
git clone https://github.com/ggerganov/llama.cpp.git
|
||||
cd llama.cpp
|
||||
|
||||
# Install build dependencies
|
||||
sudo apt install cmake gcc g++ git libcurl4-openssl-dev -y
|
||||
|
||||
# Build for ROCm
|
||||
cmake -B build -S . \
|
||||
-DGGML_HIP=ON \
|
||||
-DAMDGPU_TARGETS="gfx1151"
|
||||
|
||||
cmake --build build --config Release -j$(nproc)
|
||||
```
|
||||
|
||||
#### 4.2 Alternative: Use Pre-built Container
|
||||
Instead of building, you can use a container with ROCm pre-configured:
|
||||
```bash
|
||||
# Install Podman and Distrobox
|
||||
sudo apt install podman -y
|
||||
curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo sh
|
||||
|
||||
# Create LLM container with ROCm
|
||||
distrobox create llama-rocm \
|
||||
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
|
||||
--additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render"
|
||||
```
|
||||
|
||||
### Phase 5: Test Inference
|
||||
|
||||
#### 5.1 Run a Basic Test
|
||||
```bash
|
||||
# If building locally:
|
||||
./build/bin/llama-cli \
|
||||
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
|
||||
--no-mmap \
|
||||
-ngl 99 \
|
||||
-p "Explain quantum computing in simple terms:" \
|
||||
-n 256
|
||||
|
||||
# If using container:
|
||||
distrobox enter llama-rocm
|
||||
llama-cli \
|
||||
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
|
||||
--no-mmap \
|
||||
-ngl 99 \
|
||||
-p "Explain quantum computing in simple terms:" \
|
||||
-n 256
|
||||
```
|
||||
|
||||
### Phase 6: Benchmarking and Optimization
|
||||
|
||||
#### 6.1 Run Performance Benchmarks
|
||||
```bash
|
||||
# Run llama-bench for performance testing
|
||||
llama-bench \
|
||||
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
|
||||
-mmp 0 \
|
||||
-ngl 99 \
|
||||
-p 512 \
|
||||
-n 128
|
||||
```
|
||||
|
||||
#### 6.2 Monitor GPU Resources
|
||||
```bash
|
||||
# Watch GPU usage during inference
|
||||
watch -n 1 rocm-smi
|
||||
|
||||
# Check memory usage
|
||||
cat /sys/class/drm/card*/device/mem_info_gtt_total
|
||||
cat /sys/class/drm/card*/device/mem_info_gtt_used
|
||||
```
|
||||
|
||||
## Performance Optimization Tips
|
||||
|
||||
### Memory Management
|
||||
1. **Use 128GB GTT**: Your system can utilize up to ~115GB of memory for computing
|
||||
2. **Choose appropriate quantization**:
|
||||
- Q4_K_M: Good balance of quality and size (~30GB)
|
||||
- IQ4_XS: Smaller but with some quality loss (~26GB)
|
||||
- Q5_K_M: Higher quality but larger (~35GB)
|
||||
|
||||
### Inference Parameters
|
||||
```bash
|
||||
# Optimal flags for AMD Strix Halo
|
||||
llama-cli \
|
||||
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
|
||||
--no-mmap \
|
||||
-ngl 99 \
|
||||
-p "Your prompt text" \
|
||||
-n 512 \
|
||||
--temp 0.7 \
|
||||
--top-p 0.9 \
|
||||
--repeat-penalty 1.1
|
||||
```
|
||||
|
||||
## Expected Performance on Strix Halo
|
||||
|
||||
| Model Quantization | Predicted Performance | Notes |
|
||||
| ------------------ | --------------------- | ----- |
|
||||
| Q4_K_M (~30GB) | 10-15 tokens/sec | Good balance |
|
||||
| IQ4_XS (~26GB) | 8-12 tokens/sec | Memory efficient |
|
||||
| Q5_K_M (~35GB) | 12-18 tokens/sec | Higher quality |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
1. **HSA_STATUS_ERROR_OUT_OF_RESOURCES**: Check udev rules
|
||||
2. **Slow model loading**: Add `--no-mmap` flag
|
||||
3. **Memory allocation errors**: Verify GTT size in GRUB
|
||||
4. **Container GPU access**: Check device permissions
|
||||
|
||||
### Verification Commands
|
||||
```bash
|
||||
# Verify memory allocation
|
||||
cat /sys/class/drm/card*/device/mem_info_gtt_total
|
||||
|
||||
# Verify ROCm detection
|
||||
rocminfo | grep -E "Name:|Marketing"
|
||||
|
||||
# Test basic GPU functionality
|
||||
rocminfo
|
||||
```
|
||||
|
||||
## Additional Tools and Integration
|
||||
|
||||
### For Development/Testing
|
||||
1. **llama-cpp-python**: For Python integration
|
||||
```python
|
||||
# Install package
|
||||
pip install llama-cpp-python
|
||||
|
||||
# Test with Python
|
||||
from llama_cpp import Llama
|
||||
llm = Llama.from_pretrained(
|
||||
repo_id="bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF",
|
||||
filename="moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf"
|
||||
)
|
||||
```
|
||||
|
||||
2. **OpenWebUI**: Web interface for LLMs
|
||||
3. **Ollama**: Alternative for simpler deployment
|
||||
4. **vLLM**: High-throughput inference (for API deployment)
|
||||
|
||||
## Monitoring and Resource Usage
|
||||
|
||||
### System Resources to Monitor
|
||||
- **GPU Utilization**: Use `watch -n 1 rocm-smi`
|
||||
- **Memory Usage**: Check `/sys/class/drm/card*/device/mem_info_gtt_total`
|
||||
- **CPU Load**: Monitor with `htop`
|
||||
- **Temperature**: Monitor GPU temperature
|
||||
|
||||
### Storage Considerations
|
||||
- Model files require ~30-50GB on disk
|
||||
- Additional swap space recommended for complex tasks
|
||||
- Consider SSD storage for optimal I/O performance
|
||||
|
||||
## Recommendations for Different Use Cases
|
||||
|
||||
### For Coding Tasks
|
||||
- Use `--temp 0.3` for more deterministic outputs
|
||||
- Enable `--repeat-penalty 1.2` to reduce repetition
|
||||
- Set `-n 1024` for long context generation
|
||||
|
||||
### For Research/Testing
|
||||
- Use `Q4_K_M` for reliable performance
|
||||
- Try `IQ4_XS` to test memory efficiency
|
||||
- Monitor with `llama-bench` for reproducible results
|
||||
|
||||
### For Production Use
|
||||
- Consider containerized deployment for stability
|
||||
- Set up automated monitoring for resource usage
|
||||
- Create model-specific configuration files
|
||||
|
||||
## Conclusion
|
||||
|
||||
Your Strix Halo with 128GB unified memory provides an excellent platform for running large language models like Kimi Linear. The key to success is proper configuration of both the system and inference tools to fully utilize the unified memory architecture.
|
||||
|
||||
With this setup, you should be able to run Kimi Linear and achieve performance that rivals much more expensive hardware setups.
|
||||
109
testing/qwen3-coder-30b/README.md
Normal file
109
testing/qwen3-coder-30b/README.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Local LLM Performance Testing Framework
|
||||
|
||||
A comprehensive framework for comparing the performance and quality of different local hosted LLMs on coding tasks.
|
||||
|
||||
## Features
|
||||
|
||||
- **Performance Metrics**: Response time, tokens per second, CPU/memory/GPU usage
|
||||
- **Quality Assessment**: Automated code quality evaluation
|
||||
- **Multi-Model Testing**: Test multiple LLMs simultaneously
|
||||
- **Parallel Execution**: Run tests efficiently
|
||||
- **Comprehensive Reporting**: Detailed test results and comparisons
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Install required dependencies
|
||||
pip install psutil
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Quick Start
|
||||
|
||||
1. Create a test suite and add models:
|
||||
|
||||
```python
|
||||
from llm_test_framework import TestSuite, MockLLM
|
||||
|
||||
# Create test suite
|
||||
suite = TestSuite()
|
||||
|
||||
# Add models to test
|
||||
suite.add_model(MockLLM("gpt4all", {"model_path": "/path/to/model"}))
|
||||
suite.add_model(MockLLM("llama.cpp", {"model_path": "/path/to/llama"}))
|
||||
|
||||
# Define test tasks
|
||||
tasks = [
|
||||
{
|
||||
"name": "Simple function",
|
||||
"prompt": "Write a Python function to add two numbers",
|
||||
"expected_output": "return a + b"
|
||||
}
|
||||
]
|
||||
|
||||
# Run tests
|
||||
suite.run_all_tests(tasks)
|
||||
suite.save_results()
|
||||
```
|
||||
|
||||
### Running the Example
|
||||
|
||||
```bash
|
||||
python llm_test_framework.py
|
||||
```
|
||||
|
||||
This will run sample tests with mock models and save results to JSON.
|
||||
|
||||
## Components
|
||||
|
||||
### 1. TestSuite
|
||||
Main class managing the testing process, including:
|
||||
- Model registration
|
||||
- Test execution
|
||||
- Result collection and saving
|
||||
|
||||
### 2. LLMInterface
|
||||
Abstract base class for LLM implementations with methods:
|
||||
- `generate()`: Generate text from prompt
|
||||
- `get_model_info()`: Get model information
|
||||
|
||||
### 3. TestResult
|
||||
Data class storing individual test results with:
|
||||
- Performance metrics
|
||||
- Quality scores
|
||||
- Raw outputs
|
||||
- Success status
|
||||
|
||||
## Extending the Framework
|
||||
|
||||
To add support for a new LLM:
|
||||
|
||||
1. Create a new class inheriting from `LLMInterface`
|
||||
2. Implement `generate()` and `get_model_info()` methods
|
||||
3. Add your model to the test suite
|
||||
|
||||
## Test Types
|
||||
|
||||
The framework supports multiple types of coding tasks:
|
||||
- Basic function implementations
|
||||
- Code refactoring examples
|
||||
- Algorithm challenges
|
||||
- System design questions
|
||||
|
||||
## Output
|
||||
|
||||
Results are saved in JSON format with:
|
||||
- Performance metrics (response time, TPS)
|
||||
- Resource usage (CPU, memory, GPU)
|
||||
- Quality assessment scores
|
||||
- Raw model outputs
|
||||
- Success/failure indicators
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- Integration with local LLM APIs (text-generation-webui, llama.cpp, etc.)
|
||||
- Advanced code quality evaluation using static analysis
|
||||
- Web-based dashboard for results visualization
|
||||
- Database storage for historical comparisons
|
||||
- Automated statistical analysis of results
|
||||
158
testing/qwen3-coder-30b/TestingPlan.md
Normal file
158
testing/qwen3-coder-30b/TestingPlan.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# Local LLM Performance Testing Plan
|
||||
|
||||
## Overview
|
||||
This document outlines a comprehensive testing framework to compare the performance and quality of different local hosted LLMs on coding tasks. The framework will measure both speed and quality metrics to provide actionable insights for model selection and optimization.
|
||||
|
||||
## Objectives
|
||||
- Evaluate response time and throughput of different LLMs
|
||||
- Assess code quality and correctness of generated outputs
|
||||
- Measure resource utilization during code generation
|
||||
- Provide standardized comparison between different models
|
||||
- Enable reproducible testing environments
|
||||
|
||||
## Key Metrics to Measure
|
||||
|
||||
### Performance Metrics
|
||||
1. **Response Time**: Time from request to first token
|
||||
2. **Tokens Per Second (TPS)**: Generation speed
|
||||
3. **Total Latency**: Complete generation time
|
||||
4. **Resource Usage**:
|
||||
- CPU utilization
|
||||
- Memory consumption
|
||||
- GPU utilization (for GPU-enabled models)
|
||||
- VRAM usage (for GPU-enabled models)
|
||||
|
||||
### Quality Metrics
|
||||
1. **Code Syntax Accuracy**: Valid syntax according to language rules
|
||||
2. **Functional Correctness**: Code produces expected outputs
|
||||
3. **Code Efficiency**: Algorithmic efficiency and best practices
|
||||
4. **Code Readability**: Code structure and documentation
|
||||
5. **Task Completion Rate**: Properly solving the given coding challenge
|
||||
|
||||
## Testing Framework Architecture
|
||||
|
||||
### 1. Test Runner Component
|
||||
- Initialize and configure different LLM instances
|
||||
- Manage test execution workflow
|
||||
- Coordinate parallel execution of multiple models
|
||||
- Handle test parameters and configurations
|
||||
|
||||
### 2. Performance Monitoring
|
||||
- Real-time resource usage tracking
|
||||
- Timing measurements for each request
|
||||
- Memory allocation monitoring
|
||||
- GPU utilization tracking
|
||||
|
||||
### 3. Task Definition Catalog
|
||||
- Collection of coding challenges at various difficulty levels
|
||||
- Standardized test cases with expected outputs
|
||||
- Task categorization by programming language and complexity
|
||||
- Reproducible test scenarios
|
||||
|
||||
### 4. Evaluation Engine
|
||||
- Automated code quality assessment
|
||||
- Functional correctness validation
|
||||
- Syntax checking and linting
|
||||
- Comparison against expected outputs
|
||||
|
||||
### 5. Reporting System
|
||||
- Performance comparison dashboards
|
||||
- Quality assessment reports
|
||||
- Resource usage visualizations
|
||||
- Summary statistics and trend analysis
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### Basic Programming Tasks
|
||||
- Variable assignments and data structures
|
||||
- Function implementations
|
||||
- Control flow (loops, conditionals)
|
||||
- Error handling and validation
|
||||
|
||||
### Intermediate Complexity
|
||||
- Algorithm implementation
|
||||
- Data structure manipulation
|
||||
- API integration examples
|
||||
- Modular code design
|
||||
|
||||
### Advanced Challenges
|
||||
- Multi-threading and concurrency
|
||||
- Mathematical and scientific computing
|
||||
- System design and architecture
|
||||
- Code optimization and refactoring
|
||||
|
||||
## Implementation Approach
|
||||
|
||||
### Phase 1: Core Framework Development
|
||||
1. Create base testing infrastructure
|
||||
2. Implement model interface abstraction
|
||||
3. Build test runner and execution engine
|
||||
4. Develop basic performance monitoring
|
||||
|
||||
### Phase 2: Task and Evaluation System
|
||||
1. Design coding challenge catalog
|
||||
2. Implement automated quality assessment
|
||||
3. Create validation mechanisms
|
||||
4. Develop reporting templates
|
||||
|
||||
### Phase 3: Advanced Features
|
||||
1. Parallel execution support
|
||||
2. Advanced visualization capabilities
|
||||
3. Statistical analysis and confidence intervals
|
||||
4. Configuration management and customization
|
||||
|
||||
## Expected Output
|
||||
|
||||
### Test Reports
|
||||
- Detailed performance metrics per model
|
||||
- Quality assessment scores
|
||||
- Resource usage comparisons
|
||||
- Statistical significance analysis
|
||||
- Recommendations for model selection
|
||||
|
||||
### Visualization Components
|
||||
- Performance comparison charts
|
||||
- Resource utilization graphs
|
||||
- Quality trend analysis
|
||||
- Multi-model side-by-side comparisons
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### Model Parameters
|
||||
- Context window size
|
||||
- Sampling parameters (temperature, top-p, etc.)
|
||||
- System prompts and templates
|
||||
- Hardware specifications
|
||||
|
||||
### Test Parameters
|
||||
- Number of iterations per task
|
||||
- Timeout thresholds
|
||||
- Resource monitoring intervals
|
||||
- Quality evaluation criteria weights
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Existing Tools
|
||||
- Local LLM APIs (e.g., text-generation-webui)
|
||||
- Python-based LLM libraries
|
||||
- System monitoring tools (htop, nvidia-smi)
|
||||
- Code linting and validation tools
|
||||
|
||||
### Data Storage
|
||||
- CSV/JSON output files
|
||||
- Database for historical comparisons
|
||||
- Version control integration
|
||||
- Cloud storage for sharing results
|
||||
|
||||
## Success Criteria
|
||||
- Reproducible test results
|
||||
- Clear performance differentiation between models
|
||||
- Comprehensive quality assessment coverage
|
||||
- Scalable architecture for additional models
|
||||
- User-friendly interface for result interpretation
|
||||
|
||||
## Risks and Mitigation
|
||||
1. **Hardware Variability**: Address with controlled test environments
|
||||
2. **Inconsistent Results**: Use multiple iterations and statistical analysis
|
||||
3. **Resource Limitations**: Implement proper monitoring and resource allocation
|
||||
4. **Quality Assessment Bias**: Use multiple evaluation criteria and human validation
|
||||
70
testing/qwen3-coder-30b/example_usage.py
Executable file
70
testing/qwen3-coder-30b/example_usage.py
Executable file
@@ -0,0 +1,70 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Example usage of the LLM Test Framework with real models
|
||||
"""
|
||||
|
||||
import time
|
||||
import json
|
||||
from pathlib import Path
|
||||
from llm_test_framework import TestSuite, MockLLM
|
||||
|
||||
# Example tasks for testing
|
||||
SAMPLE_TASKS = [
|
||||
{
|
||||
"name": "Hello World Function",
|
||||
"prompt": "Write a Python function that prints 'Hello, World!'",
|
||||
"expected_output": "print('Hello, World!')",
|
||||
"parameters": {"temperature": 0.3}
|
||||
},
|
||||
{
|
||||
"name": "Sum Function",
|
||||
"prompt": "Write a Python function to calculate the sum of two numbers",
|
||||
"expected_output": "return a + b",
|
||||
"parameters": {"temperature": 0.5}
|
||||
},
|
||||
{
|
||||
"name": "FizzBuzz",
|
||||
"prompt": "Write a Python function to solve the FizzBuzz problem",
|
||||
"expected_output": "if i % 3 == 0 and i % 5 == 0",
|
||||
"parameters": {"temperature": 0.7}
|
||||
}
|
||||
]
|
||||
|
||||
def main():
|
||||
print("Starting LLM Performance Testing Framework")
|
||||
print("=" * 50)
|
||||
|
||||
# Create test suite
|
||||
suite = TestSuite("sample_test_results")
|
||||
|
||||
# Add some sample models
|
||||
print("Adding sample models...")
|
||||
suite.add_model(MockLLM("Model_A", {"context": 2048}))
|
||||
suite.add_model(MockLLM("Model_B", {"context": 4096}))
|
||||
|
||||
# Run tests
|
||||
print("Running sample tests...")
|
||||
suite.run_all_tests(SAMPLE_TASKS)
|
||||
|
||||
# Save results
|
||||
print("Saving results...")
|
||||
suite.save_results("sample_results.json")
|
||||
|
||||
# Show summary
|
||||
print("\n" + "=" * 50)
|
||||
print("TEST SUMMARY")
|
||||
print("=" * 50)
|
||||
|
||||
for result in suite.results:
|
||||
print(f"Model: {result.model_name}")
|
||||
print(f"Task: {result.task_name}")
|
||||
print(f"Response Time: {result.response_time:.2f}s")
|
||||
print(f"Quality Score: {result.quality_score:.1f}/100")
|
||||
print(f"Success: {'Yes' if result.success else 'No'}")
|
||||
print("-" * 30)
|
||||
|
||||
print(f"\nTotal tests run: {len(suite.results)}")
|
||||
print(f"Results saved to: {suite.output_dir}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
129
testing/qwen3-coder-30b/find_duplicates.py
Executable file
129
testing/qwen3-coder-30b/find_duplicates.py
Executable file
@@ -0,0 +1,129 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Find duplicate files in a directory tree by content hash, with parallelization support.
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import os
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from multiprocessing import Pool, cpu_count
|
||||
from typing import Dict, List, Tuple
|
||||
|
||||
def get_file_hash(filepath: Path, chunk_size: int = 8192) -> str:
|
||||
"""
|
||||
Calculate the SHA256 hash of a file.
|
||||
|
||||
Args:
|
||||
filepath: Path to the file
|
||||
chunk_size: Size of chunks to read at a time
|
||||
|
||||
Returns:
|
||||
SHA256 hash of the file as a hexadecimal string
|
||||
"""
|
||||
hash_sha256 = hashlib.sha256()
|
||||
try:
|
||||
with open(filepath, 'rb') as f:
|
||||
# Read file in chunks to handle large files efficiently
|
||||
for chunk in iter(lambda: f.read(chunk_size), b""):
|
||||
hash_sha256.update(chunk)
|
||||
return hash_sha256.hexdigest()
|
||||
except Exception as e:
|
||||
print(f"Error reading {filepath}: {e}")
|
||||
return ""
|
||||
|
||||
def process_file(filepath: Path) -> Tuple[str, Path]:
|
||||
"""
|
||||
Process a single file and return its hash and path.
|
||||
|
||||
Args:
|
||||
filepath: Path to the file
|
||||
|
||||
Returns:
|
||||
Tuple of (file_hash, file_path)
|
||||
"""
|
||||
file_hash = get_file_hash(filepath)
|
||||
return (file_hash, filepath)
|
||||
|
||||
def find_duplicates_parallel(root_dir: Path, num_processes: int = None) -> Dict[str, List[Path]]:
|
||||
"""
|
||||
Find duplicate files in a directory tree by content hash using parallel processing.
|
||||
|
||||
Args:
|
||||
root_dir: Root directory to search
|
||||
num_processes: Number of processes to use (default: CPU count)
|
||||
|
||||
Returns:
|
||||
Dictionary mapping file hashes to lists of file paths with that hash
|
||||
"""
|
||||
if num_processes is None:
|
||||
num_processes = cpu_count()
|
||||
|
||||
# Get all files in the directory tree
|
||||
files = []
|
||||
for file_path in root_dir.rglob('*'):
|
||||
if file_path.is_file():
|
||||
files.append(file_path)
|
||||
|
||||
print(f"Found {len(files)} files to process")
|
||||
|
||||
# Process files in parallel
|
||||
with Pool(processes=num_processes) as pool:
|
||||
results = pool.map(process_file, files)
|
||||
|
||||
# Group files by hash
|
||||
hash_to_files = defaultdict(list)
|
||||
for file_hash, file_path in results:
|
||||
if file_hash: # Only add files that were successfully hashed
|
||||
hash_to_files[file_hash].append(file_path)
|
||||
|
||||
# Filter out unique files (only keep duplicates)
|
||||
duplicates = {hash_val: file_list for hash_val, file_list in hash_to_files.items()
|
||||
if len(file_list) > 1}
|
||||
|
||||
return duplicates
|
||||
|
||||
def print_duplicates(duplicates: Dict[str, List[Path]]) -> None:
|
||||
"""
|
||||
Print the duplicate files in a formatted way.
|
||||
|
||||
Args:
|
||||
duplicates: Dictionary mapping file hashes to lists of file paths
|
||||
"""
|
||||
if not duplicates:
|
||||
print("No duplicates found.")
|
||||
return
|
||||
|
||||
print(f"Found {len(duplicates)} sets of duplicate files:")
|
||||
for hash_val, file_list in duplicates.items():
|
||||
print(f"\nHash: {hash_val}")
|
||||
for file_path in file_list:
|
||||
print(f" {file_path}")
|
||||
|
||||
def main():
|
||||
"""Main function to run the duplicate file finder."""
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python find_duplicates.py <root_directory> [num_processes]")
|
||||
sys.exit(1)
|
||||
|
||||
root_dir = Path(sys.argv[1])
|
||||
if not root_dir.exists():
|
||||
print(f"Error: Directory {root_dir} does not exist")
|
||||
sys.exit(1)
|
||||
|
||||
num_processes = None
|
||||
if len(sys.argv) > 2:
|
||||
try:
|
||||
num_processes = int(sys.argv[2])
|
||||
except ValueError:
|
||||
print("Error: num_processes must be an integer")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"Searching for duplicates in {root_dir} using {num_processes or cpu_count()} processes")
|
||||
|
||||
duplicates = find_duplicates_parallel(root_dir, num_processes)
|
||||
print_duplicates(duplicates)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
246
testing/qwen3-coder-30b/llm_test_framework.py
Executable file
246
testing/qwen3-coder-30b/llm_test_framework.py
Executable file
@@ -0,0 +1,246 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
LLM Performance Testing Framework
|
||||
"""
|
||||
|
||||
import time
|
||||
import psutil
|
||||
import json
|
||||
import subprocess
|
||||
import os
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, List, Any, Tuple
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
@dataclass
|
||||
class TestResult:
|
||||
"""Container for test results"""
|
||||
model_name: str
|
||||
task_name: str
|
||||
response_time: float
|
||||
tps: float
|
||||
cpu_usage: float
|
||||
memory_usage: float
|
||||
gpu_usage: float
|
||||
quality_score: float
|
||||
raw_output: str
|
||||
expected_output: str
|
||||
success: bool
|
||||
timestamp: float
|
||||
|
||||
@dataclass
|
||||
class PerformanceMetrics:
|
||||
"""Container for performance metrics"""
|
||||
response_time: float
|
||||
tps: float
|
||||
cpu_avg: float
|
||||
memory_avg: float
|
||||
gpu_avg: float
|
||||
|
||||
class LLMInterface(ABC):
|
||||
"""Abstract base class for LLM interfaces"""
|
||||
|
||||
def __init__(self, model_name: str, config: Dict[str, Any]):
|
||||
self.model_name = model_name
|
||||
self.config = config
|
||||
|
||||
@abstractmethod
|
||||
def generate(self, prompt: str, **kwargs) -> str:
|
||||
"""Generate text from prompt"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def get_model_info(self) -> Dict[str, Any]:
|
||||
"""Get model information"""
|
||||
pass
|
||||
|
||||
class TestSuite:
|
||||
"""Main test suite class"""
|
||||
|
||||
def __init__(self, output_dir: str = "test_results"):
|
||||
self.output_dir = Path(output_dir)
|
||||
self.output_dir.mkdir(exist_ok=True)
|
||||
self.results: List[TestResult] = []
|
||||
self.models: List[LLMInterface] = []
|
||||
|
||||
def add_model(self, model: LLMInterface):
|
||||
"""Add a model to test"""
|
||||
self.models.append(model)
|
||||
|
||||
def run_single_test(self, model: LLMInterface, task: Dict[str, Any]) -> TestResult:
|
||||
"""Run a single test on a model"""
|
||||
# Start resource monitoring
|
||||
start_time = time.time()
|
||||
|
||||
# Monitor resources
|
||||
cpu_start = psutil.cpu_percent(interval=1)
|
||||
memory_start = psutil.virtual_memory().used
|
||||
gpu_start = self._get_gpu_usage()
|
||||
|
||||
# Generate response
|
||||
try:
|
||||
response = model.generate(task["prompt"], **task.get("parameters", {}))
|
||||
success = True
|
||||
except Exception as e:
|
||||
response = f"Error: {str(e)}"
|
||||
success = False
|
||||
|
||||
# Measure completion
|
||||
end_time = time.time()
|
||||
response_time = end_time - start_time
|
||||
|
||||
# Calculate TPS
|
||||
tps = 0
|
||||
if success and response:
|
||||
# Estimate tokens (very rough approximation)
|
||||
tps = len(response.split()) / response_time if response_time > 0 else 0
|
||||
|
||||
# Monitor end resources
|
||||
cpu_end = psutil.cpu_percent(interval=1)
|
||||
memory_end = psutil.virtual_memory().used
|
||||
gpu_end = self._get_gpu_usage()
|
||||
|
||||
# Calculate averages
|
||||
cpu_avg = (cpu_start + cpu_end) / 2
|
||||
memory_avg = (memory_start + memory_end) / 2
|
||||
gpu_avg = (gpu_start + gpu_end) / 2
|
||||
|
||||
# Quality score - this is a placeholder
|
||||
quality_score = self._calculate_quality_score(response, task.get("expected_output", ""))
|
||||
|
||||
result = TestResult(
|
||||
model_name=model.model_name,
|
||||
task_name=task["name"],
|
||||
response_time=response_time,
|
||||
tps=tps,
|
||||
cpu_usage=cpu_avg,
|
||||
memory_usage=memory_avg,
|
||||
gpu_usage=gpu_avg,
|
||||
quality_score=quality_score,
|
||||
raw_output=response,
|
||||
expected_output=task.get("expected_output", ""),
|
||||
success=success,
|
||||
timestamp=time.time()
|
||||
)
|
||||
|
||||
self.results.append(result)
|
||||
return result
|
||||
|
||||
def run_all_tests(self, tasks: List[Dict[str, Any]]):
|
||||
"""Run all tests across all models"""
|
||||
for model in self.models:
|
||||
print(f"Running tests on {model.model_name}")
|
||||
for task in tasks:
|
||||
result = self.run_single_test(model, task)
|
||||
print(f" {task['name']}: {result.response_time:.2f}s, {result.quality_score:.2f}/100")
|
||||
|
||||
def _get_gpu_usage(self) -> float:
|
||||
"""Get GPU usage if available"""
|
||||
try:
|
||||
# This is a placeholder - in real implementation,
|
||||
# you'd use nvidia-smi or similar
|
||||
return 0.0
|
||||
except:
|
||||
return 0.0
|
||||
|
||||
def _calculate_quality_score(self, response: str, expected: str) -> float:
|
||||
"""Calculate quality score based on response"""
|
||||
# This is a placeholder implementation
|
||||
# In practice, this could use static analysis, code execution, etc.
|
||||
if not response:
|
||||
return 0.0
|
||||
|
||||
# Very basic quality scoring
|
||||
score = 50 # Base score
|
||||
|
||||
# Add points for presence of expected patterns
|
||||
if expected and expected.lower() in response.lower():
|
||||
score += 20
|
||||
|
||||
# Add points for valid syntax (simplified)
|
||||
if not response.startswith("Error:") and len(response) > 10:
|
||||
score += 10
|
||||
|
||||
# Cap score at 100
|
||||
return min(score, 100.0)
|
||||
|
||||
def save_results(self, filename: str = None):
|
||||
"""Save test results to JSON file"""
|
||||
if filename is None:
|
||||
filename = f"results_{int(time.time())}.json"
|
||||
|
||||
results_data = []
|
||||
for result in self.results:
|
||||
results_data.append({
|
||||
"model_name": result.model_name,
|
||||
"task_name": result.task_name,
|
||||
"response_time": result.response_time,
|
||||
"tps": result.tps,
|
||||
"cpu_usage": result.cpu_usage,
|
||||
"memory_usage": result.memory_usage,
|
||||
"gpu_usage": result.gpu_usage,
|
||||
"quality_score": result.quality_score,
|
||||
"raw_output": result.raw_output,
|
||||
"expected_output": result.expected_output,
|
||||
"success": result.success,
|
||||
"timestamp": result.timestamp
|
||||
})
|
||||
|
||||
with open(self.output_dir / filename, 'w') as f:
|
||||
json.dump(results_data, f, indent=2)
|
||||
|
||||
print(f"Results saved to {self.output_dir / filename}")
|
||||
|
||||
# Example model implementations
|
||||
class MockLLM(LLMInterface):
|
||||
"""Mock LLM for testing purposes"""
|
||||
|
||||
def generate(self, prompt: str, **kwargs) -> str:
|
||||
# Simulate generation time
|
||||
time.sleep(0.5)
|
||||
return f"Mock response for: {prompt}"
|
||||
|
||||
def get_model_info(self) -> Dict[str, Any]:
|
||||
return {
|
||||
"name": self.model_name,
|
||||
"type": "mock",
|
||||
"version": "1.0"
|
||||
}
|
||||
|
||||
class ExampleTask:
|
||||
"""Example task class for testing"""
|
||||
|
||||
@staticmethod
|
||||
def get_sample_tasks() -> List[Dict[str, Any]]:
|
||||
return [
|
||||
{
|
||||
"name": "Simple function",
|
||||
"prompt": "Write a Python function to add two numbers",
|
||||
"expected_output": "return a + b",
|
||||
"parameters": {}
|
||||
},
|
||||
{
|
||||
"name": "Loop example",
|
||||
"prompt": "Write a Python loop that prints numbers 1 to 10",
|
||||
"expected_output": "for i in range(1, 11)",
|
||||
"parameters": {}
|
||||
}
|
||||
]
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Create test suite
|
||||
suite = TestSuite()
|
||||
|
||||
# Add mock models
|
||||
suite.add_model(MockLLM("mock1", {}))
|
||||
suite.add_model(MockLLM("mock2", {}))
|
||||
|
||||
# Run sample tests
|
||||
tasks = ExampleTask.get_sample_tasks()
|
||||
suite.run_all_tests(tasks)
|
||||
|
||||
# Save results
|
||||
suite.save_results()
|
||||
|
||||
print("Testing completed!")
|
||||
228
testing/qwen3-coder-30b/llm_test_framework_simple.py
Executable file
228
testing/qwen3-coder-30b/llm_test_framework_simple.py
Executable file
@@ -0,0 +1,228 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
LLM Performance Testing Framework (Simplified Version)
|
||||
"""
|
||||
|
||||
import time
|
||||
import json
|
||||
import os
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, List, Any, Tuple
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
@dataclass
|
||||
class TestResult:
|
||||
"""Container for test results"""
|
||||
model_name: str
|
||||
task_name: str
|
||||
response_time: float
|
||||
tps: float
|
||||
cpu_usage: float
|
||||
memory_usage: float
|
||||
quality_score: float
|
||||
raw_output: str
|
||||
expected_output: str
|
||||
success: bool
|
||||
timestamp: float
|
||||
|
||||
@dataclass
|
||||
class PerformanceMetrics:
|
||||
"""Container for performance metrics"""
|
||||
response_time: float
|
||||
tps: float
|
||||
cpu_avg: float
|
||||
memory_avg: float
|
||||
|
||||
class LLMInterface(ABC):
|
||||
"""Abstract base class for LLM interfaces"""
|
||||
|
||||
def __init__(self, model_name: str, config: Dict[str, Any]):
|
||||
self.model_name = model_name
|
||||
self.config = config
|
||||
|
||||
@abstractmethod
|
||||
def generate(self, prompt: str, **kwargs) -> str:
|
||||
"""Generate text from prompt"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def get_model_info(self) -> Dict[str, Any]:
|
||||
"""Get model information"""
|
||||
pass
|
||||
|
||||
class TestSuite:
|
||||
"""Main test suite class"""
|
||||
|
||||
def __init__(self, output_dir: str = "test_results"):
|
||||
self.output_dir = Path(output_dir)
|
||||
self.output_dir.mkdir(exist_ok=True)
|
||||
self.results: List[TestResult] = []
|
||||
self.models: List[LLMInterface] = []
|
||||
|
||||
def add_model(self, model: LLMInterface):
|
||||
"""Add a model to test"""
|
||||
self.models.append(model)
|
||||
|
||||
def run_single_test(self, model: LLMInterface, task: Dict[str, Any]) -> TestResult:
|
||||
"""Run a single test on a model"""
|
||||
# Start timing
|
||||
start_time = time.time()
|
||||
|
||||
# Simulate resource usage
|
||||
cpu_start = 50 # Percentage (simulated)
|
||||
memory_start = 1024 # MB (simulated)
|
||||
|
||||
# Generate response
|
||||
try:
|
||||
response = model.generate(task["prompt"], **task.get("parameters", {}))
|
||||
success = True
|
||||
except Exception as e:
|
||||
response = f"Error: {str(e)}"
|
||||
success = False
|
||||
|
||||
# Measure completion
|
||||
end_time = time.time()
|
||||
response_time = end_time - start_time
|
||||
|
||||
# Calculate TPS (simulated)
|
||||
tps = 0
|
||||
if success and response:
|
||||
# Estimate tokens (very rough approximation)
|
||||
tps = len(response.split()) / response_time if response_time > 0 else 0
|
||||
|
||||
# Simulate end resources
|
||||
cpu_end = 60 # Percentage (simulated)
|
||||
memory_end = 1536 # MB (simulated)
|
||||
|
||||
# Calculate averages
|
||||
cpu_avg = (cpu_start + cpu_end) / 2
|
||||
memory_avg = (memory_start + memory_end) / 2
|
||||
|
||||
# Quality score - this is a placeholder
|
||||
quality_score = self._calculate_quality_score(response, task.get("expected_output", ""))
|
||||
|
||||
result = TestResult(
|
||||
model_name=model.model_name,
|
||||
task_name=task["name"],
|
||||
response_time=response_time,
|
||||
tps=tps,
|
||||
cpu_usage=cpu_avg,
|
||||
memory_usage=memory_avg,
|
||||
quality_score=quality_score,
|
||||
raw_output=response,
|
||||
expected_output=task.get("expected_output", ""),
|
||||
success=success,
|
||||
timestamp=time.time()
|
||||
)
|
||||
|
||||
self.results.append(result)
|
||||
return result
|
||||
|
||||
def run_all_tests(self, tasks: List[Dict[str, Any]]):
|
||||
"""Run all tests across all models"""
|
||||
for model in self.models:
|
||||
print(f"Running tests on {model.model_name}")
|
||||
for task in tasks:
|
||||
result = self.run_single_test(model, task)
|
||||
print(f" {task['name']}: {result.response_time:.2f}s, {result.quality_score:.2f}/100")
|
||||
|
||||
def _calculate_quality_score(self, response: str, expected: str) -> float:
|
||||
"""Calculate quality score based on response"""
|
||||
# This is a placeholder implementation
|
||||
# In practice, this could use static analysis, code execution, etc.
|
||||
if not response:
|
||||
return 0.0
|
||||
|
||||
# Very basic quality scoring
|
||||
score = 50 # Base score
|
||||
|
||||
# Add points for presence of expected patterns
|
||||
if expected and expected.lower() in response.lower():
|
||||
score += 20
|
||||
|
||||
# Add points for valid syntax (simplified)
|
||||
if not response.startswith("Error:") and len(response) > 10:
|
||||
score += 10
|
||||
|
||||
# Cap score at 100
|
||||
return min(score, 100.0)
|
||||
|
||||
def save_results(self, filename: str = None):
|
||||
"""Save test results to JSON file"""
|
||||
if filename is None:
|
||||
filename = f"results_{int(time.time())}.json"
|
||||
|
||||
results_data = []
|
||||
for result in self.results:
|
||||
results_data.append({
|
||||
"model_name": result.model_name,
|
||||
"task_name": result.task_name,
|
||||
"response_time": result.response_time,
|
||||
"tps": result.tps,
|
||||
"cpu_usage": result.cpu_usage,
|
||||
"memory_usage": result.memory_usage,
|
||||
"quality_score": result.quality_score,
|
||||
"raw_output": result.raw_output,
|
||||
"expected_output": result.expected_output,
|
||||
"success": result.success,
|
||||
"timestamp": result.timestamp
|
||||
})
|
||||
|
||||
with open(self.output_dir / filename, 'w') as f:
|
||||
json.dump(results_data, f, indent=2)
|
||||
|
||||
print(f"Results saved to {self.output_dir / filename}")
|
||||
|
||||
# Example model implementations
|
||||
class MockLLM(LLMInterface):
|
||||
"""Mock LLM for testing purposes"""
|
||||
|
||||
def generate(self, prompt: str, **kwargs) -> str:
|
||||
# Simulate generation time
|
||||
time.sleep(0.1)
|
||||
return f"Mock response for: {prompt}"
|
||||
|
||||
def get_model_info(self) -> Dict[str, Any]:
|
||||
return {
|
||||
"name": self.model_name,
|
||||
"type": "mock",
|
||||
"version": "1.0"
|
||||
}
|
||||
|
||||
class ExampleTask:
|
||||
"""Example task class for testing"""
|
||||
|
||||
@staticmethod
|
||||
def get_sample_tasks() -> List[Dict[str, Any]]:
|
||||
return [
|
||||
{
|
||||
"name": "Simple function",
|
||||
"prompt": "Write a Python function to add two numbers",
|
||||
"expected_output": "return a + b",
|
||||
"parameters": {}
|
||||
},
|
||||
{
|
||||
"name": "Loop example",
|
||||
"prompt": "Write a Python loop that prints numbers 1 to 10",
|
||||
"expected_output": "for i in range(1, 11)",
|
||||
"parameters": {}
|
||||
}
|
||||
]
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Create test suite
|
||||
suite = TestSuite()
|
||||
|
||||
# Add mock models
|
||||
suite.add_model(MockLLM("mock1", {}))
|
||||
suite.add_model(MockLLM("mock2", {}))
|
||||
|
||||
# Run sample tests
|
||||
tasks = ExampleTask.get_sample_tasks()
|
||||
suite.run_all_tests(tasks)
|
||||
|
||||
# Save results
|
||||
suite.save_results()
|
||||
|
||||
print("Testing completed!")
|
||||
3
testing/qwen3-coder-30b/requirements.txt
Normal file
3
testing/qwen3-coder-30b/requirements.txt
Normal file
@@ -0,0 +1,3 @@
|
||||
# LLM Performance Testing Framework Requirements
|
||||
|
||||
psutil>=5.8.0
|
||||
16
whisper-compose.yaml
Normal file
16
whisper-compose.yaml
Normal file
@@ -0,0 +1,16 @@
|
||||
services:
|
||||
whisper:
|
||||
image: rhasspy/wyoming-whisper:latest
|
||||
container_name: wyoming-whisper
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "10300:10300"
|
||||
volumes:
|
||||
- ./whisper-data:/data
|
||||
command:
|
||||
- --model
|
||||
- tiny-int8
|
||||
- --language
|
||||
- en
|
||||
- --beam-size
|
||||
- "1"
|
||||
Reference in New Issue
Block a user