noisedestroyers a29793032d Document current coding-workflow stack state
Snapshot of where opencode + Qwen3-Coder + MCPs + Kimi-Linear + voice
  + Phoenix tracing land today, plus in-flight (oc-tree, kimi-linear
  context ramp) and next (ComfyUI) items with pointers to per-project
  NEXT_STEPS.md guides.
2026-05-10 21:14:43 -04:00
2026-05-08 11:35:10 -04:00

localgenai

Local LLM stack on a Framework Desktop (AMD Ryzen AI Max+ 395 / "Strix Halo", Radeon 8060S, 128 GB unified LPDDR5x), driven from a Mac over Tailscale. Coding agents, monitoring, voice — all self-hosted.

Topology

┌─────────────────┐    Tailscale     ┌──────────────────────────────┐
│  Mac            │ ───────────────▶ │  Framework Desktop           │
│                 │                  │  (Ubuntu 26.04, gfx1151 iGPU)│
│  • OpenCode     │                  │  • Inference: Ollama,        │
│    + Phoenix    │                  │    llama.cpp, vLLM           │
│    bridge       │                  │  • Agent UIs: OpenWebUI,     │
│  • install.sh   │                  │    OpenHands                 │
│                 │                  │  • Monitoring: Beszel,       │
│                 │                  │    OpenLIT, Phoenix          │
└─────────────────┘                  └──────────────────────────────┘

Repo layout

Path What's there
pyinfra/ pyinfra deploy targeting the Framework Desktop. Host setup (kernel params, Docker, AMD driver bits) plus all docker-compose files for the services below. See pyinfra/framework/README.md.
opencode/ OpenCode config + Phoenix bridge plugin. install.sh deploys it to ~/.config/opencode/ on a Mac. Wires up Playwright / SearXNG MCP and ships every prompt's spans to Phoenix.
StrixHaloSetup.md Original phased bring-up plan for the box.
StrixHaloMemory.md UMA / GTT / bandwidth notes — "what fits, what's slow, why."
Roadmap.md What to build on top of the stack to narrow the gap with hosted Claude Code / Cowork / Skills / Design.
TODO.md Open questions and follow-ups.
testing/qwen3-coder-30b/ Small evaluation harness for Qwen3-Coder.
VoiceModels.md STT / TTS landscape and upgrade paths from the Wyoming defaults.

Service ports (on framework)

Port Service Notes
7575 Homepage Front door — start here. Tile per service with live widgets.
8080 llama.cpp Vulkan backend, --metrics for Prometheus
8000 vLLM ROCm; gfx1151 support varies
11434 Ollama ROCm with HSA_OVERRIDE_GFX_VERSION=11.0.0
3000 OpenWebUI ChatGPT-style UI in front of Ollama
3001 OpenLIT LLM fleet metrics dashboard
3030 OpenHands Autonomous agent + sandbox runtime — Tailscale-only by design
4317 Phoenix OTLP/gRPC Trace ingestion
6006 Phoenix UI / OTLP/HTTP Per-trace agent waterfall (also :6006/v1/traces)
8001 faster-whisper STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit
8090 Beszel Host + container + AMD GPU dashboard
8880 Kokoro TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit
10200 Piper TTS (Wyoming protocol) — for Home Assistant Assist
10300 Whisper STT (Wyoming protocol) — for Home Assistant Assist

Quick start

On the box (after the pyinfra deploy):

cd /srv/docker/ollama && docker compose up -d
cd /srv/docker/phoenix && docker compose up -d
cd /srv/docker/beszel && docker compose up -d   # then add system per beszel.yml

On the Mac:

cd opencode && ./install.sh
opencode

Send a prompt; watch it land in Phoenix at http://framework:6006.

Coding workflows — state of the stack (2026-05-10)

Component Status Notes
Primary: opencode → framework/qwen3-coder:30b (Ollama) Working daily-driver Full MCP toolbox (playwright, searxng, serena, basic-memory, sequential-thinking, task-master). Phoenix bridge emits OTel traces for every call.
Secondary: opencode → framework-vllm/kimi-linear (vLLM) Configured, experimental tool_call: false — Kimi-Linear is a research model and isn't strongly tool-trained. Long-context chat only. Switch via /model framework-vllm/kimi-linear.
Chat front-end for Kimi-Linear: OpenWebUI Working vLLM endpoint wired via OPENAI_API_BASE_URLS. Pick kimi-linear from the model selector in OpenWebUI.
Voice loop: faster-whisper + Kokoro Deployed OpenAI-compatible endpoints; not yet wired into a hands-free chat loop.
Observability: Phoenix per-trace + OpenLIT fleet Working All opencode prompts traced via .opencode/plugin/phoenix-bridge.js.
In flight: oc-tree TUI sidecar M0 skeleton done, awaiting probe Live tree view of opencode's SSE event stream; tmux-side companion to Phoenix's post-hoc traces. See oc-tree/NEXT_STEPS.md.
In flight: Kimi-Linear context ramp P0 done at 32K, BIOS+GTT recipe applied Next: ramp --max-model-len toward 1M and benchmark. See kimi-linear/NEXT_STEPS.md.
Configured but not deployed: VS Code Continue YAML ready Drop vscode-continue-config.yml into ~/.continue/config.yaml; ollama pull nomic-embed-text on the box for @codebase indexing.
Planned next: ComfyUI for image generation Phase plan written Flux.1-Dev via kyuz0 toolbox, same pyinfra pattern as kimi-linear. See task list (CF-P0..P4).
Roadmap-only (not started): Aider, Cline, OpenHands, LiteLLM, Multica See Roadmap.md.

Single-line summary: opencode + Qwen3-Coder + the MCP toolbox is the working coding harness; Kimi-Linear is alongside for long-context chat; oc-tree and ComfyUI are the next two infra additions.

Why this stack

  • Bandwidth, not VRAM, is the ceiling on Strix Halo. 256 GB/s memory bandwidth means MoE models with small active-params (Qwen3-Coder-30B-A3B, Kimi-Linear-48B-A3B) dominate. See StrixHaloMemory.md.
  • AMD APU monitoring is broken upstream. amd-smi returns N/A on gfx1151, which kills the official Prometheus/Grafana exporters. Beszel reads sysfs directly and works. See the monitoring section of pyinfra/framework/README.md.
  • Two-layer observability. OpenLIT is fleet metrics across sessions; Phoenix is per-trace waterfall for "what did this one prompt do."
  • Reproducible. Every host-side config lives in pyinfra/; every Mac-side config in opencode/install.sh. Re-running either is safe.
Description
No description provided
Readme 290 KiB
Languages
Python 64.7%
Shell 23.8%
JavaScript 9.6%
Dockerfile 1.9%