diff --git a/ImageGen.md b/ImageGen.md new file mode 100644 index 0000000..f1de653 --- /dev/null +++ b/ImageGen.md @@ -0,0 +1,231 @@ +# Image generation — landscape and workflows + +State of the OSS image-gen stack as of 2026-05 on the Framework Desktop +(Strix Halo gfx1151, 128 GB unified, ComfyUI via +`kyuz0/amd-strix-halo-comfyui`). Practical operator guide, not a tour +of the diffusion literature. + +## What's deployed today + +- **ComfyUI** at `http://framework:8188` (compose stack: `comfyui`). + Manual-start only; competes with kimi-linear for GPU. +- **Flux.1-Dev** (`/srv/docker/comfyui/models/diffusion_models/flux1-dev.safetensors`) + + CLIPs + VAE. Workhorse text-to-image model. +- **t5xxl** in both fp8 and fp16 variants — templates may reference either. + +See [`pyinfra/framework/compose/comfyui/README.md`](pyinfra/framework/compose/comfyui/README.md) +for bring-up and model dir layout. + +## Hardware envelope + +Flux.1-Dev was trained on 1024×1024. Memory scales quadratically with +pixel count. With 128 GB unified the **quality wall hits well before +the memory wall**: + +| Resolution | Memory peak | Quality reality | +|---|---|---| +| 1024×1024 (native) | ~14 GB | Sharp, composition correct | +| 1024×1536, 1536×1024 (native ratios) | ~18 GB | Native-class — trained on these | +| 1536×1536 | ~30 GB | Soft drift in composition | +| 2048×2048 | ~50-60 GB | Memory fits; duplicated subjects, anatomy errors | +| 3072×3072 | ~120 GB | At the wall; quality degrades meaningfully | +| 4096+ | OOM territory | Don't direct-render here | + +**Rule of thumb**: render at 1024-class, *upscale* to anything larger. +Direct 2K+ rendering on Flux is technically possible on this box but +produces visibly worse images than upscaling. For native >1K rendering, +use a model trained for it (HiDream-O1 — see [Model landscape](#model-landscape)). + +## Workflow patterns + +### 1. Native render + +The baseline. Template: **Workflow → Browse Templates → Flux.1 Dev**. +1024×1024, fp8 t5xxl, ~20-30 sampling steps. First generation takes +~5 min (kernel JIT); subsequent ~30-60 sec. + +### 2. Hires fix — same seed, higher final resolution + +The standard "I like the small render, give me the same thing bigger" +pattern. Latent upscale + second sampler pass: + +``` +[KSampler 1024×1024 @ seed] + ↓ +[Upscale Latent By 2.0] + ↓ +[KSampler 2048×2048 @ same seed, denoise=0.4–0.6] + ↓ +[VAE Decode] +``` + +Denoise knob: +- **0.3–0.4** — stays very close to the small render, adds detail +- **0.5–0.6** — preserves composition, reinterprets details +- **0.7+** — composition drifts; treat as "variation at higher res" + +Bundled ComfyUI templates include "Flux Hires Fix" — load that and edit +the prompt/seed. + +### 3. Image upscale + img2img refine + +Cheaper than hires fix, less faithful to the original prompt. Decode +the small image, run an ESRGAN-class upscaler on it, then img2img with +low denoise to sharpen: + +``` +[Render 1024×1024] + ↓ +[Upscale Image with Model: 4x-UltraSharp or 4x_NMKD-Siax] + ↓ +[VAE Encode] → [KSampler denoise=0.2–0.3] + ↓ +[VAE Decode] +``` + +Faster than hires fix because the upscaler is a small model; the +diffusion pass only refines existing detail rather than re-rendering. +Best when the original render is already correct and you just want +sharper output. + +### 4. Tiled upscale (>2× scaling) + +For 4K / 8K targets without OOM, process the upscaled image in +overlapping tiles. Custom node: +[`ComfyUI_UltimateSDUpscale`](https://github.com/ssitu/ComfyUI_UltimateSDUpscale). +Each tile is its own sampler pass; tile overlap blends seams. Slower +end-to-end but uses constant memory regardless of final resolution. + +## Reference / style conditioning + +Three distinct tools, picked by what you want preserved from the +reference: + +| Tool | Reference role | When to use | +|---|---|---| +| **Flux Redux** | "In the style of this" | Single inspiration image + text prompt — start here | +| **IP-Adapter Flux** | "Blend these N references with weights" | Multiple inspirations, fine control over each | +| **ControlNet (Flux)** | "Match this structure" — edges, depth, pose | Composition matters more than style | + +### Flux Redux (recommended starting point) + +Black Forest Labs' official "remix this image" adapter. Pass an +inspiration image + a text prompt describing what you want generated; +Redux blends the visual cues into the text-conditioned output. + +``` +[Load Reference Image] → [Redux Style Model] + ↓ +[Text Prompt] ────────→ [Conditioning combiner] + ↓ + [KSampler] → output +``` + +Models needed: +- `black-forest-labs/FLUX.1-Redux-dev` → `flux1-redux-dev.safetensors` + (gated — accept license like Flux.1-Dev) +- `google/siglip-so400m-patch14-384` → image encoder for Redux + +Drops in `/srv/docker/comfyui/models/style_models/` and +`/srv/docker/comfyui/models/clip_vision/`. + +### IP-Adapter Flux + +Community-built reference-conditioning adapter with per-image weight +sliders. Better when you want to blend multiple inspirations or +control intensity precisely. Custom node: +[`ComfyUI_IPAdapter_plus`](https://github.com/cubiq/ComfyUI_IPAdapter_plus) +(supports Flux as of 2025). Heavier setup than Redux. + +### ControlNet Flux + +Geometric/structural conditioning — pass a canny edge map / depth map +/ pose skeleton extracted from a reference, and Flux follows that +structure. Different question from "style of": this preserves +*composition*, not look. + +[`XLabs-AI/flux-controlnet-collections`](https://huggingface.co/XLabs-AI/flux-controlnet-collections) +ships Canny, Depth, HED variants. + +## Model landscape + +| Model | Strength | When | +|---|---|---| +| **Flux.1-Dev** | Prompt adherence, current open SOTA on most tasks | Default; already deployed | +| **Flux.1-Schnell** | Distilled 4-step Flux | When you want 4× speed and can tolerate quality drop | +| **HiDream-O1-Image-Dev** | Pixel-native 2K, better text rendering | When native >1K matters (signs, posters, UI mockups) | +| **Wan-Image-2** | Stylized / illustrative range | Specific aesthetics not Flux's strong suit | +| **SDXL** + community LoRAs | Mature LoRA ecosystem | When a specific LoRA exists only for SDXL | +| **Flux.2** | (Newer, presumably better) | ⚠ **Crashes on Strix Halo ROCm** as of 2026-05; skip until upstream patch | + +**HiDream-O1** is the recommended next pull when you want higher +native resolution. Quote from the May 2026 multimodal research brief: +"native 2K, much better text rendering than Flux, beats Flux.2 on +benchmarks at 1/7 the params." Image gen at >1K native. + +## Custom nodes worth installing + +Drop these in `/srv/docker/comfyui/custom_nodes/` (RW bind-mount — +survives container recreate): + +| Repo | Why | +|---|---| +| [`ComfyUI_essentials`](https://github.com/cubiq/ComfyUI_essentials) | Already bundled in kyuz0 image. General-purpose utility nodes. | +| [`ComfyUI-AMDGPUMonitor`](https://github.com/general/ComfyUI-AMDGPUMonitor) | Already bundled. VRAM monitoring in the UI. | +| [`ComfyUI-GGUF`](https://github.com/city96/ComfyUI-GGUF) | Already bundled. Lets you use GGUF-quantized Flux variants. | +| [`ComfyUI_UltimateSDUpscale`](https://github.com/ssitu/ComfyUI_UltimateSDUpscale) | Tiled upscale for 4K+ outputs. Manual install. | +| [`ComfyUI_IPAdapter_plus`](https://github.com/cubiq/ComfyUI_IPAdapter_plus) | Multi-reference style conditioning. Manual install. | +| [`ComfyUI-Manager`](https://github.com/ltdrdata/ComfyUI-Manager) | Install/update custom nodes from inside the UI. Quality-of-life. | + +Install pattern for a custom node: + +```sh +cd /srv/docker/comfyui/custom_nodes +git clone https://github.com/ssitu/ComfyUI_UltimateSDUpscale +# Restart ComfyUI to pick up the new node +cd /srv/docker/comfyui && docker compose restart +``` + +## Practical "starter kit" downloads + +If you want to unlock the four main workflow patterns at once: + +```sh +# Upscaler (for hires fix + img2img refine) +hf download Kim2091/UltraSharp \ + 4x-UltraSharp.pth \ + --local-dir /srv/docker/comfyui/models/upscale_models + +# Flux Redux for "style of these inspirations" (gated — accept license) +hf download black-forest-labs/FLUX.1-Redux-dev \ + flux1-redux-dev.safetensors \ + --local-dir /srv/docker/comfyui/models/style_models + +# Image encoder Redux needs +hf download google/siglip-so400m-patch14-384 \ + model.safetensors \ + --local-dir /srv/docker/comfyui/models/clip_vision + +# Optional: HiDream for native 2K rendering +# hf download HiDream-AI/HiDream-I1-Dev ... (verify path on HF) +``` + +After download, refresh the browser tab so ComfyUI rescans the dirs. + +## Operational notes + +- **First generation is slow** (kernel JIT). Subsequent ones much + faster. +- **`--disable-mmap` is mandatory** for gfx1151 — already set in + compose. Don't remove. +- **VAE OOMs at high res** — `--bf16-vae` mitigates (set). For >2K + outputs, decode in tiles. +- **Concurrent generation** — ComfyUI queues requests, so a busy + workflow can chain N images on one prompt. +- **Output lands in `/srv/docker/comfyui/output/`** as well as in the UI. + +## Pointers to other docs + +- Bring-up + model dir layout: [`pyinfra/framework/compose/comfyui/README.md`](pyinfra/framework/compose/comfyui/README.md) +- Top-level coding-workflow status: [`README.md`](README.md) +- Strategic landscape (Layer 0 image/video/audio): [`Roadmap.md`](Roadmap.md) diff --git a/README.md b/README.md index 5cf53ea..9446195 100644 --- a/README.md +++ b/README.md @@ -31,6 +31,7 @@ Tailscale. Coding agents, monitoring, voice — all self-hosted. | [`TODO.md`](TODO.md) | Open questions and follow-ups. | | [`testing/qwen3-coder-30b/`](testing/qwen3-coder-30b/) | Small evaluation harness for Qwen3-Coder. | | [`VoiceModels.md`](VoiceModels.md) | STT / TTS landscape and upgrade paths from the Wyoming defaults. | +| [`ImageGen.md`](ImageGen.md) | Image generation workflows on the ComfyUI stack — resolution envelope, hires fix, Redux/IP-Adapter for reference-driven generation. | ## Service ports (on `framework`) @@ -42,11 +43,14 @@ Tailscale. Coding agents, monitoring, voice — all self-hosted. | `11434` | Ollama | ROCm with `HSA_OVERRIDE_GFX_VERSION=11.0.0` | | `3000` | OpenWebUI | ChatGPT-style UI in front of Ollama | | `3001` | OpenLIT | LLM fleet metrics dashboard | +| `4000` | LiteLLM | OpenAI-compatible router in front of all local backends — `Bearer $LITELLM_MASTER_KEY` | | `3030` | OpenHands | Autonomous agent + sandbox runtime — Tailscale-only by design | | `4317` | Phoenix OTLP/gRPC | Trace ingestion | | `6006` | Phoenix UI / OTLP/HTTP | Per-trace agent waterfall (also `:6006/v1/traces`) | | `8001` | faster-whisper | STT (OpenAI API) — large-v3-turbo, for OpenWebUI/Conduit | | `8090` | Beszel | Host + container + AMD GPU dashboard | +| `8081` | llama.cpp (Qwen3-235B) | Long-task model, ~5-10 tok/s — manual start only (can't coexist with 30B/Kimi/ComfyUI) | +| `8188` | ComfyUI | Image generation web UI — manual start only (GPU contention) | | `8880` | Kokoro | TTS (OpenAI API) — Kokoro-82M, for OpenWebUI/Conduit | | `10200` | Piper | TTS (Wyoming protocol) — for Home Assistant Assist | | `10300` | Whisper | STT (Wyoming protocol) — for Home Assistant Assist | @@ -80,10 +84,12 @@ Send a prompt; watch it land in Phoenix at . | **In flight**: oc-tree TUI sidecar | M0 skeleton done, awaiting probe | Live tree view of opencode's SSE event stream; tmux-side companion to Phoenix's post-hoc traces. See [`oc-tree/NEXT_STEPS.md`](oc-tree/NEXT_STEPS.md). | | **In flight**: Kimi-Linear context ramp | P0 done at 32K, BIOS+GTT recipe applied | Next: ramp `--max-model-len` toward 1M and benchmark. See [`kimi-linear/NEXT_STEPS.md`](kimi-linear/NEXT_STEPS.md). | | **Configured but not deployed**: VS Code Continue | YAML ready | Drop [`vscode-continue-config.yml`](vscode-continue-config.yml) into `~/.continue/config.yaml`; `ollama pull nomic-embed-text` on the box for `@codebase` indexing. | -| **Planned next**: ComfyUI for image generation | Phase plan written | Flux.1-Dev via kyuz0 toolbox, same pyinfra pattern as kimi-linear. See task list (CF-P0..P4). | -| **Roadmap-only** (not started): Aider, Cline, OpenHands, LiteLLM, Multica | — | See `Roadmap.md`. | +| **In flight**: ComfyUI for image generation | CF-P0 — artifacts written, awaiting box-side bring-up | Flux.1-Dev via `kyuz0/amd-strix-halo-comfyui`. Manual start (`restart: "no"`) so it doesn't contend with kimi-linear at boot. See [`pyinfra/framework/compose/comfyui/README.md`](pyinfra/framework/compose/comfyui/README.md). | +| **In flight**: Qwen3-235B long-task model | M0 — artifacts written, awaiting weight pull + bring-up | Unsloth UD-Q2_K_XL (~88.8 GB) via `kyuz0:rocm-7.2.2`. The "overnight" model at ~5-10 tok/s — can't coexist with the 30B/Kimi/ComfyUI. See [`pyinfra/framework/compose/qwen3-235b/README.md`](pyinfra/framework/compose/qwen3-235b/README.md). Opencode wired via direct `framework-llama` provider. | +| **In flight**: LiteLLM router | M1 — artifacts written, awaiting box-side bring-up | Single OpenAI-compatible endpoint (port 4000) in front of all 4 local backends; routing table in [`pyinfra/framework/compose/litellm/config.yaml`](pyinfra/framework/compose/litellm/config.yaml). Opencode stays direct; OpenHands + future orchestrator point here. | +| **Roadmap-only** (not started): Aider, Cline, OpenHands integration, Multica | — | See `Roadmap.md`. | -**Single-line summary**: opencode + Qwen3-Coder + the MCP toolbox is the working coding harness; Kimi-Linear is alongside for long-context chat; oc-tree and ComfyUI are the next two infra additions. +**Single-line summary**: opencode + Qwen3-Coder + the MCP toolbox is the working coding harness; Kimi-Linear is alongside for long-context chat; Qwen3-235B is being added as the overnight long-task model, fronted by a LiteLLM router for the future orchestrator; oc-tree and ComfyUI are the next two infra additions. ## Why this stack diff --git a/opencode/install.sh b/opencode/install.sh index 87b4aa3..8934a12 100755 --- a/opencode/install.sh +++ b/opencode/install.sh @@ -11,9 +11,11 @@ # 4. Install Serena (uv tool — LSP-backed code navigation MCP) # 5. Wire basic-memory's storage to the Obsidian vault's AI-memory folder # 6. Install github-mcp-server + check for GITHUB_PERSONAL_ACCESS_TOKEN -# 7. Install task-master-ai (workflow MCP) -# 8. Install the Phoenix bridge plugin's OTel deps -# 9. Generate ~/.config/opencode/opencode.json from the repo's +# 7. Install kagimcp (uv tool) + check for KAGI_API_KEY — replaces SearXNG +# as the primary search MCP; same key powers OpenWebUI's web search +# 8. Install task-master-ai (workflow MCP) +# 9. Install the Phoenix bridge plugin's OTel deps +# 10. Generate ~/.config/opencode/opencode.json from the repo's # opencode.json with relative plugin paths rewritten to absolute, # so opencode loads the plugin regardless of where it's launched. # @@ -32,14 +34,14 @@ warn() { printf ' \033[33m!\033[0m %s\n' "$*"; } fail() { printf ' \033[31m✗\033[0m %s\n' "$*"; exit 1; } # --- 1. Homebrew ------------------------------------------------------------- -bold "[1/9] Homebrew" +bold "[1/10] Homebrew" if ! command -v brew >/dev/null 2>&1; then fail "brew not found. Install from https://brew.sh, then re-run." fi ok "brew $(brew --version | head -1 | awk '{print $2}')" # --- 2. CLI deps ------------------------------------------------------------- -bold "[2/9] CLI dependencies" +bold "[2/10] CLI dependencies" brew_install_if_missing() { local pkg="$1" local bin="${2:-$1}" @@ -64,7 +66,7 @@ else fi # --- 3. Playwright browsers -------------------------------------------------- -bold "[3/9] Playwright browser cache" +bold "[3/10] Playwright browser cache" PW_CACHE="${HOME}/Library/Caches/ms-playwright" if [[ -d "$PW_CACHE" ]] && find "$PW_CACHE" -name "chrome" -o -name "Chromium*" 2>/dev/null | grep -q .; then ok "browsers already cached at $PW_CACHE" @@ -79,7 +81,7 @@ fi # start-mcp-server ...` without paying uvx's resolution cost on every # session start. --prerelease=allow is required because serena-agent # ships pre-1.0 versions. -bold "[4/9] Serena MCP" +bold "[4/10] Serena MCP" if uv tool list 2>/dev/null | awk '{print $1}' | grep -qx 'serena-agent'; then ok "serena-agent already installed ($(serena --version 2>/dev/null | head -1 || echo 'version unknown'))" else @@ -97,7 +99,7 @@ fi # at a folder inside the Obsidian vault via symlink so the AI's notes # show up in Obsidian's graph and search. Symlink (not env var) chosen # because it's stable across basic-memory's evolving config schema. -bold "[5/9] basic-memory storage" +bold "[5/10] basic-memory storage" AI_MEM_PATH="${HOME}/Documents/obsidian/AI-memory" if [[ ! -d "$AI_MEM_PATH" ]]; then info "creating $AI_MEM_PATH" @@ -129,7 +131,7 @@ fi # for a local 70B's effective context. Token itself is NOT in opencode.json # (it's git-tracked); github-mcp-server inherits GITHUB_PERSONAL_ACCESS_TOKEN # from the user's shell env. -bold "[6/9] github-mcp-server" +bold "[6/10] github-mcp-server" brew_install_if_missing github-mcp-server github-mcp-server if [[ -z "${GITHUB_PERSONAL_ACCESS_TOKEN:-}" ]]; then warn "GITHUB_PERSONAL_ACCESS_TOKEN is not set in your shell environment." @@ -145,12 +147,41 @@ else ok "GITHUB_PERSONAL_ACCESS_TOKEN is set in this shell" fi -# --- 7. claude-task-master (workflow MCP, npm global) ----------------------- +# --- 7. Kagi MCP (web search via Kagi's Search API) ------------------------- +# Official kagimcp from kagisearch/kagimcp. Installed as a uv tool so opencode +# can launch it as `uvx kagimcp` without per-session resolution cost. Replaces +# the SearXNG MCP as the primary search provider (SearXNG kept disabled in +# opencode.json as fallback). The same KAGI_API_KEY also powers +# RAG_WEB_SEARCH_ENGINE=kagi in OpenWebUI on the framework box — one key, +# two consumers. +bold "[7/10] Kagi MCP" +if uv tool list 2>/dev/null | awk '{print $1}' | grep -qx 'kagimcp'; then + ok "kagimcp already installed" +else + info "installing kagimcp via uv tool (~20s first run)" + uv tool install kagimcp + ok "kagimcp installed" +fi +if [[ -z "${KAGI_API_KEY:-}" ]]; then + warn "KAGI_API_KEY is not set in your shell environment." + warn " opencode's kagi MCP will fail to launch." + warn " Fix:" + warn " 1. Get an API key at https://kagi.com/settings?p=api (paid)." + warn " 2. Add to ~/.zshrc:" + warn " export KAGI_API_KEY=" + warn " 3. Re-source the shell (or open a new terminal) before launching opencode." + warn " 4. Put the same key in /srv/docker/openwebui/.env as KAGI_API_KEY=" + warn " on the framework box so OpenWebUI's web search works too." +else + ok "KAGI_API_KEY is set in this shell" +fi + +# --- 8. claude-task-master (workflow MCP, npm global) ----------------------- # task-master-ai: workflow/task-gate MCP. File-based (.taskmaster/ in each # project), no DB, no external service. opencode.json launches it via # `npx -y task-master-ai`; the global install also provides the `task-master` # CLI (`task-master init` to scaffold a project's tasks). -bold "[7/9] claude-task-master" +bold "[8/10] claude-task-master" if command -v task-master >/dev/null 2>&1; then ok "task-master-ai already installed ($(task-master --version 2>/dev/null | head -1 || echo 'version unknown'))" else @@ -160,7 +191,7 @@ else fi # --- 8. Phoenix bridge plugin deps ------------------------------------------ -bold "[8/9] Phoenix bridge plugin deps" +bold "[9/10] Phoenix bridge plugin deps" if [[ -d ".opencode/plugin/node_modules" && -f ".opencode/plugin/package-lock.json" ]]; then # Re-run npm install if package.json is newer than the lockfile, otherwise skip. if [[ ".opencode/plugin/package.json" -nt ".opencode/plugin/package-lock.json" ]]; then @@ -181,7 +212,7 @@ fi # in-place. Rewriting them to absolute paths here makes opencode find the # plugin regardless of which directory it was launched from. Re-run this # script after editing opencode.json. -bold "[9/9] Deploy global config" +bold "[10/10] Deploy global config" mkdir -p "${HOME}/.config/opencode" src="${HERE}/opencode.json" dst="${HOME}/.config/opencode/opencode.json" diff --git a/opencode/opencode.json b/opencode/opencode.json index ebc3432..6c130c0 100644 --- a/opencode/opencode.json +++ b/opencode/opencode.json @@ -38,6 +38,24 @@ "tool_call": false } } + }, + "framework-llama": { + "npm": "@ai-sdk/openai-compatible", + "name": "Framework Desktop (Strix Halo) — llama.cpp (long-task)", + "options": { + "baseURL": "http://framework:8081/v1", + "apiKey": "dummy" + }, + "models": { + "qwen3-235b": { + "name": "Qwen3-235B-A22B Instruct 2507 (~5-10 tok/s, manual start)", + "limit": { + "context": 65536, + "output": 16384 + }, + "tool_call": true + } + } } }, "mcp": { @@ -49,11 +67,19 @@ "searxng": { "type": "local", "command": ["npx", "-y", "mcp-searxng"], - "enabled": true, + "enabled": false, "environment": { "SEARXNG_URL": "https://searxng.n0n.io" } }, + "kagi": { + "type": "local", + "command": ["uvx", "kagimcp"], + "enabled": true, + "environment": { + "KAGI_API_KEY": "${KAGI_API_KEY}" + } + }, "serena": { "type": "local", "command": [ @@ -82,7 +108,7 @@ "--read-only", "--toolsets", "repos,issues,pull_requests,code_security" ], - "enabled": false + "enabled": true }, "task-master": { "type": "local", diff --git a/pyinfra/framework/README.md b/pyinfra/framework/README.md index 9460606..7719c7c 100644 --- a/pyinfra/framework/README.md +++ b/pyinfra/framework/README.md @@ -33,7 +33,7 @@ Or run it ephemerally without installing: `uvx pyinfra inventory.py deploy.py`. ## What the deploy does -- Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, huggingface-cli +- Base CLI: tmux, vim, htop, btop, nvtop, amdgpu_top, uv, lazydocker, hf (huggingface_hub CLI) - Tailscale (run `sudo tailscale up` on the box once, interactively) - Docker engine + compose plugin, user added to `docker` group - ROCm host diagnostics only (`rocminfo`) — no full toolchain diff --git a/pyinfra/framework/compose/comfyui.yml b/pyinfra/framework/compose/comfyui.yml new file mode 100644 index 0000000..0309b4e --- /dev/null +++ b/pyinfra/framework/compose/comfyui.yml @@ -0,0 +1,77 @@ +# ComfyUI on Strix Halo gfx1151 via kyuz0/amd-strix-halo-comfyui. +# +# Toolbox-style image (Fedora rawhide + ROCm) with /bin/bash as CMD. +# We override entrypoint to launch ComfyUI's main.py with the flag set +# gfx1151 needs (--disable-mmap because mmap >64 GB is slow on ROCm; +# --bf16-vae avoids VAE OOM; --cache-none keeps unified-memory pressure +# manageable). +# +# Coexistence with other services. ComfyUI competes for GPU with +# kimi-linear (always-resident) and ollama (loads-on-demand). To avoid +# silent contention this stack is NOT set to restart automatically — +# bring it up manually (`docker compose up -d`) when you need image gen, +# and `docker compose down` after. Mid-term we'll add a +# load-shed/coordination layer; this comment is the binding for now. +# +# Pin: kyuz0/amd-strix-halo-comfyui:20260213-143435 (sha-7242b4d). Bump +# deliberately after re-validating Flux/HiDream/LTX2 still work. +services: + comfyui: + image: kyuz0/amd-strix-halo-comfyui:20260213-143435 + container_name: comfyui + # Explicit no auto-restart — see header note about GPU contention. + restart: "no" + devices: + - /dev/kfd:/dev/kfd + - /dev/dri:/dev/dri + cap_add: + - SYS_PTRACE + security_opt: + - seccomp=unconfined + # Numeric GIDs of host's video (44) and render (991) groups — names + # don't exist inside the Fedora-rawhide base, but GIDs need to match + # the host for /dev/kfd + /dev/dri access. + group_add: + - "44" + - "991" + shm_size: 16g + ipc: host + environment: + # Same unified-memory recipe as kimi-linear.yml: BIOS UMA=0.5 GB + + # ttm.pages_limit=33554432 cmdline + this triple. Without these, + # PyTorch's HIP allocator only sees the tiny 0.5 GB UMA pool and + # can't reach GTT. The kyuz0 image is built against native gfx1151 + # so HSA_OVERRIDE_GFX_VERSION isn't needed. + - HSA_XNACK=1 + - HSA_FORCE_FINE_GRAIN_PCIE=1 + - PYTORCH_HIP_ALLOC_CONF=backend:native,expandable_segments:True,garbage_collection_threshold:0.9 + volumes: + # All ComfyUI state lives under /srv/docker/comfyui/ on the host. + # Image's $HOME is /root (Fedora rawhide). Models go in subdirs + # under comfy-models/ (text_encoders/, vae/, checkpoints/, + # diffusion_models/, unet/, loras/, clip_vision/) — kyuz0's image + # populates extra_model_paths.yaml pointing at $HOME/comfy-models. + - /srv/docker/comfyui/models:/root/comfy-models + - /srv/docker/comfyui/output:/root/comfy-outputs + - /srv/docker/comfyui/custom_nodes:/opt/ComfyUI/custom_nodes + - /srv/docker/comfyui/workflows:/opt/ComfyUI/user/default/workflows + ports: + # 8188 = standard ComfyUI port. kyuz0's banner alias uses 8000 but + # that would collide with vLLM (compose/kimi-linear.yml). + - "8188:8188" + # bash -lc loads /etc/profile.d/01-rocm-envs.sh (TORCH_ROCM_AOTRITON, + # TORCH_BLAS_PREFER_HIPBLASLT) — without a login shell those don't + # apply and ROCm perf regresses. + entrypoint: ["/bin/bash", "-lc"] + # set_extra_paths.sh writes /opt/ComfyUI/extra_model_paths.yaml so + # ComfyUI finds models under $HOME/comfy-models. Idempotent — safe + # to run every start. Without it, model dropdowns in the UI are + # empty and templates report "missing model". + command: + - > + /opt/set_extra_paths.sh && + cd /opt/ComfyUI && python main.py + --listen 0.0.0.0 --port 8188 + --output-directory /root/comfy-outputs + --disable-mmap --gpu-only --disable-smart-memory + --cache-none --bf16-vae diff --git a/pyinfra/framework/compose/comfyui/README.md b/pyinfra/framework/compose/comfyui/README.md new file mode 100644 index 0000000..a322c64 --- /dev/null +++ b/pyinfra/framework/compose/comfyui/README.md @@ -0,0 +1,121 @@ +# comfyui + +ComfyUI for image generation on the Strix Halo box via +`kyuz0/amd-strix-halo-comfyui` (battle-tested gfx1151 toolbox). Web UI +at `http://framework:8188`. **Not** auto-started — bring up manually so +it doesn't contend with kimi-linear for GPU. + +## Coexistence notes (read first) + +ComfyUI competes for GPU memory with the always-resident kimi-linear +(vLLM) and on-demand ollama. The stack reflects this: + +- `restart: "no"` — won't come back on box reboot. You start it. +- Stop kimi-linear before heavy ComfyUI work, or accept slower swap. +- Use case order: `docker compose up -d` here → use UI → `docker compose down`. + +## Prereqs + +- Pyinfra deploy has run (creates `/srv/docker/comfyui/{models,output, + custom_nodes,workflows}` with the right perms). +- BIOS UMA at 0.5 GB + ttm.pages_limit cmdline active (same recipe as + kimi-linear). Verify with `cat /proc/cmdline | grep ttm.pages_limit`. + +## Bring up + +```sh +cd /srv/docker/comfyui +docker compose pull # ~8-12 GB image +docker compose up -d +docker compose logs -f # wait for "To see the GUI go to: http://0.0.0.0:8188" +./smoke.sh +``` + +Open `http://framework:8188` in a browser. The blank workflow loads. +Empty without models — see next section. + +## Adding Flux.1-Dev (CF-P1) + +Flux.1-Dev is gated on HF — accept the license at + first, set +`HF_TOKEN` in your shell. + +```sh +export HF_TOKEN=... + +# Diffusion model (UNet) +hf download black-forest-labs/FLUX.1-dev \ + flux1-dev.safetensors \ + --local-dir /srv/docker/comfyui/models/diffusion_models + +# VAE +hf download black-forest-labs/FLUX.1-dev \ + ae.safetensors \ + --local-dir /srv/docker/comfyui/models/vae + +# Text encoders (CLIP-L + T5XXL) +hf download comfyanonymous/flux_text_encoders \ + clip_l.safetensors t5xxl_fp8_e4m3fn.safetensors \ + --local-dir /srv/docker/comfyui/models/text_encoders +``` + +fp8 t5xxl (~5 GB) over fp16 (~9 GB) — generally indistinguishable +quality, much faster. Refresh the UI; the canonical Flux txt2img +workflow is at File → Load → flux-txt2img. + +## Model dir layout + +``` +/srv/docker/comfyui/models/ +├── checkpoints/ # full pipeline checkpoints (SDXL, SD1.5) +├── diffusion_models/ # standalone UNet/transformer (Flux, HiDream, etc.) +├── vae/ # VAEs +├── text_encoders/ # CLIP-L, T5XXL, etc. +├── loras/ +├── controlnet/ +├── clip_vision/ +└── upscale_models/ +``` + +The image's `extra_model_paths.yaml` maps these into ComfyUI's load +paths automatically. + +## Operations + +```sh +docker compose logs -f # tail +docker compose restart comfyui # reload +docker compose down # stop +docker compose exec comfyui bash # shell in +amdgpu_top # GPU view on host +./smoke.sh # health check +``` + +## Pin manifest + +| Component | Pin | +|---|---| +| Image | `kyuz0/amd-strix-halo-comfyui:20260213-143435` (sha-7242b4d) | +| Default port | 8188 (host) → 8188 (container) | +| Flux text encoder quant | fp8_e4m3fn (override to fp16 if you observe quality loss) | + +Bump the image pin deliberately after re-validating that Flux + any +custom nodes still work. + +## Known caveats + +- **`--disable-mmap` is mandatory** (kyuz0 README: "mmap above 64 GB is + currently very slow due to a ROCm issue"). Already set in compose. +- **No `controlnet/` subdir created by `set_extra_paths.sh`** — we add + it manually via pyinfra. If kyuz0 adds it to their script, our + pyinfra dir creation is harmless idempotent overhead. +- **Custom nodes can write to `/opt/ComfyUI/custom_nodes`** — that dir + is bind-mounted RW so installs persist across container recreates. +- **Flux.2 crashes on Strix Halo ROCm** as of 2026-05; stick with + Flux.1-Dev or HiDream-O1 until upstream patch lands. + +## Status + +CF-P0 in progress (this commit). CF-P1 (Flux first image), CF-P2 +(productionize), CF-P3 (workflow library + OpenWebUI hook), CF-P4 +(HiDream, LTX-2, LoRAs) — see top-level task list. diff --git a/pyinfra/framework/compose/comfyui/smoke.sh b/pyinfra/framework/compose/comfyui/smoke.sh new file mode 100755 index 0000000..d9f6b12 --- /dev/null +++ b/pyinfra/framework/compose/comfyui/smoke.sh @@ -0,0 +1,20 @@ +#!/usr/bin/env bash +# Smoke-test the running ComfyUI container. /system_stats should return +# device + memory info; /object_info should list registered nodes. +set -euo pipefail + +HOST="${COMFYUI_HOST:-127.0.0.1:8188}" + +echo "[smoke] GET /system_stats on $HOST" +curl -fsS "http://$HOST/system_stats" | python3 -m json.tool + +echo +echo "[smoke] GET /object_info — node count" +curl -fsS "http://$HOST/object_info" | python3 -c " +import json, sys +data = json.load(sys.stdin) +print(f'{len(data)} nodes registered') +" + +echo +echo "[smoke] passed" diff --git a/pyinfra/framework/compose/homepage.yml b/pyinfra/framework/compose/homepage.yml index 1a259e2..47c855b 100644 --- a/pyinfra/framework/compose/homepage.yml +++ b/pyinfra/framework/compose/homepage.yml @@ -17,6 +17,14 @@ services: # 7575 picked to avoid the soup of 30xx ports already in use # (OpenWebUI 3000, OpenLIT 3001, OpenHands 3030). - "7575:3000" + extra_hosts: + # Required for customapi widgets to reach host services. Inside the + # container, `framework` resolves to 127.0.1.1 (Ubuntu hostname + # loopback alias), which is the container itself. Use + # host.docker.internal in widget.url fields to route via the host + # gateway. The user-facing `href` fields keep `framework:` since + # those resolve on the user's browser, not the container. + - "host.docker.internal:host-gateway" volumes: - /srv/docker/homepage/config:/app/config # Read-only docker socket so homepage can render container status diff --git a/pyinfra/framework/compose/homepage/services.yaml b/pyinfra/framework/compose/homepage/services.yaml index f2e510f..9ebd0c6 100644 --- a/pyinfra/framework/compose/homepage/services.yaml +++ b/pyinfra/framework/compose/homepage/services.yaml @@ -10,9 +10,21 @@ description: Local model server (Qwen3-Coder-30B and friends) server: localhost-docker container: ollama + # Built-in `type: ollama` widget is missing on the installed + # Homepage version. customapi against /api/ps gives a better + # signal anyway: actually-loaded model + its VRAM footprint. + # When no model is loaded the models array is empty and fields + # render as N/A — that itself is useful state. widget: - type: ollama - url: http://framework:11434 + type: customapi + url: http://host.docker.internal:11434/api/ps + refreshInterval: 30000 + mappings: + - field: models.0.name + label: Loaded + - field: models.0.size_vram + label: VRAM + format: bytes - llama.cpp: icon: si-llama @@ -23,18 +35,50 @@ # No native widget; a ping check confirms liveness. widget: type: customapi - url: http://framework:8080/health + url: http://host.docker.internal:8080/health refreshInterval: 30000 mappings: - field: status label: Status - - vLLM: + - vLLM (Kimi-Linear): icon: mdi-server-network href: http://framework:8000 - description: Batched OpenAI-compatible serving (ROCm) + description: Batched OpenAI-compatible serving — Kimi-Linear-48B-A3B (long-context) server: localhost-docker - container: vllm + # Actual vLLM container is `kimi-linear` (compose/kimi-linear.yml). + # The legacy `vllm` container in compose/vllm.yml is an unused stub. + container: kimi-linear + widget: + type: customapi + url: http://host.docker.internal:8000/v1/models + refreshInterval: 30000 + mappings: + - field: data.0.id + label: Served + - field: data.0.max_model_len + label: Context + format: number + + - ComfyUI: + icon: mdi-image-edit + href: http://framework:8188 + description: Image generation (Flux.1-Dev via kyuz0 gfx1151 toolbox) + server: localhost-docker + container: comfyui + # ComfyUI's /system_stats returns nested {system, devices[0]}. + # Surfacing version + free VRAM gives a quick "is it healthy + # and does it have memory" read at a glance. + widget: + type: customapi + url: http://host.docker.internal:8188/system_stats + refreshInterval: 30000 + mappings: + - field: system.comfyui_version + label: Version + - field: devices.0.vram_free + label: VRAM Free + format: bytes - Agent UIs: - OpenWebUI: diff --git a/pyinfra/framework/compose/kimi-linear.yml b/pyinfra/framework/compose/kimi-linear.yml index 2037fca..1498adf 100644 --- a/pyinfra/framework/compose/kimi-linear.yml +++ b/pyinfra/framework/compose/kimi-linear.yml @@ -17,7 +17,7 @@ # Weights. Despite their HF name, cyankiwi's "AWQ" Kimi-Linear weights # are actually `compressed-tensors` int4 group-quantized — see config.json. # Download with: -# huggingface-cli download cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit \ +# hf download cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit \ # --local-dir /models/moonshotai/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit # Size: ~35 GB on disk (4-bit). 8-bit variant is ~54 GB if quality drives # us up later; both fit 128 GB unified comfortably. diff --git a/pyinfra/framework/compose/kimi-linear/README.md b/pyinfra/framework/compose/kimi-linear/README.md index 1d9b68d..449c996 100644 --- a/pyinfra/framework/compose/kimi-linear/README.md +++ b/pyinfra/framework/compose/kimi-linear/README.md @@ -13,14 +13,14 @@ Smoke-test below confirms all three at once. - Pyinfra deploy has run (`./run.sh` from `pyinfra/framework/`) — gives you `/srv/docker/kimi-linear/`, GPU group membership, `/models/` - layout, and `huggingface-cli` on the box. -- Hugging Face CLI authenticated (`huggingface-cli login`) if the + layout, and the `hf` CLI on the box. +- Hugging Face CLI authenticated (`hf auth login`) if the weights repo gates downloads. cyankiwi's repo is currently public. ## Step 1 — Download weights ```sh -huggingface-cli download \ +hf download \ cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit \ --local-dir /models/moonshotai/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit ``` diff --git a/pyinfra/framework/compose/litellm.yml b/pyinfra/framework/compose/litellm.yml new file mode 100644 index 0000000..cd485bc --- /dev/null +++ b/pyinfra/framework/compose/litellm.yml @@ -0,0 +1,65 @@ +# LiteLLM proxy — single OpenAI-compatible endpoint in front of all the +# local model backends on this box (Ollama 11434, llama.cpp 30B 8080, +# vLLM Kimi-Linear 8000, llama.cpp Qwen3-235B 8081). +# +# Why this exists. With ≥3 backends running and ≥2 client harnesses +# (opencode on Mac, OpenHands on the box, future orchestrator on another +# server), each client otherwise carries its own per-backend config. +# LiteLLM centralizes: model_name → backend_url mapping lives here once, +# clients just speak "model: qwen3-235b" to a single URL. +# +# Routing model is documented in compose/litellm/README.md — opencode +# stays direct-wired for now (fewer hops, simpler debug); OpenHands + +# the future orchestrator will point here. +# +# Backend reachability. `extra_hosts: host.docker.internal:host-gateway` +# resolves to the host's docker0 IP from inside this container, which +# is how it reaches the other compose services published on host ports. +# Don't use container_name-based DNS — those containers live on separate +# bridge networks (each compose stack has its own). +services: + litellm: + image: ghcr.io/berriai/litellm:main-stable + container_name: litellm + restart: unless-stopped + extra_hosts: + # On Linux, `host-gateway` is Docker's magic alias for the host's + # docker0 IP — equivalent to host.docker.internal on Mac/Windows. + # Lets LiteLLM dial localhost-bound backends as + # http://host.docker.internal:. + - "host.docker.internal:host-gateway" + environment: + # Master key. LiteLLM requires one for admin endpoints + serves + # as the default Bearer for client requests. Sibling .env file + # holds the value (created by pyinfra as a placeholder; you fill + # it in on first deploy). Same pattern as compose/beszel.yml. + - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} + # Optional: salt for hashing virtual keys at rest. Unused in the + # single-user setup but LiteLLM logs a warning without it. + - LITELLM_SALT_KEY=${LITELLM_SALT_KEY:-sk-localgenai-salt} + volumes: + # Source-of-truth config lives in the repo; pyinfra syncs it to + # /srv/docker/litellm/config.yaml on every `./run.sh`. Don't edit + # on the box — drift gets overwritten. + - /srv/docker/litellm/config.yaml:/app/config.yaml:ro + ports: + - "4000:4000" + command: + - --config + - /app/config.yaml + - --port + - "4000" + # --num_workers 1 keeps memory minimal; LiteLLM is I/O-bound here, + # not CPU-bound. Bump if you see queueing. + - --num_workers + - "1" + healthcheck: + # LiteLLM exposes both /health (verifies all backends are reachable + # — heavy) and /health/readiness (just the proxy itself — cheap). + # Use readiness for the compose healthcheck so a stopped backend + # doesn't mark LiteLLM unhealthy. + test: ["CMD", "curl", "-fsS", "http://127.0.0.1:4000/health/readiness"] + interval: 30s + timeout: 5s + retries: 3 + start_period: 30s diff --git a/pyinfra/framework/compose/litellm/README.md b/pyinfra/framework/compose/litellm/README.md new file mode 100644 index 0000000..120604f --- /dev/null +++ b/pyinfra/framework/compose/litellm/README.md @@ -0,0 +1,122 @@ +# litellm + +OpenAI-compatible router in front of all the local model backends on +this box. One endpoint (`http://framework:4000/v1`) → four backends +(Ollama, llama.cpp 30B, vLLM Kimi, llama.cpp 235B). Maintained from +[`config.yaml`](config.yaml) in this repo; edits land on the box via +`./run.sh` + `docker compose restart litellm`. + +## When to route through LiteLLM vs direct + +| Client | Recommended | Why | +|---|---|---| +| **opencode (Mac)** | Direct providers in `opencode.json` | One Mac, four providers fits cleanly in opencode's config schema. Adds no debugging burden. | +| **OpenHands (the box)** | Via LiteLLM | OpenHands wants a single `LLM_BASE_URL` + `LLM_MODEL`; switching between qwen3-coder and qwen3-235b for different tasks is a config flip, not a re-wire. | +| **Future orchestrator (other server)** | Via LiteLLM | Same reason as OpenHands. Also: one URL to firewall/Tailscale-expose. | +| **OpenWebUI** | Direct (already wired to vLLM) | No reason to add a hop. | +| **Quick curl / scripts** | Either | LiteLLM is convenient because you don't need to remember port numbers. | + +The trade-off: a hop adds a few ms per request (negligible at our tok/s) +and one more thing to debug when something breaks. The win is config +centralization. + +## Prereqs + +- Pyinfra deploy has run (creates `/srv/docker/litellm/{config.yaml,.env}`). +- Fill in `/srv/docker/litellm/.env`: + ``` + LITELLM_MASTER_KEY=sk- + LITELLM_SALT_KEY=sk- + ``` + Mode 640 root:docker — readable by docker group at compose-parse time, + not world-readable. +- At least one backend running (otherwise `/v1/models` is empty and + routing 404s — LiteLLM itself comes up regardless). + +## Bring up + +```sh +cd /srv/docker/litellm +docker compose up -d +docker compose logs -f # wait for "Application startup complete" + +./smoke.sh # /health/readiness + /v1/models +``` + +## Usage + +From any client — pass `LITELLM_MASTER_KEY` as the Bearer token: + +```sh +# List configured models +curl -fsS http://framework:4000/v1/models \ + -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ + | jq '.data[].id' + +# Chat completion (qwen3-coder via the qwen3-coder model_name in config) +curl -fsS http://framework:4000/v1/chat/completions \ + -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ + -H "Content-Type: application/json" \ + -d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "ping"}]}' \ + | jq '.choices[0].message.content' + +# Long-task model (assumes you stopped 30B/Kimi and brought qwen3-235b up) +curl -fsS http://framework:4000/v1/chat/completions \ + -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ + -H "Content-Type: application/json" \ + -d '{"model": "qwen3-235b", "messages": [{"role": "user", "content": "Plan a refactor of foo.py"}], "max_tokens": 2000}' +``` + +`/health` (no `/readiness`) hits every backend — useful for diagnosing +which one is down: + +```sh +curl -fsS http://framework:4000/health \ + -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq +``` + +## Backend down semantics + +LiteLLM does **not** chain fallbacks across our model_list — each +`model_name` maps to exactly one backend. If `qwen3-235b` is down, +requests for `model: qwen3-235b` return 503. That's deliberate: + +- Silent fallback `qwen3-235b → qwen3-coder` would be confusing + (different model, different quality, different latency profile). +- The 235B has a manual-start workflow on purpose (GPU contention). + Fail-fast surfaces the "you forgot to bring it up" case immediately. + +If you want explicit fallback for a specific harness, add it client-side +(OpenHands has `LLM_FALLBACKS=...`). + +## Adding a model + +Edit [`config.yaml`](config.yaml) → `./run.sh` from `pyinfra/framework/` +on the Mac → `docker compose restart litellm` on the box. New model +appears in `/v1/models` immediately. No state to migrate (database_url +is null). + +## Operations + +```sh +docker compose logs -f # tail +docker compose restart litellm # reload after config edit +docker compose down # stop +./smoke.sh # health + model list +``` + +## Pin manifest + +| Component | Pin | +|---|---| +| Image | `ghcr.io/berriai/litellm:main-stable` | +| Default port | 4000 | +| Auth | Master key only (no virtual keys, no database) | +| Backends | qwen3-coder (Ollama), qwen3-coder-llama, kimi-linear (vLLM), qwen3-235b | + +## Status + +M1 — compose + config artifacts written; awaiting box-side bring-up. +M2 will wire OpenHands' `LLM_BASE_URL` to this endpoint. M3 (the +orchestrator on the other server) will point at this endpoint over +Tailscale. diff --git a/pyinfra/framework/compose/litellm/config.yaml b/pyinfra/framework/compose/litellm/config.yaml new file mode 100644 index 0000000..8765132 --- /dev/null +++ b/pyinfra/framework/compose/litellm/config.yaml @@ -0,0 +1,73 @@ +# LiteLLM model routing. model_name is what clients request; the +# litellm_params block is how LiteLLM reaches the backend. +# +# `model: openai/` tells LiteLLM to use its +# openai-compatible adapter and forward to the backend. +# api_base is the backend's /v1 root reachable from inside the LiteLLM +# container (host.docker.internal = host's docker0 IP via the +# extra_hosts entry in litellm.yml). +# +# Backend running-state matters: requests to a stopped backend return +# 503/connection-refused. By design — no fallback chain, since these +# backends compete for GPU and silently routing "qwen3-235b" to the 30B +# would be more confusing than failing fast. +# +# Edits here require `./run.sh` on the Mac to push to the box, then +# `docker compose restart litellm` on the box to reload. + +model_list: + # Daily-driver coding model. Ollama with gfx1100-coerced ROCm — + # currently the default opencode provider. Always-resident + # (OLLAMA_KEEP_ALIVE=24h). + - model_name: qwen3-coder + litellm_params: + model: openai/qwen3-coder:30b + api_base: http://host.docker.internal:11434/v1 + api_key: dummy + + # Same weights as qwen3-coder above but served via llama.cpp on the + # kyuz0 rocm-7.2.2 image (native gfx1151 + rocWMMA). LL-P0 measures + # whether the eval_tps win justifies switching default opencode to + # this. Manual start until then. + - model_name: qwen3-coder-llama + litellm_params: + model: openai/qwen3-coder + api_base: http://host.docker.internal:8080/v1 + api_key: dummy + + # Long-context chat (no tool calling) via vLLM. P0 verified at 32K; + # context ramp tracked in kimi-linear/NEXT_STEPS.md. + - model_name: kimi-linear + litellm_params: + model: openai/kimi-linear + api_base: http://host.docker.internal:8000/v1 + api_key: dummy + + # Long-task model — Qwen3-235B-A22B-Instruct-2507 UD-Q2_K_XL via + # llama.cpp on port 8081. ~5-10 tok/s decode; manual start only + # (can't coexist with the other GPU services). Requests will fail + # with connection refused when the container is down — that's the + # intended UX: a stopped service is a clear signal. + - model_name: qwen3-235b + litellm_params: + model: openai/qwen3-235b + api_base: http://host.docker.internal:8081/v1 + api_key: dummy + +litellm_settings: + # Forward all client params to the backend, even unrecognized ones. + # Default drops unknown OpenAI params — fine for hosted models, but + # our backends vary (vLLM, llama.cpp, Ollama) and each accepts a + # slightly different superset. Let the backend reject what it can't + # use rather than LiteLLM silently filtering. + drop_params: false + # No /v1/models caching — the list is short and we want stop/start + # of backends to reflect immediately. + cache: false + # Log proxy requests at info; tail with `docker compose logs litellm`. + set_verbose: false + +general_settings: + # Disable LiteLLM's database mode — we're stateless. No user/key/spend + # tracking needed for a single-user trusted LAN setup. + database_url: null diff --git a/pyinfra/framework/compose/litellm/smoke.sh b/pyinfra/framework/compose/litellm/smoke.sh new file mode 100644 index 0000000..8c0fe2d --- /dev/null +++ b/pyinfra/framework/compose/litellm/smoke.sh @@ -0,0 +1,50 @@ +#!/usr/bin/env bash +# Smoke-test the LiteLLM proxy. /health/readiness for liveness, /v1/models +# for the configured backends, then /health (which dials every backend) +# to surface which ones are actually reachable right now. +set -euo pipefail + +HOST="${LITELLM_HOST:-127.0.0.1:4000}" +# Read master key from sibling .env if present, otherwise from environment. +if [[ -z "${LITELLM_MASTER_KEY:-}" && -f "$(dirname "$0")/../.env" ]]; then + # shellcheck disable=SC1091 + source "$(dirname "$0")/../.env" +fi +if [[ -z "${LITELLM_MASTER_KEY:-}" ]]; then + echo "[smoke] LITELLM_MASTER_KEY not set — export it or populate /srv/docker/litellm/.env" >&2 + exit 1 +fi + +echo "[smoke] GET /health/readiness on $HOST (proxy alive?)" +curl -fsS "http://$HOST/health/readiness" \ + -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ + | python3 -m json.tool + +echo +echo "[smoke] GET /v1/models (configured model_names)" +curl -fsS "http://$HOST/v1/models" \ + -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ + | python3 -c " +import json, sys +r = json.load(sys.stdin) +for m in r.get('data', []): + print(f\" - {m.get('id', '?')}\")" + +echo +echo "[smoke] GET /health (each backend's reachability — slow, ~10s)" +curl -fsS "http://$HOST/health" \ + -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ + | python3 -c " +import json, sys +r = json.load(sys.stdin) +healthy = r.get('healthy_endpoints', []) +unhealthy = r.get('unhealthy_endpoints', []) +print(f' healthy: {len(healthy)}') +for e in healthy: + print(f' + {e.get(\"model\", \"?\")}') +print(f' unhealthy: {len(unhealthy)}') +for e in unhealthy: + print(f' - {e.get(\"model\", \"?\")}: {e.get(\"error\", \"?\")[:80]}')" + +echo +echo "[smoke] passed — proxy up, model list populated. Unhealthy backends are expected if their compose stacks are down." diff --git a/pyinfra/framework/compose/llama.yml b/pyinfra/framework/compose/llama.yml index 990458c..e13d1f0 100644 --- a/pyinfra/framework/compose/llama.yml +++ b/pyinfra/framework/compose/llama.yml @@ -1,44 +1,101 @@ -# llama.cpp server, gfx1151-optimized via kyuz0's Strix Halo toolboxes. +# llama.cpp server, gfx1151-native via kyuz0's Strix Halo toolbox. # https://github.com/kyuz0/amd-strix-halo-toolboxes # # Tag options on docker.io/kyuz0/amd-strix-halo-toolboxes: -# vulkan-radv — most stable, recommended default (this one) -# vulkan-amdvlk — alternate Vulkan driver, sometimes faster -# rocm-7.2.2 — ROCm 7.x; needs /dev/kfd + render group_add (see vllm.yml pattern) -# rocm-6.4.4 — ROCm 6.x fallback +# rocm-7.2.2 — ROCm 7.x, native gfx1151 + rocWMMA (this one; +# best perf for Qwen3-Coder-class models) +# vulkan-radv — most-stable Vulkan; fallback if ROCm regresses +# vulkan-amdvlk — alternate Vulkan driver +# rocm-6.4.4 — older ROCm; only if 7.2.2 breaks # rocm7-nightlies — avoid: caps memory allocation to 64 GB (May 2026) # -# Toolbox images use a shell entrypoint, so we override to launch -# llama-server directly. Edit the --model path before `docker compose up -d`. +# Weights: Unsloth "dynamic" quant — UD-Q4_K_XL preserves more important +# weights at higher precision than naive Q4_K_M, closer to Q5 quality at +# Q4 size. Download path on the box (see compose/llama/README.md): +# hf download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \ +# 'Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf' \ +# --local-dir /models/qwen +# Verify exact filename in the HF repo before downloading — Unsloth's +# file naming varies (sometimes split into shards). +# +# Coexists with Ollama (11434) and vLLM (8000). Port 8080 here. Ollama +# stays the default opencode provider until LL-P0 confirms the eval_tps +# bump is real on this box. services: llama: - image: kyuz0/amd-strix-halo-toolboxes:vulkan-radv + image: kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2 container_name: llama restart: unless-stopped devices: + # ROCm needs both kfd (kernel fusion driver) and dri (DRM); Vulkan + # only needs dri. Don't drop kfd when on the rocm-* tag. + - /dev/kfd:/dev/kfd - /dev/dri:/dev/dri + cap_add: + - SYS_PTRACE + security_opt: + - seccomp=unconfined + # Numeric GIDs of host's video (44) and render (991) groups — + # required for /dev/kfd + /dev/dri access from inside the container. + group_add: + - "44" + - "991" + shm_size: 8g + ipc: host + environment: + # Unified-memory recipe (same as compose/kimi-linear.yml + + # compose/comfyui.yml + compose/ollama.yml). BIOS UMA=0.5 GB + + # ttm.pages_limit cmdline → these flags merge the rocminfo pools + # into one ~110 GB arena via the HIP allocator's demand-paging. + # kyuz0's image is native gfx1151 so no HSA_OVERRIDE. + - HSA_XNACK=1 + - HSA_FORCE_FINE_GRAIN_PCIE=1 volumes: - /models:/models:ro ports: - "8080:8080" + # Toolbox image drops to shell by default; explicit entrypoint. entrypoint: ["llama-server"] command: - --model - - /models/REPLACE/ME/model.gguf + - /models/qwen/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf + # OpenAI-compatible served name (matches what opencode/llm/curl + # request as "model"). Keep simple — provider-side name lives + # in opencode.json. + - --alias + - qwen3-coder - --host - 0.0.0.0 - --port - "8080" + # Push all layers to GPU. "999" is shorthand for "all available." + # gfx1151 with 110 GB merged arena fits 30B-class models easily. - --n-gpu-layers - "999" + # Match Ollama's OLLAMA_CONTEXT_LENGTH so opencode behaves the + # same across providers. Bump if a workflow needs more; KV cost + # at this size is small with q8_0 cache. - --ctx-size - - "32768" - # Required for GPU backends on Strix Halo per Gygeek's setup - # guide. Forces full load into GPU memory rather than mmap. + - "65536" + # No-mmap is the Strix Halo standard — mmap >64 GB is slow on + # ROCm. Forces full GPU load. - --no-mmap - # Flash attention — works on Vulkan too; the big win is on the - # ROCm tag where kyuz0's build has rocWMMA acceleration. + # Flash attention — biggest single win, ~20-40 % faster on MoE. + # Modern llama-server takes a value (on/off/auto); bare --flash-attn + # is deprecated and consumes the next arg as its value. - --flash-attn + - "on" + # Quantize KV cache to int8 — halves KV memory at minor / no + # quality loss; sometimes faster due to smaller working set. + # Matches OLLAMA_KV_CACHE_TYPE=q8_0 in compose/ollama.yml. + - --cache-type-k + - q8_0 + - --cache-type-v + - q8_0 + # Use the model's embedded jinja chat template (rather than + # llama.cpp's hardcoded default). Important for Qwen3-Coder which + # has a specific chat format. + - --jinja # Expose Prometheus metrics at /metrics — scraped by OpenLIT for - # tokens/sec, KV-cache use, queue depth, and request latency. + # tokens/sec, KV-cache use, queue depth, request latency. - --metrics diff --git a/pyinfra/framework/compose/llama/README.md b/pyinfra/framework/compose/llama/README.md new file mode 100644 index 0000000..58f6adf --- /dev/null +++ b/pyinfra/framework/compose/llama/README.md @@ -0,0 +1,92 @@ +# llama + +llama.cpp server with **native gfx1151** kernels via kyuz0's ROCm 7.2.2 +toolbox. Sits beside Ollama (11434) and vLLM (8000) on port 8080. Same +Qwen3-Coder model as Ollama, faster path. + +## Why this exists + +Ollama's bundled ROCm doesn't ship native gfx1151 — we coerce gfx1100 +kernels via `HSA_OVERRIDE_GFX_VERSION=11.0.0`. kyuz0's image is built +against gfx1151 with rocWMMA acceleration. Expected eval_tps delta on +Qwen3-Coder-30B-A3B-Q4: **~30-50 % faster**, with ~2× prefill speedup. +The compose stub used to be vulkan-radv with a placeholder model path; +this rewrite makes it the second working coding endpoint. + +## Bring up (LL-P0 verification) + +```sh +# 1. Pull the Unsloth UD-Q4_K_XL Qwen3-Coder GGUF on the box. +# Verify the actual filename in the HF repo first — Unsloth's naming +# sometimes splits into shards. As of 2026-05 the single-file +# UD-Q4_K_XL is ~17-19 GB. +hf download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \ + 'Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf' \ + --local-dir /models/qwen + +# 2. Stand up the container. +cd /srv/docker/llama +docker compose pull # ~6-10 GB image +docker compose up -d +docker compose logs -f # wait for "main: server is listening on http://0.0.0.0:8080" + +# 3. Smoke + perf measure. +./smoke.sh +``` + +If `predicted_per_second` is meaningfully higher than what Ollama +reports for the same prompt, the migration is justified. If it's the +same or worse, leave Ollama as the default and treat llama.cpp as a +secondary option. + +## Comparison test (vs Ollama) + +Run the same prompt against both for a clean A/B: + +```sh +# Ollama +curl -s http://framework:11434/api/generate \ + -d '{"model":"qwen3-coder:30b","prompt":"Write a Python fibonacci function with type hints.","stream":false}' \ + | jq '{eval_tps:(.eval_count/(.eval_duration/1e9)), prompt_tps:(.prompt_eval_count/(.prompt_eval_duration/1e9))}' + +# llama.cpp (this stack) +curl -s http://framework:8080/completion \ + -d '{"prompt":"Write a Python fibonacci function with type hints.","n_predict":200,"temperature":0}' \ + | jq '.timings | {predicted_per_second, prompt_per_second}' +``` + +## Coexistence with Ollama + +Both can run simultaneously — different ports, different model files on +disk (Ollama's content-addressed store at `/models/ollama/` vs the raw +GGUF at `/models/qwen/`). They will compete for GPU memory if both have +their models hot. With `OLLAMA_KEEP_ALIVE=24h` Ollama keeps Qwen3 +resident; if you want to A/B without contention, `docker exec ollama +ollama stop qwen3-coder:30b` while testing llama.cpp. + +If LL-P0 confirms the perf win, LL-P1 wires this as a third opencode +provider (`framework-llama/qwen3-coder` alongside `framework/qwen3-coder:30b` +and `framework-vllm/kimi-linear`). + +## Pin manifest + +| Component | Pin | +|---|---| +| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` | +| Weights | `unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF` (UD-Q4_K_XL variant) | +| Default port | 8080 | +| Context | 65536 (matches Ollama config) | + +## Operations + +```sh +docker compose logs -f # tail +docker compose restart llama # reload +docker compose down # stop +docker compose exec llama bash # shell in +./smoke.sh # health + perf check +``` + +## Status + +LL-P0 in progress. LL-P1 (opencode provider wire-up) pending verification. diff --git a/pyinfra/framework/compose/llama/smoke.sh b/pyinfra/framework/compose/llama/smoke.sh new file mode 100755 index 0000000..67f4f51 --- /dev/null +++ b/pyinfra/framework/compose/llama/smoke.sh @@ -0,0 +1,45 @@ +#!/usr/bin/env bash +# Smoke-test the running llama-server (kyuz0 rocm-7.2.2). Hits /health +# for liveness, then a tiny OpenAI-compatible chat completion. Also +# prints eval_tps so you can compare to Ollama directly. +set -euo pipefail + +HOST="${LLAMA_HOST:-127.0.0.1:8080}" +MODEL="${LLAMA_MODEL:-qwen3-coder}" + +echo "[smoke] GET /health on $HOST" +curl -fsS "http://$HOST/health" | python3 -m json.tool + +echo +echo "[smoke] POST /v1/chat/completions ($MODEL) — tiny generation" +curl -fsS "http://$HOST/v1/chat/completions" \ + -H 'Content-Type: application/json' \ + -d "{ + \"model\": \"$MODEL\", + \"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: ok\"}], + \"max_tokens\": 16, + \"temperature\": 0.0 + }" | python3 -m json.tool + +echo +echo "[smoke] perf measure — eval_tps and prompt_tps" +# Use llama.cpp's native /completion endpoint which returns timings. +curl -fsS "http://$HOST/completion" \ + -H 'Content-Type: application/json' \ + -d '{ + "prompt": "Write a Python function that computes the Fibonacci sequence iteratively. Include type hints and a brief docstring.", + "n_predict": 200, + "temperature": 0.0, + "stream": false + }' | python3 -c " +import json, sys +r = json.load(sys.stdin) +t = r.get('timings', {}) +print(f'predicted_per_second: {t.get(\"predicted_per_second\", \"?\"):.2f} tok/s') +print(f'prompt_per_second: {t.get(\"prompt_per_second\", \"?\"):.2f} tok/s') +print(f'predicted_n: {t.get(\"predicted_n\", \"?\")}') +print(f'prompt_n: {t.get(\"prompt_n\", \"?\")}') +" + +echo +echo "[smoke] passed" diff --git a/pyinfra/framework/compose/ollama.yml b/pyinfra/framework/compose/ollama.yml index 15ad4ae..a809185 100644 --- a/pyinfra/framework/compose/ollama.yml +++ b/pyinfra/framework/compose/ollama.yml @@ -31,6 +31,31 @@ services: # layers between GPU and CPU. 64K keeps the model fully on GPU # while still being plenty for coding contexts. - OLLAMA_CONTEXT_LENGTH=65536 + # Perf tuning. Flash attention is the biggest single win on MoE + # models at long context (20-40 % faster generation). q8_0 KV + # cache halves KV memory at minor / no quality loss; sometimes + # faster due to smaller working set. The parallel/loaded-models + # caps avoid Ollama slicing memory across speculative concurrent + # requests we never have. + - OLLAMA_FLASH_ATTENTION=1 + - OLLAMA_KV_CACHE_TYPE=q8_0 + - OLLAMA_NUM_PARALLEL=1 + - OLLAMA_MAX_LOADED_MODELS=1 + # Keep the model resident for 24h instead of the default 5 min. + # Avoids cold-start latency between sessions; safe because we cap + # max_loaded_models above so memory doesn't drift. + - OLLAMA_KEEP_ALIVE=24h + # Unified-memory recipe. With BIOS UMA=0.5 GB the dedicated VRAM + # pool is tiny; the model lives in GTT (system RAM the GPU borrows + # via ttm.pages_limit=33554432 on the kernel cmdline). XNACK + + # FINE_GRAIN_PCIE put the HIP allocator into demand-paging mode so + # it treats the merged VRAM+GTT pool as one arena. Same flags as + # compose/kimi-linear.yml and compose/comfyui.yml — Ollama uses + # ggml/llama.cpp underneath but its allocator goes through HIP. + # PYTORCH_HIP_ALLOC_CONF is intentionally absent (Ollama isn't + # PyTorch). + - HSA_XNACK=1 + - HSA_FORCE_FINE_GRAIN_PCIE=1 volumes: - /models/ollama:/root/.ollama - /models:/models:ro diff --git a/pyinfra/framework/compose/openwebui.yml b/pyinfra/framework/compose/openwebui.yml index 3dc248e..f3e3cac 100644 --- a/pyinfra/framework/compose/openwebui.yml +++ b/pyinfra/framework/compose/openwebui.yml @@ -1,9 +1,14 @@ # OpenWebUI — ChatGPT-like web UI in front of Ollama. Pre-configured to -# use the host's Ollama instance and the project's SearXNG for web -# search. Default port 3000. +# use the host's Ollama instance and Kagi for web search. Default port +# 3000. # # Persistent state (users, conversations, uploaded docs, RAG vector # index) lives at /srv/docker/openwebui/data so backups touch one path. +# +# Sibling .env file holds KAGI_API_KEY (single shared key — opencode's +# kagimcp MCP on the Mac and this container use the same one). Same +# placeholder-then-fill-in-by-hand pattern as compose/beszel.yml and +# compose/litellm.yml. services: openwebui: image: ghcr.io/open-webui/open-webui:main @@ -22,9 +27,16 @@ services: # prompt confuses it. OpenWebUI's plain chat UI is the right home. - OPENAI_API_BASE_URLS=http://host.docker.internal:8000/v1 - OPENAI_API_KEYS=dummy - # Built-in web search via the project's SearXNG instance. + # Built-in web search via Kagi. Kagi-specific env var name is + # KAGI_SEARCH_API_KEY (OpenWebUI's convention); kagimcp on the Mac + # uses KAGI_API_KEY (Kagi's official convention). We standardize on + # KAGI_API_KEY in the .env file and let compose interpolate it + # into OpenWebUI's expected name here. - ENABLE_RAG_WEB_SEARCH=true - - RAG_WEB_SEARCH_ENGINE=searxng - - SEARXNG_QUERY_URL=https://searxng.n0n.io/search?q=&format=json + - RAG_WEB_SEARCH_ENGINE=kagi + - KAGI_SEARCH_API_KEY=${KAGI_API_KEY} + # Fallback (commented): the self-hosted SearXNG path. Re-enable by + # swapping RAG_WEB_SEARCH_ENGINE back to searxng and uncommenting: + # - SEARXNG_QUERY_URL=https://searxng.n0n.io/search?q=&format=json volumes: - /srv/docker/openwebui/data:/app/backend/data diff --git a/pyinfra/framework/compose/qwen3-235b.yml b/pyinfra/framework/compose/qwen3-235b.yml new file mode 100644 index 0000000..1e9f416 --- /dev/null +++ b/pyinfra/framework/compose/qwen3-235b.yml @@ -0,0 +1,97 @@ +# Qwen3-235B-A22B-Instruct-2507 (Unsloth UD-Q2_K_XL ~88.8 GB) via the +# kyuz0 rocm-7.2.2 Strix Halo toolbox. Same image + unified-memory +# recipe as compose/llama.yml; the only deltas are model path, port, +# alias, context, and the no-coexist `restart: "no"`. +# https://github.com/kyuz0/amd-strix-halo-toolboxes +# +# Coexistence. At ~88.8 GB weights this CANNOT coexist with the +# 30B llama service or Kimi-Linear (vLLM) — the merged GPU arena is +# only ~110 GB. Stop those before bringing this up. Same pattern as +# compose/comfyui.yml: `restart: "no"`, manual start, swap workflow +# documented in compose/qwen3-235b/README.md. +# +# Weights. UD-Q2_K_XL is Unsloth's "Dynamic" quant — important tensors +# kept at higher precision; closer to Q3 quality than naive Q2. 2 shards +# (~50 GB + 38.8 GB); llama.cpp auto-discovers shard 2 from shard 1. +# Download path on the box (see compose/qwen3-235b/README.md): +# hf download unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF \ +# --include 'UD-Q2_K_XL/*' \ +# --local-dir /models/qwen/Qwen3-235B-A22B-Instruct-2507 +# +# Port 8081 — distinct from llama 30B (8080) so opencode/curl/etc. can +# address either explicitly even though only one runs at a time. +# +# Performance target. Bandwidth-bound: 256 GB/s ÷ ~22 GB active-bytes → +# ~5-10 tok/s decode. This is the "overnight long-task" model, NOT the +# interactive driver — see StrixHaloMemory.md for the bandwidth math. +services: + qwen3-235b: + image: kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2 + container_name: qwen3-235b + # Manual start only — see header note about GPU contention. + restart: "no" + devices: + - /dev/kfd:/dev/kfd + - /dev/dri:/dev/dri + cap_add: + - SYS_PTRACE + security_opt: + - seccomp=unconfined + # Numeric GIDs of host's video (44) and render (991) groups — + # required for /dev/kfd + /dev/dri access from inside the container. + group_add: + - "44" + - "991" + shm_size: 8g + ipc: host + environment: + # Unified-memory recipe (same as compose/llama.yml + kimi-linear). + # BIOS UMA=0.5 GB + ttm.pages_limit cmdline → these flags merge the + # rocminfo pools into one ~110 GB arena. kyuz0's image is native + # gfx1151 so no HSA_OVERRIDE_GFX_VERSION. + - HSA_XNACK=1 + - HSA_FORCE_FINE_GRAIN_PCIE=1 + volumes: + - /models:/models:ro + ports: + - "8081:8081" + entrypoint: ["llama-server"] + command: + - --model + - /models/qwen/Qwen3-235B-A22B-Instruct-2507/UD-Q2_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00001-of-00002.gguf + # OpenAI-compatible served name. Provider-side name lives in + # opencode.json once M0 perf is verified and we wire it up. + - --alias + - qwen3-235b + - --host + - 0.0.0.0 + - --port + - "8081" + - --n-gpu-layers + - "999" + # 64K — opencode auto-compaction triggers at ~75-80 % of the + # stated context limit, so a small ctx fires the summarize-and- + # rewind loop after only a few turns. 64K roughly doubles how + # many turns fit. KV at q8_0 ≈ 8 GB (94 layers × 8 kv-heads × 128 + # head-dim × 2 × 65536 × 1 byte); arena headroom still ~11 GB. + # Stretch goal 131072 documented in compose/qwen3-235b/README.md + # but tight — verify allocator behaviour first. + - --ctx-size + - "65536" + - --no-mmap + # Flash attention is required for q8_0 KV cache in llama.cpp. + - --flash-attn + - "on" + - --cache-type-k + - q8_0 + - --cache-type-v + - q8_0 + # Qwen3-235B-Instruct-2507 ships its own chat template; let + # llama.cpp use it rather than the hardcoded default. + - --jinja + # Single sequence — KV pool isn't sliced across speculative + # concurrent requests we'll never have (long-task model, one + # request at a time). + - --parallel + - "1" + - --metrics diff --git a/pyinfra/framework/compose/qwen3-235b/README.md b/pyinfra/framework/compose/qwen3-235b/README.md new file mode 100644 index 0000000..48da806 --- /dev/null +++ b/pyinfra/framework/compose/qwen3-235b/README.md @@ -0,0 +1,163 @@ +# qwen3-235b + +Qwen3-235B-A22B-Instruct-2507 on Strix Halo via `kyuz0:rocm-7.2.2`. +The "overnight long-task" model — bandwidth math says ~5-10 tok/s +decode, so this is for fire-and-forget runs (deep refactors, long-form +analysis), **not** interactive coding. Daily driver stays on Ollama / +llama 30B. + +OpenAI-compatible endpoint at `http://framework:8081` once running. + +## Coexistence notes (read first) + +At ~88.8 GB weights this can't share the GPU with anything else: + +| Concurrent service | Action | +|---|---| +| `llama` (Qwen3-Coder-30B, port 8080) | `docker compose down` in `/srv/docker/llama` first | +| `kimi-linear` (vLLM, port 8000) | `docker compose down` in `/srv/docker/kimi-linear` first | +| `ollama` (port 11434) | `docker exec ollama ollama stop qwen3-coder:30b` (Ollama itself can stay up) | +| `comfyui` (port 8188) | `docker compose down` in `/srv/docker/comfyui` first | + +The stack reflects this: `restart: "no"` — won't come back after a box +reboot. You start it deliberately. + +## Prereqs + +- Pyinfra deploy has run (creates `/srv/docker/qwen3-235b/` with right perms). +- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active. + Verify: `cat /proc/cmdline | grep ttm.pages_limit`. +- Other GPU services stopped per the table above. + +## Download weights (M0.1 — ~88.8 GB, 2 shards) + +```sh +# /models/qwen exists via pyinfra; just create the model subdir. +mkdir -p /models/qwen/Qwen3-235B-A22B-Instruct-2507 + +hf download unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF \ + --include 'UD-Q2_K_XL/*' \ + --local-dir /models/qwen/Qwen3-235B-A22B-Instruct-2507 + +# Files land at: +# /models/qwen/Qwen3-235B-A22B-Instruct-2507/UD-Q2_K_XL/ +# Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00001-of-00002.gguf (~50 GB) +# Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00002-of-00002.gguf (~38.8 GB) +# +# llama.cpp auto-discovers shard 2 from shard 1 — only point --model at +# the 00001-of-00002 file. +``` + +Disk: needs ~90 GB free on `/models`. Pull is bandwidth-bound; expect +20-60 minutes on a fast home link. + +## Bring up (M0.2 — first generation) + +```sh +cd /srv/docker/qwen3-235b +docker compose pull # already-cached image if you ran llama first +docker compose up -d +docker compose logs -f # wait for "main: server is listening on http://0.0.0.0:8081" + +./smoke.sh # /health + tiny generation + perf +``` + +Expect **2-5 minutes** for first start — llama.cpp has to load ~88 GB +of weights off disk into the merged arena. Subsequent starts are faster +if the page cache is warm. + +If `./smoke.sh` reports `predicted_per_second` in the 5-10 tok/s range, +M0 is verified. Lower than 3 tok/s = something's wrong (likely the GPU +arena is < 100 GB — see "Troubleshooting"). + +## Ramping context + +Defaults to 64K — chosen because opencode's auto-compaction triggers +at ~75-80 % of the stated limit, so a smaller ctx fires the rewrite- +the-conversation loop after only a handful of turns. 64K roughly +doubles how many turns fit. Stages: + +| Stage | `--ctx-size` | KV (q8_0) | Margin in arena | +|---|---|---|---| +| Previous (M0) | 32768 | ~4 GB | ~15 GB | +| **Current default** | **65536** | **~8 GB** | **~11 GB** | +| M0.4 stretch | 131072 | ~16 GB | ~3 GB (tight) | + +Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`, +re-run `./smoke.sh`. If you see an alloc error in the logs, dial it back. + +opencode's `limit.context` in `opencode.json` should match — otherwise +opencode either compacts too early (limit lower than server) or sends +prompts longer than the server can handle (limit higher). + +## Troubleshooting + +**OOM on startup.** Check the arena size first: +```sh +rocminfo | grep -A2 "Pool Info" | head -20 +``` +If it reports two ~31 GB pools instead of one ~110 GB arena, the +unified-memory recipe didn't apply. Verify (in order): + +1. `cat /proc/cmdline` includes `amdgpu.gttsize=131072 ttm.pages_limit=33554432` +2. BIOS UMA Frame Buffer Size is **0.5 GB** (not 64 GB) — Framework BIOS + `lfsp0.03.05+`. Counter-intuitive: a tiny UMA frees more pages for GTT. +3. Container env shows `HSA_XNACK=1 HSA_FORCE_FINE_GRAIN_PCIE=1` — + `docker compose exec qwen3-235b env | grep HSA`. + +If all three are right and OOM persists, drop to Q2_K_L (~85.8 GB) — edit +the model path in `docker-compose.yml` after a separate `hf download` of +that quant. + +**`predicted_per_second` very low (<3 tok/s).** Likely cold page cache. +Re-run `./smoke.sh` once — second run should be in band. If still slow, +verify the model file isn't being swapped from disk: `iostat -x 1` should +show ~0 read bandwidth during inference. + +**Server starts but answers gibberish.** `--jinja` not picked up; check +`docker compose logs qwen3-235b | grep -i 'chat template'`. Should +say "using chat template from gguf metadata". + +## Operations + +```sh +docker compose logs -f # tail +docker compose down # stop (always — coexists with nothing) +docker compose exec qwen3-235b bash # shell in +./smoke.sh # health + perf +amdgpu_top # GPU view on host +``` + +Suggested cycle: +``` +[evening] stop llama 30B / kimi-linear; up qwen3-235b; submit batch tasks +[overnight] qwen3-235b grinds; results land in your harness state +[morning] down qwen3-235b; up llama 30B / kimi-linear; back to interactive +``` + +M3 will automate this swap; M0 does it by hand. + +## Pin manifest + +| Component | Pin | +|---|---| +| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) | +| Weights | `unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF` UD-Q2_K_XL | +| Default port | 8081 | +| Default context | 65536 (ramp to 131072 deliberately) | +| KV cache type | q8_0 (k and v) | + +## Status + +M0 — compose artifacts written; awaiting box-side weight pull + bring-up. +M0.3-M0.4 (context ramp) follow once M0 boots cleanly. M1 wires this +endpoint as a 4th opencode/LiteLLM provider used by the long-task +orchestrator. + +## Why Instruct-2507 not Thinking-2507 + +Both are published; Thinking emits a `` block before every answer. +At ~7 tok/s decode, a 2K-token think block = ~5 min of wall time per +response, then the actual answer. For autonomous coding/refactor tasks +that's a tax we don't want. Thinking-2507 is worth adding as a separate +compose later for hard-reasoning one-shots; not the long-task default. diff --git a/pyinfra/framework/compose/qwen3-235b/smoke.sh b/pyinfra/framework/compose/qwen3-235b/smoke.sh new file mode 100644 index 0000000..f587564 --- /dev/null +++ b/pyinfra/framework/compose/qwen3-235b/smoke.sh @@ -0,0 +1,48 @@ +#!/usr/bin/env bash +# Smoke-test the running qwen3-235b llama-server (port 8081). Hits +# /health for liveness, then a tiny OpenAI-compatible chat completion, +# then measures eval_tps via /completion. Generation is bigger than +# llama's smoke (n_predict=64) because at 5-10 tok/s the per-token +# noise floor swamps a 16-token sample. +set -euo pipefail + +HOST="${QWEN235_HOST:-127.0.0.1:8081}" +MODEL="${QWEN235_MODEL:-qwen3-235b}" + +echo "[smoke] GET /health on $HOST" +curl -fsS "http://$HOST/health" | python3 -m json.tool + +echo +echo "[smoke] POST /v1/chat/completions ($MODEL) — tiny generation" +curl -fsS "http://$HOST/v1/chat/completions" \ + -H 'Content-Type: application/json' \ + -d "{ + \"model\": \"$MODEL\", + \"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: ok\"}], + \"max_tokens\": 16, + \"temperature\": 0.0 + }" | python3 -m json.tool + +echo +echo "[smoke] perf measure — eval_tps and prompt_tps (n_predict=64)" +# Bigger sample than llama/smoke.sh — at ~7 tok/s the first few tokens' +# warmup dominates a 16-token measurement. +curl -fsS "http://$HOST/completion" \ + -H 'Content-Type: application/json' \ + -d '{ + "prompt": "Write a Python function that computes the Fibonacci sequence iteratively. Include type hints and a brief docstring.", + "n_predict": 64, + "temperature": 0.0, + "stream": false + }' | python3 -c " +import json, sys +r = json.load(sys.stdin) +t = r.get('timings', {}) +print(f'predicted_per_second: {t.get(\"predicted_per_second\", \"?\"):.2f} tok/s') +print(f'prompt_per_second: {t.get(\"prompt_per_second\", \"?\"):.2f} tok/s') +print(f'predicted_n: {t.get(\"predicted_n\", \"?\")}') +print(f'prompt_n: {t.get(\"prompt_n\", \"?\")}') +" + +echo +echo "[smoke] passed — expected band 5-10 tok/s decode; <3 tok/s = investigate" diff --git a/pyinfra/framework/deploy.py b/pyinfra/framework/deploy.py index 16f0c0d..786c7b6 100644 --- a/pyinfra/framework/deploy.py +++ b/pyinfra/framework/deploy.py @@ -101,9 +101,10 @@ server.shell( _sudo=True, ) -# huggingface_hub CLI for `huggingface-cli download --local-dir ...` -# Lands at ~/.local/bin/huggingface-cli for the SSH user. Other users can -# repeat the command themselves. +# huggingface_hub CLI for `hf download --local-dir ...`. Lands as +# ~/.local/bin/hf for the SSH user (the old `huggingface-cli` name is +# still installed as a deprecated alias). Other users can repeat the +# command themselves. server.shell( name="Install / upgrade huggingface_hub CLI", commands=["uv tool install --upgrade 'huggingface_hub[cli]'"], @@ -378,7 +379,14 @@ files.directory( mode="755", _sudo=True, ) -for sub in ("moonshotai", "qwen", "deepseek", "zai", "mistralai"): +for sub in ( + "moonshotai", + "qwen", + "deepseek", + "zai", + "mistralai", + "black-forest-labs", # Flux.1-Dev for ComfyUI (gated on HF) +): files.directory( name=f"{MODELS_DIR}/{sub}", path=f"{MODELS_DIR}/{sub}", @@ -424,6 +432,9 @@ for svc in ( "vllm", "ollama", "kimi-linear", + "qwen3-235b", + "litellm", + "comfyui", "openwebui", "beszel", "openlit", @@ -460,6 +471,19 @@ files.directory( mode="2775", _sudo=True, ) +# Sibling .env for KAGI_API_KEY (used by RAG_WEB_SEARCH_ENGINE=kagi). +# Placeholder created here; fill in by hand on first deploy. Same shared +# key the kagimcp MCP on the Mac uses. Mode 640 root:docker so the +# docker group can read it at compose parse time, not world-readable. +files.file( + name="OpenWebUI .env (placeholder — fill in KAGI_API_KEY)", + path=f"{COMPOSE_DIR}/openwebui/.env", + present=True, + mode="640", + user="root", + group="docker", + _sudo=True, +) # Beszel persistent state (admin account, system list, metric history). files.directory( @@ -565,6 +589,114 @@ for cfg in ( _sudo=True, ) +# llama.cpp (kyuz0 rocm-7.2.2) operator assets — smoke + readme. The +# image and weights live elsewhere; this dir is just the scripts + +# doc that ride alongside the compose file. +for asset, mode in ( + ("smoke.sh", "0775"), + ("README.md", "0664"), +): + files.put( + name=f"llama: {asset}", + src=f"compose/llama/{asset}", + dest=f"{COMPOSE_DIR}/llama/{asset}", + group="docker", + mode=mode, + _sudo=True, + ) + +# Qwen3-235B operator assets. Same image as llama (kyuz0 rocm-7.2.2); +# this dir just ships the smoke script + README documenting the +# coexistence/swap pattern. Weights live at /models/qwen/ via manual +# `hf download` per the README. +for asset, mode in ( + ("smoke.sh", "0775"), + ("README.md", "0664"), +): + files.put( + name=f"qwen3-235b: {asset}", + src=f"compose/qwen3-235b/{asset}", + dest=f"{COMPOSE_DIR}/qwen3-235b/{asset}", + group="docker", + mode=mode, + _sudo=True, + ) + +# LiteLLM router assets. config.yaml is the source-of-truth model +# routing table — pyinfra syncs it on every run; edits on the box get +# overwritten. The .env file holds LITELLM_MASTER_KEY + LITELLM_SALT_KEY +# (same pattern as compose/beszel.yml — placeholder file created here, +# filled in by hand on first deploy). +for asset, mode in ( + ("config.yaml", "0664"), + ("smoke.sh", "0775"), + ("README.md", "0664"), +): + files.put( + name=f"litellm: {asset}", + src=f"compose/litellm/{asset}", + dest=f"{COMPOSE_DIR}/litellm/{asset}", + group="docker", + mode=mode, + _sudo=True, + ) +files.file( + name="LiteLLM .env (placeholder — fill in LITELLM_MASTER_KEY)", + path=f"{COMPOSE_DIR}/litellm/.env", + present=True, + mode="640", + user="root", + group="docker", + _sudo=True, +) + +# ComfyUI persistent state. The toolbox image runs as root and writes +# into these dirs (models read RW so custom nodes can install, output +# for generated images, custom_nodes for plugin installs that survive +# container recreate, workflows for user-saved graphs). Owned 2775 +# root:docker so docker group members can drop in model files. +for sub in ("models", "output", "custom_nodes", "workflows"): + files.directory( + name=f"ComfyUI {sub} dir", + path=f"{COMPOSE_DIR}/comfyui/{sub}", + group="docker", + mode="2775", + _sudo=True, + ) +# ComfyUI's expected model subdir layout — ComfyUI itself creates these +# at first start, but pre-creating means `hf download --local-dir ...` +# works without needing the container running. +for model_sub in ( + "checkpoints", + "diffusion_models", + "vae", + "text_encoders", + "loras", + "controlnet", + "clip_vision", + "upscale_models", +): + files.directory( + name=f"ComfyUI models/{model_sub} dir", + path=f"{COMPOSE_DIR}/comfyui/models/{model_sub}", + group="docker", + mode="2775", + _sudo=True, + ) +# Smoke test + operator doc. +for asset, mode in ( + ("smoke.sh", "0775"), + ("README.md", "0664"), +): + files.put( + name=f"comfyui: {asset}", + src=f"compose/comfyui/{asset}", + dest=f"{COMPOSE_DIR}/comfyui/{asset}", + group="docker", + mode=mode, + _sudo=True, + ) + # Kimi-Linear container assets (build script, smoke test, operator doc). # The compose file itself is copied by the for-loop above; the rest of # the build context lives under compose/kimi-linear/ on the source side