added models, model-swap, ...
This commit is contained in:
55
pyinfra/framework/compose/code-server.yml
Normal file
55
pyinfra/framework/compose/code-server.yml
Normal file
@@ -0,0 +1,55 @@
|
||||
# code-server — VS Code in the browser, served from the box. The point:
|
||||
# Claude Code (extension + CLI) runs *inside this container*, so long
|
||||
# agent tasks keep running when the laptop sleeps. Any device on the
|
||||
# tailnet gets the full VS Code experience at http://framework:8443.
|
||||
#
|
||||
# linuxserver image over codercom/code-server for the PUID/PGID
|
||||
# convention (matches host `noise` UID 1000 so workspace files don't
|
||||
# come out root-owned) and the s6 init that chowns /config.
|
||||
#
|
||||
# Extensions come from Open VSX (code-server's default registry). The
|
||||
# official Claude Code extension is published there:
|
||||
# https://open-vsx.org/extension/Anthropic/claude-code
|
||||
#
|
||||
# Persistent state: /config is the container user's $HOME — extensions,
|
||||
# settings, ~/.claude (OAuth creds + session history), ~/.local/bin
|
||||
# (claude CLI). Survives container recreation; back up one path.
|
||||
services:
|
||||
code-server:
|
||||
image: lscr.io/linuxserver/code-server:latest
|
||||
container_name: code-server
|
||||
restart: unless-stopped
|
||||
|
||||
# No PASSWORD env set — auth is disabled, same trust model as every
|
||||
# other service here: reachable over Tailscale only. If this host
|
||||
# ever sees LAN/internet traffic, set HASHED_PASSWORD (and even then
|
||||
# prefer "127.0.0.1:8443:8443" + tunnel — a browser terminal is a
|
||||
# shell on the box's docker network).
|
||||
ports:
|
||||
- "8443:8443"
|
||||
|
||||
environment:
|
||||
- PUID=1000
|
||||
- PGID=1000
|
||||
- TZ=Etc/UTC
|
||||
# Open this folder by default instead of the "welcome" screen.
|
||||
- DEFAULT_WORKSPACE=/workspace
|
||||
# tmux inside the container: detachable terminals for long claude
|
||||
# runs, so a browser-tab close or code-server reconnect hiccup
|
||||
# can't orphan your view of a session (the process itself never
|
||||
# depended on the tab — it's a child of the container).
|
||||
- DOCKER_MODS=linuxserver/mods:universal-package-install
|
||||
- INSTALL_PACKAGES=tmux
|
||||
|
||||
extra_hosts:
|
||||
# Reach the sibling services (Ollama :11434, LiteLLM :4000,
|
||||
# Phoenix :6006, ...) from the integrated terminal / extensions.
|
||||
- "host.docker.internal:host-gateway"
|
||||
|
||||
volumes:
|
||||
- /srv/docker/code-server/config:/config
|
||||
# Default project area. Container-scoped on purpose — Claude Code
|
||||
# in here can't touch /srv/docker compose stacks or the host.
|
||||
# Mount additional repos explicitly when you want them editable:
|
||||
# - /home/noise/src/somerepo:/workspace/somerepo
|
||||
- /srv/docker/code-server/workspace:/workspace
|
||||
85
pyinfra/framework/compose/code-server/README.md
Normal file
85
pyinfra/framework/compose/code-server/README.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# code-server — VS Code in the browser
|
||||
|
||||
Full VS Code served from the box at <http://framework:8443>. The reason
|
||||
it exists in this stack: **Claude Code runs inside the container**, so
|
||||
long agent tasks execute on the Framework Desktop and survive the
|
||||
laptop sleeping, the browser tab closing, or you walking away with the
|
||||
phone. Any device on the tailnet gets the same editor with the same
|
||||
state.
|
||||
|
||||
## Bring-up
|
||||
|
||||
```sh
|
||||
cd /srv/docker/code-server && docker compose up -d
|
||||
```
|
||||
|
||||
First start pulls the image and runs the universal-package-install mod
|
||||
(installs tmux), so give it a minute before the UI answers on :8443.
|
||||
|
||||
> **HTTPS is required for extension panels.** VS Code webviews (the
|
||||
> Claude Code panel included) need a secure context; over plain
|
||||
> `http://framework:8443` they render blank. Serve it via Tailscale:
|
||||
> `sudo tailscale serve --bg --https=8443 8443` →
|
||||
> `https://framework.<tailnet>.ts.net:8443`. Localhost also counts as
|
||||
> secure (`ssh -L 8443:localhost:8443 framework`).
|
||||
|
||||
## One-time setup (in the code-server UI)
|
||||
|
||||
1. **Install the Claude Code extension.** Extensions panel → search
|
||||
"Claude Code". code-server uses the Open VSX registry, where
|
||||
Anthropic publishes the official extension
|
||||
(<https://open-vsx.org/extension/Anthropic/claude-code>).
|
||||
2. **Sign in.** The OAuth flow in a browser context is a copy/paste
|
||||
dance: the extension shows a URL → open it in another tab → approve
|
||||
→ paste the code back. Credentials land in `/config/.claude/` and
|
||||
persist across container recreates.
|
||||
3. **(Optional) CLI in the terminal.** The extension bundles its own
|
||||
CLI, but for tmux-based long runs install it explicitly:
|
||||
|
||||
```sh
|
||||
curl -fsSL https://claude.ai/install.sh | bash
|
||||
```
|
||||
|
||||
Lands in `~/.local/bin` (= `/config/.local/bin`, persisted).
|
||||
|
||||
## Long-running tasks
|
||||
|
||||
The Claude process is a child of the container, not of your browser
|
||||
tab — closing the tab does nothing to it. Reattach from any device and
|
||||
the session is where you left it. For multi-hour unattended runs,
|
||||
prefer a tmux pane in the integrated terminal:
|
||||
|
||||
```sh
|
||||
tmux new -s longtask
|
||||
claude # kick off the task, detach with C-b d
|
||||
```
|
||||
|
||||
If you want to steer from a phone or claude.ai/code instead of this UI,
|
||||
use `claude remote-control` (see the framework README's
|
||||
"Claude Code on the box" section) — that works in the host's tmux too,
|
||||
no container needed.
|
||||
|
||||
## Scope and security
|
||||
|
||||
- **No password is set** — same trust model as every other service on
|
||||
this Tailscale-only box. If the host ever sees LAN/internet traffic,
|
||||
set `HASHED_PASSWORD` in the compose env and bind to localhost; a
|
||||
browser terminal is a shell.
|
||||
- The workspace is **container-scoped on purpose**: Claude Code in here
|
||||
sees `/workspace` and `/config`, not the host filesystem or the
|
||||
docker socket. Bind-mount specific repos into `/workspace/<name>`
|
||||
when you want them editable (host UID 1000 == container PUID 1000,
|
||||
so ownership just works).
|
||||
- `host.docker.internal` reaches the sibling services — handy for
|
||||
pointing scripts at Ollama (:11434) or LiteLLM (:4000) from the
|
||||
integrated terminal.
|
||||
|
||||
## State layout
|
||||
|
||||
| Path (host) | What lives there |
|
||||
| -------------------------------------- | ------------------------------------------------- |
|
||||
| `/srv/docker/code-server/config` | `$HOME`: extensions, settings, `~/.claude`, CLIs |
|
||||
| `/srv/docker/code-server/workspace` | default project area (`DEFAULT_WORKSPACE`) |
|
||||
|
||||
Back up `config` if you care about extension state and Claude session
|
||||
history; everything else is reproducible from the repo.
|
||||
64
pyinfra/framework/compose/coder.yml
Normal file
64
pyinfra/framework/compose/coder.yml
Normal file
@@ -0,0 +1,64 @@
|
||||
# Coder — self-hosted workspace manager (PILOT). Web dashboard at :7080
|
||||
# that stamps out per-project dev containers from Terraform templates;
|
||||
# each workspace gets browser code-server with the Claude Code extension
|
||||
# pre-installed. Evaluating against the standalone code-server stack
|
||||
# (compose/code-server.yml, kept as-is during the pilot) — verdict
|
||||
# criteria in compose/coder/README.md.
|
||||
#
|
||||
# The server container holds the host docker socket and spawns workspace
|
||||
# containers as siblings (same pattern + same security posture as
|
||||
# OpenHands: fine Tailscale-only, never expose further). group_add needs
|
||||
# the socket's host GID — host-specific, so it comes from the sibling
|
||||
# .env, which pyinfra generates on the box on first deploy along with a
|
||||
# random one-time Postgres password.
|
||||
#
|
||||
# Workspaces don't publish host ports — code-server and terminals are
|
||||
# reached through the dashboard's tunnel. No port bookkeeping per
|
||||
# project.
|
||||
services:
|
||||
coder:
|
||||
# Pin and bump deliberately — Coder releases weekly and the workspace
|
||||
# provisioner/agent protocol moves with it. Verify at
|
||||
# https://github.com/coder/coder/releases.
|
||||
image: ghcr.io/coder/coder:v2.33.8
|
||||
container_name: coder
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "7080:7080"
|
||||
environment:
|
||||
CODER_HTTP_ADDRESS: "0.0.0.0:7080"
|
||||
# The URL clients are told to use — tailnet MagicDNS name, same
|
||||
# convention as every other service tile.
|
||||
CODER_ACCESS_URL: "${CODER_ACCESS_URL}"
|
||||
CODER_PG_CONNECTION_URL: "postgresql://coder:${POSTGRES_PASSWORD}@database/coder?sslmode=disable"
|
||||
group_add:
|
||||
# GID of /var/run/docker.sock — generated into .env by deploy.py.
|
||||
- "${DOCKER_GROUP_ID}"
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
# Repo-shipped Terraform templates (source of truth:
|
||||
# pyinfra/framework/compose/coder/templates/). Push after changes:
|
||||
# docker compose exec coder coder templates push code-server \
|
||||
# --directory /templates/code-server --yes
|
||||
- /srv/docker/coder/templates:/templates:ro
|
||||
depends_on:
|
||||
database:
|
||||
condition: service_healthy
|
||||
|
||||
database:
|
||||
image: postgres:17
|
||||
container_name: coder-db
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
POSTGRES_USER: coder
|
||||
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
|
||||
POSTGRES_DB: coder
|
||||
volumes:
|
||||
# Entrypoint starts as root and chowns this to the postgres uid;
|
||||
# deploy.py just creates the mount point.
|
||||
- /srv/docker/coder/postgres:/var/lib/postgresql/data
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U coder -d coder"]
|
||||
interval: 5s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
110
pyinfra/framework/compose/coder/README.md
Normal file
110
pyinfra/framework/compose/coder/README.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# Coder — workspace manager (pilot)
|
||||
|
||||
Self-hosted control plane at <http://framework:7080> that creates
|
||||
per-project dev containers from Terraform templates. Each workspace:
|
||||
its own container, browser code-server with the Claude Code extension
|
||||
pre-installed, no host-port bookkeeping (everything tunnels through the
|
||||
dashboard), and idle autostop so parked projects give their RAM back to
|
||||
the inference stacks.
|
||||
|
||||
**Pilot status.** Evaluating against the standalone code-server stack
|
||||
(`/srv/docker/code-server`, kept as-is meanwhile). Verdict after a week
|
||||
of real use: does the dashboard + autostop + create-from-template earn
|
||||
an always-on control plane + Postgres? If yes, the standalone stack
|
||||
retires; if no, Coder goes and a thin `workspace` spawner script
|
||||
replaces it.
|
||||
|
||||
## Bring-up
|
||||
|
||||
```sh
|
||||
cd /srv/docker/coder && docker compose up -d
|
||||
```
|
||||
|
||||
The sibling `.env` (DOCKER_GROUP_ID, CODER_ACCESS_URL, random Postgres
|
||||
password) is generated by deploy.py on first `./run.sh` — no hand-fill
|
||||
needed. First visit to <http://framework:7080> creates the admin
|
||||
account (pick anything; it's local to the box).
|
||||
|
||||
> **HTTPS is required for extension panels.** VS Code webviews (the
|
||||
> Claude Code panel, markdown preview, etc.) run on service workers,
|
||||
> which browsers only allow in a secure context — over plain
|
||||
> `http://framework:7080` the editor works but webview panels render
|
||||
> blank. Fix via Tailscale Serve (real Let's Encrypt cert for the
|
||||
> tailnet name; enable "HTTPS Certificates" once in the Tailscale
|
||||
> admin console):
|
||||
>
|
||||
> ```sh
|
||||
> sudo tailscale serve --bg 7080
|
||||
> # then in .env: CODER_ACCESS_URL=https://framework.<tailnet>.ts.net
|
||||
> docker compose up -d # recreate server with new access URL
|
||||
> # restart any existing workspace — app URLs derive from the access URL
|
||||
> ```
|
||||
>
|
||||
> Localhost is also a secure context, so
|
||||
> `ssh -L 7080:localhost:7080 framework` + http://localhost:7080 works
|
||||
> in a pinch.
|
||||
|
||||
## Push the template (one-time + after edits)
|
||||
|
||||
The `coder` CLI ships inside the server image; authenticate it once:
|
||||
|
||||
```sh
|
||||
docker compose exec coder coder login http://localhost:7080
|
||||
# prints a /cli-auth URL — open http://framework:7080/cli-auth in your
|
||||
# browser, copy the session token, paste it back
|
||||
```
|
||||
|
||||
Then push:
|
||||
|
||||
```sh
|
||||
docker compose exec coder coder templates push code-server \
|
||||
--directory /templates/code-server --yes
|
||||
```
|
||||
|
||||
Template source of truth is the repo
|
||||
(`pyinfra/framework/compose/coder/templates/code-server/main.tf`) —
|
||||
edit there, `./run.sh`, re-push. Edits on the box get overwritten.
|
||||
|
||||
## First workspace
|
||||
|
||||
Dashboard → Workspaces → Create → `code-server` template → name it
|
||||
after the project. Open the code-server app tile, then one-time
|
||||
Claude sign-in: the extension shows an OAuth URL → open in another tab
|
||||
→ approve → paste the code back. Credentials live in `~/.claude` on
|
||||
the workspace's home volume and survive stop/start and rebuilds.
|
||||
|
||||
Set **idle autostop** under Template → Settings → Schedule (suggest
|
||||
1–2 h inactivity). Activity = open code-server tab, SSH, web terminal.
|
||||
|
||||
## What persists where
|
||||
|
||||
| Thing | Where |
|
||||
| -------------------------------- | -------------------------------------------- |
|
||||
| Workspace home (repos, ~/.claude, extensions) | named volume `coder-<workspace-id>-home` |
|
||||
| Control-plane state (users, templates, workspace defs) | Postgres → `/srv/docker/coder/postgres` |
|
||||
| Template source | this repo, shipped to `/srv/docker/coder/templates` |
|
||||
|
||||
A stopped workspace's container is deleted; only the home volume
|
||||
remains. `docker volume ls | grep coder-` to audit.
|
||||
|
||||
## Security
|
||||
|
||||
Same posture as OpenHands: the server container holds the docker
|
||||
socket and spawns code-running containers — root-equivalent on the
|
||||
box. Tailscale-only exposure is the mitigation; never forward :7080
|
||||
anywhere else. Coder does have real auth (the admin account), but
|
||||
treat that as defense-in-depth, not as permission to expose it.
|
||||
|
||||
## Notes
|
||||
|
||||
- Workspaces reach the sibling services via `host.docker.internal`
|
||||
(Ollama :11434, LiteLLM :4000, Phoenix :6006, ...).
|
||||
- Long Claude runs: same rules as anywhere — the process lives in the
|
||||
workspace container, so it survives laptop/browser disconnects, but
|
||||
**autostop will kill an idle-looking workspace mid-run**. For
|
||||
multi-hour unattended tasks either bump the workspace's TTL in the
|
||||
dashboard or use the host-side tmux + `claude remote-control`
|
||||
pattern (framework README, "Claude Code on the box").
|
||||
- The base image is `codercom/enterprise-base:ubuntu` (sudo-enabled,
|
||||
common toolchain). Per-project images are a later refinement —
|
||||
swap the `image` in main.tf or parameterize with `coder_parameter`.
|
||||
121
pyinfra/framework/compose/coder/templates/code-server/main.tf
Normal file
121
pyinfra/framework/compose/coder/templates/code-server/main.tf
Normal file
@@ -0,0 +1,121 @@
|
||||
# code-server workspace template — one dev container per project, with
|
||||
# browser VS Code and Claude Code (extension + CLI) ready on first open.
|
||||
#
|
||||
# Source of truth is the repo copy at
|
||||
# pyinfra/framework/compose/coder/templates/code-server/main.tf; pyinfra
|
||||
# ships it to /srv/docker/coder/templates/, mounted read-only into the
|
||||
# server container at /templates. Push after edits:
|
||||
# cd /srv/docker/coder
|
||||
# docker compose exec coder coder templates push code-server \
|
||||
# --directory /templates/code-server --yes
|
||||
|
||||
terraform {
|
||||
required_providers {
|
||||
coder = {
|
||||
source = "coder/coder"
|
||||
}
|
||||
docker = {
|
||||
source = "kreuzwerker/docker"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
provider "coder" {}
|
||||
|
||||
# Talks to the host daemon via the socket mounted into the server
|
||||
# container (default unix:///var/run/docker.sock) — workspace containers
|
||||
# are siblings of the compose stacks, not children.
|
||||
provider "docker" {}
|
||||
|
||||
data "coder_workspace" "me" {}
|
||||
data "coder_workspace_owner" "me" {}
|
||||
|
||||
resource "coder_agent" "main" {
|
||||
arch = "amd64"
|
||||
os = "linux"
|
||||
|
||||
# Claude Code CLI — native installer, lands in ~/.local/bin inside the
|
||||
# persisted home volume, so this is a no-op after the first start. The
|
||||
# code-server extension bundles its own CLI, but having `claude` on
|
||||
# PATH enables tmux-based long runs in the workspace terminal.
|
||||
startup_script = <<-EOT
|
||||
set -e
|
||||
command -v claude >/dev/null 2>&1 || curl -fsSL https://claude.ai/install.sh | bash
|
||||
EOT
|
||||
|
||||
env = {
|
||||
GIT_AUTHOR_NAME = data.coder_workspace_owner.me.full_name
|
||||
GIT_AUTHOR_EMAIL = data.coder_workspace_owner.me.email
|
||||
GIT_COMMITTER_NAME = data.coder_workspace_owner.me.full_name
|
||||
GIT_COMMITTER_EMAIL = data.coder_workspace_owner.me.email
|
||||
}
|
||||
|
||||
metadata {
|
||||
display_name = "CPU"
|
||||
key = "cpu"
|
||||
script = "coder stat cpu"
|
||||
interval = 10
|
||||
timeout = 1
|
||||
}
|
||||
metadata {
|
||||
display_name = "RAM"
|
||||
key = "mem"
|
||||
script = "coder stat mem"
|
||||
interval = 10
|
||||
timeout = 1
|
||||
}
|
||||
}
|
||||
|
||||
# Browser VS Code inside the workspace, surfaced as a dashboard app.
|
||||
# Extensions install from Open VSX — anthropic.claude-code is the
|
||||
# official Claude Code extension
|
||||
# (https://open-vsx.org/extension/Anthropic/claude-code). OAuth creds
|
||||
# land in ~/.claude inside the home volume and survive rebuilds.
|
||||
module "code_server" {
|
||||
count = data.coder_workspace.me.start_count
|
||||
source = "registry.coder.com/coder/code-server/coder"
|
||||
version = "~> 1.0"
|
||||
agent_id = coder_agent.main.id
|
||||
folder = "/home/coder/project"
|
||||
|
||||
extensions = [
|
||||
"anthropic.claude-code",
|
||||
]
|
||||
}
|
||||
|
||||
# Home survives workspace stop/start AND template-driven rebuilds —
|
||||
# ignore_changes keeps Terraform from recreating the volume (and wiping
|
||||
# ~/.claude, extensions, repos) when template metadata shifts.
|
||||
resource "docker_volume" "home" {
|
||||
name = "coder-${data.coder_workspace.me.id}-home"
|
||||
lifecycle {
|
||||
ignore_changes = all
|
||||
}
|
||||
}
|
||||
|
||||
resource "docker_container" "workspace" {
|
||||
# start_count is 0 when the workspace is stopped — the container is
|
||||
# deleted but the home volume above persists. This is what idle
|
||||
# autostop reclaims: RAM back to the inference stacks.
|
||||
count = data.coder_workspace.me.start_count
|
||||
image = "codercom/enterprise-base:ubuntu"
|
||||
name = "coder-${data.coder_workspace_owner.me.name}-${lower(data.coder_workspace.me.name)}"
|
||||
hostname = data.coder_workspace.me.name
|
||||
|
||||
# Agent bootstrap, rewritten to reach the control plane through the
|
||||
# docker bridge (the workspace is a sibling container; "localhost"
|
||||
# inside it isn't the Coder server).
|
||||
entrypoint = ["sh", "-c", replace(coder_agent.main.init_script, "/localhost|127\\.0\\.0\\.1/", "host.docker.internal")]
|
||||
env = ["CODER_AGENT_TOKEN=${coder_agent.main.token}"]
|
||||
|
||||
host {
|
||||
host = "host.docker.internal"
|
||||
ip = "host-gateway"
|
||||
}
|
||||
|
||||
volumes {
|
||||
container_path = "/home/coder"
|
||||
volume_name = docker_volume.home.name
|
||||
read_only = false
|
||||
}
|
||||
}
|
||||
@@ -95,6 +95,28 @@
|
||||
server: localhost-docker
|
||||
container: openhands
|
||||
|
||||
- code-server:
|
||||
icon: code-server.svg
|
||||
href: http://framework:8443
|
||||
description: VS Code in the browser — Claude Code long tasks live here
|
||||
server: localhost-docker
|
||||
container: code-server
|
||||
|
||||
- Coder:
|
||||
icon: coder.svg
|
||||
href: http://framework:7080
|
||||
description: Workspace manager (pilot) — per-project containers from templates
|
||||
server: localhost-docker
|
||||
container: coder
|
||||
# /api/v2/buildinfo is unauthenticated — cheap liveness + version.
|
||||
widget:
|
||||
type: customapi
|
||||
url: http://host.docker.internal:7080/api/v2/buildinfo
|
||||
refreshInterval: 30000
|
||||
mappings:
|
||||
- field: version
|
||||
label: Version
|
||||
|
||||
- Observability:
|
||||
- Beszel:
|
||||
icon: beszel.svg
|
||||
|
||||
102
pyinfra/framework/compose/qwable.yml
Normal file
102
pyinfra/framework/compose/qwable.yml
Normal file
@@ -0,0 +1,102 @@
|
||||
# Qwable-3.6-27B (Qwen3.6-27B fine-tuned on Fable-5-style reasoning
|
||||
# traces — "Qwen + Fable") via the kyuz0 rocm-7.2.2 Strix Halo toolbox.
|
||||
# Same image + unified-memory recipe as compose/llama.yml; deltas are
|
||||
# model path, port, alias.
|
||||
# https://github.com/kyuz0/amd-strix-halo-toolboxes
|
||||
# Model: https://huggingface.co/Mia-AiLab/Qwable-3.6-27b (MIT)
|
||||
#
|
||||
# What it's for. A "thinks-like-Fable-5" interactive model — structured,
|
||||
# step-by-step explanatory output. Dense 27B (NOT MoE), so it's slower
|
||||
# per token than the 30B-A3B MoE workhorses despite being smaller on
|
||||
# disk: all 27B weights load per token. Bandwidth math (256 GB/s ÷
|
||||
# ~16.5 GB) → ~10-15 tok/s decode. Interactive but not snappy.
|
||||
#
|
||||
# Coexistence. At ~16.5 GB (Q4_K_M) it's the smallest GPU resident here
|
||||
# and fits alongside llama 30B (port 8080), Ollama, or Kimi in the
|
||||
# ~110 GB merged arena. It does NOT fit alongside qwen3-235b (88.8 GB)
|
||||
# or comfyui — swap-model tears those down for the `qwable` target.
|
||||
# `restart: "no"`: you bring it up deliberately via swap-model, it won't
|
||||
# auto-start after a reboot and surprise-collide with a big model.
|
||||
#
|
||||
# Weights. Single-file GGUF (not sharded). Download path on the box
|
||||
# (see compose/qwable/README.md):
|
||||
# hf download Mia-AiLab/Qwable-3.6-27b \
|
||||
# 'Qwable-27b_Q4_K_M.gguf' \
|
||||
# --local-dir /models/qwen/Qwable-3.6-27b
|
||||
# Verify exact filename in the HF repo before downloading.
|
||||
#
|
||||
# Port 8082 — distinct from llama 30B (8080) and qwen3-235b (8081).
|
||||
services:
|
||||
qwable:
|
||||
image: kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2
|
||||
container_name: qwable
|
||||
# Manual start only — see header note about GPU contention with
|
||||
# the big models. swap-model brings it up/down.
|
||||
restart: "no"
|
||||
devices:
|
||||
# ROCm needs both kfd (kernel fusion driver) and dri (DRM); Vulkan
|
||||
# only needs dri. Don't drop kfd when on the rocm-* tag.
|
||||
- /dev/kfd:/dev/kfd
|
||||
- /dev/dri:/dev/dri
|
||||
cap_add:
|
||||
- SYS_PTRACE
|
||||
security_opt:
|
||||
- seccomp=unconfined
|
||||
# Numeric GIDs of host's video (44) and render (991) groups —
|
||||
# required for /dev/kfd + /dev/dri access from inside the container.
|
||||
group_add:
|
||||
- "44"
|
||||
- "991"
|
||||
shm_size: 8g
|
||||
ipc: host
|
||||
environment:
|
||||
# Unified-memory recipe (same as compose/llama.yml + kimi-linear +
|
||||
# qwen3-235b). BIOS UMA=0.5 GB + ttm.pages_limit cmdline → these
|
||||
# flags merge the rocminfo pools into one ~110 GB arena. kyuz0's
|
||||
# image is native gfx1151 so no HSA_OVERRIDE_GFX_VERSION.
|
||||
- HSA_XNACK=1
|
||||
- HSA_FORCE_FINE_GRAIN_PCIE=1
|
||||
volumes:
|
||||
- /models:/models:ro
|
||||
ports:
|
||||
- "8082:8082"
|
||||
entrypoint: ["llama-server"]
|
||||
command:
|
||||
- --model
|
||||
- /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf
|
||||
# OpenAI-compatible served name (matches what opencode/curl request
|
||||
# as "model"). Provider-side name lives in opencode.json if/when
|
||||
# this gets wired as a provider.
|
||||
- --alias
|
||||
- qwable
|
||||
- --host
|
||||
- 0.0.0.0
|
||||
- --port
|
||||
- "8082"
|
||||
# Push all layers to GPU. "999" = all available. A 27B Q4 (~16.5 GB)
|
||||
# fits the merged arena with huge headroom.
|
||||
- --n-gpu-layers
|
||||
- "999"
|
||||
# 64K to match llama/qwen3-235b — keeps opencode auto-compaction
|
||||
# behaviour consistent across providers. Tons of arena headroom
|
||||
# here (model is small), so this can ramp far higher if a workflow
|
||||
# needs it; see compose/qwable/README.md.
|
||||
- --ctx-size
|
||||
- "65536"
|
||||
# No-mmap is the Strix Halo standard — forces full GPU load.
|
||||
- --no-mmap
|
||||
# Flash attention — required for q8_0 KV cache; modern llama-server
|
||||
# takes a value (on/off/auto), bare --flash-attn is deprecated.
|
||||
- --flash-attn
|
||||
- "on"
|
||||
# Quantize KV cache to int8 — halves KV memory at minor/no quality
|
||||
# loss. Matches the other llama.cpp stacks.
|
||||
- --cache-type-k
|
||||
- q8_0
|
||||
- --cache-type-v
|
||||
- q8_0
|
||||
# Use the model's embedded jinja chat template — Qwable inherits
|
||||
# Qwen3.6's chat format, which the Fable-trace fine-tune relies on.
|
||||
- --jinja
|
||||
# Expose Prometheus metrics at /metrics — scraped by OpenLIT.
|
||||
- --metrics
|
||||
130
pyinfra/framework/compose/qwable/README.md
Normal file
130
pyinfra/framework/compose/qwable/README.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# qwable
|
||||
|
||||
Qwable-3.6-27B on Strix Halo via `kyuz0:rocm-7.2.2`. A full fine-tune
|
||||
of Qwen3.6-27B trained on **Fable-5-style reasoning traces** — the dev
|
||||
collected examples formatted like Fable 5's deliberate, step-by-step
|
||||
answers and trained Qwen to reproduce that structured, explanatory
|
||||
output. Think of it as a local "thinks-like-Fable" model.
|
||||
|
||||
OpenAI-compatible endpoint at `http://framework:8082` once running.
|
||||
|
||||
## Dense, not MoE (read first)
|
||||
|
||||
Despite being smaller on disk than the 30B-A3B workhorses, Qwable is a
|
||||
**dense** 27B — every weight loads per token. On this bandwidth-bound
|
||||
box (256 GB/s ÷ ~16.5 GB) that's **~10-15 tok/s** decode, slower than
|
||||
the MoE 30B (~100 tok/s, only ~3B active). So Qwable is for when you
|
||||
specifically want the Fable-style reasoning, not for raw throughput.
|
||||
The interactive daily driver stays on Ollama / llama 30B.
|
||||
|
||||
## Coexistence notes
|
||||
|
||||
At ~16.5 GB (Q4_K_M) Qwable is the smallest GPU resident here:
|
||||
|
||||
| Concurrent service | Coexists? |
|
||||
|---|---|
|
||||
| `llama` (Qwen3-Coder-30B, 8080) | ✅ yes (~35 GB total) |
|
||||
| `ollama` (11434) | ✅ yes |
|
||||
| `kimi-linear` (vLLM, 8000) | ✅ yes (~47 GB total) |
|
||||
| `qwen3-235b` (88.8 GB, 8081) | ❌ no — too tight, swap-model stops it |
|
||||
| `comfyui` (8188) | ❌ no — swap-model stops it |
|
||||
|
||||
`restart: "no"`: you bring it up deliberately (via `swap-model qwable`),
|
||||
it won't auto-start after a reboot and surprise-collide with a big model.
|
||||
|
||||
## Prereqs
|
||||
|
||||
- Pyinfra deploy has run (creates `/srv/docker/qwable/` with right perms).
|
||||
- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
|
||||
Verify: `cat /proc/cmdline | grep ttm.pages_limit`.
|
||||
|
||||
## Download weights (~16.5 GB, single file)
|
||||
|
||||
```sh
|
||||
# /models/qwen exists via pyinfra; just create the model subdir.
|
||||
mkdir -p /models/qwen/Qwable-3.6-27b
|
||||
|
||||
hf download Mia-AiLab/Qwable-3.6-27b \
|
||||
'Qwable-27b_Q4_K_M.gguf' \
|
||||
--local-dir /models/qwen/Qwable-3.6-27b
|
||||
|
||||
# File lands at:
|
||||
# /models/qwen/Qwable-3.6-27b/Qwable-27b_Q4_K_M.gguf (~16.5 GB)
|
||||
```
|
||||
|
||||
Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
|
||||
needs ~17 GB free on `/models`.
|
||||
|
||||
> Abliterated variant (refusals removed) lives at
|
||||
> `huihui-ai/Huihui-Qwable-3.6-27b-abliterated-GGUF`
|
||||
> (`Huihui-Qwable-3.6-27b-abliterated-Q4_K_M_Q8.gguf`, ~18.3 GB).
|
||||
> Not the default — no safety filtering, careful with it.
|
||||
|
||||
## Bring up
|
||||
|
||||
Easy path — `swap-model` handles stop-conflicting-services + waits for
|
||||
`/health`:
|
||||
|
||||
```sh
|
||||
ssh framework swap-model qwable # ~1-2 min cold load (16.5 GB)
|
||||
ssh framework /srv/docker/qwable/smoke.sh # perf measure
|
||||
```
|
||||
|
||||
Manual equivalent (first-ever bring-up, before the image is cached):
|
||||
|
||||
```sh
|
||||
cd /srv/docker/qwable
|
||||
docker compose pull # already-cached image if you ran llama first
|
||||
docker compose up -d
|
||||
docker compose logs -f # wait for "server is listening on http://0.0.0.0:8082"
|
||||
|
||||
./smoke.sh # /health + tiny generation + perf
|
||||
```
|
||||
|
||||
First start is ~1-2 min (16.5 GB load off disk; much faster than the
|
||||
235B). If `./smoke.sh` reports `predicted_per_second` in the 10-15 tok/s
|
||||
band, it's healthy. <6 tok/s = investigate (likely arena < 100 GB — see
|
||||
qwen3-235b/README.md "Troubleshooting" for the arena checks).
|
||||
|
||||
## Ramping context
|
||||
|
||||
Defaults to 64K to match the other llama.cpp stacks (keeps opencode
|
||||
auto-compaction consistent across providers). The model is tiny relative
|
||||
to the arena, so there's plenty of room to push higher:
|
||||
|
||||
| Stage | `--ctx-size` | Margin in arena |
|
||||
|---|---|---|
|
||||
| **Current default** | **65536** | huge (~90 GB free) |
|
||||
| Stretch | 131072+ | still comfortable |
|
||||
|
||||
Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
|
||||
re-run `./smoke.sh`. The real ceiling is Qwable's trained context length
|
||||
(inherits Qwen3.6-27B's), not arena memory — verify the model's max
|
||||
positions before going past 128K.
|
||||
|
||||
## Operations
|
||||
|
||||
```sh
|
||||
docker compose logs -f # tail
|
||||
docker compose down # stop
|
||||
docker compose exec qwable bash # shell in
|
||||
./smoke.sh # health + perf
|
||||
amdgpu_top # GPU view on host
|
||||
```
|
||||
|
||||
## Pin manifest
|
||||
|
||||
| Component | Pin |
|
||||
|---|---|
|
||||
| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`) |
|
||||
| Weights | `Mia-AiLab/Qwable-3.6-27b` → `Qwable-27b_Q4_K_M.gguf` (~16.5 GB) |
|
||||
| Default port | 8082 |
|
||||
| Default context | 65536 |
|
||||
| KV cache type | q8_0 (k and v) |
|
||||
| License | MIT (model); Qwen3.6-27B base license also applies |
|
||||
|
||||
## Status
|
||||
|
||||
Compose artifacts written; awaiting box-side weight pull + bring-up.
|
||||
Wired as a `swap-model qwable` target. Wire as an opencode/LiteLLM
|
||||
provider only if the Fable-style reasoning proves useful in practice.
|
||||
46
pyinfra/framework/compose/qwable/smoke.sh
Executable file
46
pyinfra/framework/compose/qwable/smoke.sh
Executable file
@@ -0,0 +1,46 @@
|
||||
#!/usr/bin/env bash
|
||||
# Smoke-test the running qwable llama-server (port 8082). Hits /health
|
||||
# for liveness, then a tiny OpenAI-compatible chat completion, then
|
||||
# measures eval_tps via /completion. Dense 27B → expect ~10-15 tok/s.
|
||||
set -euo pipefail
|
||||
|
||||
HOST="${QWABLE_HOST:-127.0.0.1:8082}"
|
||||
MODEL="${QWABLE_MODEL:-qwable}"
|
||||
|
||||
echo "[smoke] GET /health on $HOST"
|
||||
curl -fsS "http://$HOST/health" | python3 -m json.tool
|
||||
|
||||
echo
|
||||
echo "[smoke] POST /v1/chat/completions ($MODEL) — tiny generation"
|
||||
curl -fsS "http://$HOST/v1/chat/completions" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d "{
|
||||
\"model\": \"$MODEL\",
|
||||
\"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: ok\"}],
|
||||
\"max_tokens\": 16,
|
||||
\"temperature\": 0.0
|
||||
}" | python3 -m json.tool
|
||||
|
||||
echo
|
||||
echo "[smoke] perf measure — eval_tps and prompt_tps (n_predict=128)"
|
||||
# 128 tokens — at ~10-15 tok/s the per-token warmup noise still matters,
|
||||
# but a dense 27B settles faster than the 235B so we don't need 64-only.
|
||||
curl -fsS "http://$HOST/completion" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"prompt": "Write a Python function that computes the Fibonacci sequence iteratively. Include type hints and a brief docstring.",
|
||||
"n_predict": 128,
|
||||
"temperature": 0.0,
|
||||
"stream": false
|
||||
}' | python3 -c "
|
||||
import json, sys
|
||||
r = json.load(sys.stdin)
|
||||
t = r.get('timings', {})
|
||||
print(f'predicted_per_second: {t.get(\"predicted_per_second\", \"?\"):.2f} tok/s')
|
||||
print(f'prompt_per_second: {t.get(\"prompt_per_second\", \"?\"):.2f} tok/s')
|
||||
print(f'predicted_n: {t.get(\"predicted_n\", \"?\")}')
|
||||
print(f'prompt_n: {t.get(\"prompt_n\", \"?\")}')
|
||||
"
|
||||
|
||||
echo
|
||||
echo "[smoke] passed — expected band 10-15 tok/s decode (dense 27B Q4)"
|
||||
@@ -53,6 +53,15 @@ Disk: needs ~90 GB free on `/models`. Pull is bandwidth-bound; expect
|
||||
|
||||
## Bring up (M0.2 — first generation)
|
||||
|
||||
Easy path — `swap-model` handles the stop-conflicting-services dance + waits for `/health`:
|
||||
|
||||
```sh
|
||||
ssh framework swap-model 235b # ~3-5 min on first cold load
|
||||
ssh framework /srv/docker/qwen3-235b/smoke.sh # perf measure
|
||||
```
|
||||
|
||||
Manual equivalent (for first-ever bring-up before the image is cached):
|
||||
|
||||
```sh
|
||||
cd /srv/docker/qwen3-235b
|
||||
docker compose pull # already-cached image if you ran llama first
|
||||
|
||||
Reference in New Issue
Block a user