pyinfra/framework/compose/ornith/README.md

# ornith

Ornith-1.0-35B on Strix Halo via `kyuz0:rocm-7.2.2`. DeepReinforce's
MIT-licensed **agentic-coding** model — a self-improving RL fine-tune of
**Qwen3.5-35B-A3B** that co-trains its own task scaffolds with the policy.
Strong on Terminal-Bench 2.1 / SWE-Bench Verified, emits OpenAI-style
`tool_calls`, opens each answer with a `<think>` reasoning block.

OpenAI-compatible endpoint at `http://framework:8083` once running.

## MoE, not dense (read first — this is why it's worth a slot)

Despite "35B" in the name, Ornith-1.0-35B is **MoE with only ~3B active
params per token** (256 routed experts, 8 active + a shared expert, 40
layers). On this bandwidth-bound box (256 GB/s) decode speed tracks
*active* params, so it runs like the 30B-A3B workhorse (**~80-100 tok/s**),
not like a dense 27/31B (~10-15 tok/s). That's the whole point: near
frontier-class agentic-coding quality at interactive speed. Candidate to
replace `qwen3-coder:30b` (Ollama) as the opencode daily driver — A/B
before promoting.

## Quant choice moves speed here

For MoE, decode bandwidth ∝ *active bytes per token*, so quant tier
changes t/s (~2x across the range), unlike a model where everything is
read every token:

| Quant | Size | When |
|---|---|---|
| **Q4_K_M** | **21.2 GB** | **default** — fastest, huge arena headroom |
| Q6_K | 28.5 GB | bump here only if Q4 quality disappoints (~slower) |
| Q8_0 | 36.9 GB | max quality, ~half the decode speed — rarely worth it for A3B |

## Coexistence notes

At ~21.2 GB (Q4_K_M) Ornith fits the merged arena easily:

| Concurrent service | Coexists? |
|---|---|
| `llama` (Qwen3-Coder-30B, 8080) | ✅ yes |
| `ollama` (11434) | ✅ yes |
| `kimi-linear` (vLLM, 8000) | ✅ yes |
| `qwable` (8082) | ✅ yes (~38 GB total) |
| `qwen3-235b` (88.8 GB, 8081) | ❌ no — swap-model stops it |
| `comfyui` (8188) | ❌ no — swap-model stops it |

`restart: "no"`: you bring it up deliberately (via `swap-model ornith`),
it won't auto-start after a reboot and surprise-collide with a big model.

## Prereqs

- Pyinfra deploy has run (creates `/srv/docker/ornith/` with right perms).
- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
  Verify: `cat /proc/cmdline | grep ttm.pages_limit`.

## Download weights (~21.2 GB, single file)

```sh
# /models/qwen exists via pyinfra; just create the model subdir.
mkdir -p /models/qwen/Ornith-1.0-35B

hf download deepreinforce-ai/Ornith-1.0-35B-GGUF \
    'ornith-1.0-35b-Q4_K_M.gguf' \
    --local-dir /models/qwen/Ornith-1.0-35B

# File lands at:
#   /models/qwen/Ornith-1.0-35B/ornith-1.0-35b-Q4_K_M.gguf  (~21.2 GB)
```

Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
needs ~22 GB free on `/models`. Verify the exact filename in the HF repo
before downloading (casing matters).

## Bring up

Easy path — `swap-model` handles stop-conflicting-services + waits for
`/health`:

```sh
ssh framework swap-model ornith     # ~1-2 min cold load (21.2 GB)
ssh framework /srv/docker/ornith/smoke.sh    # /health + perf
```

Manual equivalent (first-ever bring-up, before the image is cached):

```sh
cd /srv/docker/ornith
docker compose pull       # already-cached image if you ran llama first
docker compose up -d
docker compose logs -f    # wait for "server is listening on http://0.0.0.0:8083"

./smoke.sh                # /health + tiny generation + perf
```

If `./smoke.sh` reports `predicted_per_second` in the ~80-100 tok/s band,
it's healthy. <30 tok/s = investigate (likely arena < 100 GB — see
qwen3-235b/README.md "Troubleshooting" for the arena checks).

## Reasoning + tool calls

Ornith emits a `<think>...</think>` block before the final answer and
OpenAI-style `tool_calls`. `--jinja` (set in the compose file) uses the
model's embedded Qwen3.5 chat template, which both rely on. If opencode
shows raw `<think>` content in responses, the box's llama.cpp build is
too old to split reasoning — bump the `kyuz0` image tag or add the
build's reasoning-format flag. Recommended sampling (set server-side):
temp 0.6 / top_p 0.95 / top_k 20.

## Ramping context

Defaults to 64K to match the other llama.cpp stacks (keeps opencode
auto-compaction consistent across providers). Ornith's native context is
262144, and the model is small relative to the arena, so there's room to
push far higher:

| Stage | `--ctx-size` | Margin in arena |
|---|---|---|
| **Current default** | **65536** | huge |
| Stretch | 131072 | comfortable |
| Native max | 262144 | watch KV cache size (q8_0 KV helps) |

Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
re-run `./smoke.sh`.

## Operations

```sh
docker compose logs -f                  # tail
docker compose down                     # stop
docker compose exec ornith bash         # shell in
./smoke.sh                              # health + perf
amdgpu_top                              # GPU view on host
```

## Pin manifest

| Component | Pin |
|---|---|
| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`/`qwable`) |
| Weights | `deepreinforce-ai/Ornith-1.0-35B-GGUF` → `ornith-1.0-35b-Q4_K_M.gguf` (~21.2 GB) |
| Base | Qwen3.5-35B-A3B (MoE: 256 experts, 8 active + shared, 40 layers) |
| Default port | 8083 |
| Default context | 65536 (native 262144) |
| KV cache type | q8_0 (k and v) |
| License | MIT (model); Qwen3.5 base license also applies |

## Status

Compose artifacts written; awaiting box-side weight pull + bring-up.
Wired as a `swap-model ornith` target and as the `framework-ornith`
opencode provider. A/B against `qwen3-coder:30b`; promote to opencode
default if the agentic-coding quality proves out.
added qwable and orinth 2026-06-26 11:33:35 -04:00			`# ornith`

			Ornith-1.0-35B on Strix Halo via `kyuz0:rocm-7.2.2`. DeepReinforce's
			`MIT-licensed agentic-coding model — a self-improving RL fine-tune of`
			`Qwen3.5-35B-A3B that co-trains its own task scaffolds with the policy.`
			`Strong on Terminal-Bench 2.1 / SWE-Bench Verified, emits OpenAI-style`
			`tool_calls`, opens each answer with a `<think>` reasoning block.

			OpenAI-compatible endpoint at `http://framework:8083` once running.

			`## MoE, not dense (read first — this is why it's worth a slot)`

			`Despite "35B" in the name, Ornith-1.0-35B is **MoE with only ~3B active`
			`params per token** (256 routed experts, 8 active + a shared expert, 40`
			`layers). On this bandwidth-bound box (256 GB/s) decode speed tracks`
			`active params, so it runs like the 30B-A3B workhorse (~80-100 tok/s),`
			`not like a dense 27/31B (~10-15 tok/s). That's the whole point: near`
			`frontier-class agentic-coding quality at interactive speed. Candidate to`
			replace `qwen3-coder:30b` (Ollama) as the opencode daily driver — A/B
			`before promoting.`

			`## Quant choice moves speed here`

			`For MoE, decode bandwidth ∝ active bytes per token, so quant tier`
			`changes t/s (~2x across the range), unlike a model where everything is`
			`read every token:`

			`\| Quant \| Size \| When \|`
			`\|---\|---\|---\|`
			`\| Q4_K_M \| 21.2 GB \| default — fastest, huge arena headroom \|`
			`\| Q6_K \| 28.5 GB \| bump here only if Q4 quality disappoints (~slower) \|`
			`\| Q8_0 \| 36.9 GB \| max quality, ~half the decode speed — rarely worth it for A3B \|`

			`## Coexistence notes`

			`At ~21.2 GB (Q4_K_M) Ornith fits the merged arena easily:`

			`\| Concurrent service \| Coexists? \|`
			`\|---\|---\|`
			\| `llama` (Qwen3-Coder-30B, 8080) \| ✅ yes \|
			\| `ollama` (11434) \| ✅ yes \|
			\| `kimi-linear` (vLLM, 8000) \| ✅ yes \|
			\| `qwable` (8082) \| ✅ yes (~38 GB total) \|
			\| `qwen3-235b` (88.8 GB, 8081) \| ❌ no — swap-model stops it \|
			\| `comfyui` (8188) \| ❌ no — swap-model stops it \|

			`restart: "no"`: you bring it up deliberately (via `swap-model ornith`),
			`it won't auto-start after a reboot and surprise-collide with a big model.`

			`## Prereqs`

			- Pyinfra deploy has run (creates `/srv/docker/ornith/` with right perms).
			- BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active.
			Verify: `cat /proc/cmdline \| grep ttm.pages_limit`.

			`## Download weights (~21.2 GB, single file)`

			```sh
			`# /models/qwen exists via pyinfra; just create the model subdir.`
			`mkdir -p /models/qwen/Ornith-1.0-35B`

			`hf download deepreinforce-ai/Ornith-1.0-35B-GGUF \`
			`'ornith-1.0-35b-Q4_K_M.gguf' \`
			`--local-dir /models/qwen/Ornith-1.0-35B`

			`# File lands at:`
			`# /models/qwen/Ornith-1.0-35B/ornith-1.0-35b-Q4_K_M.gguf (~21.2 GB)`
			```

			Single-file GGUF (not sharded) — point `--model` straight at it. Disk:
			needs ~22 GB free on `/models`. Verify the exact filename in the HF repo
			`before downloading (casing matters).`

			`## Bring up`

			Easy path — `swap-model` handles stop-conflicting-services + waits for
			`/health`:

			```sh
			`ssh framework swap-model ornith # ~1-2 min cold load (21.2 GB)`
			`ssh framework /srv/docker/ornith/smoke.sh # /health + perf`
			```

			`Manual equivalent (first-ever bring-up, before the image is cached):`

			```sh
			`cd /srv/docker/ornith`
			`docker compose pull # already-cached image if you ran llama first`
			`docker compose up -d`
			`docker compose logs -f # wait for "server is listening on http://0.0.0.0:8083"`

			`./smoke.sh # /health + tiny generation + perf`
			```

			If `./smoke.sh` reports `predicted_per_second` in the ~80-100 tok/s band,
			`it's healthy. <30 tok/s = investigate (likely arena < 100 GB — see`
			`qwen3-235b/README.md "Troubleshooting" for the arena checks).`

			`## Reasoning + tool calls`

			Ornith emits a `<think>...</think>` block before the final answer and
			OpenAI-style `tool_calls`. `--jinja` (set in the compose file) uses the
			`model's embedded Qwen3.5 chat template, which both rely on. If opencode`
			shows raw `<think>` content in responses, the box's llama.cpp build is
			too old to split reasoning — bump the `kyuz0` image tag or add the
			`build's reasoning-format flag. Recommended sampling (set server-side):`
			`temp 0.6 / top_p 0.95 / top_k 20.`

			`## Ramping context`

			`Defaults to 64K to match the other llama.cpp stacks (keeps opencode`
			`auto-compaction consistent across providers). Ornith's native context is`
			`262144, and the model is small relative to the arena, so there's room to`
			`push far higher:`

			\| Stage \| `--ctx-size` \| Margin in arena \|
			`\|---\|---\|---\|`
			`\| Current default \| 65536 \| huge \|`
			`\| Stretch \| 131072 \| comfortable \|`
			`\| Native max \| 262144 \| watch KV cache size (q8_0 KV helps) \|`

			Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`,
			re-run `./smoke.sh`.

			`## Operations`

			```sh
			`docker compose logs -f # tail`
			`docker compose down # stop`
			`docker compose exec ornith bash # shell in`
			`./smoke.sh # health + perf`
			`amdgpu_top # GPU view on host`
			```

			`## Pin manifest`

			`\| Component \| Pin \|`
			`\|---\|---\|`
			\| Image \| `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`/`qwable`) \|
			\| Weights \| `deepreinforce-ai/Ornith-1.0-35B-GGUF` → `ornith-1.0-35b-Q4_K_M.gguf` (~21.2 GB) \|
			`\| Base \| Qwen3.5-35B-A3B (MoE: 256 experts, 8 active + shared, 40 layers) \|`
			`\| Default port \| 8083 \|`
			`\| Default context \| 65536 (native 262144) \|`
			`\| KV cache type \| q8_0 (k and v) \|`
			`\| License \| MIT (model); Qwen3.5 base license also applies \|`

			`## Status`

			`Compose artifacts written; awaiting box-side weight pull + bring-up.`
			Wired as a `swap-model ornith` target and as the `framework-ornith`
			opencode provider. A/B against `qwen3-coder:30b`; promote to opencode
			`default if the agentic-coding quality proves out.`