Files
localgenai/pyinfra/framework/compose/ornith/README.md

5.5 KiB

ornith

Ornith-1.0-35B on Strix Halo via kyuz0:rocm-7.2.2. DeepReinforce's MIT-licensed agentic-coding model — a self-improving RL fine-tune of Qwen3.5-35B-A3B that co-trains its own task scaffolds with the policy. Strong on Terminal-Bench 2.1 / SWE-Bench Verified, emits OpenAI-style tool_calls, opens each answer with a <think> reasoning block.

OpenAI-compatible endpoint at http://framework:8083 once running.

MoE, not dense (read first — this is why it's worth a slot)

Despite "35B" in the name, Ornith-1.0-35B is MoE with only ~3B active params per token (256 routed experts, 8 active + a shared expert, 40 layers). On this bandwidth-bound box (256 GB/s) decode speed tracks active params, so it runs like the 30B-A3B workhorse (~80-100 tok/s), not like a dense 27/31B (~10-15 tok/s). That's the whole point: near frontier-class agentic-coding quality at interactive speed. Candidate to replace qwen3-coder:30b (Ollama) as the opencode daily driver — A/B before promoting.

Quant choice moves speed here

For MoE, decode bandwidth ∝ active bytes per token, so quant tier changes t/s (~2x across the range), unlike a model where everything is read every token:

Quant Size When
Q4_K_M 21.2 GB default — fastest, huge arena headroom
Q6_K 28.5 GB bump here only if Q4 quality disappoints (~slower)
Q8_0 36.9 GB max quality, ~half the decode speed — rarely worth it for A3B

Coexistence notes

At ~21.2 GB (Q4_K_M) Ornith fits the merged arena easily:

Concurrent service Coexists?
llama (Qwen3-Coder-30B, 8080) yes
ollama (11434) yes
kimi-linear (vLLM, 8000) yes
qwable (8082) yes (~38 GB total)
qwen3-235b (88.8 GB, 8081) no — swap-model stops it
comfyui (8188) no — swap-model stops it

restart: "no": you bring it up deliberately (via swap-model ornith), it won't auto-start after a reboot and surprise-collide with a big model.

Prereqs

  • Pyinfra deploy has run (creates /srv/docker/ornith/ with right perms).
  • BIOS UMA at 0.5 GB + ttm.pages_limit=33554432 kernel cmdline active. Verify: cat /proc/cmdline | grep ttm.pages_limit.

Download weights (~21.2 GB, single file)

# /models/qwen exists via pyinfra; just create the model subdir.
mkdir -p /models/qwen/Ornith-1.0-35B

hf download deepreinforce-ai/Ornith-1.0-35B-GGUF \
    'ornith-1.0-35b-Q4_K_M.gguf' \
    --local-dir /models/qwen/Ornith-1.0-35B

# File lands at:
#   /models/qwen/Ornith-1.0-35B/ornith-1.0-35b-Q4_K_M.gguf  (~21.2 GB)

Single-file GGUF (not sharded) — point --model straight at it. Disk: needs ~22 GB free on /models. Verify the exact filename in the HF repo before downloading (casing matters).

Bring up

Easy path — swap-model handles stop-conflicting-services + waits for /health:

ssh framework swap-model ornith     # ~1-2 min cold load (21.2 GB)
ssh framework /srv/docker/ornith/smoke.sh    # /health + perf

Manual equivalent (first-ever bring-up, before the image is cached):

cd /srv/docker/ornith
docker compose pull       # already-cached image if you ran llama first
docker compose up -d
docker compose logs -f    # wait for "server is listening on http://0.0.0.0:8083"

./smoke.sh                # /health + tiny generation + perf

If ./smoke.sh reports predicted_per_second in the ~80-100 tok/s band, it's healthy. <30 tok/s = investigate (likely arena < 100 GB — see qwen3-235b/README.md "Troubleshooting" for the arena checks).

Reasoning + tool calls

Ornith emits a <think>...</think> block before the final answer and OpenAI-style tool_calls. --jinja (set in the compose file) uses the model's embedded Qwen3.5 chat template, which both rely on. If opencode shows raw <think> content in responses, the box's llama.cpp build is too old to split reasoning — bump the kyuz0 image tag or add the build's reasoning-format flag. Recommended sampling (set server-side): temp 0.6 / top_p 0.95 / top_k 20.

Ramping context

Defaults to 64K to match the other llama.cpp stacks (keeps opencode auto-compaction consistent across providers). Ornith's native context is 262144, and the model is small relative to the arena, so there's room to push far higher:

Stage --ctx-size Margin in arena
Current default 65536 huge
Stretch 131072 comfortable
Native max 262144 watch KV cache size (q8_0 KV helps)

Edit --ctx-size in docker-compose.yml, docker compose down && up -d, re-run ./smoke.sh.

Operations

docker compose logs -f                  # tail
docker compose down                     # stop
docker compose exec ornith bash         # shell in
./smoke.sh                              # health + perf
amdgpu_top                              # GPU view on host

Pin manifest

Component Pin
Image kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2 (shared with llama/qwable)
Weights deepreinforce-ai/Ornith-1.0-35B-GGUFornith-1.0-35b-Q4_K_M.gguf (~21.2 GB)
Base Qwen3.5-35B-A3B (MoE: 256 experts, 8 active + shared, 40 layers)
Default port 8083
Default context 65536 (native 262144)
KV cache type q8_0 (k and v)
License MIT (model); Qwen3.5 base license also applies

Status

Compose artifacts written; awaiting box-side weight pull + bring-up. Wired as a swap-model ornith target and as the framework-ornith opencode provider. A/B against qwen3-coder:30b; promote to opencode default if the agentic-coding quality proves out.