# ornith Ornith-1.0-35B on Strix Halo via `kyuz0:rocm-7.2.2`. DeepReinforce's MIT-licensed **agentic-coding** model — a self-improving RL fine-tune of **Qwen3.5-35B-A3B** that co-trains its own task scaffolds with the policy. Strong on Terminal-Bench 2.1 / SWE-Bench Verified, emits OpenAI-style `tool_calls`, opens each answer with a `` reasoning block. OpenAI-compatible endpoint at `http://framework:8083` once running. ## MoE, not dense (read first — this is why it's worth a slot) Despite "35B" in the name, Ornith-1.0-35B is **MoE with only ~3B active params per token** (256 routed experts, 8 active + a shared expert, 40 layers). On this bandwidth-bound box (256 GB/s) decode speed tracks *active* params, so it runs like the 30B-A3B workhorse (**~80-100 tok/s**), not like a dense 27/31B (~10-15 tok/s). That's the whole point: near frontier-class agentic-coding quality at interactive speed. Candidate to replace `qwen3-coder:30b` (Ollama) as the opencode daily driver — A/B before promoting. ## Quant choice moves speed here For MoE, decode bandwidth ∝ *active bytes per token*, so quant tier changes t/s (~2x across the range), unlike a model where everything is read every token: | Quant | Size | When | |---|---|---| | **Q4_K_M** | **21.2 GB** | **default** — fastest, huge arena headroom | | Q6_K | 28.5 GB | bump here only if Q4 quality disappoints (~slower) | | Q8_0 | 36.9 GB | max quality, ~half the decode speed — rarely worth it for A3B | ## Coexistence notes At ~21.2 GB (Q4_K_M) Ornith fits the merged arena easily: | Concurrent service | Coexists? | |---|---| | `llama` (Qwen3-Coder-30B, 8080) | ✅ yes | | `ollama` (11434) | ✅ yes | | `kimi-linear` (vLLM, 8000) | ✅ yes | | `qwable` (8082) | ✅ yes (~38 GB total) | | `qwen3-235b` (88.8 GB, 8081) | ❌ no — swap-model stops it | | `comfyui` (8188) | ❌ no — swap-model stops it | `restart: "no"`: you bring it up deliberately (via `swap-model ornith`), it won't auto-start after a reboot and surprise-collide with a big model. ## Prereqs - Pyinfra deploy has run (creates `/srv/docker/ornith/` with right perms). - BIOS UMA at 0.5 GB + `ttm.pages_limit=33554432` kernel cmdline active. Verify: `cat /proc/cmdline | grep ttm.pages_limit`. ## Download weights (~21.2 GB, single file) ```sh # /models/qwen exists via pyinfra; just create the model subdir. mkdir -p /models/qwen/Ornith-1.0-35B hf download deepreinforce-ai/Ornith-1.0-35B-GGUF \ 'ornith-1.0-35b-Q4_K_M.gguf' \ --local-dir /models/qwen/Ornith-1.0-35B # File lands at: # /models/qwen/Ornith-1.0-35B/ornith-1.0-35b-Q4_K_M.gguf (~21.2 GB) ``` Single-file GGUF (not sharded) — point `--model` straight at it. Disk: needs ~22 GB free on `/models`. Verify the exact filename in the HF repo before downloading (casing matters). ## Bring up Easy path — `swap-model` handles stop-conflicting-services + waits for `/health`: ```sh ssh framework swap-model ornith # ~1-2 min cold load (21.2 GB) ssh framework /srv/docker/ornith/smoke.sh # /health + perf ``` Manual equivalent (first-ever bring-up, before the image is cached): ```sh cd /srv/docker/ornith docker compose pull # already-cached image if you ran llama first docker compose up -d docker compose logs -f # wait for "server is listening on http://0.0.0.0:8083" ./smoke.sh # /health + tiny generation + perf ``` If `./smoke.sh` reports `predicted_per_second` in the ~80-100 tok/s band, it's healthy. <30 tok/s = investigate (likely arena < 100 GB — see qwen3-235b/README.md "Troubleshooting" for the arena checks). ## Reasoning + tool calls Ornith emits a `...` block before the final answer and OpenAI-style `tool_calls`. `--jinja` (set in the compose file) uses the model's embedded Qwen3.5 chat template, which both rely on. If opencode shows raw `` content in responses, the box's llama.cpp build is too old to split reasoning — bump the `kyuz0` image tag or add the build's reasoning-format flag. Recommended sampling (set server-side): temp 0.6 / top_p 0.95 / top_k 20. ## Ramping context Defaults to 64K to match the other llama.cpp stacks (keeps opencode auto-compaction consistent across providers). Ornith's native context is 262144, and the model is small relative to the arena, so there's room to push far higher: | Stage | `--ctx-size` | Margin in arena | |---|---|---| | **Current default** | **65536** | huge | | Stretch | 131072 | comfortable | | Native max | 262144 | watch KV cache size (q8_0 KV helps) | Edit `--ctx-size` in `docker-compose.yml`, `docker compose down && up -d`, re-run `./smoke.sh`. ## Operations ```sh docker compose logs -f # tail docker compose down # stop docker compose exec ornith bash # shell in ./smoke.sh # health + perf amdgpu_top # GPU view on host ``` ## Pin manifest | Component | Pin | |---|---| | Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` (shared with `llama`/`qwable`) | | Weights | `deepreinforce-ai/Ornith-1.0-35B-GGUF` → `ornith-1.0-35b-Q4_K_M.gguf` (~21.2 GB) | | Base | Qwen3.5-35B-A3B (MoE: 256 experts, 8 active + shared, 40 layers) | | Default port | 8083 | | Default context | 65536 (native 262144) | | KV cache type | q8_0 (k and v) | | License | MIT (model); Qwen3.5 base license also applies | ## Status Compose artifacts written; awaiting box-side weight pull + bring-up. Wired as a `swap-model ornith` target and as the `framework-ornith` opencode provider. A/B against `qwen3-coder:30b`; promote to opencode default if the agentic-coding quality proves out.