progress 235b

2026-06-08 15:31:50 +01:00
parent a29793032d
commit de1635872f
25 changed files with 1598 additions and 53 deletions
--- a/pyinfra/framework/compose/llama/README.md
+++ b/pyinfra/framework/compose/llama/README.md
@@ -0,0 +1,92 @@
+# llama
+
+llama.cpp server with **native gfx1151** kernels via kyuz0's ROCm 7.2.2
+toolbox. Sits beside Ollama (11434) and vLLM (8000) on port 8080. Same
+Qwen3-Coder model as Ollama, faster path.
+
+## Why this exists
+
+Ollama's bundled ROCm doesn't ship native gfx1151 — we coerce gfx1100
+kernels via `HSA_OVERRIDE_GFX_VERSION=11.0.0`. kyuz0's image is built
+against gfx1151 with rocWMMA acceleration. Expected eval_tps delta on
+Qwen3-Coder-30B-A3B-Q4: **~30-50 % faster**, with ~2× prefill speedup.
+The compose stub used to be vulkan-radv with a placeholder model path;
+this rewrite makes it the second working coding endpoint.
+
+## Bring up (LL-P0 verification)
+
+```sh
+# 1. Pull the Unsloth UD-Q4_K_XL Qwen3-Coder GGUF on the box.
+#    Verify the actual filename in the HF repo first — Unsloth's naming
+#    sometimes splits into shards. As of 2026-05 the single-file
+#    UD-Q4_K_XL is ~17-19 GB.
+hf download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
+    'Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf' \
+    --local-dir /models/qwen
+
+# 2. Stand up the container.
+cd /srv/docker/llama
+docker compose pull       # ~6-10 GB image
+docker compose up -d
+docker compose logs -f    # wait for "main: server is listening on http://0.0.0.0:8080"
+
+# 3. Smoke + perf measure.
+./smoke.sh
+```
+
+If `predicted_per_second` is meaningfully higher than what Ollama
+reports for the same prompt, the migration is justified. If it's the
+same or worse, leave Ollama as the default and treat llama.cpp as a
+secondary option.
+
+## Comparison test (vs Ollama)
+
+Run the same prompt against both for a clean A/B:
+
+```sh
+# Ollama
+curl -s http://framework:11434/api/generate \
+  -d '{"model":"qwen3-coder:30b","prompt":"Write a Python fibonacci function with type hints.","stream":false}' \
+  | jq '{eval_tps:(.eval_count/(.eval_duration/1e9)), prompt_tps:(.prompt_eval_count/(.prompt_eval_duration/1e9))}'
+
+# llama.cpp (this stack)
+curl -s http://framework:8080/completion \
+  -d '{"prompt":"Write a Python fibonacci function with type hints.","n_predict":200,"temperature":0}' \
+  | jq '.timings | {predicted_per_second, prompt_per_second}'
+```
+
+## Coexistence with Ollama
+
+Both can run simultaneously — different ports, different model files on
+disk (Ollama's content-addressed store at `/models/ollama/` vs the raw
+GGUF at `/models/qwen/`). They will compete for GPU memory if both have
+their models hot. With `OLLAMA_KEEP_ALIVE=24h` Ollama keeps Qwen3
+resident; if you want to A/B without contention, `docker exec ollama
+ollama stop qwen3-coder:30b` while testing llama.cpp.
+
+If LL-P0 confirms the perf win, LL-P1 wires this as a third opencode
+provider (`framework-llama/qwen3-coder` alongside `framework/qwen3-coder:30b`
+and `framework-vllm/kimi-linear`).
+
+## Pin manifest
+
+| Component | Pin |
+|---|---|
+| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` |
+| Weights | `unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF` (UD-Q4_K_XL variant) |
+| Default port | 8080 |
+| Context | 65536 (matches Ollama config) |
+
+## Operations
+
+```sh
+docker compose logs -f                              # tail
+docker compose restart llama                        # reload
+docker compose down                                 # stop
+docker compose exec llama bash                      # shell in
+./smoke.sh                                          # health + perf check
+```
+
+## Status
+
+LL-P0 in progress. LL-P1 (opencode provider wire-up) pending verification.
--- a/pyinfra/framework/compose/llama/smoke.sh
+++ b/pyinfra/framework/compose/llama/smoke.sh
@@ -0,0 +1,45 @@
+#!/usr/bin/env bash
+# Smoke-test the running llama-server (kyuz0 rocm-7.2.2). Hits /health
+# for liveness, then a tiny OpenAI-compatible chat completion. Also
+# prints eval_tps so you can compare to Ollama directly.
+set -euo pipefail
+
+HOST="${LLAMA_HOST:-127.0.0.1:8080}"
+MODEL="${LLAMA_MODEL:-qwen3-coder}"
+
+echo "[smoke] GET /health on $HOST"
+curl -fsS "http://$HOST/health" | python3 -m json.tool
+
+echo
+echo "[smoke] POST /v1/chat/completions ($MODEL) — tiny generation"
+curl -fsS "http://$HOST/v1/chat/completions" \
+    -H 'Content-Type: application/json' \
+    -d "{
+        \"model\": \"$MODEL\",
+        \"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: ok\"}],
+        \"max_tokens\": 16,
+        \"temperature\": 0.0
+    }" | python3 -m json.tool
+
+echo
+echo "[smoke] perf measure — eval_tps and prompt_tps"
+# Use llama.cpp's native /completion endpoint which returns timings.
+curl -fsS "http://$HOST/completion" \
+    -H 'Content-Type: application/json' \
+    -d '{
+        "prompt": "Write a Python function that computes the Fibonacci sequence iteratively. Include type hints and a brief docstring.",
+        "n_predict": 200,
+        "temperature": 0.0,
+        "stream": false
+    }' | python3 -c "
+import json, sys
+r = json.load(sys.stdin)
+t = r.get('timings', {})
+print(f'predicted_per_second:  {t.get(\"predicted_per_second\", \"?\"):.2f} tok/s')
+print(f'prompt_per_second:     {t.get(\"prompt_per_second\", \"?\"):.2f} tok/s')
+print(f'predicted_n:           {t.get(\"predicted_n\", \"?\")}')
+print(f'prompt_n:              {t.get(\"prompt_n\", \"?\")}')
+"
+
+echo
+echo "[smoke] passed"