progress 235b
This commit is contained in:
92
pyinfra/framework/compose/llama/README.md
Normal file
92
pyinfra/framework/compose/llama/README.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# llama
|
||||
|
||||
llama.cpp server with **native gfx1151** kernels via kyuz0's ROCm 7.2.2
|
||||
toolbox. Sits beside Ollama (11434) and vLLM (8000) on port 8080. Same
|
||||
Qwen3-Coder model as Ollama, faster path.
|
||||
|
||||
## Why this exists
|
||||
|
||||
Ollama's bundled ROCm doesn't ship native gfx1151 — we coerce gfx1100
|
||||
kernels via `HSA_OVERRIDE_GFX_VERSION=11.0.0`. kyuz0's image is built
|
||||
against gfx1151 with rocWMMA acceleration. Expected eval_tps delta on
|
||||
Qwen3-Coder-30B-A3B-Q4: **~30-50 % faster**, with ~2× prefill speedup.
|
||||
The compose stub used to be vulkan-radv with a placeholder model path;
|
||||
this rewrite makes it the second working coding endpoint.
|
||||
|
||||
## Bring up (LL-P0 verification)
|
||||
|
||||
```sh
|
||||
# 1. Pull the Unsloth UD-Q4_K_XL Qwen3-Coder GGUF on the box.
|
||||
# Verify the actual filename in the HF repo first — Unsloth's naming
|
||||
# sometimes splits into shards. As of 2026-05 the single-file
|
||||
# UD-Q4_K_XL is ~17-19 GB.
|
||||
hf download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
|
||||
'Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf' \
|
||||
--local-dir /models/qwen
|
||||
|
||||
# 2. Stand up the container.
|
||||
cd /srv/docker/llama
|
||||
docker compose pull # ~6-10 GB image
|
||||
docker compose up -d
|
||||
docker compose logs -f # wait for "main: server is listening on http://0.0.0.0:8080"
|
||||
|
||||
# 3. Smoke + perf measure.
|
||||
./smoke.sh
|
||||
```
|
||||
|
||||
If `predicted_per_second` is meaningfully higher than what Ollama
|
||||
reports for the same prompt, the migration is justified. If it's the
|
||||
same or worse, leave Ollama as the default and treat llama.cpp as a
|
||||
secondary option.
|
||||
|
||||
## Comparison test (vs Ollama)
|
||||
|
||||
Run the same prompt against both for a clean A/B:
|
||||
|
||||
```sh
|
||||
# Ollama
|
||||
curl -s http://framework:11434/api/generate \
|
||||
-d '{"model":"qwen3-coder:30b","prompt":"Write a Python fibonacci function with type hints.","stream":false}' \
|
||||
| jq '{eval_tps:(.eval_count/(.eval_duration/1e9)), prompt_tps:(.prompt_eval_count/(.prompt_eval_duration/1e9))}'
|
||||
|
||||
# llama.cpp (this stack)
|
||||
curl -s http://framework:8080/completion \
|
||||
-d '{"prompt":"Write a Python fibonacci function with type hints.","n_predict":200,"temperature":0}' \
|
||||
| jq '.timings | {predicted_per_second, prompt_per_second}'
|
||||
```
|
||||
|
||||
## Coexistence with Ollama
|
||||
|
||||
Both can run simultaneously — different ports, different model files on
|
||||
disk (Ollama's content-addressed store at `/models/ollama/` vs the raw
|
||||
GGUF at `/models/qwen/`). They will compete for GPU memory if both have
|
||||
their models hot. With `OLLAMA_KEEP_ALIVE=24h` Ollama keeps Qwen3
|
||||
resident; if you want to A/B without contention, `docker exec ollama
|
||||
ollama stop qwen3-coder:30b` while testing llama.cpp.
|
||||
|
||||
If LL-P0 confirms the perf win, LL-P1 wires this as a third opencode
|
||||
provider (`framework-llama/qwen3-coder` alongside `framework/qwen3-coder:30b`
|
||||
and `framework-vllm/kimi-linear`).
|
||||
|
||||
## Pin manifest
|
||||
|
||||
| Component | Pin |
|
||||
|---|---|
|
||||
| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` |
|
||||
| Weights | `unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF` (UD-Q4_K_XL variant) |
|
||||
| Default port | 8080 |
|
||||
| Context | 65536 (matches Ollama config) |
|
||||
|
||||
## Operations
|
||||
|
||||
```sh
|
||||
docker compose logs -f # tail
|
||||
docker compose restart llama # reload
|
||||
docker compose down # stop
|
||||
docker compose exec llama bash # shell in
|
||||
./smoke.sh # health + perf check
|
||||
```
|
||||
|
||||
## Status
|
||||
|
||||
LL-P0 in progress. LL-P1 (opencode provider wire-up) pending verification.
|
||||
45
pyinfra/framework/compose/llama/smoke.sh
Executable file
45
pyinfra/framework/compose/llama/smoke.sh
Executable file
@@ -0,0 +1,45 @@
|
||||
#!/usr/bin/env bash
|
||||
# Smoke-test the running llama-server (kyuz0 rocm-7.2.2). Hits /health
|
||||
# for liveness, then a tiny OpenAI-compatible chat completion. Also
|
||||
# prints eval_tps so you can compare to Ollama directly.
|
||||
set -euo pipefail
|
||||
|
||||
HOST="${LLAMA_HOST:-127.0.0.1:8080}"
|
||||
MODEL="${LLAMA_MODEL:-qwen3-coder}"
|
||||
|
||||
echo "[smoke] GET /health on $HOST"
|
||||
curl -fsS "http://$HOST/health" | python3 -m json.tool
|
||||
|
||||
echo
|
||||
echo "[smoke] POST /v1/chat/completions ($MODEL) — tiny generation"
|
||||
curl -fsS "http://$HOST/v1/chat/completions" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d "{
|
||||
\"model\": \"$MODEL\",
|
||||
\"messages\": [{\"role\": \"user\", \"content\": \"Reply with exactly: ok\"}],
|
||||
\"max_tokens\": 16,
|
||||
\"temperature\": 0.0
|
||||
}" | python3 -m json.tool
|
||||
|
||||
echo
|
||||
echo "[smoke] perf measure — eval_tps and prompt_tps"
|
||||
# Use llama.cpp's native /completion endpoint which returns timings.
|
||||
curl -fsS "http://$HOST/completion" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"prompt": "Write a Python function that computes the Fibonacci sequence iteratively. Include type hints and a brief docstring.",
|
||||
"n_predict": 200,
|
||||
"temperature": 0.0,
|
||||
"stream": false
|
||||
}' | python3 -c "
|
||||
import json, sys
|
||||
r = json.load(sys.stdin)
|
||||
t = r.get('timings', {})
|
||||
print(f'predicted_per_second: {t.get(\"predicted_per_second\", \"?\"):.2f} tok/s')
|
||||
print(f'prompt_per_second: {t.get(\"prompt_per_second\", \"?\"):.2f} tok/s')
|
||||
print(f'predicted_n: {t.get(\"predicted_n\", \"?\")}')
|
||||
print(f'prompt_n: {t.get(\"prompt_n\", \"?\")}')
|
||||
"
|
||||
|
||||
echo
|
||||
echo "[smoke] passed"
|
||||
Reference in New Issue
Block a user