Files
localgenai/pyinfra/framework/compose/llama/README.md
2026-06-08 15:31:50 +01:00

93 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# llama
llama.cpp server with **native gfx1151** kernels via kyuz0's ROCm 7.2.2
toolbox. Sits beside Ollama (11434) and vLLM (8000) on port 8080. Same
Qwen3-Coder model as Ollama, faster path.
## Why this exists
Ollama's bundled ROCm doesn't ship native gfx1151 — we coerce gfx1100
kernels via `HSA_OVERRIDE_GFX_VERSION=11.0.0`. kyuz0's image is built
against gfx1151 with rocWMMA acceleration. Expected eval_tps delta on
Qwen3-Coder-30B-A3B-Q4: **~30-50 % faster**, with ~2× prefill speedup.
The compose stub used to be vulkan-radv with a placeholder model path;
this rewrite makes it the second working coding endpoint.
## Bring up (LL-P0 verification)
```sh
# 1. Pull the Unsloth UD-Q4_K_XL Qwen3-Coder GGUF on the box.
# Verify the actual filename in the HF repo first — Unsloth's naming
# sometimes splits into shards. As of 2026-05 the single-file
# UD-Q4_K_XL is ~17-19 GB.
hf download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
'Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf' \
--local-dir /models/qwen
# 2. Stand up the container.
cd /srv/docker/llama
docker compose pull # ~6-10 GB image
docker compose up -d
docker compose logs -f # wait for "main: server is listening on http://0.0.0.0:8080"
# 3. Smoke + perf measure.
./smoke.sh
```
If `predicted_per_second` is meaningfully higher than what Ollama
reports for the same prompt, the migration is justified. If it's the
same or worse, leave Ollama as the default and treat llama.cpp as a
secondary option.
## Comparison test (vs Ollama)
Run the same prompt against both for a clean A/B:
```sh
# Ollama
curl -s http://framework:11434/api/generate \
-d '{"model":"qwen3-coder:30b","prompt":"Write a Python fibonacci function with type hints.","stream":false}' \
| jq '{eval_tps:(.eval_count/(.eval_duration/1e9)), prompt_tps:(.prompt_eval_count/(.prompt_eval_duration/1e9))}'
# llama.cpp (this stack)
curl -s http://framework:8080/completion \
-d '{"prompt":"Write a Python fibonacci function with type hints.","n_predict":200,"temperature":0}' \
| jq '.timings | {predicted_per_second, prompt_per_second}'
```
## Coexistence with Ollama
Both can run simultaneously — different ports, different model files on
disk (Ollama's content-addressed store at `/models/ollama/` vs the raw
GGUF at `/models/qwen/`). They will compete for GPU memory if both have
their models hot. With `OLLAMA_KEEP_ALIVE=24h` Ollama keeps Qwen3
resident; if you want to A/B without contention, `docker exec ollama
ollama stop qwen3-coder:30b` while testing llama.cpp.
If LL-P0 confirms the perf win, LL-P1 wires this as a third opencode
provider (`framework-llama/qwen3-coder` alongside `framework/qwen3-coder:30b`
and `framework-vllm/kimi-linear`).
## Pin manifest
| Component | Pin |
|---|---|
| Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` |
| Weights | `unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF` (UD-Q4_K_XL variant) |
| Default port | 8080 |
| Context | 65536 (matches Ollama config) |
## Operations
```sh
docker compose logs -f # tail
docker compose restart llama # reload
docker compose down # stop
docker compose exec llama bash # shell in
./smoke.sh # health + perf check
```
## Status
LL-P0 in progress. LL-P1 (opencode provider wire-up) pending verification.