Files
localgenai/pyinfra/framework/compose/llama/README.md
2026-06-08 15:31:50 +01:00

3.3 KiB
Raw Blame History

llama

llama.cpp server with native gfx1151 kernels via kyuz0's ROCm 7.2.2 toolbox. Sits beside Ollama (11434) and vLLM (8000) on port 8080. Same Qwen3-Coder model as Ollama, faster path.

Why this exists

Ollama's bundled ROCm doesn't ship native gfx1151 — we coerce gfx1100 kernels via HSA_OVERRIDE_GFX_VERSION=11.0.0. kyuz0's image is built against gfx1151 with rocWMMA acceleration. Expected eval_tps delta on Qwen3-Coder-30B-A3B-Q4: ~30-50 % faster, with ~2× prefill speedup. The compose stub used to be vulkan-radv with a placeholder model path; this rewrite makes it the second working coding endpoint.

Bring up (LL-P0 verification)

# 1. Pull the Unsloth UD-Q4_K_XL Qwen3-Coder GGUF on the box.
#    Verify the actual filename in the HF repo first — Unsloth's naming
#    sometimes splits into shards. As of 2026-05 the single-file
#    UD-Q4_K_XL is ~17-19 GB.
hf download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
    'Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf' \
    --local-dir /models/qwen

# 2. Stand up the container.
cd /srv/docker/llama
docker compose pull       # ~6-10 GB image
docker compose up -d
docker compose logs -f    # wait for "main: server is listening on http://0.0.0.0:8080"

# 3. Smoke + perf measure.
./smoke.sh

If predicted_per_second is meaningfully higher than what Ollama reports for the same prompt, the migration is justified. If it's the same or worse, leave Ollama as the default and treat llama.cpp as a secondary option.

Comparison test (vs Ollama)

Run the same prompt against both for a clean A/B:

# Ollama
curl -s http://framework:11434/api/generate \
  -d '{"model":"qwen3-coder:30b","prompt":"Write a Python fibonacci function with type hints.","stream":false}' \
  | jq '{eval_tps:(.eval_count/(.eval_duration/1e9)), prompt_tps:(.prompt_eval_count/(.prompt_eval_duration/1e9))}'

# llama.cpp (this stack)
curl -s http://framework:8080/completion \
  -d '{"prompt":"Write a Python fibonacci function with type hints.","n_predict":200,"temperature":0}' \
  | jq '.timings | {predicted_per_second, prompt_per_second}'

Coexistence with Ollama

Both can run simultaneously — different ports, different model files on disk (Ollama's content-addressed store at /models/ollama/ vs the raw GGUF at /models/qwen/). They will compete for GPU memory if both have their models hot. With OLLAMA_KEEP_ALIVE=24h Ollama keeps Qwen3 resident; if you want to A/B without contention, docker exec ollama ollama stop qwen3-coder:30b while testing llama.cpp.

If LL-P0 confirms the perf win, LL-P1 wires this as a third opencode provider (framework-llama/qwen3-coder alongside framework/qwen3-coder:30b and framework-vllm/kimi-linear).

Pin manifest

Component Pin
Image kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2
Weights unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF (UD-Q4_K_XL variant)
Default port 8080
Context 65536 (matches Ollama config)

Operations

docker compose logs -f                              # tail
docker compose restart llama                        # reload
docker compose down                                 # stop
docker compose exec llama bash                      # shell in
./smoke.sh                                          # health + perf check

Status

LL-P0 in progress. LL-P1 (opencode provider wire-up) pending verification.