# llama llama.cpp server with **native gfx1151** kernels via kyuz0's ROCm 7.2.2 toolbox. Sits beside Ollama (11434) and vLLM (8000) on port 8080. Same Qwen3-Coder model as Ollama, faster path. ## Why this exists Ollama's bundled ROCm doesn't ship native gfx1151 — we coerce gfx1100 kernels via `HSA_OVERRIDE_GFX_VERSION=11.0.0`. kyuz0's image is built against gfx1151 with rocWMMA acceleration. Expected eval_tps delta on Qwen3-Coder-30B-A3B-Q4: **~30-50 % faster**, with ~2× prefill speedup. The compose stub used to be vulkan-radv with a placeholder model path; this rewrite makes it the second working coding endpoint. ## Bring up (LL-P0 verification) ```sh # 1. Pull the Unsloth UD-Q4_K_XL Qwen3-Coder GGUF on the box. # Verify the actual filename in the HF repo first — Unsloth's naming # sometimes splits into shards. As of 2026-05 the single-file # UD-Q4_K_XL is ~17-19 GB. hf download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \ 'Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf' \ --local-dir /models/qwen # 2. Stand up the container. cd /srv/docker/llama docker compose pull # ~6-10 GB image docker compose up -d docker compose logs -f # wait for "main: server is listening on http://0.0.0.0:8080" # 3. Smoke + perf measure. ./smoke.sh ``` If `predicted_per_second` is meaningfully higher than what Ollama reports for the same prompt, the migration is justified. If it's the same or worse, leave Ollama as the default and treat llama.cpp as a secondary option. ## Comparison test (vs Ollama) Run the same prompt against both for a clean A/B: ```sh # Ollama curl -s http://framework:11434/api/generate \ -d '{"model":"qwen3-coder:30b","prompt":"Write a Python fibonacci function with type hints.","stream":false}' \ | jq '{eval_tps:(.eval_count/(.eval_duration/1e9)), prompt_tps:(.prompt_eval_count/(.prompt_eval_duration/1e9))}' # llama.cpp (this stack) curl -s http://framework:8080/completion \ -d '{"prompt":"Write a Python fibonacci function with type hints.","n_predict":200,"temperature":0}' \ | jq '.timings | {predicted_per_second, prompt_per_second}' ``` ## Coexistence with Ollama Both can run simultaneously — different ports, different model files on disk (Ollama's content-addressed store at `/models/ollama/` vs the raw GGUF at `/models/qwen/`). They will compete for GPU memory if both have their models hot. With `OLLAMA_KEEP_ALIVE=24h` Ollama keeps Qwen3 resident; if you want to A/B without contention, `docker exec ollama ollama stop qwen3-coder:30b` while testing llama.cpp. If LL-P0 confirms the perf win, LL-P1 wires this as a third opencode provider (`framework-llama/qwen3-coder` alongside `framework/qwen3-coder:30b` and `framework-vllm/kimi-linear`). ## Pin manifest | Component | Pin | |---|---| | Image | `kyuz0/amd-strix-halo-toolboxes:rocm-7.2.2` | | Weights | `unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF` (UD-Q4_K_XL variant) | | Default port | 8080 | | Context | 65536 (matches Ollama config) | ## Operations ```sh docker compose logs -f # tail docker compose restart llama # reload docker compose down # stop docker compose exec llama bash # shell in ./smoke.sh # health + perf check ``` ## Status LL-P0 in progress. LL-P1 (opencode provider wire-up) pending verification.