litellm

OpenAI-compatible router in front of all the local model backends on this box. One endpoint (http://framework:4000/v1) → four backends (Ollama, llama.cpp 30B, vLLM Kimi, llama.cpp 235B). Maintained from config.yaml in this repo; edits land on the box via ./run.sh + docker compose restart litellm.

When to route through LiteLLM vs direct

Client	Recommended	Why
opencode (Mac)	Direct providers in `opencode.json`	One Mac, four providers fits cleanly in opencode's config schema. Adds no debugging burden.
OpenHands (the box)	Via LiteLLM	OpenHands wants a single `LLM_BASE_URL` + `LLM_MODEL`; switching between qwen3-coder and qwen3-235b for different tasks is a config flip, not a re-wire.
Future orchestrator (other server)	Via LiteLLM	Same reason as OpenHands. Also: one URL to firewall/Tailscale-expose.
OpenWebUI	Direct (already wired to vLLM)	No reason to add a hop.
Quick curl / scripts	Either	LiteLLM is convenient because you don't need to remember port numbers.

The trade-off: a hop adds a few ms per request (negligible at our tok/s) and one more thing to debug when something breaks. The win is config centralization.

Prereqs

Pyinfra deploy has run (creates /srv/docker/litellm/{config.yaml,.env}).
Fill in /srv/docker/litellm/.env:
```
LITELLM_MASTER_KEY=sk-<choose-something-stable>
LITELLM_SALT_KEY=sk-<choose-something-stable>
```
Mode 640 root:docker — readable by docker group at compose-parse time, not world-readable.
At least one backend running (otherwise /v1/models is empty and routing 404s — LiteLLM itself comes up regardless).

Bring up

cd /srv/docker/litellm
docker compose up -d
docker compose logs -f          # wait for "Application startup complete"

./smoke.sh                      # /health/readiness + /v1/models

Usage

From any client — pass LITELLM_MASTER_KEY as the Bearer token:

# List configured models
curl -fsS http://framework:4000/v1/models \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  | jq '.data[].id'

# Chat completion (qwen3-coder via the qwen3-coder model_name in config)
curl -fsS http://framework:4000/v1/chat/completions \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "ping"}]}' \
  | jq '.choices[0].message.content'

# Long-task model (assumes you stopped 30B/Kimi and brought qwen3-235b up)
curl -fsS http://framework:4000/v1/chat/completions \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "qwen3-235b", "messages": [{"role": "user", "content": "Plan a refactor of foo.py"}], "max_tokens": 2000}'

/health (no /readiness) hits every backend — useful for diagnosing which one is down:

curl -fsS http://framework:4000/health \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq

Backend down semantics

LiteLLM does not chain fallbacks across our model_list — each model_name maps to exactly one backend. If qwen3-235b is down, requests for model: qwen3-235b return 503. That's deliberate:

Silent fallback qwen3-235b → qwen3-coder would be confusing (different model, different quality, different latency profile).
The 235B has a manual-start workflow on purpose (GPU contention). Fail-fast surfaces the "you forgot to bring it up" case immediately.

If you want explicit fallback for a specific harness, add it client-side (OpenHands has LLM_FALLBACKS=...).

Adding a model

Edit config.yaml → ./run.sh from pyinfra/framework/ on the Mac → docker compose restart litellm on the box. New model appears in /v1/models immediately. No state to migrate (database_url is null).

Operations

docker compose logs -f                       # tail
docker compose restart litellm               # reload after config edit
docker compose down                          # stop
./smoke.sh                                   # health + model list

Pin manifest

Component	Pin
Image	`ghcr.io/berriai/litellm:main-stable`
Default port	4000
Auth	Master key only (no virtual keys, no database)
Backends	qwen3-coder (Ollama), qwen3-coder-llama, kimi-linear (vLLM), qwen3-235b

Status

M1 — compose + config artifacts written; awaiting box-side bring-up. M2 will wire OpenHands' LLM_BASE_URL to this endpoint. M3 (the orchestrator on the other server) will point at this endpoint over Tailscale.