Files
localgenai/pyinfra/framework/compose/litellm
2026-06-08 15:31:50 +01:00
..
2026-06-08 15:31:50 +01:00
2026-06-08 15:31:50 +01:00
2026-06-08 15:31:50 +01:00

litellm

OpenAI-compatible router in front of all the local model backends on this box. One endpoint (http://framework:4000/v1) → four backends (Ollama, llama.cpp 30B, vLLM Kimi, llama.cpp 235B). Maintained from config.yaml in this repo; edits land on the box via ./run.sh + docker compose restart litellm.

When to route through LiteLLM vs direct

Client Recommended Why
opencode (Mac) Direct providers in opencode.json One Mac, four providers fits cleanly in opencode's config schema. Adds no debugging burden.
OpenHands (the box) Via LiteLLM OpenHands wants a single LLM_BASE_URL + LLM_MODEL; switching between qwen3-coder and qwen3-235b for different tasks is a config flip, not a re-wire.
Future orchestrator (other server) Via LiteLLM Same reason as OpenHands. Also: one URL to firewall/Tailscale-expose.
OpenWebUI Direct (already wired to vLLM) No reason to add a hop.
Quick curl / scripts Either LiteLLM is convenient because you don't need to remember port numbers.

The trade-off: a hop adds a few ms per request (negligible at our tok/s) and one more thing to debug when something breaks. The win is config centralization.

Prereqs

  • Pyinfra deploy has run (creates /srv/docker/litellm/{config.yaml,.env}).
  • Fill in /srv/docker/litellm/.env:
    LITELLM_MASTER_KEY=sk-<choose-something-stable>
    LITELLM_SALT_KEY=sk-<choose-something-stable>
    
    Mode 640 root:docker — readable by docker group at compose-parse time, not world-readable.
  • At least one backend running (otherwise /v1/models is empty and routing 404s — LiteLLM itself comes up regardless).

Bring up

cd /srv/docker/litellm
docker compose up -d
docker compose logs -f          # wait for "Application startup complete"

./smoke.sh                      # /health/readiness + /v1/models

Usage

From any client — pass LITELLM_MASTER_KEY as the Bearer token:

# List configured models
curl -fsS http://framework:4000/v1/models \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  | jq '.data[].id'

# Chat completion (qwen3-coder via the qwen3-coder model_name in config)
curl -fsS http://framework:4000/v1/chat/completions \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "ping"}]}' \
  | jq '.choices[0].message.content'

# Long-task model (assumes you stopped 30B/Kimi and brought qwen3-235b up)
curl -fsS http://framework:4000/v1/chat/completions \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "qwen3-235b", "messages": [{"role": "user", "content": "Plan a refactor of foo.py"}], "max_tokens": 2000}'

/health (no /readiness) hits every backend — useful for diagnosing which one is down:

curl -fsS http://framework:4000/health \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq

Backend down semantics

LiteLLM does not chain fallbacks across our model_list — each model_name maps to exactly one backend. If qwen3-235b is down, requests for model: qwen3-235b return 503. That's deliberate:

  • Silent fallback qwen3-235b → qwen3-coder would be confusing (different model, different quality, different latency profile).
  • The 235B has a manual-start workflow on purpose (GPU contention). Fail-fast surfaces the "you forgot to bring it up" case immediately.

If you want explicit fallback for a specific harness, add it client-side (OpenHands has LLM_FALLBACKS=...).

Adding a model

Edit config.yaml./run.sh from pyinfra/framework/ on the Mac → docker compose restart litellm on the box. New model appears in /v1/models immediately. No state to migrate (database_url is null).

Operations

docker compose logs -f                       # tail
docker compose restart litellm               # reload after config edit
docker compose down                          # stop
./smoke.sh                                   # health + model list

Pin manifest

Component Pin
Image ghcr.io/berriai/litellm:main-stable
Default port 4000
Auth Master key only (no virtual keys, no database)
Backends qwen3-coder (Ollama), qwen3-coder-llama, kimi-linear (vLLM), qwen3-235b

Status

M1 — compose + config artifacts written; awaiting box-side bring-up. M2 will wire OpenHands' LLM_BASE_URL to this endpoint. M3 (the orchestrator on the other server) will point at this endpoint over Tailscale.