# litellm OpenAI-compatible router in front of all the local model backends on this box. One endpoint (`http://framework:4000/v1`) → four backends (Ollama, llama.cpp 30B, vLLM Kimi, llama.cpp 235B). Maintained from [`config.yaml`](config.yaml) in this repo; edits land on the box via `./run.sh` + `docker compose restart litellm`. ## When to route through LiteLLM vs direct | Client | Recommended | Why | |---|---|---| | **opencode (Mac)** | Direct providers in `opencode.json` | One Mac, four providers fits cleanly in opencode's config schema. Adds no debugging burden. | | **OpenHands (the box)** | Via LiteLLM | OpenHands wants a single `LLM_BASE_URL` + `LLM_MODEL`; switching between qwen3-coder and qwen3-235b for different tasks is a config flip, not a re-wire. | | **Future orchestrator (other server)** | Via LiteLLM | Same reason as OpenHands. Also: one URL to firewall/Tailscale-expose. | | **OpenWebUI** | Direct (already wired to vLLM) | No reason to add a hop. | | **Quick curl / scripts** | Either | LiteLLM is convenient because you don't need to remember port numbers. | The trade-off: a hop adds a few ms per request (negligible at our tok/s) and one more thing to debug when something breaks. The win is config centralization. ## Prereqs - Pyinfra deploy has run (creates `/srv/docker/litellm/{config.yaml,.env}`). - Fill in `/srv/docker/litellm/.env`: ``` LITELLM_MASTER_KEY=sk- LITELLM_SALT_KEY=sk- ``` Mode 640 root:docker — readable by docker group at compose-parse time, not world-readable. - At least one backend running (otherwise `/v1/models` is empty and routing 404s — LiteLLM itself comes up regardless). ## Bring up ```sh cd /srv/docker/litellm docker compose up -d docker compose logs -f # wait for "Application startup complete" ./smoke.sh # /health/readiness + /v1/models ``` ## Usage From any client — pass `LITELLM_MASTER_KEY` as the Bearer token: ```sh # List configured models curl -fsS http://framework:4000/v1/models \ -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ | jq '.data[].id' # Chat completion (qwen3-coder via the qwen3-coder model_name in config) curl -fsS http://framework:4000/v1/chat/completions \ -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "ping"}]}' \ | jq '.choices[0].message.content' # Long-task model (assumes you stopped 30B/Kimi and brought qwen3-235b up) curl -fsS http://framework:4000/v1/chat/completions \ -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "qwen3-235b", "messages": [{"role": "user", "content": "Plan a refactor of foo.py"}], "max_tokens": 2000}' ``` `/health` (no `/readiness`) hits every backend — useful for diagnosing which one is down: ```sh curl -fsS http://framework:4000/health \ -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq ``` ## Backend down semantics LiteLLM does **not** chain fallbacks across our model_list — each `model_name` maps to exactly one backend. If `qwen3-235b` is down, requests for `model: qwen3-235b` return 503. That's deliberate: - Silent fallback `qwen3-235b → qwen3-coder` would be confusing (different model, different quality, different latency profile). - The 235B has a manual-start workflow on purpose (GPU contention). Fail-fast surfaces the "you forgot to bring it up" case immediately. If you want explicit fallback for a specific harness, add it client-side (OpenHands has `LLM_FALLBACKS=...`). ## Adding a model Edit [`config.yaml`](config.yaml) → `./run.sh` from `pyinfra/framework/` on the Mac → `docker compose restart litellm` on the box. New model appears in `/v1/models` immediately. No state to migrate (database_url is null). ## Operations ```sh docker compose logs -f # tail docker compose restart litellm # reload after config edit docker compose down # stop ./smoke.sh # health + model list ``` ## Pin manifest | Component | Pin | |---|---| | Image | `ghcr.io/berriai/litellm:main-stable` | | Default port | 4000 | | Auth | Master key only (no virtual keys, no database) | | Backends | qwen3-coder (Ollama), qwen3-coder-llama, kimi-linear (vLLM), qwen3-235b | ## Status M1 — compose + config artifacts written; awaiting box-side bring-up. M2 will wire OpenHands' `LLM_BASE_URL` to this endpoint. M3 (the orchestrator on the other server) will point at this endpoint over Tailscale.