pyinfra/framework/compose/litellm/README.md

# litellm

OpenAI-compatible router in front of all the local model backends on
this box. One endpoint (`http://framework:4000/v1`) → four backends
(Ollama, llama.cpp 30B, vLLM Kimi, llama.cpp 235B). Maintained from
[`config.yaml`](config.yaml) in this repo; edits land on the box via
`./run.sh` + `docker compose restart litellm`.

## When to route through LiteLLM vs direct

| Client | Recommended | Why |
|---|---|---|
| **opencode (Mac)** | Direct providers in `opencode.json` | One Mac, four providers fits cleanly in opencode's config schema. Adds no debugging burden. |
| **OpenHands (the box)** | Via LiteLLM | OpenHands wants a single `LLM_BASE_URL` + `LLM_MODEL`; switching between qwen3-coder and qwen3-235b for different tasks is a config flip, not a re-wire. |
| **Future orchestrator (other server)** | Via LiteLLM | Same reason as OpenHands. Also: one URL to firewall/Tailscale-expose. |
| **OpenWebUI** | Direct (already wired to vLLM) | No reason to add a hop. |
| **Quick curl / scripts** | Either | LiteLLM is convenient because you don't need to remember port numbers. |

The trade-off: a hop adds a few ms per request (negligible at our tok/s)
and one more thing to debug when something breaks. The win is config
centralization.

## Prereqs

- Pyinfra deploy has run (creates `/srv/docker/litellm/{config.yaml,.env}`).
- Fill in `/srv/docker/litellm/.env`:
  ```
  LITELLM_MASTER_KEY=sk-<choose-something-stable>
  LITELLM_SALT_KEY=sk-<choose-something-stable>
  ```
  Mode 640 root:docker — readable by docker group at compose-parse time,
  not world-readable.
- At least one backend running (otherwise `/v1/models` is empty and
  routing 404s — LiteLLM itself comes up regardless).

## Bring up

```sh
cd /srv/docker/litellm
docker compose up -d
docker compose logs -f          # wait for "Application startup complete"

./smoke.sh                      # /health/readiness + /v1/models
```

## Usage

From any client — pass `LITELLM_MASTER_KEY` as the Bearer token:

```sh
# List configured models
curl -fsS http://framework:4000/v1/models \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  | jq '.data[].id'

# Chat completion (qwen3-coder via the qwen3-coder model_name in config)
curl -fsS http://framework:4000/v1/chat/completions \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "ping"}]}' \
  | jq '.choices[0].message.content'

# Long-task model (assumes you stopped 30B/Kimi and brought qwen3-235b up)
curl -fsS http://framework:4000/v1/chat/completions \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "qwen3-235b", "messages": [{"role": "user", "content": "Plan a refactor of foo.py"}], "max_tokens": 2000}'
```

`/health` (no `/readiness`) hits every backend — useful for diagnosing
which one is down:

```sh
curl -fsS http://framework:4000/health \
    -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq
```

## Backend down semantics

LiteLLM does **not** chain fallbacks across our model_list — each
`model_name` maps to exactly one backend. If `qwen3-235b` is down,
requests for `model: qwen3-235b` return 503. That's deliberate:

- Silent fallback `qwen3-235b → qwen3-coder` would be confusing
  (different model, different quality, different latency profile).
- The 235B has a manual-start workflow on purpose (GPU contention).
  Fail-fast surfaces the "you forgot to bring it up" case immediately.

If you want explicit fallback for a specific harness, add it client-side
(OpenHands has `LLM_FALLBACKS=...`).

## Adding a model

Edit [`config.yaml`](config.yaml) → `./run.sh` from `pyinfra/framework/`
on the Mac → `docker compose restart litellm` on the box. New model
appears in `/v1/models` immediately. No state to migrate (database_url
is null).

## Operations

```sh
docker compose logs -f                       # tail
docker compose restart litellm               # reload after config edit
docker compose down                          # stop
./smoke.sh                                   # health + model list
```

## Pin manifest

| Component | Pin |
|---|---|
| Image | `ghcr.io/berriai/litellm:main-stable` |
| Default port | 4000 |
| Auth | Master key only (no virtual keys, no database) |
| Backends | qwen3-coder (Ollama), qwen3-coder-llama, kimi-linear (vLLM), qwen3-235b |

## Status

M1 — compose + config artifacts written; awaiting box-side bring-up.
M2 will wire OpenHands' `LLM_BASE_URL` to this endpoint. M3 (the
orchestrator on the other server) will point at this endpoint over
Tailscale.
progress 235b 2026-06-08 15:31:50 +01:00			`# litellm`

			`OpenAI-compatible router in front of all the local model backends on`
			this box. One endpoint (`http://framework:4000/v1`) → four backends
			`(Ollama, llama.cpp 30B, vLLM Kimi, llama.cpp 235B). Maintained from`
			[`config.yaml`](config.yaml) in this repo; edits land on the box via
			`./run.sh` + `docker compose restart litellm`.

			`## When to route through LiteLLM vs direct`

			`\| Client \| Recommended \| Why \|`
			`\|---\|---\|---\|`
			\| opencode (Mac) \| Direct providers in `opencode.json` \| One Mac, four providers fits cleanly in opencode's config schema. Adds no debugging burden. \|
			\| OpenHands (the box) \| Via LiteLLM \| OpenHands wants a single `LLM_BASE_URL` + `LLM_MODEL`; switching between qwen3-coder and qwen3-235b for different tasks is a config flip, not a re-wire. \|
			`\| Future orchestrator (other server) \| Via LiteLLM \| Same reason as OpenHands. Also: one URL to firewall/Tailscale-expose. \|`
			`\| OpenWebUI \| Direct (already wired to vLLM) \| No reason to add a hop. \|`
			`\| Quick curl / scripts \| Either \| LiteLLM is convenient because you don't need to remember port numbers. \|`

			`The trade-off: a hop adds a few ms per request (negligible at our tok/s)`
			`and one more thing to debug when something breaks. The win is config`
			`centralization.`

			`## Prereqs`

			- Pyinfra deploy has run (creates `/srv/docker/litellm/{config.yaml,.env}`).
			- Fill in `/srv/docker/litellm/.env`:
			```
			`LITELLM_MASTER_KEY=sk-<choose-something-stable>`
			`LITELLM_SALT_KEY=sk-<choose-something-stable>`
			```
			`Mode 640 root:docker — readable by docker group at compose-parse time,`
			`not world-readable.`
			- At least one backend running (otherwise `/v1/models` is empty and
			`routing 404s — LiteLLM itself comes up regardless).`

			`## Bring up`

			```sh
			`cd /srv/docker/litellm`
			`docker compose up -d`
			`docker compose logs -f # wait for "Application startup complete"`

			`./smoke.sh # /health/readiness + /v1/models`
			```

			`## Usage`

			From any client — pass `LITELLM_MASTER_KEY` as the Bearer token:

			```sh
			`# List configured models`
			`curl -fsS http://framework:4000/v1/models \`
			`-H "Authorization: Bearer $LITELLM_MASTER_KEY" \`
			`\| jq '.data[].id'`

			`# Chat completion (qwen3-coder via the qwen3-coder model_name in config)`
			`curl -fsS http://framework:4000/v1/chat/completions \`
			`-H "Authorization: Bearer $LITELLM_MASTER_KEY" \`
			`-H "Content-Type: application/json" \`
			`-d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "ping"}]}' \`
			`\| jq '.choices[0].message.content'`

			`# Long-task model (assumes you stopped 30B/Kimi and brought qwen3-235b up)`
			`curl -fsS http://framework:4000/v1/chat/completions \`
			`-H "Authorization: Bearer $LITELLM_MASTER_KEY" \`
			`-H "Content-Type: application/json" \`
			`-d '{"model": "qwen3-235b", "messages": [{"role": "user", "content": "Plan a refactor of foo.py"}], "max_tokens": 2000}'`
			```

			`/health` (no `/readiness`) hits every backend — useful for diagnosing
			`which one is down:`

			```sh
			`curl -fsS http://framework:4000/health \`
			`-H "Authorization: Bearer $LITELLM_MASTER_KEY" \| jq`
			```

			`## Backend down semantics`

			`LiteLLM does not chain fallbacks across our model_list — each`
			`model_name` maps to exactly one backend. If `qwen3-235b` is down,
			requests for `model: qwen3-235b` return 503. That's deliberate:

			- Silent fallback `qwen3-235b → qwen3-coder` would be confusing
			`(different model, different quality, different latency profile).`
			`- The 235B has a manual-start workflow on purpose (GPU contention).`
			`Fail-fast surfaces the "you forgot to bring it up" case immediately.`

			`If you want explicit fallback for a specific harness, add it client-side`
			(OpenHands has `LLM_FALLBACKS=...`).

			`## Adding a model`

			Edit [`config.yaml`](config.yaml) → `./run.sh` from `pyinfra/framework/`
			on the Mac → `docker compose restart litellm` on the box. New model
			appears in `/v1/models` immediately. No state to migrate (database_url
			`is null).`

			`## Operations`

			```sh
			`docker compose logs -f # tail`
			`docker compose restart litellm # reload after config edit`
			`docker compose down # stop`
			`./smoke.sh # health + model list`
			```

			`## Pin manifest`

			`\| Component \| Pin \|`
			`\|---\|---\|`
			\| Image \| `ghcr.io/berriai/litellm:main-stable` \|
			`\| Default port \| 4000 \|`
			`\| Auth \| Master key only (no virtual keys, no database) \|`
			`\| Backends \| qwen3-coder (Ollama), qwen3-coder-llama, kimi-linear (vLLM), qwen3-235b \|`

			`## Status`

			`M1 — compose + config artifacts written; awaiting box-side bring-up.`
			M2 will wire OpenHands' `LLM_BASE_URL` to this endpoint. M3 (the
			`orchestrator on the other server) will point at this endpoint over`
			`Tailscale.`