progress 235b
This commit is contained in:
122
pyinfra/framework/compose/litellm/README.md
Normal file
122
pyinfra/framework/compose/litellm/README.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# litellm
|
||||
|
||||
OpenAI-compatible router in front of all the local model backends on
|
||||
this box. One endpoint (`http://framework:4000/v1`) → four backends
|
||||
(Ollama, llama.cpp 30B, vLLM Kimi, llama.cpp 235B). Maintained from
|
||||
[`config.yaml`](config.yaml) in this repo; edits land on the box via
|
||||
`./run.sh` + `docker compose restart litellm`.
|
||||
|
||||
## When to route through LiteLLM vs direct
|
||||
|
||||
| Client | Recommended | Why |
|
||||
|---|---|---|
|
||||
| **opencode (Mac)** | Direct providers in `opencode.json` | One Mac, four providers fits cleanly in opencode's config schema. Adds no debugging burden. |
|
||||
| **OpenHands (the box)** | Via LiteLLM | OpenHands wants a single `LLM_BASE_URL` + `LLM_MODEL`; switching between qwen3-coder and qwen3-235b for different tasks is a config flip, not a re-wire. |
|
||||
| **Future orchestrator (other server)** | Via LiteLLM | Same reason as OpenHands. Also: one URL to firewall/Tailscale-expose. |
|
||||
| **OpenWebUI** | Direct (already wired to vLLM) | No reason to add a hop. |
|
||||
| **Quick curl / scripts** | Either | LiteLLM is convenient because you don't need to remember port numbers. |
|
||||
|
||||
The trade-off: a hop adds a few ms per request (negligible at our tok/s)
|
||||
and one more thing to debug when something breaks. The win is config
|
||||
centralization.
|
||||
|
||||
## Prereqs
|
||||
|
||||
- Pyinfra deploy has run (creates `/srv/docker/litellm/{config.yaml,.env}`).
|
||||
- Fill in `/srv/docker/litellm/.env`:
|
||||
```
|
||||
LITELLM_MASTER_KEY=sk-<choose-something-stable>
|
||||
LITELLM_SALT_KEY=sk-<choose-something-stable>
|
||||
```
|
||||
Mode 640 root:docker — readable by docker group at compose-parse time,
|
||||
not world-readable.
|
||||
- At least one backend running (otherwise `/v1/models` is empty and
|
||||
routing 404s — LiteLLM itself comes up regardless).
|
||||
|
||||
## Bring up
|
||||
|
||||
```sh
|
||||
cd /srv/docker/litellm
|
||||
docker compose up -d
|
||||
docker compose logs -f # wait for "Application startup complete"
|
||||
|
||||
./smoke.sh # /health/readiness + /v1/models
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
From any client — pass `LITELLM_MASTER_KEY` as the Bearer token:
|
||||
|
||||
```sh
|
||||
# List configured models
|
||||
curl -fsS http://framework:4000/v1/models \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
||||
| jq '.data[].id'
|
||||
|
||||
# Chat completion (qwen3-coder via the qwen3-coder model_name in config)
|
||||
curl -fsS http://framework:4000/v1/chat/completions \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "ping"}]}' \
|
||||
| jq '.choices[0].message.content'
|
||||
|
||||
# Long-task model (assumes you stopped 30B/Kimi and brought qwen3-235b up)
|
||||
curl -fsS http://framework:4000/v1/chat/completions \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "qwen3-235b", "messages": [{"role": "user", "content": "Plan a refactor of foo.py"}], "max_tokens": 2000}'
|
||||
```
|
||||
|
||||
`/health` (no `/readiness`) hits every backend — useful for diagnosing
|
||||
which one is down:
|
||||
|
||||
```sh
|
||||
curl -fsS http://framework:4000/health \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq
|
||||
```
|
||||
|
||||
## Backend down semantics
|
||||
|
||||
LiteLLM does **not** chain fallbacks across our model_list — each
|
||||
`model_name` maps to exactly one backend. If `qwen3-235b` is down,
|
||||
requests for `model: qwen3-235b` return 503. That's deliberate:
|
||||
|
||||
- Silent fallback `qwen3-235b → qwen3-coder` would be confusing
|
||||
(different model, different quality, different latency profile).
|
||||
- The 235B has a manual-start workflow on purpose (GPU contention).
|
||||
Fail-fast surfaces the "you forgot to bring it up" case immediately.
|
||||
|
||||
If you want explicit fallback for a specific harness, add it client-side
|
||||
(OpenHands has `LLM_FALLBACKS=...`).
|
||||
|
||||
## Adding a model
|
||||
|
||||
Edit [`config.yaml`](config.yaml) → `./run.sh` from `pyinfra/framework/`
|
||||
on the Mac → `docker compose restart litellm` on the box. New model
|
||||
appears in `/v1/models` immediately. No state to migrate (database_url
|
||||
is null).
|
||||
|
||||
## Operations
|
||||
|
||||
```sh
|
||||
docker compose logs -f # tail
|
||||
docker compose restart litellm # reload after config edit
|
||||
docker compose down # stop
|
||||
./smoke.sh # health + model list
|
||||
```
|
||||
|
||||
## Pin manifest
|
||||
|
||||
| Component | Pin |
|
||||
|---|---|
|
||||
| Image | `ghcr.io/berriai/litellm:main-stable` |
|
||||
| Default port | 4000 |
|
||||
| Auth | Master key only (no virtual keys, no database) |
|
||||
| Backends | qwen3-coder (Ollama), qwen3-coder-llama, kimi-linear (vLLM), qwen3-235b |
|
||||
|
||||
## Status
|
||||
|
||||
M1 — compose + config artifacts written; awaiting box-side bring-up.
|
||||
M2 will wire OpenHands' `LLM_BASE_URL` to this endpoint. M3 (the
|
||||
orchestrator on the other server) will point at this endpoint over
|
||||
Tailscale.
|
||||
Reference in New Issue
Block a user