progress 235b
This commit is contained in:
122
pyinfra/framework/compose/litellm/README.md
Normal file
122
pyinfra/framework/compose/litellm/README.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# litellm
|
||||
|
||||
OpenAI-compatible router in front of all the local model backends on
|
||||
this box. One endpoint (`http://framework:4000/v1`) → four backends
|
||||
(Ollama, llama.cpp 30B, vLLM Kimi, llama.cpp 235B). Maintained from
|
||||
[`config.yaml`](config.yaml) in this repo; edits land on the box via
|
||||
`./run.sh` + `docker compose restart litellm`.
|
||||
|
||||
## When to route through LiteLLM vs direct
|
||||
|
||||
| Client | Recommended | Why |
|
||||
|---|---|---|
|
||||
| **opencode (Mac)** | Direct providers in `opencode.json` | One Mac, four providers fits cleanly in opencode's config schema. Adds no debugging burden. |
|
||||
| **OpenHands (the box)** | Via LiteLLM | OpenHands wants a single `LLM_BASE_URL` + `LLM_MODEL`; switching between qwen3-coder and qwen3-235b for different tasks is a config flip, not a re-wire. |
|
||||
| **Future orchestrator (other server)** | Via LiteLLM | Same reason as OpenHands. Also: one URL to firewall/Tailscale-expose. |
|
||||
| **OpenWebUI** | Direct (already wired to vLLM) | No reason to add a hop. |
|
||||
| **Quick curl / scripts** | Either | LiteLLM is convenient because you don't need to remember port numbers. |
|
||||
|
||||
The trade-off: a hop adds a few ms per request (negligible at our tok/s)
|
||||
and one more thing to debug when something breaks. The win is config
|
||||
centralization.
|
||||
|
||||
## Prereqs
|
||||
|
||||
- Pyinfra deploy has run (creates `/srv/docker/litellm/{config.yaml,.env}`).
|
||||
- Fill in `/srv/docker/litellm/.env`:
|
||||
```
|
||||
LITELLM_MASTER_KEY=sk-<choose-something-stable>
|
||||
LITELLM_SALT_KEY=sk-<choose-something-stable>
|
||||
```
|
||||
Mode 640 root:docker — readable by docker group at compose-parse time,
|
||||
not world-readable.
|
||||
- At least one backend running (otherwise `/v1/models` is empty and
|
||||
routing 404s — LiteLLM itself comes up regardless).
|
||||
|
||||
## Bring up
|
||||
|
||||
```sh
|
||||
cd /srv/docker/litellm
|
||||
docker compose up -d
|
||||
docker compose logs -f # wait for "Application startup complete"
|
||||
|
||||
./smoke.sh # /health/readiness + /v1/models
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
From any client — pass `LITELLM_MASTER_KEY` as the Bearer token:
|
||||
|
||||
```sh
|
||||
# List configured models
|
||||
curl -fsS http://framework:4000/v1/models \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
||||
| jq '.data[].id'
|
||||
|
||||
# Chat completion (qwen3-coder via the qwen3-coder model_name in config)
|
||||
curl -fsS http://framework:4000/v1/chat/completions \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "ping"}]}' \
|
||||
| jq '.choices[0].message.content'
|
||||
|
||||
# Long-task model (assumes you stopped 30B/Kimi and brought qwen3-235b up)
|
||||
curl -fsS http://framework:4000/v1/chat/completions \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "qwen3-235b", "messages": [{"role": "user", "content": "Plan a refactor of foo.py"}], "max_tokens": 2000}'
|
||||
```
|
||||
|
||||
`/health` (no `/readiness`) hits every backend — useful for diagnosing
|
||||
which one is down:
|
||||
|
||||
```sh
|
||||
curl -fsS http://framework:4000/health \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq
|
||||
```
|
||||
|
||||
## Backend down semantics
|
||||
|
||||
LiteLLM does **not** chain fallbacks across our model_list — each
|
||||
`model_name` maps to exactly one backend. If `qwen3-235b` is down,
|
||||
requests for `model: qwen3-235b` return 503. That's deliberate:
|
||||
|
||||
- Silent fallback `qwen3-235b → qwen3-coder` would be confusing
|
||||
(different model, different quality, different latency profile).
|
||||
- The 235B has a manual-start workflow on purpose (GPU contention).
|
||||
Fail-fast surfaces the "you forgot to bring it up" case immediately.
|
||||
|
||||
If you want explicit fallback for a specific harness, add it client-side
|
||||
(OpenHands has `LLM_FALLBACKS=...`).
|
||||
|
||||
## Adding a model
|
||||
|
||||
Edit [`config.yaml`](config.yaml) → `./run.sh` from `pyinfra/framework/`
|
||||
on the Mac → `docker compose restart litellm` on the box. New model
|
||||
appears in `/v1/models` immediately. No state to migrate (database_url
|
||||
is null).
|
||||
|
||||
## Operations
|
||||
|
||||
```sh
|
||||
docker compose logs -f # tail
|
||||
docker compose restart litellm # reload after config edit
|
||||
docker compose down # stop
|
||||
./smoke.sh # health + model list
|
||||
```
|
||||
|
||||
## Pin manifest
|
||||
|
||||
| Component | Pin |
|
||||
|---|---|
|
||||
| Image | `ghcr.io/berriai/litellm:main-stable` |
|
||||
| Default port | 4000 |
|
||||
| Auth | Master key only (no virtual keys, no database) |
|
||||
| Backends | qwen3-coder (Ollama), qwen3-coder-llama, kimi-linear (vLLM), qwen3-235b |
|
||||
|
||||
## Status
|
||||
|
||||
M1 — compose + config artifacts written; awaiting box-side bring-up.
|
||||
M2 will wire OpenHands' `LLM_BASE_URL` to this endpoint. M3 (the
|
||||
orchestrator on the other server) will point at this endpoint over
|
||||
Tailscale.
|
||||
73
pyinfra/framework/compose/litellm/config.yaml
Normal file
73
pyinfra/framework/compose/litellm/config.yaml
Normal file
@@ -0,0 +1,73 @@
|
||||
# LiteLLM model routing. model_name is what clients request; the
|
||||
# litellm_params block is how LiteLLM reaches the backend.
|
||||
#
|
||||
# `model: openai/<served-name>` tells LiteLLM to use its
|
||||
# openai-compatible adapter and forward <served-name> to the backend.
|
||||
# api_base is the backend's /v1 root reachable from inside the LiteLLM
|
||||
# container (host.docker.internal = host's docker0 IP via the
|
||||
# extra_hosts entry in litellm.yml).
|
||||
#
|
||||
# Backend running-state matters: requests to a stopped backend return
|
||||
# 503/connection-refused. By design — no fallback chain, since these
|
||||
# backends compete for GPU and silently routing "qwen3-235b" to the 30B
|
||||
# would be more confusing than failing fast.
|
||||
#
|
||||
# Edits here require `./run.sh` on the Mac to push to the box, then
|
||||
# `docker compose restart litellm` on the box to reload.
|
||||
|
||||
model_list:
|
||||
# Daily-driver coding model. Ollama with gfx1100-coerced ROCm —
|
||||
# currently the default opencode provider. Always-resident
|
||||
# (OLLAMA_KEEP_ALIVE=24h).
|
||||
- model_name: qwen3-coder
|
||||
litellm_params:
|
||||
model: openai/qwen3-coder:30b
|
||||
api_base: http://host.docker.internal:11434/v1
|
||||
api_key: dummy
|
||||
|
||||
# Same weights as qwen3-coder above but served via llama.cpp on the
|
||||
# kyuz0 rocm-7.2.2 image (native gfx1151 + rocWMMA). LL-P0 measures
|
||||
# whether the eval_tps win justifies switching default opencode to
|
||||
# this. Manual start until then.
|
||||
- model_name: qwen3-coder-llama
|
||||
litellm_params:
|
||||
model: openai/qwen3-coder
|
||||
api_base: http://host.docker.internal:8080/v1
|
||||
api_key: dummy
|
||||
|
||||
# Long-context chat (no tool calling) via vLLM. P0 verified at 32K;
|
||||
# context ramp tracked in kimi-linear/NEXT_STEPS.md.
|
||||
- model_name: kimi-linear
|
||||
litellm_params:
|
||||
model: openai/kimi-linear
|
||||
api_base: http://host.docker.internal:8000/v1
|
||||
api_key: dummy
|
||||
|
||||
# Long-task model — Qwen3-235B-A22B-Instruct-2507 UD-Q2_K_XL via
|
||||
# llama.cpp on port 8081. ~5-10 tok/s decode; manual start only
|
||||
# (can't coexist with the other GPU services). Requests will fail
|
||||
# with connection refused when the container is down — that's the
|
||||
# intended UX: a stopped service is a clear signal.
|
||||
- model_name: qwen3-235b
|
||||
litellm_params:
|
||||
model: openai/qwen3-235b
|
||||
api_base: http://host.docker.internal:8081/v1
|
||||
api_key: dummy
|
||||
|
||||
litellm_settings:
|
||||
# Forward all client params to the backend, even unrecognized ones.
|
||||
# Default drops unknown OpenAI params — fine for hosted models, but
|
||||
# our backends vary (vLLM, llama.cpp, Ollama) and each accepts a
|
||||
# slightly different superset. Let the backend reject what it can't
|
||||
# use rather than LiteLLM silently filtering.
|
||||
drop_params: false
|
||||
# No /v1/models caching — the list is short and we want stop/start
|
||||
# of backends to reflect immediately.
|
||||
cache: false
|
||||
# Log proxy requests at info; tail with `docker compose logs litellm`.
|
||||
set_verbose: false
|
||||
|
||||
general_settings:
|
||||
# Disable LiteLLM's database mode — we're stateless. No user/key/spend
|
||||
# tracking needed for a single-user trusted LAN setup.
|
||||
database_url: null
|
||||
50
pyinfra/framework/compose/litellm/smoke.sh
Normal file
50
pyinfra/framework/compose/litellm/smoke.sh
Normal file
@@ -0,0 +1,50 @@
|
||||
#!/usr/bin/env bash
|
||||
# Smoke-test the LiteLLM proxy. /health/readiness for liveness, /v1/models
|
||||
# for the configured backends, then /health (which dials every backend)
|
||||
# to surface which ones are actually reachable right now.
|
||||
set -euo pipefail
|
||||
|
||||
HOST="${LITELLM_HOST:-127.0.0.1:4000}"
|
||||
# Read master key from sibling .env if present, otherwise from environment.
|
||||
if [[ -z "${LITELLM_MASTER_KEY:-}" && -f "$(dirname "$0")/../.env" ]]; then
|
||||
# shellcheck disable=SC1091
|
||||
source "$(dirname "$0")/../.env"
|
||||
fi
|
||||
if [[ -z "${LITELLM_MASTER_KEY:-}" ]]; then
|
||||
echo "[smoke] LITELLM_MASTER_KEY not set — export it or populate /srv/docker/litellm/.env" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "[smoke] GET /health/readiness on $HOST (proxy alive?)"
|
||||
curl -fsS "http://$HOST/health/readiness" \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
||||
| python3 -m json.tool
|
||||
|
||||
echo
|
||||
echo "[smoke] GET /v1/models (configured model_names)"
|
||||
curl -fsS "http://$HOST/v1/models" \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
||||
| python3 -c "
|
||||
import json, sys
|
||||
r = json.load(sys.stdin)
|
||||
for m in r.get('data', []):
|
||||
print(f\" - {m.get('id', '?')}\")"
|
||||
|
||||
echo
|
||||
echo "[smoke] GET /health (each backend's reachability — slow, ~10s)"
|
||||
curl -fsS "http://$HOST/health" \
|
||||
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
||||
| python3 -c "
|
||||
import json, sys
|
||||
r = json.load(sys.stdin)
|
||||
healthy = r.get('healthy_endpoints', [])
|
||||
unhealthy = r.get('unhealthy_endpoints', [])
|
||||
print(f' healthy: {len(healthy)}')
|
||||
for e in healthy:
|
||||
print(f' + {e.get(\"model\", \"?\")}')
|
||||
print(f' unhealthy: {len(unhealthy)}')
|
||||
for e in unhealthy:
|
||||
print(f' - {e.get(\"model\", \"?\")}: {e.get(\"error\", \"?\")[:80]}')"
|
||||
|
||||
echo
|
||||
echo "[smoke] passed — proxy up, model list populated. Unhealthy backends are expected if their compose stacks are down."
|
||||
Reference in New Issue
Block a user