Soon you are juggling vLLM, llama.cpp, and more—each stack on its own port. Everything downstream still wants one /v1 base URL; otherwise you keep shuffling ports, profiles, and one-off scripts. llama-swap is the /v1 proxy before those stacks.
llama-swap provides one OpenAI- and Anthropic-compatible front door, with a YAML file that maps each model name to the command that starts the right upstream. Request a model and the proxy starts or swaps to it; configure TTLs and groups when VRAM is tight or several models must coexist. This guide covers install paths, a practical config.yaml, the HTTP surface, and the failure modes that show up once streaming and reverse proxies enter the picture.
For a broader comparison of LLM hosting options, see LLM Hosting in 2026: Local, Self-Hosted & Cloud Infrastructure Compared
llama-swap model switcher overview for OpenAI-compatible local LLM APIs
llama-swap is a lightweight proxy server built around a simple operational model: one binary, one YAML config file, no dependencies. It's written in Go, which means a single static binary beside the rest of the stack—no Python runtime or desktop app required. It sits in front of any OpenAI- and Anthropic-compatible upstream as the model-swapping layer.
Conceptually, this answers a very practical question that comes up in local LLM stacks:
How do I switch models with an OpenAI-compatible client?
With llama-swap you keep using normal /v1/... requests, but you change the model you request. llama-swap reads that model value, loads the matching server configuration, and if the "wrong" upstream is running it swaps it out for the correct one.
A few design details matter for production-ish setups:
llama-swap is MIT-licensed with no telemetry—still worth confirming for any host that sees real prompts.
It is built for on-demand loading of backends like llama.cpp, vLLM, Whisper, and stable-diffusion.cpp, not for locking you to a single inference engine.
Out of the box (no special grouping), it runs one model at a time: request a different model and it stops the current upstream and starts the right one. For more than one resident model or finer control over coexistence, configure groups.
Here's the mental model most developers find useful:
flowchart LR
C[Your app or SDK\nOpenAI-compatible client] -->|/v1/chat/completions\nmodel = qwen-coder| LS[llama-swap proxy\nsingle endpoint]
LS -->|starts or routes to| U1[Upstream server A\nllama-server]
LS -->|starts or routes to| U2[Upstream server B\nvLLM OpenAI server]
LS --> M[Management endpoints\nrunning, unload, events, metrics]
This is also why a model switcher proxy is different from "just running a model": it's orchestration and routing on top of one or more inference servers.
llama-swap vs Ollama vs LM Studio vs llama.cpp server
All four options can give you a "local LLM API", but they optimise for different workflows. The fastest way to choose is to decide whether you want a runtime (model download + execution) or a router/proxy (switching + orchestration across runtimes).
llama-swap
llama-swap focuses on being a transparent proxy that supports OpenAI-compatible endpoints (including /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models) and routes requests to the correct upstream based on the requested model. It also provides non-inference operational endpoints such as /running, /logs/stream, and a Web UI at /ui.
Ollama
Ollama exposes its own HTTP API (POST /api/chat, POST /api/generate, and the usual local default on port 11434).
keep_alive controls how long a model stays loaded, including 0 to unload immediately.
It fits users who want to pull a model and chat with minimal wiring. llama-swap fits per-model commands, mixed backends, and one OpenAI-shaped URL for every client—orchestrating vLLM next to llama-server with different flags per model is outside what Ollama is aimed at.
LM Studio
LM Studio is a desktop app with a local API server from the Developer tab (localhost or LAN), including OpenAI-compatible and Anthropic-compatible modes, plus lms server start from the terminal.
It suits a GUI-first loop: browse models, click, test. llama-swap suits a server-style role: YAML, process supervision, mixed upstreams, no desktop session.
llama.cpp server
llama-server exposes /v1/completions, /v1/chat/completions, /v1/responses, and the usual pattern is to point an OpenAI client at it via base_url.
llama.cpp also ships a router mode: run llama-server as router, --models-dir, then POST /models/load and POST /models/unload to juggle GGUF models without a separate proxy.
If every model sits under one llama.cpp router, an extra proxy is often unnecessary. When llama.cpp must sit beside vLLM or other OpenAI-shaped servers, llama-swap provides one /v1 surface and many processes behind it.
For similar OpenAI-compatible hosting solutions, see LocalAI QuickStart: Run OpenAI-Compatible LLMs Locally or SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API
Install llama-swap model switcher with Docker, Homebrew, WinGet, or binaries
Linux, macOS, and Windows are all first-class: Docker, Homebrew, WinGet, GitHub binaries, or build from source. Common choices: Docker on headless servers, Homebrew or WinGet on workstations, standalone binaries when the install footprint should stay minimal.
Docker install
Pull an image that matches your hardware. Images track upstream closely (nightly builds) and cover CUDA, Vulkan, Intel, MUSA, and CPU—pick the one that matches how you actually accelerate, not "latest" by habit.
# Example platform pulls
docker pull ghcr.io/mostlygeek/llama-swap:cuda
docker pull ghcr.io/mostlygeek/llama-swap:vulkan
docker pull ghcr.io/mostlygeek/llama-swap:intel
docker pull ghcr.io/mostlygeek/llama-swap:musa
docker pull ghcr.io/mostlygeek/llama-swap:cpu
Prefer the non-root image variants when you can: less to regret if the container boundary is ever wrong.
Homebrew install
On macOS and Linux, use the tap and install:
brew tap mostlygeek/llama-swap
brew install llama-swap
llama-swap --config path/to/config.yaml --listen localhost:8080
WinGet install
On Windows:
winget install llama-swap
winget upgrade llama-swap
Pre-built binaries and releases
GitHub Releases ships Linux, macOS, Windows, and FreeBSD binaries if you do not want a package manager.
Release numbers move fast (for example v198, v197 around early 2026)—pin a version in automation rather than floating "whatever was there yesterday".
Configure llama-swap with config.yaml for model swapping, TTL, and groups
Everything in llama-swap is configuration-driven. The minimal viable configuration is simply a models: dictionary and a cmd for each model, often launching llama-server with ${PORT} substituted per model.
The configuration system goes much further than just "start a process", and a few options are worth understanding early because they directly answer common FAQ-style problems (auto-unloading, security, and clients that rely on /v1/models).
Global settings you will actually use
healthCheckTimeout is how long llama-swap waits for a model to become healthy after startup (default 120s, minimum 15s). For multi‑GB loads on slow disks, bump this before you blame the proxy.
globalTTL is idle seconds before auto-unload; default 0 means "never unload" unless you set it—explicitly choose TTLs for anything beyond a toy setup so VRAM does not fill with forgotten models.
startPort seeds the ${PORT} macro (default 5800); assignment is deterministic by alphabetical model ID, which is a feature when you debug "who grabbed which port" and a footgun if you rename models carelessly.
includeAliasesInList decides whether aliases show up as separate rows in /v1/models; turn it on if your UI only offers enumerated models.
apiKeys gates anything reachable off localhost: Basic, Bearer, or x-api-key. llama-swap strips those headers before forwarding so upstream logs are less likely to retain client secrets.
Model-level settings that unlock production ergonomics
Per model, cmd is the only required field.
proxy defaults to http://localhost:${PORT}—that is the forwarding target for that model's upstream.
checkEndpoint defaults to /health; set "none" when the backend has no health route or cold start exceeds what you are willing to wait for—do not leave a broken /health and wonder why nothing reaches ready.
ttl: -1 inherits globalTTL, 0 never unloads, N > 0 unloads after N seconds idle—use per-model TTL when one model should linger and another should vanish quickly.
aliases and useModelName keep stable client-facing names while satisfying upstreams that require a specific identifier.
cmdStop is non-optional for containers: map it to docker stop (or equivalent); without it you get POSIX SIGTERM / Windows taskkill against whatever PID llama-swap started—fine for a bare binary, wrong for a container name.
concurrencyLimit caps parallel requests per model with HTTP 429 when exceeded—set it when you would rather shed load than queue forever.
groups cover coexistence (swap, exclusive) and always-on models (persistent). Hooks can preload on startup; if you preload several models at once without a group, expect them to fight—define a group first so preloading matches how you want models to share the GPU.
Minimal config.yaml example for llama.cpp and vLLM
This example aims to be "just enough" to illustrate best-practice knobs: a default TTL, explicit health checking, stable aliases, and a group that keeps a small "always-on" model running while bigger chat models swap.
# config.yaml
healthCheckTimeout: 180
globalTTL: 900 # 15 minutes idle then unload
includeAliasesInList: true
startPort: 5800
# Optional but recommended for anything beyond localhost development
apiKeys:
- "${env.LLAMASWAP_API_KEY}"
models:
llama-chat:
cmd: |
llama-server --port ${PORT} --model /models/llama-chat.gguf
--ctx-size 8192
aliases:
- "llama-chat-latest"
# Uses defaults:
# proxy: http://localhost:${PORT}
# checkEndpoint: /health
# ttl: -1 (inherit globalTTL)
qwen-coder:
cmd: |
llama-server --port ${PORT} --model /models/qwen-coder.gguf
--ctx-size 8192
aliases:
- "qwen-coder-latest"
vllm-coder:
# Illustrative pattern: manage a containerised OpenAI-compatible server
proxy: "http://127.0.0.1:${PORT}"
cmd: |
docker run --name ${MODEL_ID} --init --rm -p ${PORT}:8000 vllm/vllm-openai:latest
cmdStop: docker stop ${MODEL_ID}
checkEndpoint: "none"
ttl: 0 # never auto-unload (e.g. keep on GPU)
groups:
chat-models:
swap: true
exclusive: true
members: ["llama-chat", "qwen-coder"]
always-on:
persistent: true
swap: false
exclusive: false
members: ["vllm-coder"]
None of this is decorative: cmd drives the process, proxy/checkEndpoint/ttl control routing and lifecycle, cmdStop is what makes Docker-based upstreams actually stop, and groups are what separate "one big model at a time" from "these two chat models may coexist while the embedding server stays pinned".
Run and swap models via OpenAI-compatible endpoints
Once llama-swap is running, you interact with it like any other OpenAI-compatible endpoint. The API surface includes core endpoints such as /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models, and llama-swap uses the requested model to decide which upstream to run and route to.
A practical quickstart flow:
# 1) Start llama-swap
llama-swap --config ./config.yaml --listen localhost:8080
# 2) Discover available models
curl http://localhost:8080/v1/models
Model listing is a first-class management feature and includes behaviour like sorting by ID, excluding unlisted models, and optionally including aliases.
# 3) Make a chat completion request for a specific model
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${LLAMASWAP_API_KEY}" \
-d '{
"model": "qwen-coder",
"messages": [{"role":"user","content":"Write a TypeScript function that retries fetch with backoff."}]
}'
If you now repeat the call with "model": "llama-chat", llama-swap will swap upstream processes (unless your group configuration allows them to coexist) because it extracts the requested model from the request and loads the appropriate server configuration.
If you're using an SDK, point the client at http://localhost:8080/v1—same trick as aiming the OpenAI Python library at llama-server, except the stable URL is now llama-swap and the model field picks the upstream.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="sk-your-llamaswap-key"
)
resp = client.chat.completions.create(
model="qwen-coder",
messages=[{"role": "user", "content": "Explain the difference between mutexes and semaphores."}],
)
print(resp.choices[0].message.content)
To warm a model before the first real request (hide cold-start latency), use /upstream/<model>—it auto-loads if needed and forwards straight to that upstream. Straightforward way to ensure weights are resident before a benchmark or scripted test.
Control and monitor llama-swap via management API endpoints and SSE events
llama-swap isn't just "a proxy"; it also exposes operational control endpoints that let you build tooling around model lifecycle and observability.
Check what is running
GET /running returns runtime state for loaded models, including state values like ready, starting, stopping, stopped, and shutdown.
curl http://localhost:8080/running
Unload models to free VRAM
To unload everything immediately, use the API-versioned endpoint POST /api/models/unload. To unload a single model (by ID or alias), use POST /api/models/unload/<model>. A legacy GET /unload exists for backwards compatibility.
# unload all
curl -X POST http://localhost:8080/api/models/unload
# unload one model
curl -X POST http://localhost:8080/api/models/unload/qwen-coder
Use these endpoints when VRAM is needed back now instead of waiting for TTL—benchmarks, quick model switches, or after loading a much larger checkpoint than intended.
Stream live events via SSE
GET /api/events establishes a Server-Sent Events stream and is designed for real-time updates that include model status changes, logs, metrics, and in-flight request counts.
curl -N http://localhost:8080/api/events
SSE and token streaming break when any middle box buffers—disable buffering on nginx (or your equivalent) for /api/events and /v1/chat/completions. llama-swap sets X-Accel-Buffering: no on SSE; disable buffering in the proxy as well—headers are not a substitute for a correct proxy config.
Metrics and request captures
GET /api/metrics returns token usage metrics, with in-memory retention controlled by metricsMaxInMemory (default 1000).
GET /api/captures/<id> can retrieve full request/response captures, but only when captureBuffer > 0 is configured.
Logs and the Web UI
llama-swap exposes /ui for a web interface, and operational log endpoints such as /logs and /logs/stream for real-time monitoring.
If you enable apiKeys, assume defence in depth: /health and pieces of /ui stay reachable without a key—fine for local trust boundaries, not fine if the host is on a shared network. Put llama-swap behind something that enforces your real policy; the built-in auth is for keeping casual clients honest, not for a public multi-tenant API.
Troubleshooting llama-swap model switching in production
Most llama-swap problems fall into a small set of operational categories: streaming through a reverse proxy, health checks during cold starts, ports and process lifecycle, and authentication.
Streaming breaks behind nginx or another reverse proxy
nginx will happily buffer away your SSE and streamed completions. Disable proxy_buffering (and proxy_cache) for /api/events and /v1/chat/completions. llama-swap emits X-Accel-Buffering: no on SSE, which helps—fix the proxy anyway.
A model never becomes ready
By default, per-model checkEndpoint is /health and must return HTTP 200 for the process to be considered ready. You can set checkEndpoint to another path or to "none" to disable health checks entirely.
If large models take longer to load, increase healthCheckTimeout (default 120s), or use tailored health checks for your specific upstream.
Switching models leaves an old container running
If the upstream is Docker or Podman, set cmdStop—otherwise llama-swap stops the wrapper process while the container keeps chewing VRAM in the background.
I get 401 responses after enabling security
When apiKeys is configured, llama-swap requires a valid key and accepts three methods (Basic auth, Bearer token, x-api-key). It also strips authentication headers before forwarding upstream.
I get 429 Too Many Requests
concurrencyLimit returns 429 when exceeded—by design. Raise the cap if you under-provisioned it, or drop the limit if you did not mean to throttle.
Port conflicts or odd routing problems
Avoid hard-coded ports in cmd; use ${PORT} and move startPort if 5800+ collides with something else. Remember ports are assigned in alphabetical order by model ID—rename a model and the port mapping shifts.
Operational debugging checklist
/running for truth, /logs/stream when startup is opaque, POST /api/models/unload when VRAM is needed back now. That trio covers most "why is the GPU full" sessions.
Top comments (0)