Unify all your AI models - local and cloud - behind a single OpenAI-compatible API with LiteLLM and Ollama.
LiteLLM is a proxy server that exposes 100+ LLM providers through one endpoint. Connect it to Ollama for local inference, and you get load balancing, cost tracking, rate limits, and automatic fallback routing.
What You Need
- Python 3.9+
- Ollama installed and running
- About 20 minutes
Setup
1. Install LiteLLM
pip install 'litellm[proxy]'
2. Create config.yaml
model_list:
- model_name: qwen3-local
litellm_params:
model: ollama/qwen3:14b
api_base: http://localhost:11434
rpm: 30
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
general_settings:
master_key: sk-your-key
3. Start the Proxy
litellm --config config.yaml --port 4000
4. Use It
from openai import OpenAI
client = OpenAI(api_key="sk-your-key",
base_url="http://localhost:4000/v1")
response = client.chat.completions.create(
model="qwen3-local",
messages=[{"role": "user", "content": "Hello!"}])
Key Features
- Smart fallback - if local model fails, auto-route to cloud
- Load balancing - distribute across multiple GPU instances
- Cost tracking - per-model spend dashboard
- Rate limiting - control requests per user/key
- One API - use any tool that supports OpenAI format
Cost vs Cloud
| LiteLLM + Ollama | Direct Cloud APIs | |
|---|---|---|
| Gateway | Free, self-hosted | Free |
| Local inference | $0 | N/A |
| Model switching | One endpoint | Multiple SDKs |
| Failover | Automatic | Manual |
Full guide with advanced config examples: https://everylocalai.com/stack/litellm-ollama-gateway
Top comments (0)