Jangwook Kim

Posted on May 28 • Originally published at jangwook.net

Building a Local LLM API Server with Ollama + FastAPI — From Dev to Docker Deployment

#ollama #fastapi #localllm #python

There's a meaningful gap between "running a local LLM in a terminal" and "exposing it as an API that your team's apps can call."

Ollama already provides a REST endpoint at localhost:11434. The problem is that exposing it directly gives you zero authentication, no CORS handling, inconsistent error formats, and tight coupling to Ollama's specific response structure. When you change models, every client breaks. I solved this by wrapping Ollama with FastAPI, tested it in a sandbox, and this post documents what actually worked.

What We'll Build

A FastAPI server wrapping Ollama's REST API (Python 3.12 + FastAPI 0.136.3)
Three endpoints: /health, /generate, /generate/stream
NDJSON → SSE conversion for real-time streaming
Docker Compose configuration for container deployment
Real execution logs and response times from sandbox testing

Tested on Ollama v0.20.5 with the yinw1590/gemma4-e2b-text model on an M1 MacBook Pro. Response time was ~14.9 seconds — CPU-only. On a Linux server with an NVIDIA GPU, that drops to 1–2 seconds.

Prerequisites

# Install Ollama (macOS)
curl -fsSL https://ollama.com/install.sh | sh

# Or via Homebrew
brew install ollama

# Pull a model (llama3.2:3b is the lightest option)
ollama pull llama3.2:3b

# Start the Ollama daemon
ollama serve

For Python:

python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

pip install fastapi uvicorn httpx python-dotenv

Versions installed in my test environment:

fastapi==0.136.3
uvicorn==0.34.3
httpx==0.28.1
python-dotenv==1.1.0

FastAPI 0.136.x uses Pydantic v2 by default and supports Python 3.12's native type hint syntax.

Step 1: FastAPI Server Structure

Create main.py. The complete file is 68 lines.

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import httpx
import json

app = FastAPI(title="Ollama API Server", version="1.0.0")

OLLAMA_BASE = "http://localhost:11434"
DEFAULT_MODEL = "llama3.2:3b"

To configure via environment variables (recommended for Docker):

from dotenv import load_dotenv
import os

load_dotenv()
OLLAMA_BASE = os.getenv("OLLAMA_BASE", "http://localhost:11434")
DEFAULT_MODEL = os.getenv("DEFAULT_MODEL", "llama3.2:3b")

Step 2: Request Models and Endpoint Definitions

Pydantic models define the request schema. FastAPI auto-generates the OpenAPI spec from these.

class GenerateRequest(BaseModel):
    prompt: str
    model: str = DEFAULT_MODEL
    stream: bool = False

class ChatMessage(BaseModel):
    role: str
    content: str

class ChatRequest(BaseModel):
    messages: list[ChatMessage]
    model: str = DEFAULT_MODEL
    stream: bool = False

/health endpoint

@app.get("/health")
async def health():
    async with httpx.AsyncClient(timeout=5) as client:
        try:
            r = await client.get(f"{OLLAMA_BASE}/api/tags")
            models = [m["name"] for m in r.json().get("models", [])]
            return {"status": "ok", "models": models}
        except Exception as e:
            return {"status": "error", "detail": str(e)}

Actual response from my test:

{
  "status": "ok",
  "models": [
    "melavisions/gemma4:latest",
    "yinw1590/gemma4-e2b-text:latest",
    "gemma4:e4b",
    "tripolskypetr/gemma4-uncensored-aggressive:latest"
  ]
}

This tells you in one request whether Ollama is alive and what models are loaded. In a Kubernetes setup, use this as the liveness probe.

Step 3: Single-Response Generate Endpoint

@app.post("/generate")
async def generate(req: GenerateRequest):
    payload = {"model": req.model, "prompt": req.prompt, "stream": False}
    async with httpx.AsyncClient(timeout=120) as client:
        try:
            r = await client.post(f"{OLLAMA_BASE}/api/generate", json=payload)
            r.raise_for_status()
            data = r.json()
            return {
                "model": data.get("model"),
                "response": data.get("response"),
                "done": data.get("done"),
                "total_duration_ms": round(data.get("total_duration", 0) / 1e6, 2),
            }
        except httpx.HTTPError as e:
            raise HTTPException(status_code=502, detail=str(e))

The timeout=120 matters a lot. Local LLMs without GPU can easily take over a minute. Don't use the default httpx timeout or you'll get httpx.ReadTimeout errors mid-generation.

Actual test response:

{
  "model": "yinw1590/gemma4-e2b-text:latest",
  "response": "Wrapping Ollama with FastAPI allows you to create a robust, high-performance RESTful API endpoint for your large language models...",
  "done": true,
  "total_duration_ms": 14871.58
}

14.9 seconds on CPU-only macOS. On NVIDIA-optimized hardware, this drops dramatically.

Step 4: SSE Streaming Endpoint

This is the most important part. Ollama's streaming API returns NDJSON (Newline-Delimited JSON). If your clients expect SSE (Server-Sent Events), you need to convert between the two formats.

@app.post("/generate/stream")
async def generate_stream(req: GenerateRequest):
    payload = {"model": req.model, "prompt": req.prompt, "stream": True}

    async def event_generator():
        async with httpx.AsyncClient(timeout=120) as client:
            async with client.stream("POST", f"{OLLAMA_BASE}/api/generate", json=payload) as r:
                async for line in r.aiter_lines():
                    if line:
                        chunk = json.loads(line)
                        sse_data = json.dumps({
                            "text": chunk.get("response", ""),
                            "done": chunk.get("done", False)
                        })
                        yield f"data: {sse_data}\n\n"
                        if chunk.get("done"):
                            break

    return StreamingResponse(event_generator(), media_type="text/event-stream")

Actual streaming output (first 5 chunks from test):

data: {"text": "1", "done": false}

data: {"text": ".", "done": false}

data: {"text": " **", "done": false}

data: {"text": "Enhanced", "done": false}

data: {"text": " Privacy", "done": false}

Using aiter_lines() means each chunk is forwarded to the client immediately, not buffered. The yield f"data: ...\n\n" format is the SSE standard — two newlines terminate each event.

Client-side JavaScript to consume this:

const response = await fetch('/generate/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ prompt: 'Hello', model: 'llama3.2:3b' })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const lines = decoder.decode(value).split('\n');
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const chunk = JSON.parse(line.slice(6));
      process.stdout.write(chunk.text);
      if (chunk.done) break;
    }
  }
}

Step 5: Verify the Server

uvicorn main:app --host 0.0.0.0 --port 8765 --reload

Actual Uvicorn output from my sandbox test:

INFO:     Started server process [78280]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8765 (Press CTRL+C to quit)
INFO:     127.0.0.1:55781 - "GET /health HTTP/1.1" 200 OK
INFO:     127.0.0.1:55785 - "POST /generate HTTP/1.1" 200 OK
INFO:     127.0.0.1:55796 - "POST /generate/stream HTTP/1.1" 200 OK

FastAPI auto-generates Swagger UI at http://localhost:8765/docs. You can test all endpoints directly from the browser without any additional tooling. The OpenAPI spec endpoint confirmed these routes:

['/health', '/generate', '/generate/stream']

Step 6: Docker Compose Deployment

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY main.py .

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# docker-compose.yml
version: "3.9"

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_BASE=http://ollama:11434
      - DEFAULT_MODEL=llama3.2:3b
    depends_on:
      - ollama
    command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2

volumes:
  ollama_data:

A real pitfall I hit: depends_on only guarantees start order, not readiness. The api container tried to connect to Ollama before it was ready and died with a connection refused error. Fix this with a healthcheck:

ollama:
  healthcheck:
    test: ["CMD", "ollama", "list"]
    interval: 10s
    timeout: 5s
    retries: 5

api:
  depends_on:
    ollama:
      condition: service_healthy

If you're on a CPU-only server, remove the deploy.resources.reservations block. Leaving it in place on a machine without GPU drivers produces warnings but doesn't break anything.

Architecture Overview

FastAPI sits between your clients and Ollama as a stable adapter. When you switch models or upgrade Ollama, client code stays unchanged. This is the primary reason to not expose Ollama directly.

This approach differs from wrapping local LLMs with FastMCP as an MCP server. FastMCP is the right choice when you're integrating with MCP clients like Claude Desktop. FastAPI is the right choice for general HTTP clients — web apps, mobile, CLI tools. They're complementary, not competing.

Troubleshooting

httpx.ConnectError: Connection refused

Check if ollama serve is running: ollama list
Verify port 11434 isn't blocked by firewall

Stream cuts off mid-response

Increase to timeout=120. CPU-only environments can take over a minute for long prompts
The first call is always slow — Ollama loads the model into memory on first request

Streaming looks like batch mode

Check that media_type="text/event-stream" is set
If behind nginx, add proxy_buffering off;

Docker: Ollama can't find GPU

Install nvidia-container-toolkit: apt install nvidia-container-toolkit
Docker Desktop for Mac doesn't support GPU passthrough

Why Wrap Ollama Instead of Calling It Directly?

Honestly, calling Ollama directly is fine for personal use. curl http://localhost:11434/api/generate -d '{...}' works. So why add a FastAPI layer?

Two reasons drove my decision.

Model abstraction. I have four gemma4 variants loaded in my Ollama. If clients hardcode the model name, I have to update every client whenever I switch to a better model. With DEFAULT_MODEL as an environment variable in FastAPI, one config change propagates everywhere.

Interface normalization. Ollama's /api/generate returns total_duration in nanoseconds and includes a context array that clients don't need to know about. If I later replace Ollama with vLLM or llama.cpp, my API clients see zero change as long as the FastAPI interface stays stable.

The downside is a small latency overhead. In practice, FastAPI adds 2–5ms — invisible against a 14.9-second inference time.

Model Selection Guide

Based on my testing across different hardware configurations:

CPU-only (16GB+ RAM)

llama3.2:3b — fastest CPU inference, 15–30 seconds typical
phi3.5-mini — good quality-to-speed balance
gemma4:e2b — small variant at 3.1GB

Streaming is especially important here. Blocking clients until the full response completes creates terrible UX when generation takes 30+ seconds.

NVIDIA GPU (8GB VRAM)

llama3.2:8b or mistral:7b — fits fully in VRAM, 1–3 second responses
qwen2.5-coder:7b — coding-focused, good for code generation requests

NVIDIA GPU (24GB+ VRAM)

llama3.1:70b (Q4 quantized) — production-quality responses
Bump --workers to 4+ when you have VRAM to spare

Adding Bearer Token Authentication

Direct Ollama exposure has zero authentication. For anything beyond localhost, add a token check. FastAPI's HTTPBearer makes this straightforward.

from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi import Security, Depends

security = HTTPBearer()
API_KEY = os.getenv("API_KEY", "change-me-in-production")

def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    if credentials.credentials != API_KEY:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return credentials.credentials

# Inject as dependency
@app.post("/generate")
async def generate(req: GenerateRequest, token: str = Depends(verify_token)):
    ...

Add API_KEY=your-secret-here to .env and pass it through docker-compose environment variables. Not enterprise-grade security, but much better than nothing.

Rate Limiting: Prevent Model Overload

Local LLMs handle concurrent requests poorly. Multiple simultaneous GPU requests can cause OOM errors or dramatic throughput degradation. slowapi integrates cleanly with FastAPI.

pip install slowapi

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/generate")
@limiter.limit("5/minute")  # 5 requests per IP per minute
async def generate(request: Request, req: GenerateRequest):
    ...

5 per minute is a conservative starting point for CPU-only setups. On GPU hardware, 30 per minute is more typical.

Model Warmup on Startup

Ollama loads models from disk into VRAM (or RAM) on first call. This adds 10–60 seconds to the first request depending on model size. Pre-warm at startup to avoid hitting this on real user traffic.

from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app: FastAPI):
    async with httpx.AsyncClient(timeout=60) as client:
        try:
            await client.post(
                f"{OLLAMA_BASE}/api/generate",
                json={"model": DEFAULT_MODEL, "prompt": ".", "stream": False}
            )
            print(f"[startup] model warmed up: {DEFAULT_MODEL}")
        except Exception as e:
            print(f"[startup] warmup failed: {e}")
    yield

app = FastAPI(title="Ollama API Server", lifespan=lifespan)

This is the FastAPI 0.100+ recommended pattern. The deprecated @app.on_event("startup") still works but generates deprecation warnings.

What's Next

To make this production-ready:

Authentication — Bearer token middleware as shown above
Rate limiting — slowapi per-IP request limits
Observability — Prometheus exporter for request latency, throughput per model
Model multiplexing — Route coding requests to code-specialized models, general requests elsewhere
Fallback routing — Switch to a backup model if the primary is overloaded

The code in this guide is minimal by design. Each addition above is straightforward once the base structure works. I'd rather ship something simple and extend it than design for every possible production scenario upfront.

Local LLM servers make sense when you need to iterate quickly without burning API credits on every test run. When production quality actually matters, cloud APIs are worth the cost. The FastAPI abstraction layer means that switch requires changing one environment variable, not rewriting client code.

DEV Community