DEV Community

Jangwook Kim
Jangwook Kim

Posted on • Originally published at jangwook.net

Building a Local LLM API Server with Ollama + FastAPI — From Dev to Docker Deployment

There's a meaningful gap between "running a local LLM in a terminal" and "exposing it as an API that your team's apps can call."

Ollama already provides a REST endpoint at localhost:11434. The problem is that exposing it directly gives you zero authentication, no CORS handling, inconsistent error formats, and tight coupling to Ollama's specific response structure. When you change models, every client breaks. I solved this by wrapping Ollama with FastAPI, tested it in a sandbox, and this post documents what actually worked.

What We'll Build

  • A FastAPI server wrapping Ollama's REST API (Python 3.12 + FastAPI 0.136.3)
  • Three endpoints: /health, /generate, /generate/stream
  • NDJSON → SSE conversion for real-time streaming
  • Docker Compose configuration for container deployment
  • Real execution logs and response times from sandbox testing

Tested on Ollama v0.20.5 with the yinw1590/gemma4-e2b-text model on an M1 MacBook Pro. Response time was ~14.9 seconds — CPU-only. On a Linux server with an NVIDIA GPU, that drops to 1–2 seconds.

Prerequisites

# Install Ollama (macOS)
curl -fsSL https://ollama.com/install.sh | sh

# Or via Homebrew
brew install ollama

# Pull a model (llama3.2:3b is the lightest option)
ollama pull llama3.2:3b

# Start the Ollama daemon
ollama serve
Enter fullscreen mode Exit fullscreen mode

For Python:

python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

pip install fastapi uvicorn httpx python-dotenv
Enter fullscreen mode Exit fullscreen mode

Versions installed in my test environment:

fastapi==0.136.3
uvicorn==0.34.3
httpx==0.28.1
python-dotenv==1.1.0
Enter fullscreen mode Exit fullscreen mode

FastAPI 0.136.x uses Pydantic v2 by default and supports Python 3.12's native type hint syntax.

Step 1: FastAPI Server Structure

Create main.py. The complete file is 68 lines.

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import httpx
import json

app = FastAPI(title="Ollama API Server", version="1.0.0")

OLLAMA_BASE = "http://localhost:11434"
DEFAULT_MODEL = "llama3.2:3b"
Enter fullscreen mode Exit fullscreen mode

To configure via environment variables (recommended for Docker):

from dotenv import load_dotenv
import os

load_dotenv()
OLLAMA_BASE = os.getenv("OLLAMA_BASE", "http://localhost:11434")
DEFAULT_MODEL = os.getenv("DEFAULT_MODEL", "llama3.2:3b")
Enter fullscreen mode Exit fullscreen mode

Step 2: Request Models and Endpoint Definitions

Pydantic models define the request schema. FastAPI auto-generates the OpenAPI spec from these.

class GenerateRequest(BaseModel):
    prompt: str
    model: str = DEFAULT_MODEL
    stream: bool = False

class ChatMessage(BaseModel):
    role: str
    content: str

class ChatRequest(BaseModel):
    messages: list[ChatMessage]
    model: str = DEFAULT_MODEL
    stream: bool = False
Enter fullscreen mode Exit fullscreen mode

/health endpoint

@app.get("/health")
async def health():
    async with httpx.AsyncClient(timeout=5) as client:
        try:
            r = await client.get(f"{OLLAMA_BASE}/api/tags")
            models = [m["name"] for m in r.json().get("models", [])]
            return {"status": "ok", "models": models}
        except Exception as e:
            return {"status": "error", "detail": str(e)}
Enter fullscreen mode Exit fullscreen mode

Actual response from my test:

{
  "status": "ok",
  "models": [
    "melavisions/gemma4:latest",
    "yinw1590/gemma4-e2b-text:latest",
    "gemma4:e4b",
    "tripolskypetr/gemma4-uncensored-aggressive:latest"
  ]
}
Enter fullscreen mode Exit fullscreen mode

This tells you in one request whether Ollama is alive and what models are loaded. In a Kubernetes setup, use this as the liveness probe.

Step 3: Single-Response Generate Endpoint

@app.post("/generate")
async def generate(req: GenerateRequest):
    payload = {"model": req.model, "prompt": req.prompt, "stream": False}
    async with httpx.AsyncClient(timeout=120) as client:
        try:
            r = await client.post(f"{OLLAMA_BASE}/api/generate", json=payload)
            r.raise_for_status()
            data = r.json()
            return {
                "model": data.get("model"),
                "response": data.get("response"),
                "done": data.get("done"),
                "total_duration_ms": round(data.get("total_duration", 0) / 1e6, 2),
            }
        except httpx.HTTPError as e:
            raise HTTPException(status_code=502, detail=str(e))
Enter fullscreen mode Exit fullscreen mode

The timeout=120 matters a lot. Local LLMs without GPU can easily take over a minute. Don't use the default httpx timeout or you'll get httpx.ReadTimeout errors mid-generation.

Actual test response:

{
  "model": "yinw1590/gemma4-e2b-text:latest",
  "response": "Wrapping Ollama with FastAPI allows you to create a robust, high-performance RESTful API endpoint for your large language models...",
  "done": true,
  "total_duration_ms": 14871.58
}
Enter fullscreen mode Exit fullscreen mode

14.9 seconds on CPU-only macOS. On NVIDIA-optimized hardware, this drops dramatically.

Step 4: SSE Streaming Endpoint

This is the most important part. Ollama's streaming API returns NDJSON (Newline-Delimited JSON). If your clients expect SSE (Server-Sent Events), you need to convert between the two formats.

@app.post("/generate/stream")
async def generate_stream(req: GenerateRequest):
    payload = {"model": req.model, "prompt": req.prompt, "stream": True}

    async def event_generator():
        async with httpx.AsyncClient(timeout=120) as client:
            async with client.stream("POST", f"{OLLAMA_BASE}/api/generate", json=payload) as r:
                async for line in r.aiter_lines():
                    if line:
                        chunk = json.loads(line)
                        sse_data = json.dumps({
                            "text": chunk.get("response", ""),
                            "done": chunk.get("done", False)
                        })
                        yield f"data: {sse_data}\n\n"
                        if chunk.get("done"):
                            break

    return StreamingResponse(event_generator(), media_type="text/event-stream")
Enter fullscreen mode Exit fullscreen mode

Actual streaming output (first 5 chunks from test):

data: {"text": "1", "done": false}

data: {"text": ".", "done": false}

data: {"text": " **", "done": false}

data: {"text": "Enhanced", "done": false}

data: {"text": " Privacy", "done": false}
Enter fullscreen mode Exit fullscreen mode

Using aiter_lines() means each chunk is forwarded to the client immediately, not buffered. The yield f"data: ...\n\n" format is the SSE standard — two newlines terminate each event.

Client-side JavaScript to consume this:

const response = await fetch('/generate/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ prompt: 'Hello', model: 'llama3.2:3b' })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const lines = decoder.decode(value).split('\n');
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const chunk = JSON.parse(line.slice(6));
      process.stdout.write(chunk.text);
      if (chunk.done) break;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 5: Verify the Server

uvicorn main:app --host 0.0.0.0 --port 8765 --reload
Enter fullscreen mode Exit fullscreen mode

Actual Uvicorn output from my sandbox test:

INFO:     Started server process [78280]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8765 (Press CTRL+C to quit)
INFO:     127.0.0.1:55781 - "GET /health HTTP/1.1" 200 OK
INFO:     127.0.0.1:55785 - "POST /generate HTTP/1.1" 200 OK
INFO:     127.0.0.1:55796 - "POST /generate/stream HTTP/1.1" 200 OK
Enter fullscreen mode Exit fullscreen mode

FastAPI auto-generates Swagger UI at http://localhost:8765/docs. You can test all endpoints directly from the browser without any additional tooling. The OpenAPI spec endpoint confirmed these routes:

['/health', '/generate', '/generate/stream']
Enter fullscreen mode Exit fullscreen mode

Step 6: Docker Compose Deployment

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY main.py .

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Enter fullscreen mode Exit fullscreen mode
# docker-compose.yml
version: "3.9"

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_BASE=http://ollama:11434
      - DEFAULT_MODEL=llama3.2:3b
    depends_on:
      - ollama
    command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2

volumes:
  ollama_data:
Enter fullscreen mode Exit fullscreen mode

A real pitfall I hit: depends_on only guarantees start order, not readiness. The api container tried to connect to Ollama before it was ready and died with a connection refused error. Fix this with a healthcheck:

ollama:
  healthcheck:
    test: ["CMD", "ollama", "list"]
    interval: 10s
    timeout: 5s
    retries: 5

api:
  depends_on:
    ollama:
      condition: service_healthy
Enter fullscreen mode Exit fullscreen mode

If you're on a CPU-only server, remove the deploy.resources.reservations block. Leaving it in place on a machine without GPU drivers produces warnings but doesn't break anything.

Architecture Overview

Ollama FastAPI architecture diagram

FastAPI sits between your clients and Ollama as a stable adapter. When you switch models or upgrade Ollama, client code stays unchanged. This is the primary reason to not expose Ollama directly.

This approach differs from wrapping local LLMs with FastMCP as an MCP server. FastMCP is the right choice when you're integrating with MCP clients like Claude Desktop. FastAPI is the right choice for general HTTP clients — web apps, mobile, CLI tools. They're complementary, not competing.

Troubleshooting

httpx.ConnectError: Connection refused

  • Check if ollama serve is running: ollama list
  • Verify port 11434 isn't blocked by firewall

Stream cuts off mid-response

  • Increase to timeout=120. CPU-only environments can take over a minute for long prompts
  • The first call is always slow — Ollama loads the model into memory on first request

Streaming looks like batch mode

  • Check that media_type="text/event-stream" is set
  • If behind nginx, add proxy_buffering off;

Docker: Ollama can't find GPU

  • Install nvidia-container-toolkit: apt install nvidia-container-toolkit
  • Docker Desktop for Mac doesn't support GPU passthrough

Why Wrap Ollama Instead of Calling It Directly?

Honestly, calling Ollama directly is fine for personal use. curl http://localhost:11434/api/generate -d '{...}' works. So why add a FastAPI layer?

Two reasons drove my decision.

Model abstraction. I have four gemma4 variants loaded in my Ollama. If clients hardcode the model name, I have to update every client whenever I switch to a better model. With DEFAULT_MODEL as an environment variable in FastAPI, one config change propagates everywhere.

Interface normalization. Ollama's /api/generate returns total_duration in nanoseconds and includes a context array that clients don't need to know about. If I later replace Ollama with vLLM or llama.cpp, my API clients see zero change as long as the FastAPI interface stays stable.

The downside is a small latency overhead. In practice, FastAPI adds 2–5ms — invisible against a 14.9-second inference time.

Model Selection Guide

Based on my testing across different hardware configurations:

CPU-only (16GB+ RAM)

  • llama3.2:3b — fastest CPU inference, 15–30 seconds typical
  • phi3.5-mini — good quality-to-speed balance
  • gemma4:e2b — small variant at 3.1GB

Streaming is especially important here. Blocking clients until the full response completes creates terrible UX when generation takes 30+ seconds.

NVIDIA GPU (8GB VRAM)

  • llama3.2:8b or mistral:7b — fits fully in VRAM, 1–3 second responses
  • qwen2.5-coder:7b — coding-focused, good for code generation requests

NVIDIA GPU (24GB+ VRAM)

  • llama3.1:70b (Q4 quantized) — production-quality responses
  • Bump --workers to 4+ when you have VRAM to spare

Adding Bearer Token Authentication

Direct Ollama exposure has zero authentication. For anything beyond localhost, add a token check. FastAPI's HTTPBearer makes this straightforward.

from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi import Security, Depends

security = HTTPBearer()
API_KEY = os.getenv("API_KEY", "change-me-in-production")

def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    if credentials.credentials != API_KEY:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return credentials.credentials

# Inject as dependency
@app.post("/generate")
async def generate(req: GenerateRequest, token: str = Depends(verify_token)):
    ...
Enter fullscreen mode Exit fullscreen mode

Add API_KEY=your-secret-here to .env and pass it through docker-compose environment variables. Not enterprise-grade security, but much better than nothing.

Rate Limiting: Prevent Model Overload

Local LLMs handle concurrent requests poorly. Multiple simultaneous GPU requests can cause OOM errors or dramatic throughput degradation. slowapi integrates cleanly with FastAPI.

pip install slowapi
Enter fullscreen mode Exit fullscreen mode
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/generate")
@limiter.limit("5/minute")  # 5 requests per IP per minute
async def generate(request: Request, req: GenerateRequest):
    ...
Enter fullscreen mode Exit fullscreen mode

5 per minute is a conservative starting point for CPU-only setups. On GPU hardware, 30 per minute is more typical.

Model Warmup on Startup

Ollama loads models from disk into VRAM (or RAM) on first call. This adds 10–60 seconds to the first request depending on model size. Pre-warm at startup to avoid hitting this on real user traffic.

from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app: FastAPI):
    async with httpx.AsyncClient(timeout=60) as client:
        try:
            await client.post(
                f"{OLLAMA_BASE}/api/generate",
                json={"model": DEFAULT_MODEL, "prompt": ".", "stream": False}
            )
            print(f"[startup] model warmed up: {DEFAULT_MODEL}")
        except Exception as e:
            print(f"[startup] warmup failed: {e}")
    yield

app = FastAPI(title="Ollama API Server", lifespan=lifespan)
Enter fullscreen mode Exit fullscreen mode

This is the FastAPI 0.100+ recommended pattern. The deprecated @app.on_event("startup") still works but generates deprecation warnings.

What's Next

To make this production-ready:

  1. Authentication — Bearer token middleware as shown above
  2. Rate limiting — slowapi per-IP request limits
  3. Observability — Prometheus exporter for request latency, throughput per model
  4. Model multiplexing — Route coding requests to code-specialized models, general requests elsewhere
  5. Fallback routing — Switch to a backup model if the primary is overloaded

The code in this guide is minimal by design. Each addition above is straightforward once the base structure works. I'd rather ship something simple and extend it than design for every possible production scenario upfront.

Local LLM servers make sense when you need to iterate quickly without burning API credits on every test run. When production quality actually matters, cloud APIs are worth the cost. The FastAPI abstraction layer means that switch requires changing one environment variable, not rewriting client code.

Top comments (0)