I used to have a ridiculous local AI setup. Ollama running as a service. A separate Python venv for LangChain experiments. Another terminal with llama.cpp because I wanted to test quantized models. Three different API formats, three different port numbers, three things that broke independently every time I updated macOS.
Then Docker shipped Model Runner and I deleted all of it.
What Model Runner Actually Is
It's built into Docker Desktop. No separate install. You pull models the same way you pull images:
docker model pull ai/llama3.1
docker model pull ai/phi3-mini
docker model pull ai/mistral
Run inference:
docker model run ai/llama3.1 "Explain NUMA topology in two sentences"
Or hit the API endpoint, which is OpenAI-compatible:
curl http://localhost:12434/engines/llama3.1/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is OKE?"}],
"max_tokens": 100
}'
That's it. No Python, no venv, no pip install, no CUDA drivers (it uses Metal on Mac, CPU elsewhere). It just runs.
Why I Switched From Ollama
Ollama is fine. I used it for months. But a few things bugged me:
Port conflicts. Ollama defaults to 11434. I kept forgetting it was running and then wondering why port 11434 was taken. Docker Model Runner runs inside the Docker VM so it doesn't occupy a host port in the same way — it's accessible at a consistent endpoint.
Update management. Ollama is a separate binary I have to update separately. Model Runner updates come with Docker Desktop. One less thing to think about.
API compatibility. I'm deploying vLLM in production on OKE. vLLM exposes an OpenAI-compatible API. Model Runner also exposes an OpenAI-compatible API. My client code works unchanged between local and production. With Ollama I was constantly converting between Ollama's native format and OpenAI's format.
Docker context. Model Runner models can be referenced from Docker Compose files and Dockerfiles. That means my local dev stack can include an LLM as a service alongside my API server, database, and cache — all in one docker compose up.
Using Model Runner in Docker Compose
This is the part that actually changed my workflow:
# docker-compose.yml
services:
api:
build: .
ports:
- "8080:8080"
environment:
- LLM_ENDPOINT=http://host.docker.internal:12434/engines/llama3.1/v1
depends_on:
- db
db:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: dev
My API server talks to Model Runner at host.docker.internal:12434. In production on OKE, that env var points to my vLLM service instead. Same client code, same prompt format, different backend.
// Same code works with Model Runner locally and vLLM on OKE
func callLLM(prompt string) (string, error) {
endpoint := os.Getenv("LLM_ENDPOINT") + "/chat/completions"
body := map[string]interface{}{
"messages": []map[string]string{
{"role": "user", "content": prompt},
},
"max_tokens": 200,
}
jsonBody, _ := json.Marshal(body)
resp, err := http.Post(endpoint, "application/json", bytes.NewBuffer(jsonBody))
// ...
}
The Dev Loop That Actually Works
Before Model Runner, testing a prompt change meant:
- Edit prompt template
- Rebuild Docker image
- Push to OCIR
- Wait for OKE to pull the 8GB GPU image
- Test
- Realize the prompt is wrong
- Repeat from step 1
That's 15-20 minutes per iteration. With Model Runner:
- Edit prompt template
-
docker compose up --build(rebuilds in seconds, model already cached) - Test
- Fix prompt
- Repeat from step 1
Two minutes per iteration, maybe less. The model is already running locally. The API server rebuild is a Go binary, takes 3 seconds. I can iterate on prompts 10x faster.
Models I Actually Use
# Code assistance (great for generating test data)
docker model pull ai/codellama
# General purpose (good balance of speed and quality)
docker model pull ai/llama3.1
# Small and fast (for quick experiments)
docker model pull ai/phi3-mini
# List what's cached
docker model list
On my M3 MacBook Pro, phi3-mini generates ~30 tokens/sec. Llama 3.1 8B does about 15 tokens/sec. Not blazing fast, but fast enough for development. I'm not benchmarking model performance locally — I'm testing that my application handles LLM responses correctly.
Limitations I've Hit
No fine-tuned models. You can't load your own fine-tuned LoRA adapters into Model Runner. For that I still need vLLM or llama.cpp. This is a local dev tool, not a training platform.
Model selection. The catalog is growing but it's not as large as Ollama's. The main open models are there (Llama, Mistral, Phi, CodeLlama) but if you need something obscure, check first.
No GPU on Linux Docker Desktop. On Mac it uses Metal. On Linux it's CPU-only through Docker Desktop (you'd need to run vLLM directly for GPU inference). Fine for dev, not for benchmarking.
My Current Setup
# Start everything
docker compose up -d
# API server at localhost:8080
# Model Runner at localhost:12434
# Postgres at localhost:5432
# Test the AI endpoint
curl localhost:8080/api/summarize \
-d '{"text": "Long document here..."}'
# Check which models are loaded
docker model list
One command, full stack, LLM included. When it's time to deploy to OKE, I change the LLM_ENDPOINT env var to point at my vLLM service and everything else stays the same.
I deleted Ollama, uninstalled llama.cpp, removed the Python venv. My local AI setup is now just Docker.
Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate
Top comments (0)