DEV Community

Cover image for Running Ollama on OCI Container Instances - Private LLM API in 5 Minutes, No Kubernetes
Pavan Madduri
Pavan Madduri

Posted on

Running Ollama on OCI Container Instances - Private LLM API in 5 Minutes, No Kubernetes

A colleague asked me to set up a private LLM endpoint their team could use for code review suggestions. Requirements: OpenAI-compatible API, runs inside our cloud (no data leaving the tenancy), and "I don't want to learn Kubernetes."

That last requirement ruled out OKE. And honestly, for a single-model inference endpoint serving 10 people, Kubernetes is overkill anyway.

I had Ollama running on an OCI Container Instance with a GPU in about 5 minutes. Here's the whole thing.

Why Ollama Instead of vLLM

For a small team endpoint, Ollama wins on simplicity:

  • Single binary, no Python dependencies
  • Downloads models automatically on first run
  • Manages multiple models with simple ollama pull
  • Built-in OpenAI-compatible API at /v1/chat/completions
  • Handles model loading/unloading from GPU memory automatically

vLLM is better for high-throughput production (continuous batching, PagedAttention), but this isn't that. This is "10 developers hitting it a few times an hour."

The Deployment

One CLI command:

oci container-instances container-instance create \
  --compartment-id $COMPARTMENT_ID \
  --availability-domain "Uocm:US-ASHBURN-AD-1" \
  --display-name "team-ollama" \
  --shape "CI.Standard.GPU.A10.1" \
  --shape-config '{"ocpus": 15, "memoryInGBs": 240}' \
  --containers '[{
    "imageUrl": "docker.io/ollama/ollama:latest",
    "displayName": "ollama",
    "resourceConfig": {
      "vcpusLimit": 15,
      "memoryLimitInGBs": 240
    },
    "environmentVariables": {
      "OLLAMA_HOST": "0.0.0.0"
    },
    "healthChecks": [{
      "healthCheckType": "HTTP",
      "port": 11434,
      "path": "/",
      "intervalInSeconds": 30
    }]
  }]' \
  --vnics '[{
    "subnetId": "'$PRIVATE_SUBNET_ID'",
    "isPublicIpAssigned": false
  }]'
Enter fullscreen mode Exit fullscreen mode

Few things to note:

  • GPU shapeCI.Standard.GPU.A10.1 gives you an A10 GPU with 24GB VRAM. Enough for most 7-13B parameter models.
  • Private subnet — No public IP. The endpoint is only accessible from within the VCN. I added a bastion or VPN for the team to reach it.
  • OLLAMA_HOST=0.0.0.0 — By default Ollama only listens on localhost. Inside a container, you need it to listen on all interfaces.

The container starts in about 10 seconds. But no model is loaded yet.

Loading the Model

Ollama downloads models on first use. I SSH'd through the bastion and triggered the first pull:

# From a VM in the same VCN
OLLAMA_IP=10.0.1.42  # private IP of the Container Instance

# Pull a model (downloads to container's filesystem)
curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "llama3.1:8b"}'

# Test it
curl http://$OLLAMA_IP:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Review this Go function for bugs: func add(a, b int) int { return a - b }"}]
  }'
Enter fullscreen mode Exit fullscreen mode

The initial model download takes 3-4 minutes (7B model, ~4GB). After that, responses start within a second or two.

The Problem: Model Persistence

Here's the catch I should have thought about earlier. Container Instances don't have persistent storage by default. If the container restarts, the downloaded model is gone. You have to pull it again.

My fix was mounting an OCI Block Volume:

oci container-instances container-instance create \
  ... \
  --containers '[{
    "imageUrl": "docker.io/ollama/ollama:latest",
    "volumeMounts": [{
      "mountPath": "/root/.ollama",
      "volumeName": "model-storage"
    }]
  }]' \
  --volumes '[{
    "name": "model-storage",
    "volumeType": "EMPTYDIR",
    "backingStore": "EPHEMERAL_STORAGE"
  }]'
Enter fullscreen mode Exit fullscreen mode

For true persistence across container recreations, you'd use an OCI File Storage (NFS) mount. But for this use case, the ephemeral storage survives restarts (not recreations), and I have a simple curl script that re-pulls the model if it's missing:

#!/bin/bash
# warmup.sh — run after container instance creation
OLLAMA_IP=$1

# Wait for Ollama to be ready
until curl -sf http://$OLLAMA_IP:11434/ > /dev/null; do
  sleep 2
done

# Pull model if not present
curl -sf http://$OLLAMA_IP:11434/api/show -d '{"name":"llama3.1:8b"}' > /dev/null 2>&1
if [ $? -ne 0 ]; then
  echo "Pulling model..."
  curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "llama3.1:8b"}'
fi

echo "Ready at http://$OLLAMA_IP:11434"
Enter fullscreen mode Exit fullscreen mode

What the Team Uses It For

The endpoint has been running for three weeks. The team uses it for:

  • Code review suggestions — paste a function, ask for review
  • Commit message generation — describe changes, get a conventional commit
  • Documentation drafts — generate docstrings and README sections
  • SQL query help — describe what they want, get a query back

Traffic is light — maybe 50-100 requests/day total. The A10 GPU sits at 5-15% utilization most of the time. It's overkill, but even overkill on OCI is only ~$1,094/month, and the team finds it useful enough to justify the cost.

Cost vs. Alternatives

Option Monthly Cost Setup Time
OCI Container Instance + A10 GPU ~$1,094 5 minutes
OpenAI API (estimated 100 req/day) ~$30-150 N/A
Self-hosted on OKE ~$1,094 + complexity 30-60 minutes

Yeah, OpenAI is cheaper for this volume. But the team's requirement was "no data leaving our cloud." Compliance rules. The Container Instance approach gave them a private endpoint with zero Kubernetes complexity. Sometimes you pay for simplicity and privacy.

If I Were Doing This Again

I'd use OCI File Storage instead of ephemeral storage so models survive container recreation. And I'd put an OCI API Gateway in front of it for rate limiting and auth, instead of relying on network-level access control. The gateway adds about $50/month but gives you proper API keys and request logging.

For teams larger than ~20 people or with higher throughput needs, I'd switch to vLLM on OKE with the setup I described in my earlier posts. But for a small team that just wants a private LLM without touching Kubernetes? Ollama on Container Instances is hard to beat for simplicity.


Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

Top comments (0)