A colleague asked me to set up a private LLM endpoint their team could use for code review suggestions. Requirements: OpenAI-compatible API, runs inside our cloud (no data leaving the tenancy), and "I don't want to learn Kubernetes."
That last requirement ruled out OKE. And honestly, for a single-model inference endpoint serving 10 people, Kubernetes is overkill anyway.
I had Ollama running on an OCI Container Instance with a GPU in about 5 minutes. Here's the whole thing.
Why Ollama Instead of vLLM
For a small team endpoint, Ollama wins on simplicity:
- Single binary, no Python dependencies
- Downloads models automatically on first run
- Manages multiple models with simple
ollama pull - Built-in OpenAI-compatible API at
/v1/chat/completions - Handles model loading/unloading from GPU memory automatically
vLLM is better for high-throughput production (continuous batching, PagedAttention), but this isn't that. This is "10 developers hitting it a few times an hour."
The Deployment
One CLI command:
oci container-instances container-instance create \
--compartment-id $COMPARTMENT_ID \
--availability-domain "Uocm:US-ASHBURN-AD-1" \
--display-name "team-ollama" \
--shape "CI.Standard.GPU.A10.1" \
--shape-config '{"ocpus": 15, "memoryInGBs": 240}' \
--containers '[{
"imageUrl": "docker.io/ollama/ollama:latest",
"displayName": "ollama",
"resourceConfig": {
"vcpusLimit": 15,
"memoryLimitInGBs": 240
},
"environmentVariables": {
"OLLAMA_HOST": "0.0.0.0"
},
"healthChecks": [{
"healthCheckType": "HTTP",
"port": 11434,
"path": "/",
"intervalInSeconds": 30
}]
}]' \
--vnics '[{
"subnetId": "'$PRIVATE_SUBNET_ID'",
"isPublicIpAssigned": false
}]'
Few things to note:
-
GPU shape —
CI.Standard.GPU.A10.1gives you an A10 GPU with 24GB VRAM. Enough for most 7-13B parameter models. - Private subnet — No public IP. The endpoint is only accessible from within the VCN. I added a bastion or VPN for the team to reach it.
-
OLLAMA_HOST=0.0.0.0— By default Ollama only listens on localhost. Inside a container, you need it to listen on all interfaces.
The container starts in about 10 seconds. But no model is loaded yet.
Loading the Model
Ollama downloads models on first use. I SSH'd through the bastion and triggered the first pull:
# From a VM in the same VCN
OLLAMA_IP=10.0.1.42 # private IP of the Container Instance
# Pull a model (downloads to container's filesystem)
curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "llama3.1:8b"}'
# Test it
curl http://$OLLAMA_IP:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Review this Go function for bugs: func add(a, b int) int { return a - b }"}]
}'
The initial model download takes 3-4 minutes (7B model, ~4GB). After that, responses start within a second or two.
The Problem: Model Persistence
Here's the catch I should have thought about earlier. Container Instances don't have persistent storage by default. If the container restarts, the downloaded model is gone. You have to pull it again.
My fix was mounting an OCI Block Volume:
oci container-instances container-instance create \
... \
--containers '[{
"imageUrl": "docker.io/ollama/ollama:latest",
"volumeMounts": [{
"mountPath": "/root/.ollama",
"volumeName": "model-storage"
}]
}]' \
--volumes '[{
"name": "model-storage",
"volumeType": "EMPTYDIR",
"backingStore": "EPHEMERAL_STORAGE"
}]'
For true persistence across container recreations, you'd use an OCI File Storage (NFS) mount. But for this use case, the ephemeral storage survives restarts (not recreations), and I have a simple curl script that re-pulls the model if it's missing:
#!/bin/bash
# warmup.sh — run after container instance creation
OLLAMA_IP=$1
# Wait for Ollama to be ready
until curl -sf http://$OLLAMA_IP:11434/ > /dev/null; do
sleep 2
done
# Pull model if not present
curl -sf http://$OLLAMA_IP:11434/api/show -d '{"name":"llama3.1:8b"}' > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "Pulling model..."
curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "llama3.1:8b"}'
fi
echo "Ready at http://$OLLAMA_IP:11434"
What the Team Uses It For
The endpoint has been running for three weeks. The team uses it for:
- Code review suggestions — paste a function, ask for review
- Commit message generation — describe changes, get a conventional commit
- Documentation drafts — generate docstrings and README sections
- SQL query help — describe what they want, get a query back
Traffic is light — maybe 50-100 requests/day total. The A10 GPU sits at 5-15% utilization most of the time. It's overkill, but even overkill on OCI is only ~$1,094/month, and the team finds it useful enough to justify the cost.
Cost vs. Alternatives
| Option | Monthly Cost | Setup Time |
|---|---|---|
| OCI Container Instance + A10 GPU | ~$1,094 | 5 minutes |
| OpenAI API (estimated 100 req/day) | ~$30-150 | N/A |
| Self-hosted on OKE | ~$1,094 + complexity | 30-60 minutes |
Yeah, OpenAI is cheaper for this volume. But the team's requirement was "no data leaving our cloud." Compliance rules. The Container Instance approach gave them a private endpoint with zero Kubernetes complexity. Sometimes you pay for simplicity and privacy.
If I Were Doing This Again
I'd use OCI File Storage instead of ephemeral storage so models survive container recreation. And I'd put an OCI API Gateway in front of it for rate limiting and auth, instead of relying on network-level access control. The gateway adds about $50/month but gives you proper API keys and request logging.
For teams larger than ~20 people or with higher throughput needs, I'd switch to vLLM on OKE with the setup I described in my earlier posts. But for a small team that just wants a private LLM without touching Kubernetes? Ollama on Container Instances is hard to beat for simplicity.
Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate
Top comments (0)