DEV Community

chiruwonder
chiruwonder

Posted on

How I built an OpenAI-compatible API layer on top of Ollama (and what broke along the way)

I've been building NestAI for the past few months — a platform that deploys private Ollama + Open WebUI servers for teams in about 33 minutes. Recently shipped an OpenAI-compatible API layer on top of it and wanted to share what the journey looked like, including the parts that broke silently at 2am.

Why OpenAI-compatible
The obvious reason: adoption.
Most developers already have OpenAI code. LangChain integrations, existing chatbots, internal tools. If switching to a private AI stack means rewriting everything, most teams won't bother.
So we made it a one-line change:
pythonfrom openai import OpenAI

Before

client = OpenAI(api_key="sk-...")

After — everything else stays identical

client = OpenAI(
base_url="https://nestai.chirai.dev/api/v1",
api_key="YOUR_NESTAI_KEY"
)
Same SDK. Same methods. Same response format. Just your own infrastructure.

The stack
Each NestAI server is a dedicated Hetzner Cloud VM running:

Ollama — local model inference
Open WebUI — chat interface + API layer
nginx — reverse proxy + SSL termination
certbot — SSL certificates

The backend that provisions these is Express/Node on another Hetzner server, using the Hetzner Cloud API to spin up VMs via cloud-init.

What actually broke

  1. Certbot rewrites your nginx config silently This one got me badly. The flow was:

Install nginx, write config with proxy_pass to Open WebUI
Run certbot --nginx --redirect
Certbot rewrites the config — and silently removes the location blocks you added

Fix: after certbot runs, rewrite the nginx config programmatically with the correct SSL + proxy setup. Don't trust certbot to preserve your config.
bashcertbot --nginx --redirect -d $DOMAIN --non-interactive --agree-tos -m $EMAIL

Immediately overwrite with correct config after certbot

cat > /etc/nginx/sites-available/webui << 'NGINX'
server {
listen 443 ssl;
location / {
proxy_pass http://localhost:3000;
proxy_read_timeout 600;
proxy_buffering off; # critical for streaming
}
}
NGINX

  1. Model pull completes but Open WebUI doesn't know yet Ollama pulls the model fine. But Open WebUI caches the model list on startup. So you'd pull llama3.1 and it wouldn't appear in the UI until a restart. Fix: restart Open WebUI after the model pull completes, or hit the /api/tags endpoint to trigger a refresh.
  2. Streaming breaks without proxy_buffering off If you're using the streaming API and responses cut off randomly — it's almost certainly nginx buffering. One line fixes it: nginxproxy_buffering off; proxy_read_timeout 600; Without proxy_buffering off, nginx collects the whole streamed response before sending it to the client. Looks broken. Took me an embarrassingly long time to find this.
  3. First user becomes WebUI admin unintentionally Open WebUI makes the first signup the admin. If a team member visits before the owner sets up their account — they become admin. Fix: auto-create an admin account via the API immediately after WebUI starts, before any users can reach it: bashSIGNUP_RESP=$(curl -sf -X POST http://localhost:3000/api/v1/auths/signup \ -H "Content-Type: application/json" \ -d '{"name":"Admin","email":"admin@yourserver.local","password":"'$ADMIN_PASS'"}') Store those credentials somewhere — you'll need them for analytics collection later.
  4. Swap space or the model just won't load Running a 7B model on a server with 8GB RAM and no swap configured? It'll silently fail or crash mid-inference with no useful error. Always configure swap before starting Ollama: bashfallocate -l 8G /swapfile chmod 600 /swapfile mkswap /swapfile swapon /swapfile echo '/swapfile none swap sw 0 0' >> /etc/fstab

The API in practice
Once it's running, the OpenAI compatibility means you can drop it into basically anything:
LangChain:
pythonfrom langchain_openai import ChatOpenAI

llm = ChatOpenAI(
base_url="https://nestai.chirai.dev/api/v1",
api_key="your-key",
model="llama3.1"
)
Streaming in Node.js:
javascriptimport OpenAI from "openai"

const client = new OpenAI({
baseURL: "https://nestai.chirai.dev/api/v1",
apiKey: "your-key"
})

const stream = await client.chat.completions.create({
model: "llama3.1",
messages: [{ role: "user", content: "Hello" }],
stream: true,
})

for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "")
}
No rate limits — unlike OpenAI's RPM/TPM caps, you're hitting your own server. The only limit is what your VM can handle. A 7B model on 8 cores does about 20-30 tok/s, which is fine for most internal tooling.

What I'd do differently
Start with the API layer from day one. I added it later and had to retrofit some things. If you're building on Ollama for teams, the API is the product — the chat UI is just one consumer of it.
Log everything during provisioning. Cloud-init runs in the dark. Add verbose logging at every step, send it back to your backend via callbacks. You'll thank yourself at 2am when a deployment fails silently.
Test on a fresh VM every time. Your local Ollama setup has state accumulated over months. A fresh VM will surface issues your local environment hides.

Where it is now
NestAI is live at nestai.chirai.dev — deploys a private Ollama server for your team in ~33 minutes, OpenAI-compatible API included, starts at $40/month (₹3,499). There's a $2/₹99 trial if you want to kick the tyres.
The full API docs are at nestai.chirai.dev/docs/api.
Happy to answer questions about any of the above — especially the nginx/certbot stuff which I've seen trip up a lot of people.

Top comments (0)