Your startup is spending $500–2,000/month on OpenAI API calls. You've hit rate limits during a demo. You've had a customer ask where their data goes when they use your AI feature.
This post shows you how to swap out the OpenAI API for your own private server — using the same SDK, the same code, and zero per-token billing. No infrastructure expertise required.
I'll cover three approaches (from easiest to most hands-on), with real code you can copy-paste.
Why replace the OpenAI API?
Before we get into the how, here's why teams are making this switch in 2026:
Cost predictability. OpenAI charges per token. At low volume that's fine. At 10M+ tokens/month, you're paying $150–$1,500/month — and one viral feature can blow your budget overnight. A private server costs a flat monthly fee regardless of usage.
Data privacy. Every prompt and response you send to OpenAI's API transits their infrastructure. If you're building for legal, healthcare, finance, or any regulated industry, that's a compliance risk. Self-hosted means your data never leaves your server.
No rate limits. OpenAI's API has RPM (requests per minute) and TPM (tokens per minute) caps. When your app scales, you hit walls. Your own server has no artificial limits — throughput is bounded only by your hardware.
Vendor independence. OpenAI can change pricing, deprecate models, or modify their ToS at any time. Your own infrastructure, your own rules.
The two-line switch
Here's the punchline. If your app uses the OpenAI Python or Node.js SDK, the entire migration is two lines:
Before (OpenAI):
pythonfrom openai import OpenAI
client = OpenAI(
api_key="sk-your-openai-key"
)
After (your own server):
pythonfrom openai import OpenAI
client = OpenAI(
base_url="https://your-server.example.com/api/v1",
api_key="your-private-key"
)
Everything else — client.chat.completions.create(), streaming, the response format — stays identical. That's because the server exposes an OpenAI-compatible API using the same request/response schema.
This works with:
OpenAI Python SDK
OpenAI Node.js SDK
LangChain (Python and JS)
LlamaIndex
AutoGen / CrewAI
Flowise
n8n
Any tool with a "custom OpenAI base URL" setting
Approach 1: Managed deployment (fastest — 33 minutes)
If you don't want to touch servers, Docker, or nginx configs, a managed service handles everything.
Full disclosure: I built NestAI for exactly this purpose, so I'll use it as the example. But the concepts apply to any managed Ollama hosting service (Elestio, Railway templates, etc.).
How it works
Sign up at nestai.chirai.dev
Choose a model (Llama 3.3, Mistral, Qwen 3.5, DeepSeek R1, etc.)
Choose a region (Germany, US East, or Singapore)
Pay → server deploys automatically in ~33 minutes
Go to Dashboard → API → Generate API Key
You get a nai-xxxx bearer token and a base URL:
Base URL: https://nestai.chirai.dev/api/v1
API Key: nai-xxxxxxxxxxxxxxxxxxxx
Full working example (Python)
pythonfrom openai import OpenAI
client = OpenAI(
base_url="https://nestai.chirai.dev/api/v1",
api_key="nai-your-key-here"
)
Non-streaming
response = client.chat.completions.create(
model="mistral",
messages=[
{"role": "system", "content": "You are a helpful legal assistant."},
{"role": "user", "content": "Summarise the key risks in this NDA."}
]
)
print(response.choices[0].message.content)
Streaming
pythonfrom openai import OpenAI
client = OpenAI(
base_url="https://nestai.chirai.dev/api/v1",
api_key="nai-your-key-here"
)
with client.chat.completions.stream(
model="mistral",
messages=[{"role": "user", "content": "Explain transformer attention"}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Node.js
javascriptimport OpenAI from 'openai'
const client = new OpenAI({
baseURL: 'https://nestai.chirai.dev/api/v1',
apiKey: 'nai-your-key-here',
})
const response = await client.chat.completions.create({
model: 'mistral',
messages: [{ role: 'user', content: 'Draft a follow-up email for a sales call' }],
})
console.log(response.choices[0].message.content)
cURL
bashcurl https://nestai.chirai.dev/api/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer nai-your-key-here" \
-d '{
"model": "mistral",
"messages": [{"role": "user", "content": "Hello!"}]
}'
RAG (query your own documents)
If you've uploaded documents to the knowledge base, you can reference them via the API:
json{
"model": "mistral",
"messages": [{"role": "user", "content": "What does our refund policy say?"}],
"files": [{"type": "collection", "id": "YOUR_COLLECTION_ID"}]
}
This runs retrieval-augmented generation against your private document store. No data leaves your server.
What you get
Zero token limits — no RPM, TPM, or daily caps
Flat pricing — ₹3,499/mo (~$39) for Solo, ₹11,999/mo (~$135) for Team (10 seats)
Data residency — choose EU (Germany), US East, or Singapore
Dedicated resources — optional AMD EPYC upgrade up to 48 vCPU, 192GB RAM
Full OpenAI SDK compatibility — streaming, models list, chat completions
Approach 2: Self-hosted on a VPS (30–60 minutes, more control)
If you want full control and are comfortable with SSH and Docker, you can set up your own OpenAI-compatible endpoint on any VPS.
Stack
VPS: Hetzner CX43 (~$10/mo), DigitalOcean, or any Ubuntu server with 16GB+ RAM
LLM engine: Ollama
Web UI + API proxy: Open WebUI
Reverse proxy: nginx + Let's Encrypt SSL
Step 1: Install Ollama
bashcurl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral
Step 2: Install Open WebUI
bashdocker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Step 3: Set up nginx + SSL
nginxserver {
listen 80;
server_name ai.yourcompany.com;
location / {
proxy_pass http://localhost:3000;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 600;
}
}
Then:
bashsudo certbot --nginx -d ai.yourcompany.com
Step 4: Use the API
Open WebUI exposes an OpenAI-compatible API at /api/chat/completions. Generate an API key from the Open WebUI admin panel, then:
pythonfrom openai import OpenAI
client = OpenAI(
base_url="https://ai.yourcompany.com/api",
api_key="your-open-webui-api-key"
)
response = client.chat.completions.create(
model="mistral",
messages=[{"role": "user", "content": "Hello!"}]
)
Downsides of self-hosting
You manage updates, security patches, and SSL renewals
Exposing Ollama to the internet requires careful firewall config
No automatic health monitoring or alerting
Model pulling, swap management, and disk cleanup are your job
This is where a managed service saves time. But if you want full control, this works.
Approach 3: Hybrid (cheapest at scale)
Use a private server for routine tasks and fall back to OpenAI for complex reasoning:
pythonfrom openai import OpenAI
Private server for 90% of requests
private = OpenAI(
base_url="https://nestai.chirai.dev/api/v1",
api_key="nai-your-key"
)
OpenAI for the remaining 10%
cloud = OpenAI(api_key="sk-your-openai-key")
def smart_route(messages, complexity="low"):
if complexity == "high":
return cloud.chat.completions.create(
model="gpt-4o",
messages=messages
)
return private.chat.completions.create(
model="mistral",
messages=messages
)
This cuts your OpenAI bill by 80–90% while keeping GPT-4o available for tasks that genuinely need it.
Speed benchmarks (honest numbers)
These are CPU-only numbers. No GPU. Single-user, sequential requests.
ModelServer SpecTokens/secResponse time (200 words)Qwen 3.5 4B8 vCPU, 16GB RAM~15–20 tok/s~4 secondsMistral 7B8 vCPU, 16GB RAM~10–15 tok/s~6 secondsDeepSeek R1 7B8 vCPU, 32GB RAM~10–14 tok/s~6 secondsLlama 3.3 70B16 vCPU, 64GB RAM~2–3 tok/s~30 seconds
For comparison, GPT-4o streams at ~50–60 tok/s. So private server inference is 3–5x slower for small models and 20x slower for 70B. The tradeoff is: slower responses, but zero per-token cost, full privacy, and no rate limits.
For most use cases — document analysis, internal knowledge bases, batch processing, async workflows — 10–15 tok/s is perfectly usable.
LangChain integration
pythonfrom langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="https://nestai.chirai.dev/api/v1",
api_key="nai-your-key",
model="mistral",
)
response = llm.invoke("Summarise the quarterly report")
print(response.content)
That's it. LangChain's ChatOpenAI class accepts a custom base_url. Every chain, agent, and tool you've built on LangChain works identically.
When to stay on OpenAI
Private servers aren't always the right choice. Stay on OpenAI if:
You need GPT-4o/o1-level reasoning quality (open-source models are good but not frontier-tier yet)
Your usage is under 1M tokens/month (the API is cheaper than running a server)
You need function calling with complex tool schemas (Ollama's function calling support is model-dependent)
You need image generation (DALL-E has no open-source equivalent at the same quality)
For everything else — internal tools, document Q&A, customer support bots, data processing pipelines, code generation with CodeLlama/Qwen — a private server delivers the same results at a fraction of the cost.
TL;DR
OpenAI APIPrivate ServerCostPer-token ($15/M tokens for GPT-4o)Flat monthly ($39–$299/mo)Data privacyData transits OpenAI serversData stays on your serverRate limitsYes (RPM/TPM caps)NoneSpeed~50–60 tok/s~10–20 tok/s (7B CPU)SDK compatibilityNativeSame (OpenAI SDK, LangChain, etc.)Setup timeMinutes33 min (managed) or 1 hr (self-hosted)Best forComplex reasoning, low volumePrivacy, high volume, predictable costs
Get started
Managed (easiest): nestai.chirai.dev — deploy a private AI server with an OpenAI-compatible API in 33 minutes. API docs at nestai.chirai.dev/docs/api.
Self-hosted: Install Ollama + Open WebUI on any VPS.
Questions? Drop a comment below or reach me at nestaisupport@chirai.dev.
I'm Chiranjiv, a solo founder from India building NestAI — managed private AI hosting for teams. If you're spending too much on OpenAI API calls or can't send client data to public AI, this is what I built to solve that problem.
``
Top comments (0)