I Got Tired of API Outages Killing My Chatbot
I run a small chatbot for a community Discord. Nothing massive — maybe 200 messages a day. But every time OpenAI had an outage or rate-limited us, the bot just... died. Users would get error messages, I'd get pings at 2am, and I'd scramble to restart things.
So I built a stack that falls back automatically: NanoGPT as primary, Ollama as local fallback. If NanoGPT's API is down, it switches to a local model. If both are down, it queues messages and retries. No more 2am pages.
There's a great guide to NanoGPT setup if you want the full installation walkthrough. Here I'm focusing on the Python integration.
The Architecture
The idea is simple: try NanoGPT first (it's cheap and fast), fall back to Ollama (local, free, always available), and queue if both fail.
User message -> NanoGPT API (primary)
|
v (on failure)
Ollama local (fallback)
|
v (on failure)
Message queue (retry later)
Setting Up the Environment
# Create the project
mkdir private-chatbot && cd private-chatbot
python -m venv venv && source venv/bin/activate
# Dependencies
pip install requests aiohttp python-dotenv
# .env file
cat > .env << 'EOF'
NANOGPT_API_KEY=your_key_here
NANOGPT_API_URL=https://api.nanogpt.io/v1
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3.1:8b
EOF
Make sure Ollama is running locally. If you haven't set it up:
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.1:8b
The Chatbot Class
This is the core — a class that handles both providers with automatic fallback:
import os
import json
import time
import asyncio
import aiohttp
from dataclasses import dataclass, field
from typing import Optional, List, AsyncIterator
from dotenv import load_dotenv
load_dotenv()
@dataclass
class Message:
role: str
content: str
@dataclass
class ChatResponse:
content: str
model: str
provider: str
tokens_used: int = 0
latency_ms: float = 0
class PrivateChatBot:
def __init__(self):
self.nanogpt_key = os.environ.get("NANOGPT_API_KEY")
self.nanogpt_url = os.environ.get("NANOGPT_API_URL", "https://api.nanogpt.io/v1")
self.ollama_host = os.environ.get("OLLAMA_HOST", "http://localhost:11434")
self.ollama_model = os.environ.get("OLLAMA_MODEL", "llama3.1:8b")
self.message_queue: List[dict] = []
self.conversation_history: List[Message] = []
self.max_history = 20
async def chat(self, user_message: str) -> ChatResponse:
"""Send a message, auto-fallback on failure."""
self.conversation_history.append(Message("user", user_message))
if len(self.conversation_history) > self.max_history:
self.conversation_history = self.conversation_history[-self.max_history:]
# Try NanoGPT first
try:
response = await self._call_nanogpt(user_message)
self.conversation_history.append(Message("assistant", response.content))
return response
except Exception as e:
print(f"NanoGPT failed: {e}, falling back to Ollama")
# Fallback to Ollama
try:
response = await self._call_ollama(user_message)
self.conversation_history.append(Message("assistant", response.content))
return response
except Exception as e:
print(f"Ollama also failed: {e}, queuing message")
self.message_queue.append({
"message": user_message,
"timestamp": time.time(),
"retries": 0
})
return ChatResponse(
content="I'm having trouble reaching my AI backends. Your message is queued and I'll respond when things are back up.",
model="none",
provider="queue"
)
async def _call_nanogpt(self, user_message: str) -> ChatResponse:
"""Call NanoGPT API."""
start = time.time()
messages = [{"role": m.role, "content": m.content} for m in self.conversation_history]
async with aiohttp.ClientSession() as session:
async with session.post(
f"{self.nanogpt_url}/chat/completions",
headers={
"Authorization": f"Bearer {self.nanogpt_key}",
"Content-Type": "application/json"
},
json={
"messages": messages,
"max_tokens": 1000,
"temperature": 0.7
},
timeout=aiohttp.ClientTimeout(total=30)
) as resp:
if resp.status != 200:
body = await resp.text()
raise Exception(f"NanoGPT API error {resp.status}: {body}")
data = await resp.json()
latency = (time.time() - start) * 1000
return ChatResponse(
content=data["choices"][0]["message"]["content"],
model=data.get("model", "nanogpt"),
provider="nanogpt",
tokens_used=data.get("usage", {}).get("total_tokens", 0),
latency_ms=latency
)
async def _call_ollama(self, user_message: str) -> ChatResponse:
"""Call local Ollama instance."""
start = time.time()
messages = [{"role": m.role, "content": m.content} for m in self.conversation_history]
async with aiohttp.ClientSession() as session:
async with session.post(
f"{self.ollama_host}/api/chat",
json={
"model": self.ollama_model,
"messages": messages,
"stream": False
},
timeout=aiohttp.ClientTimeout(total=120)
) as resp:
if resp.status != 200:
body = await resp.text()
raise Exception(f"Ollama error {resp.status}: {body}")
data = await resp.json()
latency = (time.time() - start) * 1000
return ChatResponse(
content=data["message"]["content"],
model=data.get("model", self.ollama_model),
provider="ollama",
tokens_used=data.get("eval_count", 0),
latency_ms=latency
)
async def stream_chat(self, user_message: str) -> AsyncIterator[str]:
"""Stream tokens from NanoGPT (falls back to non-streaming Ollama)."""
try:
messages = [{"role": m.role, "content": m.content} for m in self.conversation_history]
async with aiohttp.ClientSession() as session:
async with session.post(
f"{self.nanogpt_url}/chat/completions",
headers={
"Authorization": f"Bearer {self.nanogpt_key}",
"Content-Type": "application/json"
},
json={
"messages": messages,
"max_tokens": 1000,
"temperature": 0.7,
"stream": True
},
timeout=aiohttp.ClientTimeout(total=60)
) as resp:
full_response = ""
async for line in resp.content:
line = line.decode().strip()
if not line or line == "data: [DONE]":
continue
if line.startswith("data: "):
try:
chunk = json.loads(line[6:])
delta = chunk["choices"][0].get("delta", {})
if "content" in delta:
token = delta["content"]
full_response += token
yield token
except json.JSONDecodeError:
continue
self.conversation_history.append(Message("assistant", full_response))
except Exception as e:
print(f"Streaming failed ({e}), using non-streaming fallback")
response = await self.chat(user_message)
yield response.content
Testing It
import asyncio
async def main():
bot = PrivateChatBot()
# Simple chat
response = await bot.chat("What's the capital of France?")
print(f"[{response.provider}] {response.content}")
print(f"Latency: {response.latency_ms:.0f}ms, Tokens: {response.tokens_used}")
# Streaming
print("\nStreaming response:")
async for token in bot.stream_chat("Explain quantum computing in 3 sentences"):
print(token, end="", flush=True)
print()
asyncio.run(main())
What I Learned Running This in Production
NanoGPT latency varies. Sometimes 200ms, sometimes 2s. The Ollama fallback is actually faster on my server (running llama3.1:8b on a 3090) because there's no network hop.
Ollama's context window matters. The 8B model handles 4K context fine, but quality drops fast above that. I truncate history to 20 messages for a reason.
The queue saves you. When both providers are down (happened twice in 3 months), the queue means users get eventual responses instead of errors.
Cost comparison. NanoGPT at ~$0.001 per request vs OpenAI at ~$0.01 — for 200 messages/day that's $0.20/day vs $2/day. Over a year, $73 vs $730. The Ollama fallback is free.
If you want to explore NanoGPT's full capabilities, here's where to start. Their pricing model is genuinely the best I've found for API access.
Why Not Just Use Ollama for Everything?
I tried. Two problems: my server's GPU is shared with other workloads, so under load, Ollama gets slow. And the 8B model just isn't as good as NanoGPT's models for complex reasoning. The hybrid approach gives you the best of both — cheap primary with a guaranteed fallback.
The Full Setup Checklist
- Get a NanoGPT API key
- Install Ollama and pull your model
- Copy the code above
- Set your
.envvariables - Run and test with
python chatbot.py - Wrap it in a Discord bot / Telegram bot / whatever frontend you want
The whole thing is maybe 150 lines of Python. No frameworks, no magic, just API calls with fallback logic. Works on any machine with Python 3.10+.
Originally published at https://nano-gpt-guide.vercel.app
Top comments (0)