Build a Private AI Chatbot Stack: NanoGPT + Ollama + Python

#ai #privacy #python #opensource

I Got Tired of API Outages Killing My Chatbot

I run a small chatbot for a community Discord. Nothing massive — maybe 200 messages a day. But every time OpenAI had an outage or rate-limited us, the bot just... died. Users would get error messages, I'd get pings at 2am, and I'd scramble to restart things.

So I built a stack that falls back automatically: NanoGPT as primary, Ollama as local fallback. If NanoGPT's API is down, it switches to a local model. If both are down, it queues messages and retries. No more 2am pages.

There's a great guide to NanoGPT setup if you want the full installation walkthrough. Here I'm focusing on the Python integration.

The Architecture

The idea is simple: try NanoGPT first (it's cheap and fast), fall back to Ollama (local, free, always available), and queue if both fail.

User message -> NanoGPT API (primary)
                  |
                  v (on failure)
               Ollama local (fallback)
                  |
                  v (on failure)
               Message queue (retry later)

Setting Up the Environment

# Create the project
mkdir private-chatbot && cd private-chatbot
python -m venv venv && source venv/bin/activate

# Dependencies
pip install requests aiohttp python-dotenv

# .env file
cat > .env << 'EOF'
NANOGPT_API_KEY=your_key_here
NANOGPT_API_URL=https://api.nanogpt.io/v1
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3.1:8b
EOF

Make sure Ollama is running locally. If you haven't set it up:

curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.1:8b

The Chatbot Class

This is the core — a class that handles both providers with automatic fallback:

import os
import json
import time
import asyncio
import aiohttp
from dataclasses import dataclass, field
from typing import Optional, List, AsyncIterator
from dotenv import load_dotenv

load_dotenv()

@dataclass
class Message:
    role: str
    content: str

@dataclass
class ChatResponse:
    content: str
    model: str
    provider: str
    tokens_used: int = 0
    latency_ms: float = 0

class PrivateChatBot:
    def __init__(self):
        self.nanogpt_key = os.environ.get("NANOGPT_API_KEY")
        self.nanogpt_url = os.environ.get("NANOGPT_API_URL", "https://api.nanogpt.io/v1")
        self.ollama_host = os.environ.get("OLLAMA_HOST", "http://localhost:11434")
        self.ollama_model = os.environ.get("OLLAMA_MODEL", "llama3.1:8b")
        self.message_queue: List[dict] = []
        self.conversation_history: List[Message] = []
        self.max_history = 20

    async def chat(self, user_message: str) -> ChatResponse:
        """Send a message, auto-fallback on failure."""
        self.conversation_history.append(Message("user", user_message))
        if len(self.conversation_history) > self.max_history:
            self.conversation_history = self.conversation_history[-self.max_history:]

        # Try NanoGPT first
        try:
            response = await self._call_nanogpt(user_message)
            self.conversation_history.append(Message("assistant", response.content))
            return response
        except Exception as e:
            print(f"NanoGPT failed: {e}, falling back to Ollama")

        # Fallback to Ollama
        try:
            response = await self._call_ollama(user_message)
            self.conversation_history.append(Message("assistant", response.content))
            return response
        except Exception as e:
            print(f"Ollama also failed: {e}, queuing message")
            self.message_queue.append({
                "message": user_message,
                "timestamp": time.time(),
                "retries": 0
            })
            return ChatResponse(
                content="I'm having trouble reaching my AI backends. Your message is queued and I'll respond when things are back up.",
                model="none",
                provider="queue"
            )

    async def _call_nanogpt(self, user_message: str) -> ChatResponse:
        """Call NanoGPT API."""
        start = time.time()
        messages = [{"role": m.role, "content": m.content} for m in self.conversation_history]

        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"{self.nanogpt_url}/chat/completions",
                headers={
                    "Authorization": f"Bearer {self.nanogpt_key}",
                    "Content-Type": "application/json"
                },
                json={
                    "messages": messages,
                    "max_tokens": 1000,
                    "temperature": 0.7
                },
                timeout=aiohttp.ClientTimeout(total=30)
            ) as resp:
                if resp.status != 200:
                    body = await resp.text()
                    raise Exception(f"NanoGPT API error {resp.status}: {body}")

                data = await resp.json()
                latency = (time.time() - start) * 1000

                return ChatResponse(
                    content=data["choices"][0]["message"]["content"],
                    model=data.get("model", "nanogpt"),
                    provider="nanogpt",
                    tokens_used=data.get("usage", {}).get("total_tokens", 0),
                    latency_ms=latency
                )

    async def _call_ollama(self, user_message: str) -> ChatResponse:
        """Call local Ollama instance."""
        start = time.time()
        messages = [{"role": m.role, "content": m.content} for m in self.conversation_history]

        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"{self.ollama_host}/api/chat",
                json={
                    "model": self.ollama_model,
                    "messages": messages,
                    "stream": False
                },
                timeout=aiohttp.ClientTimeout(total=120)
            ) as resp:
                if resp.status != 200:
                    body = await resp.text()
                    raise Exception(f"Ollama error {resp.status}: {body}")

                data = await resp.json()
                latency = (time.time() - start) * 1000

                return ChatResponse(
                    content=data["message"]["content"],
                    model=data.get("model", self.ollama_model),
                    provider="ollama",
                    tokens_used=data.get("eval_count", 0),
                    latency_ms=latency
                )

    async def stream_chat(self, user_message: str) -> AsyncIterator[str]:
        """Stream tokens from NanoGPT (falls back to non-streaming Ollama)."""
        try:
            messages = [{"role": m.role, "content": m.content} for m in self.conversation_history]

            async with aiohttp.ClientSession() as session:
                async with session.post(
                    f"{self.nanogpt_url}/chat/completions",
                    headers={
                        "Authorization": f"Bearer {self.nanogpt_key}",
                        "Content-Type": "application/json"
                    },
                    json={
                        "messages": messages,
                        "max_tokens": 1000,
                        "temperature": 0.7,
                        "stream": True
                    },
                    timeout=aiohttp.ClientTimeout(total=60)
                ) as resp:
                    full_response = ""
                    async for line in resp.content:
                        line = line.decode().strip()
                        if not line or line == "data: [DONE]":
                            continue
                        if line.startswith("data: "):
                            try:
                                chunk = json.loads(line[6:])
                                delta = chunk["choices"][0].get("delta", {})
                                if "content" in delta:
                                    token = delta["content"]
                                    full_response += token
                                    yield token
                            except json.JSONDecodeError:
                                continue

            self.conversation_history.append(Message("assistant", full_response))
        except Exception as e:
            print(f"Streaming failed ({e}), using non-streaming fallback")
            response = await self.chat(user_message)
            yield response.content

Testing It

import asyncio

async def main():
    bot = PrivateChatBot()

    # Simple chat
    response = await bot.chat("What's the capital of France?")
    print(f"[{response.provider}] {response.content}")
    print(f"Latency: {response.latency_ms:.0f}ms, Tokens: {response.tokens_used}")

    # Streaming
    print("\nStreaming response:")
    async for token in bot.stream_chat("Explain quantum computing in 3 sentences"):
        print(token, end="", flush=True)
    print()

asyncio.run(main())

What I Learned Running This in Production

NanoGPT latency varies. Sometimes 200ms, sometimes 2s. The Ollama fallback is actually faster on my server (running llama3.1:8b on a 3090) because there's no network hop.

Ollama's context window matters. The 8B model handles 4K context fine, but quality drops fast above that. I truncate history to 20 messages for a reason.

The queue saves you. When both providers are down (happened twice in 3 months), the queue means users get eventual responses instead of errors.

Cost comparison. NanoGPT at ~$0.001 per request vs OpenAI at ~$0.01 — for 200 messages/day that's $0.20/day vs $2/day. Over a year, $73 vs $730. The Ollama fallback is free.

If you want to explore NanoGPT's full capabilities, here's where to start. Their pricing model is genuinely the best I've found for API access.

Why Not Just Use Ollama for Everything?

I tried. Two problems: my server's GPU is shared with other workloads, so under load, Ollama gets slow. And the 8B model just isn't as good as NanoGPT's models for complex reasoning. The hybrid approach gives you the best of both — cheap primary with a guaranteed fallback.