Kirill Strelnikov

Posted on Mar 9 • Originally published at kirweb.site

Build a Production RAG Chatbot with Django + pgvector + OpenAI (Full Guide)

#python #django #ai #tutorial

I'm Kirill Strelnikov, a freelance AI/Django developer in Barcelona. I've built RAG chatbots that automated 70% of customer support for e-commerce clients. This is a practical guide to building a production-ready RAG chatbot with Django, pgvector, and OpenAI — not a toy demo, but the actual architecture I use in client projects.

What is RAG and Why It Matters

RAG (Retrieval-Augmented Generation) = vector search + LLM. Instead of hoping the LLM "knows" your business data, you:

Store your documents as vector embeddings
When a user asks a question, find the most relevant documents
Feed those documents to the LLM as context
The LLM generates an answer based on YOUR data

Why RAG beats fine-tuning for business chatbots:

No retraining when data changes (just re-embed)
Works with any LLM (swap GPT-4 for Claude without rebuilding)
Answers are grounded in real documents (reduces hallucination)
You can show sources ("Based on: Return Policy, Section 3")

Step 1: Set Up pgvector in Django

pgvector is a PostgreSQL extension for vector similarity search. No separate vector database needed — your embeddings live alongside your regular data.

# Install pgvector on PostgreSQL
# Ubuntu/Debian:
sudo apt install postgresql-16-pgvector

# Or via Docker:
# Use image: pgvector/pgvector:pg16

pip install pgvector django-pgvector

# models.py
from pgvector.django import VectorField

class Document(models.Model):
    title = models.CharField(max_length=255)
    content = models.TextField()
    source = models.CharField(max_length=255)  # "faq", "product", "policy"
    embedding = VectorField(dimensions=1536)    # text-embedding-3-small
    created_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        indexes = [
            models.Index(fields=["source"]),
        ]

    def __str__(self):
        return self.title

# migration: enable pgvector extension
from django.db import migrations

class Migration(migrations.Migration):
    dependencies = [("chatbot", "0001_initial")]
    operations = [
        migrations.RunSQL(
            "CREATE EXTENSION IF NOT EXISTS vector;",
            reverse_sql="DROP EXTENSION IF EXISTS vector;"
        ),
    ]

Step 2: Embed Your Documents

# embeddings.py
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    """Get embedding vector for a text string."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def chunk_text(text: str, chunk_size: int = 300, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by token count (approximate)."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk.strip():
            chunks.append(chunk)
    return chunks

def embed_document(title: str, content: str, source: str):
    """Chunk a document and store each chunk with its embedding."""
    chunks = chunk_text(content)
    documents = []
    for i, chunk in enumerate(chunks):
        embedding = get_embedding(chunk)
        documents.append(
            Document(
                title=f"{title} (part {i+1})",
                content=chunk,
                source=source,
                embedding=embedding,
            )
        )
    Document.objects.bulk_create(documents)
    return len(documents)

Chunking matters. I use 300-token chunks with 50-token overlap. Too small = lost context. Too large = diluted relevance. This size works well for FAQ and product data. For longer documents (legal, technical docs), I increase to 500 tokens.

Step 3: Vector Search

# search.py
from pgvector.django import CosineDistance

def search_documents(query: str, top_k: int = 5, source: str = None):
    """Find the most relevant documents for a query."""
    query_embedding = get_embedding(query)

    qs = Document.objects.annotate(
        distance=CosineDistance("embedding", query_embedding)
    )

    if source:
        qs = qs.filter(source=source)

    return qs.order_by("distance")[:top_k]

pgvector's cosine distance search is fast enough for most business chatbots (sub-100ms for 100K documents). For larger datasets, add an IVFFlat or HNSW index:

# migration for HNSW index (faster search for large datasets)
migrations.RunSQL(
    "CREATE INDEX ON chatbot_document USING hnsw (embedding vector_cosine_ops);",
)

Step 4: Generate Answers

# chatbot.py
from openai import OpenAI

client = OpenAI()

SYSTEM_PROMPT = """You are a helpful customer support assistant.
Answer questions using ONLY the provided context.
If the context doesn't contain the answer, say "I don't have information about that. Let me connect you with a human agent."
Always be concise and helpful. Cite the source document when relevant."""

def get_chatbot_response(user_message: str, conversation_history: list = None):
    """Generate a chatbot response using RAG."""
    # 1. Retrieve relevant documents
    relevant_docs = search_documents(user_message, top_k=5)

    # 2. Build context string
    context_parts = []
    sources = []
    for doc in relevant_docs:
        context_parts.append(f"[{doc.title}]: {doc.content}")
        sources.append(doc.title)

    context = "\n\n".join(context_parts)

    # 3. Build messages
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT + f"\n\nContext:\n{context}"}
    ]

    # Add conversation history for multi-turn
    if conversation_history:
        messages.extend(conversation_history[-6:])  # Last 3 exchanges

    messages.append({"role": "user", "content": user_message})

    # 4. Generate response
    response = client.chat.completions.create(
        model="gpt-4",
        messages=messages,
        temperature=0.3,  # Low temperature = more factual
        max_tokens=500,
    )

    answer = response.choices[0].message.content

    # 5. Confidence check
    min_distance = relevant_docs[0].distance if relevant_docs else 1.0
    needs_escalation = min_distance > 0.4  # Threshold tuned per project

    return {
        "answer": answer,
        "sources": sources,
        "needs_escalation": needs_escalation,
        "confidence": round(1 - min_distance, 2),
    }

Step 5: Django REST API

# views.py
from rest_framework.decorators import api_view
from rest_framework.response import Response

@api_view(["POST"])
def chat(request):
    message = request.data.get("message", "").strip()
    session_id = request.data.get("session_id")

    if not message:
        return Response({"error": "Message required"}, status=400)

    # Get conversation history from session
    history = get_session_history(session_id)

    # Generate response
    result = get_chatbot_response(message, history)

    # Save to history
    save_to_history(session_id, message, result["answer"])

    # If low confidence, notify human agent
    if result["needs_escalation"]:
        notify_agent(session_id, message, result)

    return Response({
        "answer": result["answer"],
        "sources": result["sources"],
        "confidence": result["confidence"],
    })

Step 6: Keep Embeddings Fresh

# tasks.py (Celery)
from celery import shared_task

@shared_task
def refresh_product_embeddings():
    """Re-embed products that changed since last sync."""
    from shop.models import Product

    for product in Product.objects.filter(updated_at__gte=last_sync_time()):
        content = f"{product.name}. {product.description}. Price: {product.price} EUR."
        # Delete old embeddings
        Document.objects.filter(
            source="product",
            title__startswith=product.name
        ).delete()
        # Create new ones
        embed_document(product.name, content, source="product")

Schedule this with Celery Beat to run hourly or on product updates.

Production Results

From my e-commerce client project:

Metric	Value
Documents embedded	~2,000 chunks
Avg search latency	45ms
Answer accuracy	~92% (human-evaluated)
Support automation rate	70%
Conversion rate increase	+35%
Monthly API cost	EUR 50-80

The 70% automation rate means 7 out of 10 customer questions are answered correctly without human intervention. The remaining 30% get escalated with full context, so the human agent can resolve them faster too.

Common Pitfalls

Embedding stale data. If your product catalog changes, your chatbot answers are wrong. Automate re-embedding.
No confidence threshold. Without escalation logic, the chatbot will confidently hallucinate. Always add a distance threshold.
Ignoring conversation context. A user asking "what about the blue one?" after asking about a dress needs multi-turn context. Pass conversation history.
Using a separate vector DB for small datasets. pgvector handles 100K+ documents easily. You don't need Pinecone or Weaviate until you hit millions.

Cost to Build

Scope	Timeline	Cost
Basic RAG chatbot (FAQ only)	1-2 weeks	EUR 800-1,500
RAG + product catalog + CRM	2-4 weeks	EUR 1,500-3,000
Multi-channel (web + Telegram + WhatsApp)	4-6 weeks	EUR 3,000-5,000

Detailed pricing: AI Chatbot Development Cost

I'm Kirill Strelnikov — I build production RAG chatbots, SaaS platforms, and Telegram bots as a freelance developer in Barcelona, Spain. 15+ projects delivered, EU-based, GDPR-compliant.

Website: kirweb.site
Telegram: @KirBcn
More AI case studies: kirweb.site/ai-chatbots

DEV Community