Madhu Dadi

Posted on May 25 • Originally published at madhudadi.in on May 11

Building a RAG Chat System: From Zero to Production in Building This Blog: A Production AI Platform

#technology #softwareengineering #advanced #rag

Building a RAG Chat System From Zero

The "Ask AI" page on this blog is not a generic chatbot. It's a Retrieval-Augmented Generation system that answers questions using only the content from this site's posts, and it shows you exactly which post each part of the answer came from.

Here's how it works, from embedding to response.

Why RAG Instead of Fine-Tuning

Fine-tuning a model on blog content would:

Require retraining every time a new post is published
Risk hallucinating facts not present in the training data
Give no way to cite sources in the response

RAG solves all three: query the content at runtime, inject the relevant chunks into the prompt, and return the source citations alongside the answer. No retraining needed.

Architecture

User Question
    │
    ▼
[Embedding Model] ──→ Question Vector
    │
    ▼
[pgvector HNSW Index] ──→ Top-K Similar Chunks (by vector distance)
    │
    ▼
[tsvector Full-Text Search] ──→ Top-K Chunks (by keyword relevance)
    │
    ▼
[Hybrid Scorer] ──→ Weighted + Reranked Results
    │
    ▼
[Context Assembly] ──→ Prompt with top chunks + question
    │
    ▼
[LLM] ──→ Generated Answer + Source Citations
    │
    ▼
[Source Verification] ──→ Verify citations match chunks
    │
    ▼
[Streaming Response] ──→ SSE to frontend

Step 1: The Embedding Pipeline

Every published post is split into chunks and embedded. The chunks are stored in the rag_chunks table:

CREATE TABLE rag_chunks (
    id UUID PRIMARY KEY,
    post_id UUID REFERENCES posts(id) ON DELETE CASCADE,
    chunk_index INTEGER NOT NULL,
    content TEXT NOT NULL,
    embedding vector(1536),
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

The embedding dimension (1536) comes from the model: text-embedding-3-small from OpenAI. The choice was pragmatic — it's the cheapest per-token of the high-quality embedding models and produces 1536-dimensional vectors that work well with pgvector's HNSW index.

Chunking Strategy

Posts are split on paragraph boundaries, not fixed token counts:

def chunk_post(content: str, max_tokens: int = 500) -&gt; list[str]:
    paragraphs = content.split(&quot;\n\n&quot;)
    chunks = []
    current = []

    for p in paragraphs:
        estimated_tokens = len(p.split())
        current_token_count = sum(len(c.split()) for c in current)

        if current_token_count + estimated_tokens &gt; max_tokens and current:
            chunks.append(&quot;\n\n&quot;.join(current))
            current = [p]
        else:
            current.append(p)

    if current:
        chunks.append(&quot;\n\n&quot;.join(current))

    return chunks

Why paragraph boundaries? Code blocks, lists, and blockquotes are semantic units. Splitting mid-paragraph would separate a code example from its explanation, making the chunk useless for both retrieval and generation.

Each chunk stores its chunk_index so the frontend can link back to the correct section of the post. Metadata includes the post slug, title, section heading, and URL.

Step 2: The HNSW Index

pgvector supports two index types for approximate nearest neighbor search: IVFFlat and HNSW. I chose HNSW for three reasons:

Faster build time — HNSW builds incrementally. IVFFlat requires a full rebuild when data changes.
Better recall at same speed — HNSW consistently achieves 99% recall at 10ms query time with my dataset size (~50K chunks).
No training required — IVFFlat needs a clustering step that depends on representative data. HNSW is parameter-free.

CREATE INDEX idx_rag_chunks_embedding ON rag_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

The parameters:

m = 16 — each node connects to 16 neighbors. Higher = better recall, slower build. 16 is the sweet spot for datasets under 100K vectors.
ef_construction = 200 — the dynamic list size during construction. Higher = better index quality, slower build. 200 is conservative.

At query time, the search uses SET hnsw.ef_search = 40 — this controls the search breadth. Higher = better recall, slower query.

Step 3: Hybrid Search

Vector search alone misses exact keyword matches. "How do I install FastAPI?" matches the vector of "FastAPI installation guide" but misses the exact phrase match. Full-text search via tsvector catches what vector search misses.

The hybrid query combines both:

async def hybrid_search(query: str, limit: int = 10):
    query_embedding = await embed(query)

    vector_results = await db.execute(
        text(&quot;&quot;&quot;
            SELECT id, content, post_id, chunk_index,
                   1 - (embedding &lt;=&gt; :query_emb) AS vector_score
            FROM rag_chunks
            ORDER BY embedding &lt;=&gt; :query_emb
            LIMIT :limit
        &quot;&quot;&quot;),
        {&quot;query_emb&quot;: query_embedding, &quot;limit&quot;: limit * 2}
    )

    fts_results = await db.execute(
        text(&quot;&quot;&quot;
            SELECT id, content, post_id, chunk_index,
                   ts_rank(to_tsvector(&#x27;english&#x27;, content),
                           plainto_tsquery(&#x27;english&#x27;, :query)) AS fts_score
            FROM rag_chunks
            WHERE to_tsvector(&#x27;english&#x27;, content) @@ plainto_tsquery(&#x27;english&#x27;, :query)
            ORDER BY fts_score DESC
            LIMIT :limit
        &quot;&quot;&quot;),
        {&quot;query&quot;: query, &quot;limit&quot;: limit * 2}
    )

    return hybrid_rank(vector_results, fts_results, alpha=0.7)

The alpha parameter controls the weight between vector and keyword scores. 0.7 means 70% vector, 30% keyword — biased toward semantic understanding while still catching exact matches.

Step 4: Hybrid Ranking

Results from both searches are combined using Reciprocal Rank Fusion (RRF):

def hybrid_rank(vector_results, fts_results, alpha=0.7, k=60):
    scores = {}

    for rank, row in enumerate(vector_results):
        scores[row.id] = scores.get(row.id, 0) + alpha * (1 / (k + rank + 1))

    for rank, row in enumerate(fts_results):
        scores[row.id] = scores.get(row.id, 0) + (1 - alpha) * (1 / (k + rank + 1))

    ranked = sorted(scores.items(), key=lambda x: -x[1])
    return [chunk_id for chunk_id, _ in ranked[:10]]

RRF is simple, fast, and doesn't require training a learned ranker. The constant k=60 prevents any single ranking from dominating.

Step 5: Context Assembly

The top 5-10 chunks are assembled into a prompt. The system prompt is:

You are a technical assistant for Madhu Dadi — AI, Python &amp; Analytics Hub.
Answer the user&#x27;s question based ONLY on the provided context.
If the context doesn&#x27;t contain enough information, say so.
Always cite the source post title and section for each claim.
Format citations as [Source: Post Title → Section].

The user prompt includes the question and the chunk content:

Context:
[1] Post: &quot;Understanding Python Classes&quot; → Section: &quot;Class Methods&quot;
Content: Class methods are functions defined inside a class...

[2] Post: &quot;FastAPI Routes&quot; → Section: &quot;Path Parameters&quot;
Content: Path parameters are declared using Python type hints...

Question: How do I define a class method in Python?

Step 6: Source Verification

After the LLM generates a response, a verification step checks that each cited source actually exists in the provided chunks:

def verify_citations(response: str, chunks: list[dict]) -&gt; dict:
    citations_found = re.findall(r&#x27;\[Source: (.+?)\]&#x27;, response)
    valid_citations = []
    missing_citations = []

    for citation in citations_found:
        matched = any(citation in chunk[&quot;source&quot;] for chunk in chunks)
        if matched:
            valid_citations.append(citation)
        else:
            missing_citations.append(citation)

    return {
        &quot;verified_response&quot;: response,
        &quot;citations&quot;: valid_citations,
        &quot;unverified_claims&quot;: missing_citations
    }

Unverified claims are flagged but not removed from the response — they're marked with a warning icon in the frontend. This happens rarely (less than 2% of queries) and usually when the LLM rephrases a source name.

Step 7: The Database Model

class RagChunk(Base):
    __tablename__ = &quot;rag_chunks&quot;

    id: Mapped[uuid.UUID] = mapped_column(UUID, primary_key=True, default=uuid.uuid4)
    post_id: Mapped[uuid.UUID] = mapped_column(ForeignKey(&quot;posts.id&quot;, ondelete=&quot;CASCADE&quot;))
    chunk_index: Mapped[int]
    content: Mapped[str] = mapped_column(Text)
    embedding: Mapped[Optional[Vector]] = mapped_column(Vector(1536))
    metadata: Mapped[Optional[dict]] = mapped_column(JSONB)
    created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), server_default=func.now())

The Vector type comes from pgvector.sqlalchemy. It maps directly to PostgreSQL's vector extension type.

Cold Start: First User Experience

When a user visits the Ask AI page for the first time, there are no chunks to search. The solution: a pre-computed set of seed questions and answers, one per published post, generated during the embedding pipeline.

SEED_QUESTIONS = {
    &quot;why-i-built-yet-another-blog-but-not-really&quot;: [
        &quot;Why did you build your own blog platform?&quot;,
        &quot;What features does this blog have that others don&#x27;t?&quot;
    ],
    &quot;the-monorepo-that-runs-29-services&quot;: [
        &quot;How is the monorepo structured?&quot;,
        &quot;What are the 29 API routers?&quot;
    ]
}

These seed questions are embedded and stored alongside the post chunks. On the first page load, the frontend fetches 3-5 seed questions as suggestions. When the user clicks one, it triggers a RAG query, which populates the embedding cache. Subsequent queries hit the cache.

What's Next

In the the next post, I'll cover the production RAG pipeline — streaming responses via SSE, progressive rendering, citation badges, fallback strategies, rate limiting, and the cold-start UX flow in detail.

Built with FastAPI, Next.js 16, PostgreSQL, Redis, and zero third-party CMS. Deployed on a $12/month VPS.

By Madhu Dadi

DEV Community