Mobasshir Khan

Posted on Jun 15

I Rebuilt My RAG Pipeline From Scratch. Here's What Actually Made It Better.

#machinelearning #ai #programming #datascience

Spoiler: it wasn't a bigger model, a better embedding, or a longer prompt.

When I first built my RAG pipeline, it was about as simple as it gets.

Take a topic, fetch some relevant text, feed it to the model, generate an answer. Classic RAG. The kind of thing every tutorial walks you through in twenty minutes.

And honestly? It worked. For a while.

But I wasn't building a generic Q&A bot. I was building a debate learning system, and that's where the cracks started showing fast.

The output felt generic. The sources didn't match what each part of the lesson actually needed. And the system had no idea that "background knowledge," "debate framing," "rebuttal material," and "vocabulary support" are not the same thing.

It was retrieving text. It just wasn't understanding what kind of help each piece of the pipeline needed.

So I tore it apart and rebuilt it.

What came out the other side wasn't just "better RAG." It was a layered retrieval architecture that plans its own queries, routes them by intent, ranks and packs context intelligently, remembers what worked before, and evaluates itself against real lessons.

Here's how I got there, and what I'd tell anyone trying to do the same.

The Problem With "Embed → Retrieve → Pray"

Most basic RAG systems follow the same formula:

embed the query → retrieve top chunks → pass them to the model → hope the answer improves

For simple document Q&A, this is fine. It's even good.

But for debate learning, it falls apart, because different parts of the output need fundamentally different kinds of evidence:

Pre-knowledge needs definitions and background
Debate building needs mechanisms, clashes, and framing
Vocabulary needs words that actually fit the topic
Language support needs words that reinforce the lesson itself
Coaching needs structured, distilled explanations

A single chunk-search pipeline has no concept of any of this. It doesn't know the difference between a strong debate example and a generic background article. The result is noisy retrieval, repetitive patterns, and weaker final output, no matter how good your embeddings are.

The Real Shift Wasn't Technical. It Was Mental.

Here's the thing that changed everything for me, and it had nothing to do with code at first.

I stopped thinking of retrieval as "find text" and started thinking of it as "make decisions about evidence."

That single reframe pushed me toward a completely different architecture:

topic → plan → route → preselect → retrieve → rerank → pack → teach → evaluate

Suddenly retrieval wasn't a single step anymore. It was a pipeline of decisions, each one with a clear job.

Here's the high-level flow I ended up with:

flowchart LR
    A["Topic / Node Intent"] --> B["Query Planner"]
    B --> C["Intent Router"]
    C --> D["Document Preselection"]
    D --> E["Chunk Retrieval"]
    E --> F["Reranking"]
    F --> G["Context Packing"]
    G --> H["Structured Evidence Lanes"]
    H --> I["Downstream Lesson Nodes"]
    I --> J["Trace Logging / Memory"]
    J --> K["Real-Trace Evaluation"]

Let's break down what each of these pieces actually does, and why it mattered.

1. Query Planning, Per Node

Not every node in the pipeline needs the same kind of evidence. A pre-knowledge node and an argument-generation node should never be searching the same way.

So I added a query planner that expands the raw topic into something far more useful for retrieval. Instead of handing the system a bare topic like "feminism" or "international relations," the planner turns it into structured search intent: expanded terms, subqueries, and source preferences tailored to the node that's asking.

def build_query_plan(node_name: str, topic: str) -> dict:
    return {
        "node": node_name,
        "topic": topic,
        "expanded_terms": [...],
        "subqueries": [...],
        "source_preferences": [...],
    }

This alone improved recall noticeably. The system finally started asking the right question before it even searched.

2. Intent-Aware Routing

Once the query is planned, the router decides what type of evidence the node actually needs: definitions, mechanisms, examples, clash material, style cues, or vocabulary support.

This sounds small, but it fixes one of the most common mistakes in basic RAG: treating every retrieval need as identical. They're not, and pretending otherwise is where a lot of quality gets lost.

Routing also made the whole system more inspectable. Every retrieval path now has a clear, debuggable purpose, which made every later improvement easier to reason about.

3. Hierarchical Retrieval (Stop Searching Everything, Every Time)

This was one of the biggest quality jumps in the whole rebuild.

Instead of searching every chunk across the entire corpus for every query, the system now first identifies the most likely documents, and only then searches inside them.

before:  topic → search all chunks
after:   topic → choose good documents → search chunks inside them

The difference is bigger than it sounds. Retrieval no longer wastes effort scanning irrelevant corners of the corpus, and chunk search starts from a much stronger position every single time.

4. Reranking and Context Packing

Relevant isn't the same as useful.

I added a reranking stage that reorders retrieved evidence based on usefulness, source class, and role, then a context packer that arranges the final selection in a way the model can actually reason with.

This is easy to underestimate, but a good retrieval system isn't just about finding the right information. It's about presenting it in a shape the model can use well.

5. Structured Evidence Lanes

Instead of dumping every retrieved chunk into one giant context block, the system now separates evidence into distinct lanes:

Definitions
Mechanisms
Examples
Debate framing
Vocabulary
Style and coaching notes

raw evidence → structured evidence lanes → better lesson output

Each downstream node reads only the lane it actually needs, instead of wading through everything else. The result is output that feels noticeably more coherent and "crafted," rather than a wall of loosely related text.

6. Smarter YouTube Ingestion

A lot of the best debate and explanation content lives on YouTube, but blindly ingesting every transcript is a recipe for noise.

So ingestion now:

Scans channel inventory
Caches metadata locally
Scores relevance from title and description
Uses thumbnail text as a ranking signal
Fetches transcripts only for videos worth fetching

Channels often hide their best material in the title, thumbnail, or a short description rather than the transcript itself. Making ingestion selective meant the pipeline got both faster and more useful at the same time, which doesn't happen often.

7. Retrieval Memory

This is the upgrade that turned the system from a one-off lookup tool into something that actually improves over time.

The pipeline now remembers which lessons, sources, and retrieval choices worked well in the past, and reuses those patterns instead of repeating the same weak sources again and again.

Of everything I built, this is the idea I'd point to as the most quietly powerful.

8. Evaluation Against Real Lessons

I didn't want to rely on "it feels more advanced now," so I added evaluation: both synthetic retrieval checks and real trace-based scoring from saved lessons.

That gave me a way to actually measure:

Whether the right evidence lanes are being populated
Whether the retrieval plan makes sense for the node
Whether real lessons are improving over time
Whether source reuse is getting smarter or going stale

- Does the lesson contain the right evidence lanes?
- Are the best sources showing up consistently?
- Is the output more specific than before?
- Are we repeating weak material too often?

Without this step, it's incredibly easy to convince yourself a system is better just because it looks more sophisticated. Evaluation is what turned "I think this is better" into "I can show you it's better."

What Actually Changed

After all of this, the difference wasn't subtle:

Better relevance — retrieval results now match each node's actual purpose, not a generic average.

Better structure — output is split into useful, labeled sections instead of one long blob of text.

Better efficiency — far less effort wasted on low-value sources and weak chunks.

Better debate usefulness — the system surfaces material that actually helps build arguments, not just background reading.

Better consistency — because the architecture is layered, I can tune one piece without breaking everything else.

Better learning value — the final output now contains pre-knowledge, debate angles, vocabulary, and coaching as distinct, usable layers.

The Architecture, in One Line

topic → planner → router → document selection → rerank → pack → evidence lanes → lesson output → trace evaluation

Every step in that line exists for a reason. Not because it looks advanced on a diagram, but because debate learning was never a single retrieval problem to begin with. It's a retrieval orchestration problem.

Why This Matters Specifically for Debate Learning

A good debate learner doesn't just need articles. It needs background, concepts, argument structure, rebuttal logic, examples, vocabulary, coaching, repetition control, and topic freshness, all at once, all tailored to the same lesson.

That's a much richer problem than "answer the question." Advanced RAG gave me a way to support all of those layers without the output collapsing into chaos.

It also opened the door to real personalization:

Vocabulary pulled from the lesson itself
Debate framing tied directly to the chosen article
Pre-knowledge sourced from curated material
Coaching grounded in the evidence that was actually retrieved

That's what makes the difference between a tool that answers and a tool that actually teaches.

The Biggest Lesson

If there's one thing I'd want someone to take from all of this, it's this:

Advanced RAG isn't about adding more model calls. It's about making retrieval more deliberate.

It's tempting to think RAG quality comes from bigger embeddings, bigger models, or just more text in the context window. Those things can help. They're not the main answer.

The real gains came from:

Planning better queries
Routing by intent
Selecting better documents first
Reranking evidence
Packing context intelligently
Separating source roles
Remembering what worked
Evaluating on real traces

That's the list that actually made the system feel "advanced," not the model behind it.

If You're Building a Retrieval Pipeline

Don't stop at chunk search. That's the beginning, not the destination.

Once you start thinking in terms of intent, routing, document-level selection, evidence role separation, memory, and evaluation, RAG stops being a clever prompt trick and starts becoming a real system, one that can be debugged, tuned, and trusted.

In my case, that shift was big enough to materially change both efficiency and output quality. If you're stuck wondering why your RAG pipeline "works but feels generic," there's a good chance the answer isn't your model.

It's your architecture.

If this was useful, I'd genuinely love to hear what you've tried in your own RAG pipelines, especially if you've found a different lever that moved the needle for you. Drop a comment, I read every one.

DEV Community