DEV Community

Cover image for I Built a RAG System. Then I Broke It With One Question!
Argha Sarkar
Argha Sarkar

Posted on

I Built a RAG System. Then I Broke It With One Question!

I was testing my own RAG application.

I'd spent weeks building it — .NET 8, Qdrant, OpenAI, Clean Architecture. It worked well. Upload documents, ask questions, get cited answers. I was happy with it.

So I loaded up some public annual reports and research papers, and started stress-testing it.

Most answers were solid. Then I asked:

"What are the common risk factors mentioned across these annual reports, and do any of them overlap?"

The system responded in seconds. Confident. Cited. Clean.

But when I cross-checked manually, I realised it had only pulled chunks from one report. The others hadn't been touched. No warning. No caveat. Just a quietly incomplete answer dressed up as a complete one.

That was the moment I stopped and thought: this isn't a retrieval bug. This is an architectural ceiling.


What Single-Shot RAG Actually Does

Here's the pipeline most RAG systems run:

User question
    → Generate embedding
    → Vector search (one query, one pass)
    → Take top-K chunks
    → Stuff into prompt
    → Generate answer
Enter fullscreen mode Exit fullscreen mode

It's fast, cheap, and works well for direct factual questions. "What is the refund policy?" — great. One search finds it.

But for anything that requires:

  • Searching across multiple documents with different angles
  • Comparing information from two sources
  • First understanding what documents exist, then drilling in

...single-shot RAG fails silently. The LLM gets whatever the one search returned and does its best. It has no way to say "I think I need more context from a different source." It just answers.


The Fix: Give the System the Ability to Think Before It Answers

What I needed wasn't better retrieval. I needed the system to plan its retrieval.

This is the ReAct pattern — Reason + Act. Instead of a fixed pipeline, the agent runs a loop:

Reason about what to do next
    → Act (call a tool)
    → Observe the result
    → Reason again
    → Act again
    → ... until it has enough to answer
Enter fullscreen mode Exit fullscreen mode

At each step, the agent decides: do I have enough information, or do I need to search more?


How I Implemented It

The agent is powered by the same LLM already in the stack. The trick is the system prompt — instead of asking the LLM to answer the question directly, you tell it to output a structured decision at every step:

At EVERY step, output ONLY valid JSON:
{
  "thought": "your reasoning about what to do next",
  "action": "search_documents | get_document_summary | compare_chunks | answer_directly",
  "action_input": "input for the chosen action"
}
Enter fullscreen mode Exit fullscreen mode

The loop then:

  1. Sends this prompt + the conversation history to the LLM
  2. Parses the JSON response
  3. Executes the chosen tool
  4. Appends the result to the conversation history
  5. Repeats — until the agent calls answer_directly or hits the iteration limit

Here's the core loop in C#, simplified:

while (iteration < _config.MaxIterations)
{
    var llmResponse = await _chatService.GenerateResponseAsync(systemPrompt, messages, ct);
    var parsed = ParseAgentAction(llmResponse);

    trace.Steps.Add(new AgentStep { StepType = "reasoning", Reasoning = parsed.Thought });

    if (parsed.Action == "answer_directly")
    {
        finalAnswer = parsed.ActionInput;
        break;
    }

    var toolResult = await _tools.ExecuteAsync(parsed.Action, parsed.ActionInput, ct);

    messages.Add(new ChatMessage { Role = "assistant", Content = llmResponse });
    messages.Add(new ChatMessage { Role = "user", Content = $"Tool result: {toolResult.Text}" });

    iteration++;
}
Enter fullscreen mode Exit fullscreen mode

Simple. The LLM drives the loop. The code just executes whatever it decides.


The Four Tools

The agent has four tools to choose from:

Tool What it does
search_documents Vector search — the same semantic search the existing RAG system uses
get_document_summary Retrieves chunks for a document and asks the LLM to summarise it
compare_chunks Takes two text segments and asks the LLM to identify agreements and contradictions
answer_directly Signals that the agent has enough context and is ready to answer

These aren't new infrastructure. They're thin wrappers over what already existed. The intelligence is in the loop, not the tools.


The Reasoning Trace

Every response from the agent includes the full reasoning trace — a step-by-step log of every decision it made:

{
  "answer": "Both reports flag supply chain disruption as a key risk...",
  "iterationsUsed": 3,
  "maxIterationsReached": false,
  "trace": {
    "steps": [
      { "stepType": "reasoning", "reasoning": "I need to search for risk factors in the first report" },
      { "stepType": "tool_call", "toolName": "search_documents", "toolInput": "risk factors annual report 2023" },
      { "stepType": "tool_result", "toolOutput": "Supply chain disruption, interest rate exposure..." },
      { "stepType": "reasoning", "reasoning": "Now I need to check the second report for overlap" },
      { "stepType": "tool_call", "toolName": "search_documents", "toolInput": "risk factors annual report 2024" },
      { "stepType": "tool_result", "toolOutput": "Supply chain risk, inflation, regulatory pressure..." },
      { "stepType": "answer", "toolOutput": "Both reports flag supply chain disruption..." }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

This isn't just debugging information. It's the answer to the question every enterprise user eventually asks: "How did it arrive at this?"


What Changed

The question that broke the original system — "What risk factors overlap across these reports?" — now works correctly. The agent searches each report separately, compares the results, and synthesises a grounded answer.

More importantly, it tells you exactly how it got there.

Two new endpoints:

  • POST /api/agent/query — runs the full loop, returns the complete response + trace
  • POST /api/agent/stream — SSE stream, so you watch the agent reason in real time

And one safety valve: Agent:Enabled = false returns a 503 instantly, no AI calls made. Useful for cost control.


What I'd Do Differently

The weakest part is JSON parsing. LLMs — especially smaller local models via Ollama — don't always produce clean JSON. I added fallback handling (strip code fences, fall back to answer_directly if parsing fails entirely), but a production system would benefit from structured output / function calling if the model supports it.

The iteration limit (default 5) is also a balance. Higher means more thorough answers but more cost. For complex multi-document questions, 3–4 iterations is usually enough.


The Full Implementation

The complete source is open — .NET 8, Clean Architecture, Qdrant, OpenAI/Ollama/Azure OpenAI support:

👉 https://github.com/Argha713/dotnet-rag-api

If you're building RAG in .NET and hitting the same ceiling, the agentic layer is the natural next step. It's additive — the existing /api/chat endpoints are completely untouched.

Top comments (1)

Collapse
 
ji_ai profile image
jidong

nice writeup. we hit the exact same wall with code-context retrieval for AI coding agents — single-pass embedding search misses structural relationships between files. ended up building a dependency graph first so the agent knows which files actually need to be loaded together. same idea as your ReAct loop but pre-computed at indexing time rather than runtime. the "planning retrieval" vs "just searching" distinction is the key insight here