I was testing my own RAG application.
I'd spent weeks building it — .NET 8, Qdrant, OpenAI, Clean Architecture. It worked well. Upload documents, ask questions, get cited answers. I was happy with it.
So I loaded up some public annual reports and research papers, and started stress-testing it.
Most answers were solid. Then I asked:
"What are the common risk factors mentioned across these annual reports, and do any of them overlap?"
The system responded in seconds. Confident. Cited. Clean.
But when I cross-checked manually, I realised it had only pulled chunks from one report. The others hadn't been touched. No warning. No caveat. Just a quietly incomplete answer dressed up as a complete one.
That was the moment I stopped and thought: this isn't a retrieval bug. This is an architectural ceiling.
What Single-Shot RAG Actually Does
Here's the pipeline most RAG systems run:
User question
→ Generate embedding
→ Vector search (one query, one pass)
→ Take top-K chunks
→ Stuff into prompt
→ Generate answer
It's fast, cheap, and works well for direct factual questions. "What is the refund policy?" — great. One search finds it.
But for anything that requires:
- Searching across multiple documents with different angles
- Comparing information from two sources
- First understanding what documents exist, then drilling in
...single-shot RAG fails silently. The LLM gets whatever the one search returned and does its best. It has no way to say "I think I need more context from a different source." It just answers.
The Fix: Give the System the Ability to Think Before It Answers
What I needed wasn't better retrieval. I needed the system to plan its retrieval.
This is the ReAct pattern — Reason + Act. Instead of a fixed pipeline, the agent runs a loop:
Reason about what to do next
→ Act (call a tool)
→ Observe the result
→ Reason again
→ Act again
→ ... until it has enough to answer
At each step, the agent decides: do I have enough information, or do I need to search more?
How I Implemented It
The agent is powered by the same LLM already in the stack. The trick is the system prompt — instead of asking the LLM to answer the question directly, you tell it to output a structured decision at every step:
At EVERY step, output ONLY valid JSON:
{
"thought": "your reasoning about what to do next",
"action": "search_documents | get_document_summary | compare_chunks | answer_directly",
"action_input": "input for the chosen action"
}
The loop then:
- Sends this prompt + the conversation history to the LLM
- Parses the JSON response
- Executes the chosen tool
- Appends the result to the conversation history
- Repeats — until the agent calls
answer_directlyor hits the iteration limit
Here's the core loop in C#, simplified:
while (iteration < _config.MaxIterations)
{
var llmResponse = await _chatService.GenerateResponseAsync(systemPrompt, messages, ct);
var parsed = ParseAgentAction(llmResponse);
trace.Steps.Add(new AgentStep { StepType = "reasoning", Reasoning = parsed.Thought });
if (parsed.Action == "answer_directly")
{
finalAnswer = parsed.ActionInput;
break;
}
var toolResult = await _tools.ExecuteAsync(parsed.Action, parsed.ActionInput, ct);
messages.Add(new ChatMessage { Role = "assistant", Content = llmResponse });
messages.Add(new ChatMessage { Role = "user", Content = $"Tool result: {toolResult.Text}" });
iteration++;
}
Simple. The LLM drives the loop. The code just executes whatever it decides.
The Four Tools
The agent has four tools to choose from:
| Tool | What it does |
|---|---|
search_documents |
Vector search — the same semantic search the existing RAG system uses |
get_document_summary |
Retrieves chunks for a document and asks the LLM to summarise it |
compare_chunks |
Takes two text segments and asks the LLM to identify agreements and contradictions |
answer_directly |
Signals that the agent has enough context and is ready to answer |
These aren't new infrastructure. They're thin wrappers over what already existed. The intelligence is in the loop, not the tools.
The Reasoning Trace
Every response from the agent includes the full reasoning trace — a step-by-step log of every decision it made:
{
"answer": "Both reports flag supply chain disruption as a key risk...",
"iterationsUsed": 3,
"maxIterationsReached": false,
"trace": {
"steps": [
{ "stepType": "reasoning", "reasoning": "I need to search for risk factors in the first report" },
{ "stepType": "tool_call", "toolName": "search_documents", "toolInput": "risk factors annual report 2023" },
{ "stepType": "tool_result", "toolOutput": "Supply chain disruption, interest rate exposure..." },
{ "stepType": "reasoning", "reasoning": "Now I need to check the second report for overlap" },
{ "stepType": "tool_call", "toolName": "search_documents", "toolInput": "risk factors annual report 2024" },
{ "stepType": "tool_result", "toolOutput": "Supply chain risk, inflation, regulatory pressure..." },
{ "stepType": "answer", "toolOutput": "Both reports flag supply chain disruption..." }
]
}
}
This isn't just debugging information. It's the answer to the question every enterprise user eventually asks: "How did it arrive at this?"
What Changed
The question that broke the original system — "What risk factors overlap across these reports?" — now works correctly. The agent searches each report separately, compares the results, and synthesises a grounded answer.
More importantly, it tells you exactly how it got there.
Two new endpoints:
-
POST /api/agent/query— runs the full loop, returns the complete response + trace -
POST /api/agent/stream— SSE stream, so you watch the agent reason in real time
And one safety valve: Agent:Enabled = false returns a 503 instantly, no AI calls made. Useful for cost control.
What I'd Do Differently
The weakest part is JSON parsing. LLMs — especially smaller local models via Ollama — don't always produce clean JSON. I added fallback handling (strip code fences, fall back to answer_directly if parsing fails entirely), but a production system would benefit from structured output / function calling if the model supports it.
The iteration limit (default 5) is also a balance. Higher means more thorough answers but more cost. For complex multi-document questions, 3–4 iterations is usually enough.
The Full Implementation
The complete source is open — .NET 8, Clean Architecture, Qdrant, OpenAI/Ollama/Azure OpenAI support:
👉 https://github.com/Argha713/dotnet-rag-api
If you're building RAG in .NET and hitting the same ceiling, the agentic layer is the natural next step. It's additive — the existing /api/chat endpoints are completely untouched.
Top comments (1)
nice writeup. we hit the exact same wall with code-context retrieval for AI coding agents — single-pass embedding search misses structural relationships between files. ended up building a dependency graph first so the agent knows which files actually need to be loaded together. same idea as your ReAct loop but pre-computed at indexing time rather than runtime. the "planning retrieval" vs "just searching" distinction is the key insight here