<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Argha Sarkar</title>
    <description>The latest articles on DEV Community by Argha Sarkar (@argha_dev).</description>
    <link>https://dev.to/argha_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3801528%2Fbf7a13ef-386f-4cac-92cb-dae0117c309d.jpg</url>
      <title>DEV Community: Argha Sarkar</title>
      <link>https://dev.to/argha_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/argha_dev"/>
    <language>en</language>
    <item>
      <title>I Built a RAG System. Then I Broke It With One Question!</title>
      <dc:creator>Argha Sarkar</dc:creator>
      <pubDate>Tue, 24 Mar 2026 08:09:12 +0000</pubDate>
      <link>https://dev.to/argha_dev/i-built-a-rag-system-then-i-broke-it-with-one-question-4aan</link>
      <guid>https://dev.to/argha_dev/i-built-a-rag-system-then-i-broke-it-with-one-question-4aan</guid>
      <description>&lt;p&gt;I was testing my own RAG application.&lt;/p&gt;

&lt;p&gt;I'd spent weeks building it — .NET 8, Qdrant, OpenAI, Clean Architecture. It worked well. Upload documents, ask questions, get cited answers. I was happy with it.&lt;/p&gt;

&lt;p&gt;So I loaded up some public annual reports and research papers, and started stress-testing it.&lt;/p&gt;

&lt;p&gt;Most answers were solid. Then I asked:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What are the common risk factors mentioned across these annual reports, and do any of them overlap?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system responded in seconds. Confident. Cited. Clean.&lt;/p&gt;

&lt;p&gt;But when I cross-checked manually, I realised it had only pulled chunks from one report. The others hadn't been touched. No warning. No caveat. Just a quietly incomplete answer dressed up as a complete one.&lt;/p&gt;

&lt;p&gt;That was the moment I stopped and thought: this isn't a retrieval bug. This is an architectural ceiling.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Single-Shot RAG Actually Does
&lt;/h2&gt;

&lt;p&gt;Here's the pipeline most RAG systems run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User question
    → Generate embedding
    → Vector search (one query, one pass)
    → Take top-K chunks
    → Stuff into prompt
    → Generate answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's fast, cheap, and works well for direct factual questions. &lt;em&gt;"What is the refund policy?"&lt;/em&gt; — great. One search finds it.&lt;/p&gt;

&lt;p&gt;But for anything that requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Searching across multiple documents with different angles&lt;/li&gt;
&lt;li&gt;Comparing information from two sources&lt;/li&gt;
&lt;li&gt;First understanding what documents exist, then drilling in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...single-shot RAG fails silently. The LLM gets whatever the one search returned and does its best. It has no way to say "I think I need more context from a different source." It just answers.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix: Give the System the Ability to Think Before It Answers
&lt;/h2&gt;

&lt;p&gt;What I needed wasn't better retrieval. I needed the system to &lt;strong&gt;plan&lt;/strong&gt; its retrieval.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;ReAct pattern&lt;/strong&gt; — Reason + Act. Instead of a fixed pipeline, the agent runs a loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reason about what to do next
    → Act (call a tool)
    → Observe the result
    → Reason again
    → Act again
    → ... until it has enough to answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At each step, the agent decides: do I have enough information, or do I need to search more?&lt;/p&gt;




&lt;h2&gt;
  
  
  How I Implemented It
&lt;/h2&gt;

&lt;p&gt;The agent is powered by the same LLM already in the stack. The trick is the system prompt — instead of asking the LLM to answer the question directly, you tell it to output a structured decision at every step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;At&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;EVERY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;step,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;output&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;ONLY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;valid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JSON:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"thought"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your reasoning about what to do next"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"search_documents | get_document_summary | compare_chunks | answer_directly"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action_input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"input for the chosen action"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The loop then:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sends this prompt + the conversation history to the LLM&lt;/li&gt;
&lt;li&gt;Parses the JSON response&lt;/li&gt;
&lt;li&gt;Executes the chosen tool&lt;/li&gt;
&lt;li&gt;Appends the result to the conversation history&lt;/li&gt;
&lt;li&gt;Repeats — until the agent calls &lt;code&gt;answer_directly&lt;/code&gt; or hits the iteration limit&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's the core loop in C#, simplified:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MaxIterations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;llmResponse&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_chatService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GenerateResponseAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ParseAgentAction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llmResponse&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Steps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;AgentStep&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;StepType&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Reasoning&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Thought&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"answer_directly"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;finalAnswer&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ActionInput&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;toolResult&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ExecuteAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ActionInput&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;ChatMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;Role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Content&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llmResponse&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;ChatMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;Role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Content&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;$"Tool result: &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;toolResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="p"&gt;++;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple. The LLM drives the loop. The code just executes whatever it decides.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Tools
&lt;/h2&gt;

&lt;p&gt;The agent has four tools to choose from:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_documents&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Vector search — the same semantic search the existing RAG system uses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_document_summary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Retrieves chunks for a document and asks the LLM to summarise it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;compare_chunks&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Takes two text segments and asks the LLM to identify agreements and contradictions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;answer_directly&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Signals that the agent has enough context and is ready to answer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These aren't new infrastructure. They're thin wrappers over what already existed. The intelligence is in the loop, not the tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Reasoning Trace
&lt;/h2&gt;

&lt;p&gt;Every response from the agent includes the full reasoning trace — a step-by-step log of every decision it made:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"answer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Both reports flag supply chain disruption as a key risk..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iterationsUsed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxIterationsReached"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stepType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"I need to search for risk factors in the first report"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stepType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool_call"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"toolName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"search_documents"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"toolInput"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"risk factors annual report 2023"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stepType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool_result"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"toolOutput"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Supply chain disruption, interest rate exposure..."&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stepType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Now I need to check the second report for overlap"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stepType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool_call"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"toolName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"search_documents"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"toolInput"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"risk factors annual report 2024"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stepType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool_result"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"toolOutput"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Supply chain risk, inflation, regulatory pressure..."&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stepType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"answer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"toolOutput"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Both reports flag supply chain disruption..."&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't just debugging information. It's the answer to the question every enterprise user eventually asks: &lt;em&gt;"How did it arrive at this?"&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changed
&lt;/h2&gt;

&lt;p&gt;The question that broke the original system — &lt;em&gt;"What risk factors overlap across these reports?"&lt;/em&gt; — now works correctly. The agent searches each report separately, compares the results, and synthesises a grounded answer.&lt;/p&gt;

&lt;p&gt;More importantly, it tells you exactly how it got there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two new endpoints:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;POST /api/agent/query&lt;/code&gt; — runs the full loop, returns the complete response + trace&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /api/agent/stream&lt;/code&gt; — SSE stream, so you watch the agent reason in real time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And one safety valve: &lt;code&gt;Agent:Enabled = false&lt;/code&gt; returns a &lt;code&gt;503&lt;/code&gt; instantly, no AI calls made. Useful for cost control.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;The weakest part is JSON parsing. LLMs — especially smaller local models via Ollama — don't always produce clean JSON. I added fallback handling (strip code fences, fall back to &lt;code&gt;answer_directly&lt;/code&gt; if parsing fails entirely), but a production system would benefit from structured output / function calling if the model supports it.&lt;/p&gt;

&lt;p&gt;The iteration limit (default &lt;code&gt;5&lt;/code&gt;) is also a balance. Higher means more thorough answers but more cost. For complex multi-document questions, 3–4 iterations is usually enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Implementation
&lt;/h2&gt;

&lt;p&gt;The complete source is open — .NET 8, Clean Architecture, Qdrant, OpenAI/Ollama/Azure OpenAI support:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://github.com/Argha713/dotnet-rag-api" rel="noopener noreferrer"&gt;https://github.com/Argha713/dotnet-rag-api&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're building RAG in .NET and hitting the same ceiling, the agentic layer is the natural next step. It's additive — the existing &lt;code&gt;/api/chat&lt;/code&gt; endpoints are completely untouched.&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>ai</category>
      <category>rag</category>
      <category>agents</category>
    </item>
    <item>
      <title>Standard RAG Is Blind — Building Multimodal RAG in .NET to Fix It</title>
      <dc:creator>Argha Sarkar</dc:creator>
      <pubDate>Tue, 17 Mar 2026 03:10:29 +0000</pubDate>
      <link>https://dev.to/argha_dev/standard-rag-is-blind-building-multimodal-rag-in-net-to-fix-it-4998</link>
      <guid>https://dev.to/argha_dev/standard-rag-is-blind-building-multimodal-rag-in-net-to-fix-it-4998</guid>
      <description>&lt;h2&gt;
  
  
  The Scenario
&lt;/h2&gt;

&lt;p&gt;A developer builds a RAG system. A user uploads a 60-page service manual — dense with wiring diagrams, installation schematics, and annotated screenshots. They ask: &lt;em&gt;"How do I replace the filter assembly?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The answer is entirely in Figure 7.&lt;/p&gt;

&lt;p&gt;RAG returns three paragraphs of unrelated text. The image was never ingested. It does not exist to the system.&lt;/p&gt;

&lt;p&gt;This is not a bug. It is the expected behaviour of every standard RAG pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Standard RAG Fails on Images
&lt;/h2&gt;

&lt;p&gt;A standard RAG pipeline does one thing: convert text into searchable vectors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    A[PDF / DOCX Upload] --&amp;gt; B[Text Extraction]
    B --&amp;gt; C[Chunk]
    C --&amp;gt; D[Embed]
    D --&amp;gt; E[(Vector Store)]
    A -. images discarded .-&amp;gt; X[❌]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Images are either skipped entirely or reduced to their alt-text — which is usually empty. The pipeline was not designed to understand visual content. There is no text to extract from a schematic, no words to embed from a photograph, no paragraph to chunk from a technical diagram.&lt;/p&gt;

&lt;p&gt;The result: any knowledge that exists only in images is permanently invisible to retrieval. For documents like technical manuals, medical imaging reports, architectural drawings, or slide decks, this is not a minor gap. It is a fundamental failure of coverage.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Multimodal RAG Needs to Do Differently
&lt;/h2&gt;

&lt;p&gt;Three things must change:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt; — pull image bytes out of documents alongside text, not instead of text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Describe&lt;/strong&gt; — pass each image to a vision model and get back a text description that captures what the image &lt;em&gt;means&lt;/em&gt;, not just what it looks like&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieve and Render&lt;/strong&gt; — when a retrieval query matches an image description, return both the description as context &lt;em&gt;and&lt;/em&gt; the original image to the user&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight is that vision models act as a translation layer. They convert visual content into the semantic space that the rest of the RAG pipeline already understands. Chunking, embedding, and vector search require no changes. The pipeline gains a new input channel — it does not need a new architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The multimodal pipeline extends the standard RAG system at two seams: ingestion gains a parallel image track, and retrieval gains an image rendering step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingestion
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
    A[PDF / DOCX Upload] --&amp;gt; B[Text Extraction\nexisting]
    A --&amp;gt; C[Image Extraction\nPdfPig · OpenXml]
    B --&amp;gt; D[Chunk &amp;amp; Embed\nexisting]
    C --&amp;gt; E[Vision Model\nGPT-4o]
    E --&amp;gt; F[Image Description\ntext]
    E --&amp;gt; G[Image Bytes\nPostgreSQL]
    F --&amp;gt; H[Embed Description\nas chunk + imageId]
    D --&amp;gt; I[(Qdrant\nVector Store)]
    H --&amp;gt; I
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The upload triggers two parallel tracks. The text track is unchanged. The image track extracts raw bytes per page or document part, sends each to a vision model, stores the bytes in PostgreSQL, and embeds the returned description as a standard chunk — with one addition: the chunk carries an &lt;code&gt;imageId&lt;/code&gt; reference in its metadata.&lt;/p&gt;

&lt;p&gt;Image descriptions live in the same vector space as text chunks. They compete on equal terms during retrieval.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retrieval
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
    A[User Query] --&amp;gt; B[Vector Search\nQdrant]
    B --&amp;gt; C{Chunk Type?}
    C --&amp;gt;|text| D[Text Context]
    C --&amp;gt;|image description| E[Image Description\n+ imageId]
    D --&amp;gt; F[LLM Response]
    E --&amp;gt; F
    E --&amp;gt; G[GET /api/images/id\nimage bytes]
    F --&amp;gt; H[Answer Text]
    G --&amp;gt; H
    H --&amp;gt; I[Chat UI\ntext + inline images]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Retrieval requires no changes to the search layer. When a query matches an image-description chunk, the chunk's metadata surfaces the &lt;code&gt;imageId&lt;/code&gt;. A dedicated endpoint streams the image bytes from PostgreSQL. The chat UI renders the LLM answer alongside the relevant image — in the same response panel.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pipeline Stage Breakdown
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Extract
&lt;/h3&gt;

&lt;p&gt;Two document types, two libraries, one output contract.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    PDF --&amp;gt; PdfPig --&amp;gt; ExtractedImage
    DOCX --&amp;gt; OpenXml --&amp;gt; ExtractedImage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PDF image extraction uses PdfPig's per-page image enumeration. DOCX extraction enumerates &lt;code&gt;MainDocumentPart.ImageParts&lt;/code&gt; via the OpenXml SDK. Both apply a 100×100px minimum dimension threshold — images below this are decorative and skipped — and a 20MB safety cap. The output in both cases is an &lt;code&gt;ExtractedImage&lt;/code&gt; record carrying bytes, MIME type, and dimension metadata. Text and image extraction run on the same upload; no second pass is required.&lt;/p&gt;

&lt;h3&gt;
  
  
  Describe
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    ExtractedImage --&amp;gt; B[IVisionService\nDescribeAsync] --&amp;gt; C[Text Description]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each extracted image is base64-encoded and sent to GPT-4o Vision via &lt;code&gt;IVisionService&lt;/code&gt;. The response is a plain-text description of what the image contains and means in context. This is the only pipeline stage that calls an external vision model. Descriptions are generated once at ingest time — not at query time — so retrieval latency is unaffected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Store
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    ExtractedImage --&amp;gt; A[IImageStore] --&amp;gt; B[(PostgreSQL\nDocumentImages)]
    B --&amp;gt; C[imageId]
    C --&amp;gt; D[Chunk Metadata]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Image bytes are persisted to a &lt;code&gt;DocumentImages&lt;/code&gt; table in PostgreSQL via &lt;code&gt;IImageStore&lt;/code&gt;. The returned &lt;code&gt;imageId&lt;/code&gt; is attached to the description chunk before it enters the embedding pipeline. The bytes never travel to Qdrant — only the description text and the &lt;code&gt;imageId&lt;/code&gt; reference flow through the vector store.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retrieve
&lt;/h3&gt;

&lt;p&gt;No change to the vector search layer. When a query matches an image-description chunk, the chunk's metadata carries &lt;code&gt;imageId&lt;/code&gt; and &lt;code&gt;pageNumber&lt;/code&gt;. The existing search response shape is extended with an optional image reference — source chunks now carry a type field (&lt;code&gt;text&lt;/code&gt; or &lt;code&gt;image&lt;/code&gt;) alongside the relevant text excerpt.&lt;/p&gt;

&lt;h3&gt;
  
  
  Render
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;GET /api/images/{id}&lt;/code&gt; endpoint streams image bytes directly from PostgreSQL. The Blazor chat UI inspects each source chunk's type: text sources render as before, image sources fetch the endpoint and render the image inline. The user receives the LLM answer and the relevant diagram in the same response — no separate step, no external image hosting.&lt;/p&gt;




&lt;h2&gt;
  
  
  GitHub
&lt;/h2&gt;

&lt;p&gt;The full source, issue tracker, and phase roadmap are public.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Argha713/dotnet-rag-api" rel="noopener noreferrer"&gt;github.com/Argha713/dotnet-rag-api&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>ai</category>
      <category>rag</category>
      <category>programming</category>
    </item>
    <item>
      <title>Stop Blaming Your LLM: Fix RAG Retrieval Quality With Better Chunking in .NET</title>
      <dc:creator>Argha Sarkar</dc:creator>
      <pubDate>Thu, 12 Mar 2026 08:11:14 +0000</pubDate>
      <link>https://dev.to/argha_dev/stop-blaming-your-llm-fix-rag-retrieval-quality-with-better-chunking-in-net-25ke</link>
      <guid>https://dev.to/argha_dev/stop-blaming-your-llm-fix-rag-retrieval-quality-with-better-chunking-in-net-25ke</guid>
      <description>&lt;p&gt;You swap to a better model. Still wrong answers. You tune your prompt. Still hallucinations. You increase the temperature — no, lower it — still garbage. Sound familiar?&lt;/p&gt;

&lt;p&gt;Here are three failure modes I hit repeatedly while building a RAG API in .NET:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The confident wrong answer.&lt;/strong&gt; The LLM states a fact with full certainty. The document says the opposite. You look at the retrieved chunk — it was cut in the middle of a sentence, and the half that made it into context was the setup, not the conclusion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "I don't know" on an obvious question.&lt;/strong&gt; The user asks something the document clearly answers. The LLM shrugs. You trace it: the exact answer spans the last two words of chunk N and the first sentence of chunk N+1. Neither chunk scores high enough on its own to make the retrieval cut.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bloated non-answer.&lt;/strong&gt; The LLM returns 400 words of vague summary when the user needed a number. The retrieved chunk was an entire page. There were five relevant sentences in it and 900 tokens of noise.&lt;/p&gt;

&lt;p&gt;The LLM isn't the problem. The chunks are.&lt;/p&gt;




&lt;h2&gt;
  
  
  Root Cause: Chunk Boundaries Define Retrieval Quality
&lt;/h2&gt;

&lt;p&gt;In RAG, the pipeline works like this: you embed each chunk into a vector, store those vectors, and at query time you find the chunks most similar to the user's question. The LLM never sees your document — it sees only the chunks you hand it.&lt;/p&gt;

&lt;p&gt;This means every embedding is only as good as the text it encodes. And the text it encodes is defined entirely by where you drew the chunk boundaries.&lt;/p&gt;

&lt;p&gt;Too large: the chunk contains the answer buried in irrelevant context. The embedding drifts toward the noise. Token cost spikes. The LLM has to wade through padding to find the signal.&lt;/p&gt;

&lt;p&gt;Too small: the answer spans two chunks. Each chunk, alone, doesn't capture enough meaning to rank high. Both miss the similarity threshold. The answer is never retrieved.&lt;/p&gt;

&lt;p&gt;Wrong boundary: you cut mid-sentence. The embedding captures a dangling clause, not a complete thought. Semantic similarity breaks down.&lt;/p&gt;

&lt;p&gt;The defaults in this project — &lt;code&gt;ChunkSize: 1000&lt;/code&gt; characters, &lt;code&gt;ChunkOverlap: 200&lt;/code&gt; — are a starting point, not gospel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/RagApi.Application/Models/DocumentProcessingOptions.cs&lt;/span&gt;
&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DocumentProcessingOptions&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;DefaultChunkingStrategy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Fixed"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;ChunkSize&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// characters&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;ChunkOverlap&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the number that matters more than the size is &lt;em&gt;how&lt;/em&gt; you draw the boundaries. That's what the three strategies below address.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline in One Paragraph
&lt;/h2&gt;

&lt;p&gt;Before diving into strategies, here's where chunking lives in the full upload flow. &lt;code&gt;DocumentService.UploadDocumentAsync&lt;/code&gt; runs four sequential steps: extract text from the raw file (PDF, DOCX, TXT, Markdown), chunk the text using the selected strategy, generate embeddings for every chunk, and upsert those embeddings into the vector store. Chunking is step 2 — everything after it depends on getting step 2 right.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Step 1: Extract text from document&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_documentProcessor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ExtractTextAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fileStream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contentType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancellationToken&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Step 2: Chunk the text&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_documentProcessor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ChunkText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunkingOptions&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Step 3: Generate embeddings for all chunks&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_embeddingService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GenerateEmbeddingsAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkTexts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancellationToken&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Step 4: Store chunks in vector database&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_vectorStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;UpsertChunksAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_workspaceContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CollectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancellationToken&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's look at each strategy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Strategy 1: Fixed-Size With Paragraph-Aware Overlap
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Good for:&lt;/strong&gt; mixed document corpora, financial reports, legal docs, anything where you don't know the structure in advance.&lt;/p&gt;

&lt;p&gt;The "fixed" in the name is slightly misleading. This strategy doesn't blindly slice at character N. It splits at paragraph boundaries first, then accumulates paragraphs until adding the next paragraph would exceed &lt;code&gt;ChunkSize&lt;/code&gt;. At that point it saves the current chunk and begins the next one with &lt;code&gt;ChunkOverlap&lt;/code&gt; characters carried over from the tail of the previous chunk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/RagApi.Infrastructure/DocumentProcessing/DocumentProcessor.cs&lt;/span&gt;
&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;DocumentChunk&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;ChunkByFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Guid&lt;/span&gt; &lt;span class="n"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ChunkingOptions&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;DocumentChunk&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;();&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;paragraphs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Regex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SeparatorPattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;!&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IsNullOrWhiteSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToList&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;currentChunk&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;StringBuilder&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;chunkIndex&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;foreach&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;paragraph&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;paragraphs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
            &lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;paragraph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChunkSize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;CreateChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToString&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;Trim&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;chunkIndex&lt;/span&gt;&lt;span class="p"&gt;++,&lt;/span&gt; &lt;span class="p"&gt;...));&lt;/span&gt;

            &lt;span class="c1"&gt;// Start new chunk with overlap&lt;/span&gt;
            &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;overlapText&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;GetOverlapText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChunkOverlap&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Clear&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;overlapText&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AppendLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paragraph&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// ... flush final chunk&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The overlap is word-boundary aware — &lt;code&gt;GetOverlapText&lt;/code&gt; finds the first space after &lt;code&gt;text.Length - overlapSize&lt;/code&gt; rather than slicing at a raw character index. This prevents a chunk starting with &lt;code&gt;"...ompany reported a record"&lt;/code&gt; when it should start with &lt;code&gt;"company reported a record"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoffs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Predictable token budget — you know your maximum chunk size&lt;/li&gt;
&lt;li&gt;✅ Works on any document type without structure assumptions&lt;/li&gt;
&lt;li&gt;✅ Overlap means a sentence at a chunk boundary still has context in the next chunk&lt;/li&gt;
&lt;li&gt;❌ A single paragraph larger than &lt;code&gt;ChunkSize&lt;/code&gt; will still be split mid-paragraph&lt;/li&gt;
&lt;li&gt;❌ Overlap is character-based, not semantic — the carried-over text might not be the most relevant part&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; your default for any document corpus where you don't know the structure in advance. Switch to one of the targeted strategies once you know what you're ingesting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Strategy 2: Sentence-Based
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Good for:&lt;/strong&gt; factual Q&amp;amp;A over research papers, product manuals, FAQs — text where the answer to a question is typically one or two complete sentences.&lt;/p&gt;

&lt;p&gt;The key insight is that embedding quality peaks when the encoded text is a complete thought. A sentence is the smallest unit of complete meaning. This strategy splits on &lt;code&gt;.!?&lt;/code&gt; boundaries, accumulates sentences until the next sentence would overflow &lt;code&gt;ChunkSize&lt;/code&gt;, and carries the last sentence of the previous chunk into the next one as overlap.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/RagApi.Infrastructure/DocumentProcessing/DocumentProcessor.cs&lt;/span&gt;
&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;DocumentChunk&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;ChunkBySentence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Guid&lt;/span&gt; &lt;span class="n"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ChunkingOptions&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Regex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;@"(?&amp;lt;=[.!?])\s+"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Trim&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;!&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IsNullOrWhiteSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToList&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;currentChunk&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;StringBuilder&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;lastSentence&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Empty&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// one-sentence overlap&lt;/span&gt;

    &lt;span class="k"&gt;foreach&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
            &lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChunkSize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;CreateChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToString&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;Trim&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;chunkIndex&lt;/span&gt;&lt;span class="p"&gt;++,&lt;/span&gt; &lt;span class="p"&gt;...));&lt;/span&gt;

            &lt;span class="c1"&gt;// Start next chunk with the last sentence as overlap&lt;/span&gt;
            &lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Clear&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(!&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IsNullOrWhiteSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lastSentence&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lastSentence&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;Append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;' '&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;lastSentence&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;Append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;' '&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// ... flush final chunk&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One-sentence overlap means the answer sentence is never the very first token of a chunk with no preceding context. The chunk before it and the chunk after it both have at least one connecting sentence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoffs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Embeddings capture complete semantic units — cosine similarity is more reliable&lt;/li&gt;
&lt;li&gt;✅ Best retrieval precision for direct factual questions&lt;/li&gt;
&lt;li&gt;❌ Chunk sizes vary wildly — a three-word sentence and a 200-word sentence get equal weight&lt;/li&gt;
&lt;li&gt;❌ The regex splitter breaks on abbreviations: &lt;code&gt;"Mr. Smith arrived"&lt;/code&gt; becomes two sentences. Same for &lt;code&gt;"e.g."&lt;/code&gt;, &lt;code&gt;"i.e."&lt;/code&gt;, decimal numbers. Good enough for most corpora; not production-grade for scientific text&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; FAQ documents, product manuals, research papers, anything dense with discrete facts where users ask direct questions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Strategy 3: Paragraph-Based
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Good for:&lt;/strong&gt; well-structured prose — internal wikis, policy PDFs, documentation sites — where a paragraph is a coherent topic unit.&lt;/p&gt;

&lt;p&gt;This is the simplest strategy: split on blank lines, make each paragraph exactly one chunk, no size cap.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/RagApi.Infrastructure/DocumentProcessing/DocumentProcessor.cs&lt;/span&gt;
&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;DocumentChunk&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;ChunkByParagraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Guid&lt;/span&gt; &lt;span class="n"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;paragraphs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Regex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;@"\n\n|\r\n\r\n"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Trim&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;!&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IsNullOrWhiteSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToList&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;DocumentChunk&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;();&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;paragraphs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;++)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;para&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;paragraphs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;CreateChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;para&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;para&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
        &lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="p"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;para&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No size cap is intentional. The paragraph boundary &lt;em&gt;is&lt;/em&gt; the semantic boundary. Imposing an artificial size limit would require introducing mid-paragraph cuts, which is exactly the failure mode we're trying to avoid. You accept variable sizes in exchange for zero mid-thought cuts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoffs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Most semantically coherent chunks&lt;/li&gt;
&lt;li&gt;✅ Zero mid-thought cuts — every chunk is a complete idea&lt;/li&gt;
&lt;li&gt;❌ Sizes vary wildly — a one-liner and a 2000-word section get equal treatment&lt;/li&gt;
&lt;li&gt;❌ Very large paragraphs overflow the LLM context window. There is no safety size cap in this implementation — something to add if you're ingesting documents with monster paragraphs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; well-structured prose where the author already did the work of organizing information into coherent blocks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Boundary&lt;/th&gt;
&lt;th&gt;Overlap&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Watch out for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fixed&lt;/td&gt;
&lt;td&gt;Paragraph&lt;/td&gt;
&lt;td&gt;Character (word-safe)&lt;/td&gt;
&lt;td&gt;Mixed/unknown docs, legal&lt;/td&gt;
&lt;td&gt;Long single paragraphs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sentence&lt;/td&gt;
&lt;td&gt;Sentence &lt;code&gt;.!?&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Last sentence&lt;/td&gt;
&lt;td&gt;Factual Q&amp;amp;A, manuals, research&lt;/td&gt;
&lt;td&gt;Abbreviations, lists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Paragraph&lt;/td&gt;
&lt;td&gt;Blank line&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Structured prose, wikis, policy&lt;/td&gt;
&lt;td&gt;Huge paragraphs, no size cap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  How to Choose
&lt;/h2&gt;

&lt;p&gt;The decision is simpler than it looks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't know your document structure?&lt;/strong&gt; → use &lt;code&gt;Fixed&lt;/code&gt;. It's the safe default and handles the widest range of inputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Users ask specific factual questions?&lt;/strong&gt; → use &lt;code&gt;Sentence&lt;/code&gt;. Precision beats coverage for Q&amp;amp;A workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documents are well-structured prose where paragraphs are deliberate?&lt;/strong&gt; → use &lt;code&gt;Paragraph&lt;/code&gt;. Let the author's structure do the work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixing document types across a workspace?&lt;/strong&gt; → use &lt;code&gt;Fixed&lt;/code&gt; as the default, and override per upload using the &lt;code&gt;chunkingStrategy&lt;/code&gt; parameter.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point is important. Every upload can override the default strategy without touching the server config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// DocumentService.UploadDocumentAsync signature&lt;/span&gt;
&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;UploadDocumentAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Stream&lt;/span&gt; &lt;span class="n"&gt;fileStream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;contentType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;?&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ChunkingStrategy&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;chunkingStrategy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// override per upload&lt;/span&gt;
    &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;cancellationToken&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pass &lt;code&gt;ChunkingStrategy.Sentence&lt;/code&gt; for a product manual, &lt;code&gt;ChunkingStrategy.Paragraph&lt;/code&gt; for a policy doc, and &lt;code&gt;null&lt;/code&gt; (uses config default) for everything else. The strategy is resolved at call time — no restart required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Lives
&lt;/h2&gt;

&lt;p&gt;All three strategies are implemented in &lt;a href="https://github.com/Argha713/dotnet-rag-api/blob/main/src/RagApi.Infrastructure/DocumentProcessing/DocumentProcessor.cs" rel="noopener noreferrer"&gt;&lt;code&gt;DocumentProcessor.cs&lt;/code&gt;&lt;/a&gt; in the open-source &lt;a href="https://github.com/Argha713/dotnet-rag-api" rel="noopener noreferrer"&gt;dotnet-rag-api&lt;/a&gt; project — a full RAG API built on .NET 8 with Clean Architecture, Qdrant for vector storage, and support for OpenAI, Azure OpenAI, or local Ollama as the AI provider.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;IDocumentProcessor&lt;/code&gt; interface is decoupled from the vector store: you can run it against Qdrant Cloud, Azure AI Search, or a local Qdrant instance and the chunking logic doesn't change. The same three strategies work regardless of which backend you use for embeddings and retrieval.&lt;/p&gt;

&lt;p&gt;If you're hitting retrieval quality issues, look at your chunks before you look at your model.&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
    </item>
    <item>
      <title>Building Production-Grade RAG in .NET : Language Is Not a Barrier</title>
      <dc:creator>Argha Sarkar</dc:creator>
      <pubDate>Tue, 10 Mar 2026 08:03:02 +0000</pubDate>
      <link>https://dev.to/argha_dev/building-production-grade-rag-in-net-language-is-not-a-barrier-bdd</link>
      <guid>https://dev.to/argha_dev/building-production-grade-rag-in-net-language-is-not-a-barrier-bdd</guid>
      <description>&lt;h2&gt;
  
  
  Building Production-Grade RAG in .NET 8: Language Is Not a Barrier
&lt;/h2&gt;

&lt;p&gt;Every AI tutorial you find starts with Python. Every LangChain walkthrough, every vector database quickstart, every "build your own ChatGPT" guide — all Python. If you are a .NET developer, you are used to searching for a C# equivalent and finding either a thin wrapper someone wrote last week, a GitHub issue from 2022 asking "is there a .NET SDK?", or nothing at all.&lt;/p&gt;

&lt;p&gt;I got tired of that. So I built a full Retrieval-Augmented Generation (RAG) API in .NET 8 from scratch: Clean Architecture, Qdrant vector database, OpenAI/Azure OpenAI/Ollama provider switching, hybrid search with Reciprocal Rank Fusion, MMR re-ranking, multi-tenancy, Server-Sent Events streaming, a Blazor WASM frontend, and 279 tests. Deployed to Azure Container Apps and Azure Static Web Apps.&lt;/p&gt;

&lt;p&gt;This article walks through how, and why .NET is a first-class citizen in the AI ecosystem — not a workaround.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Argha713/dotnet-rag-api" rel="noopener noreferrer"&gt;https://github.com/Argha713/dotnet-rag-api&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live API:&lt;/strong&gt; &lt;a href="https://rag-api.calmsand-4a05cfa0.eastus.azurecontainerapps.io" rel="noopener noreferrer"&gt;https://rag-api.calmsand-4a05cfa0.eastus.azurecontainerapps.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live UI:&lt;/strong&gt; &lt;a href="https://ambitious-glacier-0b62ea10f.6.azurestaticapps.net" rel="noopener noreferrer"&gt;https://ambitious-glacier-0b62ea10f.6.azurestaticapps.net&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Is RAG and Why Does Architecture Matter?
&lt;/h2&gt;

&lt;p&gt;RAG is a pattern that improves LLM responses by grounding them in your own documents. Instead of relying on the model's training data, you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ingest&lt;/strong&gt; — parse documents, split into chunks, generate vector embeddings, store in a vector database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieve&lt;/strong&gt; — embed the user's query, find the most similar chunks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate&lt;/strong&gt; — inject those chunks as context into the LLM prompt, get a grounded answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most RAG tutorials implement this in ~50 lines of Python using LangChain. That is fine for a demo. For production — where you need testability, provider flexibility, multi-tenancy, and maintainability — architecture matters enormously. And that is where .NET's ecosystem genuinely shines.&lt;/p&gt;




&lt;h2&gt;
  
  
  The .NET AI Ecosystem in 2025
&lt;/h2&gt;

&lt;p&gt;Before I show the implementation, let us be honest about the landscape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The packages you actually need:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;Package&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vector DB (Qdrant)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qdrant.Client 1.12.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Official .NET SDK, full gRPC support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PDF parsing&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PdfPig 0.1.9&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pure .NET, no native deps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DOCX/XLSX parsing&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DocumentFormat.OpenXml 3.0.2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Microsoft's own SDK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL / EF Core&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Npgsql.EntityFrameworkCore.PostgreSQL 8.0.8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rock solid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Azure AI Search&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Azure.Search.Documents 11.6.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Swap Qdrant for Azure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structured logging&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Serilog.AspNetCore 8.0.3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Industry standard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Validation&lt;/td&gt;
&lt;td&gt;&lt;code&gt;FluentValidation 11.9.2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Better than DataAnnotations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Health checks&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Microsoft.Extensions.Diagnostics.HealthChecks&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Built-in, excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI / Ollama&lt;/td&gt;
&lt;td&gt;Direct &lt;code&gt;HttpClient&lt;/code&gt; calls&lt;/td&gt;
&lt;td&gt;You don't need Semantic Kernel&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The notable absence: I did not use Microsoft Semantic Kernel. Semantic Kernel is a legitimate option, especially if you want abstractions over multiple AI providers and memory stores out of the box. I chose to build the abstractions myself for two reasons: (1) it makes the architecture explicit and teachable, and (2) it demonstrates that you do not need a framework — the primitives are sufficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is genuinely missing compared to Python:&lt;/strong&gt; LangChain's ecosystem of 300+ integrations. Python dominates experimental ML research. If you need to run a custom fine-tuned model or use bleeding-edge retrieval research, Python is still where that lives first. For production API work with mainstream providers and standard vector databases? .NET is fully capable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: Clean Architecture Meets AI
&lt;/h2&gt;

&lt;p&gt;The project follows Clean Architecture strictly across four layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Domain          — entities only, zero dependencies
Application     — interfaces + services, depends on Domain
Infrastructure  — Qdrant, OpenAI, PostgreSQL/EF Core, parsers
Api             — ASP.NET Core controllers, middleware, Serilog
BlazorUI        — Blazor WASM frontend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters for AI systems specifically because &lt;strong&gt;the AI provider is an infrastructure detail&lt;/strong&gt;. Your business logic (how to chunk, how to rank results, what prompt template to use) should not be coupled to whether you are using OpenAI today and Azure OpenAI tomorrow.&lt;/p&gt;

&lt;p&gt;The Application layer defines two key interfaces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Application/Interfaces/IEmbeddingService.cs&lt;/span&gt;
&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="nc"&gt;IEmbeddingService&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;]&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;GenerateEmbeddingAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;Dimensions&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;ModelName&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Application/Interfaces/IChatService.cs&lt;/span&gt;
&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="nc"&gt;IChatService&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;GenerateResponseAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ChatMessage&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;IAsyncEnumerable&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;GenerateResponseStreamAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ChatMessage&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;ModelName&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Infrastructure has three concrete implementations of each: &lt;code&gt;OpenAiChatService&lt;/code&gt;, &lt;code&gt;AzureOpenAiChatService&lt;/code&gt;, and &lt;code&gt;OllamaChatService&lt;/code&gt;. DI wires the correct one based on &lt;code&gt;appsettings.json&lt;/code&gt;. &lt;strong&gt;Switching providers is a config change, not a code change.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Vector Store Abstraction
&lt;/h2&gt;

&lt;p&gt;This is where most RAG tutorials stop being useful — they assume a single collection in a single database. In a multi-tenant system, each workspace needs isolated storage.&lt;/p&gt;

&lt;p&gt;Here is the &lt;code&gt;IVectorStore&lt;/code&gt; interface (the real one from the codebase):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;/// &amp;lt;summary&amp;gt;&lt;/span&gt;
&lt;span class="c1"&gt;/// All methods accept collectionName as the first parameter so callers&lt;/span&gt;
&lt;span class="c1"&gt;/// (Scoped services) can pass the workspace's collection without violating&lt;/span&gt;
&lt;span class="c1"&gt;/// the Singleton lifetime of the implementation.&lt;/span&gt;
&lt;span class="c1"&gt;/// &amp;lt;/summary&amp;gt;&lt;/span&gt;
&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="nc"&gt;IVectorStore&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Task&lt;/span&gt; &lt;span class="nf"&gt;EnsureCollectionAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;Task&lt;/span&gt; &lt;span class="nf"&gt;DeleteCollectionAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;Task&lt;/span&gt; &lt;span class="nf"&gt;UpsertChunksAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;DocumentChunk&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SearchResult&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;SearchAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;queryEmbedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;topK&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Guid&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;filterByDocumentId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;?&lt;/span&gt; &lt;span class="n"&gt;filterByTags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SearchResult&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;SearchWithEmbeddingsAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;queryEmbedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;topK&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Guid&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;filterByDocumentId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;?&lt;/span&gt; &lt;span class="n"&gt;filterByTags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SearchResult&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;KeywordSearchAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;topK&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Guid&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;filterByDocumentId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;?&lt;/span&gt; &lt;span class="n"&gt;filterByTags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;Task&lt;/span&gt; &lt;span class="nf"&gt;DeleteDocumentChunksAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Guid&lt;/span&gt; &lt;span class="n"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;VectorStoreStats&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;GetStatsAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is a critical DI lifetime design decision embedded here. &lt;code&gt;IVectorStore&lt;/code&gt; is registered as &lt;strong&gt;Singleton&lt;/strong&gt; — it wraps a gRPC channel that should be long-lived. But the workspace context (which collection to use) is &lt;strong&gt;Scoped&lt;/strong&gt; (per HTTP request).&lt;/p&gt;

&lt;p&gt;The solution: pass &lt;code&gt;collectionName&lt;/code&gt; as an explicit first parameter to every method. Scoped services resolve it from &lt;code&gt;IWorkspaceContext.Current.CollectionName&lt;/code&gt; and pass it in. The Singleton never holds any per-request state. This avoids the classic "Scoped service resolved from root scope" exception that catches .NET developers out.&lt;/p&gt;

&lt;p&gt;Two implementations ship: &lt;code&gt;QdrantVectorStore&lt;/code&gt; and &lt;code&gt;AzureAiSearchVectorStore&lt;/code&gt;. Swap via config. Same interface, same tests.&lt;/p&gt;




&lt;h2&gt;
  
  
  The RAG Pipeline: Hybrid Search + RRF Fusion + MMR Re-ranking
&lt;/h2&gt;

&lt;p&gt;Plain semantic (embedding) search has a known weakness: it finds conceptually similar chunks but misses exact keyword matches. &lt;em&gt;"What is the RFC 2119 MUST keyword definition?"&lt;/em&gt; semantically finds documents about requirements, but keyword search finds the exact definition.&lt;/p&gt;

&lt;p&gt;Hybrid search solves this by running semantic and keyword search in parallel and fusing the results. The fusion algorithm is &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;score(chunk) = Σ  1 / (60 + rank_in_list)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A chunk ranked 1st in semantic and 3rd in keyword scores higher than one ranked 1st in only semantic. Here is the real implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SearchResult&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;FuseWithRrf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SearchResult&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;semanticResults&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SearchResult&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;keywordResults&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;topK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Dictionary&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Guid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;Score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SearchResult&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;)&amp;gt;();&lt;/span&gt;

    &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;AccumulateRrf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SearchResult&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;++)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
            &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;rrfScore&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;TryGetValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChunkId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;out&lt;/span&gt; &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChunkId&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Score&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rrfScore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;
                &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChunkId&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rrfScore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nf"&gt;AccumulateRrf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;semanticResults&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;AccumulateRrf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keywordResults&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Values&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;OrderByDescending&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Score&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Score&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToList&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;MMR (Maximal Marginal Relevance)&lt;/strong&gt; re-ranking tackles a different problem: result redundancy. If your top-5 chunks are all from the same paragraph of the same document, your LLM context window is wasted. MMR balances relevance against diversity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MMR(chunk) = λ · similarity(chunk, query) - (1-λ) · max_similarity(chunk, selected)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lambda controls the relevance/diversity tradeoff. Higher lambda = more relevant. Lower = more diverse. The retrieval pipeline wires these together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In RagService.RetrieveChunksAsync&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;useHybrid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;useReRanking&lt;/span&gt;
        &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_vectorStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;SearchWithEmbeddingsAsync&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;  &lt;span class="c1"&gt;// needs vectors for MMR&lt;/span&gt;
        &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_vectorStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;SearchAsync&lt;/span&gt;&lt;span class="p"&gt;(...);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Hybrid: run semantic + keyword in parallel&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;semanticTask&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;useReRanking&lt;/span&gt;
        &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;_vectorStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;SearchWithEmbeddingsAsync&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
        &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;_vectorStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;SearchAsync&lt;/span&gt;&lt;span class="p"&gt;(...);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;keywordTask&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_vectorStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;KeywordSearchAsync&lt;/span&gt;&lt;span class="p"&gt;(...);&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WhenAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;semanticTask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keywordTask&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;FuseWithRrf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;semanticTask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keywordTask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidateCount&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;useReRanking&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MmrReRanker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queryEmbedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_searchOptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MmrLambda&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every option is per-request configurable. The caller passes &lt;code&gt;useHybridSearch&lt;/code&gt; and &lt;code&gt;useReRanking&lt;/code&gt; booleans; if null, the config default is used. This makes A/B testing retrieval strategies trivial.&lt;/p&gt;




&lt;h2&gt;
  
  
  Streaming with IAsyncEnumerable and SSE
&lt;/h2&gt;

&lt;p&gt;LLM responses are slow. Users notice. Server-Sent Events (SSE) let you stream tokens to the browser as they arrive from the model.&lt;/p&gt;

&lt;p&gt;The streaming pipeline uses &lt;code&gt;IAsyncEnumerable&amp;lt;string&amp;gt;&lt;/code&gt; all the way from the HTTP client to the controller response — no buffering, no polling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;IAsyncEnumerable&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;StreamEvent&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;ChatStreamAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ChatMessage&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;?&lt;/span&gt; &lt;span class="n"&gt;conversationHistory&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EnumeratorCancellation&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;cancellationToken&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;queryEmbedding&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_embeddingService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GenerateEmbeddingAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;searchResults&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;RetrieveChunksAsync&lt;/span&gt;&lt;span class="p"&gt;(...);&lt;/span&gt;

    &lt;span class="c1"&gt;// Yield sources FIRST — client renders citations while tokens stream in&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;StreamEvent&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;Type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"sources"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;BuildSources&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;searchResults&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;BuildContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;searchResults&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;systemPrompt&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SystemPromptTemplate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Stream each LLM token as it arrives&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;foreach&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_chatService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GenerateResponseStreamAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancellationToken&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;StreamEvent&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;Type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"token"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Content&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The controller writes each event to the response with &lt;code&gt;Content-Type: text/event-stream&lt;/code&gt;. The Blazor UI consumes it with &lt;code&gt;HttpClient&lt;/code&gt; + a manual SSE parser. No SignalR, no WebSockets, no extra infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-tenancy: Per-Workspace Qdrant Collections
&lt;/h2&gt;

&lt;p&gt;The system supports multiple isolated workspaces. Each workspace gets its own Qdrant collection. A workspace is identified by an API key sent in the &lt;code&gt;X-Api-Key&lt;/code&gt; header.&lt;/p&gt;

&lt;p&gt;The middleware pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ApiKeyMiddleware → resolves Workspace from DB by hashed key
                → sets IWorkspaceContext.Current for the request scope
                → all downstream services use Current.CollectionName
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;WorkspaceService.ComputeSha256(key)&lt;/code&gt; is the only place API keys are hashed before storage. The plaintext key is never persisted — only shown to the user once at creation. This mirrors standard API key security practices.&lt;/p&gt;

&lt;p&gt;When a workspace is created, &lt;code&gt;IVectorStore.EnsureCollectionAsync(collectionName)&lt;/code&gt; is called immediately, creating the Qdrant collection with the correct vector dimensions. When a workspace is deleted, &lt;code&gt;DeleteCollectionAsync(collectionName)&lt;/code&gt; cascades the cleanup. No manual Qdrant operations required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Qdrant Reliability: The Auto-Reinitialise Pattern
&lt;/h2&gt;

&lt;p&gt;Qdrant's managed cloud can delete a collection if it has been inactive (free tier). This would cause every vector operation to throw a &lt;code&gt;gRPC RpcException(StatusCode.NotFound)&lt;/code&gt;. A restart would fix it — but that is a terrible production experience.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;QdrantVectorStore&lt;/code&gt; implements an auto-reinitialise pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ExecuteWithReinitAsync&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Func&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RpcException&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotFound&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;_logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;LogWarning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Collection {Name} not found, reinitialising..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_initLock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WaitAsync&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;EnsureCollectionAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collectionName&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;finally&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;_initLock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Release&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  &lt;span class="c1"&gt;// retry once&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;SemaphoreSlim(1,1)&lt;/code&gt; ensures that concurrent requests hitting a missing collection do not trigger a thundering herd of reinitialise calls. The operation retries exactly once. No restart required. The collection is back in seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing AI Systems in .NET: 279 Tests
&lt;/h2&gt;

&lt;p&gt;This is where Python RAG tutorials truly fall short. Most show you ~0 tests. Production software needs tests. Here is how AI-dependent code is tested in .NET:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mock interfaces, never concrete AI classes:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Good — mock the interface&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;mockEmbedding&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Mock&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;IEmbeddingService&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;();&lt;/span&gt;
&lt;span class="n"&gt;mockEmbedding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Setup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GenerateEmbeddingAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;It&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsAny&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(),&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ReturnsAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="c1"&gt;// Bad — RagService is a concrete class; instantiate it with mocked deps&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;sut&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;RagService&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;mockVectorStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mockEmbedding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mockChat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mockLogger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;SearchOptions&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="n"&gt;mockWorkspaceContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;EF Core InMemory for repository tests:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;DbContextOptionsBuilder&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;RagApiDbContext&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;UseInMemoryDatabase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Guid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;NewGuid&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;ToString&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  &lt;span class="c1"&gt;// unique per test&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;var&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;RagApiDbContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;repo&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;DocumentRepository&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;workspaceContext&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Test coverage breakdown:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unit tests: &lt;code&gt;RagService&lt;/code&gt;, &lt;code&gt;DocumentService&lt;/code&gt;, &lt;code&gt;WorkspaceService&lt;/code&gt;, all repositories&lt;/li&gt;
&lt;li&gt;Controller tests: &lt;code&gt;ChatController&lt;/code&gt;, &lt;code&gt;DocumentsController&lt;/code&gt;, &lt;code&gt;WorkspacesController&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Middleware tests: &lt;code&gt;ApiKeyMiddleware&lt;/code&gt;, &lt;code&gt;GlobalExceptionMiddleware&lt;/code&gt;, &lt;code&gt;RateLimitMiddleware&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Infrastructure tests: &lt;code&gt;QdrantVectorStore&lt;/code&gt;, &lt;code&gt;AzureAiSearchVectorStore&lt;/code&gt;, all parsers&lt;/li&gt;
&lt;li&gt;Integration tests: chunking strategies, hybrid search, MMR re-ranking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The test project targets &lt;code&gt;net10.0&lt;/code&gt; while the production code targets &lt;code&gt;net8.0&lt;/code&gt; — the test framework takes the newer runtime while production stays on the stable LTS version.&lt;/p&gt;




&lt;h2&gt;
  
  
  CI/CD: GitHub Actions → Azure Container Apps
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Three workflows:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ci.yml&lt;/code&gt; — runs on every PR: &lt;code&gt;dotnet build&lt;/code&gt; + &lt;code&gt;dotnet test&lt;/code&gt;. PRs cannot merge without green CI.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;deploy.yml&lt;/code&gt; — runs on push to main: builds a Docker image, pushes to Azure Container Registry, deploys to Azure Container Apps via &lt;code&gt;az containerapp update&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;swa-deploy.yml&lt;/code&gt; — runs on push to main: deploys the Blazor WASM output to Azure Static Web Apps.&lt;/p&gt;

&lt;p&gt;One important Azure Container Apps gotcha: pushing a new &lt;code&gt;:latest&lt;/code&gt; image does not automatically restart existing revisions. You must force a new revision with &lt;code&gt;--revision-suffix&lt;/code&gt;. The CI pipeline does this explicitly.&lt;/p&gt;

&lt;p&gt;The Dockerfile is multi-stage: a build stage with the .NET SDK image, a publish stage, and a final runtime stage using the ASP.NET Core runtime image (~220 MB).&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Python World Has That We Need to Build
&lt;/h2&gt;

&lt;p&gt;Intellectual honesty: here is what I had to build or wire together that Python's ecosystem gives you out of the box:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Document loaders&lt;/strong&gt; — PdfPig, DocumentFormat.OpenXml, and a custom chunking pipeline. LangChain has 50+ loaders. We have three parsers. Good enough for 90% of use cases; extendable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding model variety&lt;/strong&gt; — Python can swap to any HuggingFace model running locally. In .NET, Ollama is the practical local option. It works well, but your model selection is narrower.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rapid prototyping&lt;/strong&gt; — Jupyter notebooks with real-time output remain Python's killer app for exploration. .NET interactive notebooks exist but are less mature.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For production API work, these gaps matter less than they sound.&lt;/p&gt;




&lt;h2&gt;
  
  
  What .NET Gets Right That Python Usually Doesn't
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Type safety across the entire stack&lt;/strong&gt; — from the vector store interface to the controller to the DTO. No silent dict type errors at runtime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency injection that enforces architecture&lt;/strong&gt; — the DI lifetime system (Singleton/Scoped/Transient) makes lifetime violations a runtime error, not a subtle bug. Python has no equivalent guard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;IAsyncEnumerable&amp;lt;T&amp;gt;&lt;/code&gt; for streaming&lt;/strong&gt; — first-class language support for async streams makes the SSE pipeline clean and composable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EF Core migrations&lt;/strong&gt; — &lt;code&gt;MigrateAsync()&lt;/code&gt; at startup, automatic schema evolution, full LINQ query support. SQLAlchemy is good; EF Core is better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testability by default&lt;/strong&gt; — interfaces, constructor injection, and Moq make mocking AI dependencies straightforward. The 279 tests run in ~8 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production runtime&lt;/strong&gt; — ASP.NET Core's performance is consistently near the top of TechEmpower benchmarks. Your RAG API will not be the bottleneck.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with the interface, not the implementation.&lt;/strong&gt; &lt;code&gt;IVectorStore&lt;/code&gt;, &lt;code&gt;IChatService&lt;/code&gt;, and &lt;code&gt;IEmbeddingService&lt;/code&gt; were defined before any concrete implementation. This forced the architecture to stay clean and made every provider swappable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DI lifetimes are architecture decisions.&lt;/strong&gt; Making &lt;code&gt;IVectorStore&lt;/code&gt; Singleton was the right call (gRPC channel reuse), but it forced the &lt;code&gt;collectionName&lt;/code&gt; parameter pattern. Understanding &lt;em&gt;why&lt;/em&gt; that tradeoff exists is more important than the pattern itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid search is not optional.&lt;/strong&gt; Pure semantic search fails on exact terms, acronyms, and proper nouns. Hybrid with RRF costs one extra Qdrant call per query and meaningfully improves recall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test your retrieval separately from your LLM.&lt;/strong&gt; &lt;code&gt;RagService.SearchAsync&lt;/code&gt; is a pure retrieval method that returns ranked chunks with no LLM call. Write tests against it. Your prompting and your retrieval are separate problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Python/AI ecosystem is ahead on research, not on production engineering.&lt;/strong&gt; For a maintainable, tested, observable API that a .NET team can own and operate — .NET is the right call.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The roadmap includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 11 — Agentic RAG:&lt;/strong&gt; A ReAct loop where the agent plans retrieval, chooses tools (&lt;code&gt;search_documents&lt;/code&gt;, &lt;code&gt;compare_chunks&lt;/code&gt;, &lt;code&gt;answer_directly&lt;/code&gt;), and reasons across iterations. &lt;code&gt;POST /api/agent/query&lt;/code&gt; + SSE streaming.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 12 — Expanded Document Intelligence:&lt;/strong&gt; URL ingestion, XLSX/CSV parsing, PDF table extraction, auto-tagging via LLM, document summarization with caching.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 13 — Analytics &amp;amp; Observability:&lt;/strong&gt; QueryLog entity, cost estimation per query, OpenTelemetry + Application Insights, Prometheus &lt;code&gt;/metrics&lt;/code&gt; for Grafana.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this is buildable in .NET. All of it will have tests.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The API is live. The UI is live. The code is open source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Argha713/dotnet-rag-api" rel="noopener noreferrer"&gt;https://github.com/Argha713/dotnet-rag-api&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are a .NET developer who has been told that AI work requires Python — I hope this project makes that claim feel a lot less true.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Thanks for reading. If this was useful, drop a reaction or a comment — it helps.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>csharp</category>
      <category>ai</category>
      <category>rag</category>
    </item>
    <item>
      <title>SQLite on Azure Files SMB: A Debugging Story With a Humbling Ending</title>
      <dc:creator>Argha Sarkar</dc:creator>
      <pubDate>Tue, 03 Mar 2026 17:59:38 +0000</pubDate>
      <link>https://dev.to/argha_dev/sqlite-on-azure-files-smb-a-debugging-story-with-a-humbling-ending-1p93</link>
      <guid>https://dev.to/argha_dev/sqlite-on-azure-files-smb-a-debugging-story-with-a-humbling-ending-1p93</guid>
      <description>&lt;h1&gt;
  
  
  SQLite on Azure Files SMB: A Debugging Story With a Humbling Ending
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Three hours of debugging. One line of code. A lesson I won't forget.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I was building a &lt;strong&gt;.NET 8 RAG API&lt;/strong&gt; deployed to &lt;strong&gt;Azure Container Apps&lt;/strong&gt;. The idea was simple: use SQLite with EF Core to store document chunks and embeddings, mount the database file from &lt;strong&gt;Azure Files&lt;/strong&gt; so it would survive container restarts.&lt;/p&gt;

&lt;p&gt;Simple plan. Clean architecture. What could go wrong?&lt;/p&gt;

&lt;p&gt;Everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Symptom
&lt;/h2&gt;

&lt;p&gt;The container kept restarting. Over and over. The log was painfully consistent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SQLite Error 5: database is locked.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No crash stack. No underlying exception. Just that one line, mocking me every single time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Attempt 1 — Retry Logic
&lt;/h2&gt;

&lt;p&gt;My first instinct: this is a transient lock. Maybe another process briefly holds it during startup. Classic race condition.&lt;/p&gt;

&lt;p&gt;I added exponential backoff — &lt;code&gt;2s → 4s → 8s → 16s&lt;/code&gt; — with five retries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;retryPolicy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Policy&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handle&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SqliteException&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SqliteErrorCode&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;WaitAndRetryAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;retryCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;sleepDurationProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;TimeSpan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;FromSeconds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="n"&gt;onRetry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeSpan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;LogWarning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SQLite locked. Retry {Attempt} in {Delay}s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeSpan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TotalSeconds&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The container dutifully retried five times, logged politely, then gave up with the same error.&lt;/p&gt;

&lt;p&gt;Not a transient lock. Something deeper.&lt;/p&gt;




&lt;h2&gt;
  
  
  Attempt 2 — Journal Mode
&lt;/h2&gt;

&lt;p&gt;I started digging. EF Core's &lt;code&gt;EnsureCreatedAsync&lt;/code&gt; runs &lt;code&gt;PRAGMA journal_mode = 'wal'&lt;/code&gt; when it creates a new database. WAL (Write-Ahead Logging) mode requires &lt;strong&gt;memory-mapped I/O&lt;/strong&gt; and &lt;strong&gt;POSIX byte-range locking&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Azure Files SMB supports neither.&lt;/p&gt;

&lt;p&gt;The write would hang for 30 seconds, then time out with — you guessed it — &lt;code&gt;SQLite Error 5&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Solution: set &lt;code&gt;Journal Mode=Delete&lt;/code&gt; in the connection string to prevent WAL from ever being set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Data Source=/mnt/azure/ragapi.db;Journal Mode=Delete
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Except:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ArgumentException: Connection string keyword 'journal mode' is not supported.
Microsoft.Data.Sqlite
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;Microsoft.Data.Sqlite&lt;/code&gt; only accepts a handful of valid keywords. &lt;code&gt;Journal Mode&lt;/code&gt; is not one of them. You have to set it via &lt;code&gt;PRAGMA&lt;/code&gt; after opening the connection.&lt;/p&gt;

&lt;p&gt;Dead end.&lt;/p&gt;




&lt;h2&gt;
  
  
  Attempt 3 — Pre-Create the Database File
&lt;/h2&gt;

&lt;p&gt;EF Core skips the &lt;code&gt;Create()&lt;/code&gt; step (which sets WAL mode) if the database file &lt;strong&gt;already exists&lt;/strong&gt;. So I came up with a workaround:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open a raw &lt;code&gt;SqliteConnection&lt;/code&gt; before &lt;code&gt;EnsureCreatedAsync&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Manually run &lt;code&gt;PRAGMA journal_mode=DELETE&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Close it&lt;/li&gt;
&lt;li&gt;Let EF Core run &lt;code&gt;EnsureCreatedAsync&lt;/code&gt; — it sees the file, skips &lt;code&gt;Create()&lt;/code&gt;, goes straight to &lt;code&gt;CreateTablesAsync()&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Pre-create the DB and set journal mode before EF Core touches it&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;var&lt;/span&gt; &lt;span class="n"&gt;preConn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;SqliteConnection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;connectionString&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;preConn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;OpenAsync&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;var&lt;/span&gt; &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;preConn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;CreateCommand&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CommandText&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"PRAGMA journal_mode=DELETE;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ExecuteNonQueryAsync&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;preConn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;CloseAsync&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Now let EF Core take over&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;dbContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;EnsureCreatedAsync&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;EF Core skipped &lt;code&gt;Create()&lt;/code&gt;. ✅&lt;/p&gt;

&lt;p&gt;It still failed. ❌&lt;/p&gt;

&lt;p&gt;Turns out &lt;code&gt;COMMIT&lt;/code&gt; inside &lt;code&gt;CreateTablesAsync()&lt;/code&gt; &lt;strong&gt;also&lt;/strong&gt; hits the SMB locking issue. The problem wasn't just WAL mode — it was the entire SMB locking model.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix — Accept Reality
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Data Source=ragapi.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Switch to local ephemeral storage inside the container. No mount, no SMB, no network. Standard POSIX locking. Works instantly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// appsettings.json&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s"&gt;"ConnectionStrings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"DefaultConnection"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Data Source=ragapi.db"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Documents reset on container restart. For a demo, that's perfectly fine.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Went Wrong
&lt;/h2&gt;

&lt;p&gt;SQLite's locking model is built on one assumption: &lt;strong&gt;a local filesystem with proper POSIX byte-range locking&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;SMB (Server Message Block) — the protocol Azure Files uses — doesn't provide that. Neither does NFS. Neither do most cloud-mounted volumes.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;documented behaviour&lt;/strong&gt;, not a bug. Right there in the SQLite FAQ:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"SQLite uses reader/writer locks to control access to the database. [...] If the filesystem does not support POSIX advisory locks, SQLite cannot properly serialize concurrent database accesses."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I didn't read it. I assumed "it's just a file, it'll work anywhere." It doesn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  Your Options for SQLite Persistence on Serverless Containers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Option 1 — Accept Ephemeral Storage
&lt;/h3&gt;

&lt;p&gt;Use local storage (&lt;code&gt;Data Source=ragapi.db&lt;/code&gt;). For demos, prototypes, or read-heavy apps where data can be regenerated, this is the simplest and most reliable choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2 — Startup Migration from Blob Storage
&lt;/h3&gt;

&lt;p&gt;On container startup, copy the database file from Azure Blob Storage to local disk, use it locally, and optionally write it back on shutdown.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Startup: copy DB from blob to local&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;blobClient&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;BlobClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;connectionString&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"databases"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"ragapi.db"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;blobClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;DownloadToAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/app/ragapi.db"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// On graceful shutdown: push it back&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;blobClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;UploadAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/app/ragapi.db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overwrite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option 3 — Switch to a Proper Networked Database
&lt;/h3&gt;

&lt;p&gt;If you need persistence, concurrency, and reliability in a containerised environment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Azure SQL&lt;/strong&gt; (managed SQL Server)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure Database for PostgreSQL&lt;/strong&gt; with &lt;code&gt;pgvector&lt;/code&gt; extension (great for RAG)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure Cosmos DB&lt;/strong&gt; (document storage)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;EF Core supports all of these with minimal provider swapping.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Lesson
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;SQLite and network file systems are fundamentally incompatible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not partially. Not in certain modes. &lt;strong&gt;Fundamentally.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I lost three hours to a problem that was documented, well-known, and completely avoidable. The fix was one line. The lesson cost an afternoon.&lt;/p&gt;

&lt;p&gt;If you're building RAG APIs in .NET, my recommendation: &lt;strong&gt;start with PostgreSQL + pgvector&lt;/strong&gt;. It's containerisation-friendly, Azure-native, and EF Core's pgvector support via &lt;code&gt;Npgsql.EntityFrameworkCore.PostgreSQL&lt;/code&gt; is excellent.&lt;/p&gt;

&lt;p&gt;SQLite is a phenomenal database — for local dev, testing, and embedded apps. Just not on a network mount.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What I tried&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Exponential backoff retry&lt;/td&gt;
&lt;td&gt;❌ Not a transient lock&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Journal Mode=Delete&lt;/code&gt; in connection string&lt;/td&gt;
&lt;td&gt;❌ Not a valid &lt;code&gt;Microsoft.Data.Sqlite&lt;/code&gt; keyword&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-create DB + set &lt;code&gt;PRAGMA journal_mode=DELETE&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;❌ &lt;code&gt;COMMIT&lt;/code&gt; still fails on SMB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Switch to local &lt;code&gt;Data Source=ragapi.db&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;✅ Works instantly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Azure Files SMB doesn't support POSIX byte-range locking. SQLite requires it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Use local ephemeral storage, copy from Blob on startup, or switch to PostgreSQL.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tags: &lt;code&gt;dotnet&lt;/code&gt; &lt;code&gt;csharp&lt;/code&gt; &lt;code&gt;sqlite&lt;/code&gt; &lt;code&gt;azure&lt;/code&gt; &lt;code&gt;rag&lt;/code&gt; &lt;code&gt;entityframework&lt;/code&gt; &lt;code&gt;debugging&lt;/code&gt; &lt;code&gt;cloudnative&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>azure</category>
      <category>sqlite</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
