DEV Community: Argha Sarkar

I Built a RAG System. Then I Broke It With One Question!

Argha Sarkar — Tue, 24 Mar 2026 08:09:12 +0000

I was testing my own RAG application.

I'd spent weeks building it — .NET 8, Qdrant, OpenAI, Clean Architecture. It worked well. Upload documents, ask questions, get cited answers. I was happy with it.

So I loaded up some public annual reports and research papers, and started stress-testing it.

Most answers were solid. Then I asked:

"What are the common risk factors mentioned across these annual reports, and do any of them overlap?"

The system responded in seconds. Confident. Cited. Clean.

But when I cross-checked manually, I realised it had only pulled chunks from one report. The others hadn't been touched. No warning. No caveat. Just a quietly incomplete answer dressed up as a complete one.

That was the moment I stopped and thought: this isn't a retrieval bug. This is an architectural ceiling.

What Single-Shot RAG Actually Does

Here's the pipeline most RAG systems run:

User question
    → Generate embedding
    → Vector search (one query, one pass)
    → Take top-K chunks
    → Stuff into prompt
    → Generate answer

It's fast, cheap, and works well for direct factual questions. "What is the refund policy?" — great. One search finds it.

But for anything that requires:

Searching across multiple documents with different angles
Comparing information from two sources
First understanding what documents exist, then drilling in

...single-shot RAG fails silently. The LLM gets whatever the one search returned and does its best. It has no way to say "I think I need more context from a different source." It just answers.

The Fix: Give the System the Ability to Think Before It Answers

What I needed wasn't better retrieval. I needed the system to plan its retrieval.

This is the ReAct pattern — Reason + Act. Instead of a fixed pipeline, the agent runs a loop:

Reason about what to do next
    → Act (call a tool)
    → Observe the result
    → Reason again
    → Act again
    → ... until it has enough to answer

At each step, the agent decides: do I have enough information, or do I need to search more?

How I Implemented It

The agent is powered by the same LLM already in the stack. The trick is the system prompt — instead of asking the LLM to answer the question directly, you tell it to output a structured decision at every step:

At EVERY step, output ONLY valid JSON:
{
  "thought": "your reasoning about what to do next",
  "action": "search_documents | get_document_summary | compare_chunks | answer_directly",
  "action_input": "input for the chosen action"
}

The loop then:

Sends this prompt + the conversation history to the LLM
Parses the JSON response
Executes the chosen tool
Appends the result to the conversation history
Repeats — until the agent calls answer_directly or hits the iteration limit

Here's the core loop in C#, simplified:

while (iteration < _config.MaxIterations)
{
    var llmResponse = await _chatService.GenerateResponseAsync(systemPrompt, messages, ct);
    var parsed = ParseAgentAction(llmResponse);

    trace.Steps.Add(new AgentStep { StepType = "reasoning", Reasoning = parsed.Thought });

    if (parsed.Action == "answer_directly")
    {
        finalAnswer = parsed.ActionInput;
        break;
    }

    var toolResult = await _tools.ExecuteAsync(parsed.Action, parsed.ActionInput, ct);

    messages.Add(new ChatMessage { Role = "assistant", Content = llmResponse });
    messages.Add(new ChatMessage { Role = "user", Content = $"Tool result: {toolResult.Text}" });

    iteration++;
}

Simple. The LLM drives the loop. The code just executes whatever it decides.

The Four Tools

The agent has four tools to choose from:

Tool	What it does
`search_documents`	Vector search — the same semantic search the existing RAG system uses
`get_document_summary`	Retrieves chunks for a document and asks the LLM to summarise it
`compare_chunks`	Takes two text segments and asks the LLM to identify agreements and contradictions
`answer_directly`	Signals that the agent has enough context and is ready to answer

These aren't new infrastructure. They're thin wrappers over what already existed. The intelligence is in the loop, not the tools.

The Reasoning Trace

Every response from the agent includes the full reasoning trace — a step-by-step log of every decision it made:

{
  "answer": "Both reports flag supply chain disruption as a key risk...",
  "iterationsUsed": 3,
  "maxIterationsReached": false,
  "trace": {
    "steps": [
      { "stepType": "reasoning", "reasoning": "I need to search for risk factors in the first report" },
      { "stepType": "tool_call", "toolName": "search_documents", "toolInput": "risk factors annual report 2023" },
      { "stepType": "tool_result", "toolOutput": "Supply chain disruption, interest rate exposure..." },
      { "stepType": "reasoning", "reasoning": "Now I need to check the second report for overlap" },
      { "stepType": "tool_call", "toolName": "search_documents", "toolInput": "risk factors annual report 2024" },
      { "stepType": "tool_result", "toolOutput": "Supply chain risk, inflation, regulatory pressure..." },
      { "stepType": "answer", "toolOutput": "Both reports flag supply chain disruption..." }
    ]
  }
}

This isn't just debugging information. It's the answer to the question every enterprise user eventually asks: "How did it arrive at this?"

What Changed

The question that broke the original system — "What risk factors overlap across these reports?" — now works correctly. The agent searches each report separately, compares the results, and synthesises a grounded answer.

More importantly, it tells you exactly how it got there.

Two new endpoints:

POST /api/agent/query — runs the full loop, returns the complete response + trace
POST /api/agent/stream — SSE stream, so you watch the agent reason in real time

And one safety valve: Agent:Enabled = false returns a 503 instantly, no AI calls made. Useful for cost control.

What I'd Do Differently

The weakest part is JSON parsing. LLMs — especially smaller local models via Ollama — don't always produce clean JSON. I added fallback handling (strip code fences, fall back to answer_directly if parsing fails entirely), but a production system would benefit from structured output / function calling if the model supports it.

The iteration limit (default 5) is also a balance. Higher means more thorough answers but more cost. For complex multi-document questions, 3–4 iterations is usually enough.

The Full Implementation

The complete source is open — .NET 8, Clean Architecture, Qdrant, OpenAI/Ollama/Azure OpenAI support:

👉 https://github.com/Argha713/dotnet-rag-api

If you're building RAG in .NET and hitting the same ceiling, the agentic layer is the natural next step. It's additive — the existing /api/chat endpoints are completely untouched.

Standard RAG Is Blind — Building Multimodal RAG in .NET to Fix It

Argha Sarkar — Tue, 17 Mar 2026 03:10:29 +0000

The Scenario

A developer builds a RAG system. A user uploads a 60-page service manual — dense with wiring diagrams, installation schematics, and annotated screenshots. They ask: "How do I replace the filter assembly?"

The answer is entirely in Figure 7.

RAG returns three paragraphs of unrelated text. The image was never ingested. It does not exist to the system.

This is not a bug. It is the expected behaviour of every standard RAG pipeline.

Why Standard RAG Fails on Images

A standard RAG pipeline does one thing: convert text into searchable vectors.

flowchart LR
    A[PDF / DOCX Upload] --> B[Text Extraction]
    B --> C[Chunk]
    C --> D[Embed]
    D --> E[(Vector Store)]
    A -. images discarded .-> X[❌]

Images are either skipped entirely or reduced to their alt-text — which is usually empty. The pipeline was not designed to understand visual content. There is no text to extract from a schematic, no words to embed from a photograph, no paragraph to chunk from a technical diagram.

The result: any knowledge that exists only in images is permanently invisible to retrieval. For documents like technical manuals, medical imaging reports, architectural drawings, or slide decks, this is not a minor gap. It is a fundamental failure of coverage.

What Multimodal RAG Needs to Do Differently

Three things must change:

Extract — pull image bytes out of documents alongside text, not instead of text
Describe — pass each image to a vision model and get back a text description that captures what the image means, not just what it looks like
Retrieve and Render — when a retrieval query matches an image description, return both the description as context and the original image to the user

The key insight is that vision models act as a translation layer. They convert visual content into the semantic space that the rest of the RAG pipeline already understands. Chunking, embedding, and vector search require no changes. The pipeline gains a new input channel — it does not need a new architecture.

The Architecture

The multimodal pipeline extends the standard RAG system at two seams: ingestion gains a parallel image track, and retrieval gains an image rendering step.

Ingestion

flowchart TD
    A[PDF / DOCX Upload] --> B[Text Extraction\nexisting]
    A --> C[Image Extraction\nPdfPig · OpenXml]
    B --> D[Chunk & Embed\nexisting]
    C --> E[Vision Model\nGPT-4o]
    E --> F[Image Description\ntext]
    E --> G[Image Bytes\nPostgreSQL]
    F --> H[Embed Description\nas chunk + imageId]
    D --> I[(Qdrant\nVector Store)]
    H --> I

The upload triggers two parallel tracks. The text track is unchanged. The image track extracts raw bytes per page or document part, sends each to a vision model, stores the bytes in PostgreSQL, and embeds the returned description as a standard chunk — with one addition: the chunk carries an imageId reference in its metadata.

Image descriptions live in the same vector space as text chunks. They compete on equal terms during retrieval.

Retrieval

flowchart TD
    A[User Query] --> B[Vector Search\nQdrant]
    B --> C{Chunk Type?}
    C -->|text| D[Text Context]
    C -->|image description| E[Image Description\n+ imageId]
    D --> F[LLM Response]
    E --> F
    E --> G[GET /api/images/id\nimage bytes]
    F --> H[Answer Text]
    G --> H
    H --> I[Chat UI\ntext + inline images]

Retrieval requires no changes to the search layer. When a query matches an image-description chunk, the chunk's metadata surfaces the imageId. A dedicated endpoint streams the image bytes from PostgreSQL. The chat UI renders the LLM answer alongside the relevant image — in the same response panel.

Pipeline Stage Breakdown

Extract

Two document types, two libraries, one output contract.

flowchart LR
    PDF --> PdfPig --> ExtractedImage
    DOCX --> OpenXml --> ExtractedImage

PDF image extraction uses PdfPig's per-page image enumeration. DOCX extraction enumerates MainDocumentPart.ImageParts via the OpenXml SDK. Both apply a 100×100px minimum dimension threshold — images below this are decorative and skipped — and a 20MB safety cap. The output in both cases is an ExtractedImage record carrying bytes, MIME type, and dimension metadata. Text and image extraction run on the same upload; no second pass is required.

Describe

flowchart LR
    ExtractedImage --> B[IVisionService\nDescribeAsync] --> C[Text Description]

Each extracted image is base64-encoded and sent to GPT-4o Vision via IVisionService. The response is a plain-text description of what the image contains and means in context. This is the only pipeline stage that calls an external vision model. Descriptions are generated once at ingest time — not at query time — so retrieval latency is unaffected.

Store

flowchart LR
    ExtractedImage --> A[IImageStore] --> B[(PostgreSQL\nDocumentImages)]
    B --> C[imageId]
    C --> D[Chunk Metadata]

Image bytes are persisted to a DocumentImages table in PostgreSQL via IImageStore. The returned imageId is attached to the description chunk before it enters the embedding pipeline. The bytes never travel to Qdrant — only the description text and the imageId reference flow through the vector store.

Retrieve

No change to the vector search layer. When a query matches an image-description chunk, the chunk's metadata carries imageId and pageNumber. The existing search response shape is extended with an optional image reference — source chunks now carry a type field (text or image) alongside the relevant text excerpt.

Render

A GET /api/images/{id} endpoint streams image bytes directly from PostgreSQL. The Blazor chat UI inspects each source chunk's type: text sources render as before, image sources fetch the endpoint and render the image inline. The user receives the LLM answer and the relevant diagram in the same response — no separate step, no external image hosting.

GitHub

The full source, issue tracker, and phase roadmap are public.

github.com/Argha713/dotnet-rag-api

Stop Blaming Your LLM: Fix RAG Retrieval Quality With Better Chunking in .NET

Argha Sarkar — Thu, 12 Mar 2026 08:11:14 +0000

You swap to a better model. Still wrong answers. You tune your prompt. Still hallucinations. You increase the temperature — no, lower it — still garbage. Sound familiar?

Here are three failure modes I hit repeatedly while building a RAG API in .NET:

The confident wrong answer. The LLM states a fact with full certainty. The document says the opposite. You look at the retrieved chunk — it was cut in the middle of a sentence, and the half that made it into context was the setup, not the conclusion.

The "I don't know" on an obvious question. The user asks something the document clearly answers. The LLM shrugs. You trace it: the exact answer spans the last two words of chunk N and the first sentence of chunk N+1. Neither chunk scores high enough on its own to make the retrieval cut.

The bloated non-answer. The LLM returns 400 words of vague summary when the user needed a number. The retrieved chunk was an entire page. There were five relevant sentences in it and 900 tokens of noise.

The LLM isn't the problem. The chunks are.

Root Cause: Chunk Boundaries Define Retrieval Quality

In RAG, the pipeline works like this: you embed each chunk into a vector, store those vectors, and at query time you find the chunks most similar to the user's question. The LLM never sees your document — it sees only the chunks you hand it.

This means every embedding is only as good as the text it encodes. And the text it encodes is defined entirely by where you drew the chunk boundaries.

Too large: the chunk contains the answer buried in irrelevant context. The embedding drifts toward the noise. Token cost spikes. The LLM has to wade through padding to find the signal.

Too small: the answer spans two chunks. Each chunk, alone, doesn't capture enough meaning to rank high. Both miss the similarity threshold. The answer is never retrieved.

Wrong boundary: you cut mid-sentence. The embedding captures a dangling clause, not a complete thought. Semantic similarity breaks down.

The defaults in this project — ChunkSize: 1000 characters, ChunkOverlap: 200 — are a starting point, not gospel:

// src/RagApi.Application/Models/DocumentProcessingOptions.cs
public class DocumentProcessingOptions
{
    public string DefaultChunkingStrategy { get; set; } = "Fixed";
    public int ChunkSize { get; set; } = 1000;   // characters
    public int ChunkOverlap { get; set; } = 200;
}

But the number that matters more than the size is how you draw the boundaries. That's what the three strategies below address.

The Pipeline in One Paragraph

Before diving into strategies, here's where chunking lives in the full upload flow. DocumentService.UploadDocumentAsync runs four sequential steps: extract text from the raw file (PDF, DOCX, TXT, Markdown), chunk the text using the selected strategy, generate embeddings for every chunk, and upsert those embeddings into the vector store. Chunking is step 2 — everything after it depends on getting step 2 right.

// Step 1: Extract text from document
var text = await _documentProcessor.ExtractTextAsync(fileStream, contentType, cancellationToken);

// Step 2: Chunk the text
var chunks = _documentProcessor.ChunkText(document.Id, text, chunkingOptions);

// Step 3: Generate embeddings for all chunks
var embeddings = await _embeddingService.GenerateEmbeddingsAsync(chunkTexts, cancellationToken);

// Step 4: Store chunks in vector database
await _vectorStore.UpsertChunksAsync(_workspaceContext.Current.CollectionName, chunks, cancellationToken);

Now let's look at each strategy.

Strategy 1: Fixed-Size With Paragraph-Aware Overlap

Good for: mixed document corpora, financial reports, legal docs, anything where you don't know the structure in advance.

The "fixed" in the name is slightly misleading. This strategy doesn't blindly slice at character N. It splits at paragraph boundaries first, then accumulates paragraphs until adding the next paragraph would exceed ChunkSize. At that point it saves the current chunk and begins the next one with ChunkOverlap characters carried over from the tail of the previous chunk.

// src/RagApi.Infrastructure/DocumentProcessing/DocumentProcessor.cs
private static List<DocumentChunk> ChunkByFixed(Guid documentId, string text, ChunkingOptions options)
{
    var chunks = new List<DocumentChunk>();
    var paragraphs = Regex.Split(text, options.SeparatorPattern)
        .Where(p => !string.IsNullOrWhiteSpace(p))
        .ToList();

    var currentChunk = new StringBuilder();
    var chunkIndex = 0;

    foreach (var paragraph in paragraphs)
    {
        if (currentChunk.Length > 0 &&
            currentChunk.Length + paragraph.Length > options.ChunkSize)
        {
            chunks.Add(CreateChunk(documentId, currentChunk.ToString().Trim(), chunkIndex++, ...));

            // Start new chunk with overlap
            var overlapText = GetOverlapText(currentChunk.ToString(), options.ChunkOverlap);
            currentChunk.Clear();
            currentChunk.Append(overlapText);
        }

        currentChunk.AppendLine(paragraph);
    }
    // ... flush final chunk
    return chunks;
}

The overlap is word-boundary aware — GetOverlapText finds the first space after text.Length - overlapSize rather than slicing at a raw character index. This prevents a chunk starting with "...ompany reported a record" when it should start with "company reported a record".

Tradeoffs:

✅ Predictable token budget — you know your maximum chunk size
✅ Works on any document type without structure assumptions
✅ Overlap means a sentence at a chunk boundary still has context in the next chunk
❌ A single paragraph larger than ChunkSize will still be split mid-paragraph
❌ Overlap is character-based, not semantic — the carried-over text might not be the most relevant part

When to use: your default for any document corpus where you don't know the structure in advance. Switch to one of the targeted strategies once you know what you're ingesting.

Strategy 2: Sentence-Based

Good for: factual Q&A over research papers, product manuals, FAQs — text where the answer to a question is typically one or two complete sentences.

The key insight is that embedding quality peaks when the encoded text is a complete thought. A sentence is the smallest unit of complete meaning. This strategy splits on .!? boundaries, accumulates sentences until the next sentence would overflow ChunkSize, and carries the last sentence of the previous chunk into the next one as overlap.

// src/RagApi.Infrastructure/DocumentProcessing/DocumentProcessor.cs
private static List<DocumentChunk> ChunkBySentence(Guid documentId, string text, ChunkingOptions options)
{
    var sentences = Regex.Split(text, @"(?<=[.!?])\s+")
        .Select(s => s.Trim())
        .Where(s => !string.IsNullOrWhiteSpace(s))
        .ToList();

    var currentChunk = new StringBuilder();
    var lastSentence = string.Empty; // one-sentence overlap

    foreach (var sentence in sentences)
    {
        if (currentChunk.Length > 0 &&
            currentChunk.Length + sentence.Length + 1 > options.ChunkSize)
        {
            chunks.Add(CreateChunk(documentId, currentChunk.ToString().Trim(), chunkIndex++, ...));

            // Start next chunk with the last sentence as overlap
            currentChunk.Clear();
            if (!string.IsNullOrWhiteSpace(lastSentence))
                currentChunk.Append(lastSentence).Append(' ');
        }

        lastSentence = sentence;
        currentChunk.Append(sentence).Append(' ');
    }
    // ... flush final chunk
    return chunks;
}

One-sentence overlap means the answer sentence is never the very first token of a chunk with no preceding context. The chunk before it and the chunk after it both have at least one connecting sentence.

Tradeoffs:

✅ Embeddings capture complete semantic units — cosine similarity is more reliable
✅ Best retrieval precision for direct factual questions
❌ Chunk sizes vary wildly — a three-word sentence and a 200-word sentence get equal weight
❌ The regex splitter breaks on abbreviations: "Mr. Smith arrived" becomes two sentences. Same for "e.g.", "i.e.", decimal numbers. Good enough for most corpora; not production-grade for scientific text

When to use: FAQ documents, product manuals, research papers, anything dense with discrete facts where users ask direct questions.

Strategy 3: Paragraph-Based

Good for: well-structured prose — internal wikis, policy PDFs, documentation sites — where a paragraph is a coherent topic unit.

This is the simplest strategy: split on blank lines, make each paragraph exactly one chunk, no size cap.

// src/RagApi.Infrastructure/DocumentProcessing/DocumentProcessor.cs
private static List<DocumentChunk> ChunkByParagraph(Guid documentId, string text)
{
    var paragraphs = Regex.Split(text, @"\n\n|\r\n\r\n")
        .Select(p => p.Trim())
        .Where(p => !string.IsNullOrWhiteSpace(p))
        .ToList();

    var chunks = new List<DocumentChunk>();
    var position = 0;

    for (int i = 0; i < paragraphs.Count; i++)
    {
        var para = paragraphs[i];
        chunks.Add(CreateChunk(documentId, para, i, position, position + para.Length));
        position += para.Length + 2;
    }

    return chunks;
}

No size cap is intentional. The paragraph boundary is the semantic boundary. Imposing an artificial size limit would require introducing mid-paragraph cuts, which is exactly the failure mode we're trying to avoid. You accept variable sizes in exchange for zero mid-thought cuts.

Tradeoffs:

✅ Most semantically coherent chunks
✅ Zero mid-thought cuts — every chunk is a complete idea
❌ Sizes vary wildly — a one-liner and a 2000-word section get equal treatment
❌ Very large paragraphs overflow the LLM context window. There is no safety size cap in this implementation — something to add if you're ingesting documents with monster paragraphs

When to use: well-structured prose where the author already did the work of organizing information into coherent blocks.

Decision Table

Strategy	Boundary	Overlap	Best for	Watch out for
Fixed	Paragraph	Character (word-safe)	Mixed/unknown docs, legal	Long single paragraphs
Sentence	Sentence `.!?`	Last sentence	Factual Q&A, manuals, research	Abbreviations, lists
Paragraph	Blank line	None	Structured prose, wikis, policy	Huge paragraphs, no size cap

How to Choose

The decision is simpler than it looks:

Don't know your document structure? → use Fixed. It's the safe default and handles the widest range of inputs.
Users ask specific factual questions? → use Sentence. Precision beats coverage for Q&A workloads.
Documents are well-structured prose where paragraphs are deliberate? → use Paragraph. Let the author's structure do the work.
Mixing document types across a workspace? → use Fixed as the default, and override per upload using the chunkingStrategy parameter.

That last point is important. Every upload can override the default strategy without touching the server config:

// DocumentService.UploadDocumentAsync signature
public async Task<Document> UploadDocumentAsync(
    Stream fileStream,
    string fileName,
    string contentType,
    List<string>? tags = null,
    ChunkingStrategy? chunkingStrategy = null,   // override per upload
    CancellationToken cancellationToken = default)

Pass ChunkingStrategy.Sentence for a product manual, ChunkingStrategy.Paragraph for a policy doc, and null (uses config default) for everything else. The strategy is resolved at call time — no restart required.

Where This Lives

All three strategies are implemented in DocumentProcessor.cs in the open-source dotnet-rag-api project — a full RAG API built on .NET 8 with Clean Architecture, Qdrant for vector storage, and support for OpenAI, Azure OpenAI, or local Ollama as the AI provider.

The IDocumentProcessor interface is decoupled from the vector store: you can run it against Qdrant Cloud, Azure AI Search, or a local Qdrant instance and the chunking logic doesn't change. The same three strategies work regardless of which backend you use for embeddings and retrieval.

If you're hitting retrieval quality issues, look at your chunks before you look at your model.

Building Production-Grade RAG in .NET : Language Is Not a Barrier

Argha Sarkar — Tue, 10 Mar 2026 08:03:02 +0000

Building Production-Grade RAG in .NET 8: Language Is Not a Barrier

Every AI tutorial you find starts with Python. Every LangChain walkthrough, every vector database quickstart, every "build your own ChatGPT" guide — all Python. If you are a .NET developer, you are used to searching for a C# equivalent and finding either a thin wrapper someone wrote last week, a GitHub issue from 2022 asking "is there a .NET SDK?", or nothing at all.

I got tired of that. So I built a full Retrieval-Augmented Generation (RAG) API in .NET 8 from scratch: Clean Architecture, Qdrant vector database, OpenAI/Azure OpenAI/Ollama provider switching, hybrid search with Reciprocal Rank Fusion, MMR re-ranking, multi-tenancy, Server-Sent Events streaming, a Blazor WASM frontend, and 279 tests. Deployed to Azure Container Apps and Azure Static Web Apps.

This article walks through how, and why .NET is a first-class citizen in the AI ecosystem — not a workaround.

GitHub: https://github.com/Argha713/dotnet-rag-api
Live API: https://rag-api.calmsand-4a05cfa0.eastus.azurecontainerapps.io
Live UI: https://ambitious-glacier-0b62ea10f.6.azurestaticapps.net

What Is RAG and Why Does Architecture Matter?

RAG is a pattern that improves LLM responses by grounding them in your own documents. Instead of relying on the model's training data, you:

Ingest — parse documents, split into chunks, generate vector embeddings, store in a vector database
Retrieve — embed the user's query, find the most similar chunks
Generate — inject those chunks as context into the LLM prompt, get a grounded answer

Most RAG tutorials implement this in ~50 lines of Python using LangChain. That is fine for a demo. For production — where you need testability, provider flexibility, multi-tenancy, and maintainability — architecture matters enormously. And that is where .NET's ecosystem genuinely shines.

The .NET AI Ecosystem in 2025

Before I show the implementation, let us be honest about the landscape.

The packages you actually need:

Concern	Package	Notes
Vector DB (Qdrant)	`Qdrant.Client 1.12.0`	Official .NET SDK, full gRPC support
PDF parsing	`PdfPig 0.1.9`	Pure .NET, no native deps
DOCX/XLSX parsing	`DocumentFormat.OpenXml 3.0.2`	Microsoft's own SDK
PostgreSQL / EF Core	`Npgsql.EntityFrameworkCore.PostgreSQL 8.0.8`	Rock solid
Azure AI Search	`Azure.Search.Documents 11.6.0`	Swap Qdrant for Azure
Structured logging	`Serilog.AspNetCore 8.0.3`	Industry standard
Validation	`FluentValidation 11.9.2`	Better than DataAnnotations
Health checks	`Microsoft.Extensions.Diagnostics.HealthChecks`	Built-in, excellent
OpenAI / Ollama	Direct `HttpClient` calls	You don't need Semantic Kernel

The notable absence: I did not use Microsoft Semantic Kernel. Semantic Kernel is a legitimate option, especially if you want abstractions over multiple AI providers and memory stores out of the box. I chose to build the abstractions myself for two reasons: (1) it makes the architecture explicit and teachable, and (2) it demonstrates that you do not need a framework — the primitives are sufficient.

What is genuinely missing compared to Python: LangChain's ecosystem of 300+ integrations. Python dominates experimental ML research. If you need to run a custom fine-tuned model or use bleeding-edge retrieval research, Python is still where that lives first. For production API work with mainstream providers and standard vector databases? .NET is fully capable.

Architecture: Clean Architecture Meets AI

The project follows Clean Architecture strictly across four layers:

Domain          — entities only, zero dependencies
Application     — interfaces + services, depends on Domain
Infrastructure  — Qdrant, OpenAI, PostgreSQL/EF Core, parsers
Api             — ASP.NET Core controllers, middleware, Serilog
BlazorUI        — Blazor WASM frontend

This matters for AI systems specifically because the AI provider is an infrastructure detail. Your business logic (how to chunk, how to rank results, what prompt template to use) should not be coupled to whether you are using OpenAI today and Azure OpenAI tomorrow.

The Application layer defines two key interfaces:

// Application/Interfaces/IEmbeddingService.cs
public interface IEmbeddingService
{
    Task<float[]> GenerateEmbeddingAsync(string text, CancellationToken ct = default);
    int Dimensions { get; }
    string ModelName { get; }
}

// Application/Interfaces/IChatService.cs
public interface IChatService
{
    Task<string> GenerateResponseAsync(string systemPrompt,
        List<ChatMessage> messages, CancellationToken ct = default);
    IAsyncEnumerable<string> GenerateResponseStreamAsync(string systemPrompt,
        List<ChatMessage> messages, CancellationToken ct = default);
    string ModelName { get; }
}

Infrastructure has three concrete implementations of each: OpenAiChatService, AzureOpenAiChatService, and OllamaChatService. DI wires the correct one based on appsettings.json. Switching providers is a config change, not a code change.

The Vector Store Abstraction

This is where most RAG tutorials stop being useful — they assume a single collection in a single database. In a multi-tenant system, each workspace needs isolated storage.

Here is the IVectorStore interface (the real one from the codebase):

/// <summary>
/// All methods accept collectionName as the first parameter so callers
/// (Scoped services) can pass the workspace's collection without violating
/// the Singleton lifetime of the implementation.
/// </summary>
public interface IVectorStore
{
    Task EnsureCollectionAsync(string collectionName, CancellationToken ct = default);
    Task DeleteCollectionAsync(string collectionName, CancellationToken ct = default);
    Task UpsertChunksAsync(string collectionName,
        List<DocumentChunk> chunks, CancellationToken ct = default);
    Task<List<SearchResult>> SearchAsync(string collectionName,
        float[] queryEmbedding, int topK = 5,
        Guid? filterByDocumentId = null, List<string>? filterByTags = null,
        CancellationToken ct = default);
    Task<List<SearchResult>> SearchWithEmbeddingsAsync(string collectionName,
        float[] queryEmbedding, int topK = 5,
        Guid? filterByDocumentId = null, List<string>? filterByTags = null,
        CancellationToken ct = default);
    Task<List<SearchResult>> KeywordSearchAsync(string collectionName,
        string query, int topK = 5,
        Guid? filterByDocumentId = null, List<string>? filterByTags = null,
        CancellationToken ct = default);
    Task DeleteDocumentChunksAsync(string collectionName,
        Guid documentId, CancellationToken ct = default);
    Task<VectorStoreStats> GetStatsAsync(string collectionName,
        CancellationToken ct = default);
}

There is a critical DI lifetime design decision embedded here. IVectorStore is registered as Singleton — it wraps a gRPC channel that should be long-lived. But the workspace context (which collection to use) is Scoped (per HTTP request).

The solution: pass collectionName as an explicit first parameter to every method. Scoped services resolve it from IWorkspaceContext.Current.CollectionName and pass it in. The Singleton never holds any per-request state. This avoids the classic "Scoped service resolved from root scope" exception that catches .NET developers out.

Two implementations ship: QdrantVectorStore and AzureAiSearchVectorStore. Swap via config. Same interface, same tests.

The RAG Pipeline: Hybrid Search + RRF Fusion + MMR Re-ranking

Plain semantic (embedding) search has a known weakness: it finds conceptually similar chunks but misses exact keyword matches. "What is the RFC 2119 MUST keyword definition?" semantically finds documents about requirements, but keyword search finds the exact definition.

Hybrid search solves this by running semantic and keyword search in parallel and fusing the results. The fusion algorithm is Reciprocal Rank Fusion (RRF):

score(chunk) = Σ  1 / (60 + rank_in_list)

A chunk ranked 1st in semantic and 3rd in keyword scores higher than one ranked 1st in only semantic. Here is the real implementation:

private static List<SearchResult> FuseWithRrf(
    List<SearchResult> semanticResults,
    List<SearchResult> keywordResults,
    int topK,
    int k = 60)
{
    var scores = new Dictionary<Guid, (double Score, SearchResult Result)>();

    void AccumulateRrf(List<SearchResult> results)
    {
        for (int i = 0; i < results.Count; i++)
        {
            var r = results[i];
            var rrfScore = 1.0 / (k + i + 1);
            if (scores.TryGetValue(r.ChunkId, out var existing))
                scores[r.ChunkId] = (existing.Score + rrfScore, existing.Result);
            else
                scores[r.ChunkId] = (rrfScore, r);
        }
    }

    AccumulateRrf(semanticResults);
    AccumulateRrf(keywordResults);

    return scores.Values
        .OrderByDescending(x => x.Score)
        .Take(topK)
        .Select(x => { x.Result.Score = x.Score; return x.Result; })
        .ToList();
}

MMR (Maximal Marginal Relevance) re-ranking tackles a different problem: result redundancy. If your top-5 chunks are all from the same paragraph of the same document, your LLM context window is wasted. MMR balances relevance against diversity:

MMR(chunk) = λ · similarity(chunk, query) - (1-λ) · max_similarity(chunk, selected)

Lambda controls the relevance/diversity tradeoff. Higher lambda = more relevant. Lower = more diverse. The retrieval pipeline wires these together:

// In RagService.RetrieveChunksAsync
if (!useHybrid)
{
    candidates = useReRanking
        ? await _vectorStore.SearchWithEmbeddingsAsync(...)  // needs vectors for MMR
        : await _vectorStore.SearchAsync(...);
}
else
{
    // Hybrid: run semantic + keyword in parallel
    var semanticTask = useReRanking
        ? _vectorStore.SearchWithEmbeddingsAsync(...)
        : _vectorStore.SearchAsync(...);
    var keywordTask = _vectorStore.KeywordSearchAsync(...);

    await Task.WhenAll(semanticTask, keywordTask);

    candidates = FuseWithRrf(semanticTask.Result, keywordTask.Result, candidateCount);
}

if (useReRanking && candidates.Count > 0)
    return MmrReRanker.Rerank(candidates, queryEmbedding, topK, _searchOptions.MmrLambda);

return candidates;

Every option is per-request configurable. The caller passes useHybridSearch and useReRanking booleans; if null, the config default is used. This makes A/B testing retrieval strategies trivial.

Streaming with IAsyncEnumerable and SSE

LLM responses are slow. Users notice. Server-Sent Events (SSE) let you stream tokens to the browser as they arrive from the model.

The streaming pipeline uses IAsyncEnumerable<string> all the way from the HTTP client to the controller response — no buffering, no polling:

public async IAsyncEnumerable<StreamEvent> ChatStreamAsync(
    string query,
    List<ChatMessage>? conversationHistory = null,
    [EnumeratorCancellation] CancellationToken cancellationToken = default)
{
    var queryEmbedding = await _embeddingService.GenerateEmbeddingAsync(query, ct);
    var searchResults = await RetrieveChunksAsync(...);

    // Yield sources FIRST — client renders citations while tokens stream in
    yield return new StreamEvent { Type = "sources", Sources = BuildSources(searchResults) };

    var context = BuildContext(searchResults);
    var systemPrompt = string.Format(SystemPromptTemplate, context);

    // Stream each LLM token as it arrives
    await foreach (var token in _chatService.GenerateResponseStreamAsync(
        systemPrompt, messages, cancellationToken))
    {
        yield return new StreamEvent { Type = "token", Content = token };
    }
}

The controller writes each event to the response with Content-Type: text/event-stream. The Blazor UI consumes it with HttpClient + a manual SSE parser. No SignalR, no WebSockets, no extra infrastructure.

Multi-tenancy: Per-Workspace Qdrant Collections

The system supports multiple isolated workspaces. Each workspace gets its own Qdrant collection. A workspace is identified by an API key sent in the X-Api-Key header.

The middleware pipeline:

ApiKeyMiddleware → resolves Workspace from DB by hashed key
                → sets IWorkspaceContext.Current for the request scope
                → all downstream services use Current.CollectionName

WorkspaceService.ComputeSha256(key) is the only place API keys are hashed before storage. The plaintext key is never persisted — only shown to the user once at creation. This mirrors standard API key security practices.

When a workspace is created, IVectorStore.EnsureCollectionAsync(collectionName) is called immediately, creating the Qdrant collection with the correct vector dimensions. When a workspace is deleted, DeleteCollectionAsync(collectionName) cascades the cleanup. No manual Qdrant operations required.

Qdrant Reliability: The Auto-Reinitialise Pattern

Qdrant's managed cloud can delete a collection if it has been inactive (free tier). This would cause every vector operation to throw a gRPC RpcException(StatusCode.NotFound). A restart would fix it — but that is a terrible production experience.

The QdrantVectorStore implements an auto-reinitialise pattern:

private async Task<T> ExecuteWithReinitAsync<T>(
    string collectionName,
    Func<Task<T>> operation)
{
    try
    {
        return await operation();
    }
    catch (RpcException ex) when (ex.StatusCode == StatusCode.NotFound)
    {
        _logger.LogWarning("Collection {Name} not found, reinitialising...", collectionName);
        await _initLock.WaitAsync();
        try
        {
            await EnsureCollectionAsync(collectionName);
        }
        finally
        {
            _initLock.Release();
        }
        return await operation();  // retry once
    }
}

A SemaphoreSlim(1,1) ensures that concurrent requests hitting a missing collection do not trigger a thundering herd of reinitialise calls. The operation retries exactly once. No restart required. The collection is back in seconds.

Testing AI Systems in .NET: 279 Tests

This is where Python RAG tutorials truly fall short. Most show you ~0 tests. Production software needs tests. Here is how AI-dependent code is tested in .NET:

Mock interfaces, never concrete AI classes:

// Good — mock the interface
var mockEmbedding = new Mock<IEmbeddingService>();
mockEmbedding.Setup(e => e.GenerateEmbeddingAsync(It.IsAny<string>(), default))
    .ReturnsAsync(new float[1536]);

// Bad — RagService is a concrete class; instantiate it with mocked deps
var sut = new RagService(
    mockVectorStore.Object,
    mockEmbedding.Object,
    mockChat.Object,
    mockLogger.Object,
    Options.Create(new SearchOptions()),
    mockWorkspaceContext.Object);

EF Core InMemory for repository tests:

var options = new DbContextOptionsBuilder<RagApiDbContext>()
    .UseInMemoryDatabase(Guid.NewGuid().ToString())  // unique per test
    .Options;
using var context = new RagApiDbContext(options);
var repo = new DocumentRepository(context, workspaceContext);

Test coverage breakdown:

Unit tests: RagService, DocumentService, WorkspaceService, all repositories
Controller tests: ChatController, DocumentsController, WorkspacesController
Middleware tests: ApiKeyMiddleware, GlobalExceptionMiddleware, RateLimitMiddleware
Infrastructure tests: QdrantVectorStore, AzureAiSearchVectorStore, all parsers
Integration tests: chunking strategies, hybrid search, MMR re-ranking

The test project targets net10.0 while the production code targets net8.0 — the test framework takes the newer runtime while production stays on the stable LTS version.

CI/CD: GitHub Actions → Azure Container Apps

Three workflows:

ci.yml — runs on every PR: dotnet build + dotnet test. PRs cannot merge without green CI.

deploy.yml — runs on push to main: builds a Docker image, pushes to Azure Container Registry, deploys to Azure Container Apps via az containerapp update.

swa-deploy.yml — runs on push to main: deploys the Blazor WASM output to Azure Static Web Apps.

One important Azure Container Apps gotcha: pushing a new :latest image does not automatically restart existing revisions. You must force a new revision with --revision-suffix. The CI pipeline does this explicitly.

The Dockerfile is multi-stage: a build stage with the .NET SDK image, a publish stage, and a final runtime stage using the ASP.NET Core runtime image (~220 MB).

What the Python World Has That We Need to Build

Intellectual honesty: here is what I had to build or wire together that Python's ecosystem gives you out of the box:

Document loaders — PdfPig, DocumentFormat.OpenXml, and a custom chunking pipeline. LangChain has 50+ loaders. We have three parsers. Good enough for 90% of use cases; extendable.
Embedding model variety — Python can swap to any HuggingFace model running locally. In .NET, Ollama is the practical local option. It works well, but your model selection is narrower.
Rapid prototyping — Jupyter notebooks with real-time output remain Python's killer app for exploration. .NET interactive notebooks exist but are less mature.

For production API work, these gaps matter less than they sound.

What .NET Gets Right That Python Usually Doesn't

Type safety across the entire stack — from the vector store interface to the controller to the DTO. No silent dict type errors at runtime.
Dependency injection that enforces architecture — the DI lifetime system (Singleton/Scoped/Transient) makes lifetime violations a runtime error, not a subtle bug. Python has no equivalent guard.
IAsyncEnumerable<T> for streaming — first-class language support for async streams makes the SSE pipeline clean and composable.
EF Core migrations — MigrateAsync() at startup, automatic schema evolution, full LINQ query support. SQLAlchemy is good; EF Core is better.
Testability by default — interfaces, constructor injection, and Moq make mocking AI dependencies straightforward. The 279 tests run in ~8 seconds.
Production runtime — ASP.NET Core's performance is consistently near the top of TechEmpower benchmarks. Your RAG API will not be the bottleneck.

Lessons Learned

Start with the interface, not the implementation. IVectorStore, IChatService, and IEmbeddingService were defined before any concrete implementation. This forced the architecture to stay clean and made every provider swappable.

DI lifetimes are architecture decisions. Making IVectorStore Singleton was the right call (gRPC channel reuse), but it forced the collectionName parameter pattern. Understanding why that tradeoff exists is more important than the pattern itself.

Hybrid search is not optional. Pure semantic search fails on exact terms, acronyms, and proper nouns. Hybrid with RRF costs one extra Qdrant call per query and meaningfully improves recall.

Test your retrieval separately from your LLM. RagService.SearchAsync is a pure retrieval method that returns ranked chunks with no LLM call. Write tests against it. Your prompting and your retrieval are separate problems.

The Python/AI ecosystem is ahead on research, not on production engineering. For a maintainable, tested, observable API that a .NET team can own and operate — .NET is the right call.

What's Next

The roadmap includes:

Phase 11 — Agentic RAG: A ReAct loop where the agent plans retrieval, chooses tools (search_documents, compare_chunks, answer_directly), and reasons across iterations. POST /api/agent/query + SSE streaming.
Phase 12 — Expanded Document Intelligence: URL ingestion, XLSX/CSV parsing, PDF table extraction, auto-tagging via LLM, document summarization with caching.
Phase 13 — Analytics & Observability: QueryLog entity, cost estimation per query, OpenTelemetry + Application Insights, Prometheus /metrics for Grafana.

All of this is buildable in .NET. All of it will have tests.

Try It

The API is live. The UI is live. The code is open source.

GitHub: https://github.com/Argha713/dotnet-rag-api

If you are a .NET developer who has been told that AI work requires Python — I hope this project makes that claim feel a lot less true.

Thanks for reading. If this was useful, drop a reaction or a comment — it helps.

SQLite on Azure Files SMB: A Debugging Story With a Humbling Ending

Argha Sarkar — Tue, 03 Mar 2026 17:59:38 +0000

SQLite on Azure Files SMB: A Debugging Story With a Humbling Ending

Three hours of debugging. One line of code. A lesson I won't forget.

The Setup

I was building a .NET 8 RAG API deployed to Azure Container Apps. The idea was simple: use SQLite with EF Core to store document chunks and embeddings, mount the database file from Azure Files so it would survive container restarts.

Simple plan. Clean architecture. What could go wrong?

Everything.

The Symptom

The container kept restarting. Over and over. The log was painfully consistent:

SQLite Error 5: database is locked.

No crash stack. No underlying exception. Just that one line, mocking me every single time.

Attempt 1 — Retry Logic

My first instinct: this is a transient lock. Maybe another process briefly holds it during startup. Classic race condition.

I added exponential backoff — 2s → 4s → 8s → 16s — with five retries.

var retryPolicy = Policy
    .Handle<SqliteException>(ex => ex.SqliteErrorCode == 5)
    .WaitAndRetryAsync(
        retryCount: 5,
        sleepDurationProvider: attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)),
        onRetry: (exception, timeSpan, attempt, context) =>
        {
            logger.LogWarning("SQLite locked. Retry {Attempt} in {Delay}s", attempt, timeSpan.TotalSeconds);
        });

The container dutifully retried five times, logged politely, then gave up with the same error.

Not a transient lock. Something deeper.

Attempt 2 — Journal Mode

I started digging. EF Core's EnsureCreatedAsync runs PRAGMA journal_mode = 'wal' when it creates a new database. WAL (Write-Ahead Logging) mode requires memory-mapped I/O and POSIX byte-range locking.

Azure Files SMB supports neither.

The write would hang for 30 seconds, then time out with — you guessed it — SQLite Error 5.

Solution: set Journal Mode=Delete in the connection string to prevent WAL from ever being set.

Data Source=/mnt/azure/ragapi.db;Journal Mode=Delete

Except:

ArgumentException: Connection string keyword 'journal mode' is not supported.
Microsoft.Data.Sqlite

Microsoft.Data.Sqlite only accepts a handful of valid keywords. Journal Mode is not one of them. You have to set it via PRAGMA after opening the connection.

Dead end.

Attempt 3 — Pre-Create the Database File

EF Core skips the Create() step (which sets WAL mode) if the database file already exists. So I came up with a workaround:

Open a raw SqliteConnection before EnsureCreatedAsync
Manually run PRAGMA journal_mode=DELETE
Close it
Let EF Core run EnsureCreatedAsync — it sees the file, skips Create(), goes straight to CreateTablesAsync()

// Pre-create the DB and set journal mode before EF Core touches it
await using var preConn = new SqliteConnection(connectionString);
await preConn.OpenAsync();
await using var cmd = preConn.CreateCommand();
cmd.CommandText = "PRAGMA journal_mode=DELETE;";
await cmd.ExecuteNonQueryAsync();
await preConn.CloseAsync();

// Now let EF Core take over
await dbContext.Database.EnsureCreatedAsync();

EF Core skipped Create(). ✅

It still failed. ❌

Turns out COMMIT inside CreateTablesAsync() also hits the SMB locking issue. The problem wasn't just WAL mode — it was the entire SMB locking model.

The Fix — Accept Reality

Data Source=ragapi.db

That's it. Switch to local ephemeral storage inside the container. No mount, no SMB, no network. Standard POSIX locking. Works instantly.

// appsettings.json
{
  "ConnectionStrings": {
    "DefaultConnection": "Data Source=ragapi.db"
  }
}

Documents reset on container restart. For a demo, that's perfectly fine.

What Actually Went Wrong

SQLite's locking model is built on one assumption: a local filesystem with proper POSIX byte-range locking.

SMB (Server Message Block) — the protocol Azure Files uses — doesn't provide that. Neither does NFS. Neither do most cloud-mounted volumes.

This is documented behaviour, not a bug. Right there in the SQLite FAQ:

"SQLite uses reader/writer locks to control access to the database. [...] If the filesystem does not support POSIX advisory locks, SQLite cannot properly serialize concurrent database accesses."

I didn't read it. I assumed "it's just a file, it'll work anywhere." It doesn't.

Your Options for SQLite Persistence on Serverless Containers

Option 1 — Accept Ephemeral Storage

Use local storage (Data Source=ragapi.db). For demos, prototypes, or read-heavy apps where data can be regenerated, this is the simplest and most reliable choice.

Option 2 — Startup Migration from Blob Storage

On container startup, copy the database file from Azure Blob Storage to local disk, use it locally, and optionally write it back on shutdown.

// Startup: copy DB from blob to local
var blobClient = new BlobClient(connectionString, "databases", "ragapi.db");
await blobClient.DownloadToAsync("/app/ragapi.db");

// On graceful shutdown: push it back
await blobClient.UploadAsync("/app/ragapi.db", overwrite: true);

Option 3 — Switch to a Proper Networked Database

If you need persistence, concurrency, and reliability in a containerised environment:

Azure SQL (managed SQL Server)
Azure Database for PostgreSQL with pgvector extension (great for RAG)
Azure Cosmos DB (document storage)

EF Core supports all of these with minimal provider swapping.

The Real Lesson

SQLite and network file systems are fundamentally incompatible.

Not partially. Not in certain modes. Fundamentally.

I lost three hours to a problem that was documented, well-known, and completely avoidable. The fix was one line. The lesson cost an afternoon.

If you're building RAG APIs in .NET, my recommendation: start with PostgreSQL + pgvector. It's containerisation-friendly, Azure-native, and EF Core's pgvector support via Npgsql.EntityFrameworkCore.PostgreSQL is excellent.

SQLite is a phenomenal database — for local dev, testing, and embedded apps. Just not on a network mount.

TL;DR

What I tried	Result
Exponential backoff retry	❌ Not a transient lock
`Journal Mode=Delete` in connection string	❌ Not a valid `Microsoft.Data.Sqlite` keyword
Pre-create DB + set `PRAGMA journal_mode=DELETE`	❌ `COMMIT` still fails on SMB
Switch to local `Data Source=ragapi.db`	✅ Works instantly

Root cause: Azure Files SMB doesn't support POSIX byte-range locking. SQLite requires it.

Fix: Use local ephemeral storage, copy from Blob on startup, or switch to PostgreSQL.

Tags: dotnet csharp sqlite azure rag entityframework debugging cloudnative