Argha Sarkar

Posted on Mar 10

Building Production-Grade RAG in .NET : Language Is Not a Barrier

#dotnet #ai #rag #csharp

Building Production-Grade RAG in .NET 8: Language Is Not a Barrier

Every AI tutorial you find starts with Python. Every LangChain walkthrough, every vector database quickstart, every "build your own ChatGPT" guide — all Python. If you are a .NET developer, you are used to searching for a C# equivalent and finding either a thin wrapper someone wrote last week, a GitHub issue from 2022 asking "is there a .NET SDK?", or nothing at all.

I got tired of that. So I built a full Retrieval-Augmented Generation (RAG) API in .NET 8 from scratch: Clean Architecture, Qdrant vector database, OpenAI/Azure OpenAI/Ollama provider switching, hybrid search with Reciprocal Rank Fusion, MMR re-ranking, multi-tenancy, Server-Sent Events streaming, a Blazor WASM frontend, and 279 tests. Deployed to Azure Container Apps and Azure Static Web Apps.

This article walks through how, and why .NET is a first-class citizen in the AI ecosystem — not a workaround.

GitHub: https://github.com/Argha713/dotnet-rag-api
Live API: https://rag-api.calmsand-4a05cfa0.eastus.azurecontainerapps.io
Live UI: https://ambitious-glacier-0b62ea10f.6.azurestaticapps.net

What Is RAG and Why Does Architecture Matter?

RAG is a pattern that improves LLM responses by grounding them in your own documents. Instead of relying on the model's training data, you:

Ingest — parse documents, split into chunks, generate vector embeddings, store in a vector database
Retrieve — embed the user's query, find the most similar chunks
Generate — inject those chunks as context into the LLM prompt, get a grounded answer

Most RAG tutorials implement this in ~50 lines of Python using LangChain. That is fine for a demo. For production — where you need testability, provider flexibility, multi-tenancy, and maintainability — architecture matters enormously. And that is where .NET's ecosystem genuinely shines.

The .NET AI Ecosystem in 2025

Before I show the implementation, let us be honest about the landscape.

The packages you actually need:

Concern	Package	Notes
Vector DB (Qdrant)	`Qdrant.Client 1.12.0`	Official .NET SDK, full gRPC support
PDF parsing	`PdfPig 0.1.9`	Pure .NET, no native deps
DOCX/XLSX parsing	`DocumentFormat.OpenXml 3.0.2`	Microsoft's own SDK
PostgreSQL / EF Core	`Npgsql.EntityFrameworkCore.PostgreSQL 8.0.8`	Rock solid
Azure AI Search	`Azure.Search.Documents 11.6.0`	Swap Qdrant for Azure
Structured logging	`Serilog.AspNetCore 8.0.3`	Industry standard
Validation	`FluentValidation 11.9.2`	Better than DataAnnotations
Health checks	`Microsoft.Extensions.Diagnostics.HealthChecks`	Built-in, excellent
OpenAI / Ollama	Direct `HttpClient` calls	You don't need Semantic Kernel

The notable absence: I did not use Microsoft Semantic Kernel. Semantic Kernel is a legitimate option, especially if you want abstractions over multiple AI providers and memory stores out of the box. I chose to build the abstractions myself for two reasons: (1) it makes the architecture explicit and teachable, and (2) it demonstrates that you do not need a framework — the primitives are sufficient.

What is genuinely missing compared to Python: LangChain's ecosystem of 300+ integrations. Python dominates experimental ML research. If you need to run a custom fine-tuned model or use bleeding-edge retrieval research, Python is still where that lives first. For production API work with mainstream providers and standard vector databases? .NET is fully capable.

Architecture: Clean Architecture Meets AI

The project follows Clean Architecture strictly across four layers:

Domain          — entities only, zero dependencies
Application     — interfaces + services, depends on Domain
Infrastructure  — Qdrant, OpenAI, PostgreSQL/EF Core, parsers
Api             — ASP.NET Core controllers, middleware, Serilog
BlazorUI        — Blazor WASM frontend

This matters for AI systems specifically because the AI provider is an infrastructure detail. Your business logic (how to chunk, how to rank results, what prompt template to use) should not be coupled to whether you are using OpenAI today and Azure OpenAI tomorrow.

The Application layer defines two key interfaces:

// Application/Interfaces/IEmbeddingService.cs
public interface IEmbeddingService
{
    Task<float[]> GenerateEmbeddingAsync(string text, CancellationToken ct = default);
    int Dimensions { get; }
    string ModelName { get; }
}

// Application/Interfaces/IChatService.cs
public interface IChatService
{
    Task<string> GenerateResponseAsync(string systemPrompt,
        List<ChatMessage> messages, CancellationToken ct = default);
    IAsyncEnumerable<string> GenerateResponseStreamAsync(string systemPrompt,
        List<ChatMessage> messages, CancellationToken ct = default);
    string ModelName { get; }
}

Infrastructure has three concrete implementations of each: OpenAiChatService, AzureOpenAiChatService, and OllamaChatService. DI wires the correct one based on appsettings.json. Switching providers is a config change, not a code change.

The Vector Store Abstraction

This is where most RAG tutorials stop being useful — they assume a single collection in a single database. In a multi-tenant system, each workspace needs isolated storage.

Here is the IVectorStore interface (the real one from the codebase):

/// <summary>
/// All methods accept collectionName as the first parameter so callers
/// (Scoped services) can pass the workspace's collection without violating
/// the Singleton lifetime of the implementation.
/// </summary>
public interface IVectorStore
{
    Task EnsureCollectionAsync(string collectionName, CancellationToken ct = default);
    Task DeleteCollectionAsync(string collectionName, CancellationToken ct = default);
    Task UpsertChunksAsync(string collectionName,
        List<DocumentChunk> chunks, CancellationToken ct = default);
    Task<List<SearchResult>> SearchAsync(string collectionName,
        float[] queryEmbedding, int topK = 5,
        Guid? filterByDocumentId = null, List<string>? filterByTags = null,
        CancellationToken ct = default);
    Task<List<SearchResult>> SearchWithEmbeddingsAsync(string collectionName,
        float[] queryEmbedding, int topK = 5,
        Guid? filterByDocumentId = null, List<string>? filterByTags = null,
        CancellationToken ct = default);
    Task<List<SearchResult>> KeywordSearchAsync(string collectionName,
        string query, int topK = 5,
        Guid? filterByDocumentId = null, List<string>? filterByTags = null,
        CancellationToken ct = default);
    Task DeleteDocumentChunksAsync(string collectionName,
        Guid documentId, CancellationToken ct = default);
    Task<VectorStoreStats> GetStatsAsync(string collectionName,
        CancellationToken ct = default);
}

There is a critical DI lifetime design decision embedded here. IVectorStore is registered as Singleton — it wraps a gRPC channel that should be long-lived. But the workspace context (which collection to use) is Scoped (per HTTP request).

The solution: pass collectionName as an explicit first parameter to every method. Scoped services resolve it from IWorkspaceContext.Current.CollectionName and pass it in. The Singleton never holds any per-request state. This avoids the classic "Scoped service resolved from root scope" exception that catches .NET developers out.

Two implementations ship: QdrantVectorStore and AzureAiSearchVectorStore. Swap via config. Same interface, same tests.

The RAG Pipeline: Hybrid Search + RRF Fusion + MMR Re-ranking

Plain semantic (embedding) search has a known weakness: it finds conceptually similar chunks but misses exact keyword matches. "What is the RFC 2119 MUST keyword definition?" semantically finds documents about requirements, but keyword search finds the exact definition.

Hybrid search solves this by running semantic and keyword search in parallel and fusing the results. The fusion algorithm is Reciprocal Rank Fusion (RRF):

score(chunk) = Σ  1 / (60 + rank_in_list)

A chunk ranked 1st in semantic and 3rd in keyword scores higher than one ranked 1st in only semantic. Here is the real implementation:

private static List<SearchResult> FuseWithRrf(
    List<SearchResult> semanticResults,
    List<SearchResult> keywordResults,
    int topK,
    int k = 60)
{
    var scores = new Dictionary<Guid, (double Score, SearchResult Result)>();

    void AccumulateRrf(List<SearchResult> results)
    {
        for (int i = 0; i < results.Count; i++)
        {
            var r = results[i];
            var rrfScore = 1.0 / (k + i + 1);
            if (scores.TryGetValue(r.ChunkId, out var existing))
                scores[r.ChunkId] = (existing.Score + rrfScore, existing.Result);
            else
                scores[r.ChunkId] = (rrfScore, r);
        }
    }

    AccumulateRrf(semanticResults);
    AccumulateRrf(keywordResults);

    return scores.Values
        .OrderByDescending(x => x.Score)
        .Take(topK)
        .Select(x => { x.Result.Score = x.Score; return x.Result; })
        .ToList();
}

MMR (Maximal Marginal Relevance) re-ranking tackles a different problem: result redundancy. If your top-5 chunks are all from the same paragraph of the same document, your LLM context window is wasted. MMR balances relevance against diversity:

MMR(chunk) = λ · similarity(chunk, query) - (1-λ) · max_similarity(chunk, selected)

Lambda controls the relevance/diversity tradeoff. Higher lambda = more relevant. Lower = more diverse. The retrieval pipeline wires these together:

// In RagService.RetrieveChunksAsync
if (!useHybrid)
{
    candidates = useReRanking
        ? await _vectorStore.SearchWithEmbeddingsAsync(...)  // needs vectors for MMR
        : await _vectorStore.SearchAsync(...);
}
else
{
    // Hybrid: run semantic + keyword in parallel
    var semanticTask = useReRanking
        ? _vectorStore.SearchWithEmbeddingsAsync(...)
        : _vectorStore.SearchAsync(...);
    var keywordTask = _vectorStore.KeywordSearchAsync(...);

    await Task.WhenAll(semanticTask, keywordTask);

    candidates = FuseWithRrf(semanticTask.Result, keywordTask.Result, candidateCount);
}

if (useReRanking && candidates.Count > 0)
    return MmrReRanker.Rerank(candidates, queryEmbedding, topK, _searchOptions.MmrLambda);

return candidates;

Every option is per-request configurable. The caller passes useHybridSearch and useReRanking booleans; if null, the config default is used. This makes A/B testing retrieval strategies trivial.

Streaming with IAsyncEnumerable and SSE

LLM responses are slow. Users notice. Server-Sent Events (SSE) let you stream tokens to the browser as they arrive from the model.

The streaming pipeline uses IAsyncEnumerable<string> all the way from the HTTP client to the controller response — no buffering, no polling:

public async IAsyncEnumerable<StreamEvent> ChatStreamAsync(
    string query,
    List<ChatMessage>? conversationHistory = null,
    [EnumeratorCancellation] CancellationToken cancellationToken = default)
{
    var queryEmbedding = await _embeddingService.GenerateEmbeddingAsync(query, ct);
    var searchResults = await RetrieveChunksAsync(...);

    // Yield sources FIRST — client renders citations while tokens stream in
    yield return new StreamEvent { Type = "sources", Sources = BuildSources(searchResults) };

    var context = BuildContext(searchResults);
    var systemPrompt = string.Format(SystemPromptTemplate, context);

    // Stream each LLM token as it arrives
    await foreach (var token in _chatService.GenerateResponseStreamAsync(
        systemPrompt, messages, cancellationToken))
    {
        yield return new StreamEvent { Type = "token", Content = token };
    }
}

The controller writes each event to the response with Content-Type: text/event-stream. The Blazor UI consumes it with HttpClient + a manual SSE parser. No SignalR, no WebSockets, no extra infrastructure.

Multi-tenancy: Per-Workspace Qdrant Collections

The system supports multiple isolated workspaces. Each workspace gets its own Qdrant collection. A workspace is identified by an API key sent in the X-Api-Key header.

The middleware pipeline:

ApiKeyMiddleware → resolves Workspace from DB by hashed key
                → sets IWorkspaceContext.Current for the request scope
                → all downstream services use Current.CollectionName

WorkspaceService.ComputeSha256(key) is the only place API keys are hashed before storage. The plaintext key is never persisted — only shown to the user once at creation. This mirrors standard API key security practices.

When a workspace is created, IVectorStore.EnsureCollectionAsync(collectionName) is called immediately, creating the Qdrant collection with the correct vector dimensions. When a workspace is deleted, DeleteCollectionAsync(collectionName) cascades the cleanup. No manual Qdrant operations required.

Qdrant Reliability: The Auto-Reinitialise Pattern

Qdrant's managed cloud can delete a collection if it has been inactive (free tier). This would cause every vector operation to throw a gRPC RpcException(StatusCode.NotFound). A restart would fix it — but that is a terrible production experience.

The QdrantVectorStore implements an auto-reinitialise pattern:

private async Task<T> ExecuteWithReinitAsync<T>(
    string collectionName,
    Func<Task<T>> operation)
{
    try
    {
        return await operation();
    }
    catch (RpcException ex) when (ex.StatusCode == StatusCode.NotFound)
    {
        _logger.LogWarning("Collection {Name} not found, reinitialising...", collectionName);
        await _initLock.WaitAsync();
        try
        {
            await EnsureCollectionAsync(collectionName);
        }
        finally
        {
            _initLock.Release();
        }
        return await operation();  // retry once
    }
}

A SemaphoreSlim(1,1) ensures that concurrent requests hitting a missing collection do not trigger a thundering herd of reinitialise calls. The operation retries exactly once. No restart required. The collection is back in seconds.

Testing AI Systems in .NET: 279 Tests

This is where Python RAG tutorials truly fall short. Most show you ~0 tests. Production software needs tests. Here is how AI-dependent code is tested in .NET:

Mock interfaces, never concrete AI classes:

// Good — mock the interface
var mockEmbedding = new Mock<IEmbeddingService>();
mockEmbedding.Setup(e => e.GenerateEmbeddingAsync(It.IsAny<string>(), default))
    .ReturnsAsync(new float[1536]);

// Bad — RagService is a concrete class; instantiate it with mocked deps
var sut = new RagService(
    mockVectorStore.Object,
    mockEmbedding.Object,
    mockChat.Object,
    mockLogger.Object,
    Options.Create(new SearchOptions()),
    mockWorkspaceContext.Object);

EF Core InMemory for repository tests:

var options = new DbContextOptionsBuilder<RagApiDbContext>()
    .UseInMemoryDatabase(Guid.NewGuid().ToString())  // unique per test
    .Options;
using var context = new RagApiDbContext(options);
var repo = new DocumentRepository(context, workspaceContext);

Test coverage breakdown:

Unit tests: RagService, DocumentService, WorkspaceService, all repositories
Controller tests: ChatController, DocumentsController, WorkspacesController
Middleware tests: ApiKeyMiddleware, GlobalExceptionMiddleware, RateLimitMiddleware
Infrastructure tests: QdrantVectorStore, AzureAiSearchVectorStore, all parsers
Integration tests: chunking strategies, hybrid search, MMR re-ranking

The test project targets net10.0 while the production code targets net8.0 — the test framework takes the newer runtime while production stays on the stable LTS version.

CI/CD: GitHub Actions → Azure Container Apps

Three workflows:

ci.yml — runs on every PR: dotnet build + dotnet test. PRs cannot merge without green CI.

deploy.yml — runs on push to main: builds a Docker image, pushes to Azure Container Registry, deploys to Azure Container Apps via az containerapp update.

swa-deploy.yml — runs on push to main: deploys the Blazor WASM output to Azure Static Web Apps.

One important Azure Container Apps gotcha: pushing a new :latest image does not automatically restart existing revisions. You must force a new revision with --revision-suffix. The CI pipeline does this explicitly.

The Dockerfile is multi-stage: a build stage with the .NET SDK image, a publish stage, and a final runtime stage using the ASP.NET Core runtime image (~220 MB).

What the Python World Has That We Need to Build

Intellectual honesty: here is what I had to build or wire together that Python's ecosystem gives you out of the box:

Document loaders — PdfPig, DocumentFormat.OpenXml, and a custom chunking pipeline. LangChain has 50+ loaders. We have three parsers. Good enough for 90% of use cases; extendable.
Embedding model variety — Python can swap to any HuggingFace model running locally. In .NET, Ollama is the practical local option. It works well, but your model selection is narrower.
Rapid prototyping — Jupyter notebooks with real-time output remain Python's killer app for exploration. .NET interactive notebooks exist but are less mature.

For production API work, these gaps matter less than they sound.

What .NET Gets Right That Python Usually Doesn't

Type safety across the entire stack — from the vector store interface to the controller to the DTO. No silent dict type errors at runtime.
Dependency injection that enforces architecture — the DI lifetime system (Singleton/Scoped/Transient) makes lifetime violations a runtime error, not a subtle bug. Python has no equivalent guard.
IAsyncEnumerable<T> for streaming — first-class language support for async streams makes the SSE pipeline clean and composable.
EF Core migrations — MigrateAsync() at startup, automatic schema evolution, full LINQ query support. SQLAlchemy is good; EF Core is better.
Testability by default — interfaces, constructor injection, and Moq make mocking AI dependencies straightforward. The 279 tests run in ~8 seconds.
Production runtime — ASP.NET Core's performance is consistently near the top of TechEmpower benchmarks. Your RAG API will not be the bottleneck.

Lessons Learned

Start with the interface, not the implementation. IVectorStore, IChatService, and IEmbeddingService were defined before any concrete implementation. This forced the architecture to stay clean and made every provider swappable.

DI lifetimes are architecture decisions. Making IVectorStore Singleton was the right call (gRPC channel reuse), but it forced the collectionName parameter pattern. Understanding why that tradeoff exists is more important than the pattern itself.

Hybrid search is not optional. Pure semantic search fails on exact terms, acronyms, and proper nouns. Hybrid with RRF costs one extra Qdrant call per query and meaningfully improves recall.

Test your retrieval separately from your LLM. RagService.SearchAsync is a pure retrieval method that returns ranked chunks with no LLM call. Write tests against it. Your prompting and your retrieval are separate problems.

The Python/AI ecosystem is ahead on research, not on production engineering. For a maintainable, tested, observable API that a .NET team can own and operate — .NET is the right call.

What's Next

The roadmap includes:

Phase 11 — Agentic RAG: A ReAct loop where the agent plans retrieval, chooses tools (search_documents, compare_chunks, answer_directly), and reasons across iterations. POST /api/agent/query + SSE streaming.
Phase 12 — Expanded Document Intelligence: URL ingestion, XLSX/CSV parsing, PDF table extraction, auto-tagging via LLM, document summarization with caching.
Phase 13 — Analytics & Observability: QueryLog entity, cost estimation per query, OpenTelemetry + Application Insights, Prometheus /metrics for Grafana.

All of this is buildable in .NET. All of it will have tests.

Try It

The API is live. The UI is live. The code is open source.

GitHub: https://github.com/Argha713/dotnet-rag-api

If you are a .NET developer who has been told that AI work requires Python — I hope this project makes that claim feel a lot less true.

Thanks for reading. If this was useful, drop a reaction or a comment — it helps.

DEV Community