Архипов Владимир

Posted on Jun 5

Stop Calling External Vector DBs for Every RAG Query: A .NET Embedded Alternative

#ai #csharp #vectordatabase #rag

Latency comparison included | When embedded wins | When cloud still wins

You have a .NET 8 application. You need RAG (Retrieval-Augmented Generation). Your first instinct is to spin up Pinecone, Qdrant, or Weaviate.

They work. But they add:

15–50 ms per query just for network roundtrips

Serialization overhead (JSON/gRPC)

An extra distributed system to monitor

For some applications, this is fine.

For others — high-frequency agent loops, real-time copilots, edge devices, on-prem compliance — it's a dealbreaker.

So I built an embedded vector database that runs entirely inside your .NET process: VectorRAG.Net.

No network. No external service. Microsecond latency.

But I'm not going to tell you it's always better. Let's be honest about when it wins, and when it doesn't.
🔍 What Is VectorRAG.Net?

VectorRAG.Net is a .NET 8+ library that implements a full RAG pipeline inside your application:

Vector search (ANN via LSH + SIMD reranking)

Hybrid search (vector + BM25)

Automatic chunking

Metadata filtering

Snapshot persistence (save/load to file)

Runtime metrics

NuGet: VectorRAG.Net
GitHub: https://github.com/likeslines-maker/VectorRAG.Net

You create an instance and start adding documents — no separate database, no containers, no API keys (for the DB part).
📊 Honest Benchmarks: Embedded vs Cloud vs Others

Test environment: Windows 11, Intel Core i5-11400F, .NET 8, 10k vectors, dim=64 (synthetic for reproducibility)
Operation VectorRAG.Net Pinecone / Qdrant Other .NET libs
Vector search (TopK=5) 15 μs 8-15 ms 0.5-5 ms
Hybrid search 117 μs 10-20 ms (limited support) rarely exists
Network overhead 0 10-40 ms 0
Allocations per query 5.7 KB 50-200 KB (JSON) variable
Automatic chunking built-in you implement rare
Offline capable ✅ ❌ varies

Real-world takeaway: For dim=768 (real embeddings), multiply VectorRAG.Net latency by 3-5x — still ~50-150 microseconds. That's 100-300x faster than cloud roundtrips.

But speed isn't everything. Here's when each approach actually makes sense.
✅ When Embedded RAG (VectorRAG.Net) Wins

High-Frequency Retrieval Loops

Scenario: An AI agent that performs 10-50 retrieval steps per user request.

Cloud math: 30 steps × 20 ms = 600 ms just for retrieval.
Embedded math: 30 steps × 0.1 ms = 3 ms.

User experience difference: unusable vs snappy.

Air-Gapped / Offline Systems

Real-world examples:

Defense and intelligence systems (data cannot leave the process)

Avionics (no network in flight)

Medical devices (offline compliance)

Industrial SCADA (network segmentation for safety)

Cloud vector DBs are simply not allowed here.

Edge Devices & Desktop Apps

You ship a single .exe to a customer's laptop or an IoT device. No Docker. No cloud dependency. The database must be inside your binary or a local file.

Embedded RAG works. Cloud doesn't.

Deterministic Latency Requirements

Cloud introduces jitter: 10 ms, then 200 ms, then timeout. Embedded gives you predictable microseconds because there's no network.

Prototyping & Testing

You want to iterate fast — change chunk size, adjust LSH parameters, rebuild the index. With cloud, you're cleaning collections via API, waiting for consistency. With embedded, you recreate the database in milliseconds.
❌ When Cloud Vector DBs (Pinecone/Qdrant/Weaviate) Still Win

Let's be fair. Embedded is not always better.

Multi-Tenant Shared Knowledge Bases

You have one index that needs to be shared across 50 microservices written in different languages (Python, Go, Java, .NET). A centralized cloud vector DB is the right tool.

Massive Scale (Billions of Vectors)

Embedded solutions scale vertically — one machine's RAM. If you need 10 billion vectors, you need distributed sharding, replication, and horizontal scaling. That's cloud territory.

Teams Without .NET Expertise

Your data team knows Python, your infra team knows Go. Forcing them to touch a .NET library for retrieval is unnecessary friction. Stick with a cloud API.

Managed HA and Backups

You don't want to worry about snapshot schedules, disk corruption, or DR. Cloud DBs handle this for you.

Already Invested in Cloud Infrastructure

If you're deep in AWS/GCP/Azure and your whole stack is serverless, adding an in-process .NET library might break your architectural patterns.
🔄 Hybrid Approach That Actually Works

You don't have to choose one. Many teams use both:
Use Case Solution
Prototyping / local dev Embedded (VectorRAG.Net)
Production low-latency agent loops Embedded (in the agent process)
Shared knowledge base across teams Cloud (Pinecone/Qdrant)
Batch analytics / offline jobs Embedded (on the compute cluster)

Same code, different backend. VectorRAG.Net's persistence format is just a file — you can move snapshots between environments.
🚀 Quick Start (Full Example)

Here's everything you need to get running.
Install

dotnet add package VectorRAG.Net --version 0.1.17

Minimal Working Example

using SlidingRank.FastOps;
using VectorRAG.Net;

// 1. Configure LSH
var lshConfig = new EmbeddingLshConfig(Bands: 24, BitsPerBand: 12, MaxCandidates: 2048);

// 2. Create database (1536-dim embeddings)
var db = new VectorRAGDatabase(dimension: 1536, lshConfig: lshConfig);

// 3. Add your embedding model (OpenAI example)
IEmbeddingModel embedder = new OpenAIEmbeddingModel(apiKey: "sk-...");

// 4. Add a document (auto-chunking)
await db.UpsertTextDocumentAsync(
    externalId: "doc_001",
    text: "Your long document text here...",
    metadata: new DocumentMetadata { Department = "Support" },
    embeddingModel: embedder
);

// 5. Search
var queryVector = await embedder.GenerateEmbeddingAsync("How to reset password?");
var results = db.Search(queryVector, new SearchOptions { TopK = 5 });

foreach (var result in results)
    Console.WriteLine($"{result.Score:F2}: {result.Text[..100]}");

That's it. No external database. No containers. Just code.
📈 Performance Deep Dive
Why It's Fast

No serialization — vectors stay as float[] in memory

SIMD reranking — exact distance calculation uses CPU vector instructions

LSH candidate filtering — only rerank ~2000 candidates, not all vectors

Pooled arrays — minimal GC pressure in hot path

Benchmark Numbers (dim=1536, real embeddings)
Operation Latency (mean) Allocations
Vector search (10k vectors) 180 μs 18 KB
Hybrid search (BM25 + vector) 450 μs 32 KB
Insert single vector 12 μs 4 KB
Save snapshot (10k vectors) 120 ms 8 MB

Throughput: ~5,500 vector searches per second per core.
Scaling Limits

Memory: ~8 bytes per dimension per vector + overhead. 1M vectors at 1536-dim = ~12 GB RAM.

Search time: O(log N) for LSH candidates + O(candidates) for reranking. 10M vectors → ~2-3 ms.

🔁 Real-World Agent Loop Example

public async Task<string> ResearchAsync(string topic)
{
    var findings = new List<string>();

    for (int step = 0; step < 5; step++)
    {
        // Embed + search in ~200 microseconds
        var queryVec = await _embedder.GenerateEmbeddingAsync(topic);
        var results = _db.Search(queryVec, new SearchOptions { TopK = 3 });

        // Generate next question
        var nextQuestion = await _llm.GenerateAsync(
            $"Context: {results[0].Text}\nWhat to ask next?"
        );

        topic = nextQuestion;
    }

    return await _llm.GenerateAsync($"Synthesize: {string.Join("\n", findings)}");
}

Latency per iteration: ~200 μs for retrieval + LLM call.
Cloud equivalent per iteration: ~20 ms for retrieval + LLM call.

For 5 iterations, that's 1 ms vs 100 ms of pure retrieval overhead.
🧩 Integration with ASP.NET Core

// Program.cs
builder.Services.AddSingleton(sp =>
{
    var lshConfig = new EmbeddingLshConfig(24, 12, 2048);
    var db = new VectorRAGDatabase(1536, lshConfig);

    if (File.Exists("vectors.vdb"))
        db.LoadAsync("vectors.vdb").Wait();

    return db;
});

// Controller
[ApiController]
public class SearchController : ControllerBase
{
    [HttpPost("search")]
    public async Task<IActionResult> Search([FromBody] SearchRequest req)
    {
        var vector = await _embedder.GenerateEmbeddingAsync(req.Query);
        var results = _db.Search(vector, new SearchOptions { TopK = req.TopK });
        return Ok(results);
    }
}

📦 Persistence: Snapshots, Not a Database Engine

VectorRAG.Net doesn't run a separate storage engine. It saves snapshots:

await db.SaveAsync("knowledge_base.vdb");   // Save
var db2 = new VectorRAGDatabase(1536, lshConfig);
await db2.LoadAsync("knowledge_base.vdb");  // Load elsewhere

This is intentionally simple. You manage:

Snapshot frequency (every hour, every day)

Backup strategy (copy files)

Versioning (keep multiple snapshots)

No schema migrations. No connection strings. No replication config.
🆚 Comparison Table: Embedded vs Cloud vs Local DB
Criterion VectorRAG.Net Pinecone/Qdrant SQLite + Vector
Latency (p50) 50-200 μs 10-20 ms 5-15 ms
Network required No Yes No
Hybrid search Yes (BM25) Limited DIY
Chunking Built-in DIY DIY
Multi-language support .NET only All All
Horizontal scaling No (vertical only) Yes No
Snapshot management You manage Built-in You manage
Memory footprint ~12GB/1M vectors (1536-dim) Cloud-managed ~15GB/1M vectors
Offline capable ✅ ❌ ✅
🎯 Final Verdict

Use VectorRAG.Net when:

You need sub-millisecond retrieval

You're building high-frequency agent loops

Your system must run offline or air-gapped

You're shipping a desktop or edge .NET application

You want to avoid network and serialization overhead

You're prototyping and need fast iteration

Use cloud vector DBs when:

You have multi-language services sharing one index

You're at billion-vector scale with horizontal sharding

You don't want to manage snapshots and backups

Your team doesn't work primarily in .NET

Use both when:

Prototype with embedded, deploy cloud for shared indices

Edge devices use embedded, central knowledge uses cloud

📚 Resources

NuGet: https://www.nuget.org/packages/VectorRAG.Net

GitHub (benchmarks & examples): https://github.com/likeslines-maker/VectorRAG.Net

Documentation: See GitHub README

💬 Let's Discuss

Have you hit latency walls with external vector DBs?

Are you building agent loops that need microsecond retrieval?

Would an embedded approach work for your use case?

Drop a comment below. I'm happy to share implementation details, benchmark scripts, or help with integration.

Tags: dotnet, csharp, ai, rag, performance, vector-database