Ian Cowley

Posted on May 16

I built a zero-dependency C# Vector Database that saturates DDR5 RAM bandwidth

#dotnet #csharp #performance #ai

If you’re building AI apps today, you eventually need a Vector Database for RAG (Retrieval-Augmented Generation) or giving your agents long-term memory.

The current ecosystem’s answer to this problem usually involves one of three things:

Pay for a cloud service.
Spin up a massive Rust, Go, or Python Docker container locally (like Qdrant, Chroma, or Milvus).
Use a bloated wrapper library that pulls in 50MB of dependencies just to do math.

I’ve been a software engineer for 40 years. I despise bloat. I don't use heavy data-access frameworks or massive client-side libraries. At its core, a vector database is just a massive 2D array of floats and a tight math loop. I didn't want to spin up a Docker container or rely on Python interop just to do math.

So, I built Glacier.Vector.

It is a purely native, zero-dependency, hardware-accelerated Vector Database for .NET 10. And it is fast enough to literally hit the physical limits of my motherboard.

Here is how I squeezed every drop of performance out of the .NET runtime.

1. The Goal: Zero Allocations

If you have 100,000 documents, and each has a 1536-dimensional embedding (the OpenAI text-embedding-3-small standard), you are looking at about 153.6 million floats (~600 MB of RAM).

If you store this as a float[][] (an array of arrays), the .NET Garbage Collector will create 100,000 object headers on the heap. Your cache locality is ruined, and your GC pause times will be brutal.

Instead, Glacier.Vector uses a zero-copy memory model. It allocates flat arrays in massive chunks, or uses Memory-Mapped files, completely bypassing GC pauses.

When searching, the engine pins the memory and uses fixed pointers and ReadOnlySpan<float>. No bounds checking. No object allocations in the hot path.

2. Pushing the CPU: 4-Way SIMD Unrolling

Searching a vector database requires comparing the user's query against every single document using Cosine Similarity (which, for normalized vectors, is just the Dot Product).

In standard C#, a scalar for loop over 150 million floats takes forever.

To fix this, I wrote a custom compute kernel using .NET hardware intrinsics (System.Runtime.Intrinsics). But just using Vector256 wasn't enough. I unrolled the loop 4-ways to feed the CPU's out-of-order execution pipeline with simultaneous Fused-Multiply-Add (FMA) instructions:

// Unrolled AVX2 fast path (processing 32 floats per cycle)
var acc0 = Vector256<float>.Zero;
var acc1 = Vector256<float>.Zero;
// ...
for (; i <= length - 32; i += 32)
{
    acc0 = Fma.MultiplyAdd(Vector256.Load(pTarget + i), Vector256.Load(pDb + i), acc0);
    acc1 = Fma.MultiplyAdd(Vector256.Load(pTarget + i + 8), Vector256.Load(pDb + i + 8), acc1);
    // ...
}

The Benchmark: Hitting the Memory Wall
I ran a benchmark pumping 100,000 vectors (1536 dimensions each) into the engine. Here is the raw console output:

==========================================
 Glacier.Vector | SIMD Performance Engine
==========================================

[1] Initializing In-Memory Storage...
    Dimensions: 1536
    Target Count: 100,000

[2] Generating and loading synthetic vectors...
    Done! Loaded 100,000 vectors in 547 ms.

[3] Preparing search query...
[4] Executing SIMD brute-force search...

==========================================
 SEARCH COMPLETED IN: 7.152 ms
 Vectors scanned:     100,000
 Operations/sec:      13,982,298
==========================================

Top 5 Results:
  Rank 1 | Score: 0.1056 | ID: 51030 | Meta: Document_Chunk_51030
  Rank 2 | Score: 0.1019 | ID: 87632 | Meta: Document_Chunk_87632
  Rank 3 | Score: 0.1003 | ID: 52591 | Meta: Document_Chunk_52591
  Rank 4 | Score: 0.0994 | ID: 96139 | Meta: Document_Chunk_96139
  Rank 5 | Score: 0.0990 | ID: 29879 | Meta: Document_Chunk_29879

To search the entire database—scanning 153.6 million floats—it took exactly 7.15 milliseconds.

At that speed, the engine is demanding roughly 85 Gigabytes per second of memory bandwidth. The dual-channel DDR5 RAM on my motherboard physically maxes out right around 85-90 GB/s.

I literally cannot make this C# code any faster without buying faster RAM. We hit the physical memory wall.

Built-in AI Integration (MCP) A vector database isn't useful if AI agents can't talk to it.

Instead of building a bloated REST API or requiring gRPC, I built a native Model Context Protocol (MCP) server directly into the engine. It runs entirely over standard I/O (stdio).

You can configure Claude Desktop, Cursor, or your own autonomous agents to point directly at the Glacier.Vector.Host.dll. The AI instantly understands how to call the add_vector and search_vectors tools via JSON-RPC. Zero Python, zero external API keys, zero network latency.

Try it out
If you are a .NET developer building RAG pipelines, AI agents, or just dealing with heavy data, and you hate bloated frameworks as much as I do, try it out.

NuGet:

dotnet add package Glacier.Vector

GitHub: ian-cowley/Glacier.Vector

It pairs perfectly with my other recent project, AgentDevKit (a native C# LLM orchestration library).

Drop a star on the repo, throw a few million vectors at the memory storage, and let me know how many milliseconds it takes to saturate your RAM! Let's prove C# belongs in the AI ecosystem.

DEV Community

I built a zero-dependency C# Vector Database that saturates DDR5 RAM bandwidth

1. The Goal: Zero Allocations

2. Pushing the CPU: 4-Way SIMD Unrolling

Top comments (0)