DEV Community

Kurotsuba
Kurotsuba

Posted on

Building a Vector Database from Scratch

Building a Vector Database from Scratch

I have built a vector database with Rust from scratch to learn how such a database work under the hood. The database is not production-ready, though functional. Building it let me understand the basics of vector database, and I would like to keep it with written words.

Introduction

Recently LLM-powered tools make a huge difference on how people work, and many of them are based on RAG. As a developer, I want to build something customized and useful with LLM, but struggled with complex concepts like embeddings, RAG, aligning, etc. Building something fundamental of RAG systems, or semantic search can be a good starting point.

Retrieval Augmented Generation (RAG) is a context providing layer above LLM, aiming at more precise answers based on information provided by the user. Instead of simply passing the query to the LLM, RAG system will search the related information in the database, combine them with the query, and provide it to the LLM.

Searching related information with traditional keyword based matching can be useful in some situations, but usually RAG systems use semantic search. It is to get some most semantically similar results (the results have the most similar meaning, not string, comparing to the query). To make machine "understand" the meaning of a string, or at least being able to compare two strings on meaning, embedding is used. Embedding is to convert the input (in our scope, a string) into a fixed length array of numbers (vector) with ML models. The length is called the dimension of the vector. After embedding, strings with similar meanings will get vectors in shorter distance. The semantic search is now turned into vector calculation.

Vector database is where the vector matching happens. It is the foundation of RAG system. It can also be used in semantic search of in-house database or other tasks.

       User Query
           │
           ▼
┌───────────────────────┐
│      Embedding         │
│    (Query → Vector)    │
└───────────┬───────────┘
            │
            ▼
┌───────────────────────┐
│    Vector Database     │   < -- What I built
│                        │
└───────────┼───────────┘
            │
            ▼
┌───────────────────────┐
│  Retrieve Related      │
│  Documents by ID       │
└───────────┬───────────┘
            │
            ▼
┌───────────────────────┐
│  Combine Query +       │
│  Retrieved Context     │
└───────────┬───────────┘
            │
            ▼
┌───────────────────────┐
│        LLM             │
│  (Generate Answer)     │
└───────────┬───────────┘
            │
            ▼
       Final Answer
Enter fullscreen mode Exit fullscreen mode

What I Built

The project is for learning the basics of vector database, so I did not add ANN-indexing or quantizing into it to keep it simple. It can handle about 100K 786-d f32 records currently. No storage or dimension limitation is set, but larger data will lead to slower search. I implemented the following features.

  • Insert, Delete, Search
  • Save, Load
  • CLI

The code is on GitHub: Kurotsuba/kvdb

Persistence and CLI exposure is not the key point of this project. And to be honest, these parts are mostly made by Claude Code. So I will focus on the implementation of the database part.

The project is only about vector search, so embeddings or real semantic search is out of the scope, although I use them as demonstration. A record is made up with one id and a vector. Usually semantic search application can use the id to get the related record from other data sources.

All vectors will be normalized when inserted. L2 normalization is used in this project, which is simply divide each record with its geometric length. Below is L2 normalization of 2D:

 L2_norm(x, y)=(x / sqrt(x^2 + y^2), y / sqrt(x^2 + y^2))
Enter fullscreen mode Exit fullscreen mode

This brings two benefits. First is normalization can keep the consistency of records and increase the search quality. We don't want "big cat" gets a higher similarity to "big dog", instead of "tiger". Second, after L2 normalization, the cosine similarity calculation will be converted into simply dot product. CPUs are great at multiply-and-add. The normalization is done on-insert to lower the searching cost.

To make things simple, cosine similarity is the only metric used for similarity comparison. After L2 normalization, this makes more sense than distance.

For each search, the query vector will first get an L2 normalization. The database will do a linear scan across the stored data, calculate cosine similarity for each vector, and save the top-K (K is set by user) results. Finally, top-K results will be returned, with both their IDs and the similarity score. This is just brute force scan, but it works well with small scale data. I got 35 ms per search on 100K 786-d f32 vector with top-10 result, on i7 13700KF, 64GB/2400MTps memory. Adding indexing can make the database fit with million-scale or even larger data. However it is out of the scope of this project, at least for now.

Query Vector
     │
     ▼
┌──────────┐
│    L2     │
│Normalize  │
└────┬─────┘
     │
     ▼
┌──────────┐
│  Linear   │
│   Scan    │──▶ Calculate dot product with each stored vector
└────┬─────┘
     │
     ▼
┌──────────┐
│  Top-K    │
│  Select   │──▶ Keep K highest similarity scores
└────┬─────┘
     │
     ▼
  Results
 (ID + Score)
Enter fullscreen mode Exit fullscreen mode

What I Learned

For the concepts of RAG, vector database and vector calculation, they are covered in the Introduction part. So I will go with the Rust part.

This is my first Rust project after Rustlings. I leave the CLI and persistence part to Claude Code as I found these parts couple with existing crates or OS heavily and adds more complexity, which I am not confident to deal with. The db part and the math is clear and almost decoupled after understanding the logic, so I focus on this part. It turns out it is a good decision.

Designing the data layout and searching method brought the biggest challenge. They force me to recall knowledge about cache line, CPU pipeline and other relatively low-level things. I got really excited after I make the search time improved from 100+ ms to 35 ms. Thanks to Rust, giving me the control of this. This reveals that sometimes high-level designs can bring big performance boost. Also Rust gives me the opportunity of SIMD optimization, which is my next big step.

Also, after building this, I can confidently say I understand how semantic search works. And maybe other fancy things are also conceptually complex but simple in implementation? Building things can really boost the learning speed, while giving more confidence than reading.

What's Next

As mentioned before, I would like to add SIMD optimization to the hot paths. I have read the ASM and found unrolling and SIMD are not fully utilized. This can be another recall on low-level developing. Hope I can make the search time under 20 ms on my machine.

Also, I want to build a semantic search engine, including the embedding part, based on this vector database. Building things layer by layer feels good.

Conclusion

Building a vector database from scratch turned out to be one of the most rewarding ways to learn — both about how RAG systems work and about Rust itself. I hope this post gives you a clearer picture of what goes on inside a vector database, under the AI-based tools. If you have questions, suggestions, or spotted something I got wrong, feel free to leave a comment. This is my first tech blog, and I'd love to hear your feedback.

Top comments (0)