K Yadav

Posted on Apr 21

I Built a Vector Database Project from Scratch — Here’s What Actually Happened

#ai #rag #vectordatabase

A few weeks ago, I decided to stop just reading about vector databases and actually build something with them.

Not a tutorial clone. Not a copy-paste project.

Something messy, slightly broken, and real.

This is a write-up of what I built, what I learned the hard way, and what I would not do again.

Why I Started This

Everywhere I looked, people were talking about embeddings, semantic search, and AI-powered retrieval.

But most tutorials felt shallow. They showed how to use a library, not why things work or what breaks when you go off-script.

So I set a simple goal:

Build a small vector database project that actually solves a problem.

The Project Idea

I built a semantic search engine for personal notes.

Instead of keyword search, I wanted to search like this:

“notes about scaling backend”
“ideas I wrote about startups”
“that thing about caching I wrote last week”

Even if those exact words weren’t in the note.

The Basic Setup

Here’s what I used:

Python
An embedding model (I started with OpenAI embeddings, later tried local ones)
A vector store (initially FAISS)
A simple API layer

The flow looked like this:

Take a note
Convert it into an embedding (vector)
Store it in the database
When searching:

Convert query to embedding
Find closest vectors
Return matching notes

Simple in theory.

Not so simple in practice.

What I Learned (The Real Stuff)

1. Embeddings Are Not Magic

At first, I thought:

“Same meaning = always similar vectors”

Not true.

Small wording differences sometimes gave weird results.

Example:

“How to scale a backend” worked well
“Handling traffic spikes” returned unrelated notes

Lesson:
You don’t just rely on embeddings. You sometimes need:

better chunking
metadata
re-ranking

2. Chunking Matters More Than I Expected

Initially, I stored full notes as single vectors.

Bad idea.

Long notes diluted meaning.

When I switched to smaller chunks (like 200–500 words), search improved a lot.

But then a new problem appeared:

Too many chunks = noisy results

So there’s a balance:

Too big → vague results
Too small → fragmented context

3. Vector Search Alone Is Not Enough

I assumed nearest neighbor search would solve everything.

It didn’t.

Sometimes results were “technically similar” but not useful.

What helped:

adding metadata filters
sorting by both similarity and recency
sometimes even mixing keyword search

4. Performance Sneaks Up on You

At the beginning, everything felt fast.

Then I added more data.

Search slowed down. Memory usage increased.

Things I learned:

Index type matters (FAISS has options, and they behave very differently)
You don’t notice problems until you scale even a little
Testing with 50 records is meaningless

5. Local vs API Embeddings Is a Tradeoff

I tried both:

API-based embeddings

Easy
High quality
Costs money

Local models

Free after setup
Slower (on my machine)
Sometimes lower quality

There’s no “best” option. It depends on:

your budget
latency requirements
privacy needs

Mistakes I Made (So You Don’t Have To)

Treating It Like a Normal Database

This is not SQL.

You don’t query exact matches. You deal with probabilities.

That mindset shift takes time.

Ignoring Evaluation

At first, I just “felt” like results were good.

That’s dangerous.

You need test queries like:

expected input → expected output

Otherwise you’re just guessing.

Overengineering Too Early

I wasted time trying:

fancy pipelines
multiple models
complex ranking

When a simple setup worked fine.

Not Logging Results

I didn’t log queries at first.

Big mistake.

Logs help you understand:

what users search
where results fail
patterns you didn’t expect

What I’d Do Differently

If I started again, I’d:

Start with a very small dataset
Add proper evaluation early
Keep chunking simple
Avoid premature optimization
Focus on real use cases, not benchmarks

Final Thoughts

Building this project changed how I think about search systems.

Vector databases are powerful, but they’re not plug-and-play magic.

You still need:

good data
thoughtful design
constant iteration

If you’re planning to build something similar, my advice is simple:

Don’t try to be perfect. Try to be real.

Build something small. Break it. Fix it. Repeat.

That’s where the actual learning happens.

Read in Detail how Vector Database Works:

Understanding Vector Databases for LLM Applications: A Deep Dive

Unlock LLM power with vector databases. This deep guide covers their mechanics, critical role, and impact on modern AI applications for enhanced understandin...

analyticsdrive.tech

DEV Community