Julien L

Posted on Mar 26

I replaced my 500MB vector database Docker stack with a 3MB embedded engine

#ai #python #database #tutorial

Most vector database tutorials start the same way:

docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant

That's 500MB+ of Docker image, a running server process, a REST API to talk to, and a container to babysit in production. For what? Storing a few thousand embeddings and doing similarity search.

I've been building AI features for a project where everything runs locally: no cloud, no Docker, no external dependencies. I needed a vector store that I could pip install and forget about. So I built VelesDB, an embedded database written in Rust.

Here's what it looks like in practice.

Setup: one line

pip install velesdb

That's it. No Docker. No config files. No server to start. The entire engine is a ~3MB native binary that ships inside the Python wheel.

Create a database and index documents

import velesdb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dimensions

db = velesdb.Database("./my_vectors")
collection = db.create_collection("documents", dimension=384, metric="cosine")

texts = [
    "Transformers use self-attention to process sequences in parallel.",
    "HNSW is a graph-based algorithm for approximate nearest neighbor search.",
    "RAG combines retrieval with generation to ground LLM responses in facts.",
    "Vector databases store high-dimensional embeddings for similarity search.",
    "Knowledge graphs represent relationships between entities as edges.",
    "Local-first software works offline and syncs when connectivity returns.",
    "Embedding models convert text into dense vector representations.",
]

vectors = model.encode(texts).tolist()

collection.upsert([
    {"id": i, "vector": v, "payload": {"text": t}}
    for i, (v, t) in enumerate(zip(vectors, texts))
])

That's a working vector database. On disk, it's just a regular directory with a few files: vectors, indexes, and a WAL for crash recovery. No server PID files, no config sprawl.

Search

query = "How does similarity search work?"
query_vec = model.encode(query).tolist()

results = collection.search(vector=query_vec, top_k=3)

for r in results:
    print(f"score={r['score']:.4f} → {r['payload']['text']}")

score=0.82 → Vector databases store high-dimensional embeddings for similarity search.
score=0.71 → HNSW is a graph-based algorithm for approximate nearest neighbor search.
score=0.64 → Embedding models convert text into dense vector representations.

Standard cosine similarity search. VelesDB uses HNSW (Hierarchical Navigable Small World) under the hood, same algorithm as most production vector databases.

Where it gets interesting: hybrid search

Most embedded vector stores stop at basic similarity search. VelesDB also includes a BM25 full-text index, so you can combine keyword matching with semantic search:

# Pure keyword search (BM25)
results = collection.text_search("vector database embeddings", top_k=3)

# Hybrid: 70% semantic similarity + 30% keyword matching
results = collection.hybrid_search(
    vector=model.encode("fast nearest neighbor algorithms").tolist(),
    query="HNSW vector search algorithm",
    top_k=3,
    vector_weight=0.7
)

This is the same hybrid search pattern that Pinecone and Weaviate charge for. Here it's built into the engine.

Batch and multi-query search

If you're building a RAG pipeline, you often need to run multiple queries at once (maybe one per reformulated question). VelesDB handles this natively:

# Parallel batch search
batch_results = collection.batch_search([
    {"vector": model.encode("machine learning models").tolist(), "top_k": 3},
    {"vector": model.encode("graph databases").tolist(), "top_k": 3},
])

# Multi-query with result fusion (Reciprocal Rank Fusion)
fused = collection.multi_query_search(
    vectors=[
        model.encode("vector similarity search").tolist(),
        model.encode("nearest neighbor algorithms").tolist(),
        model.encode("embedding databases").tolist(),
    ],
    top_k=5,
    fusion=velesdb.FusionStrategy.rrf(k=60)
)

RRF fusion is the same technique used by Elasticsearch and Cohere's Rerank. It combines rankings from multiple queries into a single, more robust result set.

The feature nobody else has: built-in knowledge graph

This is why I built VelesDB instead of using Chroma or LanceDB. It has a native graph engine alongside the vector store.

graph = db.create_graph_collection("knowledge", dimension=384)

# Store node metadata
graph.store_node_payload(1, {"name": "Python", "type": "language"})
graph.store_node_payload(2, {"name": "Guido van Rossum", "type": "person"})
graph.store_node_payload(3, {"name": "Rust", "type": "language"})
graph.store_node_payload(4, {"name": "VelesDB", "type": "database"})

# Create edges
graph.add_edge({"id": 1, "source": 1, "target": 2, "label": "CREATED_BY",
                "properties": {"year": 1991}})
graph.add_edge({"id": 2, "source": 4, "target": 3, "label": "WRITTEN_IN",
                "properties": {"year": 2024}})
graph.add_edge({"id": 3, "source": 4, "target": 1, "label": "HAS_SDK",
                "properties": {"version": "1.7.2"}})

# Traverse the graph
outgoing = graph.get_outgoing(4)  # What is VelesDB connected to?
for edge in outgoing:
    print(f"VelesDB →[{edge['label']}]→ node {edge['target']}")

# BFS traversal
reachable = graph.traverse_bfs(source_id=4, max_depth=2, limit=10)

Why does this matter? Because GraphRAG (combining vector search with graph traversal) is how you get AI agents that understand relationships, not just similarity. Vector search finds documents that look like your query. Graph traversal finds documents that are connected to your results.

With Qdrant or Pinecone, you'd need to bolt on Neo4j or a separate graph database. Here it's one engine, one pip install.

Real benchmarks

I ran these on the actual VelesDB engine (v1.7.2), not synthetic numbers.

Test config: Intel Core i9-14900KF, 64 GB RAM, Windows 11, Python 3.11 (2026-03-26)

Operation	10K vectors (384D)	50K vectors (384D)
Bulk insert	~9,000 vectors/sec	~5,400 vectors/sec
Search (top-10, avg)	~438 µs	~1,463 µs
Search (top-10, p50)	~409 µs	~1,117 µs
Search (top-10, p99)	~1,058 µs	~3,017 µs
Database size on disk	31 MB	162 MB

Sub-millisecond search at 10K vectors, ~1ms at 50K. Zero infrastructure, zero network calls.

For comparison, a Qdrant Docker container at rest uses ~200MB of RAM and requires a running gRPC server. VelesDB uses exactly as much memory as your vectors need, and the process exits when your script does.

When to use this (and when not to)

Use VelesDB when:

Your dataset fits on a single machine (up to a few hundred thousand vectors)
You need offline/local-first capability
You can't send data to the cloud (GDPR, healthcare, finance)
You want zero infrastructure to manage
You're building GraphRAG or need relationship traversal

Use Qdrant/Pinecone/Weaviate when:

You need distributed scaling across machines
You have millions of vectors with multi-tenant isolation
You want a managed service with built-in monitoring

Getting started

pip install velesdb

import velesdb

db = velesdb.Database("./my_data")
collection = db.create_collection("docs", dimension=384, metric="cosine")
collection.upsert([{"id": 1, "vector": [...], "payload": {"text": "hello"}}])
results = collection.search(vector=[...], top_k=5)

Full docs: velesdb.com/en
GitHub: github.com/cyberlife-coder/VelesDB

If you try it, I'd like to hear what you think, especially if you're coming from a Docker-based vector store. What's your current setup, and what made you choose it?

DEV Community