Vektor Memory

Posted on May 7

Vector Databases Explained: What They Don’t Tell You

#ai #llm #rag #vectordatabase

Everyone working in AI reaches a moment where they search a document and get back something that looks right but means nothing — or searches for a concept and gets back noise. That moment is when they discover vector databases. This guide covers everything: the math, the architecture, the algorithms, the top tools, and — most importantly — what vector search alone cannot do for an AI agent that needs to remember.

What is a vector? Embeddings from first principles
A vector is just a list of numbers. [0.12, -0.87, 0.44, ...]. What makes vectors powerful for AI is what the numbers represent: the meaning of a piece of content, encoded by a neural network into a point in high-dimensional space.

When an embedding model (like OpenAI’s text-embedding-3-large, Cohere's embed-v3, or a local model like nomic-embed-text) processes a sentence, it outputs a vector of typically 768 to 3072 dimensions. Every dimension captures some latent feature of the content. You don't choose what the dimensions mean - the model learns them during training.

The key property: semantically similar content ends up close together in this space. “The capital of France” and “Paris” produce vectors that are close. “My favourite sandwich” and “Paris” produce vectors that are far apart. Distance in vector space ≈ conceptual distance.

embedding example — node.js

// Every piece of content becomes a point in space const embedding = await openai.embeddings.create({ model: "text-embedding-3-large", input: "What is a vector database?" }); // Returns ~3072 numbers: the meaning of that sentence const vector = embedding.data[0].embedding; // [0.0023, -0.0187, 0.0441, ... 3072 values]

The distance between two vectors is typically measured with cosine similarity (angle between vectors), Euclidean distance (straight-line distance), or dot product (magnitude-weighted angle). Cosine similarity is the most common for text because it ignores vector magnitude and focuses purely on direction — i.e., conceptual alignment.

Typical vector dimensions

ANN search latency at scale

Vectors in production deployments

How vector databases work: ingestion, indexing, retrieval
A vector database has two workflows: ingestion (turning content into stored vectors) and retrieval (finding the most similar vectors to a query). Here’s how each works.

Ingestion pipeline
Step 1 — Embed. Your content (text, images, audio, code) is passed through an embedding model. The output is a dense numerical vector. Each piece of content produces one vector (or, if chunked, several).

Step 2 — Store. The vector is stored alongside its metadata: IDs, timestamps, source, category, or any structured fields you want to filter on later. Most vector databases store vectors and metadata separately in optimised structures.

Step 3 — Index. Raw storage is not enough for fast retrieval. The database builds a vector index that organises vectors so nearest-neighbour search can skip brute-force comparisons. More on this in the next section.

Retrieval pipeline
Step 1 — Embed the query. The user’s query is embedded with the same model used at ingestion time. This is critical — mismatched embedding models produce meaningless distance comparisons.

Step 2 — ANN search. The query vector is compared against stored vectors using an Approximate Nearest Neighbour (ANN) algorithm. ANN trades a small amount of accuracy for enormous speed gains — returning the top-k most similar vectors in milliseconds rather than seconds.

Step 3 — Metadata filtering. If you passed filters (e.g., “only documents from this user” or “created after 2025”), the database applies them. Some systems pre-filter (narrow the candidate set before ANN), some post-filter (filter ANN results). Hybrid approaches are becoming standard.

Step 4 — Return ranked results. The top-k results, ranked by similarity score, are returned. Your application decides what to do with them: feed them to an LLM, display them to a user, or use them to trigger further actions.

⚙ How RAG uses vector databases

Retrieval-Augmented Generation (RAG) is the most common pattern: embed all your documents at ingestion time, then at query time embed the user’s question, retrieve the top-k most relevant chunks, and pass them to an LLM as context. The LLM generates its answer grounded in retrieved content rather than hallucinating from training data alone.

Indexing algorithms: HNSW, IVF, PQ, LSH
The choice of indexing algorithm determines speed, accuracy, and memory usage. Here are the four you’ll encounter most:

HNSW — Hierarchical Navigable Small-World
The dominant algorithm in production vector databases today. HNSW builds a multi-layer graph where each vector is connected to its approximate nearest neighbours. Search starts at the top (sparse) layer and progressively zooms in to the bottom (dense) layer. This gives sub-linear search time — even with 100M vectors, an HNSW query typically completes in single-digit milliseconds.

Best for: High-throughput, low-latency production search. Used by Qdrant, Weaviate, pgvector, and Milvus.

IVF — Inverted File Index
IVF clusters the vector space into Voronoi cells (using k-means). At query time, only the nearest clusters are searched, skipping most of the index. IVF is memory-efficient and highly parallelisable, making it the backbone of large-scale FAISS deployments.

Best for: Very large datasets (billions of vectors) where memory matters more than per-query latency.

PQ — Product Quantisation
PQ compresses vectors by splitting them into sub-vectors and quantising each. A 3072-dimensional float32 vector (12KB) can be compressed to ~96 bytes — a 125× reduction. The trade-off is a small accuracy loss. PQ is almost always combined with IVF (IVF-PQ) for large-scale deployments that need both speed and memory efficiency.

Best for: Deployments where storing raw vectors at full precision would require dozens of terabytes.

LSH — Locality-Sensitive Hashing
LSH projects vectors into hash buckets such that similar vectors land in the same bucket with high probability. Approximate search becomes a hash lookup. LSH is fast but less accurate than HNSW for high-dimensional dense vectors, so it’s more commonly used for sparse vectors and certain specialised tasks.

Best for: Very high-dimensional sparse data, deduplication tasks, and situations where approximate accuracy is sufficient.

⚠ Accuracy vs speed trade-off

Every ANN algorithm trades some recall accuracy for speed. HNSW typically achieves 95–99% recall at query speeds 100× faster than brute-force. For most production applications, this is the right trade-off. For safety-critical applications requiring exact nearest neighbours, exact k-NN search (brute force) is still supported by most databases — just much slower at scale.

Real-world use cases
Vector databases are not a niche technology. They are now infrastructure-layer components in production AI applications serving millions of users. Here are the dominant use cases:

Semantic search
Instead of matching keywords, semantic search finds content that means what the user is asking. A search for “can’t access my account” returns “password reset documentation” even if that phrase never appears in the query. This powers enterprise document search, e-commerce, support portals, and knowledge bases.

Retrieval-Augmented Generation (RAG)
The dominant architecture for grounding LLM outputs in real data. All reference material is embedded at ingestion. At query time, the most relevant chunks are retrieved and injected into the LLM’s context window. This is how most enterprise AI assistants, chatbots, and copilots are built today.

Recommendation systems
User preferences, item features, and behavioural history are embedded into the same space. Recommendations become nearest-neighbour queries: “find products whose vector is closest to this user’s preference vector.” Spotify, Netflix, and most e-commerce platforms run some version of this.

Image and multimodal search
Models like CLIP embed images and text into a shared vector space. You can search image libraries with text queries, find visually similar products, power content moderation, or detect near-duplicate images — all via the same k-NN retrieval mechanism.

Anomaly detection
Embed sequences of actions, log events, or network packets. Normal behaviour clusters tightly in vector space. Anomalies appear as outliers far from any cluster. This pattern is used in fraud detection, intrusion detection, and predictive maintenance.

Long-term AI agent memory
This is the use case that separates serious agent deployments from demos. Agents need to remember what they’ve done, what users have told them, and how the world has changed since their last session. Vector databases are the obvious answer — and they work, up to a point. We’ll get to the limits in section 6.

Top vector databases compared
Press enter or click to view image in full size

The vector database market is crowded. Here’s a technical breakdown of the most widely used options, their architecture, and when to choose each:

Write on Medium
★ = Stack used by VEKTOR Slipstream — better-sqlite3 loads sqlite-vec as a native extension, giving agent-memory workloads zero-latency vector search with full SQL and no external process.

The gap: why vector search alone is not agent memory
Here is the thing nobody mentions in the “vector databases explained” articles: storing vectors and retrieving similar ones is not the same as remembering.

Vector search answers one question: “What stored content is most similar to this query?” That is powerful. But an AI agent needs to answer different questions:

What has changed since I last spoke to this user?
Is this new fact consistent with what I already know?
How are these two facts related — not in text similarity, but in logical causality?
What should I forget because it’s stale or contradicted?
What is the narrative arc of this user’s project over time?
None of these questions are cosine similarity problems. They are graph traversal, contradiction resolution, temporal reasoning, and compression problems. A vector database is necessary but not sufficient.

Press enter or click to view image in full size

This is the architectural gap that motivated VEKTOR. We needed persistent agent memory that did more than retrieve similar chunks — we needed a system that could reason about what it knows, resolve conflicts, and stay clean over thousands of interactions.

“Every long-running agent eventually accumulates contradictory, stale, redundant memory. Vector search doesn’t fix this. A compression-aware memory graph does.”

MAGMA: four-layer memory graph beyond cosine similarity
MAGMA (Multi-layer Associative Graph Memory Architecture) is the memory model at the core of VEKTOR Slipstream. Instead of a single flat vector store, MAGMA maintains four distinct memory layers — each capturing a different type of relationship between facts.

Layer 1 — Semantic layer
Standard vector embeddings. The familiar cosine-similarity retrieval layer that surfaces content close in meaning to the query. This is what every RAG system has. In MAGMA it is the entry point, not the whole system.

Layer 2 — Causal layer
Directed edges between facts that have a cause-and-effect relationship. “User changed jobs” → “budget constraints changed” → “paused subscription.” Vector similarity would never surface this chain from a query about subscription status. Causal traversal does.

Layer 3 — Temporal layer
Every memory node carries a timestamp and a decay weight. Facts become less authoritative over time unless reinforced. Contradicting facts trigger the AUDN loop for resolution. This is how VEKTOR avoids the “hairball problem” — the entropy accumulation that kills long-running agents.

Layer 4 — Entity layer
Named entities (people, projects, tools, companies) are indexed as first-class nodes. Queries can traverse entity relationships: “everything this user’s company uses,” “all decisions made in this project,” “everyone who worked on this problem.” This is graph traversal, not vector search.

Every new memory ingestion triggers the AUDN decision: Add (new fact, store it), Update (existing fact has changed, modify the node), Delete (fact is contradicted or obsolete, remove it), or None (redundant, discard). This loop is what keeps MAGMA’s graph coherent over time. Without it, vector stores accumulate contradictions silently. Deep dive →

Periodically, VEKTOR runs a REM (Recall-Evaluate-Merge) compression cycle inspired by how biological memory consolidates during sleep. Redundant memories are merged. Stale knowledge is decayed. Contradictions are resolved. The graph stays small and signal-rich even after thousands of interactions. Read the REM deep dive →

VEKTOR Slipstream & Cloak: vector memory in practice
VEKTOR Slipstream is the single-package implementation of MAGMA. It ships as an npm package that installs in minutes, runs entirely on local hardware, and exposes its full capability as an MCP server — meaning Claude, Cursor, Windsurf, VS Code, and Groq Desktop all get access to persistent, graph-backed memory through the standard MCP protocol.

Under the hood, Slipstream stores memory in a local SQLite database. No cloud dependency, no per-call API fees, no data leaving your machine. The vector index lives inside the same .db file as the graph. It is truly sovereign infrastructure.

terminal — install slipstream

Install globally npm install -g vektor-slipstream # Start the MCP server vektor mcp # Your AI apps now have persistent memory ✓ Claude Desktop connected ✓ Cursor connected ✓ Windsurf connected

Cloak is the stealth browser and SSH orchestration layer inside Slipstream. It is where agent actions meet real-world execution: fetching URLs with human-realistic browser fingerprints, executing SSH commands on remote servers with automatic backup and rollback, managing credentials in an AES-256 encrypted vault, and running multi-step operations as atomic transactions with a single approval gate.

Cloak’s memory integration means every action it takes can be remembered: the server it SSH’d into, the config it changed, the page it fetched, the credential it used. That context accumulates in MAGMA and becomes available to future agent sessions. This is the difference between an agent that executes and an agent that knows what it has done.

vektor slipstream — remember + recall

// Store a memory with graph relationships await vektor.store("User migrated from Pinecone to LanceDB in March", { entities: ["user:alex", "tool:pinecone", "tool:lancedb"], causal: "cost_reduction", temporal: new Date() }); // Recall with graph traversal, not just similarity const memory = await vektor.recall("what vector db is this user running?"); // Returns: LanceDB (March migration) - not Pinecone // Standard RAG would return both, with no preference

Vex & Vek-Sync: open-source memory tooling
Two open-source tools from the VEKTOR ecosystem solve problems that every developer building on vector databases eventually hits.

Vex — Vector Exchange Format
Switching vector databases is painful because every database uses its own export format. Moving from Pinecone to Qdrant means writing a one-off migration script. Moving from Weaviate to LanceDB means writing another. The ecosystem has no interchange standard.

Vex is the open interchange format for agent memory. A .vex file contains vectors, metadata, and graph relationships in a portable schema that any vector database can import. Write one migration path to Vex, then go anywhere. GitHub →

Vek-Sync — One config file to rule them all
Every AI app on your machine stores its MCP server config in a different directory. Update your VEKTOR Slipstream path and you have to edit five JSON files. Vek-Sync maintains one canonical mcp-sync.json and propagates it to every detected AI app with a single vek-sync push command. Credential rotation, server updates, new tool additions - all handled in one place. Read the article → · GitHub →

How to choose the right vector layer for your stack
The right answer depends on three variables: scale, sovereignty, and what kind of memory your agent actually needs.

For prototypes and research
Use LanceDB (embedded, no server) or pgvector (if you’re already on Postgres). Zero ops overhead. Both support HNSW. Good enough for tens of millions of vectors.

For production RAG at scale
Use Qdrant (self-hosted, Rust performance, sparse + dense) or Weaviate (strong hybrid search, good ecosystem). Both are battle-tested at hundreds of millions of vectors with sub-10ms query latency.

For teams who want zero infrastructure
Use Pinecone. Fully managed, reliable, expensive. Worth it if engineering time costs more than the bill.

For AI agents that need persistent, graph-backed memory
A flat vector store is the wrong abstraction. You need a system that handles contradiction resolution, temporal decay, entity relationships, and compression. That is what VEKTOR Slipstream + MAGMA is designed for. It uses an embedded vector index for semantic recall as its foundation, then adds the four memory layers on top. Local-first, MCP-native, no cloud dependency.

Storing documents for semantic search? Any vector database works. Pick based on scale and ops preference.

Building a RAG pipeline for an LLM app? Qdrant or Weaviate self-hosted, Pinecone if managed. Use a chunking strategy and metadata filters.

Building an AI agent that needs to remember across sessions? You need more than a vector store. You need MAGMA. See VEKTOR Slipstream →

Originally published at https://vektormemory.com on May 7, 2026.

DEV Community

Vector Databases Explained: What They Don’t Tell You

Install globally npm install -g vektor-slipstream # Start the MCP server vektor mcp # Your AI apps now have persistent memory ✓ Claude Desktop connected ✓ Cursor connected ✓ Windsurf connected

Top comments (0)