Aman Puri

Posted on Jun 27 • Originally published at hydradb.com

B-Trees, Vectors, and Graphs: Why Hybrid Search Breaks at the Storage Layer

#database #ai #memory

Agent memory exposes a storage-layer problem beneath retrieval: metadata, semantic similarity, relationships, permissions, and time have to work together in one query path. Vector search with filters covers only one part of that workload.

TL;DR

AI agents do not need another database to store embeddings. Instead, they require a purpose-built context database that can preserve memory across users, sessions, permissions, relationships, and time.

A single memory query may need to filter by tenant, workspace, user, status, or permissions. It may also need to follow relationships across users, files, sessions, entities, and versions, then rank the surviving context by semantic similarity.

Under the hood, that forces three physical access patterns into one query path: B-tree range access for metadata, ANN vector traversal for semantic similarity, and graph traversal for contextual relationships.

Each one wants different I/O, cache behavior, maintenance policy, and execution priority. Hybrid search breaks down when these structures are added as separate features but planned, cached, and maintained as if they were independent.

HydraDB is the context and memory layer for AI applications. It gives agents structured, time-aware memory so they can retrieve the right context for the right user, workflow, tenant, and moment. The HydraDB docs describe this as graph-native, temporal memory infrastructure for AI applications, where memories, knowledge, metadata, and graph context work together instead of being stitched across disconnected retrieval tools.

Agent memory exposes a storage-layer problem

For years, data retrieval improved through specialization.

Relational databases handled structured predicates with B-tree indexes. Search engines handled keyword relevance with inverted indexes. Graph databases handled relationship traversal. Vector databases optimized approximate similarity search over embeddings.

AI agents combine retrieval patterns that traditional systems usually isolate.

Static RAG over documents is giving way to persistent agent memory: a dynamic, interconnected store of facts, conversations, user preferences, permissions, source documents, and versioned state. HydraDB’s guide to AI agent memory frames this shift as the difference between stateless assistants that reset every session and memory-aware agents that can learn, personalize, and maintain continuity over time.

This workload goes beyond finding the most semantically similar chunk. The system must preserve context well enough to answer what is relevant, what changed, when it changed, who it affects, and why the change matters.

Agent memory requires infrastructure that can reconcile multiple physical access patterns without letting one dominate query latency, recall, or write throughput.

Why vector search alone is insufficient

Agent memory combines several structures that databases usually optimize separately.

At the center are raw memories: text, images, events, messages, files, and data records. These are often indexed semantically using vector embeddings.

Around those memories is relational metadata: timestamps, user IDs, tenant IDs, source documents, permissions, version markers, ownership, and compliance attributes.

Connecting those memories is an explicit relationship graph. A user authored a document. That document revised a previous version. The revision was discussed in a specific meeting. The meeting changed a user preference. The new preference applies only to one account, workflow, or project.

Vector search does not directly solve these problems. A vector index answers a nearest-neighbor question in embedding space. It does not natively encode tenant boundaries, permission rules, version validity, authorship, dependency paths, or temporal state. Those properties can be stored alongside vectors, but the vector index itself still optimizes for distance, not correctness under scope, time, policy, and relationships.

That is why vector search alone is insufficient for agent memory. It can retrieve semantically close memories, but it cannot decide whether a memory is allowed, current, causally connected, or applicable to the current workflow without coordinated metadata filtering, temporal state, and relationship traversal.

The three access patterns: B-tree, ANN, graph

Access pattern	Common structure	What it does well	Physical behavior	Main risk in hybrid search
Range and point access	B-tree	Tenant filters, timestamps, ordered scans, equality predicates	Page-oriented and relatively predictable	Can flood cache with metadata and heap pages
Semantic similarity	HNSW or another ANN index	Approximate nearest-neighbor retrieval over embeddings	Random-access-heavy graph traversal	Can lose recall under selective filters
Relationship traversal	Context graph edges	Context expansion across users, documents, sessions, and versions	Dependent reads across nodes, pages, shards, or objects	Can produce high fan-out and weak locality

B-trees support metadata filtering

B-trees are widely used in OLTP databases because they support point lookups, ordered scans, and range predicates efficiently.

They work well for queries such as:

WHERE tenant_id = 'abc'

AND created_at > '2024-01-01'

B-trees perform best when the index order matches the predicate shape. Their I/O is page-oriented and relatively predictable compared to graph traversal, making them a strong fit for buffer pools, page caches, and disk-backed execution.

HNSW vector search behaves differently

Approximate nearest-neighbor search, often implemented with an HNSW graph, has a different physical profile.

The original HNSW paper by Malkov and Yashunin describes a hierarchical proximity graph used for efficient approximate nearest-neighbor search. Query execution traverses graph neighborhoods and compares candidate vectors. Depending on the implementation and dataset layout, those graph nodes and vector payloads may not have the same locality as an ordered B-tree scan.

Hybrid search often depends on fast access to graph neighborhoods, candidate vectors, and metadata predicates inside the same query path.

Relationship graphs add another access pattern

Relationship traversal, such as user -> interacted_with -> document, is a form of dependent access.

Each hop can point to a different node, page, shard, or storage object. With enough fan-out, traversal becomes a sequence of small reads where the next access depends on the current result.

This weakens batching and prefetching. It also creates a different cache profile from both B-tree scans and ANN traversal.

HydraDB’s full recall response format reflects this difference. Recall responses can include retrieved chunks, source metadata, graph paths, chunk relations, and additional related context. That response shape exists because useful agent context is not just a ranked list of nearest chunks.

Where hybrid execution breaks down

Supporting B-trees, vectors, and graph edges as logical features is only the first step. The storage engine still has to make its I/O behavior predictable when a single query needs all three.

A hybrid query may start with a tenant filter, expand through relationships, and rank results by vector similarity. Each structure is reasonable in isolation. Together, they can create cache churn, random reads, high tail latency, and recall instability.

Whether the system starts with a relational core and adds vectors, or starts with a vector core and adds metadata filtering, the same constraint appears: the physical design optimized for one workload can limit the others.

Postgres + pgvector: useful, but filtered ANN has trade-offs

Postgres was a natural place to add vector search because it has mature transactional behavior, a strong extension ecosystem, and broad operational adoption.

pgvector has gained adoption because it stores embeddings alongside transactional data and avoids adding another operational system. This helps teams that want vector search near application data.

But it introduces trade-offs in high-recall, filtered ANN workloads.

The core issue is that Postgres was built around relational operators, cost estimates, indexes, and tuples. Vector similarity search has a different selectivity behavior. The planner can reason about ordinary predicates such as tenant, timestamp, or status, but high-dimensional nearest-neighbor search does not behave like a normal B-tree lookup.

Filtered vector search can reduce recall

Filtered vector search exposes the mismatch between relational predicates and ANN traversal.

In approximate index scans, filtering is typically applied after the vector index produces candidates. If the filter is selective, many of the nearest candidates may be discarded. The query can return fewer than k results unless the scan explores more of the index.

That creates a recall cliff. The system may look fast because it searched a small candidate set, but it did not search enough of the graph to satisfy the filtered query.

The pgvector documentation addresses this with iterative scans. Starting with version 0.8.0, iterative index scans can continue scanning an HNSW or IVFFlat index until enough filtered results are found or a configured limit is reached.

This gives operators another control surface for trading latency, recall, and scan depth, but the tension remains: relational filtering and approximate graph traversal must still be coordinated within a single execution path.

HNSW index builds can be operationally expensive

Building an HNSW index is CPU-intensive and sensitive to memory settings such as maintenance_work_mem.

The pgvector HNSW docs note that indexes build faster when the graph fits into maintenance_work_mem, and that build time can increase significantly when it no longer fits.

On shared Postgres systems, an HNSW build can compete with transactional workloads for CPU, memory, and I/O unless it is isolated carefully.

Vector-native systems: stronger ANN, but filtering and relationships still require coordination

Dedicated vector databases avoid some Postgres planner constraints, but filtered vector search still requires careful execution planning.

Stronger systems build payload indexes, filter-aware HNSW variants, or query planners that switch strategies based on filter selectivity. Qdrant, for example, documents payload indexes and filterable HNSW, plus ACORN search for stricter filtered-search cases.

These approaches address the filtered-ANN problem by changing when filters are applied, how much of the vector graph is explored, and how candidates are validated.

For common filters, this can work well. For tenant isolation, permissions, time windows, version constraints, or low-cardinality intersections, the system still has to choose how deeply to search and how much candidate expansion to tolerate.

A reliable design applies metadata constraints during traversal or planning, not only after retrieval. That requires tighter coordination between the metadata index, the vector index, and the executor.

The storage engine has to know which graph paths are worth exploring under the active predicate. That requires the executor to coordinate metadata selectivity, vector traversal, and candidate validation during the query rather than treating them as separate phases.

Relationship graphs conflict with page-oriented I/O

Layering a relationship graph on either a B-tree-centric or a vector-centric foundation introduces another physical problem.

B-trees are page-oriented. Vector indexes often rely on memory locality for graph layers and candidate vectors. Relationship traversal can break both assumptions.

A graph hop from a user node to a document node to a session node may require fetching records from different pages or storage regions. With enough fan-out, traversal becomes a chain of dependent reads.

The system cannot always batch or prefetch those reads because each hop determines the next set of addresses.

Hybrid search becomes expensive when metadata filtering, relationship expansion, and vector ranking share a single query path, even when each index is well designed.

Production symptoms: cache contention, tail latency, index maintenance, write interference

Hybrid search friction appears as I/O amplification, cache contention, index maintenance cost, and write-path interference. These implementation details directly affect recall, latency, concurrency, and operating cost.

`mmap` can create unpredictable latency

Some vector systems use memory-mapped files, or mmap, to access index structures.

The programming model is simple. Hot data can run near memory speed, and the operating system handles paging. For read-mostly workloads with enough memory and predictable locality, this can work well.

The trade-off is control. In "Are You Sure You Want to Use MMAP in Your Database Management System?", Crotty, Leis, and Pavlo argue that mmap is not a suitable replacement for a traditional DBMS buffer pool because it reduces the database engine’s direct control over paging and I/O behavior.

Eviction, prefetching, fault handling, and I/O scheduling affect latency when queries depend on random access.

When a query touches a non-resident HNSW node, vector page, metadata page, or graph edge, the system must fetch it before execution can continue. On local NVMe, that may be manageable. On networked storage or object-backed designs, cold reads can become more visible in tail latency.

mmap moves key storage decisions away from the database engine at the point where hybrid search needs more control.

A managed buffer pool, potentially paired with async I/O mechanisms, can give the engine more explicit control over admission, eviction, prefetching, and backpressure.

That control comes with implementation complexity. In a hybrid database, the complexity may be justified because the engine must balance multiple access patterns rather than delegate paging decisions to the operating system.

B-tree buffers and HNSW layers compete for cache

In a unified system, the B-tree index and the HNSW graph compete for the same memory budget unless the engine isolates them.

Consider a common hybrid query:

Find documents matching tenant_id = X that are semantically similar to this query.

If tenant_id is selective but still matches a large working set, the relational portion of the query may read many B-tree and heap pages into cache. That can evict parts of the HNSW graph, including upper layers or frequently visited neighborhoods used as entry points for ANN search.

When the vector portion of the query runs, it may encounter cache misses where it expected hot graph structures.

The reverse can also happen. A burst of vector queries can keep HNSW pages hot while pushing relational metadata pages out of memory. Later, ordinary transactional or filtered queries pay the price.

B-trees and HNSW can coexist. Their working sets have different shapes, and a generic cache policy may not understand which pages are latency-critical for each phase of a hybrid query.

MVCC adds maintenance pressure

The concurrency challenges of hybrid systems become sharper under MVCC.

In a system like Postgres, updates create new row versions. PostgreSQL’s routine vacuuming documentation explains that VACUUM removes dead row versions in tables and indexes and marks space available for reuse. For agent memory workloads, where state, conversation history, metadata, and embeddings may change frequently, this can lead to table and index bloat unless vacuum, partitioning, and retention policies are carefully tuned.

Vector indexes add their own maintenance pressure. HNSW is a graph-like structure, so inserts and updates are not just local tuple changes. The engine has to maintain graph connectivity while preserving query correctness and concurrency guarantees.

Deletes and updates can leave tombstones, stale candidates, or deferred cleanup work depending on the implementation.

In a hybrid engine, a filtered search query can collide with background indexing, compaction, vacuum, or a high-throughput write stream.

If concurrency control is too coarse, long-running reads can delay maintenance or writes. If it is too permissive, queries may see inconsistent candidate sets or pay extra validation costs.

Relational transactions, vector index maintenance, and graph updates all want different consistency and scheduling policies.

Mitigation patterns: tiered storage, quantization, read/write separation

Several patterns reduce physical-layer friction in hybrid search. Each one helps, but none removes the underlying trade-off.

The design still has to choose where to spend memory, latency, recall, and operational complexity.

Pattern	What it improves	What it costs	Best fit
Tiered storage and lazy loading	Reduces memory and local disk pressure	Adds first-query and cache-miss latency	Large collections, skewed access, analytical retrieval
Quantization	Reduces vector storage and memory footprint	Can reduce recall without re-ranking	Vector-heavy systems with two-stage retrieval
Read/write separation	Isolates indexing from serving	Adds freshness, replication, and recovery complexity	Interactive workloads that need stable query latency

Tiered storage reduces memory pressure

One common pattern is separating compute from storage and using low-cost object storage as the durable backing layer.

Systems such as Milvus support tiered storage to avoid loading every field and index into each query node up front. Instead, the query node can load lightweight metadata first, then fetch field data and indexes on demand through a local caching layer.

This is a critique of object-backed lazy-loading as the primary serving pattern for interactive agent memory, not a critique of Milvus as a vector database. HydraDB uses Milvus-backed indexing for configured metadata fields, as reflected in the tenant metadata schema API, but that is different from treating a standalone object-backed vector retrieval layer as the full memory substrate. In HydraDB, vector and metadata indexing sit underneath a memory layer that also coordinates graph context, temporal state, and recall behavior.

The trade-off is first-query and cache-miss latency. Object storage is durable and cost-efficient, but it is a poor fit for workloads that require many dependent random reads unless the system adds caching, warm-up, prefetching, and careful segment management.

Milvus documentation reflects this trade-off. Lazy loading and partial loading reduce memory and disk pressure, while warm-up and tiered-storage cache policy help control query latency.

Tiered storage is useful for large collections, skewed access patterns, analytical workloads, batch retrieval, and mixed workloads that can tolerate occasional cache misses.

Interactive agent memory has less room for cache misses because a query may need fresh context, relationship expansion, and vector ranking inside a tight response budget.

Quantization trades precision for I/O efficiency

Quantization reduces the representation size of vectors so the system can keep more candidates in memory, read fewer bytes from storage, and compare more vectors per CPU cache line.

The foundational work on Product Quantization by Jégou, Douze, and Schmid established the basic trade-off. Smaller vector representations reduce memory and storage pressure, but they also reduce precision.

Production systems usually compensate by over-fetching compressed candidates and re-ranking them using full-precision vectors.

The byte savings can be large. A 1024-dimensional vector stored as 32-bit floats requires 4096 bytes. Binary quantization can represent that same dimensionality in 128 bytes, a 32x reduction before accounting for index overhead.

With 2048 dimensions, the float32 representation is 8192 bytes, and the binary representation is 256 bytes.

Those reductions can decide whether an index fits in memory or spills into a colder storage tier.

The cost is recall risk. Aggressive quantization can distort distances enough to drop relevant candidates from the first-stage result set.

Quantization works best as part of a two-stage retrieval pipeline, with full-precision re-ranking.

Read/write separation stabilizes serving

Another common pattern is separating read and write workloads.

Writes go to an ingestion or indexing path. Queries go to read-optimized replicas. This prevents index builds, compaction, and heavy write bursts from directly competing with low-latency search.

That separation helps with resource isolation. It keeps CPU-heavy index construction away from serving nodes. It also gives query replicas a more stable cache profile, which matters for HNSW traversal and metadata filters.

For interactive workloads, predictable cache residency can be more valuable than peak throughput.

The cost is freshness and operational complexity. The system must manage replication lag, index handoff, failure recovery, and version visibility.

Agent memory raises the cost of replication lag because stale context can be semantically wrong, not just slightly out of date. If an agent acts on superseded preferences or an old conversation state, the storage delay becomes a product bug.

What memory-native infrastructure should optimize for

The physical-layer friction between B-trees, vector graphs, and relationship graphs shows why agent memory is hard to build on top of existing retrieval assumptions.

Each access pattern is well understood on its own. The difficulty lies in combining them into a single low-latency query path without allowing one structure to sabotage the others.

Persistent agent memory adds versioned history, temporal context, user-specific state, permissions, semantic similarity, and explicit relationships. These complex requirements highlight the core differences between an agent memory layer and a traditional vector database. They demand a memory-native context layer, not a stack of disconnected retrieval components.

A memory-native architecture should treat metadata, vector similarity, relationships, temporal state, permissions, and version history as primary primitives. Cache policy, index maintenance, query planning, and I/O scheduling need to be aware of all of them.

HydraDB is built around this shape of retrieval. Its Memories model stores user-scoped context for personalization across sessions. Its Knowledge flow handles shared, tenant-wide context. Its metadata filters narrow recall to known scopes before ranking. Its context graph captures entities, relationships, temporal signals, and graph paths so agents can reason about how information connects.

At the architectural level, HydraDB addresses this through three primitives. The Sliding Window Inference Pipeline enriches each ingested chunk using a window of surrounding segments to resolve pronouns and capture user preferences, so retrieved chunks are semantically self-contained rather than disconnected fragments. The Git-Style Versioned Temporal Graph keeps relationships, state changes, and historical versions connected so agents can reason about what changed, when it changed, and which context is still valid. Multi-Stage Retrieval coordinates metadata constraints, graph expansion, and semantic ranking so filtering and relationship context are part of retrieval rather than post-processing after vector search.

HydraDB’s quickstart also shows the operational shape of this model: content is ingested, processed, embedded, and used to build graph context before recall. For knowledge sources, applications use full_recall. For user memories, they use recall_preferences. The response can include chunks, source metadata, graph context, and additional related context that can be passed into an LLM through a clean context builder.

Reliable agent memory depends on predictable latency, high recall, controlled permissions, historical traceability, and context that reflects how users and organizations change over time.

To see how HydraDB approaches agent memory, context graphs, and time-aware retrieval, visit hydradb.com or explore the HydraDB documentation.

DEV Community

B-Trees, Vectors, and Graphs: Why Hybrid Search Breaks at the Storage Layer

TL;DR

Agent memory exposes a storage-layer problem

Why vector search alone is insufficient