DEV Community: Aman Puri

Best OLAP databases for real-time analytics in 2026 (compared)

Aman Puri — Tue, 14 Jul 2026 17:58:08 +0000

The architectural split that defined the last decade of data engineering is breaking down. You know the pattern: run a transactional rowstore for applications, batch everything to a cloud data warehouse overnight for reporting.

But modern applications don't work that way anymore. Analytics are embedded directly into user-facing dashboards, ad-tech decisioning, and IoT telemetry. The demand for sub-second analytical query latency on fresh, high-volume data has outgrown traditional architectures.

PostgreSQL and MySQL hit physical I/O walls when aggregating at scale, reading whole rows off disk even for narrow analytical scans. Traditional batch cloud data warehouses solve different problems. Snowflake virtual warehouses add queueing and resume behavior, while BigQuery adds job scheduling, dynamic concurrency, query queues, slot allocation, and reservation behavior. Both make sub-second P99 latency hard to sustain under high concurrency.

Matching your workload shape to the right database requires moving beyond marketing claims. You need to evaluate purpose-built open-source OLAP databases against proprietary cloud warehouses based on architectural realities, open reproducible benchmarks like ClickBench, and proofs of concept on your own workload.

Key takeaways

Best overall for real-time analytics (sub-second, high concurrency): ClickHouse is the fastest broad columnar engine, with strong compression, Kafka ingestion, flexible SQL, and predictable scaling.
Best for ultra-low-latency known access patterns on denormalized data: Apache Pinot provides star-tree indexing and other indexes for predictable aggregation and lookup SLAs.
Best for time-series event streams: Apache Druid is optimized for time-window aggregations and streaming ingestion.
Best for normalized, join-heavy MPP workloads: StarRocks is built around a CBO-first optimizer, and Doris has MPP shuffle joins and runtime filters. ClickHouse now has full JOIN support, automatic join reordering, runtime filters, and spill-capable algorithms.
Best for embedded/single-node analytics: DuckDB offers an in-process OLAP for local workloads.
Best for warehouse-native governance, data sharing, and latency-tolerant batch BI: Snowflake and BigQuery remain strong when teams prioritize existing warehouse ecosystems, elastic batch/ad-hoc analytics, and seconds-level dashboard latency. ClickHouse is the better fit for real-time data warehousing, BI, and application serving when freshness, concurrency, and cost predictability matter.

Evaluation criteria for real-time OLAP databases (latency, concurrency, ingestion)

Architecture determines the physics of real-time analytics. Latency, concurrency, and data freshness are set by an engine's underlying design, not by its marketing. Engines must be evaluated against strict real-time constraints: the ability to sustain sub-second P99 query latency, serve hundreds to thousands of concurrent users or requests depending on query shape without sharp P99 degradation, and ingest real-time streaming data with second- or sub-second freshness.

Benchmark data from ClickBench must be weighed along with current architectural capabilities like native JSON handling, vector search, joins, and streaming ingestion.

Selection criteria for real-time OLAP in 2026

Query latency and P99 tail behavior: Real-time OLAP engines must target sub-second P99 latency for serving workloads. Tens-of-milliseconds latency usually requires narrow queries, pre-aggregation, indexes, or a small hot working set. Assess whether an engine achieves this via core columnar execution or requires explicitly provisioned memory reservations. What matters is stability under load, not the average. A system that averages 200ms but spikes to 5 seconds during heavy ingestion is unusable for real-time serving.
Concurrency limits: Analyze whether each architecture can sustain target concurrency without sharp P99 degradation, accounting for query shape, queueing, reservations, and replica or cluster scaling.
Ingestion and data freshness: Support for real-time Kafka streaming with sub-second to second-level freshness, examining how engines handle row-level mutations versus delayed micro-batching.
SQL completeness and joins: Handling of complex aggregations and large-to-large table joins. Both modeling approaches are valid: use normalized patterns when operational flexibility, governance, schema evolution, and ad-hoc analysis matter; use denormalized patterns when a single dominant access pattern, strict latency SLAs, append-only event streams, or ultra-low-latency lookups matter. Evaluate the maturity of each engine's join algorithms and query optimizer for both approaches.
AI and vector search capabilities: Native vector search is now a meaningful differentiator for RAG workloads, but implementations differ in indexing, refresh, memory, filtering, and SQL integration.
Operational complexity and TCO: Factor in deployment overhead, JVM dependencies, storage compression efficiency, and how compute-based pricing models out-scale query-based pricing at high concurrency. At high concurrency, cost comes from compute time, duplicated clusters, memory reservations, and scanned data.

How to benchmark real-time OLAP databases yourself (and common mistakes)

It is a mistake to be overly reliant on vendor benchmarks. Many competitor benchmarks are closed, lack methodology, cherry-pick configurations, and optimize inconsistently. Contrast this with ClickBench. It is maintained by ClickHouse but is open-source, reproducible, and industry-recognized. The only benchmark that settles an evaluation is one you run on your own data and workload. A reproducible proof of concept requires:

Generate representative data. Use skewed, real-world-shaped data, not uniform random rows, and size it well beyond RAM (roughly 10x) so you are testing disk and cache behavior, not an in-memory toy.
Test concurrency, not a single user. The most common failure mode in analytical pilots is benchmarking against a single user and extrapolating. Ramp virtual users beyond your own expected peak volume to account for bursts and future growth, and watch the P95/P99 curve: does it stay flat, or does it hockey-stick as queueing kicks in? A flat curve under load is the core requirement for user-facing analytics.
Run ingestion and queries simultaneously. Push your maximum ingestion rate while the concurrency test is running. If the ingestion client starts timing out or query latency doubles, the engine's read/write isolation is failing under backpressure. This is exactly the pathology that averages hide, and that breaks real-time serving in production.

Quick comparison of OLAP databases, HTAP engines, and cloud warehouses (2026)

ClickHouse is the best overall choice for high-concurrency, sub-second analytics and cost-effective real-time serving. Choose Apache Pinot for ultra-low-latency indexed aggregations and lookups on highly denormalized data, or Apache Druid for heavy time-series event streams. For single-node or embedded analytics, DuckDB is the top choice.

Snowflake and BigQuery remain strong options for warehouse-native governance, data sharing, and latency-tolerant batch or ad-hoc analytics. ClickHouse is a compelling alternative for reporting workloads where latency, concurrency, and cost matter: it can be added as a "speed layer" to an existing data warehouse or used for analytical consolidation after workload validation. For high-concurrency serving, its compute-based model and compression make costs more predictable than per-scan billing.

How to choose an OLAP database for real-time analytics

Latency-tolerant BI, governed sharing, and batch reporting: Snowflake or BigQuery. Internal BI and ad-hoc reporting, where seconds are acceptable and warehouse-native governance matters.
Low concurrency, low latency, small active dataset: Postgres. Use SingleStore when you specifically want one HTAP engine for operational writes and analytical reads in the same proprietary system. You don't need a distributed OLAP system until the data or the concurrency grows. When you do outgrow PostgreSQL's capabilities, ClickPipes for Postgres CDC provides an integration path for replicating operational data from Postgres into ClickHouse.
High concurrency, low latency: real-time OLAP (ClickHouse, Druid, Pinot, StarRocks, Doris). User-facing analytics, observability, and data apps live here.

If you are evaluating specific architectural requirements:

Need sub-second P99 analytical query latency? Real-time OLAP required.
Serving hundreds to thousands of concurrent users with low-latency analytics? Real-time OLAP required.
Active analytical dataset above ~500 GB with repeated scans? Columnar storage becomes the right default.
Need sub-minute freshness plus sub-second serving? Real-time OLAP required.
Joins plus ad-hoc queries on semi-structured data? ClickHouse offers the highest SQL flexibility among the real-time OLAP engines.

Architecture comparison: real-time OLAP and analytical alternatives

Tool name	Best for	Architecture type	Serving latency profile	Concurrency handling	Deployment model
ClickHouse	High-concurrency serving	Real-time OLAP (vectorized columnar)	Sub-second serving	High via replicas and service scaling	Open-source, Managed Cloud
Apache Pinot	Known-pattern serving	Real-time OLAP (star-tree indexed)	Sub-second for indexed patterns	High for known indexed patterns	Open-source, Managed Cloud
Apache Druid	Time-series events	Real-time OLAP (time-partitioned segments)	Sub-second for time-series serving	High via brokers and historical nodes	Open-source, Managed Cloud
SingleStore	HTAP workloads	Hybrid OLTP/OLAP	Sub-second for hot hybrid workloads	Medium	Proprietary, Managed Cloud
DuckDB	Embedded analytics	Embedded OLAP	Sub-second local analytics	Low	Open-source, In-process
Snowflake	Governed warehouse BI and data sharing	Cloud Data Warehouse	Seconds for standard warehouses	Scales via warehouse size or multi-cluster	Proprietary SaaS
BigQuery	Serverless ad-hoc and batch analytics	Cloud Data Warehouse	Seconds for standard queries	High throughput, slot/reservation-bound	Proprietary SaaS
StarRocks	Real-time complex joins	Real-time OLAP (CBO-first MPP)	Sub-second serving	High, workload-dependent	Open-source, Managed Cloud
Apache Doris	Unified ad-hoc reporting	Real-time OLAP (MPP, FE/BE)	Sub-second serving	High, workload-dependent	Open-source, Managed Cloud

Openness rarely makes the feature matrix, but tends to matter most over a system's lifetime. Snowflake and BigQuery are closed, proprietary platforms. Your data and query engine are tied to the vendor's cloud, the billing units are abstract by design, and there is no way to run the engine outside that environment. Most purpose-built OLAP engines here take the opposite approach.

ClickHouse, Druid, Pinot, StarRocks, and Doris are open source, which means architectural transparency, no licensing lock-in, and a verifiable exit strategy. It is worth applying a simple "laptop test": you can download ClickHouse or DuckDB and run the same core engine locally in minutes, while ClickHouse Cloud adds service features such as shared storage and compute separation. Cloud warehouses have no real local equivalent and depend on emulators or sandboxes. Among the open-source engines, openness alone isn't enough. Check contributor trends and release cadence: an open-source license on a project with declining commit activity is a different proposition than one shipping monthly releases with a growing contributor base. For teams evaluating long-term maintainability alongside cost and lock-in, community health matters as much as the license.

1. ClickHouse for high-concurrency real-time analytics

Best for

Versatile real-time, high-concurrency analytics at petabyte scale, including user-facing dashboards, ad-tech decisioning, and observability platforms.
Data engineering teams needing sub-second latency combined with exceptional data compression and predictable infrastructure costs.

Overview

ClickHouse is the fastest open-source columnar database for real-time analytics. Built entirely in C++, it uses tuned vectorized execution and a columnar storage architecture that groups similar data. This enables effective compression and dramatically reduces I/O when executing analytical queries on massive datasets.

It offers immense deployment versatility, including running as a local binary, single-node server, or distributed cluster. This simpler architecture makes local development and testing easier for developers and production operations easier for ops teams. It is available as self-hosted open-source software or as a fully managed, serverless platform via ClickHouse Cloud, which features modern separation of storage and compute.

Key features

Vectorized execution and columnar storage: Processes data in blocks using SIMD instructions, drastically reducing I/O and CPU overhead.
Native JSON type: As of version 25.3, ClickHouse features a production-ready native JSON type that dynamically stores JSON fields as individual sub-columns. This enables low-latency schema-less querying, automatic type inference, and dynamic paths, with max_dynamic_paths controlling how many paths are stored as sub-columns before excess paths move to shared data. For a deeper dive into these advanced capabilities, see how the JSON data type gets even better.
Vector similarity search: Native support for vector similarity indexes (such as HNSW) makes ClickHouse highly capable for hybrid AI and RAG workloads natively alongside traditional OLAP. For ANN search, the vector index must fit in memory during search, and index design, vector type, filtering behavior, and cluster sizing determine production fit.
Advanced join capabilities: Employs automatic global join reordering, runtime bloom filter push-downs, and grace hash joins (which safely spill to disk for right-side tables that exceed available RAM) to support complex star-schema workloads. On TPC-H SF100, automatic global join reordering cut one query from over an hour to roughly 2.7 seconds (about 1,450x faster) while using 25x less memory, and runtime bloom filters deliver a further ~2x speedup at a fraction of the memory (see the coffeeshop benchmark).
Lightweight updates/deletes: Supports row-level changes without rewriting entire data parts. Lightweight deletes mark rows with a delete mask and physically remove them during later merges. Lightweight updates use patch parts that make updated values visible immediately and materialize them during merges. Both avoid full part rewrites for small row-level changes, while heavy ALTER TABLE mutations remain the right tool for larger partition-aligned operations.
Streaming ingestion: Natively integrates with Kafka and features an async_insert capability that batches concurrent inserts server-side. Freshness depends on async insert flush thresholds and workload configuration.

Pros

Exceptional query speed, consistently topping open benchmarks like ClickBench. In a join benchmark published by ClickHouse against Snowflake and Databricks, ClickHouse executed a 1.44 billion row join in roughly 0.5 seconds compared to 5-13 seconds on traditional cloud data warehouses. As with any vendor-published benchmark, it is worth validating against ClickBench or your own workload before treating it as a general result.
Strong compression, around 10x for common analytical data and higher for compressible logs. Compression lowers storage TCO and reduces scanned bytes, which turns into faster I/O and cheaper compute.
Read throughput scales by adding replicas, bounded by query shape and cluster resources, while avoiding the per-query scan billing spikes common in cloud data warehouses.
Pricing is grounded in hardware rather than abstract credits. Compute is metered based on actual resource usage and only while running. In ClickHouse Cloud, multiple services can share the same storage, which is billed once, and compute scales per service. There are no per-query scan penalties. Intra-query parallelism also speeds up individual queries, not just aggregate throughput. On a cost-per-query basis, ClickHouse Cloud is significantly more cost-efficient than cloud data warehouses for high-concurrency serving workloads.
One of the largest open-source database communities: 2,600+ contributors, monthly release cadence, and over 48,000 GitHub stars. Active contributor growth reduces long-term maintenance and lock-in risk.

Cons

Not a direct replacement for OLTP databases (like PostgreSQL) that require high-rate, strictly ACID single-row transactions.
Large distributed joins across large fact tables or high-cardinality dimensions still require schema design, partitioning, and workload testing. ClickHouse now supports automatic join reordering, runtime bloom filters, and spill-capable algorithms, so this is a sizing and data-layout constraint rather than a lack of join support.

Pricing

100% Open Source (Apache 2.0 license).
ClickHouse Cloud offers consumption-based pricing across compute, storage, data transfer, and optional managed ingestion, with autoscaling and auto-idling to zero.

2. Apache Pinot for ultra-low-latency user-facing analytics

Best for

Ultra-low-latency, user-facing analytics on highly denormalized (flat) datasets.
Kafka-first architectures requiring segment-level commits and predictable low-latency SLAs.

Overview

Apache Pinot is a real-time distributed OLAP datastore originally developed at LinkedIn to serve interactive analytics directly to external users. It focuses squarely on ingesting data from streaming sources and serving known aggregation and lookup patterns at very low latency under high concurrency.

Key features

Star-Tree index: A specialized data structure that pre-aggregates known dimension and metric combinations, allowing matching aggregation and group-by queries to hit the index directly rather than scanning raw rows.
Deep Kafka integration: Supports Kafka stream ingestion, segment lifecycle controls, and queryability soon after events are emitted.
Pluggable indexing: Supports inverted, sorted, and text indexes for optimized filtering on flat tables.
Scatter-gather execution: Optimized for concurrent, simple queries mapped across distributed JVM server nodes.

Pros

Exceptional for known, well-indexed aggregation and lookup patterns on denormalized data.
Star-Tree indexes provide a unique architectural solution for consistent, sub-second latency when query predicates, group-by columns, and aggregation functions match the index design.
Proven at massive scale in ad-tech and social media use cases where predictable sub-second latency matters for known indexed serving patterns.

Cons

Pinot supports joins through its query engine, but its lowest-latency serving path still favors denormalized schemas and known access patterns.
Complex operational architecture requiring ZooKeeper, Helix controllers, brokers, servers, and minions.
Upserts require primary-key design, partitioning discipline, and additional memory for record-location bookkeeping. This works well for CDC-style streams, but it is less flexible than ClickHouse patch parts or mutation-based workflows.

Pricing

Open Source (Apache 2.0 license).
Commercial managed offerings available via StarTree.

3. Apache Druid for time-series event analytics

Best for

Heavy time-series event streams where queries are primarily slice-and-dice time-based aggregations.
Engineering organizations comfortable operating multi-service JVM clusters with deep storage, metadata storage, and ZooKeeper.

Overview

Apache Druid is an open-source, real-time analytics database designed for fast analytics on event data. Originally built to handle massive-scale clickstream and observability data, it uses a distributed, Java-based architecture. Druid is one of the original real-time OLAP engines, with roots going back to 2011, but its open-source community has contracted in recent years.

Druid separates node types for ingestion, querying, and data management, with distinct real-time and historical data paths that are merged at query time.

Key features

Tiered ingestion architecture: Separates real-time indexing from historical data storage. Real-time data is queryable immediately in memory before being finalized as immutable segments and pushed to deep storage.
Time-partitioned segments: Heavily optimized for queries filtered by time ranges, making Druid efficient for rolling-window aggregations.
Native search indexes: Uses inverted indexes and bitmap compression for fast, high-cardinality filtering.
Pluggable architecture: Features deep native integrations with HDFS, Kafka, and cloud object stores.

Pros

Excellent performance for real-time streaming ingestion and strict time-series aggregations on immutable event data.
Scalable for massive datasets when properly partitioned and tuned.
Combines streaming and historical data paths at query time.

Cons

Heavy infrastructure footprint. Running Druid requires deploying multiple JVM node types (Coordinator, Overlord, Historical, Broker) and maintaining a strict dependency on ZooKeeper. This leads to high operational complexity.
Druid supports SQL joins, but native joins use a broadcast hash join, and non-left inputs must fit in memory. Join-heavy workloads perform better with lookup tables, denormalized data, or pre-joined tables.
JVM tuning, GC behavior, and multi-service operations add operational overhead compared with ClickHouse's native C++ engine.
Druid's contributor base and commit velocity have declined year-over-year. Quarterly commits dropped roughly 40% compared to the same period the prior year, and per-release contributor counts have fallen from 60+ to around 29. For a system with this operational complexity, a contracting contributor base raises long-term maintenance risk.

Pricing

Open Source (Apache 2.0 license).
Commercial managed offerings available via vendors like Imply.

4. SingleStore for HTAP (hybrid OLTP and OLAP)

Best for

Hybrid Transactional/Analytical Processing (HTAP) workloads.
Applications that need fast single-row OLTP writes alongside large OLAP aggregations within a single engine.

Overview

SingleStore is a proprietary, distributed SQL database built to unify transactional and analytical workloads. It bridges the gap between traditional row stores and modern column stores, combining in-memory row-oriented indexing for transactional operations with persistent columnar disk storage for broad analytics.

Key features

Universal storage model: An on-disk columnstore extended with transactional features — row-level locking, upserts, and hash indexes — so a single table type serves both analytical scans and operational access, alongside an in-memory rowstore for the highest-write workloads.
MySQL wire compatibility: Integrates with existing BI tools, ORMs, and application frameworks built for the MySQL ecosystem.
Bottomless storage: Separates hot local storage from colder object-storage-backed capacity in managed deployments, with warm data cached locally on SSDs.
High-speed ingestion: Features native Kafka Pipelines and object store ingestion to stream data directly into the database.

Pros

Can reduce the architectural complexity of running separate OLTP (e.g., PostgreSQL) and OLAP (e.g., ClickHouse) databases for specific hybrid use cases like gaming leaderboards or real-time financial decisioning.
Familiar MySQL ecosystem integrations significantly reduce the learning curve for application developers.

Cons

Proprietary and closed-source, locking users into a specific vendor ecosystem.
While capable as a hybrid engine, pure OLAP engines like ClickHouse remain stronger for analytical serving workloads. Pure OLTP databases remain stronger for transaction-heavy workloads.
Commercial licensing limits cost transparency and exit flexibility compared with open-source alternatives.

Pricing

Proprietary software licensing.
Managed cloud service with compute-based hourly pricing.

5. DuckDB for embedded analytics (in-process OLAP)

Best for

Single-node, local data processing, and embedded analytical applications.
Data engineering workflows, local testing, and desktop data science.

Overview

DuckDB is an in-process SQL OLAP database management system. Often described as the "SQLite for Analytics," it runs completely embedded within a host process (such as Python, Node.js, or C++) without requiring any external server management, JVMs, or cluster configuration.

Key features

In-process execution: Operates entirely within the host application. No network overhead, socket communication, or server management required.
Columnar-vectorized engine: Optimized for executing analytical queries using local CPU caches.
Zero-copy integration: Reads directly from Pandas DataFrames, Apache Arrow, and Parquet files in memory without expensive serialization or data duplication.

Pros

Zero operational complexity or infrastructure to manage.
Incredibly fast for datasets that fit on a single machine. Its in-process execution avoids network overhead, and its columnar-vectorized engine is built for analytical scans, aggregations, and joins over local data.

Cons

Not designed as a distributed, multi-node serving database for thousands of concurrent external clients.
DuckDB supports concurrency within a single writer process and multiple read-only processes, but it is not a distributed serving database and lacks the built-in high availability, replication, and write-heavy streaming ingestion features required for enterprise-grade real-time serving.

Pricing

100% Free and Open Source (MIT License).
Commercial cloud scaling and hyper-tenancy options available via MotherDuck.

6. Snowflake for enterprise cloud data warehousing

Best for

Governed warehouse BI, enterprise data sharing, complex batch ELT, and ad-hoc analytics where second-level latency is acceptable.
Organizations prioritizing Snowflake's managed governance and warehouse ecosystem over native sub-second OLAP serving.

Overview

Snowflake is a major proprietary cloud data warehouse. It pioneered the separation of compute and storage for elastic batch analytics, allowing multiple independent compute clusters to access the same underlying object storage.

While optimized for ad-hoc queries over massive historical datasets, recent additions attempt to bridge the gap toward real-time AI and operational workloads.

Key features

Elastic virtual warehouses: Start, stop, resize, and isolate compute warehouses over shared data, reducing contention between ETL and BI workloads.
Snowflake Cortex and hybrid tables: Cortex Search provides hybrid vector and keyword search for RAG and enterprise search. Hybrid Tables use a row-oriented primary layout for high-concurrency operational reads and writes, while standard Snowflake tables remain columnar and are better suited to large analytical scans.
Snowflake Horizon: Deep, enterprise-grade governance, RBAC, and data-sharing features.
Snowpipe Streaming: Low-latency streaming ingestion with data available for query in as little as 5 seconds.

Pros

Fully managed service with mature governance, workload isolation, and enterprise administration.
Large ecosystem of native BI integrations, marketplace data sharing, and robust cross-organization governance.

Cons

Standard warehouses queue queries when compute resources are unavailable, and Snowflake recommends multi-cluster warehouses for concurrency scaling. Achieving low-latency operational reads and writes uses Hybrid Tables, which are a separate operational table type rather than a general accelerator for standard analytical tables.
Serving thousands of simultaneous users requires scaling warehouse capacity or multi-cluster warehouses, and each running cluster bills its own credits per hour. This makes high-concurrency serving cost scale with provisioned compute.
Credit-based billing can become unpredictable for high-concurrency or bursty workloads. Warehouses bill by uptime with a one-minute minimum charge on resume, and compute is metered in abstract credits rather than hardware units, making direct cost comparison difficult.

Pricing

Proprietary credit-based model billed on warehouse size and uptime. Credit-based pricing can become unpredictable for high-concurrency or always-on workloads.

7. BigQuery for serverless cloud data warehousing

Best for

Serverless batch analytics, ad-hoc data exploration, and organizations deeply embedded in the Google Cloud ecosystem.
Analyzing petabyte-scale historical datasets without managing any underlying infrastructure.

Overview

Google BigQuery is a fully managed, serverless enterprise data warehouse on GCP. It uses a distributed architecture (historically known as Dremel) to execute columnar queries dynamically across thousands of Google-managed nodes, abstracting away all cluster sizing and hardware management from the user.

Key features

Serverless architecture: No clusters to provision or size. Queries automatically scale across available compute slots behind the scenes.
BI Engine and vector search: Provides vector search for AI integrations. To mitigate baseline dashboard latency, BigQuery offers BI Engine, an explicitly configured in-memory acceleration layer for selected dashboard workloads.
Storage write API: Combines streaming and batch ingestion for row-level appends and near-real-time data availability.
Built-in ML/AI: Allows data scientists to execute machine learning models directly via standard SQL.

Pros

Fully managed, automatic scaling for massive, infrequent ad-hoc queries over petabytes of data.
Integrated with the Google Cloud ecosystem, including Google Analytics, Looker, and Vertex AI.

Cons

BigQuery's job-oriented execution, query queues, and slot/reservation behavior make it a poor fit for high-concurrency, millisecond-latency user serving unless teams explicitly configure BI Engine or another serving layer.
BigQuery has quota and reservation-specific concurrency behavior, including queued interactive query limits and special limits for remote functions and UDFs. Slot contention and reservation design can make latency less predictable for user-facing SLAs.
The on-demand model charges per TiB processed, so broad scans and poorly pruned queries can become expensive even when result sets are small. Slot reservations avoid per-scan billing but require committing compute upfront, which can lead to underutilization during off-peak periods.

Pricing

On-demand (per TiB processed) or Capacity pricing (compute slots or BI Engine reservations).

8. StarRocks for join-heavy real-time OLAP (MPP)

Best for

Workloads requiring complex, multi-table real-time joins without flattening data pipelines.
Data engineering teams specifically comparing ClickHouse and StarRocks for high-concurrency BI and join-heavy normalized schemas.

Overview

StarRocks is an open-source, high-performance analytical database that delivers real-time query speeds on large datasets without requiring heavy upstream denormalization. It employs a fully vectorized architecture tailored to efficiently handle both star and snowflake schemas.

Key features

Cost-Based Optimizer (CBO): Uses statistics to plan complex join queries and optimize execution paths for normalized tables.
Fully vectorized engine: Maximizes CPU efficiency for large-scale analytical processing across distributed nodes.
Vector search: Supports beta vector indexes, including HNSW and IVFPQ, for approximate nearest neighbor search alongside analytical queries.
Direct data lake querying: Supports federated queries directly across Apache Iceberg, Apache Hudi, and Hive formats without moving data.

Pros

Strong join performance out of the box for star-schema workloads, with a CBO that reduces the need for upstream data flattening.
Supports high-concurrency queries on real-time data streams, with explicit support for Primary Key upserts.

Cons

Distributed joins and spill behavior still require memory, partitioning, and workload tuning.
Smaller global ecosystem, third-party integration footprint, community support, and operational tooling maturity compared to ClickHouse.

Pricing

Open Source (Apache 2.0 license).
Managed options for enterprises available via CelerData.

9. Apache Doris for MySQL-compatible real-time OLAP

Best for

Unified real-time reporting and ad-hoc analysis where simplified cluster deployment is a priority.
Teams evaluating ClickHouse against Doris who require deep, native MySQL compatibility.

Overview

Apache Doris is an easy-to-use, open-source real-time analytical database originally built at Baidu. It centers on a simplified massively parallel processing (MPP) architecture consisting entirely of Frontend (FE) and Backend (BE) nodes, completely removing reliance on external coordination systems.

Key features

Simplified architecture: Eliminates dependencies on external distributed systems like ZooKeeper, cutting deployment, scaling, and day-two operations.
Rich join algorithms: Supports broadcast, shuffle, and colocate joins, providing flexibility for executing complex analytical queries over normalized data.
Native materialized views: Can transparently rewrite supported SELECT, PROJECT, JOIN, GROUP BY (SPJG) queries, with async views providing eventual consistency.
MySQL protocol support: Integrates with standard BI tools, drivers, and visualization platforms without requiring custom connectors.

Pros

Easy to deploy, scale, and maintain compared to heavier, multi-role Java-based systems like Apache Druid or Pinot.
Delivers strong performance for point queries, reporting dashboards, and large ad-hoc analytical workloads across lakehouse data.

Cons

Write-heavy streaming workloads and continuous high-volume mutations can bottleneck query performance if backend nodes are not explicitly sized and tuned.

Pricing

Open Source (Apache 2.0 license).
Commercial support and managed cloud availability via SelectDB.

Conclusion: choosing the best OLAP database for real-time analytics

Choosing the right OLAP database in 2026 comes down to matching your platform to your workload shape. Snowflake and BigQuery remain strong for warehouse-native governance, data sharing, and latency-tolerant batch or ad-hoc analytics. But serving data directly to applications requires a purpose-built columnar engine.

For architectures that need sub-second latencies, native vector search, and the ability to handle hundreds to thousands of concurrent queries without spiraling compute costs, ClickHouse stands out as the premier serving layer. It is a strong option for analytical warehouse consolidation where OLTP semantics and warehouse-specific governance features are not required. Its billing is based on actual compute resources consumed, so cost stays predictable as concurrency grows. Since it is fully open source, there is no licensing lock-in to design around.

Validate these architectural realities against your own data shapes. Spin up a free trial of ClickHouse Cloud, load as much of your own data as possible, run an evaluation at a realistic scale, and compare the results against your existing system.

FAQs about real-time OLAP databases (2026)

What is a real-time OLAP database?

A real-time OLAP database is a columnar analytics engine designed for sub-second queries on fresh, continuously ingested data under high concurrency.

What is the difference between a cloud data warehouse and a real-time OLAP database?

Cloud data warehouses (like Snowflake and BigQuery) are designed for warehouse-native batch processing, complex ELT, governed BI, and broad data sharing, with query latency measured in seconds for many interactive analytical workloads. Real-time OLAP databases (like ClickHouse, Druid, StarRocks, and Pinot) are built for sub-second query latencies, continuous streaming ingestion, and high-concurrency user-facing applications.

Is Snowflake a real-time database?

Not for general-purpose real-time OLAP serving. Snowflake has Snowpipe Streaming for second-level ingestion and Hybrid Tables for operational reads and writes, but standard Snowflake warehouses remain optimized for governed analytics, batch processing, and complex aggregations. Under high concurrency, standard warehouse latency is less predictable than that of purpose-built serving engines because queries can queue when compute resources are unavailable.

What is a good lower-latency Snowflake alternative for real-time apps?

When evaluating alternatives, note that Snowflake and BigQuery are strong for warehouse-native governance, data sharing, and latency-tolerant batch or ad-hoc analytics. ClickHouse can be added as a complementary "speed layer" to an existing data warehouse to achieve sub-second latency, or used for analytical consolidation where a dedicated OLTP layer or warehouse-specific governance is not needed. Organizations route high-concurrency user-facing analytics and hybrid search workloads to ClickHouse when Snowflake or BigQuery becomes too slow or too expensive for serving.

Which OLAP database is best for Kafka streaming?

ClickHouse, Apache Pinot, and Apache Druid all feature exceptional native integrations for Kafka. ClickHouse is widely favored for its versatility, offering deduplication patterns, advanced JSON parsing, and high compression via Kafka ingestion, table engines, and materialized views.

Which databases sustain sub-second queries under heavy concurrency?

ClickHouse, Pinot, Druid, StarRocks, and Doris are the relevant real-time OLAP class. Pinot is strongest for known indexed patterns. StarRocks and Doris are strong alternatives for join-heavy normalized schemas. ClickHouse is the best default for high-concurrency dashboards because it combines performance, compression, SQL flexibility, and operational maturity.

Which OLAP databases support complex joins on normalized schemas?

ClickHouse, StarRocks, and Apache Doris all support complex joins on normalized schemas with modern optimizers and join algorithms. Pinot generally requires denormalized tables.

What does "sub-second P99 latency" mean, and why does it matter?

P99 latency is the response time for the slowest 1% of queries. Keeping P99 under a second is critical for user-facing analytics where tail latency drives perceived performance.

Which OLAP database is best for vector search and RAG workloads?

ClickHouse, StarRocks, Doris, and Pinot all have vector search or vector index capabilities. ClickHouse is the best default when vector search must sit inside the same high-concurrency OLAP system as JSON analytics, joins, streaming ingestion, and dashboard serving.

Can I use Snowflake or BigQuery for real-time user-facing analytics?

You can, but they have higher baseline latency and cost under high concurrency. Teams use them for batch/BI and add a real-time OLAP serving layer for interactive workloads, or use ClickHouse for analytical consolidation that does not require OLTP semantics or warehouse-specific governance features.

What is the best database for stateful AI agents in 2026?

Aman Puri — Tue, 14 Jul 2026 11:44:18 +0000

Engineering teams building stateful AI agents usually start by bolting context onto their existing relational databases, like Postgres and pgvector. It's familiar territory.

This works fine for simple retrieval. But it hits walls fast. Extracting personalized data from the relational store requires multiple complex queries and manual sifting, and workloads that demand multi-hop reasoning, temporal state tracking, or permission-aware graph traversal break it further.

As application-layer memory frameworks mature, the underlying database infrastructure needs to evolve too. The shift is away from flat storage models, relational or vector, toward systems that can track complex entity relationships over time.

Key takeaways

The best database depends on the agent's workload. Use vector databases for simple RAG, Postgres for your system-of-record, and graph-native context infrastructure with temporal versioning for stateful, multi-hop, or regulated agents.
Relational databases like Postgres struggle with stateful agent context because recursive JOINs (CTEs) are slow for multi-hop traversal and relational schemas require complex migrations to evolve an agent's ontology over time.
Vector databases fail at relationship-aware workloads. They have no concept of entity relationships or how facts change over time.
Engineering teams need to distinguish the infrastructure layer (the physical database) from the application layer (agent memory frameworks). Choose your database foundation first.
Graph-native infrastructure on object storage gives you multi-hop traversal, versioned temporal state, and a much lower-cost model for retaining long-term agent history.

Quick answer: database routing matrix for AI agent workloads

Picking the right context substrate means matching your agent's reasoning requirements to the physical capabilities of the underlying database.

Agent workload profile	Recommended database architecture	Engineering rationale
Simple RAG & stateless assistants	Dedicated vector database or relational + vector extension (pgvector)	When you only need semantic similarity without temporal reasoning or entity relationships, dedicated vector stores give you the lowest latency. Postgres with pgvector can serve smaller-scale vector workloads, though performance characteristics depend on index type, hardware, and query patterns.
Long-running stateful agents	Graph-native context infrastructure on object storage	Long-term memory requires versioned state. Agents need to know what changed over months of interaction. Object storage economics let you retain temporal graph data indefinitely without aggressively pruning history to save costs.
Multi-agent orchestration	Graph-native context infrastructure or traditional graph database	When multiple specialized agents read and write to a shared context pool, the database has to map permissions, handoffs, and tool dependencies directly. Property graphs execute these multi-hop dependencies in a single traversal pass.
Regulated enterprise agents	Graph-native context infrastructure	Enterprise agents require strict multi-tenant isolation, bitemporal audit trails, and concrete provenance. A versioned graph lets compliance teams reconstruct exactly what an agent knew at any historical timestamp.

This matrix isolates the storage primitive from application logic. If you try running a long-running stateful agent on a flat vector store, you'll end up building a custom graph traversal and temporal versioning engine in the application layer.

That's an anti-pattern, which leads to consistency drift and data governance failures.

Why relational and vector databases break for stateful agents at scale

Relational databases like Postgres and MySQL are the undeniable systems of record for canonical business data (billing, permissions, orders, transactions, audit) and metadata. These belong in a relational schema with strict ACID guarantees.

When engineers force highly connected, temporal agent context into these systems, the architecture fractures under query load.

Agent context is inherently graph-shaped. When an agent tries to answer a question spanning multiple connected entities, it needs multi-hop resolution.

In Postgres, you stitch this together with recursive Common Table Expressions (CTEs). The problem is that Postgres materializes each step of a recursive CTE independently. These queries act as optimization fences. The query planner can't push predicates down through the recursion.

As context scales, recursively joining entities, metadata, and timestamps grinds inference latency to a halt.

Flat vector databases introduce a different failure mode. Dedicated vector stores are great at retrieving semantically similar chunks. But they're completely blind to relationships, exact identifiers, and temporal drift.

Pure vector retrieval for every query is unsafe because it misses exact IDs, names, dates, and policy clauses, making hybrid retrieval the baseline. Furthermore, pure vector retrieval can't represent what changed, who has access, or how a specific document clause relates to an organizational policy across time. Flat chunks drop the connective tissue.

Relying on pure cosine similarity means your agent retrieves context that's semantically relevant but factually obsolete. That can drive up hallucination rates in production.

Infrastructure layer vs. application layer: what to store where

The single largest architectural mistake engineering teams make when designing agentic systems is conflating the application layer with the infrastructure layer.

The infrastructure layer is the physical database foundation. Systems like Postgres, Apache AGE, Pinecone, Neo4j, FalkorDB, Elasticsearch, OpenSearch, and HydraDB live here. They handle durable storage, indexing, and multi-signal retrieval execution. Your physical context resides here.

The application layer, often branded as "agent memory," consists of frameworks and products like Mem0, Zep, Letta, Graphiti, and Supermemory. These tools run on top of the infrastructure layer. They provide orchestration logic, chunking strategies, entity extraction, and retention rules that dictate what gets written to the database and how it's summarized.

You have to choose your database foundation before you choose or build an application memory layer.

Adopting an application-layer product out of the box often forces your team to inherit the vendor's underlying infrastructure choices and predefined memory schemas. If a memory framework hard-codes its entity extraction to a flat relational schema, you lose the ability to model your specific domain ontology.

Separating these layers lets you deploy a graph-native database substrate while keeping the freedom to build a bespoke application layer tailored to your product logic.

Five database architectures for AI agent context

Evaluating databases as context substrates means looking past general-purpose benchmarks. Focus strictly on how they handle stateful, multi-hop agent workloads.

Architecture	Schema flexibility	Native multi-hop traversal	Temporal/bitemporal	Hybrid retrieval	Cost model
Relational databases with vector extensions (Postgres + pgvector, MySQL)	Rigid (ontology changes need migrations)	None (recursive JOINs only)	Bolt-on (triggers and audit tables)	Partial (pgvector add-on, no fusion)	Co-located with business data (scales on block/SSD)
Dedicated vector databases (Pinecone, Qdrant, Weaviate)	Flat (vectors only)	None (no relationships)	None	Partial (semantic-first, weak on keyword/graph/time)	SSD storage (pricey for long retention)
Document stores and hot caches (MongoDB, Redis)	Flexible (schema-on-read)	None (no graph traversal)	None	Partial (bolt-on vector, no fusion)	Low-latency cache (consistency drift for durable context)
Traditional graph databases (Neo4j, Amazon Neptune)	Semi-rigid (strict node/edge models)	Native (Cypher, ms traversals)	Weak (schema gymnastics)	None (no multi-signal fusion)	RAM/storage costs prohibitive at high volume
Graph-native context infrastructure on object storage (HydraDB)	Flexible (bring-your-own ontology, dynamic edge metadata)	Native (graph-native traversal)	Native (valid-time + commit-time on edges)	Native (semantic + keyword + graph + temporal + rerank)	Object-storage economics (retain full history)

Relational databases with vector extensions (Postgres + pgvector, MySQL)

Using Postgres with pgvector or pgvectorscale is the default starting point for most engineering teams.

Strengths: Universal familiarity, mature operational tooling, strict ACID compliance, and the ability to co-locate agent memory with canonical business data.
Failure modes for agents: Recursive JOINs choke on multi-hop entity resolution. Relational schemas are rigid, so evolving an agent's ontology requires painful database migrations. Maintaining temporal versioning (tracking what an agent knew at a specific past date) requires complex trigger systems and audit tables that bloat storage and slow down ingestion.

Dedicated vector databases (Pinecone, Qdrant, Weaviate)

Vector databases were the primary infrastructure choice for the first wave of Retrieval-Augmented Generation (RAG).

Strengths: Fast semantic retrieval and high-throughput approximate nearest-neighbor (ANN) search algorithms optimized for massive embedding workloads.
Failure modes for agents: Context in a vector database is flat. These systems lack relationship modeling. You can't answer questions like "Which tool did the user who uploaded this document use last week?" without heavy application-side glue code. They also lack built-in provenance tracking and bitemporal state, making it hard to verify why an agent retrieved a specific fact. Storage costs scale with data volume on block storage or SSDs, making infinite retention of conversational history expensive compared to architectures built on object storage.

Document stores and hot caches (MongoDB, Redis)

Teams frequently use flexible JSON stores and in-memory databases in the caching layer of agentic systems.

Strengths: Schema-on-read flexibility makes them excellent for capturing unstructured episodic logs and raw agent event payloads. They provide low-latency session caching for active working memory.
Failure modes for agents: They don't support graph traversal. Vector search extensions have been bolted onto these systems, but they're often computationally expensive and lack multi-signal fusion. Relying on them for long-term durable context leads to eventual consistency drift between the agent's semantic memory and the actual system of record.

Traditional graph databases (Neo4j, Amazon Neptune)

Property graphs explicitly map nodes (entities) and edges (relationships).

Strengths: Excellent for multi-hop reasoning. Query languages like Cypher let developers execute complex dependency traversals in milliseconds, enforcing strict node and edge models by design.
Failure modes for agents: Traditional graph databases historically bottleneck on ingestion throughput. Newer engines like Memgraph can bulk-import over 1 million nodes and edges per second in batch loads, though legacy systems and live transactional ingestion often trail far behind. They struggle to keep pace with agents that continuously stream new facts. Scaling costs are notoriously prohibitive for high-volume storage. And traditional graphs struggle with bitemporal state. Managing "what was true then vs. now" requires significant schema gymnastics.

Graph-native context infrastructure on object storage (HydraDB)

This emerging architecture decouples graph compute from physical storage by writing directly to object storage layers like Amazon S3.

Strengths: This architecture combines a Git-style versioned graph, vector search, and standard B-tree indexes in a single system, layering explicit graph traversal and hybrid multi-signal retrieval on top. Valid-time metadata attaches directly to graph edges.
Advantage: By building on object storage, the cost model drops drastically. Teams can retain full, immutable histories of core agent context and traverse temporal timelines without aggressively pruning high-value data to save on block storage costs.

How to evaluate databases for AI agent storage (five lenses)

The baseline design rule is that every context record should carry tenant_id, user_id, agent_id, source, ACL, created_at, valid_from, valid_to, confidence, and retention/TTL. This prevents cross-tenant leakage and lets agents tell current facts from stale context.

Schema flexibility and multi-hop traversal

AI systems deal with highly dynamic information. The database needs to let developers bring their own ontology rather than forcing them into a predefined schema.

It also has to traverse dependencies as a core operation, mapping a User to an Organization to a Document to a Tool to an Action, without recursive SQL JOINs.

Production Schema Design: Relational databases force developers to cram metadata into rigid, sprawling tables or overly fragmented normalized setups. A graph-native approach lets architects assign mandatory metadata fields dynamically on edges and nodes. When a new entity type emerges in the application layer, the infrastructure can absorb it without a schema migration.

Temporal and bitemporal state

Agents need to track what's true now versus what was true in the past to maintain accurate decision-making over time. Bitemporal modeling tracks both transaction time (when the database recorded the fact) and valid time (when the fact was actually true in the real world).

Compliance Requirement: An immutable, append-only architecture inherently conflicts with GDPR Article 17 (Right to Erasure). You can't simply delete a node in an immutable ledger. Production teams need to implement tombstoning combined with crypto-shredding. Encrypt personal data with a per-user key, then destroy that key when an erasure request comes in.
Production Schema Design: Context records need valid_from and valid_to timestamps. This lets agents differentiate current facts from stale historical context and enforces a pattern of appending state (versioning) over destructive overwrites (CRUD).

Provenance and decision traceability

When an agent takes an autonomous action, engineering and compliance teams need to audit the exact context that drove that decision. Flat vectors lose this traceability immediately upon ingestion.

Production Schema Design: Every edge and node needs to support tracing data. Mandatory schema fields should include source, created_at, and confidence built into the relationships. If an agent retrieves a policy document, the database should return the edge connecting that policy to the user. That proves the relationship and the confidence score of the extraction at inference time.

Hybrid retrieval and multi-tenant access control

The database needs to natively support multi-signal retrieval, fusing semantic vectors, sparse keyword indexing, graph traversal paths, and temporal metadata into a single ranked result.

Production Schema Design: Enforce multi-tenant security at the physical layer, not just in application logic. Treat the baseline tenant_id, user_id, agent_id, and ACL as strictly mandatory fields on every single context record. The database should physically separate access at the query layer, ensuring that a multi-hop traversal immediately halts if an edge lacks the appropriate tenant identifier.

Inference-time latency and cost

Agentic workflows often require dozens of database calls per user interaction. The database has to meet sub-second latency requirements during inference while maintaining viable storage economics.

If storing dense vectors on premium solid-state drives costs too much, teams are forced to aggressively prune context history, effectively giving their agents amnesia.

Production Schema Design: Using a decoupled compute-and-storage architecture via object storage prevents these cost overruns. The schema should also support a TTL (Time to Live) field. This lets engineers intelligently expire episodic logs or transient session caches without bloating the primary infrastructure. High-value graph relationships persist indefinitely, while low-value chat logs gracefully age out.

Why HydraDB fits stateful AI agent context (graph + temporal + hybrid retrieval)

HydraDB is graph-native context infrastructure for stateful AI applications, built for agent architectures that have outgrown simple RAG.

It's purpose-built to sit underneath memory frameworks, company brains, and autonomous workflows. It provides the substrate these applications require to function reliably at high volume.

HydraDB differentiates itself through a versioned temporal graph architecture. Instead of destructively overwriting facts, it appends new state. Graph edges carry relation type, explicit commit time, and valid-time metadata.

It uses a Sliding Window Inference Pipeline that makes chunks self-contained before retrieval by resolving entity references and embedding contextual bridges. It also uses a unified multi-signal retrieval engine that combines semantic search, sparse keywords, latent signals, metadata, graph traversal, temporal bounds, and cross-encoder reranking into a single query path.

Since it's built on an object-storage foundation, it fundamentally alters the cost model.

It also enforces multi-tenant isolation at the storage layer, scoping every node and edge by tenant so a single deployment can serve many customers without cross-tenant leakage.

HydraDB's performance on the LongMemEval-s benchmark confirms the effectiveness of multi-signal retrieval:

LongMemEval-s category	HydraDB score
Overall accuracy	90.79%
Temporal reasoning	90.97%
Knowledge updates	97.4%
Single-session user & assistant recall	100%
Single-session preference extraction	96.67%

The overall score is averaged across all question categories, including harder multi-session reasoning, not broken out above.

HydraDB used Gemini 3.0 Pro as a judge and GPT-5 mini for inference during these benchmarks. Alternatives like Zep reported scores using GPT-4o. Engineering teams should always run multi-hop evaluations on their own proprietary corpus rather than relying exclusively on vendor-reported benchmarks.

Engineering teams evaluating graph-native context infrastructure should map out how it will integrate with their chosen application-layer memory products or orchestration frameworks.

Conclusion: choosing a database for stateful AI agent context

Building reliable, stateful AI agents requires drawing a hard line between canonical business records and agent context.

Keep your canonical data safely inside your relational database. But stop trying to force multi-hop, temporal AI context into rigid relational tables or isolated, flat vector stores. The friction of recursive JOINs and the loss of temporal traceability will break your production workflows.

When choosing your context infrastructure, evaluate your database on schema flexibility, bitemporal truth enforcement, decision traceability, hybrid retrieval, multi-tenant security, and cost-to-scale.

If your AI agents need to reason over relationships, temporal state, and cross-source context, HydraDB provides the graph-native context infrastructure on object storage to build your own memory layers and ontologies, without inheriting someone else's schema.

Check out HydraDB’s documentation and architecture benchmarks to see the temporal graph in action.

FAQ: AI agent database architecture

What's the best database for AI agents in 2026?

It depends on the workload. Use Postgres for canonical business data, a vector database for simple RAG, and a graph-native database with temporal versioning for long-running, multi-hop, and permission-aware agents.

When is Postgres + pgvector enough for an AI agent?

Postgres with pgvector works for small-to-mid-scale semantic retrieval and simple assistants. It breaks down when agents require complex multi-hop traversal, evolving schemas, or temporal queries to determine "what was true then vs. now."

What is "graph-native context infrastructure on object storage"?

It's a graph database architecture where graph state is stored in an object storage layer (like Amazon S3), while compute for traversal and retrieval runs separately. This model enables cheaper long-term data retention and a fully versioned temporal history.

Do I need a graph database if I already have a vector database?

It depends on your agent's workload. If your agent requires relationship-aware reasoning, multi-hop queries, permission checks across entities, or temporal state tracking, then yes, a vector database alone is not sufficient.

What is hybrid retrieval for agent context?

Hybrid retrieval combines multiple search signals into one ranked result set. Vectors for semantics, keywords for sparse terms, metadata filters for tenants or time, and graph traversal for relationships. This improves accuracy and reduces incorrect matches from stale or cross-tenant data.

How should multi-tenant permissions be enforced for agent memory?

Enforce permissions (tenant_id, user_id, ACLs) at the database query layer, not just in application code. This ensures data traversals and retrieval operations can't cross tenant boundaries by design.

What's the difference between temporal and bitemporal modeling for agents?

Temporal modeling tracks when a fact is valid in the real world (valid time). Bitemporal modeling tracks both valid time and transaction time (when the system recorded the fact). This lets you reconstruct exactly what the agent knew at any point in the past.

How do I migrate from Postgres/pgvector to a graph-based agent context store?

Keep Postgres as the system-of-record for canonical data. Stream events and data changes from Postgres into the graph context store. Gradually shift agent read operations (retrieval and traversal) to the graph while maintaining write-ahead logs or audit links for provenance.

How much agent history should I retain, and what should expire?

Retain high-value data indefinitely. Core entities, their relationships, and provenance records. Use a Time-to-Live (TTL) policy to automatically expire low-value, transient data like episodic logs or session caches. This controls costs without causing long-term amnesia.

When should I use Neo4j/Neptune vs a graph-on-object-storage approach?

Use traditional graph databases like Neo4j or Neptune for smaller-scale graphs or when you need mature, out-of-the-box tooling and integrations. Go with graph-on-object-storage when you need to retain massive volumes of versioned history at a lower storage cost.

Every AI Company Needs a Context Graph. None of Them Need the Same One.

Aman Puri — Tue, 07 Jul 2026 13:14:47 +0000

AI agents that can't carry structured state across sessions degrade on repeated work. They lose what worked, what failed, which sources were dead ends, and how users corrected them.

Perplexity's Brain is one response: a context graph that reported +25% answer correctness and +16% recall on previously-seen tasks (internal metrics). Other teams have arrived at the same problem with radically different architectures.

This is not a knowledge-graph revival. The schemas diverge too sharply, and the workload is too different. If you're building AI products, the question isn't whether to invest in structured context. It's whether to adopt an off-the-shelf context model or build on infrastructure that lets you define your own.

Key Takeaways

Multiple teams are building context layers for AI agents, but their schemas diverge sharply. Same need, radically different architectures.
Context graphs are a distinct database workload: append-heavy, bitemporal, provenance-rich, permission-aware, and latency-sensitive on the inference path.
Good context graph infrastructure is opinionated about how context is stored, traversed, versioned, and retrieved, and silent about what your graph means.
Object-storage economics make it possible to retain every relationship, version, invalidated fact, and evidence chain instead of pruning what agents need most.
If your context model is where your domain intelligence compounds, don't outsource it. Build on infrastructure that provides the primitives, not the schema.

Why every context graph schema looks different

Perplexity Brain organizes around sessions, tasks, files, corrections, and outcomes. Its graph makes one agent (Computer) better at repeated work by tracking what worked, what failed, which sources were dead ends, and how corrections propagate. The ontology is episodic and execution-adjacent.

Capital One's engineering team published an architecture for agent memory using a context graph: entities as nodes, claims as directed edges with confidence scores and lifecycle states. Every claim links back to the conversation that produced it, and contradicting evidence deprecates older claims with a timestamp. The ontology is evidence-centric.

Glean maps content, people, activity, and permissions. It's optimized for enterprise findability: who authored what, who viewed it, who shared it, and who's allowed to see it. The ontology is retrieval-centric.

Dust describes its context layer in terms of semantic search, data sources, tools, and agent orchestration, not graph-native traversal. The ontology is agent-operational: instructions, tools, skills, workspaces, governance.

Product	Core schema elements	Ontology type
Perplexity Brain	Sessions, tasks, files, corrections, outcomes	Episodic
Capital One	Entities, claims, confidence scores, lifecycle states	Evidence-centric
Glean	Content, people, activity, permissions	Retrieval-centric
Dust	Semantic search, data sources, tools, agent orchestration	Agent-operational

Same need, radically different schemas, and these are just four examples.

The ontology trap: why building on someone else's schema breaks

Schema divergence is healthy at the application layer. It becomes dangerous when you mistake one application's schema for infrastructure.

When you reach for an existing "company brain" or memory layer as infrastructure, you don't just adopt a storage engine. You adopt a worldview.

That worldview decides what counts as a node, which relationships are first-class, what time means, how evidence attaches to facts, how permissions propagate, and what gets preserved versus summarized away.

In a demo, this feels like acceleration. In production, it becomes a ceiling.

Salesforce's object model (Account, Contact, Lead, Opportunity) standardizes sales motion, but it also shapes how organizations think about customers and pipeline. Jira's issue types and workflow schemes define what "work" means so completely that teams building unique CD pipelines end up fighting the state machine.

Once a vendor schema becomes specific enough to be useful, a 1:1 export to a neutral format becomes practically impossible.

A security-incident graph forced into a people/document/activity model leaks or distorts. A robotics state machine forced into a decisions/commitments model doesn't fit. A clinical-trial audit trail forced into an agent-episodic schema loses its regulatory structure. A support team's SLA-and-escalation lineage doesn't map onto any of them.

These applications should have opinions, and they do. Perplexity should have a Perplexity-shaped ontology. Glean should have a Glean-shaped one.

Strong product opinions make great applications and terrible substrates. If your ontology is where your domain intelligence compounds (and it is), your ontology is part of your product. You don't outsource it.

Context graphs are a distinct database workload

Context graphs aren't knowledge graphs with more hype. They're a new serving workload.

A context graph is append-heavy: every session, correction, tool call, and permission change adds or invalidates nodes and edges.

It needs temporal state. Production systems increasingly need full bitemporal semantics, tracking both when a fact was true in the world and when the system recorded it. Without temporal invalidation, agents inherit contradictory or stale facts, which drives hallucination and reasoning failures.

It's provenance-rich: facts and relationships should carry evidence of why they exist, linked to their source session, document, conversation, or author.

It's permission-aware: access control evaluated during traversal, not post-filtered. Glean's architecture emphasizes permissions as part of indexing and search. For context graphs, that principle argues for access control during retrieval and traversal, not only after.

Relationship traversal can leak context (a public ticket referring to a private postmortem), so forbidden nodes should never be expanded.

And it's latency-sensitive enough to sit in the inference path. An agent's tool-call loop needs working context in the low hundreds of milliseconds. Sub-200ms p95 is a common target for production agent memory systems.

No existing database category cleanly and economically combines all of this. OLTP graph databases handle transactions over relatively stable graphs but weren't designed for append-heavy temporal workloads with per-edge provenance.

Vector databases handle similarity and can store metadata, but their core abstraction doesn't encode directional relationships, temporal validity, provenance chains, or permission-aware traversal.

A vector index tells you what is semantically similar. A context graph tells you what is structurally connected, when it was true, why it matters, and whether you're permitted to know it.

What good context graph infrastructure looks like

The most durable infrastructure products share a pattern: fierce technical opinions, weak domain opinions.

S3 has hard convictions about durability (eleven nines), object semantics, consistency, and request models. It has zero opinions about whether your bytes are product photos, fraud features, genomic sequences, or agent traces.

Postgres makes strong bets on ACID, MVCC, extensibility, and query planning. It doesn't decide your tables. Kafka owns append-only ordered logs and consumer offsets. It doesn't care what's in the messages.

Separating compute from storage layers is a proven architectural pattern used by modern relational databases like AWS Aurora, GCP AlloyDB, and Neon DB. Building on this concept, turbopuffer rethought vector search around object-storage economics, using only object storage for state with NVMe and memory as a cache layer. The cost curve collapsed so dramatically that Cursor reportedly cut costs 95% after migration while improving code-retrieval accuracy by up to 23.5%.

Context-graph infrastructure should work the same way. An engine should hold hard opinions about storage layout, adjacency organization, hot/cold separation, temporal edge compaction, traversal planning, multi-hop read batching, provenance indexing, permission-aware traversal, and hybrid symbolic-plus-semantic retrieval.

It should hold no opinion about whether a node is a session, customer, repo, meeting, device, care plan, ticket, claim, policy, or tool call.

It is the only graph database that natively separates compute from storage, delivering an engine purpose-built for the context-graph workload. Object-storage-native durability, NVMe/RAM-cached hot neighborhoods, append-only writes, bitemporal state, provenance on every edge, permission-aware traversal. The mechanics of a context graph, without renting someone else's ontology.

Why graph database pricing forces you to forget

Managed graph databases typically price per gigabyte of provisioned RAM, not stored data. Leading offerings run $65-146/GB/month at that level. Object storage runs ~$0.023/GB/month.

These are not equivalent comparisons: managed graph offerings bundle compute, availability, support, and operations. But the structural implication is real.

When your working graph must live in provisioned RAM at those rates, you prune and curate, losing the very provenance and temporal state that agents need.

When durable graph state lives on object storage, you can afford to remember: every relationship, every version, every invalidated fact, every evidence chain stays intact.

This is the same dynamic turbopuffer exploited for vector search: cheaper storage unlocks more usage. Cursor reportedly cut costs 95% and scaled its vector footprint once the economics made retention viable.

The context-graph workload has the same shape: append-heavy, continuously growing, economically wants to retain everything.

The hard engineering challenge is making this work for graph traversal, a nastier access pattern than vector search. A naive graph engine on object storage issues one remote fetch per hop. At ~100-200ms per round-trip, a 3-hop traversal becomes seconds of latency.

The credible answer involves co-locating adjacency with nodes, batching the traversal frontier, keeping hot neighborhoods in NVMe/RAM, bounding fanout, and separating cold durable truth from hot serving state.

Recent systems research database implementations, such as Dgraph (a highly popular graph database utilizing LSM and predicate partitioning) and LSMGraph, combine LSM-tree write friendliness with CSR-like read locality for exactly this class of problem. It requires an engine purpose-built for the workload.

Own your context graph schema

The next thousand company brains will be built for support intelligence, sales coordination, code reasoning, security response, clinical workflows, compliance lineage, agent memory, and domains nobody has built yet. Each with its own nodes, edges, temporal semantics, and permission model.

The market for application-layer brains is real and growing. Every company building AI products will need one. The mistake is assuming every company needs the same one, or that you should inherit your context model from a vendor whose domain assumptions will diverge from yours within the first quarter of production use.

The durable infrastructure position, the one HydraDB is built around, is the one that lets all of these exist. Opinionated about the mechanics, silent about the meaning. Cheap enough that you don't have to choose what to forget. Fast enough that context graphs can sit in the inference path. Structurally incapable of locking anyone into a schema that isn't theirs.

Your ontology is your product. Don't outsource it. Use infrastructure that has hard opinions about how context is stored, traversed, versioned, secured, and retrieved, and no opinion at all about what your graph means.

Explore the architecture, or book a call to walk through your workload.

Frequently asked questions

What is a context graph?

A context graph is a structured representation of an AI agent's working state: sessions, corrections, tool calls, decisions, sources, and the relationships between them.

Unlike flat memory stores or embedding indexes, a context graph preserves provenance (where a fact came from), temporal validity (when it was true), structured relationships between entities, and permission-aware access control. It sits in the inference path and gives agents structured context for every run.

How is a context graph different from a knowledge graph?

A knowledge graph maps what is known: entities, attributes, taxonomies, and stable relationships. It's curated and relatively static. A context graph maps what is in motion: what happened, what changed, what worked, what failed.

Think of them as a super-set or a dynamic upgrade to traditional knowledge graphs. Context graphs are session-centric, append-heavy, continuously mutating, and updated after every agent run. While they build on foundational graph capabilities, their workload profile is fundamentally different from traditional, static knowledge graph workloads.

Why do AI agents need context graphs?

Agents without persistent structured state degrade on repeated work. They can't carry forward what worked, what failed, which sources were dead ends, or how users corrected them.

Embeddings tell you what's semantically nearby, but they don't tell you what depends on what, what superseded what, what broke last time, or what's permitted. Context graphs answer relationship and provenance questions that flat retrieval can't.

What makes context graphs a distinct database workload?

Context graphs are append-heavy, bitemporal (tracking when facts were true and when they were recorded), provenance-rich, permission-aware during traversal, and latency-sensitive enough to sit in the inference path. No existing database category cleanly combines all of these.

OLTP graph databases weren't designed for append-heavy temporal workloads. Vector databases don't encode directional relationships, temporal validity, provenance chains, or permission-aware traversal.

What is the ontology trap in AI infrastructure?

The ontology trap occurs when you adopt an off-the-shelf context layer and unknowingly inherit its schema assumptions. That schema decides what counts as a node, which relationships are first-class, how time works, and how permissions propagate. In a demo this feels like acceleration. In production it becomes a ceiling, because your domain's needs will diverge from the vendor's assumptions.

How does object storage reduce context graph costs?

Managed graph databases typically price per gigabyte of provisioned RAM ($65-146/GB/month). Object storage runs ~$0.023/GB/month. When your graph must live in RAM at those rates, you prune and curate, losing the provenance and temporal state that agents need. Object-storage-native architecture lets you retain every relationship, version, invalidated fact, and evidence chain.

What should I look for in context graph infrastructure?

An engine with strong technical opinions about storage layout, traversal planning, provenance indexing, temporal edge compaction, permission-aware retrieval, and hot/cold separation, but no opinion about what your nodes and edges mean. The infrastructure should handle the mechanics of a context graph without forcing you into a schema that isn't yours.

ClickHouse Concurrency: How to Size for User-Facing Analytics

Aman Puri — Fri, 03 Jul 2026 10:16:12 +0000

Sizing ClickHouse for a customer-facing analytics product starts with the workload. An application can have thousands of active users, generate bursts of dashboard requests, ingest data continuously, and still require sub-second query latency. The number of queries it can serve depends on the cost of those requests, the available resources, and the required latency.

ClickHouse has no fixed 100-query concurrency ceiling. The server-level max_concurrent_queries setting defaults to 0, which means unlimited. In ClickHouse Cloud, max_concurrent_queries_for_all_users defaults to 1,000 per replica. Both are configurable admission controls that protect a deployment during overload, and each server or Cloud replica evaluates them independently.

Our concurrency documentation covers analytical workloads serving more than 10,000 queries per second with latency below 10 milliseconds on petabyte-scale databases. Capacity varies with the query mix, data layout, cache state, ingestion load, latency targets, hardware, and cluster topology. A selective query that reads a small number of granules has a very different cost from a large aggregation, sort, or join.

Measure representative traffic at increasing concurrency. Use the results to configure resource limits, admission controls, and scaling.

TL;DR: How many concurrent queries can ClickHouse handle?

ClickHouse has no fixed architectural ceiling for concurrent queries. Sustainable concurrency is the number of simultaneous queries that meet the latency target for a specific query mix and deployment. Configured query limits protect the service during overload; they do not define the engine's capacity.

To size a deployment:

Translate active users and dashboard fan-out into peak queries per second and simultaneous queries.
Benchmark with production-like conditions using the representative query mix, real ingestion load, and cache state.
Configure limits based on benchmark results covering per-query parallelism, memory, admission controls, and workload scheduling.
Add replicas to meet throughput and availability targets when measured per-replica capacity falls short.

Measure the workload behind concurrent users

The number of active users becomes useful for sizing only after it is translated into database traffic.

A product can have 10,000 active users while only 50 queries execute at once. A single dashboard load can also fan out into 20 queries, causing 500 users to produce a burst of 10,000 requests.

Define these measurements before sizing the deployment:

Peak requests per second sent to ClickHouse
Number of queries generated by each page or dashboard load
Number of simultaneously executing queries
Query mix, including filters, aggregations, joins, sorts, and exports
p50, p95, and p99 latency targets for each query class
Peak result size
Ingestion rate and insert batch size
Required capacity during a replica failure

Under steady-state conditions, Little's Law relates throughput, average time in the measured system, and average concurrency. At 1,000 queries per second and an average ClickHouse execution time of 50 milliseconds (not end-to-end request latency), the average number of queries executing is approximately 50.

At the same request rate with 500 milliseconds of ClickHouse execution time, it is approximately 500. Using end-to-end request latency instead would include time spent in application queues, network transit, and other components outside query execution.

How ClickHouse executes concurrent queries

ClickHouse executes queries through parallel processing pipelines. The selected data is divided across processing lanes, and each lane processes blocks through operations such as filtering and aggregation.

max_threads controls the maximum number of query-processing threads available to a query and defaults to the number of hardware threads available to ClickHouse. This setting defines an upper bound. The actual number of processing lanes depends on the amount of data selected and the work available in each pipeline stage. A selective query can use fewer lanes than its max_threads value. Under concurrency control, a query may start with limited parallelism and scale up when more CPU slots become available.

The same workload dependency that shapes single-query parallelism applies across concurrent queries.

A server with 64 available hardware threads and max_threads=8 can execute more than eight queries simultaneously. Individual queries may use fewer than eight threads, and CPU scheduling allows additional queries to make progress. The sustainable query count depends on how much CPU time, memory, and I/O the workload consumes while meeting its latency target.

Memory consumption follows a similar pattern, but it is often nonlinear, as more processing lanes activate additional buffers and intermediate states. Aggregation cardinality, join build sides, sorting, decompression, result size, and query shape can drive most of the memory consumption.

Use EXPLAIN PIPELINE to inspect processing lanes for a query. Use system.processes to inspect active queries, current and peak memory, and peak thread usage.

Configure ClickHouse concurrency and resource limits

High-concurrency deployments need separate controls for per-query cost, total admitted work, and resource allocation between workloads.

`max_threads`: Control per-query parallelism

max_threads limits how many query-processing threads a query can use. Reducing it can improve throughput under load by preventing one query from consuming most available CPU parallelism, but it can also increase latency for scans that benefit from parallel execution.

Do not set max_threads=1 or 2 as a universal dashboard rule. Test values such as 1, 2, 4, and the deployment default against the real query mix, and select the value that produces the required tail latency and aggregate throughput.

In practice, a selective lookup may not reach its configured maximum, while a large scan can benefit materially from a higher value. Separate profiles let the application use a lower tested limit, while batch analytics use a different limit.

`max_memory_usage`: Limit memory per query

max_memory_usage limits the RAM used by one query on one server. A value of 0 means unlimited in both self-managed and Cloud deployments. In ClickHouse Cloud, the default is set based on replica memory.

Set this limit from measured peak memory for each query class, with headroom for data growth and parameter variation. If the limit is too low, it converts valid traffic into query failures, but if it is too high, it allows several expensive queries to exhaust the server together.

Per-query memory limits alone do not control the aggregate memory across concurrent queries. Configure max_memory_usage_for_user and server memory limits accordingly.

ClickHouse concurrency limits and query overflow behavior

ClickHouse provides multiple admission controls:

Server-wide max_concurrent_queries
Server-wide max_concurrent_select_queries
Server-wide max_concurrent_insert_queries
Per-user max_concurrent_queries_for_user
Cross-user max_concurrent_queries_for_all_users

These limits are enforced by the local server process, and in a replicated deployment, each replica applies its configured limits to the queries it receives.

When the server-wide max_concurrent_queries limit is reached, a new query can wait for a slot for up to queue_max_wait_ms. Its default value is 0, which means no wait. If no slot becomes available before the configured timeout, ClickHouse rejects the query with TOO_MANY_SIMULTANEOUS_QUERIES.

The select, insert, per-user, and cross-user concurrency limits reject a new query when their threshold is reached. The bounded wait associated with max_concurrent_queries provides simple overflow handling. Use a QUERY resource for workload-aware query-slot scheduling.

Set admission limits from per-replica load-test results and operational headroom. The limit should keep each server below the point where p99 latency rises sharply, memory pressure causes failures, or ingestion and background work fall behind.

Use measured values for each deployment. A limit of 100 can be too high for memory-heavy joins and too low for selective dashboard queries.

Concurrent thread scheduling in ClickHouse

For self-managed deployments, concurrent_threads_soft_limit_num and concurrent_threads_soft_limit_ratio_to_cores define a soft limit for query-processing threads across concurrent queries. The ratio defaults to 2, so when the absolute limit remains 0, the effective soft limit is twice the number of CPU cores available to ClickHouse. If both settings are nonzero, ClickHouse uses the lower limit.

The concurrent thread scheduler distributes CPU slots among queries that use concurrency control. A query still receives a thread and may scale up as slots become available, while query admission remains governed separately.

These controls operate at different scopes:

max_threads limits one query.
Concurrent-thread scheduling allocates processing threads across queries.
Admission limits cap the number of accepted queries.

Keeping these scopes separate makes the resulting capacity model easier to test and operate.

ClickHouse workload scheduling for CPU, I/O, and query slots

Workload scheduling uses RESOURCE and WORKLOAD objects. Queries are assigned to a workload through the workload setting.

Declaring a CPU resource disables the effect of concurrent_threads_soft_limit_num and concurrent_threads_soft_limit_ratio_to_cores. Participating queries then use workload settings such as max_concurrent_threads or max_concurrent_threads_ratio_to_cores, with the workload scheduler distributing their CPU slots instead of concurrent_threads_scheduler.

The framework can schedule:

CPU resources
Remote disk I/O
Query slots

CPU workloads can define thread limits, CPU shares, weights, and priorities, but CPU throttling through max_cpus and max_cpu_share is active only when cpu_slot_preemption is enabled.

Query-slot scheduling can define limits on concurrent queries, query start rate, bursts, and waiting queries. When a query-slot constraint is full, the query waits until capacity becomes available. Waiting queries remain outside SHOW PROCESSLIST until they start. If max_waiting_queries is reached, ClickHouse returns SERVER_OVERLOADED.

Query-slot waits have no server-side timeout, so applications should enforce request deadlines and cancellation. Asynchronous inserts and some administrative queries, including KILL, are excluded from workload query-slot accounting.

Use query, user, and server settings for memory protection. CPU scheduling currently covers query workloads and excludes merges and mutations, so separate compute remains the strongest form of resource isolation.

When interactive and batch traffic share compute, configure separate workloads with weights, priorities, and query-slot limits derived from mixed-workload tests. Large scans and exports can then operate under limits appropriate to their latency and resource profile.

Configure ClickHouse settings from benchmark results

Assign measured limits to a dedicated application user with ALTER USER or through an XML profile in self-managed deployments. Set max_threads from measured throughput and tail latency, max_memory_usage above the observed peak for valid dashboard queries, and max_concurrent_queries_for_user below the measured per-replica overload point with operational headroom.

max_execution_time can provide an additional server-side guard, but it is not a substitute for the application's request deadline. By default, ClickHouse begins estimating total execution time after timeout_before_checking_execution_speed, which defaults to 10 seconds. Setting it to 0 makes max_execution_time use elapsed clock time. Enforcement occurs only at designated processing points, so actual runtime can exceed the configured limit. Enforce the user-facing deadline in the client and propagate cancellation to ClickHouse.

Use dedicated profiles for workloads with different resource and latency requirements.

Prevent inserts from degrading query performance

High-frequency small synchronous inserts can create data parts faster than background merges can consolidate them. The resulting merge pressure can degrade read performance even when SELECT and INSERT queries use different settings.

The simplest fix is client-side batching. We recommend batches of at least 1,000 rows and ideally 10,000 to 100,000 rows for synchronous inserts.

When client-side batching is not practical, enable asynchronous inserts:

ALTER USER ingest_user SETTINGS async_insert = 1, wait_for_async_insert = 1;

With async_insert=1, ClickHouse buffers compatible inserts and flushes them when a size, time, or query-count threshold is reached. This reduces part creation and ingestion overhead.

With wait_for_async_insert=1, ClickHouse acknowledges the insert after the buffer is flushed successfully, making it the documented default and recommended production mode. The client waits, receives flush errors, and retains reliable backpressure.

Setting wait_for_async_insert=0 acknowledges data when it enters memory. That mode can lose buffered data and hide flush errors from the client.

Asynchronous inserts improve batching, while flushes, part creation, and merges continue to use server resources.

When workload scheduling cannot meet the required service-level objective, self-managed deployments can use separate nodes or clusters. In ClickHouse Cloud, separate services in a warehouse let read and write workloads use separate compute.

Use the ClickHouse query cache for repeated dashboard queries

The query cache can reduce work when the same deterministic SELECT query runs repeatedly, and slightly stale results are acceptable.

The cache is opt-in through use_query_cache=true, exists once per ClickHouse server process, and is not shared between users by default. Entries become stale after 60 seconds by default.

By default, cache eligibility excludes queries that use nondeterministic functions such as now() and today(). A dashboard query written as event_time >= now() - INTERVAL 24 HOUR therefore behaves differently from a query with fixed time boundaries.

Use explicit time buckets or application parameters when cached results can follow a defined refresh interval:

SELECT 
    customer_id, 
    sum(revenue) 
FROM hourly_revenue 
WHERE hour >= {window_start:DateTime} 
  AND hour < {window_end:DateTime} 
GROUP BY customer_id 
SETTINGS 
    use_query_cache = true, 
    query_cache_ttl = 30;

Measure the hit rate through system.query_log, system.events, and system.metrics. Calculate uncached capacity independently because a cold cache, a fresh deployment, or a changed query shape can remove expected hits immediately.

Scale ClickHouse read concurrency with replicas

Replicas add read capacity only when traffic is distributed across them.

In a self-managed replicated cluster, configure the client, proxy, or Distributed table routing so that read queries reach different replicas. Replicas also perform replication and background merges, so reserve capacity for that work.

Shards and replicas solve different problems:

Shards split data and let a distributed query process different data subsets across servers.
Replicas store copies of the same data and can serve independent read queries.
Parallel replicas let one query use multiple replicas. They can reduce latency for suitable queries while consuming capacity that could otherwise serve independent requests.

For high-concurrency dashboards, load balancing independent requests across replicas is the primary throughput mechanism. Benchmark parallel replicas by query class because coordination overhead can slow down small queries, complex queries, and high-cardinality aggregations.

Scale concurrent queries in ClickHouse Cloud

ClickHouse Cloud uses SharedMergeTree over shared object storage. Compute replicas do not need to maintain independent full copies of table data and metadata, which supports faster scale operations than a deployment that must copy local data to each new replica.

New SharedMergeTree replicas still require CPU, memory, and local cache population, which should be included in scale-up and failover tests.

Vertical autoscaling in ClickHouse Cloud

ClickHouse Cloud Scale and Enterprise services support vertical autoscaling based on CPU and memory usage. Administrators configure minimum and maximum sizes, and the service scales within those bounds.

Size the maximum high enough for the tested peak workload. Autoscaling reacts to measured load and takes time, so capacity testing and admission controls remain part of the design.

Horizontal replica scaling in ClickHouse Cloud

For regular horizontal scaling, administrators change the replica count through the Cloud console or API. Scale and Enterprise services can use additional replicas up to the documented service limits.

Scheduled scaling can change replica count or memory tier at configured times. Metric-based autoscaling applies to vertical service size, while scheduled scaling handles predictable capacity changes.

Isolate workloads with ClickHouse Cloud warehouses

Warehouses provide compute-compute separation, allowing multiple services to use separate compute and endpoints while sharing the same data.

A user-facing deployment can use:

A read-write service for ingestion and write operations
A read-only service for dashboard queries
A separate service for batch analytics or exports

Read-only services do not perform background merges outside system tables, so their CPU and memory stay focused on read queries.

Warehouses provide strong compute isolation. Services still share storage and ClickHouse Keeper, and our warehouse documentation lists edge cases for shared operations. Measure each service independently and set its scaling bounds based on its workload.

Warehouses are available on Scale and Enterprise plans.

Real-World Deployments at Scale

These deployments show how different organizations combine ingestion and analytics with workload-specific data models, infrastructure, and resource controls.

Cloudflare: Analytics at scale with ClickHouse

Cloudflare uses ClickHouse for HTTP and DNS analytics, customer dashboards, Firewall Analytics, and Cloudflare Radar. At their 2023 community event, Cloudflare shared that its deployment had grown to more than 1,000 active replicas processing hundreds of millions of inserted rows per second.

An earlier account of its HTTP analytics pipeline described a 36-node ClickHouse cluster processing an average of 6 million HTTP requests per second, with peaks up to 8 million. The associated Zone Analytics API served about 40 queries per second and reached about 150 queries per second in load testing on that deployment.

GitLab: Sub-second analytics for 50 million users

GitLab uses ClickHouse for product-facing analytics across GitLab.com, GitLab Dedicated, and self-managed deployments. ClickHouse powers workloads such as Contribution Analytics, GitLab Duo analytics, and SDLC trends for a platform serving 50 million registered users.

Queries over 100 million rows that previously took 30 to 40 seconds now return in under a second. GitLab standardized on ClickHouse as its OLAP engine while continuing to use Postgres for transactional workloads.

Laravel Nightwatch: Real-time observability on ClickHouse Cloud

Laravel Nightwatch uses ClickHouse Cloud for the analytical layer of its first-party observability platform. The service processes more than 1 billion events per day while maintaining sub-second query latency for real-time dashboards.

At launch, Nightwatch processed 500 million events on day one and reported 97 ms average dashboard request latency. Its architecture uses Amazon MSK, ClickPipes, materialized views, and ClickHouse Cloud to separate streaming ingestion from analytical queries.

Mintlify: Customer-facing analytics with sub-one-second dashboards

Mintlify uses ClickHouse Cloud to provide real-time analytics for documentation and knowledge sites used by tens of thousands of companies and tens of millions of developers each month.

After moving from PostHog to ClickHouse, Mintlify reduced analytics dashboard load times from tens of seconds to sub-one-second. The architecture uses ClickHouse materialized views to keep multi-tenant customer dashboards responsive as traffic grows.

Best Practices for Improving Concurrency

Reducing the amount of data and intermediate state processed by each request is the most effective way to improve concurrency.

Design the ORDER BY key around common filters and the expected data-pruning behavior.
Read only required columns.
Use projections when a second access pattern justifies additional storage and write work.
Use dictionaries or direct joins for suitable lookup workloads.
Use incremental materialized views and AggregatingMergeTree for repeated aggregations.
Bound export queries separately from interactive requests.
Return only the rows and bytes the application needs.

Pre-aggregation can transform a repeated large scan into a much smaller aggregation over stored states. Measure the resulting access pattern, processing lanes, and resource use because the query may still scan multiple rows and execute in parallel.

Query complexity changes the concurrency curve. Large joins, high-cardinality aggregations, full scans, and large sorts consume more CPU, memory, and I/O per request. ClickHouse can execute them concurrently, but the sustainable concurrent query count at a fixed latency target will be lower than for selective queries.

A Step-by-Step Benchmarking and Sizing Guide

1. Define query classes

Group traffic into classes such as:

Interactive dashboard queries
Drill-down queries
API lookups
Large exports
Batch analytics
Inserts

Record expected peak QPS, result size, and latency target for each class.

2. Build representative data and traffic

Use production-scale cardinalities, partition counts, part counts, and skew. Run ingestion during the test, and include the parameter values that produce the largest valid scans and aggregation states.

Test both warm and cold cache conditions.

3. Measure query latency and resource usage

For each query class, record:

query_duration_ms
read_rows and read_bytes
memory_usage
peak_threads_usage
ProfileEvents
Result rows and bytes

Use EXPLAIN indexes = 1 to confirm data pruning and EXPLAIN PIPELINE to inspect parallelism.

4. Increase load with `clickhouse-benchmark`

Use clickhouse-benchmark with --concurrency or --max_concurrency to increase parallel load.

Find the point where one of these conditions occurs:

p95 or p99 latency exceeds the target
Throughput stops increasing
Memory pressure or query failures appear
CPU wait grows sharply
Insert latency rises
Background merges fall behind

If the deployment uses an admission limit, set it below the measured overload point with enough headroom for traffic variation and background work.

5. Test the production workload mix

Do not size each query class in isolation and add the results. Run the production traffic mix, including ingestion, scheduled jobs, and exports.

Repeat the test with one replica unavailable. A high-availability deployment must meet its minimum service objective during a failure, not only when every replica is healthy.

6. Configure limits, queues, and workloads

Apply these settings based on the test results:

Set per-class max_threads, memory, result, and server-side execution guards.
Set user and server admission limits.
Configure workloads for interactive and batch traffic.
Set queue length and overload behavior for query-slot scheduling.
Set client timeouts, retry behavior, and backpressure.

7. Add replicas and retest capacity

If one replica cannot meet the target after query and schema optimization, add replicas and distribute traffic across them. In ClickHouse Cloud, set vertical autoscaling bounds and configure the replica count required for peak throughput through the console, API, or a scheduled scaling policy.

Retest after each topology change because additional replicas change routing, cache locality, coordination, and failure behavior.

Monitoring and Analyzing Capacity

Use system.processes for active-query state and system.query_log for historical analysis. Both tables are local to the server where they are queried.

A useful node-level report groups client-initiated queries by normalized query hash and compares latency, memory, and thread usage:

SELECT 
    normalized_query_hash, 
    count() AS executions, 
    quantile(0.50)(query_duration_ms) AS p50_ms, 
    quantile(0.95)(query_duration_ms) AS p95_ms, 
    quantile(0.99)(query_duration_ms) AS p99_ms, 
    max(memory_usage) AS max_memory, 
    max(peak_threads_usage) AS max_peak_threads, 
    sum(read_rows) AS total_read_rows, 
    sum(read_bytes) AS total_read_bytes 
FROM system.query_log 
WHERE type = 'QueryFinish' 
  AND is_initial_query = 1 
  AND event_time >= now() - INTERVAL 1 HOUR 
GROUP BY normalized_query_hash 
ORDER BY p99_ms DESC;

The is_initial_query = 1 filter excludes child queries created by distributed execution, giving one record per client-initiated query for request counts and latency. The memory_usage and peak_threads_usage values in those records describe the initiating query process and do not include the separate child-query processes running on remote replicas. For distributed resource analysis, query every replica and correlate initial and child records through initial_query_id. Retain the initial record for end-to-end query latency.

In ClickHouse Cloud, query every replica in the current service with clusterAllReplicas('default', merge('system', '^query_log')) in place of system.query_log, and set skip_unavailable_shards=1 so an unavailable replica does not block the report. Within a warehouse, the default cluster covers only the current service. Use clusterAllReplicas('all_groups.default', merge('system', '^query_log')) to query all services in the warehouse.

Workload query-slot waits remain outside system.processes until execution starts. Monitor system.scheduler and its queue_length column for workload queues. This table is also local. Use clusterAllReplicas('default', system.scheduler) for a service-wide view in Cloud, or the all_groups.default cluster for all services in a warehouse.

Monitor capacity signals at the workload level:

Active queries in system.processes
Workload queue depth in system.scheduler
Rejected queries, including TOO_MANY_SIMULTANEOUS_QUERIES and SERVER_OVERLOADED errors
CPU utilization and OS CPU wait
Query memory and total server memory
Merge backlog and active part counts
Insert latency and asynchronous insert failures
Query cache hit rate
Per-replica QPS and latency

Capacity planning is continuous. Data distribution, query parameters, and product usage change after launch. Re-run the benchmark when the schema, query mix, replica size, or service-level objective changes.

ClickHouse concurrency follows the cost of the workload and the resources available to execute it. The deployments running at this scale, with thousands of replicas, millions of inserted rows per second, sub-second dashboard latency for millions of users, are not special cases. They are the result of measuring workloads, controlling per-query cost, isolating competing workloads, and scaling compute when throughput requires it.

Configured limits can protect each server or replica at its measured operating boundary. They do not define the capacity of ClickHouse’s execution engine.

Frequently Asked Questions (FAQ)

Is ClickHouse limited to 100 concurrent queries?

No. The server-level max_concurrent_queries default is 0, which means unlimited. A configured value controls admission independently on each server or Cloud replica.

What should `max_threads` be for dashboard queries?

Set it from benchmark results. Test multiple values against the production query mix. Lower values can improve aggregate throughput, while higher values can reduce latency for scans. There is no universal dashboard value.

Does `max_threads=2` mean each query reserves two cores?

No. max_threads sets an upper bound on query-processing parallelism. Actual processing lanes depend on available work and server scheduling.

Does ClickHouse queue queries after reaching `max_concurrent_queries`?

The server-wide max_concurrent_queries limit supports a bounded wait through queue_max_wait_ms. Its default is 0, which means immediate rejection. If the configured wait expires, ClickHouse rejects the query. A QUERY resource provides workload-aware query-slot scheduling. Those queued queries wait until capacity becomes available or max_waiting_queries is reached.

How do I prevent ingestion from degrading dashboard latency?

Batch inserts and monitor merge pressure. Use asynchronous inserts when client-side batching is not practical. Use workload scheduling for resource allocation on shared compute, and separate nodes, clusters, or ClickHouse Cloud services when stronger isolation is required.

Do ClickHouse read replicas increase query concurrency?

Yes, when requests are distributed across them. Adding replicas without changing routing does not increase application throughput.

Can the ClickHouse query cache improve concurrency?

It reduces repeated work for identical deterministic queries when cached results are acceptable. Size the deployment to meet its service objective under the expected uncached workload.

Does ClickHouse Cloud automatically add replicas during traffic spikes?

Metric-based autoscaling changes service size vertically within configured bounds. Administrators manage replica count through the console or API. Scheduled scaling can adjust replica count for predictable time periods.

How should I size ClickHouse for concurrent queries?

Benchmark the representative mixed workload at increasing concurrency. Select the largest load that meets the required tail latency with headroom for ingestion, background work, traffic bursts, and replica failure.

AI Decision Traceability for Agent Compliance

Aman Puri — Sat, 27 Jun 2026 15:41:47 +0000

A customer pings you weeks after a piece of content shipped, asking why the agent wrote what it wrote. The agent was right at the time. By the time the question reaches you, the source has moved on. You don't know what the agent saw at the moment it decided.

At Zenith we run a fleet of agents that watch customer product documentation and rewrite derived marketing content as the source evolves. Two years in, what did the agent see at the moment it decided? is the question we've engineered every architectural decision around. Most teams shipping stateful agents will hit it. The infrastructure for answering it is what this piece is about.

This question is also the question your compliance team will ask. And unlike an engineer who can dig through application logs and piece together a plausible reconstruction, an auditor needs a deterministic answer. They need the exact source artifact the agent read, the exact policy it applied, and the exact reasoning path it followed. "We think this is probably what happened" doesn't satisfy a regulator.

The question is unanswerable on most agent infrastructure because agent state operates across two distinct planes that get conflated: the current transactional state and the decision trace. Conflate them and you'll break production systems. Ignore the trace plane entirely and you'll fail audits.

If your agents mutate state that matters, this isn't optional. You need an immutable audit trail for what they did and why.

Key takeaways

The problem: Most AI agent architectures can't reconstruct what an agent saw or used at the exact moment of a decision. This makes debugging failures and satisfying audit requirements nearly impossible.
The concept: Agent state operates on two planes: the current transactional state (what's true now, stored in a database like Postgres) and the decision trace (the immutable history of why a decision was made). Conflating these planes breaks production systems and forecloses auditability.
The requirement: A decision trace must be an immutable, time-ordered record of provenance. That includes exact source artifact versions, tool arguments, environment responses, policy/prompt versions, and bitemporal timestamps.
The gap: Operational databases overwrite state. Vector stores are flat indexes without provenance. Neither was designed for this workload.
The solution: Purpose-built memory architectures like HydraDB that treat decision traces as a first-class primitive. HydraDB's Git-style versioned temporal graph natively encodes decision traces as part of every state transition, with bitemporality, append-only immutability, and full provenance metadata built into the storage model.
When you need it: A dedicated trace plane is mandatory when agents mutate critical state, have delayed consequences, coordinate across sessions, or must meet compliance and audit requirements.

The two planes of AI agent state: transactional state vs. decision trace

The first plane is the current transactional state. This represents what's true about an entity right now.

When an agent updates a customer's seat count, applies a billing discount, or modifies a marketing asset, that resulting ground truth belongs in a transactional store like Postgres. Operational databases excel at immediate consistency, enforcing referential integrity, and returning the single valid snapshot of the present moment.

The second plane is the decision trace. This represents the exact sequence of contexts, tool invocations, and steps that led to that current state. The trace isn't a snapshot. It's an immutable history of reasoning.

At Zenith we built around separating these planes from day one. Without that separation, teams end up with the final marketing copy stored safely while the context that generated it is overwritten.

In my previous piece, Agents Are Just State Machines, I established that pushing durable state into an operational database solves single-run context failures. The decision trace plane is the necessary next layer for cross-run auditability and multi-agent coordination.

Forcing both planes into a single system creates real performance trade-offs at scale, and forecloses the audit and replay capability the trace plane is built for. An operational database optimized for sub-millisecond point lookups on user records shouldn't also be asked to scan millions of reasoning traces for analytical replay. The two access patterns compete for the same resources, regardless of which database you use.

What is a decision trace in AI agents?

Decision traces are a distinct data primitive. They're not system logs tracking CPU usage. They're not application telemetry measuring endpoint latency. And they're not framework checkpoints designed simply to resume a paused execution node.

Logs capture telemetry. Decision traces capture testimony.

A proper decision trace payload must include strict provenance of the agent's decision. It must record the exact version of the source-of-truth artifact the agent read, the specific arguments passed to its tools, the raw response returned by the environment, and the specific policy or prompt configuration applied at that moment.

Without these elements, you can't accurately reconstruct the execution context. And without an accurate reconstruction, you can't satisfy an auditor who asks, "Why did your agent approve this discount for this customer on this date?"

The industry has muddied this requirement with buzzwords. To clarify what a decision trace actually is, we need to differentiate it from abstract concepts:

Concept	Storage engine	Mutability	Primary use case
Knowledge graph	Graph database (Neo4j)	Mutable	Mapping static relationships between entities for retrieval
Event log	System logger / SIEM	Immutable	Infrastructure debugging, error tracking, security auditing
Vector store	Embedding index (Pinecone, Weaviate)	Mutable	Semantic similarity search; no provenance, no temporality
Context graph	Purpose-built memory layer (e.g., HydraDB)	Immutable (append-only)	Organizational context encoding with temporal provenance, semantic search, decision traceability, and cross-run auditability

Knowledge graphs map static entity relationships but are mutable and lack temporal ordering. Vector stores optimize for similarity retrieval without provenance. A context graph built on immutability, bitemporality, and provenance metadata provides the structural foundation for capturing decision traces as a first-class primitive.

To achieve auditability, you need physical decision traces: the observable digital trail of every state transition an agent commits. By securely storing these atomic traces, you lay the concrete foundation for multi-agent coordination, cross-run learning, and regulatory compliance over time.

Why traditional stacks can't provide AI decision traceability

Operational databases: built for current state, not historical reasoning

Postgres is unmatched at immediate consistency, schema constraints, and transactional updates. If you need to know a user's current subscription tier or verify referential integrity between an order and a customer record, you query Postgres.

But operational databases are architecturally opposed to decision trace storage. They're designed to overwrite. When a customer moves from New York to London, Postgres updates the row. The previous state is gone unless you've manually engineered an event-sourcing pattern on top.

You can build bitemporality, append-only event logs, provenance metadata, and retention policies on Postgres. But you're assembling these primitives yourself, on a system whose core abstraction is mutable rows. That assembly cost is the real problem. You own the integration surface, you maintain the custom temporal query layer, you build the retention policies, and you debug the edge cases when bitemporal filters interact with your application logic in ways Postgres was never designed to anticipate.

Vector stores: semantic search without provenance

Vector databases solve retrieval. They don't solve auditability.

A vector store reduces all knowledge to a flat index. HydraDB's research team describes it as "a high-dimensional soup of embeddings where the only retrieval primitive is cosine similarity." There's no temporal ordering, no versioning, no relationship tracking between entities, and no provenance metadata linking a retrieved chunk to the decision it influenced. This is why autonomous agents require a dedicated agent memory layer instead of a stateless vector database.

When an auditor asks "which specific document version did the agent read before generating this output?", a vector store can tell you which chunks were semantically similar to a query. It can't tell you which chunks were actually retrieved during that specific execution, what version they were at that moment, or how they related to the decision payload the agent committed.

How memory layers make AI agent decisions traceable

In my Agents Are Just State Machines piece, I argued that agent memory should be treated as a database problem, not a model problem. The logical extension: decision traces should be a first-class primitive in that memory layer, rather than assembled from components not designed for it.

Purpose-built agent memory architectures implement the decision trace plane natively, rather than requiring teams to assemble it from infrastructure components that weren't designed for the workload.

HydraDB isn't alone in this category. Zep's Graphiti implements a temporal knowledge graph with valid_at and invalid_at markers. Mem0 optimizes for token-efficient memory with single-pass extraction. Letta takes an LLM-managed memory approach. Each addresses a piece of the agent memory problem. For a deeper comparison, see our guide to Mem0 and Zep alternatives. For the decision trace use case specifically, where you need append-only immutability, bitemporality on every edge, and provenance metadata captured at commit time, HydraDB's architecture is the most direct fit I've seen.

Immutable, append-only state transitions

HydraDB implements what it calls a Git-Style Versioned Temporal Graph. The core model is an append-only, immutable edge-based knowledge graph where every state change is committed as a new edge, never overwritten.

If a user moves from New York to London, HydraDB doesn't update a row. It commits a new edge with fresh temporal metadata. The previous state remains queryable. This guarantees zero data loss and enables queries that are impossible in systems that destructively resolve state: "What places did I visit last year?" or "From where and why did I make a career switch?"

For compliance, this means every historical state is preserved exactly as it existed at the time the agent made its decision. No reconstruction required. No forensic log-stitching. The trace is the storage model.

Provenance metadata on every edge

Each edge in HydraDB's graph carries a tuple of (semantic_relation, t_commit, t_valid, C_meta):

semantic_relation: the typed relationship (WORKS_AT, PREFERS, CAUSED_BY, BLOCKED_BY)
t_commit: the ingestion timestamp (when the system recorded the fact)
t_valid: the extracted temporal validity (when the fact was actually true in the real world)
C_meta: auxiliary metadata preserving the reasoning context, sentiment, and situational factors surrounding the transition

That C_meta field is doing the heavy lifting for auditability. HydraDB records not merely that a user changed their preference. It records why they changed it, what alternatives were considered, and what outcome they were optimizing for. This is the provenance chain an auditor needs.

Deterministic, multi-hop decision lineage

Because entities and relationships are first-class graph primitives, HydraDB enables deterministic, multi-hop traversal that traces causal chains across the full decision history.

Consider a query like "Why is the authentication service behaving differently since last month?" HydraDB's graph can traverse auth-service → DEPENDS_ON → user-db → MODIFIED_BY → migration-v2 → AUTHORED_BY → alice → CAUSED_BY → schema-change-ticket, recovering the full causal chain without any of these hops being co-located in embedding space.

A vector store would need all of those facts to appear in semantically similar chunks. A relational database would need them manually joined across tables. The graph makes distant but causally connected facts retrievable as a native operation.

For audit purposes, this means you can trace any agent decision back through its full dependency lineage. Not "the agent probably read something about the auth service." The specific chain of state transitions that led to the output.

Graph-derived inferences with traceable reasoning

HydraDB can synthesize conclusions from the graph's topology, independent of any single retrieved chunk. If an agent observes edges like user → REJECTED → cloud-vendor-A, user → REJECTED → cloud-vendor-B, user → OPTIMIZES_FOR → data-sovereignty, the system infers a vendor preference that was never explicitly stated.

For compliance, these inferences are traceable. You can point to the specific edges that generated the conclusion. The reasoning path is deterministic and auditable, unlike a black-box LLM output where you can't reconstruct which retrieved context influenced the generation.

Where HydraDB is today vs. where it's headed

The temporal graph captures both system time and valid time per edge, but a SQL-like queryable interface for bitemporal axes (like XTDB's FOR VALID_TIME AS OF) isn't exposed in public docs yet. The graph provides relational context at read time but doesn't enforce relational constraints at write time. ACID-style isolation levels and commit-time MVCC are on the roadmap. The append-only temporal substrate is production-grade. The full database-grade query semantics are still maturing.

Why AI agent compliance requires two time axes, not one

Most of the audit failures I've seen come down to one question: what did the agent believe was true at the moment it decided?

This is a bitemporality problem. You need two distinct time axes:

System time (t_commit): the exact millisecond the trace was recorded by the infrastructure. When the system learned the fact.

Valid time (t_valid): the temporal context the agent assumed was true about the world when it made its decision. When the fact was actually true in reality.

These two clocks diverge constantly in production. A customer tells your agent on Tuesday that they moved to London last month. The system time is Tuesday. The valid time is last month. If another agent needs to reconstruct what was true about that customer's location as of three weeks ago, it needs both axes to get the right answer.

HydraDB implements bitemporality as a first-class primitive on every graph edge. Every state transition carries both timestamps natively. You don't schema it yourself. You don't build a custom temporal query layer on top of Postgres. The storage model enforces it.

This is what makes the "as-of context replay" query pattern work. When you need to reconstruct the exact source-of-truth state that existed at the specific millisecond an agent made its decision, you filter on both t_commit and t_valid. Even if another process subsequently overwrote the underlying operational data, the trace preserves the agent's exact viewpoint.

Knowing what the agent did is the snapshot. Knowing what the agent saw is the trace. HydraDB stores both.

From black-box LLM outputs to explainable AI agent decisions

The enterprise adoption barrier for AI agents isn't capability. It's explainability.

Executives refuse to rely on AI agent outputs for business decisions because the reasoning is opaque. The agent says "approve this discount" or "escalate this ticket" or "rewrite this paragraph," and nobody can trace why. The output looks confident. The provenance is invisible.

This is the gap between "what is the current status" (which most agent architectures handle well) and "how did we get here" or "what decision led to this outcome" (which most architectures can't answer at all).

Purpose-built memory layers with native decision traces close this gap by making every generated insight explainable and fully traceable to the source data. The reasoning chain isn't reconstructed from fragmented application telemetry after the fact. It's captured at commit time as a structural property of the storage model.

HydraDB's benchmark results bear this out in the dimensions that matter most for auditability. On the LongMemEval-s benchmark (Wu et al. 2025, ICLR 2025, 500 question-conversation stacks averaging over 115,000 tokens each), HydraDB scored 97.43% on knowledge updates (correctly distinguishing current from historical state) and 90.97% on temporal reasoning (accurately preserving and reasoning over the chronology of stored information). The overall accuracy of 90.79% represents a 5-point improvement over the next strongest system and a 30-point gain over full-context baselines.

These aren't retrieval benchmarks. They're state-correctness benchmarks. They measure whether the system can tell you what was true at a specific point in time and what changed since then. That's exactly what compliance requires.

When do AI agents need decision traceability?

Not every agent application requires a dedicated trace plane from day one.

If your agents perform stateless retrieval, simple text classification, or internal semantic search against static documentation, Postgres alone is sufficient. You can handle standard application logging and push current state updates without introducing the complexity of a secondary storage layer.

But you reach the tipping point when your agents begin to mutate critical state, generate delayed consequences, or coordinate across multiple independent sessions. (For a broader checklist, see 7 signs your AI agent needs a memory layer.)

Agents mutate state that matters. Once an agent dictates billing logic, modifies customer-facing assets, or executes multi-step workflows, you need an immutable record of its logic. If an agent approves a transaction today but the downstream impact isn't visible until next month's billing cycle, a snapshot of the current database won't help you understand why.

Delayed consequences require historical context. When our customer pinged us about that outdated blog post three weeks later, the source document had moved on. The trace had to live somewhere that captured it at commit time.

Multi-agent coordination requires shared provenance. When a secondary agent needs to know why a primary agent escalated a ticket two seconds ago, the trace must be immediately queryable.

Regulated industries require deterministic auditability. Finance, healthcare, and enterprise software operate under strict auditability standards that are difficult to meet using overwritten operational state alone. If an auditor asks why a pricing algorithm executed a specific trade or approved a discount, producing the chronological reasoning trace lets teams address these inquiries transparently.

When evaluating this architectural decision, compare the cost of engineering delay during incident response against the infrastructure cost of a purpose-built trace layer. If a bad agent decision takes your senior engineering team three days to untangle because they have to manually reconstruct overwritten context logs from fragmented application telemetry, the cost of a single incident far exceeds the infrastructure investment. A dedicated trace plane turns auditability from an operational headache into a solvable query.

Decision traceability is an infrastructure problem

Agent failures often trace back to architecture. When the current operational state and the historical reasoning history live in the same store, the context that generated a decision gets overwritten by the next one. You can't debug what you can't reconstruct. And you can't pass an audit on reconstructions.

A deliberate two-plane architecture aligns infrastructure with workload. Operational databases handle current transactional state, where they excel. Purpose-built memory layers like HydraDB handle the decision trace plane, where append-only immutability, bitemporality, and graph-based provenance tracking are native to the storage model rather than assembled on top of it.

The difference between assembling decision trace infrastructure yourself and using a purpose-built memory layer is the same as the difference between building your own transactional database and using Postgres. You can do it. You probably shouldn't. The primitives (immutable append-only edges, bitemporal timestamps, typed semantic relationships, contextual metadata on every state transition) need to work together as a coherent system, not as independent components wired together with custom middleware.

Don't throw away your decision traces. The next decision your agents commit should be one you can replay, explain, and defend under audit.

Frequently asked questions

What is a decision trace for AI agents?

An immutable, time-ordered record of an agent's execution context: what it read, which tools it called (with inputs and outputs), which policy or prompt version it used, and what decision it committed. Unlike system logs (which track infrastructure behavior) or framework checkpoints (which enable execution replay), decision traces capture the provenance needed to reconstruct why an agent made a specific decision.

How is a decision trace different from logs, telemetry, or observability?

Logs and telemetry focus on system behavior: errors, latency, CPU utilization. Observability platforms like Datadog tell you that an agent failed. A decision trace tells you why it made the decision it did, even when it didn't fail. The distinction matters for compliance: an auditor doesn't ask "did the agent error out?" They ask "what information drove this specific output?"

Why can't I store decision traces in Postgres alongside current state?

At high volume, append-only traces introduce write contention, index bloat, and expensive scans that compete with the transactional workload Postgres is optimized for. More fundamentally, Postgres is designed to overwrite state. Building bitemporality, append-only event sourcing, and retention policies on top of it means assembling the trace plane yourself in a system optimized for a different access pattern.

Why can't I use a vector database for decision traceability?

Vector stores solve semantic retrieval, not provenance. They can tell you which chunks are similar to a query. They can't tell you which chunks were retrieved during a specific execution, what version those chunks were at that moment, or how they causally relate to the decision the agent committed. There's no temporal ordering, no relationship tracking, and no guarantee of immutability.

What is bitemporality and why does it matter for compliance?

Bitemporality separates two time axes: system time (when the trace was recorded) and valid time (when the fact was actually true in the world). An agent might learn on Tuesday that a customer moved to London last month. System time is Tuesday. Valid time is last month. Storing both lets you replay decisions accurately even when underlying operational data changes later. That's exactly what an auditor needs.

How does HydraDB provide native decision traceability?

HydraDB implements a Git-Style Versioned Temporal Graph where every state change is committed as a new immutable edge carrying bitemporal timestamps and contextual metadata. The append-only model guarantees zero data loss. The graph structure enables deterministic, multi-hop traversal of decision lineage. And the C_meta field on every edge preserves the reasoning context, sentiment, and situational factors surrounding each state transition.

When do I need a dedicated decision trace plane?

When agents mutate important state, have delayed consequences, coordinate across sessions or agents, or you face audit and compliance requirements. For simple stateless retrieval or classification, a single operational database plus standard logging is usually enough. The tipping point is when you can't afford to lose the reasoning context behind a decision.

What fields must a decision trace include for reliable provenance?

At minimum: entity and trace identifiers, system time, valid time (the agent's assumed world time), source artifact ID and version, tool name and inputs, tool or environment response, decision payload, and policy or prompt version. Without these elements, you can't accurately reconstruct the execution context.

How does a purpose-built memory layer differ from assembling trace infrastructure myself?

You can build bitemporality, event sourcing, retention policies, and graph-based provenance on top of Postgres, Kafka, and a columnar store. But you're wiring together independent components with custom middleware, and you own the integration surface. A purpose-built layer like HydraDB ships these primitives as a coherent system: immutable append-only edges, bitemporal timestamps, typed semantic relationships, and contextual metadata on every state transition, all working together natively.

5 Best Time-Aware Memory Layers for Long-Term AI Agents (2026 Guide)

Aman Puri — Sat, 27 Jun 2026 15:41:19 +0000

Moving from stateless LLM interactions to long-horizon autonomous agents has exposed some serious cracks in standard RAG architectures. For agents that need to operate over months or years, time isn't just metadata. It's the product.

Standard RAG and naive vector append systems treat memory as a flat, chronology-agnostic blob. You end up with unlinked records floating in vector space. They completely fall apart when an agent needs to answer the foundational question: "What was true when?"

Developers often assume that million-token context windows make specialized memory infrastructure unnecessary. If the full session history fits in the context window, why not just inject it and let the model's attention sort out the timeline?

The ICLR 2025 LongMemEval benchmark shows why. Commercial chat assistants and long-context models suffer a 30 percent accuracy drop when memorizing information across sustained multi-session interactions, even when the entire history fits in the context window. External memory routing, temporal invalidation, and deterministic state management are still required for production systems.

Key takeaways

Time-aware memory layers help long-term AI agents answer "what was true when" using validity windows or bitemporal modeling. Flat RAG and naive vector append systems still fail at this over long horizons, even with million-token context windows.
Pick HydraDB if you want a unified context and memory layer with built-in bitemporal modeling, multi-signal retrieval, and structured ingestion, without running an external graph database.
Pick Zep (Graphiti) if temporal correctness and auditability are top priorities, and you can operate a graph DB backend.
Pick Mem0 for the easiest managed integration when deep point-in-time audits aren't required, Letta if you want an agent framework that manages its own tiered memory, and Supermemory for lightweight recency-aware context.
Benchmarks cited are vendor-reported. Validate against your data, latency, and compliance needs.

How to interpret AI agent memory benchmarks 2026

Treat self-reported metrics in the agent memory space with skepticism. The category is in a benchmark arms race right now, filled with self-published figures that lack independent replication or standardized testing methodologies.

You'll frequently see vendors claiming near-perfect retrieval statistics on their landing pages. Systems like OMEGA advertise a 95.4% score on LongMemEval. Frameworks like Evermind's EverMemOS claim approximately 93% on LoCoMo. Without rigorous independent verification, these numbers often reflect optimized, over-fitted test runs rather than expected production performance across real, messy data streams.

The volatility here is well-documented. When Zep initially published their performance metrics, they claimed an impressive 84% on LoCoMo. After an independent correction by Mem0 restricted accuracy to the first four validated LoCoMo categories and averaged results across ten independent runs, that figure dropped to 58.44%, with both vendors since contesting each other's methodology.

This benchmark volatility exists because resolving conflicting state changes over time is significantly more complex than standard semantic similarity matching. Since independent, peer-reviewed evaluation frameworks for bitemporal AI memory are still maturing, every metric in this evaluation must be read as vendor-reported. That includes HydraDB's vendor-reported 90.79% overall accuracy on LongMemEval-s and sub-200 millisecond retrieval latency. HydraDB stands by its internal testing methodologies and architectural design. Technical buyers should validate all performance claims from any vendor against their specific production payloads and query patterns before committing to an architecture.

Evaluation criteria for time-aware memory layers in 2026

Temporal modeling depth (valid time vs system time)

The system must go beyond simple insertion timestamps. True temporal modeling requires validity windows attached to facts.

Bitemporal modeling tracks both system time, when the database recorded the fact, and valid time, when the fact was true in the real world. This approach handles supersession without resorting to destructive overwrites or false-positive deletions.

Memory lifecycle operations (ingestion, revision, forgetting, retrieval)

As defined in the 2026 academic paper Is Agent Memory a Database?, agent memory should be evaluated through the Governed Evolving Memory framework. This framework demands that memory systems replace record-level operations with state-level operators: ingestion, revision, forgetting, and retrieval.

Consider a user stating "I love the JS framework Next.js" in 2025, and later stating "I love the JS framework Angular" in 2026. A naive system relying on LLM-resolved deletion might purge the 2025 preference entirely. A continuous state trajectory preserves the historical nuance, recognizing that both facts coexist across different validity windows.

Operational footprint and dependencies (graph DBs, structured output)

Evaluate whether the platform is a unified runtime or requires standing up separate graph databases.

Extracting bitemporal graphs often requires reliable structured output generation. Smaller models frequently cause schema extraction failures. This means building bitemporal graphs often requires routing ingestion through premium models like GPT-4o or Gemini 3.0 Pro to prevent memory graph corruption.

Security and deployment requirements (SOC 2, HIPAA, BYOC)

For B2B deployments, the memory layer must support multi-tenant and sub-tenant data isolation. Technical buyers need to verify if the vendor provides SOC 2 Type 2, HIPAA compliance, Bring Your Own Cloud capabilities, or air-gapped deployments to meet enterprise governance standards.

Integrations and framework compatibility (LangChain, LangGraph, LlamaIndex, CrewAI)

The layer must align with your engineering stack, whether that's operating as a standalone API, providing native SDKs, or coupling with frameworks like LangChain, LangGraph, LlamaIndex, or CrewAI.

Quick comparison of time-aware memory layers (at-a-glance)

Platform name	Best for	Temporal approach	Operational footprint	Security & deployment	Framework compatibility	Key differentiator
HydraDB	Unified context and memory layer without external DB dependencies	Bitemporal Graph (System + Valid Time)	Unified runtime, standalone API, internal DB engine	RBAC, SSO, Multi-tenant isolation, Managed Cloud, BYOC	Standalone API/SDK, LangChain, LlamaIndex, CrewAI	Graph-native temporal memory with multi-signal retrieval and Sliding Window Inference Pipeline
Zep (Graphiti)	Temporal correctness and rigorous historical invalidation	Validity Windows (valid_at, invalid_at)	Requires external graph DB (Neo4j, FalkorDB, Kuzu)	SOC 2 Type 2, HIPAA, ABAC, Managed Cloud	API/SDK, LangChain, LangGraph, LlamaIndex, CrewAI	Apache 2.0 Graphiti core; rigorous invalidation
Mem0	Broad integration with a mature managed cloud offering	Hierarchical distillation and semantic traces	Drop-in managed service, lightweight local footprint	Kubernetes, air-gapped, zero-trust enterprise options	SDKs, LangChain, LlamaIndex, CrewAI	High-speed memory compression engine
Letta	Autonomous agents managing their own tiered memory lifecycles	Tiered OS-style virtual context management	Tightly coupled framework runtime	Open-source self-hosted, Letta Cloud hosted options	Letta-native only; limited external framework drop-in	Agent-managed memory blocks and sleep-time compute
Supermemory	Fast, lightweight recency-aware memory via dynamic merging	Dynamic fact merging and semantic traces	Managed context cloud, low operational overhead	VPC, On-premise enterprise tiers	API, native connectors, browser extractors	POSIX-style filesystem context mounting

## HydraDB: time-aware memory layer for long-term agents

Best for: unified context and memory infrastructure without external graph databases

Engineering teams building personalized agents, copilots, or internal company brains that require unified memory and knowledge within a single runtime. HydraDB is the right choice when your application must answer "what was true when" and resolve implicit context references via the Sliding Window Inference Pipeline, but your infrastructure team wants to avoid standing up, tuning, and maintaining a separate graph database backend.

Overview: how HydraDB models valid time and system time

HydraDB is a graph-native context and memory infrastructure layer. Under the hood, it combines a Git-style append-only temporal graph, a vector database, and a database supporting standard B-tree indexes.

HydraDB isn't merely "git for vector indexes alone." It operates as a single, fully managed or self-hostable API that exposes user memories, organizational knowledge, and graph-enriched retrieval over a continuously changing state.

Key features: bitemporal graph, sliding window inference, unified retrieval

Git-style versioned temporal graph: HydraDB uses an append-only temporal graph that tracks both commit time and valid time. This preserves the complete historical decision tree, allowing agents to query what was true at any specific historical point. Unlike flat vector databases, destructive overwrites are entirely eliminated.
Sliding window inference pipeline: To solve the prevalent "meaningless chunk" problem in standard RAG, HydraDB resolves entities, pronouns, shifting user preferences, and implicit references using surrounding context before committing data to storage.
Multi-signal, context-aware retrieval: The platform combines semantic similarity, sparse keyword matching, dense vector retrieval, metadata filtering, graph traversal, entity-based search, chunk-level graph expansion, and reranking into a single retrieval layer. You don't need to wire together separate instances of Neo4j, PostgreSQL, and Pinecone to achieve temporal and contextual awareness.
Multi-tenant and shared organizational memory: The system implements explicit multi-tenant and sub-tenant data isolation, operating alongside shared organizational knowledge pools for enterprise security compliance.

Pros: temporal correctness with low operational overhead

Eliminates the multi-system DIY stack typically required for temporal memory, combining vector, graph, relational, and custom memory logic into a single runtime.
Features optimized, isolated read and write paths that ensure stable multi-signal retrieval. Vendor-reported results show 90.79 percent overall accuracy on LongMemEval-s and 90.97 percent on the temporal reasoning sub-task, with sub-200 millisecond retrieval latency.

Cons: architectural shift and roadmap database features

As a unified runtime, HydraDB's internal graph, vector, and relational components are not independently swappable. Teams accustomed to choosing best-of-breed components for each layer should weigh the operational simplicity against reduced component-level flexibility.
While the core graph is resilient, advanced transactional database features, such as ACID-style isolation levels and commit-time multi-version concurrency control, are currently roadmap items.

Pricing and deployment options (managed cloud vs self-hosted)

HydraDB is available as a managed cloud offering for scaling or as a BYOC deployment running inside the customer's AWS VPC for environments with strict deployment control requirements.

## Zep (Graphiti): temporal knowledge graph with validity windows

Best for: maximum temporal correctness with external graph DB support

Engineering teams where temporal correctness is the non-negotiable priority for agent behavior. Zep is ideal when your organization already has the infrastructure resources to run the necessary external graph databases, and when your team is comfortable navigating the rapid deprecation cadences inherent to early-stage open-source tools.

Overview: how Zep invalidates facts over time

Zep, powered by the Apache 2.0 open-source Graphiti engine, is a temporal knowledge graph designed for long-term agent memory. Zep is among the strongest architectures for temporal correctness currently available.

Instead of deleting old or contradictory facts when new information arrives, Zep invalidates them, while the complete historical memory graph remains auditable and structurally intact.

Key features: validity windows, provenance, governance

Validity windows: The system attaches exact valid_at and invalid_at timestamps to every node and edge within the graph, preventing temporal hallucination during retrieval.
Auditable history: Every extracted fact traces back to a specific conversational episode or document ingestion event. This ensures strict data provenance and provides a deterministic method for handling direct contradictions.
Context lake abstraction: Zep orchestrates millions of isolated temporal graphs as a single governed enterprise system, simplifying massive-scale agent deployments.
Managed governance: The Zep Cloud tier handles attribute-based access control, strict data retention policies, and compliance audit logs natively at the infrastructure substrate level.

Pros: rigorous invalidation and auditability

Strong performance on temporal resolution tasks relative to generalized memory layers. Vendor benchmarks in the Zep paper report a 63.8% overall score on the LongMemEval.
Uncompromising approach to temporal invalidation rather than destructive data overwrites.
Ships with out-of-the-box SOC 2 Type 2 and HIPAA certifications for the managed cloud offering.

Cons: operational footprint and deprecation risk

Imposes a heavy operational footprint. The open-source Graphiti core can't run in isolation and requires a separate, dedicated graph database backend (Neo4j, FalkorDB, or Kuzu) to function.
Zep Community Edition was deprecated in April 2025, with further feature retirements in February 2026. Teams seeking a self-hosted route must manually configure and manage the raw Graphiti engine.

Pricing (open-source core vs managed cloud)

The core Graphiti engine is free to self-host. The managed Zep Cloud platform starts at $125 per month on a credit-based model.

## Mem0: managed agent memory with compression (limited temporal auditing)

Best for: fast integration and context compression

Teams prioritizing the easiest broad-integration path paired with mature managed-cloud polish. Mem0 is the right architectural choice when fast deployment and general context reduction are critical, but deep temporal reasoning, bitemporal supersession, and point-in-time historical audits aren't your primary operational requirements.

Overview: how Mem0 stores and distills long-term context

Mem0 is an accessible, general-purpose memory layer with a drop-in service and SDK. It stores, compresses, and retrieves user and agent context across extended sessions.

Its primary utility is acting as a multi-signal retrieval engine aimed at reducing overall token spend and preventing the continuous re-injection of massive, uncompressed context windows.

Key features: hierarchical distillation and optional relationship tracking

Memory compression engine: The platform relies on hierarchical distillation. Instead of storing massive raw graphs, Mem0 summarizes and compresses conversational context to keep retrieval payloads lightweight.
Optional graph layer: While focused on semantic compression, Mem0 has introduced an optional relational tracking layer to map basic entity relationships. This layer remains less temporal than dedicated bitemporal engines.

Pros: onboarding speed and enterprise deployment options

Fast developer onboarding with a refined, mature cloud dashboard and comprehensive observability tools.
Robust enterprise deployment options, including native support for Kubernetes deployments, air-gapped environments, and zero-trust security architectures.
Continues to invest in capability upgrades, with incremental gains in temporal reasoning.

Cons: limited temporal auditing and configurability

Lacks native bitemporal modeling. Historical point-in-time state reconstruction and valid-time tracking are not supported, regardless of retrieval accuracy improvements.
Managed-first architecture limits low-level configurability. Teams requiring fine-grained control over graph indexing, storage backends, or retrieval tuning will find Mem0 more restrictive than self-hostable graph-native alternatives.

Pricing and self-hosting options

Mem0 uses usage-based cloud pricing for its managed service, while offering an Apache 2.0 self-hostable core for localized deployments.

## Letta (MemGPT): agent-managed memory framework

Best for: building autonomous agents with model-managed memory

Developers engineering autonomous agents from scratch who want the agent itself to manage, page, and consolidate its own memory lifecycle over long operational horizons. Letta is the framework when you prefer operating system-inspired memory management over relying on a passive backend database layer.

Overview: tiered memory and virtual context management

Letta is a comprehensive agent framework built around the concept of memory-first agents that continuously learn from ongoing experience. Originating from the influential UC Berkeley MemGPT research, the platform treats memory management as an OS-level operation dynamically performed by the language model itself through virtual context management.

Key features: tiered memory blocks, self-editing, sleep-time compute

Tiered memory blocks: The framework divides agent state into distinct tiers, separating core memory, persona definitions, fast-recall memory, and deep archival storage.
Self-editing agents: The agent uses defined tools to read, write, edit, and page memory blocks in and out of its active context window as the task requires.
Sleep-time compute: To mitigate the computational latency of continuous context summarization, Letta allows agents to asynchronously consolidate memory and "dream" in the background during idle periods.

Pros: portability and continual learning

Delivers complete model-agnostic memory portability, letting you move complex agent states between different language model providers without data loss.
Arguably the strongest out-of-the-box continual-learning capabilities for autonomous execution, backed by rigorous academic research.
Powerful for persistent agents that must synthesize deep domain expertise organically over time.

Cons: less explicit temporal validity modeling and framework portability

Less turnkey temporal modeling compared to the explicit validity windows or bitemporal querying found in HydraDB or Zep.
Since Letta is a tightly coupled framework and paradigm, retrofitting or dropping it into an existing tech stack using LangChain or LlamaIndex is difficult.

Pricing and hosting options

The core Letta Code and CLIs are open-source, with Letta Cloud providing hosted execution environments.

## Supermemory: lightweight recency-aware memory with dynamic merging

Best for: low-overhead conversational memory and fast retrieval

Applications that need a fast, lightweight, recency-aware memory layer to replace basic flat vector databases with intelligent semantic traces. Supermemory is ideal when you need to maintain conversational state but don't require heavy graph traversal overhead or strict, auditable bitemporal event logs.

Overview: time-annotated semantic traces and merging

Supermemory is a streamlined, managed context cloud that treats agent memory as time-annotated semantic traces. It operates as a fast, accessible alternative to legacy vector database infrastructure, automatically prioritizing continuous fact updates and data merging over rigid structural graphing.

Key features: dynamic fact merging, connectors, filesystem-style mounting

Dynamic fact merging: Instead of blindly returning raw, unlinked data chunks during retrieval, the platform automatically updates, merges, and resolves contradictions within incoming data streams at ingestion time.
POSIX-style filesystems: Uses a unique paradigm that lets autonomous agents virtually "mount" user context like an operating system mounts a filesystem. This standardizes how models interact with historical data.
Native connectors and extractors: Ships with an extensive library of built-in tooling designed to pull continuous memory from external applications, databases, and browser sessions.

Pros: speed and minimal ops

Provides fast access to recent context. Vendor-reported sub-300 millisecond retrieval times keep agent response latency low.
Low operational overhead with a fast initial setup for newly onboarded developer teams.
An efficient deduplicated billing model saves enterprise clients significant compute and storage costs by preventing redundant context ingestion.

Cons: no bitemporal auditing for point-in-time queries

Lacks the deep, structured temporal graph capabilities required to separate system time from valid time. Complex historical point-in-time audits are practically impossible.
The underlying data resolution mechanics lack the transparent audit trail that bitemporal event logs provide.

Pricing and enterprise deployment tiers

Operates on a usage-based model priced by actual memories saved and tokens processed, while offering enterprise-grade VPC and on-premise deployment tiers.

Emerging time-aware memory tools to watch (2026)

Beyond the established platforms, several specialized tools and framework-native solutions introduce alternative architectural approaches to temporal validity.

Alternatives and frameworks (MinnsDB, MenteDB, Cognee, LangMem)

MinnsDB and MenteDB: These represent the next wave of native bitemporal databases. MinnsDB operates as a single Rust binary with the MinnsQL query language, enforcing temporal validity on every ingested fact, and uses ontology-driven cascade invalidation and six storage modalities to automatically manage complex data deprecation. MenteDB takes a different approach, using strict valid_from and valid_until timestamps, point-in-time queries, and invalidation instead of deletion. Both remain in early-stage development and lack the production battle-testing of the larger platforms.
Cognee: An open-source solution that blends traditional property graphs with vector memory, operating primarily over documents, chats, code, and unstructured enterprise data. Cognee provides an effective split between short-term session memory and permanent graph storage for knowledge-heavy agents. It currently offers lower temporal-validity modeling capabilities than specialized engines such as HydraDB.
LangGraph and LangMem: These are the framework-native options for developers entrenched in the LangChain ecosystem. The stack provides checkpointers for immediate thread memory and cross-session stores. The critical tradeoff is that you must design, implement, and maintain the underlying bitemporal timestamps, supersession logic, and retention policies yourself.

Conclusion: choosing a time-aware memory layer for long-term agents

Vector retrieval helps agents find similar information. Time-aware memory layers help agents reason about time. For long-running agents that must answer 'what was true when?' and prove what they did, time-aware memory isn't a feature. It’s the foundation.

Before committing to a proof of concept, technical buyers need to map their specific temporal requirements. Weigh the necessity of validity windows and bitemporal auditing against your capacity to manage external operational footprints. Failing to align these criteria will lead to compounding context drift and destructive memory overwrites within your application.

For developers and architects ready to evaluate a unified runtime that resolves temporal state without the overhead of external graph databases, review HydraDB's documentation on temporal graph retrieval and Sliding Window Inference to see how memory-native infrastructure transforms long-term agent reliability.

Frequently asked questions (FAQ)

What is a time-aware memory layer for AI agents?

A time-aware memory layer stores facts with explicit time semantics, such as validity windows or bitemporal timestamps, so an agent can reconstruct what was true at a specific point in time, not just what is most semantically similar.

What does "bitemporal" mean in agent memory?

Bitemporal memory tracks valid time (when a fact was true in the real world) and system time (when the system recorded the fact), enabling point-in-time queries, auditability, and non-destructive supersession.

Do long context windows (millions of tokens) eliminate the need for external memory?

No. Even when history fits in context, long-horizon multi-session recall and temporal consistency degrade. External memory is still needed for deterministic retrieval, invalidation, and state management.

How do validity windows prevent "memory contradictions"?

Instead of overwriting or deleting old facts, the system marks them as valid until a specific time and stores the newer fact with a new validity range. Retrieval can then answer based on the requested time.

Which platform should I choose if I need "what was true when" without running a graph database?

Choose HydraDB if you want temporal querying with both commit-time and valid-time tracking in a unified runtime without operating an external graph DB.

Which platform is best if temporal correctness and audit trails are the top priority?

Choose Zep (Graphiti) if you want rigorous invalidation and an auditable history, and you can manage the required graph database backend.

When is Mem0 a better fit than a bitemporal system?

Mem0 is a better fit when you primarily need easy integration and context compression across sessions and don't require strict point-in-time audits or deep temporal reasoning.

Is Letta a memory database or an agent framework?

Letta is primarily an agent framework where the model actively manages tiered memory (paging, editing, and consolidation) rather than a standalone drop-in memory database API.

How should I evaluate time-aware memory layers beyond vendor benchmarks?

Test with your own contradiction-heavy timelines (preference changes, policy updates, and entity attribute changes), measure point-in-time accuracy, and verify latency, retention, and compliance requirements in a staging environment.

What is the difference between "recency-aware memory" and "time-aware memory"?

Recency-aware memory prioritizes newer information for relevance, while time-aware memory supports explicit historical reconstruction (e.g., "as of March 2025") using validity or bitemporal timestamps.

HydraDB vs Traditional Vector Databases: Why AI Agents Need a True Memory Layer

Aman Puri — Sat, 27 Jun 2026 15:40:51 +0000

Teams deploying autonomous agents keep running into the same wall. Standard RAG stacks suffer from context rot, the gradual degradation of retrieval usefulness as irrelevant, stale, or conflicting information accumulates, leading to diminished recall, incorrect reasoning, and confident but wrong outputs over time.

The root cause is treating memory as a retrieval problem when it's fundamentally a state management problem. Vector databases solve retrieval, while memory layers solve state. A true memory layer includes vector search as one signal among many, and teams can either assemble that themselves or adopt a unified primitive.

Key takeaways

Vector databases are optimized for embedding similarity search, fast and useful for static-content RAG, but not a substitute for agent memory.
AI agents need state management across identity and relationships, temporal correctness, memory lifecycle (store, update, merge, deprecate, delete, forget, audit), context assembly, and multi-tenant isolation.
HydraDB combines a Git-styled temporal graph, a native vector index, and standard B-tree structures into one unified memory layer, available as a managed or self-hostable API.
Use a vector database for static-content semantic search and a memory layer when the agent must operate over time, covering use cases such as customer support copilots, coding agents, healthcare companions, and internal knowledge brains.

What's the difference between a vector database and an agent memory layer?

The core difference is that a vector database retrieves content by similarity, while a memory layer manages evolving state across sessions, entities, and time.

Standard semantic search runs on vector databases. For teams evaluating alternatives to Pinecone, Qdrant, ChromaDB, or Weaviate, the root issue is usually the difference between AI memory and vector database capabilities, particularly relationship-aware filtering. Vector databases are well-engineered for embedding similarity search, metadata filtering, CRUD, scalability, and security. These systems handle isolated text chunks well and match user queries to the nearest semantic neighbor in a vast, mostly static corpus.

But traditional vector databases are stateless by design. They treat every indexed embedding as an isolated artifact and don't have native primitives for modeling relational entity graphs, tracking evolving user state, or maintaining decision histories across multiple sessions.

A memory layer works differently. It's a unified state infrastructure built for agents. Instead of persisting isolated text chunks, memory layers capture entities, relationships, temporal context, and full context lifecycles. They govern how facts connect, when specific assertions were true, and who they apply to.

A vector database might return a text snippet because it semantically resembles the prompt. A memory layer returns an assembled, scoped picture of the agent's current state. It understands not just semantic similarity, but the ongoing narrative of an autonomous agent's interaction with a specific user over time.

Drawing this line between stateless semantic search and stateful memory operations allows teams to choose the right tool for stateful, production-grade agent deployments.

Why vector search breaks for stateful agents (identity, updates, and temporal correctness)

The limitations of flat vector retrieval become obvious when you watch how LLMs and autonomous agents behave over extended timelines. Research presented at ICLR in the LongMemEval study tests information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention across commercial assistants.

When tested on sustained interactions requiring long-term memory, commercial models and long-context LLMs showed roughly a 30% accuracy drop. This points to a fundamental gap in treating memory as a stateless retrieval problem rather than a managed lifecycle, highlighting why agent memory is ultimately a database problem that requires transactional isolation and bitemporality.

Failure mode: identity and tenant boundaries aren't first-class

Think about a customer support agent resolving an ongoing enterprise account issue. A vector database can find similar past support tickets based on embedding proximity, but it lacks native constructs to track the user's current, evolving state.

If the customer requested phone contact instead of email yesterday, a vector database might still retrieve older interactions stating an email preference. This surfaces conflicting information to the language model.

The flat database can't natively associate that a specific billing penalty was waived last month. It can't logically map that this user belongs to a high-value tenant account, warranting different operational rules.

Failure mode: unlinked facts and destructive upserts lose temporal truth

In a typical RAG pipeline, developers ingest conversational data by chunking text and generating new unique identifiers for every new session. The result is a disorganized, append-only pile of unlinked facts. Previous and current facts coexist in the index with no linkage between them, leaving no deterministic way to resolve the latest truth.

As detailed in the academic research investigating whether agent memory is a database, these naive pipelines inevitably suffer from unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. To address this, the paper proposes core memory-level operators, including ingestion, revision, forgetting, and retrieval.

When teams try to circumvent this by directly overwriting existing vectors, they hit the semantic equivalence problem, determining if a new chunk of text completely replaces an older embedding or merely adds nuance to it. For example, if a developer tells their coding copilot they love the JS framework Next.js, and a year later says they prefer Angular, naive vector replacement destroys the historical nuance. Both facts need to coexist temporally. The transition between frameworks is a vital context for an agent making architectural recommendations, and a naive LLM-resolved delete loses that nuance.

A flat vector index can't determine if a retrieved fact is still true today, identify who the fact applies to within a multi-tenant environment, resolve which newer information replaced an earlier belief, or decide which specific memory should be forgotten to save context space.

The system just retrieves what's semantically closest to the query, regardless of whether it's currently valid.

Failure mode: application-layer timestamp filtering adds latency and complexity

Teams often try to solve temporal correctness by appending metadata timestamps to every vector and executing complex post-retrieval filtering logic in the application layer. This approach introduces latency penalties at scale and offloads database responsibilities into fragile application glue code.

A real solution requires pushing these temporal and relational operators down into the database layer itself. That way, the agent retrieves a single, verified truth rather than forcing the language model to synthesize conflicting timelines in its already constrained context window.

Flat vector search works well as a foundation for finding text, but without temporal reasoning and semantic revision operators, it falls short when agents must maintain coherent, evolving state over months of interaction.

The industry is converging on memory as infrastructure

The enterprise AI industry is increasingly recognizing this architectural gap. Cloudflare's 2026 Agent Memory launch ships a managed service that gives agents persistent memory, allowing them to "recall what matters, forget what doesn't, and get smarter over time." The fact that a major infrastructure provider is building dedicated memory services, with its own extraction pipeline, supersession chains, and multi-channel retrieval, underscores that flat vector search alone is insufficient for production agents.

This pattern extends beyond any single vendor. The shift from stateless retrieval to stateful memory infrastructure is becoming a recognized requirement for production agent deployments.

What a true agent memory layer includes (identity, temporal correctness, lifecycle management, and context assembly)

Identity and relationship modeling (entities and graphs)

A true memory layer links users, tenants, and projects contextually. Retrieved information is bound to the specific entities involved in the ongoing interaction, rather than floating as disconnected embeddings.

Temporal correctness (versioning, supersession, and auditability)

If an earlier preference of 'I use npm' is superseded by 'switch all projects to pnpm,' a memory layer resolves this conflict by preserving the continuous timeline. It understands that pnpm is the current preference, while safely retaining the historical fact that the user previously used npm, allowing the agent to reference past codebase decisions accurately.

Memory lifecycle management (merge, deprecate, forget, and audit)

Since a versioned temporal graph grows continuously, a memory layer provides dedicated lifecycle management primitives to store, update, merge, deprecate, delete, forget, and audit memories. These primitives eliminate the need for brittle, custom application-side glue code and prevent storage bloat and context degradation.

Context assembly and multi-tenant isolation

Instead of returning a disjointed pile of semi-related text snippets based solely on a distance metric, a memory layer returns compact, ranked, and tenant-isolated context.

This assembled context is scoped to the user and workspace boundary, meeting enterprise security requirements while providing the agent with a coherent picture of the current operational state.

HydraDB architecture: versioned graph + vector search + B-tree indexes

Building a system that truly understands relationships, time, and semantic meaning requires moving beyond single-index database designs. HydraDB achieves this through an architectural triad of a Git-styled temporal graph, a native vector index, and standard B-tree structures.

The temporal graph stores entities, relationships, and versioned state transitions. The vector index handles semantic similarity search across memory objects. B-tree indexes support structured lookups, metadata filtering, and ordered access patterns. Together, these three storage paradigms back three core primitives. Semantic knowledge covers documents and facts that agents reason over. User memories capture preferences, history, and identity that persist across sessions. Episodic experiences track time-ordered events from every agent interaction. This is all accessible through one managed or self-hostable database API.

HydraDB isn't merely a Git interface for vectors. It's a converged infrastructure designed for stateful AI workloads.

Structured context ingestion (sliding window inference)

The foundation of this architecture is structured context ingestion, powered by a sliding window inference pipeline. In a standard retrieval setup, a fragmented chunk of text like "he fixed the bug yesterday" is essentially meaningless when retrieved out of context weeks later.

The pipeline resolves entities, pronouns, and implicit conversational references before the data ever reaches the storage layer. This turns fragmented conversation chunks into fully self-contained, contextualized memory objects. The stored state remains immediately useful for future reasoning tasks.

Hybrid retrieval (semantic search + keyword + graph traversal)

When an autonomous agent needs to recall historical information, HydraDB executes multi-signal context graph retrieval. Instead of returning context that is merely mathematically similar to the prompt, the system returns operationally useful context.

The retrieval engine combines semantic similarity, sparse keyword matching, latent inferred meaning, metadata filtering, graph traversal, temporal signals, and entity-based search. These signals are unified through adaptive query expansion, chunk-level graph expansion, and triple-tier reranking with graph-vector fusion. This hybrid approach ensures that exact lookups, semantic similarities, and relational paths are all evaluated dynamically and simultaneously.

Versioned memory updates (Git-styled history and provenance)

When new information is ingested, the database creates new forward-linked versions within the temporal graph. It preserves historical states while returning the most temporally relevant truth to the querying agent.

The Git-style versioning ensures that every memory mutation is traceable. Teams can audit the decision trace of an agent, reviewing exactly which memory version influenced a specific output. This level of provenance is nearly impossible to reconstruct in a flat vector store, but it is critical for debugging enterprise AI systems where changing preferences and historical state matter just as much as current facts.

Performance and evaluation results (LongMemEval-s accuracy and latency)

The system abstracts this architectural complexity away from the application developer, yielding significant performance advantages for long-running agent deployments.

As detailed in published benchmarks, HydraDB achieves 90.79% accuracy on the rigorous LongMemEval-s evaluation suite, outperforming the next strongest system, Supermemory, by five absolute points.

Despite simultaneously evaluating graph relationships, vector similarity, and metadata constraints, the system explicitly isolates read and write paths, sustaining sub-200ms retrieval latency for real-time agent workloads.

By fusing these three distinct database paradigms, HydraDB provides the infrastructure required to solve context rot without sacrificing the rapid execution speed expected from traditional semantic search operations.

When to use a vector database vs. an agent memory layer (decision framework)

Capability dimension	Traditional vector database	Agent memory layer
Statefulness	Stateless; processes isolated chunks	Stateful; tracks entities and relationships
Update mechanism	Append-only unlinked facts; no native supersession	Version chains and temporal supersession
Temporal tracking	Application layer responsibility	Native timelines and historical persistence
Multi-tenant isolation	Manual namespace filtering	Native boundary isolation and context assembly
Ideal workloads	Static document RAG, product catalogs	Support agents, coding copilots, AI companions

Use a vector database when your data is mostly static

If you're building enterprise document search, querying massive product catalogs, indexing extensive static codebases, or routing static FAQs, a flat vector index remains the correct architectural choice.

In these scenarios, the underlying data doesn't have an evolving temporal state. The operational simplicity, ultra-low latency, and attractive cost-performance profile of zero-operations vector infrastructure match the demands of the workload.

Use a memory layer when your agent needs long-term state

Customer support agents, coding copilots, personalized healthcare companions, and internal enterprise knowledge brains require continuous context.

If the agent's utility degrades when it forgets what a user said last week, or if the agent hallucinates because it can't distinguish between an old, deprecated preference and a new directive, integrating a memory layer becomes essential.

Procurement checklist: how to evaluate memory layer vendors

When evaluating memory layer options, particularly if your team is exploring Mem0 and Zep alternatives, assess your application's need for temporal reasoning and the underlying complexity of relationship mapping in your data model. Consider the risk that unlinked or overwritten facts pose to your application logic.

If losing the connection between an old fact and its replacement permanently destroys critical user history, standard vector pipelines will inevitably break your product experience.

Teams must also weigh the operational realities of adoption. Standard vector databases compete aggressively on managed service simplicity and massive horizontal scale. Memory infrastructure that maintains version chains and relational temporal graphs introduces its own operational considerations. Teams building equivalent capabilities from separate vector, graph, and relational systems typically face comparable or greater operational complexity spread across multiple systems.

When selecting a memory layer, scrutinize how the vendor handles long-term storage bloat. Make sure the system supports the full lifecycle primitive set, including store, update, merge, deprecate, delete, forget, and audit, so superseded facts can be retired cleanly as long-term memory grows.

The evaluation comes down to whether your AI system acts as a static research tool or a collaborative participant. Research tools need semantic retrieval to find existing documents. Collaborative participants need memory layers to understand the evolving relationship they share with the user, keeping context personalized and temporally accurate.

By aligning your core architectural choice with your specific engineering need for state, you ensure your deployed agents remain coherent, performant, and reliable in production environments.

Conclusion

Flat embeddings and endlessly expanding context windows can't substitute for a structured, time-aware state. Vector databases are engineered for finding similar information within static datasets, but memory layers are essential for maintaining the correct, evolving state of an autonomous agent.

If you're engineering personalized, stateful AI systems where context continuity directly dictates product quality, standard semantic search will eventually bottleneck your development.

Vector databases help agents find information. Memory layers help agents maintain state. For production AI agents, state is the product.

For teams ready to solve context rot and build reliable multi-session agents, explore how HydraDB provides the context and memory layer infrastructure required for enterprise production.

Frequently asked questions (FAQ)

What is the difference between an agent memory layer and a vector database?

A vector database retrieves content by embedding similarity and is typically stateless. An agent memory layer manages state over time, linking facts to entities (user, tenant, and project) and tracking versions, timelines, and supersession so an agent can recall the most current, applicable truth.

When should I use a vector database instead of an agent memory layer?

Use a vector database when you need semantic search over mostly static data like documents, product catalogs, or reference knowledge where facts don't frequently change, and you don't need multi-session user state.

When do I need an agent memory layer for an AI agent?

You need a memory layer when the agent must stay correct across sessions, including support copilots, personalized assistants, and coding agents, where preferences, policies, and user context change over time and must apply to the right identity and tenant.

Why do vector databases struggle with "latest truth" in long-running agents?

Typical pipelines store conversation as append-only chunks or rely on destructive upserts, which makes determining what was superseded difficult. Retrieval then returns semantically similar but potentially outdated or conflicting facts, forcing the LLM to guess.

Can I add timestamps and metadata to vectors to solve temporal correctness?

Timestamps help, but they usually push the hard work into application-side filtering and logic, increasing complexity and latency. A memory layer handles temporal reasoning natively (version chains, supersession, and lifecycle rules) so retrieval returns a coherent state.

Do I need a separate vector database alongside memory infrastructure?

Not necessarily. Purpose-built memory infrastructure like HydraDB includes vector search as one retrieval signal alongside graph traversal, metadata filtering, and temporal reasoning. If your workload only requires semantic search over static documents, a standalone vector database may suffice. But for agents that need both retrieval and state, a unified memory infrastructure eliminates the integration burden of maintaining separate systems.

What capabilities should I look for in an agent memory layer?

Look for entity and identity modeling, temporal versioning and supersession, multi-tenant isolation, memory lifecycle controls (merge, deprecate, forget, and audit), and structured context assembly that returns compact, scoped state.

How is an agent memory layer different from a knowledge graph?

Knowledge graphs model entities and relationships, but many don't provide LLM-oriented retrieval, semantic similarity, or memory lifecycle and versioning needed for evolving agent state. A memory layer typically combines graph structure with retrieval and time-aware updates.

What is "context rot" and how do memory layers reduce it?

Context rot is the gradual degradation in the usefulness of earlier context as conversations grow. As irrelevant, stale, or conflicting information accumulates, models struggle to distinguish salient facts from noise, leading to diminished recall, incorrect reasoning, and unstable behavior. Memory layers reduce context rot by maintaining versioned, scoped, and lifecycle-managed memories so the agent retrieves the most relevant current state.

How should I evaluate memory layer vendors for production use?

Evaluate latency, accuracy on long-memory tasks, auditability and provenance, lifecycle controls (versioning, supersession, and forgetting) for storage growth, and security and multi-tenant boundaries. Also, confirm how updates preserve history without losing temporal nuance.

B-Trees, Vectors, and Graphs: Why Hybrid Search Breaks at the Storage Layer

Aman Puri — Sat, 27 Jun 2026 15:40:23 +0000

Agent memory exposes a storage-layer problem beneath retrieval: metadata, semantic similarity, relationships, permissions, and time have to work together in one query path. Vector search with filters covers only one part of that workload.

TL;DR

AI agents do not need another database to store embeddings. Instead, they require a purpose-built context database that can preserve memory across users, sessions, permissions, relationships, and time.

A single memory query may need to filter by tenant, workspace, user, status, or permissions. It may also need to follow relationships across users, files, sessions, entities, and versions, then rank the surviving context by semantic similarity.

Under the hood, that forces three physical access patterns into one query path: B-tree range access for metadata, ANN vector traversal for semantic similarity, and graph traversal for contextual relationships.

Each one wants different I/O, cache behavior, maintenance policy, and execution priority. Hybrid search breaks down when these structures are added as separate features but planned, cached, and maintained as if they were independent.

HydraDB is the context and memory layer for AI applications. It gives agents structured, time-aware memory so they can retrieve the right context for the right user, workflow, tenant, and moment. The HydraDB docs describe this as graph-native, temporal memory infrastructure for AI applications, where memories, knowledge, metadata, and graph context work together instead of being stitched across disconnected retrieval tools.

Agent memory exposes a storage-layer problem

For years, data retrieval improved through specialization.

Relational databases handled structured predicates with B-tree indexes. Search engines handled keyword relevance with inverted indexes. Graph databases handled relationship traversal. Vector databases optimized approximate similarity search over embeddings.

AI agents combine retrieval patterns that traditional systems usually isolate.

Static RAG over documents is giving way to persistent agent memory: a dynamic, interconnected store of facts, conversations, user preferences, permissions, source documents, and versioned state. HydraDB’s guide to AI agent memory frames this shift as the difference between stateless assistants that reset every session and memory-aware agents that can learn, personalize, and maintain continuity over time.

This workload goes beyond finding the most semantically similar chunk. The system must preserve context well enough to answer what is relevant, what changed, when it changed, who it affects, and why the change matters.

Agent memory requires infrastructure that can reconcile multiple physical access patterns without letting one dominate query latency, recall, or write throughput.

Why vector search alone is insufficient

Agent memory combines several structures that databases usually optimize separately.

At the center are raw memories: text, images, events, messages, files, and data records. These are often indexed semantically using vector embeddings.

Around those memories is relational metadata: timestamps, user IDs, tenant IDs, source documents, permissions, version markers, ownership, and compliance attributes.

Connecting those memories is an explicit relationship graph. A user authored a document. That document revised a previous version. The revision was discussed in a specific meeting. The meeting changed a user preference. The new preference applies only to one account, workflow, or project.

Vector search does not directly solve these problems. A vector index answers a nearest-neighbor question in embedding space. It does not natively encode tenant boundaries, permission rules, version validity, authorship, dependency paths, or temporal state. Those properties can be stored alongside vectors, but the vector index itself still optimizes for distance, not correctness under scope, time, policy, and relationships.

That is why vector search alone is insufficient for agent memory. It can retrieve semantically close memories, but it cannot decide whether a memory is allowed, current, causally connected, or applicable to the current workflow without coordinated metadata filtering, temporal state, and relationship traversal.

The three access patterns: B-tree, ANN, graph

Access pattern	Common structure	What it does well	Physical behavior	Main risk in hybrid search
Range and point access	B-tree	Tenant filters, timestamps, ordered scans, equality predicates	Page-oriented and relatively predictable	Can flood cache with metadata and heap pages
Semantic similarity	HNSW or another ANN index	Approximate nearest-neighbor retrieval over embeddings	Random-access-heavy graph traversal	Can lose recall under selective filters
Relationship traversal	Context graph edges	Context expansion across users, documents, sessions, and versions	Dependent reads across nodes, pages, shards, or objects	Can produce high fan-out and weak locality

B-trees support metadata filtering

B-trees are widely used in OLTP databases because they support point lookups, ordered scans, and range predicates efficiently.

They work well for queries such as:

WHERE tenant_id = 'abc'

AND created_at > '2024-01-01'

B-trees perform best when the index order matches the predicate shape. Their I/O is page-oriented and relatively predictable compared to graph traversal, making them a strong fit for buffer pools, page caches, and disk-backed execution.

HNSW vector search behaves differently

Approximate nearest-neighbor search, often implemented with an HNSW graph, has a different physical profile.

The original HNSW paper by Malkov and Yashunin describes a hierarchical proximity graph used for efficient approximate nearest-neighbor search. Query execution traverses graph neighborhoods and compares candidate vectors. Depending on the implementation and dataset layout, those graph nodes and vector payloads may not have the same locality as an ordered B-tree scan.

Hybrid search often depends on fast access to graph neighborhoods, candidate vectors, and metadata predicates inside the same query path.

Relationship graphs add another access pattern

Relationship traversal, such as user -> interacted_with -> document, is a form of dependent access.

Each hop can point to a different node, page, shard, or storage object. With enough fan-out, traversal becomes a sequence of small reads where the next access depends on the current result.

This weakens batching and prefetching. It also creates a different cache profile from both B-tree scans and ANN traversal.

HydraDB’s full recall response format reflects this difference. Recall responses can include retrieved chunks, source metadata, graph paths, chunk relations, and additional related context. That response shape exists because useful agent context is not just a ranked list of nearest chunks.

Where hybrid execution breaks down

Supporting B-trees, vectors, and graph edges as logical features is only the first step. The storage engine still has to make its I/O behavior predictable when a single query needs all three.

A hybrid query may start with a tenant filter, expand through relationships, and rank results by vector similarity. Each structure is reasonable in isolation. Together, they can create cache churn, random reads, high tail latency, and recall instability.

Whether the system starts with a relational core and adds vectors, or starts with a vector core and adds metadata filtering, the same constraint appears: the physical design optimized for one workload can limit the others.

Postgres + pgvector: useful, but filtered ANN has trade-offs

Postgres was a natural place to add vector search because it has mature transactional behavior, a strong extension ecosystem, and broad operational adoption.

pgvector has gained adoption because it stores embeddings alongside transactional data and avoids adding another operational system. This helps teams that want vector search near application data.

But it introduces trade-offs in high-recall, filtered ANN workloads.

The core issue is that Postgres was built around relational operators, cost estimates, indexes, and tuples. Vector similarity search has a different selectivity behavior. The planner can reason about ordinary predicates such as tenant, timestamp, or status, but high-dimensional nearest-neighbor search does not behave like a normal B-tree lookup.

Filtered vector search can reduce recall

Filtered vector search exposes the mismatch between relational predicates and ANN traversal.

In approximate index scans, filtering is typically applied after the vector index produces candidates. If the filter is selective, many of the nearest candidates may be discarded. The query can return fewer than k results unless the scan explores more of the index.

That creates a recall cliff. The system may look fast because it searched a small candidate set, but it did not search enough of the graph to satisfy the filtered query.

The pgvector documentation addresses this with iterative scans. Starting with version 0.8.0, iterative index scans can continue scanning an HNSW or IVFFlat index until enough filtered results are found or a configured limit is reached.

This gives operators another control surface for trading latency, recall, and scan depth, but the tension remains: relational filtering and approximate graph traversal must still be coordinated within a single execution path.

HNSW index builds can be operationally expensive

Building an HNSW index is CPU-intensive and sensitive to memory settings such as maintenance_work_mem.

The pgvector HNSW docs note that indexes build faster when the graph fits into maintenance_work_mem, and that build time can increase significantly when it no longer fits.

On shared Postgres systems, an HNSW build can compete with transactional workloads for CPU, memory, and I/O unless it is isolated carefully.

Vector-native systems: stronger ANN, but filtering and relationships still require coordination

Dedicated vector databases avoid some Postgres planner constraints, but filtered vector search still requires careful execution planning.

Stronger systems build payload indexes, filter-aware HNSW variants, or query planners that switch strategies based on filter selectivity. Qdrant, for example, documents payload indexes and filterable HNSW, plus ACORN search for stricter filtered-search cases.

These approaches address the filtered-ANN problem by changing when filters are applied, how much of the vector graph is explored, and how candidates are validated.

For common filters, this can work well. For tenant isolation, permissions, time windows, version constraints, or low-cardinality intersections, the system still has to choose how deeply to search and how much candidate expansion to tolerate.

A reliable design applies metadata constraints during traversal or planning, not only after retrieval. That requires tighter coordination between the metadata index, the vector index, and the executor.

The storage engine has to know which graph paths are worth exploring under the active predicate. That requires the executor to coordinate metadata selectivity, vector traversal, and candidate validation during the query rather than treating them as separate phases.

Relationship graphs conflict with page-oriented I/O

Layering a relationship graph on either a B-tree-centric or a vector-centric foundation introduces another physical problem.

B-trees are page-oriented. Vector indexes often rely on memory locality for graph layers and candidate vectors. Relationship traversal can break both assumptions.

A graph hop from a user node to a document node to a session node may require fetching records from different pages or storage regions. With enough fan-out, traversal becomes a chain of dependent reads.

The system cannot always batch or prefetch those reads because each hop determines the next set of addresses.

Hybrid search becomes expensive when metadata filtering, relationship expansion, and vector ranking share a single query path, even when each index is well designed.

Production symptoms: cache contention, tail latency, index maintenance, write interference

Hybrid search friction appears as I/O amplification, cache contention, index maintenance cost, and write-path interference. These implementation details directly affect recall, latency, concurrency, and operating cost.

`mmap` can create unpredictable latency

Some vector systems use memory-mapped files, or mmap, to access index structures.

The programming model is simple. Hot data can run near memory speed, and the operating system handles paging. For read-mostly workloads with enough memory and predictable locality, this can work well.

The trade-off is control. In "Are You Sure You Want to Use MMAP in Your Database Management System?", Crotty, Leis, and Pavlo argue that mmap is not a suitable replacement for a traditional DBMS buffer pool because it reduces the database engine’s direct control over paging and I/O behavior.

Eviction, prefetching, fault handling, and I/O scheduling affect latency when queries depend on random access.

When a query touches a non-resident HNSW node, vector page, metadata page, or graph edge, the system must fetch it before execution can continue. On local NVMe, that may be manageable. On networked storage or object-backed designs, cold reads can become more visible in tail latency.

mmap moves key storage decisions away from the database engine at the point where hybrid search needs more control.

A managed buffer pool, potentially paired with async I/O mechanisms, can give the engine more explicit control over admission, eviction, prefetching, and backpressure.

That control comes with implementation complexity. In a hybrid database, the complexity may be justified because the engine must balance multiple access patterns rather than delegate paging decisions to the operating system.

B-tree buffers and HNSW layers compete for cache

In a unified system, the B-tree index and the HNSW graph compete for the same memory budget unless the engine isolates them.

Consider a common hybrid query:

Find documents matching tenant_id = X that are semantically similar to this query.

If tenant_id is selective but still matches a large working set, the relational portion of the query may read many B-tree and heap pages into cache. That can evict parts of the HNSW graph, including upper layers or frequently visited neighborhoods used as entry points for ANN search.

When the vector portion of the query runs, it may encounter cache misses where it expected hot graph structures.

The reverse can also happen. A burst of vector queries can keep HNSW pages hot while pushing relational metadata pages out of memory. Later, ordinary transactional or filtered queries pay the price.

B-trees and HNSW can coexist. Their working sets have different shapes, and a generic cache policy may not understand which pages are latency-critical for each phase of a hybrid query.

MVCC adds maintenance pressure

The concurrency challenges of hybrid systems become sharper under MVCC.

In a system like Postgres, updates create new row versions. PostgreSQL’s routine vacuuming documentation explains that VACUUM removes dead row versions in tables and indexes and marks space available for reuse. For agent memory workloads, where state, conversation history, metadata, and embeddings may change frequently, this can lead to table and index bloat unless vacuum, partitioning, and retention policies are carefully tuned.

Vector indexes add their own maintenance pressure. HNSW is a graph-like structure, so inserts and updates are not just local tuple changes. The engine has to maintain graph connectivity while preserving query correctness and concurrency guarantees.

Deletes and updates can leave tombstones, stale candidates, or deferred cleanup work depending on the implementation.

In a hybrid engine, a filtered search query can collide with background indexing, compaction, vacuum, or a high-throughput write stream.

If concurrency control is too coarse, long-running reads can delay maintenance or writes. If it is too permissive, queries may see inconsistent candidate sets or pay extra validation costs.

Relational transactions, vector index maintenance, and graph updates all want different consistency and scheduling policies.

Mitigation patterns: tiered storage, quantization, read/write separation

Several patterns reduce physical-layer friction in hybrid search. Each one helps, but none removes the underlying trade-off.

The design still has to choose where to spend memory, latency, recall, and operational complexity.

Pattern	What it improves	What it costs	Best fit
Tiered storage and lazy loading	Reduces memory and local disk pressure	Adds first-query and cache-miss latency	Large collections, skewed access, analytical retrieval
Quantization	Reduces vector storage and memory footprint	Can reduce recall without re-ranking	Vector-heavy systems with two-stage retrieval
Read/write separation	Isolates indexing from serving	Adds freshness, replication, and recovery complexity	Interactive workloads that need stable query latency

Tiered storage reduces memory pressure

One common pattern is separating compute from storage and using low-cost object storage as the durable backing layer.

Systems such as Milvus support tiered storage to avoid loading every field and index into each query node up front. Instead, the query node can load lightweight metadata first, then fetch field data and indexes on demand through a local caching layer.

This is a critique of object-backed lazy-loading as the primary serving pattern for interactive agent memory, not a critique of Milvus as a vector database. HydraDB uses Milvus-backed indexing for configured metadata fields, as reflected in the tenant metadata schema API, but that is different from treating a standalone object-backed vector retrieval layer as the full memory substrate. In HydraDB, vector and metadata indexing sit underneath a memory layer that also coordinates graph context, temporal state, and recall behavior.

The trade-off is first-query and cache-miss latency. Object storage is durable and cost-efficient, but it is a poor fit for workloads that require many dependent random reads unless the system adds caching, warm-up, prefetching, and careful segment management.

Milvus documentation reflects this trade-off. Lazy loading and partial loading reduce memory and disk pressure, while warm-up and tiered-storage cache policy help control query latency.

Tiered storage is useful for large collections, skewed access patterns, analytical workloads, batch retrieval, and mixed workloads that can tolerate occasional cache misses.

Interactive agent memory has less room for cache misses because a query may need fresh context, relationship expansion, and vector ranking inside a tight response budget.

Quantization trades precision for I/O efficiency

Quantization reduces the representation size of vectors so the system can keep more candidates in memory, read fewer bytes from storage, and compare more vectors per CPU cache line.

The foundational work on Product Quantization by Jégou, Douze, and Schmid established the basic trade-off. Smaller vector representations reduce memory and storage pressure, but they also reduce precision.

Production systems usually compensate by over-fetching compressed candidates and re-ranking them using full-precision vectors.

The byte savings can be large. A 1024-dimensional vector stored as 32-bit floats requires 4096 bytes. Binary quantization can represent that same dimensionality in 128 bytes, a 32x reduction before accounting for index overhead.

With 2048 dimensions, the float32 representation is 8192 bytes, and the binary representation is 256 bytes.

Those reductions can decide whether an index fits in memory or spills into a colder storage tier.

The cost is recall risk. Aggressive quantization can distort distances enough to drop relevant candidates from the first-stage result set.

Quantization works best as part of a two-stage retrieval pipeline, with full-precision re-ranking.

Read/write separation stabilizes serving

Another common pattern is separating read and write workloads.

Writes go to an ingestion or indexing path. Queries go to read-optimized replicas. This prevents index builds, compaction, and heavy write bursts from directly competing with low-latency search.

That separation helps with resource isolation. It keeps CPU-heavy index construction away from serving nodes. It also gives query replicas a more stable cache profile, which matters for HNSW traversal and metadata filters.

For interactive workloads, predictable cache residency can be more valuable than peak throughput.

The cost is freshness and operational complexity. The system must manage replication lag, index handoff, failure recovery, and version visibility.

Agent memory raises the cost of replication lag because stale context can be semantically wrong, not just slightly out of date. If an agent acts on superseded preferences or an old conversation state, the storage delay becomes a product bug.

What memory-native infrastructure should optimize for

The physical-layer friction between B-trees, vector graphs, and relationship graphs shows why agent memory is hard to build on top of existing retrieval assumptions.

Each access pattern is well understood on its own. The difficulty lies in combining them into a single low-latency query path without allowing one structure to sabotage the others.

Persistent agent memory adds versioned history, temporal context, user-specific state, permissions, semantic similarity, and explicit relationships. These complex requirements highlight the core differences between an agent memory layer and a traditional vector database. They demand a memory-native context layer, not a stack of disconnected retrieval components.

A memory-native architecture should treat metadata, vector similarity, relationships, temporal state, permissions, and version history as primary primitives. Cache policy, index maintenance, query planning, and I/O scheduling need to be aware of all of them.

HydraDB is built around this shape of retrieval. Its Memories model stores user-scoped context for personalization across sessions. Its Knowledge flow handles shared, tenant-wide context. Its metadata filters narrow recall to known scopes before ranking. Its context graph captures entities, relationships, temporal signals, and graph paths so agents can reason about how information connects.

At the architectural level, HydraDB addresses this through three primitives. The Sliding Window Inference Pipeline enriches each ingested chunk using a window of surrounding segments to resolve pronouns and capture user preferences, so retrieved chunks are semantically self-contained rather than disconnected fragments. The Git-Style Versioned Temporal Graph keeps relationships, state changes, and historical versions connected so agents can reason about what changed, when it changed, and which context is still valid. Multi-Stage Retrieval coordinates metadata constraints, graph expansion, and semantic ranking so filtering and relationship context are part of retrieval rather than post-processing after vector search.

HydraDB’s quickstart also shows the operational shape of this model: content is ingested, processed, embedded, and used to build graph context before recall. For knowledge sources, applications use full_recall. For user memories, they use recall_preferences. The response can include chunks, source metadata, graph context, and additional related context that can be passed into an LLM through a clean context builder.

Reliable agent memory depends on predictable latency, high recall, controlled permissions, historical traceability, and context that reflects how users and organizations change over time.

To see how HydraDB approaches agent memory, context graphs, and time-aware retrieval, visit hydradb.com or explore the HydraDB documentation.

AI Agent Memory Is a Database Problem

Aman Puri — Sat, 27 Jun 2026 15:39:55 +0000

TL;DR

Most "agent hallucinations" in production are database problems: stale reads, race conditions, missing audit trails, broken referential integrity. The model didn't fail. The infrastructure did.
Make the agent stateless. Push durable state into a purpose-built memory layer. The model reasons. The database handles versioning, isolation, conflict detection, and durability.
The primitives needed are familiar database engineering: read isolation, MVCC, event sourcing, bitemporality, relational constraints, schema evolution, retention, provenance.
Vector indexes are search tools, not state stores. Framework checkpoints solve single-run replay, not cross-run audit. Postgres + pgvector gets you ACID and joins, but you assemble bitemporality, event sourcing, and retention yourself.
Pay for assembly, or pay a vendor to ship the assembly as a single system. Pretending it's a model problem isn't defensible.

A customer downgrades from 500 to 200 seats on a Wednesday morning. The support agent processes the change. Two hours later, the renewal agent reads a cached account snapshot and sends a 500-seat renewal at full price. The customer's VP of Engineering replies asking if anyone at the company talks to each other.

The model didn't hallucinate. It reasoned correctly. The infrastructure handed it a stale snapshot.

This is a database problem, not a model problem.

At Zenith, we've been building AI agents for two years. We've run an AI-native marketing agency on them for over a year, and shipped agents into other products across industries. I've watched this hit every deployment. Different customer, different agents, different industry. Every time, the team blames the model first.

Why teams misdiagnose agent failures

When agents fail in production, engineers tweak the prompt, drop the temperature, swap Claude for ChatGPT. Sometimes the model did confabulate. Most of the time it didn't.

Andrej Karpathy has called reliable agents a "decade project," with "reliable memory" and continuous action over time as the bottleneck. He's right about the timescale. Agent memory isn't a novel research problem, though much of the industry treats it as one. The failures teams blame on it are stale reads, race conditions, missing audit trails, and broken referential integrity. We've solved these before. A few teams are starting to apply that experience to the agentic stack, but the frame isn't widely recognized yet.

Agents should be stateless. Memory should be a database.

Web engineering made the same shift twenty years ago. Stateful application servers don't scale, don't fail gracefully, and don't compose. The fix: push session and durable state out of the app server, into databases (Postgres, MySQL) and session stores (Memcached, Redis). The app server became a stateless transition function. Read state, process, write state, forget. Scaling, failover, and concurrency stopped being application-layer hacks.

LLMs are stateless compute. A model call is a pure function from (system prompt + context + user input) to (output + tool calls). The model is good at state transition: reasoning, resolving ambiguity, and merging signals. It's bad at mechanical guarantees: versioning writes, enforcing referential integrity, providing consistent snapshots, detecting concurrent write conflicts, and managing retention. That's the database's job.

This is a separation of concerns, not a capability boundary.

The database detects conflicts. The model resolves them. When two agents write conflicting preferences for the same customer, the database catches it via optimistic locking, version checks, or serializable transactions. The model reasons about the resolution: re-read both inputs, weigh recency against authority, escalate to a human.

The database enforces structure. The model interprets meaning. The database enforces that an Order references a valid Customer. The model interprets whether the extracted "order" from a conversation is an order or a casual mention.

The database versions every state change as an immutable event. The model decides what's worth committing.

The database provides isolation. The model assumes consistency. If the snapshot is stale, the model's reasoning is correct, but its premises are wrong.

The agentic stack today treats the context window as both compute and state. The prompt is where the agent reasons and where its memory lives, stuffed with retrieval results, conversation history, system instructions, and cached fragments. The context window is the new $_SESSION: mutable, ephemeral, bounded, unversioned, unauditable, with no isolation between reads and writes. It works until you need a second server, a restart, a concurrent user, or an audit trail.

The vector database next to it is a search index, not a state store. Hamel Husain on naive RAG: "What's dead is the 2023 marketing version of RAG. Chuck documents into a vector database, do cosine similarity, call it a day. This approach fails because compressing entire documents into single vectors loses critical information." Husain's critique is about retrieval. The write path has parallel problems. A vector store optimized for similarity search doesn't give you typed entities with enforced relationships, versioned writes, transactional read isolation, multi-writer concurrency control, or retention semantics.

The same shift, mapped:

Web (2005 → 2015)	Agents (2024 → 2026)
Stateful app servers (session in memory)	Stateful context windows (memory in prompt)
`$_SESSION`, sticky sessions, in-process state	Conversation history, retrieval-stuffed prompts, LRU caches in app code
The fix: stateless servers + relational DB + session store	The fix: stateless agents + purpose-built memory/context DB
PostgreSQL, MySQL, Redis, Memcached	HydraDB, Postgres + event sourcing, XTDB, custom assembly
Scaling = add servers, DB handles consistency	Scaling = add agents, memory layer handles consistency
Failover = any server picks up from DB state	Failover = any agent picks up from committed memory state
Session store with TTL, GC, eviction	Memory layer with retention, GC, temporal partitioning
DB detects constraint violations; app logic resolves	DB detects write conflicts; model reasons about resolution

Make the agent a stateless transition function. Make the memory layer a database.

Before each step, the agent reads relevant state with explicit isolation semantics (snapshot, read-committed, or stronger). The database guarantees consistency. The model trusts it. The agent reasons. The model produces outputs, decisions, tool calls. After each step, the agent writes resulting changes back as immutable events. The database handles versioning, conflict detection, durability. If a write conflicts with a concurrent write, the database surfaces it. The model or a human resolves it.

Between steps, between sessions, between agents, the database is the source of truth. No agent holds state. Any agent picks up any workflow from the last committed state.

The analogy isn't perfect. Web requests are short and don't reason about their own history. Agent workflows are multi-step and reason about what they believed three steps ago. The memory layer needs more than a session store: temporal queries, event sourcing, graph-aware context.

Four agent failure modes (only one is the model)

Most production "agent failures" get lumped together as hallucinations. They aren't. There are four failure classes. Only one lives in the model.

Model confabulation. The LLM invents facts not present in any context. Memory architecture can't fix it. The fix lives in the model layer: better grounding, structured output, retrieval before generation, evals.
Stale-context reads. The agent acts on memory another process has since invalidated, or on a snapshot assembled from cached fragments without transactional coherence. The agent equivalent of a stale read or read skew. The opening incident: the renewal agent read a 500-seat snapshot two hours after the support agent committed the downgrade.
State-continuity failure. The agent's durable session state is missing, partially checkpointed, or interpreted under a changed schema between turns. Restart-and-lose-context, or schema-migration-broke-replay. A durability and schema-evolution problem.
Cross-agent corruption. Two agents writing the same shared state without coordination, with the system silently accepting last-write-wins. A concurrency-control problem.

The four get conflated as "the AI hallucinated again" because the surface symptom is the same: the agent confidently outputs something wrong. The root causes are different. Three of them (2, 3, 4) are infrastructure problems commonly manifesting as the State Confusion Problem.

The infrastructure primitives that address those failures are read isolation, concurrency control, event sourcing, bitemporality, relational constraint enforcement, schema evolution, retention, and provenance.

Stale reads vs. dirty reads in agent memory

One agent commits a change. Another reads a cached snapshot from before the commit and acts on outdated state. That's a stale read.

Tacnode argues that stale context is the root cause of a class of agent failures usually blamed on the model: "Wheel-spin isn't a model problem. It's an infrastructure problem."

This isn't a dirty read. A dirty read means reading data written by a concurrent uncommitted transaction. A stale read in agent memory is reading old committed state, or a snapshot assembled from cached fragments with no shared transactional boundary. Closer to a stale read, read skew, or a non-repeatable read. The precision matters: the fix differs.

Production agents need at minimum read-committed semantics. Snapshot or stronger is better. Prompt context windows have undefined isolation. The prompt is assembled from a vector retrieval, a Redis cache lookup, a session JSON blob, and a tool output, with no transactional coordination. Whatever the agent reads is whatever each subsystem happened to return.

Vector databases have eventual consistency. Writes propagate to indexes asynchronously, with documented lag windows. That's a known shape, not a one-off bug. It's fine for retrieval-augmented generation over a static corpus. It's not fine for an agent reading and writing to the same store. The agent needs at least a consistent snapshot of committed writes, ordered against a known transaction-time horizon. Most agent memory implementations weren't built to give that guarantee.

The fix is to give the agent a defined isolation level when it reads. Isolation level is a property of a single transactional system, not a federation of stores stitched at the application layer (vector DB + cache + JSON blob).

Multi-agent write conflicts: the database detects, the model resolves

Two agents read the same record at the same time. Both write. Under last-write-wins (the default in most agent memory implementations), one intent silently disappears. The audit log shows the survivor. No trace of the conflict.

This isn't theoretical. Tian Pan documented the failure mode in production: "Three agents processed a customer account update concurrently. All three logged success. The final database state was wrong in three different ways simultaneously... The team spent two weeks blaming the model. It wasn't the model. It was a race condition." A LangGraph practitioner running five Claude Code agents on the same repo: "about 24% of intended changes disappeared while the build still passed, and one agent's auth routes were completely lost." The surface symptom looks correct: a passing build, a database that reads back consistently. The error is in what's missing, not what's wrong.

Most agent frameworks solve multi-agent coordination via queue-based orchestration (Temporal, Inngest), turn-taking protocols, or last-write-wins acceptance. Not via database concurrency primitives. Queues aren't wrong. They're a reasonable answer for some workflows. But the agent state layer should give you the option of database-grade conflict detection when you need it, not force you to build coordination outside the data layer.

Under MVCC plus expected-version checks, or under serializable transactions where the conflict is expressible to the database, the second writer's commit is rejected. The database surfaces both intents in the audit log. The model decides which wins, weighing source authority, recency, and specificity, or escalates to a human. A model is well-suited to that judgment. It can't make it if the infrastructure silently swallowed the conflict.

MVCC alone gives you snapshot reads. It doesn't detect every semantic conflict at the application level. Conflict detection requires explicit optimistic concurrency control: expected-version checks, compare-and-swap, serializable isolation, or stream-version checks. That layer needs to exist, exposed to the agent, with semantics that the agent or a human can reason about.

Event sourcing vs. checkpointing for agent state

A customer disputes an outcome that an agent committed months ago. A regulator asks how a decision was made. An internal review needs to understand why the agent acted on stale data. The team has to reconstruct what the agent read, what was said in the conversation, what the agent committed, and what downstream systems consumed the commitment. If memory only stores the current state, there is no trail for proper decision traceability and auditability.

Aaron Cheiffetz at Finextra frames it as the audit-trail problem: conventional logs are "telemetry, not testimony." They capture what an agent did, not why.

At Zenith, we run this audit pattern often. When customers flag content that's gone stale, we need to know whether the agent had the right source documentation at decision time, or whether something else caused the mis-grounding. Without an immutable trail, the answer is a guess.

LangGraph, CrewAI, and AutoGen ship state checkpointing. LangGraph's persistence layer saves graph state as snapshots and supports time-travel by replaying from a checkpoint. CrewAI persists workflow state to SQLite or vector stores.

Event sourcing is the broader primitive. The argument is structural, not stylistic. Checkpointing answers "redo this run from step N." It captures one moment in one run. Event sourcing answers four things checkpointing alone does not.

First, queries that span multiple agent runs and sessions. Such a dispute spans months, multiple agents, and downstream systems. No single LangGraph checkpoint covers it. An event log scoped to the affected entity does.

Second, disciplined schema evolution. Events are immutable. Readers tolerate version envelopes. Old events project into the new schema via upcasters. Event sourcing gives you a path, not a free lunch. This is how you keep historical records interpretable when the customer model changes. Snapshots don't. When the snapshot schema changes, the old snapshots are dead weight or require migration.

Third, rebuilding derived views without reprocessing every agent execution. Per-customer summaries, per-tenant aggregates, per-tool usage analytics: all are projections over the event stream. New view, new projection. Not a new run of every historical agent.

Fourth, cross-agent and cross-session audit trails. Reconstructing the dispute requires tracing a decision across multiple agents and downstream systems. Framework checkpoints are scoped to a workflow run. Event logs are scoped to the entity. The disputed entity's history spans every agent and system that touched it.

For single-agent, single-session workflows, framework checkpoints are a reasonable lighter-weight choice. For anything that touches the same entity across runs (which is most production agent work), event sourcing is the broader primitive. Checkpointing is a special case. Tian Pan: "Agent state as event stream" scales with cross-run, cross-agent reasoning.

Bitemporality: system time vs. valid time

An agent acts on a fact. Later, the fact changes, or gets retroactively corrected. Two questions arise. What did the agent see when it acted? And what was actually true at that moment?

These are different questions. The first is system time: the value the database held when the agent read it. The second is valid time: the value that was true in the world. Without bitemporality, you can't answer either cleanly. You replay logs by hand and hope nothing was overwritten.

The pattern is well-established in regulated industries. In financial trading, trades get amended after the fact, and without bitemporal records of those amendments, firms face regulator fines. The standard examples (backdated insurance policies, invoice corrections, retroactive HR changes) all need the same primitive.

At Zenith we run agents that watch product documentation and update derived customer content when the source changes. When the system slips, the post-mortem needs both axes: when our knowledge base recorded the change versus when it actually happened in the world. The gap tells us whether we were late or whether the content was wrong from the start.

That distinction separates agents that support auditable replay, root-cause debugging, and post-hoc correctness checks from black-box stateful blobs. Most agent-memory work doesn't draw it.

Bitemporal databases go back to Snodgrass in the early 1990s. Datomic productionized transaction-time history with [as-of](https://docs.datomic.com/reference/filters.html), [since](https://docs.datomic.com/reference/filters.html), and [history](https://docs.datomic.com/reference/filters.html) filters. XTDB v2 went further: "Unlike other SQL databases, XTDB tracks both 'system time' and 'valid time' automatically... All tables are bitemporal tables." Queries like SELECT * FROM customers FOR VALID_TIME AS OF '2026-03-12' are a single SQL statement, not a custom event-log replay scaffold built in app code. An immutable temporal substrate gives you "history that cannot lie, queries that cannot race, and a model you can reason about in daylight". An agent memory layer worth auditing needs the same.

Postgres can do this with discipline: temporal_tables for system-period support, application-maintained valid time, careful trigger plumbing. Bitemporality isn't exotic. Any production agent making decisions worth auditing needs both axes. Most teams are still assembling them by hand on stores that don't expose them natively.

Relational integrity, not vector metadata filters

An Order must reference a valid Customer. A Renewal must link to an existing Contract. A Discount must apply to a specific tier. These are constraints. The database enforces them at write time. A vector index can't.

Vector databases aren't untyped blob stores. Pinecone supports metadata filtering. Weaviate has schema definitions and cross-references. Qdrant has typed payload indexes. The mechanism exists. It isn't relational integrity.

When an agent pulls a contract and the linked seat-count and tier records, a vector-only layer can't guarantee those records are from the same coherent state. Two records that match a query by similarity have no enforced relationship to each other. The agent gets an inconsistent profile and reasons over it confidently. This lack of state management causes context rot, highlighting why autonomous agents require a dedicated agent memory layer instead of a stateless vector database.

No foreign keys across entity boundaries. No referential integrity enforced on write. No schema migrations with constraint validation. No transactional invariants spanning multiple records. A Weaviate forum thread, "Messing up search results under parallel write operations," reports results that don't reflect concurrent updates. The fix isn't "add more retrieval." It's a typed entity graph with enforced constraints.

Schema evolution and retention in event-sourced memory

A schema upgrade hits months into production. "Contact preference" splits into "support" and "billing" channels. A new "churn risk score" field is added. Product wants to query historical data with the new shape. Events from before the upgrade don't have churn risk scores. They have one "contact preference" field, not two. If events are immutable with versioned envelopes and tolerant readers (or upcasters), old events project into the new schema on read. If the team ran a destructive migration, the original context is lost and the projection is a guess.

This is event-sourcing 101 applied to agent belief schemas. Greg Young and Vaughn Vernon have been writing about event versioning for over a decade. The patterns (version envelopes, weak schema, upcasters, copy-and-replace) are well established. They require discipline up front. Most agent memory implementations don't commit to it because they were built around current-state representations, not append-only event logs.

Retention is the other side of immutability. Different parts of an agent system have different retention requirements. A data processing agreement requires deleting conversation transcripts after 12 months. Compliance keeps reasoning traces for 7 years. Quality monitoring retains embeddings for 30 days. Can the memory layer express "keep the event log, compact the embeddings, crypto-shred the transcripts" without a manual migration?

Immutability is the substrate, but legal deletion and tenant isolation cut across it. For GDPR, EventStoreDB requires destructive "scavenge" operations that permanently remove deleted or expired events from the global stream. Datomic uses excision. Postgres-backed event stores typically use crypto-shredding: encrypt PII fields with per-record keys, then delete the key when the record must be erased. Postgres also gives you VACUUM, MVCC visibility horizons, partition-based retention, and TTL extensions. Agent memory layers need analogous primitives: temporal partitioning, retention policies for different data classes, a vacuum equivalent that reclaims space without breaking replay, and crypto-shredding for fields that can't remain even in compacted history.

The tension is well-documented in event-sourcing literature. Most agent memory implementations have none of the primitives. They're running on borrowed time before the storage bill, the latency profile, or the first compliance audit forces a rewrite.

Provenance and auditability in agent memory

Database primitives ensure state correctness: what was true, when, and how it changed. Epistemic correctness is a separate concern: why an assertion is true. Agent memory isn't just facts. It's assertions, evidence, confidence, source, extraction method, and revision history. A typed graph can enforce that an Order references a valid Customer. It can't enforce that the extracted "order" from a conversation is an order rather than a casual mention. A database enforces structure, not reality.

Every assertion should carry provenance: the source utterance, the extraction model and version, the confidence, the valid-time claim, and a link to the prior assertion it supersedes. A versioned belief ledger over evidence, not a fact store. When a customer says "switch billing contact to the IT admin," the stored assertion is the customer's stated billing-contact preference at that conversation turn, linked to the turn itself. Later updates create new assertions linked to their sources, not destructive overwrites.

This changes what the agent can do. Reason about its own beliefs. Distinguish high-confidence facts from low-confidence inferences. Audit decisions back to evidence. It also changes what the system can claim about its outputs. The database stores provenance. The model uses it to reason.

Simon Willison's "lethal trifecta" (private data + untrusted content + external action) makes auditability the floor for any agent system handling production data. Provenance is how you reach that floor. A graph without provenance is a belief ledger you can't audit.

Four agent memory architectures compared

Architecture	Examples	ACID + Isolation	Bitemporality	Multi-writer concurrency	Audit / Event sourcing	Best for
Vector indexes	Pinecone, Weaviate, Qdrant, Milvus	Eventual consistency	None	None	Limited	RAG with stateless agents
Postgres + pgvector	—	Native	Assemble (temporal_tables)	Native (Postgres MVCC)	Assemble	Teams with DB engineering bandwidth
Framework checkpointing	LangGraph, CrewAI, AutoGen, LlamaIndex	Inherits backing store	Single-run only	Queue-based or last-write-wins	Per-run snapshots	Single-agent workflows
Purpose-built memory	HydraDB, Mem0, Zep, Letta, Cognee, Graphiti	Mostly not exposed	Both axes in leaders (HydraDB, Zep).	Versioned edges in HydraDB; varies elsewhere	Append-only graphs	Multi-agent, cross-run state

Dedicated vector indexes (Pinecone, Weaviate, Qdrant, Milvus) solve similarity search well. They have schema definitions, cross-references, typed payload indexes, metadata filtering. They don't provide relational integrity across entity boundaries, multi-entity transactional invariants, bitemporal queries, or multi-writer concurrency control. Acceptable for retrieval-augmented generation with stateless agents. Unsuitable as the canonical store for stateful agents.

Postgres + pgvector is a credible canonical store. It inherits Postgres ACID, joins, foreign keys, point-in-time recovery, and schema migrations. The right starting point if your team has database engineering bandwidth. The tradeoff: you're assembling bitemporality, event sourcing, graph projections, retention, and agent-specific ergonomics yourself, on a stack where schema migrations cross engines and transaction boundaries split across stores. The pieces exist. The integration cost is months of senior engineering, and the seams break under production load. Bitemporal correctness becomes your team's problem to maintain.

Agent framework checkpointing (LangGraph, CrewAI, AutoGen, LlamaIndex Workflows) solves replay within a single run. Persistence layers serialize graph state to SQLite, Postgres, or Redis. Doesn't solve cross-run bitemporal queries, multi-agent concurrency on shared entities, or schema evolution across belief versions. Acceptable for single-agent or queue-orchestrated workflows. The right complement to a database, not a replacement.

Purpose-built agent-memory systems (Mem0, Zep, Letta, Cognee, Graphiti, HydraDB) are an emerging category. As teams evaluate Mem0 and Zep alternatives for production, it becomes clear that each makes a different bet:

Mem0: token-efficient memory with single-pass ADD-only extraction. Optimized for cost and throughput, with less emphasis on long-context coherence. Not pitched as a transactional system.
Zep / Graphiti: temporal context graph with valid_at and invalid_at markers, sub-200ms retrieval. Bitemporal-leaning, without explicit ACID/MVCC claims.
Letta (formerly MemGPT): memory managed by the LLM in an "OS paradigm." Innovative on agent ergonomics, less aligned with database-grade semantics.
Cognee: knowledge engine combining graphs and vectors. Less emphasis on bitemporality.
HydraDB: append-only temporal graph with versioned edges (writes preserved with system and event timestamps), automatic graph extraction without manual schema, Sliding Window Inference Pipeline for chunk enrichment, shared cross-agent memory ("Hive Memories"). Read/write paths isolated; ACID-style isolation levels and commit-time MVCC on the roadmap.

Each addresses a piece of the agent-memory shape. None, including HydraDB, exposes the full set of database-grade primitives: ACID transaction guarantees, explicit isolation levels, full Snodgrass-style bitemporality (system time + valid time as queryable dimensions), MVCC with optimistic concurrency control surfaced to the agent, and disciplined schema-evolution tooling.

A vector DB isn't enough. A general-purpose OLTP database isn't enough either. It gets you ACID and relational integrity, but you assemble the rest. Furthermore, forcing B-tree metadata filtering, vector search, and graph traversal into a single query path creates a storage-layer crisis that degrades performance. Every team running a stateful agent in production is paying for the missing pieces somewhere: in glue code, in query latency, in audit gaps, in rewrites. Pay for assembly, or pay a vendor to ship the assembly as a single system.

When agents don't need a memory database

Not every agent needs database-grade memory. An agent that runs once, doesn't share state, and doesn't write back to anything external can get away with a vector index and a session blob. A conversational agent with no consequential actions outside the chat doesn't need durable, audited, coordinated state. A coding agent running on one developer's machine has no multi-agent concurrency to worry about. A research-summary agent whose value is the output, not the persisted facts, doesn't need a belief ledger.

The frame matters when an agent is stateful, multi-step, and shares state with other agents or systems. That's most production agent work. It isn't all of it.

Where HydraDB fits

HydraDB sits in the purpose-built lane. It ships a Git-style append-only temporal graph, automatic entity and relationship extraction without manual schema definition, a Sliding Window Inference Pipeline that makes chunks self-contained before retrieval by resolving entity references and embedding contextual bridges, multi-tenant and sub-tenant isolation, metadata-filtered and graph-enriched recall, recency-biased search, and shared cross-agent memory through Hive Memories. The temporal graph is real append-only history, not destructive overwrites: the substrate the event-sourcing reframe demands. Concurrent writes between two entities are preserved as separate versioned edges with system-time and event-time metadata, not silently overwritten. Read and write paths are explicitly isolated, so heavy ingestion doesn't degrade read latency. Each edge carries reasoning context, sentiment, and situational factors as metadata, encoding the "why" of every state transition. The graph extraction provides relational context at retrieval time (entity paths, relationship types, temporal signals) that vector-only stores don't.

In production, HydraDB sustains 2.5 million tokens per minute of ingestion with sub-500ms wait times at peak, and has handled bursts of 50 million tokens per hour under noisy-neighbor conditions. The write side is what makes a memory layer viable as a canonical store, not just a retrieval cache.

The architectural choices show up in benchmarks. On LongMemEval-s, the long-context conversational memory test (500 question-conversation stacks, 115k tokens average per stack), HydraDB scores 90.79% overall: state of the art by +5 points over the next-best purpose-built system, with 100% on single-session recall and 90.97% on temporal reasoning.

The edge model captures both transaction time and valid time per relationship. A SQL-like queryable interface for these axes (like XTDB's FOR VALID_TIME AS OF) isn't shown in public docs. The graph gives relational context at read time but doesn't enforce relational constraints at write time. Compliance covers RBAC, SSO, audit logs, and encryption, not first-party SOC 2 or GDPR attestations. Deployment options include managed cloud and BYOC: a HydraDB cluster running inside the customer's AWS VPC, with observability, CI/CD, and scale-to-zero baked in.

HydraDB's roadmap targets several of these layers directly: weighted nodes for graph-level reasoning, state-space-model architectures for more efficient memory tracking, orchestration harnesses for switching between file-system memory and long-term memory, file-system memory APIs for cloud-native deployments, end-to-end observability into ingestion processing, and a production version of the Bio-Mimetic Decay Engine for retention.

I think the category will mature toward database-grade semantics, not get displaced by Postgres + pgvector + assembled primitives. HydraDB is one of the systems making that bet credibly today.

Agent memory is a database problem

The renewal agent in the opening didn't fail because it was a bad agent. It read what the infrastructure handed it. The same is true of every other failure here: stale reads, dropped writes, lost audit trails, retroactive corrections with no replay, multi-record incoherence, missing provenance. Database problems with database fixes.

The compute layer is stateless. The model reasons. The infrastructure underneath has to give it consistent snapshots, durable writes, conflict detection, multi-record consistency, bitemporal queries, retention controls, and provenance. Most of the pieces exist somewhere. They don't ship together as one system that an agent team can adopt without months of integration work.

The question for engineering leaders building stateful agents: pay the assembly cost on Postgres and a stack of glue, or pay a vendor to ship the assembly. Both are defensible answers. What isn't defensible is pretending the problem isn't a database problem.

Git for Context: Versioned Temporal Graphs for AI Agent Memory

Aman Puri — Sat, 27 Jun 2026 15:39:10 +0000

TL;DR

Most agent memory failures look like hallucinations. They are not. The model reasons correctly over a stale fact that the memory layer fed it. That is a database failure, not a model failure.
Destructive updates create the State Confusion Problem. The seemingly obvious fix (have an LLM resolve facts at write time) breaks two ways: it silently purges history when the resolution model hallucinates equivalence, and it adds an LLM call to every ingested chunk.
The architecture that works borrows the shape of Git. Edges are append-only commits, each carrying both a transaction time and a valid time. Nothing is overwritten. The current state of any relationship is determined by the edge log.
On LongMemEval-s, this architecture scores 90.79% overall, +5 points absolute over the strongest published system, with the largest gains in Knowledge Update (97.43%) and Temporal Reasoning (90.97%): the categories where temporal versioning matters most.
The architecture is not free. Storage grows. Write-path enrichment costs cycles. Query plans get more involved. The paper argues these costs are paid efficiently and scale correctly with workload.

A user moved from New York to London in October 2024. The next month, an agent planning their weekend reads memory and suggests a Brooklyn dinner reservation, citing a 2022 conversation about a favorite spot in Williamsburg.

The model didn't hallucinate. It reasoned correctly. The memory layer handed it a stale fact and silently overwrote the correction.

This is a database failure, not a model failure.

In my previous piece, I argued that agent memory is a database problem: agents should be stateless, state should live in a purpose-built store with the primitives database engineers already know how to build. One of those primitives does the most work in production: a versioned, bitemporal graph for relationships and facts that change over time.

The shape HydraDB has ended up with is a composite architecture: it combines traditional vector search and B-tree indexes with a foundational structure that looks a lot like Git. Every state transition, whether a user's preference shifts, a customer's seat count drops, or an internal system's owner changes, is committed as an append-only edge with both a transaction time and a valid time. The current state of any relationship is a function over the full commit history of that edge: "all commits to this edge, ordered by time, filtered to t ≤ now." Reading state at any historical point is a function over the same edge log, not a custom replay scaffold built in app code. The "what changed and why" question has the same shape.

The architecture and the empirical numbers come from the HydraDB paper, Beyond Context Windows for Long-Term Agentic Memory.

The State Confusion Problem: when agent memory overwrites correct facts

Most memory layers handle a state change by overwriting the old value. A user's location, a customer's tier, an account's primary contact: the new value replaces the old one in storage. While vector stores can do this via explicit ID-based upserts, more commonly they simply store the new chunk alongside the old one, completely unlinked. This leaves no deterministic way to fetch the "latest" entry without building custom application logic. Standard graph databases do this by updating the edge in place. KV stores do it by writing to the same key. The old value is gone unless someone wrote a separate audit log.

For most application workloads, this is the correct default. For agent memory, it is destructive.

Consider a user who in 2022 says, "I live in New York because I work at startup XYZ, headquartered in NYC." In 2024 the same user says, "I live in London because I switched to Meta. I moved to be closer to my parents."

In a destructive store, two things can happen. The system overwrites NYC with London, and the trail of why the move happened, when it happened, and what the world looked like at the earlier point disappears. Or the system stores both as separate facts with no temporal ordering, and at retrieval time the agent gets two competing values without knowing which is current. Either way, the agent loses the timeline, the reasoning, and the decision tree.

The paper calls this the State Confusion Problem. It is not a corner case. Every long-running agent runs into it within weeks of production traffic.

The cost is concrete. A scheduling agent suggests dinner in a city the user moved away from. A renewal agent quotes the prior year's seat count after a downgrade has been committed. A health-tracking agent acts on a dietary preference the user reversed three months ago. The model reasoned correctly. The memory layer fed it the wrong premise.

Why the Iterative Resolution Loop breaks at scale

The first instinct most teams have is to add an LLM-mediated update step on the write path. For every incoming fact, vector-search for similar existing facts, then ask an LLM to decide whether the new fact updates, supersedes, or coexists with the old one. This pattern, called the Iterative Resolution Loop in the paper, shows up in nearly every agent memory codebase that has tried to handle changing state.

It breaks two ways.

The first failure mode is what the paper calls Instability via False Positives. Semantic similarity does not imply factual redundancy. For example, if a user says "I love the Next.js framework" and a year later says "I love Angular," it is incredibly difficult for an LLM to conclusively determine if the user has abandoned Next.js (requiring an overwrite) or simply loves both. Asking an LLM to "resolve" these at write time produces a False Positive Delete whenever the resolution model makes the wrong probabilistic guess and hallucinates a replacement: a destructive overwrite driven by a judgment call. The system silently purges history. There is no principled way to recover the lost prior state.

The second failure mode is the O(N) Latency Trap. Every ingested chunk triggers a retrieval-and-reasoning step. For a system ingesting tens of thousands of facts per day per tenant, that is tens of thousands of vector searches and tens of thousands of LLM calls on the write path. The cost and latency profile is incompatible with any production write throughput target.

Both failure modes share a root cause. The Iterative Resolution Loop tries to maintain a single canonical "current state" by reconciling at write time. Reconciliation requires inference. Inference at write time is expensive and unsafe. The cleaner move is to stop reconciling at write time entirely.

How HydraDB stores entity state as Git-style commits

Borrow the shape from version control. A Git repository does not store a single canonical state of the codebase that is overwritten on every change. It stores an append-only log of commits, each carrying a parent reference, a timestamp, and metadata describing the change. The current state of any file is a function over the commit log: take all commits touching this file, ordered by time, and resolve.

HydraDB takes the same shape for relationship and entity state. Edges are not single records updated in place. They are append-only, time-ordered sequences of commits. The paper formalizes this directly. For two entities u and v, the set of all edges between them is E(u,v). Each edge e_k is a tuple:

e_k = (r_k, t_commit, t_valid, C_meta)

Four fields, each load-bearing:

r_k is the semantic relation: LOCATED_IN, PREFERS, WORKS_AT, CAUSED_BY, BLOCKED_BY. The graph is typed.
t_commit is the ingestion timestamp. When the system learned this fact.
t_valid is the real-world validity time. When the fact actually became true in the world.
C_meta carries the contextual metadata: the source utterance, the sentiment, the reasoning, the alternatives considered, the situational factors. The "why" of the state change, not just the "what."

The split between t_commit and t_valid is the bitemporal axis. A user might tell the agent in March that they moved in October. The transaction-time history records the commit happened in March. The valid-time history records the move happened in October. Most memory layers conflate these. A bitemporal architecture answers two distinct questions cleanly: what did the agent know at March 12, and what was actually true on March 12. The first is an audit query. The second is a correctness query. Both matter, and they are not the same query.

When state changes, no edge is mutated. A new edge is appended to E(u,v) with a fresh t_commit and t_valid. The previous edge stays. The current state of the relationship is a function over the full edge history:

ΔState(u, v) = SortByTime(E(u, v)), filtered to t_valid ≤ t_now

Reading state at any historical point is the same function with a different t_now. The append-only log also answers the historical "blame" question: what facts did the agent commit between dates X and Y, and what conversation context drove them? Both fall out of the same data model.

This is a familiar shape. Bitemporal databases have been a research line since Richard Snodgrass formalized the model in the early 1990s. Datomic productionized transaction-time history with as-of, since, and history filters: a single-axis temporal model. XTDB v2 extended this to full bitemporality, exposing both system time and valid time as first-class queryable dimensions directly in SQL: SELECT * FROM customers FOR VALID_TIME AS OF '2024-10-15' is a single statement, not a replay scaffold built in app code. HydraDB applies the same bitemporal model at the edge level. It exposes the temporal axis through its recall API (with parameters like recency bias and valid-time filtering on graph traversal), not through a generic SQL-like temporal query language. The novelty in the agent memory layer is not the bitemporal model. It is applying that model to the substrate that LLMs read from. Vector indexes were not built for this. Standard graph stores were not built for this. The agent memory layer needs to be built for this.

What HydraDB's temporal graph enables: multi-hop reasoning, implicit preferences, and cross-session memory

Bitemporal append-only graphs are not free.

Multi-hop graph traversal for causally connected facts

Vector retrieval treats each chunk as an independent point in embedding space. Two facts are retrievable together only if they are semantically close. In production, the facts an agent needs are often causally connected but semantically distant.

Consider the query: "Why is the authentication service behaving differently this month?" The relevant facts include the auth service, the user database it depends on, a migration that touched the user database, the engineer who authored the migration, and the schema-change ticket that justified it. Vector retrieval might surface recent logs that mention the auth service. It will not connect those logs to the migration, the engineer, or the ticket, because none of those entities sit close to the auth service in the embedding space.

The graph traversal is direct:

Each hop is on a typed relation, with valid-time filters applied at each step. The agent recovers the full causal chain in a single retrieval pass. None of the intermediate hops were co-located in embedding space, and a flat vector index could not have surfaced them together.

Inferring unstated preferences from graph topology

Some of the most useful preferences a user has are never stated explicitly. They are visible in the topology of the user's decisions over time.

A user rejects two cloud vendors and accepts a third. The rejection conversations cite different reasons each time, but the accepted vendor shares one property the rejected ones lack: data residency in the user's home jurisdiction. The user has expressed a preference for data sovereignty, but never in those words. A flat retrieval system cannot surface this preference because no chunk contains the words "data sovereignty" attached to the user.

A versioned graph encodes the rejections and the acceptance as typed edges:

The graph then compares what cloud-vendor-C offers that A and B lack (in this case, data residency in the user's home jurisdiction), and synthesizes a higher-level preference edge.

The inferred preference is now retrievable across every future conversation that touches vendor selection, even if those conversations never use the words "data sovereignty." The graph grew smarter through use.

Preference accumulation across sessions

The third consequence falls out of the first two. When preferences are typed edges with provenance and outcome metadata, they accumulate across sessions instead of decaying with the prompt window. A user who repeatedly accepts open-source recommendations, declines SaaS suggestions, and expresses cost-sensitivity across unrelated sessions builds a preference subgraph:

Each edge has a count, a confidence weighted by recency, and outcome annotations: did the user act on the recommendation, did the plan succeed, was the decision later reversed. The memory layer is no longer a passive record of stated preferences. It is an active model of demonstrated preferences, retrievable as structured priors for downstream reasoning.

When these preference subgraphs are shared across agents operating on behalf of the same organization, HydraDB exposes the result as Hive Memories: cross-agent shared learning where one agent's observed preference becomes available to the next agent that touches the same entity.

How a temporal graph handles a four-year preference change

The paper's Figure 1 illustrates the model with a dietary preference: omnivore in 2021 to vegan in 2025. The edge structure follows the paper; specific dates and metadata fields are illustrative.

In January 2021, the user mentions in passing that they enjoy cooking steak on weekends. The system commits an edge.

In March 2024, the user is diagnosed with high cholesterol and tells the agent they are cutting back on red meat. The system commits a new edge.

In November 2025, after a year of progressively cutting back, the user tells the agent they have decided to go fully vegan and have already stopped buying meat. The system commits a third edge.

Three edges, no overwrites.

"What is the user's current dietary preference?" Resolve E(user, cuisine) by sorting on t_valid and filtering to t_valid ≤ now. Returns e_3: vegan.

"What was the user's dietary preference in 2023?" The same resolution with t_valid ≤ '2023-12-31'. Returns e_1: omnivore. The agent suggesting a steakhouse for a 2023 anniversary dinner is now reasoning over the right premise.

"When did the user's dietary preference change?" Walk the edge sequence. Two transitions: 2024-03-08 (omnivore to reducing red meat) and 2025-10-15 (reducing to vegan).

"Why did the user change?" Read C_meta on e_2 and e_3. Medical advice from a cardiologist drove the first transition. The second was driven by both medical and ethical motivations, discussed with the user's physician and spouse. The agent now understands the state change well enough to navigate adjacent decisions. It should not push the user toward beef-heavy recipes or restaurants whose menus are concentrated around red meat, even if older interactions show enthusiasm for those cuisines.

"What did the agent know about the user's diet on 2024-04-01?" Resolve with t_commit ≤ '2024-04-01' rather than t_valid. Returns e_1 and e_2. The agent at that point knew the user was an omnivore reducing red meat. Any decisions it made then can be audited against that state.

Stale recommendations, contradictory profile fields, missing transition reasoning, and unauditable decisions all collapse into resolutions of the same edge sequence with different temporal filters.

Benchmark: HydraDB scores 90.79% on LongMemEval-s

The architecture's claims are testable. The HydraDB paper evaluates the system against LongMemEval-s, a 500-question benchmark for long-term agent memory introduced at ICLR 2025. Each question-conversation stack exceeds 115,000 tokens, simulating roughly 50 continuous user sessions. That is well past the point where naive context-stuffing or flat vector retrieval falls apart.

HydraDB scores 90.79% overall, +5.0 absolute over the next strongest published system (Supermemory at 85.20%) and roughly +30 points over a GPT-4o full-context baseline (60.2%). The category breakdown:

Category	HydraDB	Supermemory	Zep	Full-context (GPT-4o)	Mem0-oss
Single-session (User)	100.00%	98.57%	92.9%	81.4%	38.71%
Single-session (Assistant)	100.00%	98.21%	80.4%	94.6%	8.93%
Single-session (Preference)	96.67%	70.00%	56.7%	20.0%	40.00%
Knowledge Update	97.43%	89.74%	83.3%	78.2%	52.56%
Temporal Reasoning	90.97%	81.95%	62.4%	45.1%	25.56%
Multi-session Reasoning	76.69%	76.69%	57.9%	44.3%	20.30%
Overall	90.79%	85.20%	71.2%	60.2%	29.07%

The largest absolute gains land where the bitemporal versioned graph architecture should help most. Knowledge Update at 97.43% measures whether the system can correctly handle conflicting or evolving facts, the precise failure mode the State Confusion Problem causes. Temporal Reasoning at 90.97% measures queries that depend on knowing what was true when. Single-session Preference Extraction at 96.67% measures implicit preference modeling, exactly the topology-derived inference the architecture exposes.

Multi-session Reasoning tells a less flattering story. HydraDB and Supermemory both score 76.69%. This is a tie, not a win. It is the hardest category in the suite. The architectural advantages don't close this gap. Combining facts distributed across many sessions remains the open frontier for every published system, HydraDB included.

The architecture's wins also do not depend on the largest available model. Re-running the same evaluation with GPT-5 Mini lands at 85.80% overall, and GPT-5.2 at 84.73%. Both numbers are well above every non-HydraDB system in Table 2, on substantially smaller backbones. The paper's conclusion: long-term memory quality is governed primarily by preprocessing and representation design, not raw model capacity. Users can select backbone models based on operational constraints (cost, latency, throughput) without compromising memory reliability.

Costs of a bitemporal agent memory: storage, write-cost, query complexity

Append-only versioned graphs are not free.

Storage growth. Append-only is monotonically growing by design. Every edge committed sticks. For a long-running multi-tenant system, the storage profile under retention-free assumptions is unsustainable. The paper introduces a Bio-Mimetic Decay Engine that scores memory nodes by initial salience, exponential decay over chronological age, and reinforcement boost from successful retrievals. Memories with low retention scores migrate through tiered storage and eventually evict. The decay engine is experimental in the current paper. The general shape (retention policies for different data classes, plus a vacuum-equivalent that reclaims space without breaking replay) is a known pattern from event-sourcing literature. It is solvable. It is not free.

Write-path inference cost. Sliding-window enrichment, entity resolution, and preference mapping happen at ingestion time, not retrieval time. That ingestion overhead is the explicit price of having self-contained, structured memory chunks at retrieval time. In production, HydraDB sustains 2.5 million tokens per minute of ingestion with sub-500ms wait times at peak, and has handled bursts of 50 million tokens per hour under noisy-neighbor conditions. The cost is real, and it is paid efficiently at scale. The tradeoff is principled: write-time enrichment scales with the number of facts ingested; query-time enrichment scales with the number of queries multiplied by the recall window. For any system with a write-to-read ratio above roughly 1:10, paying the cost on the write path is the cheaper integration.

Query complexity. Multi-hop graph traversal with cross-encoder reranking and query expansion is more involved than nearest-neighbor vector lookup. The query path has more moving pieces: adaptive query expansion, hybrid semantic search, entity-anchored graph search, chunk-level graph expansion, triple-tier reranking. The complexity is the bug fix. The flat retrieval pipeline is simpler precisely because it is not solving the problems the multi-stage pipeline is built to solve.

The paper does not pitch the architecture as a free lunch. It pitches it as the correct shape for a problem the industry has been trying to solve with the wrong tools.

Agent memory is a database engineering problem

Agent memory is a database engineering problem. The shape that handles changing facts, multi-hop relational reasoning, preference inference, and auditability is a versioned, bitemporal, typed graph with structured context attached at the edge.

The Git-Style Versioned Temporal Graph is one pillar of HydraDB's broader Composite Context architecture. The Sliding Window Inference Pipeline makes ingested chunks self-contained before they hit the graph. Multi-Stage Retrieval fuses hybrid semantic search with graph traversal. The temporal graph alone is necessary. The composite is what makes the system work end-to-end.

If you are running into stale recommendations, lost preferences, contradictory profile state, or unauditable agent decisions in production, the question is not which retrieval heuristic to swap in next. The question is whether your memory layer is architecturally capable of distinguishing what was true at the time the agent acted from what is true now. If the answer is no, the failures will keep coming, and the next tweak to the retrieval scoring function will not fix them.

Frequently Asked Questions

What is the State Confusion Problem in AI agent memory?

The State Confusion Problem occurs when an agent's memory layer overwrites or fails to time-order facts about a user, customer, or system. When state changes (a user moving cities, a customer downgrading their tier), a destructive store either loses the old value entirely or stores both without temporal ordering, leaving the agent unable to reason about what is current. The model reasons correctly over what it is given; the memory layer feeds it the wrong premise.

Why isn't a vector database enough for AI agent memory?

Vector databases are search indexes, not state stores. They retrieve by semantic similarity, which works for retrieval-augmented generation over static corpora but fails for stateful agents that need typed entities, versioned writes, transactional read isolation, and multi-record consistency. Two facts that should retrieve together are often causally connected but semantically distant in embedding space, and a flat vector index has no primitive for that.

What is a bitemporal knowledge graph?

A bitemporal knowledge graph stores every fact with two timestamps: a transaction time (when the system learned the fact) and a valid time (when the fact actually became true in the world). This lets the system answer two different questions about any historical point: what did the system know at time T, and what was actually true at time T. Append-only versioned graphs apply this model to typed entity relationships rather than rows in a relational table.

How does HydraDB handle conflicting or evolving facts?

HydraDB never mutates an existing edge. When a fact changes, HydraDB appends a new edge with a fresh transaction time and valid time, leaving the prior edge intact. The current state of any relationship is a function over the full edge log, so historical state and the reasoning behind each transition remain queryable. On LongMemEval-s, this architecture scores 97.43% on the Knowledge Update category, the benchmark axis that tests handling conflicting or evolving facts.

What is the difference between transaction time and valid time?

Transaction time records when the system ingested a fact. Valid time records when the fact actually became true in the world. A user might tell an agent in March that they moved in October: transaction time captures the March commit, valid time captures the October move. Conflating the two makes audit queries and correctness queries indistinguishable, so a bitemporal architecture keeps them as separate axes.

How does HydraDB compare to Mem0, Zep, and other agent memory systems on LongMemEval-s?

On LongMemEval-s, HydraDB scores 90.79% overall, +5 points absolute over the next-best system (Supermemory at 85.20%) and roughly +30 points over a GPT-4o full-context baseline (60.2%). Mem0-oss scores 29.07%, Zep 71.2%. HydraDB's largest gains are in Knowledge Update (97.43%) and Temporal Reasoning (90.97%), the categories where bitemporal versioning matters most. The two top systems tie on Multi-session Reasoning at 76.69%, the suite's hardest category.

What are Hive Memories in HydraDB?

Hive Memories are HydraDB's cross-agent shared learning layer. When preference subgraphs accumulate across agents operating on behalf of the same organization, one agent's observed preference becomes available to the next agent that touches the same entity. Preferences propagate across the agent network instead of being relearned each session.

What are the tradeoffs of an append-only versioned memory graph?

Three costs are real. Storage grows monotonically and needs retention policies to stay sustainable; HydraDB introduces a Bio-Mimetic Decay Engine to score and evict low-salience memories. Write-path enrichment costs compute since sliding-window inference happens at ingestion; HydraDB sustains 2.5 million tokens per minute in production with sub-500ms wait times at peak. Query plans are more involved than nearest-neighbor lookup since traversal combines hybrid semantic search, graph paths, and cross-encoder reranking.

Best Mem0 and Zep Alternatives for AI Agent Memory (2026 Guide)

Aman Puri — Sat, 27 Jun 2026 15:29:15 +0000

Best Mem0 and Zep alternatives for AI agent memory (2026 guide) Early tools like Mem0, Zep, and Supermemory solved real problems for generative AI applications, handling fast fact extraction, basic document recall, and episodic chat memory. These made bounded, single-application use cases work, and some have since evolved significantly to support more complex enterprise deployments. But if you're reading this guide, your engineering team has probably hit a wall.

Key takeaways

If your team has hit architectural walls with destructive updates, temporal conflicts, or multi-tenant isolation issues, pick the alternative that matches your workload:

Need graph-native temporal versioning, structured ingestion, and a unified context

layer across agents and apps? → HydraDB

Need an agent that manages its own memory, including paging and eviction? →

Letta (MemGPT)

Need a focused chat history/profile store? → Memento
Need DIY semantic retrieval infrastructure? → Qdrant or Weaviate
Need memory tightly coupled with your orchestration layer? → Framework-Native

Memory (LangGraph, LlamaIndex) The architectural rule that decides the rest is simple. Avoid destructive updates and choose a platform with temporal versioning or valid-time windows so new facts don't silently overwrite old ones. Among the alternatives here, only HydraDB provides temporal versioning natively.

Why production teams outgrow Mem0, Zep, and Supermemory

Engineering teams rarely abandon their initial memory stack because they lack basic features. They hit architectural walls when trying to map complex, evolving institutional knowledge. The most common failure mode in flat-vector memory systems is the destructive update problem. Memory systems that rely exclusively on flat vector architectures often flatten old and new facts together. When a user's preference or a critical company policy changes, standard chunking methods accrete invalid atomic facts without native conflict resolution. The agent eventually suffers from ghost knowledge, confidently synthesizing outdated vectors with current ones because both remain semantically similar to the user's query. A simple recency filter on created_at partially helps, but doesn't resolve cases where both old and new facts are genuinely recent, or where the conflict spans multiple entities.

Some platforms have addressed this limitation directly. Zep, for example, has evolved into a capable temporal knowledge graph through its Graphiti architecture, offering sub-200ms retrieval latency and multi-signal capabilities fusing semantic search, sparse keywords, and breadth-first graph traversal, along with strict valid and invalid temporal windows. Teams evaluating alternatives to Zep typically aren't always leaving because of capability deficits. They may be seeking differences in deployment flexibility, ingestion architecture, unified context-layer design, or developer-first infrastructure primitives that platforms like HydraDB provide natively. Similarly, Mem0, Zep, and Supermemory are popular solutions. Mem0 boasts 14M+ downloads and $24M in funding; Zep successfully handles complex enterprise role-based access control and archive flows; and Supermemory holds SOC 2, HIPAA, and GDPR compliance. However, the friction typically comes from structural architectural choices that lack full database-grade semantics, or from critical compliance and temporal features being locked behind steep pricing discontinuities. Teams face unpredictable scaling costs and significant operational overhead when trying to push past the limits of standard API wrappers.

When Mem0, Zep, and Supermemory still make sense

Don't prematurely abandon these tools if your scope is intentionally limited. They remain excellent choices for teams building single-application chat experiences that only require quick, episodic context. Indie developers and teams building MVPs using managed RAG APIs for bounded personal knowledge management or basic document recall will find them a solid fit. If your use case involves isolated, static documents where historical state changes or evolving user preferences don't matter, flat semantic extraction works fine, too.

How to evaluate AI agent memory architecture in 2026

The evaluation criteria for agent memory have shifted over the past year. First-generation comparisons relied almost exclusively on standard academic benchmarks such as LoCoMo and LongMemEval to rank tools and secure product citations. These benchmarks help establish baseline capabilities, but they predominantly test static retrieval over fixed datasets. They don't fully measure how a system handles dynamic conflict resolution, continuous ingestion, or evolving user states over a prolonged deployment lifecycle. Instead of indexing your engineering decisions against static leaderboards, you need to rigorously test deterministic CRUD primitives, ingestion mechanics, multi-signal retrieval fusion, and temporal reasoning. Evaluate the underlying context representation. Modern agent memory must operate beyond flat vector similarity, providing cross-agent or user-centric memory rather than a single-app chat

scope. Look for architectures that support true multi-signal retrieval, fusing dense semantic search, sparse metadata keyword matching, and deterministic graph traversal. An agent needs to understand the explicit, structured relationships between distinct entities, not just their semantic proximity within a dense latent space. Mandate strict temporal priority and versioning. The underlying system must handle evolving context like a version control history. Track what changed, when it changed, and why it changed rather than blindly overwriting an existing database row or appending a contradictory embedding into the index. Bitemporal modeling or explicit valid-time windows let the agent reason accurately about past states and current truths without hallucinating blended facts from stale data. Scrutinize the structured ingestion pipelines. The most sophisticated retrieval algorithm can't salvage poorly ingested, unstructured data. Evaluate whether the platform actively resolves entities and ambiguous pronouns at write time rather than deferring all interpretation to retrieval. Systems that enrich context during ingestion, such as those that use sliding-window approaches to link pronouns and preferences to their referent entities, prevent the creation of meaningless, isolated text chunks. Linking entities before generating the final embedding ensures that every node in your memory graph has an explicit, verifiable contextual weight. Assess operational observability, indexing latency, and productivity-tool connectors (Notion, Google Docs). The memory layer should offer direct local-model support without forcing your infrastructure to proxy every call through an external gateway or proprietary bottleneck. Crucially, your engineering team must be able to validate decision traces. You need complete observability into the entire retrieval pipeline to track exactly which specific memories were injected into the prompt and the precise routing logic that selected them over others. Without deterministic decision traces, debugging a confident hallucination in a live production environment becomes exceptionally difficult, which can destroy end-user trust and limit enterprise adoption.

## Quick comparison of AI agent memory platforms

Alternative name	Architecture focus	Best for	Temporal reasoning	Enterprise tenancy
HydraDB	Context Layer / Graph-Native	Production AI agents, copilots, and enterprise applications requiring persistent context and complex state tracking	Yes (Graph-native)	Native
Letta (MemGPT)	OS-Style Tiered Memory	Agent-driven autonomous context management	No	Custom
Memento	Episodic Memory Server	Focused chat history and basic user profiles	No	Custom
Qdrant / Weaviate	Vector DB / Build-it-yourself	High-scale pure semantic similarity infrastructure	No (App-level)	Native
Framework-Native	Orchestration-Integrated	Tightly coupled state management within code	No	Custom

In-depth reviews of AI agent memory alternatives to Mem0 and Zep

HydraDB: graph-native, time-aware context layer

Who HydraDB is best for

Software companies and applied AI teams building production-grade agents, copilots, personalized assistants, company brains, and multi-agent workflows where context quality dictates product quality and reliability. It fits teams that have outgrown a chat-history store, need deterministic control over what the agent remembers, and don't want to assemble memory infrastructure by hand or lock it to a single orchestration framework.

HydraDB overview

HydraDB operates as a dedicated context and memory layer for AI applications. It provides the core developer-first infrastructure required to build personalized, stateful agents without forcing engineering teams to assemble a vector database, a graph database, a parser, a temporal system, and custom memory logic by hand. By treating memory as a first-class infrastructural primitive rather than an afterthought, HydraDB centralizes context management across multiple interconnected agent applications. It delivers a unified retrieval experience that inherently understands both semantic meaning and structured relationships. Even on static benchmarks, which don't fully capture dynamic conflict resolution or continuous ingestion, HydraDB posts a state-of-the-art result: 90.79% overall accuracy on LongMemEval-s with Gemini 3.0 Pro, a +5 point gain over the strongest competing system, with 90.97% on temporal reasoning and 96.67% on preference extraction.

HydraDB key differentiators vs. Mem0 and Zep

Time-Aware Temporal Graph: HydraDB uses a Git-style versioned temporal graph to preserve entities, relationships, and state changes as an append-only history. Unlike flat vector stores that destructively overwrite existing records or duplicate conflicting information across disparate chunks, HydraDB actively tracks temporal validity. Because updates are appended rather than overwritten, no historical state is ever lost. Agents can query what was true a month ago, what's true right now, and the exact sequence of events that caused the state to change. Structured Context Ingestion: Standard recursive chunking leaves nearly 40% of chunks semantically invisible, stripped of the entity or pronoun they depend on. To address this, HydraDB employs a Sliding Window Inference Pipeline. It resolves entities, pronouns, preferences, and implicit references at write time before the embedding is ever generated. Every retrieved context block is semantically complete and correctly anchored to the correct global entity within the system. Multi-Signal Retrieval: The platform executes advanced hybrid retrieval by default. HydraDB combines semantic similarity, sparse keyword matching, latent inferred meaning, metadata, graph traversal, temporal signals, entity-based search, chunk-level graph expansion, and reranking into a unified retrieval pipeline, rather than relying on single-dimensional vector similarity. Model-Agnostic Backbone: HydraDB holds strong results across backbone models, 90.79% on Gemini 3.0 Pro, 85.80% on GPT-5 mini, and 84.73% on GPT-5.2, because memory quality is driven by preprocessing and representation design, not raw model capacity. Teams can pick a backbone based on cost, latency, and throughput without sacrificing memory reliability.

What you gain with HydraDB

You gain a system of record for AI context that spans cross-session conversations, enterprise documents, and decision history. HydraDB enables tracking of evolving user states and relationships without requiring your

backend engineering team to write manual conflict-resolution scripts. It also provides enterprise tenancy and scoped retrieval, access controls, and historical traceability for injected context. Because the graph is append-only, every state change carries the reasoning behind it: why a preference changed, what alternatives were rejected, and what outcome the user was optimizing for. That gives you queryable decision traces for any injected context, not just the final fact.

HydraDB trade-offs

HydraDB isn't a simple plug-and-play chat widget you can deploy in minutes. It requires a serious developer-first mindset to implement effectively as core backend infrastructure. For teams building simple, stateless hobbyist chatbots or temporary single-session wrappers, this architecture represents unnecessary overhead and excessive complexity.

Migrating to HydraDB from Mem0 or Zep

Migration difficulty: Moderate. Migrating involves shifting from maintaining prompt-injected fact arrays to pushing data through a structured, entity-aware ingestion API. HydraDB provides comprehensive SDKs to handle the entity mapping logic, but you'll need to redirect your application's core write paths to fully use the new sliding window ingestion methods.

Letta (MemGPT): agent-managed, tiered memory

Who Letta is best for

Engineering teams building autonomous, long-running agents that manage their own memory

and run with minimal human supervision. intervention.

Letta overview

Originating from the widely cited MemGPT research paper, Letta approaches AI agent memory from an operating system perspective. Instead of relying on passive semantic search pipelines triggered externally by the backend application, Letta gives the LLM explicit programmatic tools to page information between a constrained working memory and a virtually infinite archival memory database.

Letta key differentiators vs. Mem0 and Zep

Agent-managed memory: The defining characteristic of Letta's architecture is that the LLM itself actively decides what to save, what to evict, and what to page into context. It uses native function calls to dynamically interact with its memory tiers during live execution, closely mirroring how a traditional CPU manages physical RAM and disk storage. Tiered architecture: Letta maintains a strict, native structural separation between the core active context and external storage, forcing developers to clearly define operational boundaries for their agents.

What you gain with Letta

This architecture excels for long-running agent loops where the agent requires deep self-correction mechanisms, continuous autonomous execution, and the ability to dictate its own context management strategy over thousands of sequential iterations. Letta works best in environments where human supervision is minimal and context needs to be continuously curated by the reasoning engine itself.

Letta trade-offs

You surrender deterministic control. Since the model independently decides what to store and what to evict based on probabilistic reasoning, it inevitably makes autonomous mistakes that are hard to reproduce and reliably debug. Letta also requires specific prompting frameworks and specialized agent runtimes, making it less flexible if you simply want to attach a backend context layer to an existing standard application.

Migrating to Letta from Mem0 or Zep

Migration difficulty: Complex. Adopting Letta isn't a simple database swap or API change. It requires re-architecting your core agent loop, discarding legacy system prompts, and restructuring your execution runtime to fully adopt Letta's operating system-style framework and function-calling memory paradigms.

Memento: episodic memory server for chat history

Who Memento is best for

Teams that need a focused memory server for chat history, conversational state, and user profiles in a single application.

Memento overview

Memento acts as a dedicated memory server specifically optimized for conversational AI applications. It sits between your application code and the LLM to handle continuous interaction state, episodic chat history, and basic user profiling natively. This prevents conflating complex memory logic with standard application routing logic.

Memento key differentiators vs. Mem0 and Zep

Memento is frequently adopted as a direct structural alternative to first-generation memory tools for teams focused heavily on managing raw conversation threads and user attributes. It intentionally strips away the complexity of heavy graph platforms and temporal engines to provide a focused, lightweight API designed for rapid conversational persistence.

What you gain with Memento

Adopting Memento provides a clean architectural separation of conversational state from your core backend application logic. It excels at the easy, structured handling of user profiles, session tokens, and chronologically ordered chat threads. Developers can treat conversation history as a reliable, distinct microservice.

Memento trade-offs

The platform intentionally lacks the deep multi-signal retrieval and complex graph-vector fusion required to map heavy institutional knowledge or track intricate, dynamically evolving state changes over long periods.

Memento may also struggle with non-conversational context ingestion, like executing large-scale asynchronous document processing or handling bulk batch data pipelines.

Migrating to Memento from Mem0 or Zep

Migration difficulty: Easy to Moderate. Since Memento maps so closely to basic episodic storage patterns, migration primarily involves writing straightforward API scripts to securely translate existing chat episode histories and structured user profiles directly into Memento's required payload structure.

Qdrant and Weaviate: build-your-own vector memory stack

Who Qdrant and Weaviate are best for

Enterprise infrastructure teams that want pure semantic similarity at scale and prefer to build their own memory logic on top.

Qdrant and Weaviate overview

These aren't strictly agent memory platforms right out of the box, but highly optimized vector-native databases like Qdrant and Weaviate remain the most common do-it-yourself alternatives. Resourced infrastructure teams use these platforms to store raw embeddings securely while building custom application middleware to handle all semantic retrieval, temporal logic, and agent routing logic in-house.

How Qdrant and Weaviate differ from Mem0 and Zep

The primary differentiator is pure infrastructural flexibility and full control. By operating exclusively at the database level, your engineering team owns the complete retrieval pipeline, your specific chunking strategy, and the implementation of metadata filtering. Both Qdrant and Weaviate offer massive enterprise-grade scaling, mature surrounding ecosystems, and optimized vector search latency that higher-level memory APIs often struggle to match consistently under severe load.

What you gain with Qdrant or Weaviate

You gain ultimate control over the specific embedding models used, the exact similarity metrics applied, and the intricate database scaling parameters. You can deeply tune dense and sparse hybrid search configurations specifically to your unique domain. You also benefit heavily from strong native multi-tenancy and enterprise security capabilities. Qdrant features native tiered sharding and advanced payload filters, while Weaviate integrates with enterprise identity providers with OIDC and provides granular role-based access control

schemas.

Trade-offs of building on Qdrant or Weaviate

You must build your entire conceptual memory logic layer from scratch. Since the database itself only natively handles storage, indexing, and tenancy, the engineering debt comes from having to architect the temporal event store, perform complex entity resolution during data ingestion, and write the custom graph traversal logic yourself. This results in high ongoing maintenance overhead compared to deploying a fully integrated context layer.

Migrating from Mem0 or Zep to Qdrant or Weaviate

Migration difficulty: Complex. Moving to a raw vector database requires extracting all data from your current managed memory provider, generating new embeddings, defining a new payload schema, and writing custom application middleware to replace the memory API functions you previously relied on.

Framework-native memory: LangGraph and LlamaIndex

Who framework-native memory is best for

Developers committed to a single orchestration framework who want built-in state management without running a separate memory service.

Framework-native memory overview

Instead of deploying and maintaining dedicated external memory servers, engineering teams can rely on the built-in memory primitives provided directly by their chosen agent orchestration frameworks. Using tools like LangGraph checkpointers for strict thread state persistence or LlamaIndex memory patterns lets developers evaluate native framework capabilities against the operational overhead of provisioning dedicated standalone memory servers.

How framework-native memory differs from Mem0 and Zep

Orchestration Integration: Memory operations and state updates are embedded directly into the agent framework rather than acting as a standalone third-party microservice queried over a network API boundary. Simplified Stack: This approach eliminates the need to provision, manage, and scale a separate memory server for basic conversation state management.

What you gain with framework-native memory

Your overall system architecture benefits from having fewer moving parts. You achieve tight coupling with framework-specific state machines, conditional routing logic, and parallel execution threads. This localized state management typically delivers ultra-low latency during

complex, multi-step agent reasoning loops.

Framework-native memory trade-offs

The critical cost of this tight integration is the loss of portability. Your custom memory logic becomes locked into that framework's ecosystem. If you transition away from LangGraph, you lose your entire memory implementation. You also sacrifice advanced enterprise features natively found in dedicated context layers, like multi-tenant physical scoping, deep temporal versioning architectures, and unified graph-native relationship tracking.

Migrating from Mem0 or Zep to LangGraph or LlamaIndex memory

Migration difficulty: Moderate. This migration requires removing all standalone API network calls to your previous memory provider and manually wiring the raw conversation history directly into the orchestration framework's native checkpointer configurations or distinct memory class structures.

Security, compliance, and data governance for agent memory

As AI agents rapidly transition from internal, low-risk experiments to high-stakes, customer-facing enterprise deployments, agent memory can no longer operate as an opaque black box. Evaluating serious alternatives requires a rigorous, thorough assessment of data governance, security postures, and compliance mechanisms. Consider the strict legal requirements governing the right to be forgotten under comprehensive privacy regulations such as the GDPR. When a user formally requests data deletion, standard vector databases often leave generated embeddings silently persisting in the index as orphaned data points after the source document is deleted. Ensuring data cannot be recovered from disk requires additional infrastructure that most teams don't implement by default. That creates severe compliance liabilities. You must have the operational capability to cleanly hard-delete a user's entire trace from both the semantic index and the relationship graph without breaking the database structure or corrupting adjacent entities. Production memory layers should support lineage-aware deletion so that when a primary source is purged, downstream memory artifacts derived from it can also be removed. Techniques like crypto-shredding can help ensure deleted data is unrecoverable. Beyond strict data deletion, tenant isolation represents another critical vulnerability point in enterprise memory architectures. You must evaluate whether your compliance needs dictate dedicated physical isolation or if logical isolation is sufficient. If using a shared memory architecture, ensure it is deliberately

designed for explicit multi-tenant scoping from day one. Every memory write operation and retrieval query must be scoped by standard parameters like tenant_id and user_id deep within the infrastructure layer. This prevents cross-contamination during retrieval and provides cautious enterprise buyers with guarantees that an agent interacting with Tenant A can't inadvertently hallucinate or leak isolated, sensitive memories belonging to Tenant B. You must also design mitigations against poisoning and prompt injection directly within the memory layer itself. Malicious actors can exploit loosely governed ingestion pipelines by intentionally feeding false, adversarial, or manipulative facts into an agent's long-term memory store. Over time, these injected memories can subtly manipulate the agent's behavior, bypassing traditional application firewalls and safety guardrails. Implementing structured ingestion protocols, demanding high confidence thresholds for automated fact extraction, and establishing strict permission layers filter out untrusted sources before they ever reach the context graph. Highly regulated enterprise environments demand deep auditability. Storing a fact within a database isn't sufficient. The system must provide a comprehensive, immutable historical log detailing exactly who altered a specific memory, the previous state before the alteration, and the precise timestamp of the modification. Without an immutable audit trail, tracing the exact origin of a flawed agent decision back to a specific corrupted context injection becomes impossible.

Conclusion: choosing the right Mem0 or Zep alternative

Moving beyond standard first-generation tools like Mem0, Zep, or Supermemory requires shifting away from a basic-feature-checkbox mentality and adopting a real, production-grade memory infrastructure. The core decision rule is simple. Pick a focused, bounded API if your use case is inherently limited in scope and complexity. If context accuracy, deep temporal reasoning, and complex state management directly dictate your product's overall success, you need a dedicated context layer. Evaluate your options based on structured ingestion capabilities, multi-tenant security, and the system's ability to handle continuously evolving states. Engineering teams must rigorously test multi-signal retrieval on real enterprise data sets rather than blindly relying on static academic benchmarks. If your engineering team is ready to build graph-native, time-aware enterprise agents, get started with HydraDB's SDKs and see how a dedicated context and memory layer transforms AI agent reliability.

FAQ: Mem0 alternatives 2026, Zep alternatives, and Supermemory alternatives

What's the difference between a vector database and an AI agent memory/context layer?

A vector database retrieves semantically similar text. An agent context layer also models entities, relationships, evolving user state, decision traces, and temporal truth so agents can retrieve the right facts for the right tenant and time period.

How do you prevent "destructive updates" and outdated facts in agent memory?

Use temporal versioning (valid-time windows or bitemporal history) so new facts don't overwrite old ones, and the agent can query what was true "then" vs "now."

When should I keep using Mem0 or Zep instead of switching?

Keep using them if you only need lightweight episodic chat memory for a single app, or if you are already using Zep's Graphiti architecture for temporal tracking and don't require HydraDB's structured context ingestion or enterprise tenancy.

Which alternative is best for long-running autonomous agents that manage their own memory?

Letta (MemGPT), because the agent can page information between working and archival memory and decide what to store or evict during execution.

What's the best option if I just want a DIY semantic search infrastructure?

Qdrant or Weaviate. Use them as the vector layer, but expect to build your own entity resolution, temporal logic, and retrieval routing.

What's Memento best for compared to Mem0/Zep?

Memento is best for chronological chat history and basic user profiles when you want a focused episodic memory service without graph/temporal complexity. Can you self-host these tools for SOC2/HIPAA/GDPR requirements? It depends. Qdrant/Weaviate are commonly self-hosted. Enterprise context layers may offer VPC/single-tenant or on-prem options for regulated environments.

How do I migrate from Mem0 or Zep to a new memory layer? Export existing memories, re-ingest them through the new system's ingestion pipeline (often re-embedding and entity linking), and update write paths so that new facts are versioned rather than overwritten. How do you ensure tenant isolation so one customer's memory can't leak to another? Enforce tenant-scoped reads and writes at the storage/query layer (e.g., tenant_id/user_id), and add access controls and audit logs for every retrieval and update.

Source links preserved from the PDF

Build vs Buy a Managed Streaming Platform for Real-Time RAG in 2026

Aman Puri — Sat, 27 Jun 2026 15:17:32 +0000

Moving a retrieval-augmented generation (RAG) prototype from a Python notebook into production isn't an API orchestration challenge. It's a distributed systems problem. For engineering managers and data platform leads, the build-versus-buy decision on streaming infrastructure will dictate your artificial intelligence (AI) feature velocity for the next three to five years.

This guide assumes you've already prototyped a RAG pipeline. The question we tackle here is what changes when you put it in front of customers, where the real cost lives, and how to choose a streaming foundation that won't trap your team in maintenance work for the next decade.

Executive Summary

The problem. Production real-time RAG is a streaming-systems problem, not an API-orchestration problem. DIY pipelines accumulate an integration tax that compounds over time, slowing AI feature velocity to a crawl.

The recommendation. For most enterprises, buying an unified managed streaming platform that delivers stream, connect, process, and govern under a single service-level agreement (SLA) is the correct choice. It should ship with AI-native primitives built in: in-flight embedding generation, Streaming Agents, and context served via the Model Context Protocol (MCP).

The evidence.

A single production change data capture (CDC) connector typically takes three to six engineering months to build and stabilize
DIY paths break against the serverless ceiling (e.g., AWS Lambda's 15-minute execution limit) and bleed cross-availability zone (AZ) egress at $0.01 per GB
Confluent customers like Henry Schein One, Notion, and Palmerston North City Council credit the platform for moving high-quality data fast enough to power production AI

The build. A production-grade platform powered by the Kora engine (GBps+ throughput, 99.99% SLA, fully compatible with Apache Kafka® APIs), more than 120 connectors with more than 80 fully managed (PostgreSQL Debezium, Oracle CDC and XStream, Snowflake, S3), Confluent Cloud for Apache Flink® with ML_PREDICT and AI_COMPLETE for in-flight embeddings, Stream Governance (Schema Registry, Data Contracts, Stream Catalog, Stream Lineage), and Confluent Intelligence (Streaming Agents, Real-Time Context Engine, and built-in ML functions) for agentic AI.

Scope. This guide is for engineering managers and data platform leads weighing build versus buy for a real-time RAG initiative. Build is still the right answer if you're air-gapped, have extreme customization needs, or have a large platform team to staff ongoing operations.

What Real-Time RAG Looks Like in Production

Production RAG is never just a stateless app calling a vector database. When you shift from static file uploads to enterprise real-time context, the architecture becomes a persistent, stateful streaming data problem.

The invisible components in this diagram demand continuous synchronization. CDC ingestion from operational databases translates complex, high-throughput row-level updates into event streams. Those change events need to be normalized, chunked, and routed to embedding APIs (OpenAI, Cohere, Amazon Bedrock, Voyage AI, or self-hosted models). The generated vectors must then be securely upserted into your vector database (Pinecone, Weaviate, Milvus, or PostgreSQL using pgvector) while you continuously monitor end-to-end freshness.

Operating this pipeline exposes teams to demanding day two distributed system operations. You need to handle late-arriving data via precise stream watermarking without corrupting the vector index. You need to gracefully process upstream schema changes, like a suddenly dropped column, without breaking downstream chunking logic. And when your AI team upgrades their foundation model, you face the challenge of dual-writing to new indexes and re-embedding millions of historical records without triggering application downtime.

These aren't problems you can solve with simple Python scripts or basic batch cron jobs. They require handling continuous database updates, maintaining strict idempotency to prevent duplicate embeddings, and executing high-throughput writes. If you don't treat RAG synchronization as a hardened data layer reality, you'll end up with index bloat, stale context, and degraded AI output quality.

Faced with these realities, teams pick one of two paths. Build is the natural starting point. Here's why it usually doesn't end there.

Building Real-Time RAG Pipelines: Hidden TCO and the Integration Tax

Engineering teams initially lean toward building their own streaming infrastructure for valid reasons. Extreme customizability, specialized networking protocols, strict air-gapped GovCloud compliance, and a mandate to avoid perceived vendor lock-in often drive the decision to assemble raw open source components.

But these architectures rapidly hit the "serverless ceiling."

Initial RAG pipelines built on serverless functions or batch jobs buckle under continuous CDC ingestion. Standard serverless limits, such as AWS Lambda's strict 15-minute execution limit, break long-running streaming state. Lambda's Kafka Event Source Mapping (ESM) handles polling for free, but you still pay $0.0000166667 per GB-second plus request fees on every invocation, and the stateless invocation model leaves no room for the stateful joins, watermarks, or exactly-once guarantees that production CDC pipelines need.

The architectural breaking point arrives when your team stops shipping differentiated AI features and starts maintaining fragile infrastructure. Highly paid engineers spend their sprints tuning Kafka partitions, managing distributed dead letter queues (DLQs), rewriting broken connector scripts, and orchestrating complex re-embedding workflows when a large language model (LLM) is upgraded.

This operational drag is the "integration tax."

Stitching together best-of-breed raw cloud components comes with an ever-growing maintenance burden that stalls feature velocity. Building and stabilizing a single production-grade CDC connector typically consumes three to six engineering months of labor. That's because building a connector involves navigating single-threaded snapshot bottlenecks, handling complex state management, and overcoming performance barriers. For example, the Debezium PostgreSQL connector is architecturally limited to one streaming task, meaning a single thread captures all changes in order. Under high write volumes, this causes lag and requires multiple connectors to scale, adding to the complexity of partitioning and reassembly.

The total cost of ownership (TCO) formula has three components: infrastructure (compute, storage, network), operations (labor), and hidden costs (downtime, opportunity cost, cross-AZ traffic). Self-managed deployments also incur a "state tax." Managing Flink requires tuning RocksDB block caches and remote durable storage for checkpoints. Multi-AZ open source Kafka deployments silently rack up massive AWS cross-AZ data transfer fees at $0.01 per GB.

The table below maps each of those three buckets to where DIY teams pay versus what a unified managed platform absorbs.

TCO Comparison by Cost Component: Custom Build vs Unified Managed Platform

Cost component	Self-managed (open source Kafka, Flink, and connectors)	Unified managed platform (e.g., Confluent)
Broker infrastructure	Self-managed VMs, 24/7 on-call, multi-AZ egress at $0.01 per GB	Fully managed, 99.99% SLA, optimized cross-AZ paths
Connectors	Three to six engineering months per source for the first version, plus ongoing schema-drift fixes	More than 80 fully managed connectors out of the box, no source-side maintenance
Stream processing	Self-managed Flink: RocksDB tuning, checkpoint storage, JVM upgrades	Serverless Flink, billed per Confluent Unit for Flink (CFU) consumed, hard spending caps available
Embedding tier	Separate fleet of Python embedding workers, plus queue and retry logic	`ML_PREDICT` and `AI_COMPLETE` inside the stream processor, no separate worker tier
Governance and lineage	Build your own schema registry, lineage tracker, and role-based access control (RBAC) layer	Schema Registry, Data Contracts, Stream Catalog, Stream Lineage included
Operational labor	0.5 to 2 dedicated platform FTEs at small or medium scale, multiple teams at enterprise	Capacity reclaimed for AI feature work

Specific dollar values vary widely by workload, region, and data volume. Anyone who hands you a single annual figure without your topology in hand is selling you a number. Forrester's Total Economic Impact study of Confluent Cloud is a defensible starting point for benchmarking your own scenario against a self-managed open source build, and Confluent's public cost estimator lets you size a workload directly.

Generating embeddings natively inside the stream processor eliminates the need to provision, scale, and monitor a separate fleet of Python embedding workers, reducing both your cloud bill and operational headcount.

How to Evaluate Managed Streaming Platforms for Real-Time RAG in 2026

With the cost of building mapped, the next question is what a managed alternative actually needs to deliver to absorb that complexity. Evaluating managed streaming platforms for RAG workloads requires moving beyond basic throughput benchmarks. In 2026, production-grade data streaming infrastructure must natively execute four foundational capabilities: stream, connect, process, and govern. On top of those four, it needs dedicated AI-native primitives (in-flight embedding, MCP-served context, agent runtime) under a single SLA.

The four subsections below cover the foundational capabilities. The fifth covers the AI-native layer that sits on top of them.

Stream: Throughput, Latency, and Uptime Requirements

Your foundational messaging layer must support GBps+ throughput, ultra-low tail latency, and a 99.99% uptime SLA, without manual partition rebalancing.

Modern cloud-native engines, like the Kora engine, which powers Confluent cloud, decouple compute from storage to deliver 10x faster autoscaling and 10x lower tail latencies than self-managed Kafka while staying fully compatible with Apache Kafka® at the protocol level. Your existing producers and consumers keep working as they are. Cluster Linking creates real-time replicas of existing Kafka data and metadata for zero-downtime migration when you move away from open-source Kafka. The decoupled architecture means a cluster absorbs sudden ingestion spikes (common during a backfill or re-embedding window) without you having to lift a finger.

Connect: Fully Managed CDC and Connector Coverage

Evaluate platforms strictly on the breadth and depth of their fully managed connector ecosystem. You need out-of-the-box support for complex CDC workloads, software-as-a-service (SaaS) applications, and object storage.

A platform offering more than 120 connectors, where more than 80 are fully managed (including complex integrations like Postgres Debezium, Oracle CDC, and Snowflake), lets your engineers provision reliable data pipelines in minutes rather than dedicating months to custom development.

Process: Stateful Stream Processing and In-Flight Embeddings

Stream processing must be serverless, support stateful joins, and execute in-flight machine learning (ML) inference. Transforming a text column into a vector embedding directly inside the stream processor simplifies your architecture.

Engines like Confluent Cloud for Apache Flink ship SQL functions like ML_PREDICT and AI_COMPLETE that replace a separate embedding worker tier. Your data engineer writes one ANSI SQL statement to turn a text column in a Kafka topic into a continuous stream of vector embeddings, and the platform handles batching, retries, and rate limits against the embedding API. The same engine supports Python and Java for cases where SQL isn't expressive enough, useful for custom chunking strategies or hybrid retrieval logic.

What's distinctive about Confluent Cloud for Apache Flink is the combination of three languages, native AI functions, and a managed runtime sharing one SLA with the broker. The closest AWS path pairs Amazon Managed Streaming for Apache Kafka (MSK) with Amazon Managed Service for Apache Flink (MSF), which delivers a real Flink runtime supporting SQL, Python, and Java but ships no ML_PREDICT or AI_COMPLETE equivalent and sits on a separate SLA from MSK. MSK paired with Lambda is simpler for short enrichment, but Lambda's 15-minute execution wall breaks long-running streaming state. Open source Flink demands deep Java fluency and a self-managed cluster, and Redpanda has no native Flink at all (its in-broker WebAssembly transforms are sandboxed and limited, by Redpanda's own admission, to "trivial and stateless" cases).

The processing engine must guarantee exactly-once semantics. Without advanced two-phase commit protocols, retry loops will push duplicate embeddings or miss delete commands, permanently corrupting your RAG context.

The processor must also offer robust failure handling (configurable backpressure, buffer debloating, exponential retries, and dead letter queues) to safely navigate strict API rate limits from LLM embedding providers.

Govern: Data Contracts, Catalog, Lineage, and Access Control for RAG

AI outputs are only as trustworthy as their inputs. You need enterprise-grade governance to keep RAG indexes secure, traceable, and accurate.

Start with a Schema Registry that enforces strict Data Contracts, preventing an upstream database change from silently breaking your downstream embedding pipeline. Pair it with a Stream Catalog that organizes Kafka topics as discoverable data products with metadata tagging, search, and self-service access requests, so AI teams can find and adopt trusted streams without bottlenecking on a central data engineering team.

Stream Lineage gives you the audit trail every AI agent's context source needs, answering "where did this RAG document come from, and what schema version produced its embedding?" RBAC, client-side field-level encryption (CSFLE), and masking ensure personally identifiable information (PII) is masked before it ever reaches the vector database.

AI-Native: Streaming Agents, MCP Context, and Built-In ML

A modern streaming platform must speak the language of agentic AI. The four foundational capabilities above keep your data plane reliable. The AI-native layer on top is what turns it into a substrate for production agents.

Confluent Intelligence is the dedicated AI layer of the data streaming platform and ships three components on top of Kafka and Flink:

Streaming Agents. Agents that run as Flink jobs inside the stream processing pipeline, with always-on state, tool calling via MCP and Agent2Agent (A2A), and replayable, governed event flows. Because they are Flink jobs, the same exactly-once and lineage guarantees apply to agent decisions.
Real-Time Context Engine. A fully managed service that serves structured context to AI apps and agents over the Model Context Protocol, with built-in authentication, RBAC, and audit logging. MCP integrations include LangChain, Amazon Bedrock, Salesforce Agentforce, and Anthropic Claude.
Built-in ML functions. Native Flink SQL functions for embedding, anomaly detection, fraud prevention, forecasting, and sentiment analysis, with hooks to invoke remote AI/ML models or custom ones.

Tableflow extends these same Kafka topics into open table formats (Apache Iceberg™ and Delta Lake), so the streams that feed your real-time RAG pipeline form the bronze and silver layers of an analytics medallion stack. Tableflow eliminates separate ETL pipelines and shifts processing and governance left, an approach Confluent reports cuts analytical compute costs by up to 30% and reduces data quality issues by up to 60%, while giving AI agents readily queryable historical context alongside their real-time streams.

Streaming Platform Comparison: Custom Build, MSK, Redpanda, Confluent

Apply those evaluation criteria to the market, and the practical streaming choices for a real-time RAG initiative are narrowed to four. You can roll your own with open source components, lean on a hyperscaler-managed broker like MSK, pick a Kafka-compatible alternative like Redpanda, or buy a complete data streaming platform like Confluent. Each has a defensible use case. Only one was designed end-to-end for production agentic AI.

At a Glance: How Each Option Covers the Four Capabilities Plus AI-Native Primitives

Option	Stream	Connect	Process	Govern	AI-native
Custom build (self-managed Kafka, Flink, and connectors)	Self-managed	Self-managed	Self-managed	Self-managed	DIY
AWS MSK + Glue + MSF/Lambda	✓ Managed broker, 99.9% SLA (infrastructure only)	Bring your own connectors, limited managed CDC	Bolt-on via MSF (separate SLA from MSK, no `ML_PREDICT`/`AI_COMPLETE`) or Lambda (15-min cap)	Piecemeal (Glue Schema Registry is primarily Java-focused, no unified catalog or lineage)	Bring your own
Redpanda	✓ C++ Kafka-compatible broker, 99.99% multi-zone / 99.5% single-zone, bring your own cloud (BYOC) option	More than 10 fully managed connectors	No native Flink (in-broker WebAssembly only)	Basic schema registry, no Stream Catalog or Stream Lineage	Bring your own
Confluent	✓ Kora engine, 99.99% SLA covering infrastructure and Kafka software	✓ More than 120 connectors, more than 80 fully managed	✓ Serverless Flink with `ML_PREDICT` and `AI_COMPLETE`	✓ Schema Registry, Data Contracts, Stream Catalog, Stream Lineage, CSFLE, bring your own key (BYOK)	✓ Confluent Intelligence (Streaming Agents, Real-Time Context Engine, built-in ML functions)

The subsections below give a profile of the best-fit and trade-offs for each option. The decision matrix later in the article maps these options to specific organizational profiles.

Custom Build: Self-managed Kafka, Flink, andConnectors

The traditional self-managed approach involves provisioning open source Kafka, managing KRaft (or legacy ZooKeeper) quorums, deploying Flink clusters, and writing custom Python workers for chunking and vector embeddings.

Best for: massive enterprises with dedicated, heavily staffed infrastructure teams, extensive legacy on-premises deployments, unique networking constraints, and extreme customization requirements.

Trade-offs: you assume the maximum possible operational burden and get zero vendor SLAs on integrations, which means your team handles all edge cases, schema evolutions, and scaling events. This path incurs the highest hidden labor costs and delays time-to-market for AI features.

AWS MSK: AWS-Native Broker With Bolt-On Processing

MSK provides a managed broker experience. Teams often pair MSK with MSF or Lambda for processing and AWS Glue for schema management.

Best for: organizations under strict mandates to use only native AWS services for billing consolidation, or teams already deeply entrenched in the AWS ecosystem and willing to absorb significant day 2 operational burden.

Trade-offs: for production real-time RAG, the gaps add up fast.

First, the ZooKeeper-to-KRaft migration. Apache Kafka removed ZooKeeper entirely in Kafka 4.0. For any MSK customer still running on a ZooKeeper-based cluster (which covers most clusters spun up before AWS added KRaft support to MSK), this is a forced cluster rebuild: MSK has no in-place upgrade path from ZooKeeper to KRaft, so those customers must spin up a new cluster and migrate their data and applications. The technical effort to migrate from ZooKeeper-based MSK to KRaft-based MSK is roughly the same as migrating to Confluent Cloud.

Second, the SLA gap is structural. MSK provides 99.9% uptime covering infrastructure only, with Kafka and ZooKeeper software failures explicitly excluded. That works out to 7.9 additional hours (or more due to exclusions) of potential downtime per year compared to Confluent Cloud's 99.99%, which covers both infrastructure and Kafka software. For a real-time RAG pipeline feeding production AI, the gap of nearly eight hours is the difference between a minor incident and a stale-context outage.

Third, the hidden costs compound. MSK's apparent low price expands once you account for monitoring beyond CloudWatch's basic tier (topic-level metrics cost extra), a Kafka UI (MSK ships none), Cruise Control for partition rebalancing on Standard clusters, schema registry self-management (Glue Schema Registry primarily supports Java clients), proxy infrastructure, and a Private Certificate Authority for mTLS. Layer on a processing tier you assemble yourself: MSF runs on its own SLA separate from MSK and ships no ML_PREDICT or AI_COMPLETE equivalents, and Lambda is bound by a 15-minute execution wall that breaks long-running streaming state. Add a piecemeal governance story across Glue, Identity and Access Management (IAM), and CloudWatch with no unified Stream Catalog or Stream Lineage equivalent, and you're stitching multiple disparate services together with no single SLA, no Kafka-specific support, and AWS-only deployment with no multi-cloud or hybrid path.

Companies like Square, Instacart, iFood, SmartThings, and SecurityScorecard switched from MSK to Confluent because the operational burden and feature gaps became intolerable at scale. SecurityScorecard alone reports more than $1 million in savings after switching from MSK to Confluent.

Redpanda: Kafka-Compatible Broker Without a Full RAG Platform

Redpanda is a C++ Kafka clone with high (but not 100%) Kafka API compatibility, packaged across community on-premises, BYOC, dedicated, and serverless tiers.

Best for: small teams running simple event logging or edge workloads where C++ thread-per-core architecture and broker-level p99 latency are the primary constraints.

Trade-offs: Redpanda is a broker, not a data streaming platform, and the platform gap matters most for production RAG.

First, it isn’t fully compatible with Kafka API. Partial compatibility means edge cases break with tools that the open-source Kafka community treats as standard. Redpanda's "225 connectors" headline counts processors, which are equivalent to Kafka's single-message transforms (SMTs). The genuine production-ready connector count is a fraction of that figure, none of which are offered as a managed service, compared with Confluent's more than 120 connectors, with more than 80 fully managed.

Second, performance claims deserve scrutiny. Redpanda's "10x faster than Kafka" headline holds in synthetic, single-producer benchmarks. It degrades in real production workloads with larger producer groups, record keys, and long-running tests. Confluent's Kora engine, on production-shaped workloads, has been measured up to 10x faster than self-managed Kafka and delivers GBps+ throughput with elastic scaling rather than tier-based manual sizing.

Third, compliance and reliability are uneven. Redpanda lists two production-grade certifications (SOC 2 and GDPR readiness, plus a recent HIPAA self-attestation) against Confluent's 10 (SOC 1/2/3, ISO 27001/27701, PCI DSS, CSA Star, TISAX, HITRUST, HIPAA). The single-zone Redpanda BYOC and Dedicated SLA is 99.5%, equivalent to approximately 43 more hours of potential downtime per year than Confluent Cloud. Redpanda BYOC additionally requires installing an agent inside your virtual private cloud (VPC) with break-glass support access for Redpanda engineers, a model that enterprise security teams with strict data sovereignty requirements may find concerning.

Stream processing is bolt-on. Redpanda's in-broker WebAssembly transforms are sandboxed and, by Redpanda's own admission, limited to "trivial and stateless" cases. There is no native Flink, no ML_PREDICT or AI_COMPLETE equivalent, no Stream Lineage, no Stream Catalog, no client-side field level encryption, and no BYOK. Customers building real-time RAG end up assembling external processing and governance, which puts them back at the integration tax we already mapped.

Real customer migrations underscore the gap. Elemental Cognition, an AI digital native, switched from Redpanda to Confluent Cloud for mission-critical real-time workloads.

Confluent: Unified Streaming Platform for Real-Time RAG

Confluent delivers a complete data streaming platform that encompasses the Kora engine, Confluent Cloud for Apache Flink, more than 120 managed connectors, Stream Governance, Tableflow, and Confluent Intelligence under one SLA.

Best for: enterprises that need to stream, connect, process, and govern data under a single 99.99% SLA covering both infrastructure and Kafka software, and especially for teams building production-grade agentic AI applications who want first-class AI primitives natively integrated into the data plane.

Trade-offs: Confluent's list price can feel premium for basic, low-volume logging use cases. For complex, multi-source RAG architectures, the consolidated ecosystem typically yields the lowest TCO once connector development time, embedding worker tier consolidation, and avoided governance build-out are included. Forrester's Total Economic Impact study reports 257% ROI and $2.58M in savings over self-managed Apache Kafka, and Confluent's migration cost analysis shows up to 60% TCO reduction.

The Confluent advantage stack is concrete. Kora delivers GBps+ throughput with full Kafka protocol compatibility, so your existing producers and consumers don't change. Cluster Linking gives you a zero-downtime migration path from MSK or self-managed Kafka. Stream Governance bundles Schema Registry, Data Contracts, Stream Catalog, and Stream Lineage into a single suite, and CSFLE and BYOK lock down PII before it reaches the vector index.

The people and the AI layer round it out. Confluent was founded by the original co-creators of Apache Kafka. It’s one of the largest contributors to the Apache Kafka open source project, and offers committer-led support with a 60-minute contractual P1 response. On top of that foundation, Confluent Intelligence ships Streaming Agents, the Real-Time Context Engine, and built-in ML functions as native primitives, which is exactly the surface area a production RAG pipeline needs.

Customer evidence backs the position. Henry Schein One frames it directly: "Everyone wants AI, but the hard part is getting high-quality data moving in real time. The Confluent data streaming platform makes that possible for us." Notion attributes its ability to keep AI tools fed with up-to-the-second context to Confluent's managed connector and streaming layer. The Palmerston North City Council team summarizes the AI-data dependency clearly: "Good AI needs good data. Confluent is our trusted source of truth. The data streaming platform provides context and orchestration for our AI agents to automate workflows and accelerate our smart city transformation." SecurityScorecard reports more than $1 million in savings after switching from MSK to Confluent. The pattern is consistent: when teams move from a piecemeal stack to a unified platform, the AI roadmap unlocks.

Decision Matrix: Which Streaming Approach Fits Your Real-Time RAG Needs?

Choosing the right streaming infrastructure requires an assessment of your organizational constraints, existing engineering headcount, and strategic AI goals.

Organizational constraints and engineering profile	Recommended approach
If you have: Strict air-gapped environments, unique networking protocols, a dedicated team of more than 20 infrastructure engineers, and a mandate to avoid commercial software.	Choose: Custom build. The heavy integration tax and high labor costs are justified by absolute architectural control.
If you have: Predominantly simple event logging needs, low data volume, edge or single-zone deployments where the 99.5% single-zone SLA is acceptable, and a preference for a C++ broker.	Choose: Redpanda. Redpanda provides a low-footprint Kafka-compatible broker for targeted workloads, though you sacrifice platform completeness, governance, and a managed connector ecosystem.
If you have: A strict mandate to consolidate cloud billing within AWS, existing expertise in AWS Glue, AWS-only deployment with no multi-cloud or hybrid plans, and a willingness to absorb a forced ZooKeeper-to-KRaft migration.	Choose: AWS MSK. MSK offers native billing integration, provided you accept the 99.9% infrastructure-only SLA, several categories of hidden costs, and heavier orchestration overhead.
If you have: Multiple complex data sources, strict enterprise data governance requirements, the need to inject real-time context into AI agents, and a strategic mandate to ship fast.	Choose: Confluent. Confluent eliminates the integration tax, delivers stream, connect, process, govern, and AI-native primitives under one 99.99% SLA, and supports zero-downtime migration from MSK or self-managed Kafka via Cluster Linking.

Build vs Buy: Making the Call

Real-time RAG is a streaming systems problem before it is an AI problem. That single reframe is what separates teams who ship production AI from teams who stall in pilot purgatory.

The case for building is narrow and well-defined. If you operate in an air-gapped or sovereign environment, have unique networking constraints, or already staff a team of more than 20 engineers dedicated to Kafka and Flink operations, the upfront flexibility of open source components can justify the integration tax.

For most enterprises, that case doesn't apply. The cost math in this article is not subtle: three to six engineering months per CDC connector, a serverless ceiling that breaks long-running streaming state, and cross-AZ egress fees that compound silently. None of those costs show up in a vendor proposal. They show up two years in, when your AI roadmap is being held hostage by day two operations on infrastructure your team didn't set out to own.

A unified managed streaming platform shifts that math. Stream, connect, process, and govern collapse into one SLA. The embedding worker tier disappears into Confluent Cloud for Apache Flink. Schema Registry, Data Contracts, and Stream Lineage replace governance you would otherwise build yourself. And on top of those four foundational capabilities, AI-native primitives (Streaming Agents, Real-Time Context Engine, and built-in ML functions) give your agent teams a substrate they can actually ship against.

If your organization is building agentic AI and needs continuous, trusted context, Confluent is the streaming foundation that absorbs the integration tax instead of charging you for it. To go deeper, explore Confluent's ML_PREDICT and AI_COMPLETE model-inference functions inside Confluent Cloud for Apache Flink, or model your own infrastructure savings with Confluent's cost estimator.

Frequently Asked Questions

What is "real-time RAG" and why does it require streaming infrastructure?

Real-time RAG continuously syncs changes from operational systems into a vector index so LLM responses use fresh context. That requires CDC ingestion, stateful processing, and reliable delivery, not periodic batch jobs.

How do you keep a vector database in sync with Postgres or Oracle changes?

Use CDC connectors to capture inserts, updates, and deletes, process events to chunk text and generate embeddings, then apply upserts and deletes to the vectors database to prevent drift.

What is the "integration tax" in a DIY RAG pipeline?

The integration tax is the ongoing engineering cost of stitching together and operating connectors, stream processing, retries and dead letter queues (DLQs), schema evolution handling, and re-embedding workflows. It often dwarfs the initial build effort.

Where do real-time analytics databases fit in a real-time RAG architecture?

Real-time analytics databases serve a different role from streaming platforms. The streaming platform handles ingestion, processing, governance, and delivery. A real-time analytics database sits downstream as a query engine, powering sub-second dashboards, operational monitoring, and ad-hoc investigation over the same governed event streams. In architectures that use Tableflow, the analytics engine can query Kafka topics directly as Iceberg tables without a separate ETL pipeline.

How long does it take to build a production-grade CDC connector?

Commonly, three to six engineering months per connector, once you include snapshots, backfills, failure handling, schema changes, and operational runbooks.

Why do exactly-once semantics matter for embeddings and vector upserts?

Without exactly-once semantics, retries can create duplicate embeddings or miss deletes, corrupting the vector index and leading to stale or incorrect retrieval results.

What happens when the source schema changes (schema evolution)?

Pipelines can break or silently produce wrong embeddings unless schemas are governed with contracts and a registry, and downstream processors are compatible with additive and breaking changes.

How do you handle re-embedding when you change models or chunking logic?

You typically dual-write to a new index, backfill historical records, and cut over once parity is verified. This requires orchestration, lineage, and careful rollback planning.

When is "build" the right choice for real-time RAG streaming?

When you must run in air-gapped or sovereign environments, need extreme customization, or already have a large platform team to own Kafka, Flink, connectors, and 24/7 operations.

Is AWS MSK enough for production real-time RAG?

MSK can cover the broker layer, but teams often still need to assemble connectors, processing, governance, and reliability patterns across multiple services. That raises operational complexity.

What should I look for in a managed streaming platform for RAG in 2026?

Native support for stream, connect, process, and govern, plus AI-ready capabilities like in-flight embedding generation, strong SLAs, schema governance, lineage, and secure PII handling.

How does a unified platform reduce cost compared to separate embedding workers?

If embeddings are generated within the stream processor, you can eliminate the need for a separate fleet of Python workers and the associated scaling, monitoring, retries, and queue management overhead.

How do you prevent PII from entering the vector database?

Apply governance controls (RBAC, masking, data minimization) and enforce policies in-stream before embedding or upserting, so sensitive fields never reach the index.