Anil Kurmi

Posted on Apr 12

Agent Native Data Infrastructure

#ai #programming #agents #database

The Database Didn't Change. The User Did.

At Databricks, agents now create 80% of new databases.

Not schemas. Not tables. Entire databases. Some agent-driven projects have reached 500+ nested branch depths—a topology that no human would create, manage, or even conceptually organize. At PingCAP, over 90% of new TiDB Cloud clusters are provisioned by agents. The primary consumer of database infrastructure is no longer a person.

And here's what nobody wants to admit: every optimization we've built into databases for the last 40 years assumed a human was asking the questions.

Humans have intuition. They know that a sampled trace is "probably fine" because they recognize the pattern from last quarter. They get tired at 2 AM and stop branching their investigation. They read an error message, sigh, and open a runbook they've used before.

Agents do none of this.

An agent won't stop branching after 10 experiments. It'll branch 500 times. It won't accept sampled data as "good enough"—it can't compensate for the missing 1% with gut feeling. It won't tolerate a 20-second query response in a reasoning loop that needs sub-second feedback. And it will absolutely not read your helpful error message and "figure it out."

This week, I tracked six independent announcements from Databricks, CockroachDB, ClickHouse, Confluent, RisingWave, and PingCAP. None of them coordinated. All of them arrived at the same conclusion: the database stack must be redesigned for a non-human consumer. What emerged are six design principles that define agent-native data infrastructure.

5-Minute Skim

If you are short on time, here is the entire argument in six bullets:

Copy-on-write everything. Agents need cheap isolation, not expensive duplication. Databricks Lakebase creates database branches in milliseconds via O(1) metadata copy-on-write. Agents create ~4x more databases than humans.
SQL as the universal agent interface. LLMs generate SQL fluently. PostgreSQL wire protocol is becoming the lingua franca. CockroachDB, ClickHouse, RisingWave, and Confluent Flink all converge on SQL as the agent-facing surface.
Full-fidelity as default. Sampling, rollups, and short retention windows are human compromises that become agent poison. ClickHouse's object storage economics ($0.0005/GB/month effective) make 30-365 day full-fidelity retention the baseline.
Scale-to-zero economics. Agents create ephemeral workloads. Billing must match. Lakebase, TiDB, and ClickHouse all offer scale-to-zero or request-unit pricing.
MCP as control plane. CockroachDB ships a managed MCP server. Confluent integrates MCP-based tool calling into Flink. ClickHouse exposes an MCP server for constrained SQL. MCP is becoming the de facto agent-to-infrastructure protocol.
Agent Experience (AX) as design discipline. ClickHouse's CLI ships "CONTEXT FOR AGENTS" sections in help text. CockroachDB's ccloud CLI is agent-ready. Tools must now be designed for two audiences simultaneously.

What Does "Agent-Native" Actually Mean?

I want to be precise about this term because it's already getting diluted by marketing.

Agent-native doesn't mean "we added an API." It means the infrastructure was designed—or redesigned—around the assumption that autonomous software agents are the primary consumer. The distinction matters because it changes everything: branching models, retention policies, billing granularity, error surfaces, even help text.

Here's the architecture model. Six principles, each addressing a specific failure mode when agents hit traditional infrastructure:

Each of these principles emerged independently from different companies solving different problems. That's what makes this interesting. Nobody designed this framework. It crystallized.

Why Do Agents Need Git-for-Databases?

PingCAP published a scale model that reframes how we should think about database provisioning: 10 million databases = 100,000 users x 10 agent tasks x 10 experimental branches.

That number isn't theoretical. Manus 1.5 goes from prompt to code to deploy to database in minutes. Each agent task might spin up multiple experimental branches, test hypotheses against isolated copies of state, and discard 90% of them. The database isn't a place you carefully design and migrate. It's a scratchpad.

Traditional database cloning can't handle this. A full clone takes minutes to hours, costs real storage, and requires cleanup. Databricks Lakebase solves this with O(1) metadata copy-on-write branching—the same principle behind Git, but applied to PostgreSQL 17 storage.

New branches inherit schema and data from the parent but share underlying storage via pointers. Only when an agent actually mutates data does the system write new pages. Branch creation takes milliseconds. 97% of dev/test copies use this mechanism.

The insight: agents don't need full database clones. They need isolated views of state with lazy materialization. That's a fundamentally different storage primitive than anything we've built for human consumers.

Lakebase supports up to 8TB per instance, scale-to-zero timeout with usage-based billing, and pgvector for AI-driven vector search. Data written via Postgres is immediately queryable by Spark and Databricks SQL—no ETL pipeline required.

The trade-off is real. You're tightly coupled to the Databricks platform to get the lakehouse integration benefit. AWS is GA, Azure is preview, GCP comes later in 2026. And readable secondaries are only available in the provisioned tier, not autoscaling.

But the directional bet is clear: agents treat database state the way developers treat code branches.

Why Does Multi-Region SQL Matter for Agents?

Here's a failure mode nobody talks about: an AI SRE agent investigating an incident in us-east fires a diagnostic query that silently escalates to a cross-region join against eu-west. The query technically succeeds. But the latency spike it introduced just made the incident worse.

CockroachDB filed three patents to prevent exactly this.

The core innovation is locality-aware query planning. Traditional cost models account for CPU, I/O, cardinality, and network. They don't weight WAN latency as a first-class cost factor. CockroachDB's new optimizer generates multiple candidate plans with inter-region latency as an explicit dimension.

The killer feature for agents is enforce_home_region. It's a session-level setting that creates a hard boundary: queries either complete locally or error immediately. No silent cross-region escalation. No "technically correct but operationally surprising" behavior.

This matters because agents don't notice when a query takes 200ms instead of 2ms. They don't have the human instinct to say "that felt slow, something's wrong." Without explicit enforcement, agents will silently degrade cross-region performance in ways that compound unpredictably.

CockroachDB also shipped a managed MCP server and an agent-ready ccloud CLI in March 2026. This isn't a bolt-on. It's production database operations exposed to AI agents with enterprise security out of the box.

Why Are Sampling and Rollups Poison for Agents?

ClickHouse published a piece this week that names what I've been feeling for months. They call retention limits, sampling, and rollups "the three villains of agentic observability." I think the framing is exactly right.

These aren't bad engineering decisions. They're architectural constraints masquerading as best practices. They emerged from storage cost limitations, not from what operators actually needed. For humans with institutional memory, they're acceptable compromises. For agents, they're active sabotage.

Retention. Organizations enforce 7-14 day log retention because SSD-backed storage is expensive. An AI SRE investigating a checkout failure today can't see the same failure pattern from six weeks ago. Seasonal patterns, rare edge cases, long-tail incidents—all invisible. The agent has no institutional memory to compensate.

Sampling. Head-based sampling decides at ingestion time which traces to keep. Tail-based sampling waits for trace completion. Both permanently discard data. An agent trying to correlate error patterns with deployment events can't do it if the connecting traces were sampled away. The data loss is irreversible and invisible to the agent.

Rollups. Time-series systems pre-aggregate metrics to handle high-cardinality labels. But pre-aggregation requires predicting future queries. If you aggregated away userId, you can never retroactively break down by customer. For a human analyst, that's an inconvenience. For an agent trying to reason about causality, it's a dead end.

ClickHouse's counter-argument is economic. Object storage at ~$0.025/GB with 50x columnar compression yields an effective cost of ~$0.0005/GB/month. At that price, 30-365 day full-fidelity retention becomes the default, not the luxury.

Their AI SRE reference architecture measures 6-27 database queries per investigation. At 20-30 seconds per query on legacy systems, the AI workflow is actually slower than a human. ClickHouse delivers sub-second responses. Character.AI reported that queries against the last 10 minutes dropped from 1-2 minutes to instant after switching.

The architecture is clean: observability data in MergeTree tables, context tables for deployments and topology, an MCP server providing constrained SQL access, and copilot logic for iterative SQL generation grounded in deployment and historical context.

How Do Streaming Pipelines Become Agent-Aware?

Confluent announced Streaming Agents this quarter—event-driven AI agents built natively on Apache Flink within Confluent Cloud. This is not "connect your agent to Kafka." This is agents embedded directly inside data pipelines for real-time reasoning.

Every input is logged immutably. Every decision is replayable. This solves three problems simultaneously: failure recovery, logic testing, and decision auditing.

The technical surface is rich. Native model inference against remote LLM endpoints in Flink SQL queries. Continuous vector generation for RAG. Tool calling via MCP. Built-in anomaly detection and Auto-ARIMA forecasting directly on time-series streams. Stream governance with lineage tracking and schema enforcement.

But the architectural move I find most interesting is the A2A protocol integration. Agent-to-Agent is an open protocol now wired directly into Flink. Streaming Agents can connect, orchestrate, and collaborate with agents on any A2A-capable platform—LangChain, SAP, Salesforce. Communication happens over replayable Kafka event streams.

The multi-agent orchestrator pattern uses Kafka as short-term shared memory and Flink for real-time routing. Agents are essentially stateful microservices with a brain. No hard-coded dependencies between them.

Confluent also shipped KIP-932 share groups (the same feature I wrote about last week in the context of queue semantics). For agent workloads, this is critical. Agents produce bursty, parallel work. The old 1:1 partition-to-consumer constraint was designed for predictable human-scale throughput. Share groups shatter that limitation with elastic many-to-many consumption.

On the streaming database side, RisingWave's economics tell a compelling story. State storage on S3 costs ~$23/month per TB. The same state on EBS-backed systems runs $100-300/month per TB. For agents that spin up materialized views as ephemeral feature stores, that 5-10x cost difference determines whether the architecture is viable.

What Does Agent Experience (AX) Look Like in Practice?

This is the principle that surprised me most. ClickHouse's Alasdair Brown wrote a post about building clickhousectl—their CLI—and the core argument is that Agent Experience mirrors Developer Experience. LLMs are the new primary users of your CLI.

Four design decisions stood out:

Self-discovery through convention. Comprehensive --help output specifically targeting agent comprehension. Not terse Unix-style help. Detailed, workflow-oriented help.

Agent-specific context. Each command includes a "CONTEXT FOR AGENTS" section with workflow guidance. For example: "Typical local workflow: chv install stable && chv use stable && chv run server." The agent doesn't need to explore or experiment. The happy path is documented explicitly.

Predictability over cleverness. Boring, conventional patterns. Unexpected behavior wastes tokens in recovery loops. Every surprising edge case is a wasted API call.

Guidance reduces exploration. Encode common workflows upfront. Every exploratory API call an agent makes is a cost and a latency penalty. Good AX eliminates the need to explore.

Brown validated this by deploying a full Google Analytics competitor using only voice commands via OpenClaw on mobile. The agent autonomously installed ClickHouse, bootstrapped a Next.js app, created schemas, populated test data, built dashboards, and deployed to ClickHouse Cloud.

ClickHouse also shipped 28 packaged Agent Skills—schema design, query optimization, data ingestion, partitioning strategies—installable via npx skills add clickhouse/agent-skills. Auto-detects Claude Code, Cursor, and Copilot.

The unresolved problem: authentication, user management, and billing remain deeply human-oriented workflows. Nobody has a good answer for how agents should handle these. This is where MCP-as-control-plane needs to evolve next.

Where Does This Fall Apart?

I want to be honest about the tensions because this space is moving fast and the marketing is moving faster.

Isolation versus cost. Copy-on-write branching is elegant but adds metadata complexity. At 500+ branch depths, garbage collection of abandoned branches becomes its own operational problem. Lakebase is young. We don't yet know where the metadata overhead hits a wall.

Full-fidelity versus budget. $0.0005/GB/month sounds cheap until you're ingesting petabytes. At 1 PB of observability data, that's still $500/month just for storage—reasonable, but the query compute against unsampled petabyte-scale data is the real cost. ClickHouse's columnar architecture handles this well, but you need to size your compute pools correctly.

Agent autonomy versus governance. PingCAP's 10-million-database model raises an obvious question: who pays? Per-agent governance with statement-level metering and budget controls is essential, but the tooling is immature. A runaway agent loop creating databases at scale-to-zero pricing could still generate a surprising bill.

SQL universality versus performance. SQL is the agent-friendliest interface, but not every workload fits SQL semantics cleanly. Graph traversals, time-series downsampling, and geospatial queries all have specialized languages that outperform SQL. The risk is that "SQL everywhere" becomes a performance ceiling for agents that need specialized operations.

Streaming agents versus debuggability. Embedding LLM reasoning inside Flink pipelines sounds powerful until you need to debug why Agent #47 in your multi-agent orchestrator made a bad routing decision at 3:47 AM. Kafka's immutable log helps with replay, but reasoning traces inside streaming pipelines are a new observability challenge that nobody has fully solved.

What Should You Do With This?

If your infrastructure team is starting a new project this quarter, here are the concrete moves:

Evaluate copy-on-write branching for any agent-driven workflow. If your agents create test environments, run experiments, or need isolated state, traditional database cloning is an anti-pattern. Lakebase, Neon's branching, or PlanetScale's branching should be on your shortlist.

Audit your observability retention and sampling. If you're enforcing 7-day retention and head-based sampling, you've built an infrastructure that agents cannot effectively use. Run the cost model on ClickHouse-style object storage. You might find that 90-day full-fidelity costs less than you think.

Ship an MCP server for your internal data services. CockroachDB, ClickHouse, and Confluent all converged on MCP as the agent access layer. If you have internal services that agents need to query, an MCP server with constrained access is the pattern to adopt now.

Add "CONTEXT FOR AGENTS" to your CLIs and internal tools. This is the cheapest, highest-leverage change on this list. Your internal tooling was built for humans. Adding agent-oriented documentation to help text, error messages, and README files costs almost nothing and dramatically reduces agent exploration waste.

Model your billing for agent-scale workloads. Run the math on 100x your current database creation rate. If the number is terrifying, you need scale-to-zero or request-unit billing before agents hit production.

Deep Dive Resources

Resource	Why It Matters
Databricks: Agentic Development Will Change Databases	Origin story for O(1) database branching
ClickHouse: Three Villains of Agentic Observability	The definitive case against sampling for agents
ClickHouse: AI SRE Architecture	Reference architecture with query-count benchmarks
Alasdair Brown: Agent Experience — Building a CLI	Practical AX design principles
CockroachDB: Multi-Region SQL Patents	Locality-aware query planning deep dive
Confluent: Introducing Streaming Agents	Event-driven AI agents on Flink
Confluent: Multi-Agent Orchestrator	Kafka as shared memory for agent swarms
PingCAP: Agentic AI Database Trends 2026	10-million-database scale model
RisingWave: Streaming Database Landscape 2026	State storage cost comparison
Cloud Security Alliance: Cybersecurity Needs a New Data Architecture	Federated satellite model for agent-specialized access

Sources

Databricks Blog — "How Agentic Software Development Will Change Databases" (2026)
Databricks Blog — "Database Branching in Postgres with Lakebase" (2026)
InfoQ — "Databricks Introduces Lakebase" (Feb 2026)
CockroachDB Blog — "Multi-Region Database Architecture & SQL Placement Locality" (2026)
ClickHouse Blog — "Three Villains of Agentic Observability" (Apr 2026)
ClickHouse Blog — "AI SRE Observability Architecture" (2026)
ClickHouse Blog — "Introducing Agent Skills" (2026)
Alasdair Brown — "Agent Experience: Building a CLI for ClickHouse" (2026)
RisingWave Blog — "Streaming Database Landscape 2026 Complete Guide" (2026)
RisingWave Blog — "CDC Stream Processing Complete Guide" (2026)
Confluent Blog — "Q1 2026 Cloud Launch" (2026)
Confluent Blog — "Introducing Streaming Agents" (2026)
Confluent Blog — "Multi-Agent Orchestrator Using Flink and Kafka" (2026)
Cloud Security Alliance — "Cybersecurity Needs a New Data Architecture" (Apr 2026)
PingCAP Blog — "Agentic AI Database Trends That Will Define 2026" (2026)

Top comments (1)

mote • Apr 12

This is a great crystallization of patterns that are genuinely emerging from production deployments.

The point about "full fidelity by default" resonates deeply with me. The 7-day retention + sampling pattern was designed for human dashboards — humans fill in gaps with intuition. Agents don't. When an agent hits a data gap, it either hallucinates a guess or stalls entirely. Neither is acceptable in a production pipeline.

The MCP-as-control-plane point is fascinating. I've been thinking about this from the embedded/edge side — what happens when the agent isn't cloud-connected? Edge AI systems (think robotics, drones, autonomous vehicles) need agent-native data infrastructure that works offline first, persists sensor streams locally, and syncs selectively. The copy-on-write branching model is interesting here too — you'd want immutable snapshots for replay and debugging without the network assumption.

SQL as the universal interface makes sense for cloud-side workloads, but for edge agents with multimodal data (video frames, IMU readings, vector embeddings all timestamped together), it starts to feel like fitting a square peg. Curious whether you see a split emerging: SQL-native for cloud orchestration, something more storage-efficient for the actual agent's memory layer?

What's your take on the 2PC coordination problem in agent workflows — when you have multi-step tool calls that partially succeed?