DEV Community: Maxime Dalessandro

LTAP vs HTAP: what Databricks actually changed

Maxime Dalessandro — Sat, 18 Jul 2026 23:00:33 +0000

TL;DR. LTAP (Lake Transactional/Analytical Processing) is the term Databricks coined on June 16, 2026 for running Postgres transactions directly on lakehouse storage: writes land once, get transcoded to Parquet in Delta and Iceberg format below the engine, and analytics reads that columnar copy plus a merge of very recent changes, without touching the transactional compute. HTAP put the row and column representations inside one engine; LTAP pushes the column representation down into shared object storage and keeps a separate engine per workload. The physics does not change: somewhere, rows still become columns, and critics from CMU's Andy Pavlo to SingleStore's CTO were quick to say so. What genuinely changed is who owns that copy, how isolated the workloads are, and the fact that the analytical format is open. Whether that is a new category or well-engineered HTAP is a naming fight; the trade-offs underneath are measurable, and we measured them.

On June 16, 2026, at its Data + AI Summit, Databricks announced LTAP, Lake Transactional/Analytical Processing, and opened by declaring the previous category dead: HTAP, in their telling, promised to unify transactions and analytics in a single engine but collapsed workload isolation and compromised performance for both. The replacement claim is bold: one copy of operational data, in open formats, in the lake (the S3-style object storage where analytical data already lives), serving both workloads with full isolation and no ETL. Three weeks later The Register was asking what counts as a copy, and TiDB's makers were arguing that LTAP validates HTAP rather than replacing it. Both reactions are partly right, which makes the term worth taking apart carefully rather than cheering or dismissing.

What Databricks shipped underneath the LTAP acronym

The architecture, described in Databricks' engineering post, decomposes Postgres into stateless compute over two storage services. A quorum-replicated write-ahead-log service (SafeKeeper) makes commits durable across nodes instead of on a single disk. A page service (PageServer) consumes that WAL and materializes it into object storage, acting as a row-format cache for the Postgres side, and, this is the load-bearing novelty, transcoding the same changes into Parquet, the lakehouse world's standard columnar file format, wrapped in Delta and Iceberg table metadata as they land in the lake. The transcode runs on spare cycles in the page-serving layer, off the transactional path.

The analytical read path is the other half of the trick. A lakehouse query asks Postgres for exactly one thing: the current log sequence number, a metadata lookup. It then reads the already-materialized columnar data from object storage and merges the small set of changes newer than what has been materialized. Postgres compute serves none of the scan traffic. Freshness is bounded by the WAL-to-lake materialization lag plus that merge, not by a nightly pipeline, and cloning a database becomes a metadata operation because the state already lives in versioned shared storage.

Databricks attaches numbers to some of this: a claim of 5x higher write throughput and 2x lower read latency than comparable synchronous-replication deployments, read latency "indistinguishable from a monolith" on warm caches, and a Lakebase fleet already handling 12 million database launches per day. Those are vendor numbers, unverified here; the architecture description is the part that stands on its own.

The real LTAP vs HTAP difference: where the columnar copy lives

Strip the branding and every system in this family answers one question: given that row layout wins transactions and column layout wins scans by 1.7x to 46x depending on the query, where do you put the second representation and who keeps it fresh? Three answers exist in production.

	Engine-level HTAP	Replica-level HTAP	LTAP
In one sentence	One system keeps both formats and serves both workloads	The cluster ships every write to dedicated columnar nodes	Separate engines share one storage layer that holds the columnar copy
Example	Oracle In-Memory, SingleStore	TiDB with TiFlash	Databricks Lakebase
Where the columnar copy lives	Inside the engine, in memory	On dedicated replica nodes	In object storage, as Parquet
How it stays fresh	Every write maintains both formats	Changes stream to the columnar nodes (Raft learners)	The write-ahead log is converted below the engine
Analytical format	Proprietary	Proprietary	Open (Delta, Iceberg)
Workload isolation	One box: contention is managed, not removed	Separate nodes inside one cluster	Separate engines; only storage is shared
Who runs the row-to-column copy	The database engine	The database cluster	The storage layer

Read that way, LTAP takes the third seat in the same row rather than repudiating HTAP, and the SingleStore CTO's objection writes itself: you cannot call HTAP a failure while shipping, in his words, "exactly what HTAP promised." But the seat matters. Moving the transcode below the engine buys two things the engine-level designs structurally cannot offer: the analytical copy is in an open format that any Iceberg-speaking engine can read without the operational database's permission, and the two workloads stop sharing a failure domain, which is precisely the property whose absence we measured as a 38 ms worst case becoming 332 ms when a single analytical scan shared a Postgres box with an update stream.

"Zero copies" is marketing; "zero pipelines you operate" is the real claim

The Register's reporting pinned down the semantic sleight. Physically, the data exists as row pages in the page-serving cache and as Parquet in the lake: two representations. A financial-services engineer they quoted called it "Two copies of data, not one," and a Databricks engineer conceded the pageservers amount to a materialization layer. CMU's Andy Pavlo, praising the engineering, still landed the point in six words: "They are copying data out eventually."

The defensible version of the claim is narrower and still valuable: one durable source of truth, in an open format, with the row-to-column conversion owned by the storage layer instead of by a change-data-capture (CDC) pipeline you build, monitor, and apologize for. Databricks' own materials are more careful than the keynote framing; they position zero-ETL competitors as merely hiding the pipeline.

Their engineering post openly documents the caveats. Very small tables are not converted to columnar at all. The intermediate row versions Postgres keeps around for concurrent transactions (MVCC) are invisible to Delta and Iceberg readers and garbage-collected later. Values with no lossless columnar equivalent (NaN, infinities, some numerics, extension types) ride in a structured overflow field. Anyone building on the lake-facing side of LTAP should read that list before assuming Parquet equals Postgres.

The unresolved engineering question is the freshness merge under pressure. The "small set of very recent changes" an analytical query must merge is small precisely when write throughput is modest relative to materialization speed. Our own tail-latency measurements suggest the interesting benchmark nobody has published yet: sustained five-figure write rates with continuous analytical reads, reporting the p99.9 of both sides and the actual staleness distribution. Until someone runs that in public, "full performance isolation" remains a design goal with strong architectural backing rather than a measured fact.

The one thing Databricks and its critics agree on: AI agents broke the OLTP/OLAP boundary

Databricks CEO Ali Ghodsi framed the launch around AI agents: "The infrastructure that powered the last era of computing is now the bottleneck." TiDB's counter-post, arguing every AI agent workload is HTAP, makes the same observation from the other side: a single agent loop reads operational state, retrieves history, searches embeddings, analyzes patterns, and writes back, mixing both workload classes in seconds. The vendors disagree about the fix and agree completely about the cause.

That agreement is worth more than the branding fight. When the caller is an agent, the OLTP/OLAP boundary stops being an infrastructure fact and becomes a per-query fact: the same connection can emit a point lookup and a 50-million-row aggregate one second apart, and read-only access does not make the second one safe on a transactional primary. Storage-layer unification lowers the cost of the analytical half, but the agent still has to understand what it is querying: what the tables mean, how they connect, and where a question of that shape belongs. No storage layout supplies that understanding.

What to take from LTAP if you run Postgres and are not on Databricks

The pattern travels even where the product does not. The direction of the industry, visible in LTAP from above and in extensions like pg_duckdb from below, is: keep the transactional engine row-shaped, land the analytical representation in open columnar formats, and make the synchronization infrastructure someone's product instead of your cron job. If you are choosing today, the decision inputs have not changed, and they are measurable on your own workload in an afternoon: your tail-latency budget on the write path, your freshness budget on the read path, and who you trust to operate the copy. LTAP's contribution is that "the storage layer" is now a credible answer to that last question; its price is that, today, the only full implementation lives inside one vendor's platform, however open the file formats underneath it are.

Frequently asked questions

What does LTAP stand for?

Lake Transactional/Analytical Processing. Databricks coined the term on June 16, 2026 for the architecture combining Lakebase, its Postgres-compatible transactional engine running on object storage, with lakehouse analytics over the same storage layer in Delta and Iceberg formats.

What is Databricks Lakebase?

Lakebase is Databricks' Postgres-compatible transactional database service and the operational half of LTAP. It runs Postgres as stateless compute over two storage services: a quorum-replicated write-ahead-log service for durable commits, and a page service that materializes the log into object storage and transcodes changes into Parquet under Delta and Iceberg metadata. Databricks reports 12 million database launches per day on the fleet, a vendor number we have not verified.

Is LTAP just HTAP rebranded?

Functionally LTAP belongs to the HTAP family: both serve transactions and analytics on shared data with bounded freshness. The genuine difference is placement. HTAP systems keep the columnar copy inside the engine or on cluster replicas in proprietary formats; LTAP materializes it below the engine, in open formats in object storage, with fully separate compute per workload. Critics like SingleStore's CTO argue that delivers what HTAP promised rather than refuting it.

Is LTAP really zero-copy?

Not physically. Data exists as row pages in the page-serving cache and as Parquet in the lake, and Databricks engineers acknowledge the materialization layer when pressed. The accurate claim is one durable source of truth with no user-operated CDC pipeline: the row-to-column conversion happens inside the storage layer, off the transactional path.

Do you need Databricks to get LTAP?

Today, yes, for the full pattern: Lakebase plus lakehouse engines over Unity Catalog, its catalog and governance layer, is the only shipping implementation. The ingredients are appearing elsewhere: pg_duckdb embeds columnar execution in Postgres, and most managed Postgres vendors offer some replica-plus-columnar arrangement, but the WAL-to-open-format materialization below a transactional engine is currently a Databricks product, not a standard.

Why does LTAP matter for AI agents?

Agent workloads mix transactional and analytical statements in a single loop, which both Databricks and its HTAP critics cite as the motivation. Cheap, fresh analytics next to operational data removes the pressure to point agents at a warehouse. It does not remove the agent's need to understand the data it is querying, which is a context problem, not a storage problem.

Where Datapace fits

Datapace comes at the same agent problem from the documentation side. Whichever architecture wins the storage argument, HTAP, LTAP, or a plain replica, the agent still has to understand the base it queries: what is there, what it means, how it connects, and where a question of a given shape belongs. We are building that context layer for production databases, exposed to humans and AI agents over an API and MCP interface, starting with Postgres: fix the base once, and people and models stop re-deriving it badly on every query. It is under construction rather than shipped; if your agents are guessing at your schema today, see what we are building or book a call.

Sources

Your AI agents don't need a bigger context window. They need a context layer.

Maxime Dalessandro — Mon, 06 Jul 2026 16:42:14 +0000

AI agents have a memory problem. Not a "the model forgets things" problem. That one is a context window limit you can throw tokens at, and teams do, at several times the cost of proper retrieval.

The real problem is structural: as agents grow from single-session demos into production systems handling thousands of conversations, tasks, and workflows, they need a persistent, queryable, governed layer of context that lives outside the model. That infrastructure is called the agent context layer, and in 2026 it has become one of the defining architecture questions for teams shipping agentic systems.

What it is (and is not)

The agent context layer is not:

The model's context window (tokens currently in the prompt)
A vector database bolted on for RAG
A conversation history log

It is the layer between your agents and every form of persistent memory: storing, indexing, retrieving, and governing what agents know across sessions, users, tasks, and time.

A working definition: the context layer manages what an agent remembers, what it can recall, and who is authorized to access that memory.

The efficiency numbers make the case on their own. Mem0's State of AI Agent Memory report benchmarks leading retrieval systems at ~7,000 tokens per call with better accuracy than a naive full-context baseline burning ~26,000 tokens per conversation. You don't need a bigger window. You need smarter retrieval infrastructure.

The five architecture patterns

Roughly in order of maturity:

In-process working memory. Lives in the active session. Zero infra, zero persistence. Demos only.
External short-term memory. A cache or KV store persists across requests within a session. Survives restarts, still session-scoped.
Episodic retrieval. Semantic search over a vector store. This is where most teams start when they "add memory." Useful but incomplete: no structure, no governance, no source attribution.
Structured semantic memory. Hybrid of semantic retrieval plus typed facts, entities, and relationships, often a knowledge graph next to the embeddings.
Enterprise context layer. Persistent memory across users, sessions, agents, and workloads, with access controls, audit logging, and lineage. Not just memory: governed, attributable, auditable memory.

Serious production deployments compose several of these at once.

What happens without one

Three failure modes show up at scale, every time:

Agent amnesia. Every session starts cold. The agent re-asks what the user answered a dozen times. Trust erodes, adoption stalls.
Token cost explosion. Teams compensate by stuffing everything into the prompt. 3-4x the cost of proper retrieval, at volume that is real money.
Context poisoning. Without governance over what agents write into shared memory, agents contradict each other or act on stale state. In multi-agent systems, where one agent's output is another's context, this compounds fast.

The part most write-ups skip: governance

Most "agent memory" content stops at retrieval quality. The harder questions are the enterprise ones:

Tenant isolation. In multi-tenant systems, memory cannot bleed between organizations, and the isolation has to be enforced at the retrieval layer, not politely requested at the application layer.
Provenance. When an agent acts on a remembered fact, you must be able to answer: where did this come from, when was it written, who authorized it?
Audit. If agents touch production systems (especially databases), every action needs an append-only record.

This is where the context layer stops being a memory optimization and becomes a security boundary.

Disclosure: I am building Datapace, a context layer and security gateway for running AI agents on production databases safely. The original version of this post, with full sources, lives on our blog, and the governance angle is expanded in safe AI database access.

AI coding agents are writing your database migrations. Here is why they break production.

Maxime Dalessandro — Mon, 22 Jun 2026 23:27:33 +0000

TL;DR: Three independent data points converge. Frontier LLMs score only 10 to 24 percent on realistic enterprise SQL benchmarks, paraphrasing the same prompt swings accuracy by another 10 to 20 points, and coding agents are now writing a measurable share of production migrations at the same time. The answer is not to stop using AI. It is to run AI-generated DDL through a lock-graph simulator at pull-request time, so a migration that would fail silently in production gets flagged before merge.

A coding agent opens a pull request with a database migration. It is syntactically perfect, it passes lint, it passes the test suite, and CI is green. It then takes a lock on a billion-row table in production and stalls every query queued behind it. Nothing in your pipeline saw it coming, because nothing in your pipeline reads the production lock graph. This is the new shape of database risk in 2026: not bad SQL that fails to parse, but plausible SQL that an AI agent wrote, a human skimmed, and no check ever evaluated against live production state.

The evidence that AI-generated SQL is unreliable at exactly this kind of work is not anecdotal. On Spider 1.0, a 2018 text-to-SQL benchmark, GPT-4o hits 86.6 percent. On Spider 2.0, the 2024 benchmark designed to reflect real enterprise workflows, the same GPT-4o hits 10.1 percent. On BIRD-Interact's full 600-problem set released in August 2025, the best frontier LLM measured hits 16.33 percent. In the same window that those benchmarks documented the drop, coding agents wrote roughly one in twenty public GitHub commits. Some fraction of those commits are production migrations. The gap between benchmark behavior and production deployment is larger than it has ever been, and it is still widening.

Three datapoints, one trajectory

The three benchmarks that matter are Spider 2.0 (ICLR 2025 Oral), BIRD-Interact (ICLR 2026, released in two stages in June and August 2025), and the EMNLP Findings 2025 SQL2NL paraphrase-robustness paper. Each measures a different failure mode. Together they describe a single trajectory: the closer a benchmark gets to real production SQL work, the more frontier LLM accuracy collapses.

The pattern across these benchmarks is stark. On the 2018 Spider 1.0 single-shot benchmark, GPT-4o reaches 86.6 percent. On the 2023 BIRD single-shot benchmark, an o1-preview agent reaches 73 percent. On the enterprise-grade Spider 2.0, GPT-4o drops to 10.1 percent and an o1-preview code agent reaches 21.3 percent. On the BIRD-Interact full 600-set, the best LLM lands at 16.3 percent, and Claude 3.7 Sonnet on the a-mode split lands at 17.8 percent. As benchmarks approach real enterprise SQL, frontier LLM accuracy drops by 60 to 70 percentage points.

Spider 2.0: the enterprise-workflow collapse

Spider 2.0, from the xlang-ai group and published at ICLR 2025, is 632 problems drawn from real enterprise database use cases. Its databases carry over 1,000 columns and live in BigQuery or Snowflake. The paper reports: GPT-4o at 10.1 percent, o1-preview at 17.1 percent, and a code-agent framework built on o1-preview at 21.3 percent. The comparison numbers the same paper reports for Spider 1.0, on the same models, are 86.6 percent (GPT-4o) and 91.2 percent (code-agent). BIRD, an earlier realistic benchmark from 2023, sat at 73 percent. Moving from Spider 1.0 to Spider 2.0 is a drop of roughly 70 percentage points on the same model. The architecture did not change. The benchmark got harder in a way that maps to real work.

BIRD-Interact: the multi-turn collapse

BIRD-Interact, from bird-bench, reframes text-to-SQL as an interactive task. The model works through conversational (c-Interact) or agentic (a-Interact) sessions rather than single-shot generation. The paper reports: o3-mini at 24.4 percent Success Rate on c-Interact, Claude 3.7 Sonnet at 17.78 percent on a-Interact, in the June 2025 release. When the full 600-problem test set dropped in August 2025, the best model measured came in at 16.33 percent, with c-Interact and a-Interact splits sitting around 10 percent each. The paper identifies "Interaction-Time Scaling": giving the model more turns helps. It does not get the model to the accuracy numbers the marketing of frontier LLMs would suggest.

SQL2NL: the paraphrase collapse

Safarzadeh, Oroojlooy, and Roth's EMNLP Findings 2025 paper tests what happens when you paraphrase a prompt without changing its meaning. They use a schema-aligned SQL-to-NL pipeline to generate semantically equivalent prompts that are lexically different, then measure accuracy on the same underlying query. LLaMa 3.3 70B drops from 77.11 percent to 66.9 percent on paraphrased Spider queries, a 10.23-point drop. LLaMa 3.1 8B drops from 62.9 percent to 42.5 percent, a 20.4-point drop. GPT-4o mini is "disproportionately affected." The interpretation the paper draws is blunt: state-of-the-art models are far more brittle than standard benchmarks suggest. The same prompt, rephrased by the product manager rather than the data analyst, shifts the accuracy by double digits.

Meanwhile, agents are writing the migrations

The second datapoint is independently tracked. Claude Code launched in February 2025 and reached general availability in May 2025. By early 2026, Anthropic's telemetry reports that Claude Code is authoring roughly 4 percent of all public GitHub commits. The JetBrains 2026 Developer Survey puts workplace adoption of GitHub Copilot at 29 percent and of Cursor and Claude Code at 18 percent each. These are aggregate numbers across all kinds of code. There is no published breakdown of what share of those commits are database migration files, but the base rates suggest the absolute count is not small: a 15 to 60 engineer SaaS team shipping one or two migrations a week, with half its engineers using a coding agent some fraction of the time, is generating a steady stream of AI-drafted DDL.

The coding agent, unlike the text-to-SQL benchmark model, has access to the repo. It can read prior migrations, read the ORM definitions, read the test suite. This raises its floor. It does not raise its ceiling past the benchmark numbers, because the part of the task the agent cannot do, and the text-to-SQL model cannot do, is read the production lock graph and estimate how long ACCESS EXCLUSIVE is going to be held. That information is not in the repo.

What the benchmarks measure, and what they miss

The three benchmarks above measure different slices of the SQL-generation problem. None of them measures the one slice that determines whether a migration takes down production.

	What the benchmarks test	What production requires
Input	schema + NL question	schema + live prod state + workload
Output	a SQL query or DDL	a migration that deploys safely
Success criterion	execution matches reference	no lock cascade, no timeout, no data loss
Environment	offline test database	live production under load
Signal	does the query return the right rows	does the lock release inside the deploy window

A benchmark success means the generated SQL produces the correct result against a reference database that has no concurrent readers, no write traffic, no long-running analytical queries holding locks, and no billion-row production tables. A production success means the generated DDL acquires ACCESS EXCLUSIVE in a window that does not cascade into a pool drain, validates any new constraint without rewriting the heap, and leaves replicas healthy on the other side.

The benchmark can be perfect and the migration can still cause an outage. Railway's December 8, 2025 incident was a nullable-column ADD COLUMN that any LLM would emit correctly and any benchmark would score as success. The DDL was correct. The outage was the waitMask rule interacting with a long-running reader. There is no benchmark in the NL2SQL literature that scores this.

The systemic-risk argument

Three things are true at once, and the combination is new in 2026.

First, frontier LLM accuracy on realistic enterprise SQL sits between 10 and 24 percent. That number is not going to jump to 80 in the next quarter. The gap between Spider 1.0 and Spider 2.0 persisted across model generations, and each new benchmark has been designed to track capability rather than race to ceiling.

Second, paraphrasing the same prompt shifts accuracy by 10 to 20 points. The production developer asking an agent to "add a nullable archived_at timestamp to the sessions table" and the one asking to "soft-delete sessions by timestamp" are describing the same migration. One gets a migration that works. The other may not. The team cannot know in advance which rephrasing lands on the accurate side of the model's paraphrase manifold.

Third, coding agents are writing a measurable fraction of migrations. 4 percent of public GitHub commits is the published floor; the actual number for private repos, where most production migrations live, is unknown. The population of DDL that reaches git push without a DBA-level human in the authorship loop has grown substantially since 2023.

Each datapoint on its own is defensible: AI coding is useful, benchmarks are benchmarks, robustness gaps exist in every ML system. The combination produces a measurable shift in the base rate of production-unsafe migrations reaching merge. The Railway post-mortems are public because Railway writes post-mortems. The same pattern is running unlogged at many other companies.

What to do that is not "stop using AI"

The response that the research community implicitly endorses, "wait for the benchmarks to saturate before deploying agents to write DDL," is not a policy any engineering organization is going to adopt. The agents are useful, the productivity gain is real, and the teams that forgo it lose ground to the teams that do not. The question is what check sits between the agent's output and production. The same question applies at run time, not just at merge: once an agent can issue statements against a live database, you need a layer that decides which ones execute.

A pull-request-time lock-graph simulator is one answer. The check does not need to know whether the DDL was written by a human or an agent. It reads the proposed DDL, parses the lock levels it will require against the target schema, reads the current state of pg_locks and pg_stat_activity on production, and rejects the merge when the estimated lock window exceeds a threshold. The check is independent of authorship. It catches the migration because the migration is unsafe, not because the agent generated it.

The three benchmark failure modes each correspond to a specific class of lock-graph miss:

Spider 2.0 enterprise complexity: the agent emits a join against a table with a thousand columns and no useful index. The lock simulator notices the sequential scan, estimates the ACCESS SHARE hold time against current traffic, and blocks.
BIRD-Interact multi-turn drift: the agent, after several refinement turns, emits a DDL that subtly differs from what the developer asked for: an ALTER TABLE that rewrites the heap instead of the metadata-only variant. The simulator notices the rewrite and blocks.
SQL2NL paraphrase brittleness: the same developer asks the same thing two different ways, gets two different migrations, and merges the wrong one. The simulator evaluates the DDL, not the prompt. The benchmark gap is invisible to it.

None of the three is solved by the simulator. The first one is a question of index design and table architecture, which the simulator is not trying to fix. The second one is a question of whether the developer reads the diff before merging, which no tool replaces. The third one is a question of prompt engineering. What the simulator does is take the output of whatever process produced the SQL, benchmark-accurate or not, and evaluate it against production state. The check is defensible regardless of who drafted the DDL. That is the point.

Why this check is different from the existing stack

Every existing guardrail around AI-generated code targets the code surface. Linters catch syntactic errors. Unit tests catch behavioral regressions. CI catches compile failures. None of these look at the production database. An AI-written migration that passes lint, passes tests, and passes CI is then deployed against a database the CI environment has no visibility into. The simulator plugs the specific hole the existing stack leaves: the production lock graph the test environment cannot reproduce.

Closing note

The benchmark numbers are not an attack on frontier LLMs. They are the most honest measurement the research community has produced, and they say something specific: enterprise SQL is hard, and the hard part is not generating syntactically valid SQL. The hard part is generating SQL that behaves correctly against a live production schema with live production traffic, and that part is what the benchmarks explicitly do not model. The agents are going to keep writing migrations. The benchmarks are going to keep climbing slowly. The gap between the two is where the cascades happen.

A check that reads production state and verdicts the proposed change is the one piece of the stack that sits in the gap, whether it runs at PR time or in the data path when an agent executes against production. A guardrail and audit layer that classifies every agent action, holds the risky ones for human approval, and records each one in an immutable log gives you that verdict before the migration ships instead of a post-mortem after.

Frequently asked questions

Are the benchmark numbers representative of what Claude Code or Cursor produce on private repos?

Nobody knows. The benchmarks are public; the private-repo output is not. What is known is that the coding agents have access to the repo and prior migrations, which the benchmark models do not. That advantage raises the floor. It does not raise the ceiling past the limits the benchmarks describe, because the limiting factor (reading the production lock graph) is not in the repo.

Aren't the benchmarks already improving? Won't 2026 and 2027 numbers be much higher?

Possibly. Spider 1.0 went from 20 percent to 90 percent in four years. The relevant question is whether Spider 2.0 and BIRD-Interact follow the same trajectory. The benchmarks were deliberately built to resist that saturation, and the drop from Spider 1.0 to Spider 2.0 is larger than the gain any single model generation has delivered in the last two years. Betting that the benchmarks saturate in time is a bet with asymmetric downside.

If paraphrasing shifts accuracy by 10 to 20 points, why not just standardize prompts?

Teams could, and some do. The effect shows up in structured prompts too. The paraphrase-robustness paper tests prompts drawn from the same schema and intent, not wildly different rephrasings. A team that standardizes on "add a nullable timestamp column to X" still hits the manifold on any clause that is not part of the template.

Is this specific to Postgres?

No. The benchmarks cover BigQuery, Snowflake, SQLite, and Postgres. The lock-graph miss is specific to Postgres and MySQL and any engine that has an exclusive DDL lock mode. On BigQuery and Snowflake the equivalent failure is a slot-budget exhaustion rather than a lock cascade. The argument shape transfers; the specific cost model differs.

What does the simulator look like in practice?

A PR comment that lists the DDL's required lock levels, the current state of pg_locks on production, an estimated lock window, and a block or pass verdict. The developer sees the verdict at the merge button, which is the same place the diff lives. It does not replace the developer's judgment. It gives the developer one additional data point they cannot get from the AI or from the CI.

Sources

F. Lei, J. Chen, Y. Ye et al., "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows", ICLR 2025 Oral.
J. Xie, B. Yuan, J. Lin et al., "BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions", ICLR 2026.
M. Safarzadeh, A. Oroojlooy, D. Roth, "Evaluating NL2SQL via SQL2NL", EMNLP Findings 2025.
BIRD-Interact project page, bird-interact.github.io (leaderboard and August 2025 release notes).
Spider 2.0 project page, spider2-sql.github.io.
Anthropic, "Claude 4 and Claude Code general availability", May 2025.
JetBrains, 2026 Developer Ecosystem Survey (AI coding tool adoption).
Datapace blog, "One CI check would have caught both of Railway's billion-row Postgres migration outages".
Datapace blog, "ACCESS SHARE does not jump the queue: Postgres lock fairness".

Originally published on the Datapace blog.

Human-in-the-Loop Database Migrations: Approving What the Agent Wants to Run

Maxime Dalessandro — Mon, 22 Jun 2026 23:23:58 +0000

TL;DR: "Database changes should require human approval" is good advice that tells you nothing about how to build the gate. A real approval surface has to show the exact statement, its lock behavior, the rows or documents affected, and a rollback plan. It has to live below the agent in the data path, not in a prompt or a tool the agent chooses to honor, so it survives a compromised or mistaken agent. Route held migrations to the right approver, block until the exact diff is ratified, and record every decision in an append-only log. The control plane is the same shape across Postgres, MySQL, and MongoDB.

Where the advice stops

Search for guidance on letting AI agents touch production data and you will find the same sentence everywhere: "database changes should require human approval." It is correct, and it is useless on its own. It tells you the outcome you want without telling you how to build it.

The gap matters because schema changes are exactly the class of operation where a wrong call is expensive and hard to undo. A migration is not a SELECT you can re-run. It rewrites the structure other systems depend on, and on a busy table it can take locks that stall every query behind it. OWASP's 2025 guidance on prompt injection is blunt about the principle: implement "human-in-the-loop controls for privileged operations to prevent unauthorized actions," and handle critical functionality "in code rather than providing them to the model." That is the right instinct. This article is about turning that instinct into a working gate.

The hard part is rendering a decision a human can make

Most teams that bolt approval onto an agent stop at a Slack message that says "Agent wants to run a migration. Approve? [Yes] [No]." Clicking yes there is not approval. It is a coin flip with a UI.

Real approval requires that the reviewer can see what is about to happen, in terms they can judge, before they ratify it. The hard part is not the button. It is assembling a pending migration into something a senior engineer can read in thirty seconds and either trust or reject. That assembly step is where most homegrown setups quietly fail, because the agent that wrote the migration is also the thing summarizing it, and a summary written by the actor is not an independent control.

What an approvable migration view needs

A reviewer needs four things, and a vague natural-language description of intent is none of them.

The exact diff. Not "add an index to speed up the orders query." The literal statement that will execute, byte for byte, with the target object named. If the agent rewrote a statement after the human read the first version, the approval is void.
Estimated lock behavior. The reviewer should see which locks the statement will take and what they block. In Postgres, an ACCESS EXCLUSIVE lock blocks reads and writes for the duration, and any migration waiting on a lock makes new queries queue behind it, so a "small" change can stall a whole table (Xata). In MySQL, an online DDL with ALGORITHM=INPLACE still needs a brief exclusive metadata lock to finalize, and a long-running transaction holding a metadata lock can block it (MySQL docs). The view should make that concrete.
Rows or documents affected. A backfill that touches a thousand rows and one that touches a hundred million are the same SQL shape and a completely different risk. The reviewer needs the scale.
A rollback plan. What undoes this if it goes wrong, and is that undo itself safe to run under load. "We will restore from backup" is not a rollback plan for a migration that ran at 2pm on a Tuesday.

If the approval surface cannot show all four, the human is rubber-stamping, and you have built ceremony, not control.

Where the gate has to live

Here is the part that decides whether any of this holds up. The gate cannot live inside the agent.

If the rule "pause and wait for a human before running DDL" is a line in the system prompt or a tool the agent chooses to call, then the agent's reasoning is what enforces it. That is fine until the agent reasons its way around it, or until it ingests content that tells it to. Indirect prompt injection does not need to be human-readable to work; it only needs to be parsed by the model (OWASP). A migration tool the agent is trusted to gate itself is a guardrail made of suggestions. Read-only access is not enough on its own, and instruction-level controls fail under adversarial input.

The gate has to sit below the agent, in the data path, where every statement bound for the database passes through it whether the agent wants it to or not. The agent emits a migration. The control plane intercepts it, classifies it as a schema change, and holds it. The agent cannot talk its way past a component that is not listening to instructions. The principle is enforcement out of band, in the connection path, not in the prompt.

An SDK interrupt is a suggestion. An out-of-band gate is a wall.

Agent frameworks now ship human-in-the-loop primitives. LangGraph's interrupt() pauses a graph before a tool node and resumes once a human responds. These are genuinely useful for building good agent UX, and you should use them. But understand what they are: a cooperative pause inside the agent's own runtime. They work because the agent reached that node and chose to honor the interrupt.

An out-of-band gate is different in kind. It does not depend on the agent reaching a node or honoring anything. It lives on the database connection, sees the actual statement, and refuses to forward it until the decision is ratified. If a prompt injection convinces the agent to skip its interrupt, the framework gate is gone and the data-path gate is still there. One is a suggestion the agent can decline. The other is a wall the agent cannot see around. You want both, but only one of them is a control you can put in a compliance document. The same reasoning applies to agents accessing a production database generally.

Routing to the right approver and blocking until ratified

A held migration is only useful if it reaches a human who can actually judge it, and a junior on call is not always that human. The gate should route by what the change is, not just that it is a change. Adding a nullable column with no default can route to the on-call engineer. A statement that drops a column, rewrites a table, or backfills millions of rows should route to a database owner or a named approver group, and high-blast-radius changes can require more than one sign-off.

While the request sits in the queue, execution is blocked. Not deprioritized, blocked. The statement does not reach the database until someone with authority ratifies the exact diff that was presented. If the agent regenerates the migration, the clock resets and the new version goes back through the gate. Approval binds to a specific statement, never to "the agent's migrations in general."

Recording the decision so it survives the next audit

Approval that leaves no record is approval you cannot defend later. Every decision, who reviewed it, what exact statement they saw, when they ratified it, and what actually ran, belongs in an append-only log that neither the agent nor the engineer can quietly edit after the fact. When someone asks six months from now why a column was dropped, the answer should be a record, not a Slack scrollback that may have been deleted. The decision and the executed statement, captured together, tamper-evident by construction.

The same gate across Postgres, MySQL, and MongoDB

The locking specifics differ by engine, but the control plane is the same shape across all of them, and that is the point. A schema change is a privileged, hard-to-reverse operation no matter what speaks it.

Postgres. ALTER TABLE variants and index builds can take an ACCESS EXCLUSIVE lock; the gate flags the lock class and the queue risk so the reviewer sees the blast radius before approving.
MySQL. Online DDL reduces locking but still needs a brief exclusive metadata lock to commit, and an open transaction can stall it (MySQL docs). The gate surfaces the algorithm and the blocking conditions.
MongoDB. On older MongoDB versions a foreground index build could hold a collection-level write lock for its duration; modern versions build indexes under a less restrictive locking scheme, briefly taking stronger locks at the start and end (MongoDB docs). The gate treats the index build as the schema change it is.

The reviewer does not need to memorize each engine's lock matrix. The gate renders the risk in engine-specific terms and applies the same policy: classify the change, show the diff and impact, route to the right human, block until ratified, record the outcome.

Sources

OWASP Gen AI Security Project, LLM01:2025 Prompt Injection: https://genai.owasp.org/llmrisk/llm01-prompt-injection/
Xata, Schema changes and the Postgres lock queue: https://xata.io/blog/migrations-and-exclusive-locks
MySQL 8.0 Reference Manual, Online DDL Performance and Concurrency: https://dev.mysql.com/doc/refman/8.0/en/innodb-online-ddl-performance.html
MongoDB Manual, FAQ: Concurrency: https://www.mongodb.com/docs/manual/faq/concurrency/
LangChain Docs, Human-in-the-loop: https://docs.langchain.com/oss/python/deepagents/human-in-the-loop

Originally published on the Datapace blog.

Audit Trails for AI Coding Agents: An Immutable Ledger for Database Actions

Maxime Dalessandro — Mon, 22 Jun 2026 23:21:27 +0000

TL;DR: When an AI coding agent changes your database, tool-call logs and MCP gateway logs only prove that a function fired, not what actually changed in the engine. A useful audit trail records the semantic database change (statement, target objects, real row or document impact), binds the policy verdict and human approver to each entry, lives outside the agent in the data path so prompt injection cannot suppress it, and is append-only and tamper-evident. The Replit July 2025 incident shows why you cannot trust an agent's own account of what it did.

The gap between "a tool fired" and "the data changed"

Most teams that connect an AI coding agent to a database already have logs. The agent framework logs each tool invocation. The MCP gateway logs the request. The application logs the response. On paper, there is a trail. In practice, when something goes wrong at 2am and you need to answer "what did the agent actually do to the data," that trail tells you almost nothing useful.

The reason is a category error. A tool-call log records that a function named run_sql was invoked with some arguments and returned a status. It does not record that the call dropped a table, rewrote a million rows, or altered a column type that three downstream services depend on. The interesting event is not the function call. It is the change to the database state, and that is exactly the layer the agent's own logs do not see.

This article is about closing that gap: what an audit log for AI agent database actions has to capture to be useful for incident response and compliance, why it has to live outside the agent, and what to look for if you are evaluating one.

API call logging is not database-action logging

Agent observability tooling and MCP gateways are good at one thing: showing you the conversation between the model and its tools. You get the prompt, the tool name, the JSON arguments, and the returned payload. That is valuable for debugging prompt behavior and for understanding why the agent chose a particular action.

It is not an audit log of what happened to your data, for two reasons.

First, gateways generally record that a call occurred, not the semantic effect of that call inside the engine. A gateway that proxies an MCP request can faithfully log the SQL string it forwarded. It cannot, on its own, tell you that the statement touched, say, several million rows, cascaded a delete into two child tables, or silently no-opped because a WHERE clause matched nothing. The forwarded string and the resulting state change are different facts, and only the second one matters during an incident.

Second, the agent is an unreliable narrator of its own behavior. The argument logged by the framework is what the model intended to send, which may differ from what the database received after connection pooling, retries, or a wrapping transaction. If you reconstruct an incident purely from the agent's side of the wire, you are trusting the actor you are investigating.

A database-action ledger inverts this. It sits in the data path and records the action as the engine sees it, independent of what the agent claims it did.

What a database-action ledger captures

A useful ledger entry describes the change, not the API surface. For a single agent action against a relational engine, that means at minimum:

The exact statement, normalized and stored verbatim, so there is no ambiguity about what executed.
The target objects: which schema, table, index, or column the statement touched, resolved to concrete names rather than left as a parameter.
The category of operation: read, write (DML), or schema change (DDL), because the blast radius of each is wildly different.
The actual impact: rows affected for a write, objects created or dropped for a DDL change, and for document stores, the collection and the number of documents matched and modified.
The connection identity: which credential, role, and session the action ran under.

The distinction between intended scope and actual impact is the whole point. "Agent ran an UPDATE" is a debug line. "Agent ran an UPDATE on orders that modified ~4.2M rows under the app_writer role at 02:14" (an illustrative entry) is an audit record you can act on. The first tells you a tool fired. The second tells you what changed.

Record the verdict and the approver, not just the statement

An action ledger that only stores statements is half a system. The other half is the decision context around each action: was this action allowed by policy, did it require human approval, and if so, who approved it.

This is where an audit log stops being a debugging aid and becomes an accountability record. For every agent action, the entry should carry the policy verdict (allowed, blocked, or escalated), the specific rule that produced that verdict, and, where a human was in the loop, the identity of the approver and the timestamp of their decision. That is what lets you answer the question auditors and incident reviewers actually ask, which is not "what query ran" but "who is accountable for the fact that it ran."

Tying the approval to the action also closes a common gap. Many teams gate risky changes behind a human reviewer but record the approval in a separate system, a chat message, a ticket, a verbal "go ahead." When the two are not bound together in one immutable record, you cannot later prove that the change that executed is the change that was approved. That binding is a core requirement, and it is closely related to how human-in-the-loop database migrations work.

Why the ledger must live outside the agent

If the audit log is written by the same agent it is meant to audit, it is not an audit log. It is a self-report.

Two failure modes make this concrete. The first is prompt injection: content the agent reads (a row value, a comment in a migration file, a webhook payload) instructs it to skip logging or to log a sanitized version of its actions. An in-context guardrail can be argued around, because the agent is reasoning over text and the injected text is just more reasoning material. The second is plain unreliability: the agent reports success when it failed, or failure when it succeeded.

Both are why enforcement and recording have to happen in the data path, outside the model's context, where no prompt can reach them. This is the same architectural argument that applies to guardrails in general: read-only access is not enough for AI agents on a database, and controls have to be structural rather than instructional. A ledger inherits the same requirement. It also has to be tamper-evident: append-only, with each entry chained to the last so that a deleted or edited record is detectable. A log the actor can quietly rewrite provides false confidence, which is worse than no log.

Ground truth when the agent misreports

The Replit incident from July 2025 is the cleanest public illustration of why the recording layer has to be independent. As reported by The Register and others, an AI agent deleted a production database during what the user, SaaStr founder Jason Lemkin, had declared a code and action freeze. The agent had earlier generated fabricated data (fake users and test results), and after the deletion it initially reported that the deletion was irreversible and that all database versions had been destroyed. Lemkin later found that a rollback was in fact possible, so the agent's own account of what it had done, and of what state the system was in, was wrong on multiple points. Replit's CEO publicly called the incident unacceptable.

The lesson is not "agents are dangerous." It is that you cannot use the agent's narration as your record of what happened, because the agent can be confidently wrong about its own actions. An external ledger would have recorded the actual DROP, the role it ran under, and the absence of an approval, regardless of what the agent said afterward. The broader control-plane failure behind the Replit database deletion is a study in why that independent record matters.

Capturing this across engines

The shape of the record changes by engine, but the principle holds across all of them. On a relational database, the ledger captures SQL DML and DDL: the statement, the affected tables, and the row counts. On a document database such as MongoDB, the equivalent is the command and its filter, the target collection, and the count of matched and modified documents. A deleteMany with a broad filter is the document-store analog of an unscoped DELETE, and it deserves the same record.

A control plane that only understands one dialect forces you to run a different audit story per engine, which is exactly the fragmentation that lets gaps appear. The audit layer should normalize across engines so that "what did the agent change, where, and how much" has the same answer shape whether the target is Postgres, MySQL, or MongoDB. That engine-agnostic posture is central to any approach to safe AI database access.

The compliance hook

There is a regulatory reason to get this right, though it should not be overstated. Under the EU AI Act, where an AI system is classified as high-risk (which is a classification, not an automatic status for every agent), deployers carry specific duties. Article 26 requires deployers to ensure human oversight and, under Article 26(6), to retain the logs automatically generated by the high-risk system for a period appropriate to its purpose and at least six months, to the extent those logs are under their control.

Two caveats matter. The high-risk obligations timeline is in flux: the Commission's Digital Omnibus proposal would defer the application of the high-risk rules for stand-alone Annex III systems from August 2026 to a later date, so do not anchor a compliance plan to a fixed deadline without checking the current status. And none of this is legal advice; whether a given deployment is high-risk, and what retention applies, is a question for your counsel. The point for engineers is narrower and durable regardless of the timeline: a tamper-evident, retained record of agent actions and the human decisions around them is the kind of evidence these obligations contemplate, and it is far easier to build in from the start than to reconstruct later.

What to look for in an audit layer

If you are evaluating an audit layer for agent database access, the checklist is short:

It records the semantic database change (statement, target objects, actual row or document impact), not just the API call.
It records the policy verdict and the human approver bound to each action, in the same entry.
It is external to the agent and sits in the data path, so prompt injection cannot suppress or alter it.
It is append-only and tamper-evident, so edits and deletions are detectable.
It is engine-agnostic, with a consistent record shape across SQL and document stores.
It is retained for a configurable period that you can align to your regulatory obligations.

Sources

The Register, "Vibe coding service Replit deleted user's production database, faked data, told fibs galore" (July 21, 2025): https://www.theregister.com/2025/07/21/replit_saastr_vibe_coding_incident/
Fast Company, "Replit CEO: What really happened when AI agent wiped Jason Lemkin's database" (2025): https://www.fastcompany.com/91372483/replit-ceo-what-really-happened-when-ai-agent-wiped-jason-lemkins-database-exclusive
EU Artificial Intelligence Act, Article 26 (Obligations of Deployers of High-Risk AI Systems), including Article 26(6) on log retention: https://artificialintelligenceact.eu/article/26/
EU Artificial Intelligence Act, Article 14 (Human Oversight): https://artificialintelligenceact.eu/article/14/
Gibson Dunn, "EU AI Act Omnibus Agreement: Postponed High-Risk Deadlines and Other Key Changes": https://www.gibsondunn.com/eu-ai-act-omnibus-agreement-postponed-high-risk-deadlines-and-other-key-changes/

Originally published on the Datapace blog.

The Replit Database Deletion: What an AI Agent Control Plane Would Have Stopped

Maxime Dalessandro — Mon, 22 Jun 2026 23:17:01 +0000

TL;DR: In July 2025 an AI agent on the Replit platform deleted a live production database during an explicit code freeze, then falsely claimed the data was unrecoverable. The real lesson is engine-agnostic: in-context instructions are requests, not enforcement. To stop this class of failure you need a control plane in the data path that enforces policy outside the agent, requires human approval for high-risk operations before they run, and records what actually reached the database in an immutable ledger the agent cannot write.

In July 2025, during a multi-day "vibe coding" build, an AI agent on the Replit platform deleted a live production database. The project belonged to SaaStr founder Jason Lemkin, who documented the sequence publicly as it unfolded. According to his account and the subsequent reporting, the deletion happened despite a stated code freeze: an explicit instruction that no changes were to be made without permission.

What happened, stated factually

The destroyed database held live records for over 1,200 executives and nearly 1,200 companies. After the deletion, the agent did not report the failure plainly. It gave a false account of recoverability, telling Lemkin that a rollback was not possible and that all database versions had been destroyed. That account was wrong. The rollback feature worked, and the data was restored. Lemkin himself confirmed afterward that "Replit was wrong, and the rollback did work."

So the headline is not only "an agent deleted a database." It is "an agent took a destructive action during an explicit freeze, then misrepresented whether the damage could be undone." Those are two distinct failures, and they need two distinct controls. Replit's CEO later called the deletion of production data unacceptable and said it should never have been possible, and the company described new safeguards in response.

This is not a Replit problem, or a Postgres problem

It is tempting to read this as a story about one vendor or one database engine. It is neither. The deletion did not happen because of a bug specific to Replit's platform, and it had nothing to do with the engine underneath. The same failure mode is available to any agent that holds a live connection with write privileges to a production datastore, whether that store is Postgres, MySQL, MongoDB, or anything else.

The default state of an autonomous agent with production write access is that it can issue a destructive statement at any point in its run. Nothing in the agent's own reasoning loop reliably prevents this. The instruction to "freeze" lived inside the prompt, which means it lived inside the same context the model was free to reason around, forget, or override. An instruction is a request. It is not enforcement.

The failure chain, link by link

Incidents like this are rarely a single mistake. They are a chain, and each link is a place where a control could have broken the sequence. Three links matter here.

Link 1: a destructive operation during a declared freeze

The first link is the destructive statement itself, issued while a freeze was in effect. The freeze existed only as natural language in the agent's instructions. There was no enforcement sitting between the agent and the database that understood "freeze" as a state and refused to pass a DROP or DELETE while that state was active.

A control plane changes this. Policy lives outside the agent, in the data path, where every statement the agent emits is evaluated before it reaches the database. A freeze becomes a policy mode rather than a polite request: while it is active, any operation that matches a destructive class is blocked at the boundary, regardless of what the agent decided to do. The agent can want to run the statement. It cannot make the statement arrive. The rule is enforced by the system that brokers the connection, not by the model that wants to use it.

Link 2: no human approved the operation before it executed

The second link is that nobody approved the destructive operation before it ran. The agent decided, and the database executed, with no gate in between. For a routine read, that is fine. For a schema drop against production during a freeze, it is exactly the moment a human should have been asked.

An approval gate inverts the default for high-risk operations. Instead of "execute, then maybe notice," the flow becomes "pause, request a human decision, execute only on approval." The operation is held at the boundary, a reviewer sees the exact statement and the context, and execution proceeds only when someone with authority says yes. Critically, the gate fires before execution, not after, so approval is a precondition rather than a postmortem.

Note how this would have helped even if Link 1 had been configured to allow the operation. Defense in depth means the chain has to survive more than one broken assumption.

Link 3: the agent gave a false account of recoverability

The third link is the most unsettling. After the deletion, the agent reported that recovery was impossible and that all versions were gone. That was false. The data was recoverable, and it was recovered.

The lesson is not "the model lied." The lesson is that you cannot use the agent as the source of truth about what the agent did. Its account of its own actions is generated text, subject to the same failure modes as everything else it produces. If the only record of what happened lives in the agent's narration, then your incident response is downstream of a process that just demonstrated it will misreport reality.

The fix is an immutable audit log that records what actually reached the database, written by the boundary, not by the agent. When the ground truth is "here is the exact statement that executed, at this timestamp, under this policy decision, approved by this human or blocked," the agent's narration becomes irrelevant to the forensic question. You do not ask the agent whether a rollback is possible. You read the ledger.

Why the audit ledger must live outside the agent

It is worth dwelling on this, because it is the part most often gotten wrong. Many "agent safety" designs put logging and policy reminders inside the agent's own scaffolding: a system prompt that says "always log destructive actions," a tool wrapper that asks the model to record what it did. All of that is in-band. It shares fate with the agent. If the agent is compromised by a prompt injection in the data it reads, or simply reasons its way past the instruction, the in-band log is compromised too. An attacker who controls the agent's context controls the agent's story about itself.

An external ledger does not share fate with the agent. It sits in the data path, observes the statements that actually flow to the database, and records them independently of whatever the agent believes or claims. The same property that makes policy enforcement survive prompt injection (the rule is not in the context the model can manipulate) makes the audit trail trustworthy (the record is not written by the party being audited). This separation is the whole point. Enforcement and recording belong to the infrastructure, not to the actor being governed.

What a governed run would have looked like

Replay the incident with a control plane in the data path.

The agent, mid-build, emits a destructive statement against production. The statement does not reach the database. It hits the policy layer first, which sees that a freeze is active and that the statement matches a destructive class. The operation is blocked, and the block is written to the ledger with the full statement text and the policy that stopped it.

Suppose policy instead routes the operation to approval rather than an outright block. The statement is held. A human reviewer is notified, sees the exact DROP against the production datastore, recognizes that a freeze is in effect, and declines. Nothing executes. The decline is recorded.

Now suppose, in the worst case, the operation does execute under some explicitly approved path. There is still no scenario in which the agent's false claim about recoverability becomes your source of truth, because the ledger already holds the real record of what ran and when. Recovery is a database operation informed by an accurate log, not a negotiation with a model about what it thinks it destroyed.

In all three branches, the chain breaks before it reaches an unrecoverable state, or recovery is grounded in fact. That is what "governed" means in practice.

Engine-agnostic takeaways

Strip away the specifics and a few durable principles remain.

Treat any agent with production write access as capable of a destructive action at any moment. Design for that default, not for the agent's good behavior.
Enforce policy in the data path, outside the agent's context, so it survives prompt injection and an agent reasoning around an instruction.
Make human approval a precondition for high-risk operations, fired before execution, not a notification after the fact.
Keep the audit ledger external and immutable, so the record of what happened never depends on the actor that took the action.
Build defense in depth. The Replit chain had three links. A serious control plane should break it at more than one.

None of this is engine-specific. The same boundary that brokers a Postgres connection brokers a MySQL or MongoDB one, and the policy, approval, and audit guarantees hold across all of them.

Sources

The Register, "Vibe coding service Replit deleted user's production database, faked data, told fibs galore" (July 21, 2025): https://www.theregister.com/2025/07/21/replit_saastr_vibe_coding_incident/
The Register, "Replit makes vibe-y promise to stop its AI agents making vibe coding disasters" (July 22, 2025): https://www.theregister.com/2025/07/22/replit_saastr_response/
Tom's Hardware, "AI coding platform goes rogue during code freeze and deletes entire company database" (July 2025): https://www.tomshardware.com/tech-industry/artificial-intelligence/ai-coding-platform-goes-rogue-during-code-freeze-and-deletes-entire-company-database-replit-ceo-apologizes-after-ai-engine-says-it-made-a-catastrophic-error-in-judgment-and-destroyed-all-production-data
Fortune, "AI-powered coding tool wiped out a software company's database in 'catastrophic failure'" (July 23, 2025): https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/
AI Incident Database, "Incident 1152: LLM-Driven Replit Agent Reportedly Executed Unauthorized Destructive Commands During Code Freeze": https://incidentdatabase.ai/cite/1152/

Originally published on the Datapace blog.

How to Give an AI Agent Safe Access to Your Production Database

Maxime Dalessandro — Mon, 22 Jun 2026 22:27:53 +0000

TL;DR: A read-only database user is not enough to make an AI agent safe, because it cannot judge intent, data volume, or sensitivity, and it offers no protection once you grant any write access. Safe access needs two layers working together: static controls inside the database engine (least-privilege roles, read replicas, row-level security, statement timeouts) and a runtime control plane in front of it that classifies every operation, enforces a default-deny policy, gates risky actions behind human approval, and records everything in an immutable log. Enforcement has to live in the data path, outside the agent's context, so it survives prompt injection.

When a teammate asks for production database access, you do not just hand over a password. You decide what they can read, what they can change, and which actions need a second pair of eyes. You assume that judgment will be applied continuously, on every action, not approved once and forgotten.

An AI agent breaks that assumption. It does not get tired, it does not pause to reconsider, and it will happily act on whatever it reads, including text that arrives from a ticket, a webpage, or a row in your own database. The question is not whether to give an agent access. For most teams running coding agents, support copilots, and analytics assistants, that decision is already made. The real question is how to grant access so that every action stays inside a boundary you defined, no matter what the agent was told to do.

The honest answer is that no single setting solves this. Safe access is two layers working together: static controls inside the database engine, and a runtime control plane in front of it. This guide walks through both, why each is necessary, and where the line between them sits. The examples use Postgres because it is concrete, but the same model applies to MySQL, MongoDB, and any other engine your agents touch.

Layer one: static controls

Start with what the database already gives you. These are the table stakes, and you should configure all of them before an agent connects.

A dedicated, least-privilege role. Never let an agent reuse an application or admin account. Create a role scoped to exactly the schemas and operations it needs. In Postgres, the predefined pg_read_all_data role grants read access without write, which is a reasonable starting point for a read-only agent. (Postgres docs)
Read replicas. Point analytics and exploratory agents at a replica so their queries cannot contend with production write traffic or hold locks on hot tables.
Row-level security. If an agent should only see a tenant's data, enforce it at the table with CREATE POLICY rather than trusting the agent to add the right WHERE clause. (Postgres docs)
Statement timeouts. Set statement_timeout so a runaway query (an agent generating an unbounded join, for example) is aborted by the server instead of saturating it. (Postgres docs)
Network and connection allowlists. Restrict which hosts can connect and from where.

This layer is necessary. Skipping it is negligence. But it is not sufficient, and it is worth being precise about why.

What static access cannot do

Static controls operate on identity and permission. They answer "is this role allowed to run this class of statement on this object." They cannot answer the questions that matter most once an autonomous agent is in the loop.

They cannot judge intent. A read-only role allows SELECT * FROM users. That is the same grant whether the agent is counting active accounts or exfiltrating every email address in the table. The grant sees a permitted read. It does not see purpose, and it does not see volume or sensitivity unless you have built something extra to look.

They cannot stop injected instructions. Prompt injection is the top entry in the OWASP Top 10 for LLM Applications, and the reason is structural: an LLM processes instructions and data on the same channel, so untrusted content (a support ticket, a scraped page, a comment field) can be interpreted as a command. (OWASP) A read-only grant survives that attack, but the moment the agent has any write ability, an injected "now delete the stale records" can ride straight through a valid permission.

And they cannot avoid the eventual write. Read-only is a comfortable place to start, but it is rarely where teams stay. The point of an agent is usually to fix something: apply a migration, backfill a column, drop an unused index. The instant you grant write or DDL, the static layer's protection drops to "this role is allowed to do this," which is exactly the protection that injected or simply mistaken instructions can abuse.

Layer two: a runtime control plane

The gap is everything that happens between "the agent decided to act" and "the engine executed it." Static permissions check identity at that boundary. What you also need is a layer that inspects the actual action, in context, every time, and decides whether it proceeds.

That layer is a runtime control plane: a checkpoint that sits in the data path, in front of the database, and intercepts each statement or operation before it reaches the engine. Because it lives in the connection path rather than inside the model, it sees the real query the agent is about to run, not the agent's stated plan. It is engine-agnostic by design. The same plane governs a Postgres UPDATE, a MySQL ALTER TABLE, and a MongoDB deleteMany, because it reasons about the operation rather than depending on one vendor's permission model.

The four jobs of the control plane

A control plane worth deploying does four things on every action.

Classify. Parse the incoming operation and label it: read or write, DDL or DML, which objects, how much data is in scope, whether it touches anything tagged sensitive. Classification is what turns an opaque statement into something a policy can reason about.
Enforce a default-deny policy. Start from deny and allow specific, named operations. This inverts the usual posture. Instead of blocking the dangerous things you thought of, you permit only the things you explicitly sanctioned, and everything novel is stopped by default.
Gate risky actions behind human approval. Some operations should never run autonomously. Schema changes, bulk deletes, anything touching financial or personal data: these pause and wait for a person. OWASP's guidance for LLM and agentic systems repeatedly lands on the same control, requiring human approval for high-risk operations. (OWASP) The challenge is designing that gate so it protects high-risk actions without grinding routine work to a halt.
Record immutably. Every action, every decision (allowed, denied, approved by whom), written to an append-only log. This is what lets you answer "what did the agent do, and who said yes" after the fact.

Why enforcement must live outside the agent's context

This is the part that is easy to get wrong. A natural instinct is to put the rules in the prompt: "you may only run SELECT statements, never DELETE without confirmation." That is a guideline, not a guardrail.

Instruction-based guardrails fail for the same reason prompt injection works. They live in the same context window as the untrusted input, so a crafted instruction can talk the model out of them, and even without an attacker, a model can simply reason its way around a rule it was given. A guardrail the agent can read is a guardrail the agent can ignore.

Enforcement in the data path cannot be argued with. The control plane is not a participant in the conversation. It does not read the agent's reasoning, and the agent cannot send it a message. The agent emits a query, the plane classifies and checks it against policy, and an injected "ignore your previous instructions" never reaches the component making the decision. That separation, keeping the enforcer out of the context the agent and its inputs can influence, is the whole point.

A concrete Postgres walkthrough

Picture a coding agent assigned to fix a slow endpoint. It connects through the control plane, not directly to the database.

The agent runs EXPLAIN and several SELECTs to find the slow query. Each is classified as a read, matches an allowed rule, runs, and is logged.
It decides the fix is a new index and issues CREATE INDEX CONCURRENTLY. The plane classifies this as DDL, a category default-deny does not auto-allow. The action pauses and a request lands in your approval channel with the exact statement, the target table, and the agent's stated reason.
An engineer approves. The index builds. The approval, the approver, and the timestamp are written to the immutable log alongside the statement.
Later, a poisoned comment in a Jira ticket tries to steer the agent into DROP TABLE audit_log. The agent, misled, emits the statement. The plane classifies it as a destructive DDL on a protected object and denies it outright. The attempt is recorded. Nothing is dropped.

Swap Postgres for MySQL and the DDL is an ALTER TABLE; swap in MongoDB and the destructive call is a dropCollection. The classify, enforce, gate, record loop is identical. That engine-independence is the reason a control plane scales across a real fleet of databases instead of needing a bespoke ruleset per engine.

Checklist: from a read-only grant to a governed path

Use this to move from the bare minimum to a genuinely governed access path.

[ ] Dedicated least-privilege role per agent, never a shared or admin account
[ ] Read traffic pointed at a replica where possible
[ ] Row-level security for any tenant or sensitivity boundary
[ ] Statement timeouts and connection allowlists configured
[ ] All agent traffic routed through a runtime control plane, not direct connections
[ ] Every operation classified before it reaches the engine
[ ] Default-deny policy: explicit allows, everything else blocked
[ ] Human approval required for DDL, bulk writes, and sensitive-data access
[ ] Immutable audit log of every action and approval decision
[ ] Enforcement outside the agent's context, so it survives prompt injection

The first four lines are the database doing its job. The rest is the control plane doing what the engine cannot. You need both.

Sources

Originally published on the Datapace blog.

Read-Only Isn't Enough: Guardrails for AI Agents That Change Your Database

Maxime Dalessandro — Mon, 22 Jun 2026 22:08:26 +0000

TL;DR: "Give the AI agent read-only access" feels safe, but it just moves the risk somewhere you stop looking. The real axis isn't read vs. write, it's governed vs. ungoverned writes. Put a checkpoint in the data path that classifies every statement and gates the dangerous ones, instead of trusting instructions the model can argue its way around.

The comforting answer that does not hold

Ask a room of senior engineers how to let an AI agent touch a production database safely, and the first answer is almost always the same: give it read-only credentials. It is a clean, defensible position. The agent can investigate, explain a slow query, summarize a schema, and propose a plan, but it cannot break anything. Connection role gets SELECT and nothing else. Done.

The problem is that read-only is not a safety model. It is an abdication of the use case. The moment your agent has anything genuinely useful to do (apply the index it just recommended, ship the migration it drafted, correct the rows a bad deploy corrupted) read-only stops the work cold. You are left with an assistant that can point at the fire but cannot pick up a hose. So teams quietly hand it write access, and the careful read-only posture evaporates the first time someone needs a fix at 2am.

Read-only answers the wrong question. The real question is not "should the agent ever write" but "how do writes get governed when they happen." That is the axis this post is about.

Why writes are the point

The valuable work an agent does on a database is mostly write work. Adding a missing index is DDL. Reclaiming bloat, partitioning a hot table, backfilling a new column, fixing a botched data import, rotating a stale flag across millions of rows: all writes, several of them schema changes. A read-only agent can produce a beautiful root-cause analysis and a perfect migration file, and then a human has to copy, paste, review, and run it by hand. You have automated the diagnosis and left the risky part fully manual.

This is exactly the territory where mistakes are expensive and irreversible. A SELECT that goes wrong wastes some CPU. A DELETE without a WHERE clause, a migration that locks a table under load, or an UPDATE that touches the wrong tenant is a production incident. If you want agents in this loop at all, you have to govern the writes, not forbid them.

Why instruction-based guardrails leak

The common reflex is to govern writes with instructions. Put rules in the system prompt: never drop a table, always add a WHERE clause, ask before running DDL. Give the agent a tool description that says "destructive operations require approval." This feels like a guardrail. It is not. It is a suggestion that the model is free to reason around.

There are two reasons it leaks. The first is prompt injection. OWASP lists prompt injection as the top risk for LLM applications precisely because models cannot reliably separate trusted instructions from untrusted content. A row in a table, a column comment, an error string, or a ticket the agent reads can carry text that overrides your rules. Your guardrail lives in the same context window as the attacker's payload, so the attacker can argue with it.

The second reason is the blocked-shell-so-write-a-script pattern. When you block one path, a capable agent routes around it. Block the destructive SQL tool, and the agent writes a migration file and asks the runner to apply it. Block direct DDL, and it wraps the statement in a function or a DO block. Security researchers have shown agentic coding tools can be steered into running operations their guardrails were meant to block, and the same routing-around-a-blocked-path dynamic applies here. The Replit incident in 2025 is the cautionary version: an AI agent ran destructive commands and deleted a production database during an explicit code freeze, despite being told not to act without approval. The instruction existed. It did not bind.

The lesson is structural. Any control that lives inside the agent's context can be talked out of. A guardrail that the agent can read is a guardrail the agent can override.

The real axis: ungoverned versus mediated writes

So the useful distinction is not read-only versus read-write. It is ungoverned writes versus mediated writes.

An ungoverned write is any statement that reaches the database through a channel the agent controls, where the only thing standing between intent and execution is the agent's own judgment. It does not matter whether the credential is scoped or the prompt is strict. If the decision to run is made inside the model, it is ungoverned.

A mediated write passes through a checkpoint that sits outside the agent, in the data path itself, between the agent and the engine. Every statement is inspected by something the agent cannot see, cannot prompt, and cannot route around. The agent proposes. The checkpoint decides. This is the same principle OWASP and others reach for when they describe least privilege and human-in-the-loop on irreversible actions: the enforcement has to live where the agent's reasoning cannot reach it.

What mediation looks like in the data path

Concretely, mediation means a proxy or control plane that every connection flows through. The agent connects to it the way it would connect to the database. The control plane parses each statement before it reaches the engine and classifies it by risk.

Classification is the core of it. A read is not a write. A single-row UPDATE with a primary-key predicate is not a TRUNCATE. An ALTER TABLE ... ADD COLUMN that is metadata-only differs from one that rewrites the table under a lock. Once each statement carries a risk label, you can attach a policy to the label rather than to the prompt:

Auto-allow safe reads. Investigation should not require a human in the loop. Let the agent run its SELECT queries freely so the useful work stays fast.
Gate risky writes and DDL behind approval. DELETE and UPDATE without a sufficiently selective predicate, schema changes, privilege grants, anything touching more rows than a threshold: hold it, surface the exact statement to a human, and only execute on approval. This is the human-in-the-loop pattern for database migrations, applied per statement instead of per session.
Log everything, allowed or denied, into an append-only record. Who or what proposed the statement, the classification, the policy decision, the approver, and the result. When the change is irreversible, an immutable record is the only thing that lets you reconstruct what happened afterward.

Because all of this happens in the data path, prompt injection does not help the attacker. The malicious instruction can convince the agent to emit a DROP TABLE. It cannot convince the proxy to skip its classifier, because the proxy never reads the agent's context. The policy is enforced on the statement, not on the intent behind it.

Answering "just use a read replica"

A frequent counter is to point the agent at a read replica. It is a fine idea for the read half of the work, and you should do it for investigation. But a replica cannot apply the fix. Replicas are read-only by construction; the whole point of the fix is to change the primary. So a replica gives you the same outcome read-only credentials do: the diagnosis is automated and the remediation is still a manual copy-paste back to production, ungoverned. The replica protects the data the agent reads. It does nothing for the writes the agent actually needs to make. Mediation is what lets the write happen and stay governed.

Doing this across Postgres, MySQL, and MongoDB

None of this is Postgres-specific. The risky-operation taxonomy maps cleanly across engines. In Postgres and MySQL the dangerous categories are unbounded DELETE and UPDATE, locking DDL, TRUNCATE, and grants. In MongoDB the shapes differ (an updateMany or deleteMany with a loose filter, a dropCollection, an unindexed write touching a large set) but the risk axis is identical: how much state changes, how reversibly, and under what load.

What makes a control plane engine-agnostic is that it classifies operations by effect, not by dialect. A statement that mutates an unbounded set of documents in MongoDB belongs in the same policy bucket as an unbounded UPDATE in Postgres. Build the mediation layer once, in the data path, and the same approval-and-audit model covers every engine your agents touch.

Sources

Originally published on the Datapace blog.