D1 Schema Migrations with AI Agents: The DDL-in-Transaction Trap That Kills Zero-Downtime Deploys

#ai #aiagents #mcp #cloudflare

Running an AI agent to execute your D1 migrations will silently wreck your database — unless you explicitly forbid it from wrapping DDL in a transaction.

Claude Code, when handed a migration task, defaults to wrapping everything in BEGIN TRANSACTION / COMMIT because it looks safer. On D1 (SQLite under the hood), mixing DDL and DML in one transaction produces unpredictable behavior. The error you get — D1_ERROR: cannot start a transaction within a transaction — doesn't even hint at the real cause until you attach wrangler tail and watch the raw logs. I spent the better part of a day diagnosing this before I understood it was the agent being "helpful."

My ad analytics SaaS runs 12 Workers sharing 3 D1 databases. One bad migration cascades fast. The pattern I've settled on after 8 months is four discrete stages: add a nullable column (online, zero downtime), dual-write to both old and new columns for at least 24 hours, chunked backfill via Durable Object scheduler, then manual table recreation for the column drop. The backfill chunk size matters more than you'd think — I pushed 120k rows in one shot and hit the 30-second Worker timeout. Splitting into 1,000-row batches fixed it:

const batch = await db
  .prepare(
    `SELECT rowid, source FROM ad_events
     WHERE source_hash IS NULL
     ORDER BY rowid LIMIT 1000`
  )
  .all();
await db.batch(
  batch.results.map((row: any) =>
    db.prepare(`UPDATE ad_events SET source_hash = ? WHERE rowid = ?`)
      .bind(hashSource(row.source), row.rowid)
  )
);

The trickiest failure mode isn't the DDL error — it's what happens between stage 2 and 3. If a Workers deployment rolls back while you're in dual-write mode, the old code silently stops populating the new column. I've been caught twice thinking backfill was complete, only to find NULL rows still present before the table recreation step. A null-count check immediately before stage 3 is now a hard gate in my agent hook, not a suggestion.

The 120k-row backfill added roughly $0.80 to my D1 write costs. Workers execution time was the bigger line item. When D1 reads suddenly doubled in one billing cycle, the culprit was a dual-write window I'd accidentally left open for 48 hours instead of 24.

I wrote up the full breakdown — including the exact agent context schema I pass to prevent the transaction wrapping, the KV flag structure for multi-stage coordination, and why I still do stage 4 manually — over on riversealab.com.

Full post →

Top comments (1)

Alex Shev • Jul 2

This is exactly where agentic coding needs environment-specific memory. A generic migration plan can be technically plausible and still fail because the runtime has one sharp constraint.