Nathan Schram

Posted on Mar 31 • Originally published at littlebearapps.com

My $5/month Cloudflare bill hit $4,868 because of an infinite loop

#cloudflare #serverless #buildinpublic #webdev

The invoice said $4,868.00. My Cloudflare account usually costs $5 a month.

In January 2026, two bugs in two different workers wrote billions of rows to D1. I'm a solo developer on the Workers Paid plan. I don't have a billing department. I have a credit card and a vague hope that nothing goes catastrophically wrong. That hope cost me 18 days of stress, a near-suspension of my entire account, and a spam folder I should have been checking more carefully.

TL;DR: Two code bugs wrote 4.83 billion rows to Cloudflare D1 in January 2026, generating a ~$4,868 overage on a $5/month account. After 18 days and four escalation channels, Cloudflare waived the full $4,586.64 invoice. I then built a three-layer circuit breaker system so it can't happen again.

What went wrong with D1?

Two separate bugs, two separate projects, both writing to D1 without anything to stop them.

The embedding worker that couldn't stop writing

Semantic Librarian is my Australian heritage records project. 1.4 million historical records from the National Library of Australia's Trove archive, searchable via Workers AI embeddings stored in Vectorize, backed by a D1 database. The worker runs on a cron schedule, processing documents in batches: fetch a batch of records, generate embeddings through Workers AI, write the vectors and metadata to D1, move to the next batch.

The bug was in the "move to the next batch" part. There was no deduplication check. The worker would process a batch of documents, write the embeddings, and on the next cron tick, process the exact same batch again. No offset tracking. No "already processed" flag. Every cycle wrote the same records. And the next cycle wrote them again.

For four days, from January 11 to 14, the worker ran on autopilot while I was focused on building other things. I wasn't watching the Cloudflare dashboard. Why would I? The worker was deployed, running on a cron, no errors in the logs.

3.45 billion D1 writes in four days. Here's how that breaks down:

Date	D1 Writes	Cost
Jan 11	479,873,853	$479.91
Jan 12	1,335,107,674	$1,259.99
Jan 13	1,424,638,592	$1,411.40
Jan 14	282,900,856	$282.65

Peak day was January 13: 1.42 billion writes in 24 hours. Storage spiked to 10 GB. After I killed the worker and cleaned up, it dropped to 2.4 GB, confirming most of it was duplicate data.

I didn't notice for four days because the worker was running silently. No errors. No alerts from Cloudflare. No email saying "hey, your D1 writes are 7,000x above normal." Just a worker doing exactly what I told it to do, over and over and over.

The harvester without ON CONFLICT

A second project, a GitHub data harvesting tool I was deploying for the first time, had a different version of the same problem. During the initial data seeding phase in early January (Jan 1-4), each scan cycle re-inserted existing records instead of updating them. The INSERT statements had no ON CONFLICT clause. So every time the harvester ran, it tried to insert records that already existed, and D1 happily accepted every one. About 910 million redundant writes in four days.

I found this one faster and fixed it on January 5 with proper ON CONFLICT DO UPDATE clauses. The Semantic Librarian bug started six days later.

Between the two bugs: 4.83 billion D1 writes in January. To put that in perspective, my normal usage across all 9 databases is maybe 200 writes per hour. The D1 pricing page says $1 per million rows written beyond the 50 million included. 4.83 billion rows at that rate is $4,779 in write charges alone, plus storage, requests, and AI inference costs that pushed the total to $4,868.

How do you fight a $4,868 bill on a $5/month account?

Slowly. Across multiple channels. With a detailed audit document and more patience than I thought I had.

7 days of silence

On February 1, I submitted support ticket #01953111. I didn't just write "please waive this." I attached a full usage audit as a PDF: daily D1 write counts broken down by project, spike period analysis with exact dates and row counts, root cause analysis for each bug, and a list of every fix and architectural improvement I'd deployed to prevent recurrence.

I wanted to make it easy for whoever reviewed it. Here's exactly what happened, here's exactly why, and here's what I built to make sure it doesn't happen again. If you're going to ask a company to waive $4,868, you should come prepared.

No response by February 7. Six days. I sent a follow-up asking if it had been assigned to the billing team. Nothing.

Finding the right human

On February 7, I posted to the Cloudflare Community Forum. A CF Community MVP called neiljay responded quickly and pointed me to a post by cherryjimbo (CF MVP '23-'26) who had shared a direct email for the Head of Billing.

On February 8, I emailed Dmitry Alexeenko (Head of Billing) directly, referencing my ticket number and cherryjimbo's referral.

Then I waited.

Marta from support had actually replied on February 11. The case had been raised with Engineering and was on temporary hold. That should have been reassuring.

The spam folder that almost killed my account

On February 18, I found an automated email in my junk folder. It was dated February 17. Cloudflare's billing system had sent a suspension warning: paid services would be disabled for the unpaid invoice. I ran a full account audit. R2 object storage and Analytics Engine were already disabled. My 34 workers were still running, the 8 D1 databases were still accessible, KV and Queues were fine. Partial suspension, not full. Not yet.

The human support team had my case on hold with Engineering, actively working on it. The automated billing system operated on its own timeline and didn't check whether a human being was already handling the dispute. Two parallel systems, zero coordination between them.

I sent urgent follow-ups to both Marta and Dmitry. Dmitry's autoresponder came back: the Portugal office was closed for Carnival, and he'd included his mobile number for urgent matters. I texted him. That same day, I posted to Reddit r/CloudFlare: "Support said my $4.8k billing dispute was on hold, but the automated system just suspended me anyway."

By that evening, four escalation channels were active: the original support ticket, the community forum post, the direct email to Dmitry, and the Reddit post. Within hours, things moved. Dmitry responded despite the holiday. Akash Das, Director of Customer Support, personally took the case. He'd read my audit document and accurately identified both root causes: the infinite write loop in the heritage records worker and the missing conflict handling in the data harvesting tool. The case was upgraded to urgent priority.

Did Cloudflare do the right thing?

Yes. Daniel Anselmo (Technical Support Shift Engineer) confirmed the full waiver on February 19: $4,586.64, invoice IN 56608827. Account unlocked. All services restored. I re-subscribed to the Workers Paid plan and verified everything: 34 workers running, 8 D1 databases accessible, KV, Queues, R2, Analytics Engine all back online.

The $4,586.64 was the actual invoice total, slightly different from my $4,868 estimate because of how Cloudflare calculates final billing. Either way, the full amount was waived as a one-time courtesy.

18 days from first ticket to resolution. That feels long when you're living it, and fair when you look back at it. I want to credit the specific people who made the resolution happen: Akash Das (Director of Customer Support) for personally reviewing the case and identifying both technical root causes accurately from my audit. Dmitry Alexeenko (Head of Billing) for responding and escalating despite a public holiday in Portugal. neiljay and cherryjimbo on the Cloudflare Community Forum for pointing me to the right contact when the ticket queue was silent. And Daniel Anselmo for closing it out cleanly.

My critique isn't of the people. The people were good. It's of the gap between the human support process (which was thorough once it engaged) and the automated billing system (which nearly suspended my entire account while that support was actively investigating my case). Those two systems don't talk to each other fast enough. A billing dispute that's actively being reviewed by the Director of Support should probably not trigger an automated suspension at the same time.

Why doesn't D1 have write rate limits?

Mine isn't an isolated case.

In 2025, ofsecman.io documented a $5,000+ D1 overage caused by a missing WHERE clause in an update statement. A single-row update became a full-table update on every incoming request. Over $5,000 in under 10 seconds. Their conclusion was blunt: "Don't ever use Cloudflare D1 as a Database."

On the Cloudflare Community Forum, a first-time database user reported a $3,200 bill because they didn't set up an index. Their credit card was overdrawn before they noticed anything. "Cloudflare did not give me any notice or reminder."

Three different bugs, three different accounts, same outcome. D1 charges per row written with no caps, no write rate limits, and no billing alerts granular enough for D1 write operations specifically. Cloudflare's billing notifications exist, but they're not designed to catch a worker writing a billion rows in 24 hours.

Scale-to-zero billing is D1's selling point. You pay nothing when your database is idle. That's genuinely great for solo developers and small projects, and it's why I chose Cloudflare's stack in the first place. Scale-to-zero also means scale-to-infinity when a bug amplifies, because the same billing model that charges you nothing at rest charges you per operation at scale with no ceiling.

D1 hit general availability in September 2024. The billing model shipped before the billing safeguards did. This is a platform maturity gap, not malice, and I expect Cloudflare will address it. It's a gap that has cost at least three people documented real money though, and probably more who paid the bill without writing about it.

I'm not saying don't use D1. I still use it across 9 databases for multiple projects. I'm saying don't use it without your own circuit breakers, because the platform doesn't have them yet.

What did I build to prevent this from happening again?

After the invoice was waived, I spent a week doing nothing except cost safety. I had been building features. Now I was building guardrails. Nine improvements across three tiers of priority, all shipped and deployed. Tier 1 was critical one-line fixes I could push immediately. Tier 2 was the anomaly detection that would have caught the January incident within an hour. Tier 3 was the longer-term monitoring improvements.

Everything described here lives in an infrastructure SDK I built after the incident. Two TypeScript packages: a consumer SDK that goes into each worker, and an admin backend with a monitoring dashboard and the telemetry pipeline that feeds it. The admin side runs on Cloudflare Pages, so I can check budget state from my phone - a meaningful upgrade from my previous approach of noticing the invoice a month later.

I open-sourced it because three people hitting the same $4,000+ wall suggests this isn't just my problem.

Update (March 2026): The original circuit breaker infrastructure described above (Platform SDKs) worked but was too complex - 10+ workers, 61 D1 migrations, cross-account HMAC forwarding. I've since replaced it with CF Monitor, a much simpler rewrite: one worker per account, Analytics Engine + KV only, zero D1. It's open source and available as an npm package (@littlebearapps/cf-monitor).

Each worker imports the consumer SDK, wraps its environment bindings on startup, and the tracking happens automatically. No per-project instrumentation.

Three layers of circuit breakers

The circuit breakers work at three levels of granularity.

Feature-level is the most precise. Each distinct function in each project gets its own budget. A GitHub scanner, a document embedder, an API endpoint - each has a defined daily limit for D1 writes, KV operations, Workers AI neurons, whatever resources it consumes. If the document embedder goes haywire, it gets disabled. The GitHub scanner keeps running.

Project-level aggregates all features for a project. Individual features might stay within their budgets while the project total is too high.

Global emergency stop is the nuclear option. It kills everything across all projects immediately. I haven't had to use it. I hope I never do.

Each level enforces progressively: 70% triggers a Slack warning, 90% triggers a critical alert, 100% auto-disables the feature. The breakers auto-reset after 1 hour via KV TTL, so a tripped feature doesn't sit dead until I manually re-enable it at 3am.

Counting writes before they become a bill

The tracking can't use D1 writes to count D1 writes. That would be self-defeating. If my monitoring system writes usage data to D1, it's consuming the exact resource it's trying to protect. The January incident itself proved this: my original monitoring infrastructure was writing ~200 rows per hour to D1 just to track usage across all projects. That's not a lot in isolation, but the principle is wrong.

The consumer SDK wraps your Cloudflare environment bindings with proxies that automatically count every operation. D1 reads and writes, KV gets and puts, R2 uploads, Workers AI inference calls, Vectorize queries - all tracked transparently. When your worker calls env.DB.prepare(...).run(), the proxy intercepts it, increments a counter, and forwards the call. Your code doesn't change. You call createTrackedEnv(env) at startup and the counting happens behind the scenes.

The counters get flushed to Analytics Engine via a Cloudflare Queue. Analytics Engine is free for the first 25 million data points per month and it's designed for exactly this kind of high-volume telemetry. Zero D1 write overhead for the tracking itself.

The budget checker queries Analytics Engine roughly every 30 seconds, sums up recent writes per feature, and compares them against the budget defined in a YAML config file. If a feature crosses 70%, Slack warning. 90%, critical alert. 100%, the feature's circuit breaker trips and the SDK starts rejecting operations for that feature until the breaker resets.

Detection latency: about 30 seconds from a write happening to the circuit breaker evaluating it. In January, my runaway worker ran for four days. Now it would run for about 30 seconds before getting shut down automatically.

The monitoring that ties it together

On top of the circuit breakers, the nine hardening improvements across three tiers:

Auto-reset circuit breakers via KV TTL (1-hour expiry, no manual intervention needed)
Workers AI cost monitoring added to the sentinel (previously untracked)
Investigation SQL column fix (the monitoring was querying the wrong column name)
Hourly D1 write anomaly detection using a 168-hour rolling window with 3-sigma threshold
Per-project anomaly detection, not just account-wide (previously only checked totals)
Budget warning thresholds at 70% and 90% with Slack alerts and 1-hour deduplication
Monthly budget tracking with progressive alerts at 70%, 90%, and exceeded
Batch resource snapshot inserts, reduced from ~200 individual D1 writes per hour to ~8 batch transactions
Six missing budget overrides for features that were falling back to overly generous defaults

All nine shipped and deployed within a week. The batch insert change alone (item 8) cut monitoring D1 write overhead by 96%, from ~200 individual writes per hour down to ~8 batch transactions. The hourly anomaly detection (item 4) would have caught the January spike within its first hour: 480 million writes in a single day is roughly 15,000 standard deviations above my normal baseline of ~200 writes per hour. The 3-sigma threshold would have tripped before the first hour was up.

What pattern do serverless billing failures share?

Three cases. Same billing model. Same outcome.

	My infinite loop	ofsecman.io	CF Community
Root cause	Missing deduplication	Missing WHERE clause	Missing index
Time to cost	4 days	10 seconds	Days (unclear)
Bill	$4,868	$5,000+	$3,200
Warning from CF	None	None	None

The common denominator isn't the code. Bugs happen. I wrote about this in my last post on dogfooding: the bugs that matter most are the ones that live in the gaps between states, not in the states themselves. A worker that runs correctly once will also run correctly a billion times. The bug isn't in the execution; it's in the assumption that anything would stop it.

The common denominator is that D1's billing model has no safety net between "working correctly" and "catastrophic overage." No write rate limit. No anomaly detection. No automatic pause when usage spikes 10,000x above normal. The billing system faithfully counts every row, generates an invoice, and sends it to your credit card.

Every serverless database with per-operation billing has this exposure. D1 isn't unique in charging per write. Most managed databases give you some combination of connection pools, query timeouts, billing caps, or at minimum a usage alert that fires before you hit four figures. D1 currently offers none of those for write operations.

If you're building on D1 in production, build your own circuit breakers. I did. The infrastructure described in this post took about a week to build and deploy. The January invoice would have taken me considerably longer to pay off. That's a pretty clear cost-benefit calculation.

Common questions about D1 billing and cost protection

Can you set a billing cap on Cloudflare D1?

No. As of March 2026, Cloudflare doesn't offer a hard billing cap for D1 write operations. You can set up billing notifications, but they're not granular enough to catch a worker writing a billion rows overnight. Application-level circuit breakers are currently the only option.

How do you detect a runaway worker before the bill arrives?

Monitor D1 write counts at sub-hourly intervals. I use Analytics Engine (free tier, 25 million events per month) to track every D1 write via proxied environment bindings, with a budget checker that evaluates every 30 seconds. Anomaly detection with a 168-hour rolling window catches spikes that exceed 3 standard deviations from normal. The whole system adds zero D1 write overhead because telemetry goes through Analytics Engine, not D1.

Is this just a D1 problem?

The billing exposure exists on any serverless platform with per-operation pricing and no rate limits. D1 is the most visible example right now because it's relatively new (GA 2024) and write-heavy workloads can accumulate cost fast. DynamoDB, Firestore, and PlanetScale all have their own versions of this risk, though most offer billing alerts or auto-scaling limits that D1 currently lacks.

The timeline, dollar amounts, and technical details in this post are reconstructed from support ticket #01953111, Reddit r/CloudFlare, Cloudflare dashboard data, and internal audit documents.

DEV Community