Shantanu Gonade

Posted on Jun 25 • Originally published at towardsaws.com on Jun 17

The EventBridge Myth That Hid a Double-Registration Bug and a Race Condition in My Serverless App

#architecture #aws #cloudcomputing #ai

How a routine architecture review turned up a team myth about EventBridge and a double-booking bug hiding in plain sight — and why the two were the same problem wearing different hats.

Three weeks ago, I closed out a 14-week build of TEMS (Terrapin Events Management System) — a serverless event management platform for the University of Maryland. Students browse and register for campus events, organizers manage capacity and waitlists, admins approve events and watch a metrics dashboard. Under the hood: a single-table DynamoDB design, an AppSync GraphQL API, Cognito auth, EventBridge wiring eight backend services together, 54 Lambda functions, the works.

TEMS high-level architecture: AppSync → Lambda → DynamoDB, with EventBridge wiring eight backend services and Cognito handling auth. 54 functions, one custom bus, one table.

The completion report I wrote for myself said what every completion report says:

- 100% of planned features implemented

- AWS Well-Architected Framework compliance

- Production-grade security and scalability

Project Status: COMPLETE AND PRODUCTION-READY

I believed it. I’d written tests. I’d followed the patterns from the AWS docs I’d read. I’d even left myself meeting notes during the build with little architecture decisions and “why we did it this way” rationales — including one note that EventBridge would “automatically deduplicate events within 24 hours,” so a few rough edges in the registration flow weren’t a big deal.

Then, mostly out of curiosity, I pointed an AI architecture audit at the codebase and asked it to compare three things against each other: what the docs claimed, what the code actually did, and what AWS’s current (2026) guidance recommends for each pattern.

It came back with a numbered list of findings. Two of them stopped me cold — not because they were exotic, but because they were the kind of thing a senior reviewer catches in five minutes and a solo developer (me, several months deep in feature work) walks past a hundred times without seeing.

This is the story of those two findings, and why they turned out to be the same bug.

The setup: as-described vs. as-built vs. best-practice

Before the findings, a quick word on the audit itself, because the shape of it is as useful as the content.

I didn’t ask for a generic “review my AWS architecture” pass — that tends to produce a list of platitudes (“consider adding monitoring,” “review your IAM policies”) that are true of literally every AWS account on Earth. Instead, I gave it three things to triangulate:

What the project documentation claims — architecture docs, meeting notes, the completion report.
What the code actually does — every serverless.yml, every handler, the GraphQL schema, the DynamoDB table definition.
What AWS currently recommends — for each pattern in use (EventBridge, DynamoDB single-table design, idempotency, CQRS-style read models).

Then I asked it to flag every place those three things disagreed, with file and line citations for every claim.

That last part matters. An AI that says “your event-driven architecture could be more robust” is not useful. An AI that says “register.ts:161-199 writes two PutCommands keyed on a freshly generated registrationId, with no ConditionExpression enforcing uniqueness on (eventId, userId) — here's the AWS doc on conditional writes that addresses this" is useful, because you can go open that file and check it yourself in thirty seconds. Good audits — human or AI — are falsifiable. If you can't verify a finding against the actual repo, throw it out.

With that framing, here’s what came back.

Finding 1: “EventBridge automatically deduplicates events”

Buried in my own meeting notes from the week I wired up the registration flow was this line, paraphrased: EventBridge automatically deduplicates events, so duplicate RegistrationCreated events within a 24-hour window won't cause problems downstream.

I have no idea anymore where I picked that up. Maybe I was thinking of SQS FIFO queues and their 5-minute deduplication window. Maybe I half-remembered something about EventBridge Pipes. Either way, it had the structure of a fact: specific (24 hours), plausible (AWS services do dedupe things sometimes), and — crucially — comforting. If it’s true, a bunch of edge cases I hadn’t fully thought through just… go away.

The audit’s response was blunt: Amazon EventBridge provides at-least-once delivery and has no native event-level deduplication. None. Not in 24 hours, not ever. If a producer calls PutEvents twice with logically identical payloads, EventBridge will happily deliver both, to every matching rule, as two separate events. (There's PutEvents request-level retry behavior if the API call itself fails and you retry it, but that's a different thing from "the same business event was generated twice by my own application code," which is the case that actually matters here.)

For TEMS specifically, that means every consumer of RegistrationCreated, WaitlistAdded, WaitlistPromoted, and friends — published from register.ts:224-242 and waitlist-manager.ts onto our custom bus (TemsEventBus, with a 30-day replay archive and a DLQ behind it) — has to assume it might receive the same logical event twice and either be naturally idempotent (a second "send confirmation email" is harmless-ish, if annoying) or actively de-duplicate (a second "decrement available capacity" is not harmless at all).

On its own, this is a “fix the docs, add a code comment” finding. Mildly embarrassing, low stakes. I’d have fixed the meeting note, added a // NOTE: EventBridge does NOT dedupe — see comment near the PutEventsCommand calls, and moved on.

Except the audit’s very next finding was about the registration handler. And once I read it, the meeting-note myth stopped being mildly embarrassing and started looking like the reason the second bug had survived this long.

Finding 2: the double-registration bug

Here’s a simplified version of what register.ts does when a student registers for an event (real file: backend/services/registrations/handlers/register.ts):

Read in isolation, every individual piece looks reasonable. There’s an idempotency check! There’s a “already registered” check! There’s an atomic counter increment using DynamoDB’s ADD update expression, which really is atomic! It reads like code written by someone who knew about these problems and was actively guarding against them. (It was. That someone was me, four months ago, clearly aware idempotency was a thing — just not finishing the thought.)

But walk through what actually happens when a student double-clicks “Register” — or, more realistically, taps it once on flaky dorm wifi, the request hangs, they tap again:

Request A arrives. No idempotencyKey (the frontend doesn't currently send one — it's optional, remember). isUserRegistered runs its Query, finds nothing, returns false. Capacity check passes. A is now mid-flight, about to write.
Request B arrives a few hundred milliseconds later, before A’s PutCommands have landed. isUserRegistered runs again — and because A hasn't written yet, B also gets false. Capacity check passes again.
Both requests generate their own fresh registrationId via nanoid(16). Both write two PutCommands each — USER#{userId}/REGISTRATION#{id} and EVENT#{eventId}/REGISTRATION#{id} — with completely different sort keys, because the IDs are different random strings. Nothing in DynamoDB rejects this. There's no ConditionExpression. There's no uniqueness constraint anywhere that says "a given (eventId, userId) pair may only have one active registration row."
Both requests call atomicIncrementRegistered(eventId, 1). This part is atomic — it's a DynamoDB UpdateItem with ADD registeredCount :inc — so the counter correctly goes up by 2. Correctly, for the wrong reason: it's accurately counting two registrations that should never have both existed.
Both requests publish a RegistrationCreated event to EventBridge.

End state: one team, two registration records, two QR codes, two confirmation emails, and an event capacity counter that’s now off by one in the direction of “oversold.” Do this across a popular event during a registration rush — say, the first ten minutes after a big lecture’s extra-credit event opens up — and you get a slow leak of phantom registrations that nobody notices until check-in day, when the room is fuller than the headcount predicted and a few students’ QR codes scan into a now-”full” event that capacity-checks are blocking new registrants from.

The race window is small — milliseconds — but it doesn’t need to be wide. It needs to be non-zero, and it needs traffic. A campus event platform at registration-rush time has both.

Where the two findings meet

Here’s the part that actually changed how I think about this codebase.

The “fix” for the double-registration bug is, fundamentally, idempotency at the write layer: make it so that no matter how many times “register user X for event Y” is requested, at most one registration record can ever exist for that pair. That’s a DynamoDB-side guarantee — a conditional write or a transaction — and it’s completely independent of anything downstream.

The myth said: “don’t worry about duplicate events, EventBridge will collapse them within 24 hours.” If I’d genuinely believed that and stopped there, I’d have been solving the wrong layer entirely — even in a world where EventBridge did dedupe events, that would do nothing to stop the DynamoDB writes in steps 3 and 4 above from happening twice. The duplication isn’t an EventBridge problem. It’s not even really a “two Lambda invocations” problem. It’s that the database has no opinion about whether a given (eventId, userId) pair is allowed to appear more than once, and nothing upstream of the database closes that gap.

The myth didn’t cause the bug. The bug would exist with or without it. But I think the myth is a big part of why the bug survived a full build cycle, a test suite with >80% coverage, and a “production-ready” sign-off — because every time some part of my brain noticed “hm, this registration flow doesn’t have a hard uniqueness guarantee,” another part of my brain had a half-remembered, comforting, wrong answer ready: that’s fine, it gets cleaned up downstream. Nothing gets cleaned up downstream. There is no downstream cleanup. There never was.

That’s the actual lesson, more than either finding individually: a wrong belief about how your infrastructure behaves doesn’t just cost you when you act on it directly — it costs you every time it talks you out of fixing something else.

The fix (briefly — full tutorial is next)

The shape of the fix is: give every (eventId, userId) pair a single, deterministic "claim" item —

{
  PK: `EVENT#${eventId}`,
  SK: `REGCLAIM#${userId}`,
  // ...
}

— and write it with ConditionExpression: 'attribute_not_exists(PK)', ideally inside the same TransactWriteItems call that creates the two registration records and the idempotency record. The second concurrent request's conditional write fails with a ConditionalCheckFailedException, the handler catches that and returns the existing registration instead of creating a new one, and the race window closes — not because it got narrower, but because DynamoDB itself now rejects the collision regardless of timing.

There’s also a more turnkey option: AWS Lambda Powertools’ Idempotency utility, which wraps a handler with makeIdempotent() and a DynamoDB-backed persistence layer, handling the claim-record lifecycle, TTLs, and in-flight request coalescing for you. Making idempotencyKey a required argument (instead of optional, as it is today at register.ts:31 and in the GraphQL schema's registerForEvent(eventId: ID!, idempotencyKey: String)) is part of this too — an idempotency mechanism that callers can opt out of by simply not passing a key isn't really a guarantee, it's a suggestion.

I’m walking through both approaches — deterministic conditional writes and the Powertools utility, with before/after code and a way to actually test for the race condition — in the companion piece: Stop Relying on EventBridge to Dedupe Your Events: A Practical Guide to Idempotent DynamoDB Writes ** soon **.

What I’d tell you if you’re about to run this audit on your own project

A few things I’d do differently next time, and would suggest if you try this on your own codebase:

Ask for citations, not opinions. “Is this secure?” gets you a essay. “Show me every IAM policy with a Resource: '*' and the line number" gets you a punch list. The second one is useful at 11pm before a launch.
Separate “as described” from “as built” from “current best practice,” explicitly. A surprising amount of value in this audit came from the gap between my docs and my code — not from either one alone. My own completion report claimed 5 backend services; the code has 8. That’s not a security issue, but it’s a signal that the docs were written from memory, not from the repo — which made me trust the other claims in those docs less, including the EventBridge note.
Verify anything time-sensitive against current docs, especially for fast-moving services. AWS behavior and tooling changes. The Lambda Powertools Idempotency utility I’m using in the next article didn’t always look like it does now. An audit’s value decays; treat findings as a snapshot, not gospel, and re-check the load-bearing ones.
The boring findings are often connected to the interesting ones. I almost filed the EventBridge note under “fix the comment, move on.” It turned out to be load-bearing for a much bigger miss. If an audit turns up something that smells like “minor docs inaccuracy,” ask what decisions were made on the assumption that the inaccurate thing was true.

TEMS is still “production-ready” in the sense that matters most: it’s a real system, serving a real use case, and I know exactly what’s wrong with it and how to fix it — which is a state most production systems never quite reach. The completion report needed an asterisk, not a rewrite. But I’m glad I asked the question before someone’s registration silently doubled during finals week.

TEMS is a serverless event management platform built on AWS Lambda, AppSync, DynamoDB, EventBridge, and Cognito. This is the first in a short series on what a structured AI architecture audit found in a “finished” project — next up, the idempotency fix in detail, and a look at how the same FIFO waitlist that handles oversold events is built entirely on DynamoDB.

A Note of Appreciation

This project, and this write-up, wouldn’t exist without the people who supported it along the way.

To Dr. Tony D. B. — thank you for your guidance, your patience, and for holding the work to a standard that made it worth doing properly. Your feedback shaped not just the system, but how I think about building one.

To Rohin Vaidya, Sanskar Vidyarthi, Syed Muhammad Affan and Vishal Patil — thank you for your contributions, your collaboration, and for showing up consistently through every phase of this build. The best parts of TEMS reflect what we figured out together.

I’m grateful to all of you for the time, thoughtfulness, and care you brought to this. It meant more than I can neatly put into a completion report.

Git Repo: https://github.com/shantanu-gonade/terrapin-events