Srinivasa Rao

Posted on Jun 28

Fixing a Double-Send Bug Taught Me Idempotency Keys Aren't Enough

#python #fastapi #backend #webdev

A reader comment on my last article did more code review than I expected from a Dev.to comment section.

After I published the multi-tenant email piece — Resend + FastAPI, one backend, per-customer domains — Alex Shev left this:

"Multi-tenant email gets tricky because the hard part is not sending one message; it is preserving tenant boundaries across templates, sender identity, suppression lists, audit logs, and retries. The docs usually cover the happy-path API call. The production system is all the isolation around it."

Five gaps in one comment. I went looking for the smallest one to fix first — retries — assuming it'd be a quick follow-up. It wasn't. Tracing it properly surfaced two more bugs I hadn't written into the original code on purpose, and one habit I'm changing going forward: when I add a safety mechanism, I now have to prove it actually fires before I get to call it done.

This is that trace, in the order I actually found things — not cleaned up into a tutorial.

The bug retries actually have

The original /send endpoint batches 100 emails per Resend API call and loops until every contact is covered. If you have 250 subscribers, that's 3 calls. If call #2 throws — rate limit, network blip, Resend having a bad five minutes — the loop stops. The first 100 are sent. The last 150 are not. The endpoint returns a 502.

Nothing about that response tells the caller which 100 went out. The obvious move — retry the request — re-sends to all 250, because the retry has no memory of the first attempt. The 100 who already got the email get it twice.

That's the gap Alex's comment was naming. The fix everyone reaches for is idempotency keys. I reached for them too. They were necessary. They were not sufficient, and the way they weren't sufficient is the actual point of this article.

Attempt 1: an `Idempotency-Key` header

Resend supports an Idempotency-Key header on both /emails and /emails/batch — send the same key with the same payload twice, get the same response back, no second email. I wired that into send_single and send_batch:

# app/services/resend_service.py
def _auth_headers(idempotency_key: str = "") -> dict:
    headers = {
        "Authorization": f"Bearer {resend.api_key}",
        "Content-Type": "application/json",
        "Accept": "application/json",
    }
    if idempotency_key:
        headers["Idempotency-Key"] = idempotency_key
    return headers

The detail that matters more than the header itself: where the key comes from. My first instinct was to generate one server-side per request — uuid4(), done. That's wrong, and it's wrong in a way that's easy to miss because the code runs fine and the tests pass. A server-generated key changes every time the server creates a new request object, which includes every retry. The client resending the exact same logical request gets a fresh key each time, and Resend — correctly, by design — treats a new key as a new send. You've built idempotency infrastructure that's idempotent against nothing, because the one thing that's supposed to stay constant across a retry is the one thing you're regenerating on every attempt.

The key has to come from the caller, survive across their retry, and identify the request, not the attempt:

# app/routers/email.py
async def send_email(
    customer_id: str,
    body: SendEmailRequest,
    idempotency_key: Optional[str] = Header(None, alias="Idempotency-Key"),
    db: AsyncSession = Depends(get_db)
):
    ...
    campaign_id = idempotency_key or new_id()

    existing_result = await db.execute(select(Campaign).where(Campaign.id == campaign_id))
    existing_campaign = existing_result.scalar_one_or_none()
    if existing_campaign:
        if existing_campaign.status == "sent":
            return { "success": True, "campaign_id": existing_campaign.id, ... }
        raise HTTPException(status_code=409, detail={...})

Same Idempotency-Key, same campaign — if it already finished, return the cached result instead of sending again. If it's still mid-flight or only partially done, 409 instead of a silent re-send. That part worked exactly as intended once I tested it against the real API: same key plus same payload returns the identical response on both /emails and /emails/batch, no duplicate delivery. Different payload under the same key gets a 409 invalid_idempotent_request from Resend itself.

But notice what the 409 branch does for a campaign stuck in "partial": it tells the caller there's unfinished work and stops. It does not tell them how to finish it. That gap is the next bug.

Attempt 2: the 409 was pointing at a worse bug than the one it caught

My first version of that error message read:

"A campaign with this Idempotency-Key already exists but did not finish. Start a new campaign with a different Idempotency-Key to retry."

Read that literally and follow the advice: a new key means a new campaign, which means the send loop runs again from contact #1 — including the ones who already got the email in the failed attempt. The error message I wrote specifically to prevent a double-send was itself instructing the caller into one. I'd fixed the retry path and left the recovery path wide open.

Fixing it properly meant tracking who was actually emailed, not just how many. I added a CampaignRecipient row per contact per chunk, written the moment Resend's batch call returns an ID for them:

# app/models.py
class CampaignRecipient(Base):
    """One row per contact per campaign. Populated when the batch call returns
    email IDs, then updated by the webhook handler as delivery events arrive."""
    __tablename__ = "campaign_recipients"

    id = Column(String, primary_key=True, default=new_id)
    campaign_id = Column(String, ForeignKey("campaigns.id", ondelete="CASCADE"), nullable=False)
    contact_id = Column(String, ForeignKey("contacts.id", ondelete="SET NULL"), nullable=True)
    resend_email_id = Column(String(255), nullable=True, index=True)
    status = Column(String(50), default="queued")
    updated_at = Column(DateTime, default=datetime.utcnow)

Then a resume_from_campaign_id field that excludes anyone already in that table for the failed campaign:

# app/routers/email.py
if body.resume_from_campaign_id:
    resume_campaign = ...  # fetch + verify it belongs to this customer
    if resume_campaign.status != "partial":
        raise HTTPException(status_code=400, detail=(
            f"Cannot resume a campaign with status '{resume_campaign.status}'. "
            "Only 'partial' campaigns can be resumed."
        ))
    already_sent_subq = select(CampaignRecipient.contact_id).where(
        and_(
            CampaignRecipient.campaign_id == body.resume_from_campaign_id,
            CampaignRecipient.contact_id.isnot(None),
        )
    )
    query = query.where(Contact.id.notin_(already_sent_subq))

Two details in that snippet earned their place the hard way, not by design upfront.

The .isnot(None) line. CampaignRecipient.contact_id is nullable — it's set to SET NULL if a contact gets deleted later. SQL's NOT IN uses three-valued logic: if the subquery returns even one NULL, the entire NOT IN comparison evaluates to unknown — for every row, not just the one tied to the NULL. Without that filter, one deleted contact silently turns "exclude everyone already emailed" into "exclude no one," and resume would have matched zero people for reasons that wouldn't show up in any test using a normal dataset. It only shows up once a contact is deleted between the failed send and the resume — which is exactly the kind of thing that doesn't happen in a demo and does happen eventually in production.

The != "partial" check. I originally allowed resuming anything that wasn't "sent" — which included "sending". That's a race: resume an in-flight campaign and you now have two processes building exclusion lists against a table that's still being written to mid-send. Restricting resume to "partial" only closes that window.

Both came from asking "what's the actual claim this code is making, and what has to be true for the claim to hold" rather than "does this pass the test I already wrote." The NULL trap especially wouldn't show up under the conditions most people test with.

Lesson 1: an idempotency key at the wrong layer is theater

Here's the one I didn't expect. Each chunk inside the send loop gets its own key:

for i in range(0, len(recipients), 100):
    chunk_key = f"{campaign_id}/batch-{i // 100}"
    chunk_results = await send_batch(..., idempotency_key=chunk_key)

The idea: if the chunk-level call to Resend gets retried, Resend's own dedup catches it independently of anything my database is doing. Defense in depth. It reads like good defensive engineering, and I believed it was protecting something right up until I traced what actually calls this code with a repeated chunk_key.

Nothing does.

The top-level check at the start of the endpoint short-circuits on campaign_id before the loop is ever reached a second time — a retry with the same Idempotency-Key either returns the cached "sent" result or a 409, full stop. resume_from_campaign_id, the other path that re-enters the loop, always builds a new campaign_id and therefore new chunk keys, every time. There is no code path, anywhere in this service, that calls send_batch twice with the same chunk_key. Resend's per-chunk dedup is sitting behind a door nothing in my codebase ever knocks on.

It's not actively harmful — it's just inert. But inert security code is worse than no code, because it occupies the mental slot of "this is handled" without doing anything. If a future change to the resume logic accidentally did start reusing a campaign_id, I'd reach for "the chunk keys protect us there" and be wrong, because the lack of replay was never actually guaranteed by the chunk keys — it was guaranteed by the resume logic always minting fresh ones, which is a property of completely different code.

The lesson I'm taking into the next thing I build: an idempotency key only protects you at the layer where a retry can actually recur. If you can't point to the specific code path that would call the same operation twice with the same key, you haven't added protection — you've added a key that looks like protection. Worth tracing before you trust it, not after.

Lesson 2: a forensic log is only real once you've checked it's emitting

The chunk-key finding leaves a real gap, not a cosmetic one: if a chunk's httpx call to Resend times out after Resend has already accepted and started sending it, the exception handler marks the campaign "partial" — but no CampaignRecipient rows exist for that chunk, because the response never came back far enough to read the email IDs out of it. Resume can't exclude contacts it has no record of. Building the full fix (a CampaignChunk table tracking attempt status independent of the response) felt like more machinery than this risk currently justifies, so I made a deliberate call to document it as a known v1 limitation instead of building it — more on why below.

As a cheap mitigation in the meantime, I added a log line before each chunk dispatch, specifically so a human could reconstruct what happened if a duplicate complaint ever surfaced:

logger.info(
    "send_batch attempt campaign_id=%s chunk_key=%s contact_ids=%s",
    campaign_id, chunk_key, [c.id for c in chunk_contacts],
)

I logged the contact IDs, not just a count — a count can't be turned back into "which specific people" later, especially after the contact table has drifted since the incident. IDs can.

Then I checked whether the line was actually showing up anywhere, mostly out of habit. It wasn't. Nowhere in the project did anything call logging.basicConfig() or set a level — which means Python's root logger was sitting at its default, WARNING, and silently swallowing every single logger.info() call in the file, and in the webhook handler too. The mitigation I'd built specifically to cover the one failure window I couldn't close in code was never running. Adding logging.basicConfig(level=logging.INFO, ...) to app/main.py and watching the output actually appear in stdout was the fix — one line, after the fact.

This is the same shape of mistake as the chunk-key one: code that's structurally correct and semantically inert. The difference is that this one would have been invisible right up until the moment I actually needed the log — which is the worst possible time to discover a safety net was never connected. The same incident that needs the forensic trail is the same incident that would have revealed it wasn't there. I'm treating "did I confirm this actually executes" as a separate, mandatory step now, distinct from "did I write code that should execute" — they're not the same claim, and only one of them is checked by reading the diff.

What's still open, on purpose

A couple of things I'm leaving unfixed, with the reasoning written down rather than left implicit:

The timeout-after-accept window. Closing it properly means a CampaignChunk table — one row per chunk attempt, written before the Resend call so an attempt is on record even if the response never comes back, with its own status independent of whether CampaignRecipient rows exist yet. That's a real schema change, not a one-line fix, and it's solving for a narrow race (network failure in the few-hundred-millisecond window between Resend accepting and the response reaching the client) that hasn't actually happened yet in this project. Documenting the gap and the mitigation felt like the more honest call than guessing at a design for a failure mode I haven't observed. Chasing it properly would be its own investigation.

Concurrent duplicate keys. Two requests with the same Idempotency-Key arriving close enough together can both pass the "no existing campaign" check before either one commits a row. Low probability for how this service is actually called, but unhandled — it would currently surface as a raw 500 instead of a clean 409. Noted, not fixed yet.

The actual takeaway

Idempotency keys are the right tool. They're also not a single fix you add once — they're a claim about a specific code path, and the claim is only true if you can trace the exact retry that would hit it. The header on /emails and /emails/batch protects the layer where Resend itself might see a duplicate call. The campaign_id check protects the layer where a client retries the whole request. Neither one automatically covers the layer in between — per-chunk attempts — and I only found that out by asking "what specifically calls this twice" instead of "does this look right."

Same with the logging fix: a safety net you haven't watched fire isn't a safety net, it's an assumption with good intentions.

Code's on GitHub if you want to see the rest: github.com/srinivaspavuluri/resend-fastapi

That comment named five gaps. Retries are the one I've actually closed the loop on here. The remaining four — suppression lists, audit logs, sender identity isolation, and the timeout-after-accept window the chunk-key finding left open — are candidates for what comes next, in some order. Sender identity isolation is the one I understand the least; if anyone has a strong opinion on it, I'd like to hear it before I try to write about it.

Found a hole in this one too? That's kind of how the last one went, and it made the article better. Comments are open.

DEV Community

Fixing a Double-Send Bug Taught Me Idempotency Keys Aren't Enough

The bug retries actually have

Attempt 1: an `Idempotency-Key` header

Attempt 2: the 409 was pointing at a worse bug than the one it caught

Lesson 1: an idempotency key at the wrong layer is theater

Lesson 2: a forensic log is only real once you've checked it's emitting

What's still open, on purpose

The actual takeaway

Top comments (0)

The bug retries actually have

Attempt 1: an Idempotency-Key header

Attempt 2: the 409 was pointing at a worse bug than the one it caught

Lesson 1: an idempotency key at the wrong layer is theater

Lesson 2: a forensic log is only real once you've checked it's emitting

What's still open, on purpose

The actual takeaway

Attempt 1: an `Idempotency-Key` header