Renaldi for AWS Community Builders

Posted on Mar 22

Idempotency Architecture for Lambda-Driven Systems on AWS

#aws #webdev #eventdriven #productivity

Duplicate processing is one of those problems that looks small in a diagram and very expensive in production.

I have seen teams build clean event-driven and Lambda-based systems, only to run into duplicate charges, duplicated emails, repeated downstream writes, or inconsistent state once retries and redrives start happening. The tricky part is that the system is often behaving as designed. AWS services are doing what they should do: retrying, buffering, redriving, and favoring delivery durability.

This is exactly why I consider idempotency architecture one of the most important and most underexplained topics in serverless engineering.

In this post, I will walk through how I design idempotency for Lambda-driven systems on AWS, including:

the exactly-once myth vs the at-least-once reality
how to choose idempotency keys and dedupe windows
a DynamoDB-based idempotency store design
using AWS Lambda Powertools idempotency utility
handling retries from API Gateway, SQS, and EventBridge
an end-to-end implementation walkthrough with code

I will keep this practical and architecture-heavy, so you can adapt it to real workloads instead of only toy examples.

The core idea in one sentence

Idempotency means I can safely process the same logical request more than once and still end up with the same intended outcome.

That does not mean the system only receives the request once. It means my system is resilient when it receives it multiple times.

Exactly-once is usually not the right mental model

A lot of production mistakes start with the assumption that a serverless flow will process each request exactly once.

In practice, in distributed systems, what I usually get is:

at-least-once delivery (common with queues/events)
retries at multiple layers (client SDKs, AWS services, Lambda, Step Functions, etc.)
timeouts and ambiguous outcomes (did the function finish but the caller timed out?)
redrives / replay (DLQ reprocessing, archive replay, manual reruns)
duplicate submissions from clients (double-click, refresh, mobile reconnect)

So instead of trying to force an “exactly-once” guarantee everywhere, I design for:

at-least-once delivery
idempotent handlers
safe retries
observability around duplicates

That mental shift makes the architecture much more robust.

Where duplicates come from in AWS Lambda-driven systems

Before I show the solution, I like to make the duplicate paths explicit.

API Gateway -> Lambda

Duplicates can happen when:

the client retries after a timeout
the network drops after the backend succeeded
the user taps “Submit” multiple times
an upstream reverse proxy retries

SQS -> Lambda

Duplicates can happen when:

Lambda fails and the message becomes visible again
processing exceeds visibility timeout
partial batch failures cause some records to be retried
DLQ redrive sends records back later
standard queues deliver the same message more than once

EventBridge -> Lambda

Duplicates can happen when:

target invocation is retried
a publisher emits semantically duplicate events
archive/replay is used
consumers reprocess historical events intentionally

That is why I architect idempotency at the business operation level, not at just one transport layer.

What a good idempotency architecture looks like

At a high level, I want:

a stable idempotency key for each logical operation
a dedupe window (TTL) appropriate for the business
a persistence store to track processing status and cached results
conditional writes to prevent concurrent duplicate execution
response replay for safe duplicate requests when appropriate
clear behavior for mismatched payloads using the same key

Architecture at a glance

End-to-end walkthrough (the scenario I will implement)

To make this concrete, I will use a common example:

“Create payment intent / order processing” style operation.

Flow

A client sends POST /payments with an Idempotency-Key header.
API Gateway invokes Lambda.
Lambda checks DynamoDB idempotency table.
If the key is new, Lambda acquires an IN_PROGRESS lock and processes the request.
Lambda writes the business result (for example, a payment record) and stores the response in the idempotency table with status COMPLETED.
If the same request is retried, Lambda returns the cached response instead of processing again.

Then I will extend the same pattern to:

SQS worker retries
EventBridge consumer retries/replay

Designing the idempotency key

This is where a lot of teams accidentally introduce bugs.

A good idempotency key should identify the logical operation, not just the transport envelope.

Good key examples

payment:{merchant_id}:{client_request_id}
order-create:{tenant_id}:{cart_checkout_id}
invoice-email:{invoice_id}:{template_version}
event:{event_id} (if the publisher guarantees a stable event ID)

Risky key choices

raw timestamp
Lambda aws_request_id (changes every invocation)
SQS receiptHandle (changes on delivery)
entire payload serialized without normalization (field order / formatting issues)
keys that are too broad (cause false dedupe)
keys that are too narrow (miss duplicates)

My rule of thumb

I choose a key from business identity + operation intent, and I define it explicitly in the contract.

For APIs, that often means:

require an Idempotency-Key header from the client, and
validate that it maps to a stable request intent.

For asynchronous consumers, that often means:

use the publisher’s stable eventId, or
derive a deterministic business key (for example order_id + action).

Dedupe windows (TTL): how long should I remember a key?

There is no universal value. I set the dedupe window based on business and retry patterns.

What affects the dedupe window

expected client retry duration
SQS redrive timing and replay operations
EventBridge replay windows / operational reruns
whether duplicates after long periods are still harmful
cost of storing idempotency records

Practical examples

API payment creation: 24 hours to 7 days
Webhook ingestion: 1 to 7 days (depends on provider retry policy)
Batch event processing: hours to days
high-volume telemetry: maybe minutes to hours (if duplicate impact is low)

Important nuance

TTL in DynamoDB is eventually applied, not immediate deletion. I design my logic so:

expiry_timestamp is authoritative in code, and
TTL is the cleanup mechanism.

In other words, I do not depend on the item disappearing exactly at expiry time.

DynamoDB-based idempotency store design (recommended pattern)

I prefer DynamoDB for idempotency state in Lambda workloads because it gives me:

low-latency key lookups
conditional writes
TTL support
simple scaling
good fit for stateless Lambda functions

Table design (single-purpose table)

A dedicated table keeps the pattern easy to reason about.

Partition key

id (string): the idempotency key

Recommended attributes

status (IN_PROGRESS, COMPLETED, optionally EXPIRED)
expiryTimestamp (epoch seconds for dedupe window)
inProgressExpiryTimestamp (shorter lock expiry to recover from crashed executions)
payloadHash (optional but highly recommended)
responseData (optional, cached result or safe response envelope)
createdAt
updatedAt
source (api / sqs / eventbridge)
functionName (optional for ops visibility)

Why `payloadHash` matters

If a client reuses the same idempotency key with a different payload, I want to detect that and reject it. Otherwise I can accidentally return a cached response for the wrong request.

This is a subtle but critical best practice.

State transitions I use

Here is the lifecycle I generally implement:

No record exists
- Try conditional write -> create IN_PROGRESS
IN_PROGRESS exists
- Another invocation is already processing (or crashed recently)
- Return a retryable outcome or fail fast depending on source
COMPLETED exists and not expired
- Return cached result (or safe ack)
Expired
- Treat as a new request (depending on business policy)

That gives me concurrency safety and retry safety.

Infrastructure example (SAM / CloudFormation snippets)

Below is a minimal setup for:

Lambda function
DynamoDB idempotency table
IAM permissions
env vars for configuration

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Idempotent Lambda API example

Resources:
  PaymentsFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: python3.12
      Handler: app.lambda_handler
      CodeUri: src/
      MemorySize: 512
      Timeout: 29
      Policies:
        - Statement:
            - Effect: Allow
              Action:
                - dynamodb:GetItem
                - dynamodb:PutItem
                - dynamodb:UpdateItem
                - dynamodb:DeleteItem
              Resource: !GetAtt IdempotencyTable.Arn
      Environment:
        Variables:
          IDEMPOTENCY_TABLE_NAME: !Ref IdempotencyTable
          IDEMPOTENCY_EXPIRES_SECONDS: "86400" # 24h
      Events:
        CreatePaymentApi:
          Type: Api
          Properties:
            Path: /payments
            Method: post

  IdempotencyTable:
    Type: AWS::DynamoDB::Table
    Properties:
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: id
          AttributeType: S
      KeySchema:
        - AttributeName: id
          KeyType: HASH
      TimeToLiveSpecification:
        AttributeName: expiration
        Enabled: true
      PointInTimeRecoverySpecification:
        PointInTimeRecoveryEnabled: true
      SSESpecification:
        SSEEnabled: true

Outputs:
  ApiUrl:
    Value: !Sub "https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/payments"

Notes on the table schema vs Powertools

AWS Lambda Powertools idempotency utility has its own default attribute names and record model. If I use the library, I usually let it manage the item shape and only customize when I truly need to.

That said, I still think through the conceptual schema above so the team understands what is being stored and why.

Implementation option 1 (recommended): AWS Lambda Powertools idempotency utility

If I am using Python Lambda functions, AWS Lambda Powertools is my default choice. It saves me from reimplementing concurrency locks, conditional checks, and record state transitions from scratch.

Install

pip install aws-lambda-powertools[boto3]

API Gateway example with idempotency (Python)

This example assumes the client sends:

Idempotency-Key header
JSON body containing payment details

I use:

event_key_jmespath to extract the idempotency key from headers
payload_validation_jmespath to validate that key reuse with different payloads is detected
a DynamoDB persistence layer

import json
import os
from typing import Any, Dict

from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.idempotency import (
    DynamoDBPersistenceLayer,
    IdempotencyConfig,
    idempotent,
)
from aws_lambda_powertools.utilities.idempotency.exceptions import (
    IdempotencyValidationError,
)

logger = Logger(service="payments-api")

TABLE_NAME = os.environ["IDEMPOTENCY_TABLE_NAME"]
EXPIRES_SECONDS = int(os.getenv("IDEMPOTENCY_EXPIRES_SECONDS", "86400"))

persistence_layer = DynamoDBPersistenceLayer(table_name=TABLE_NAME)

config = IdempotencyConfig(
    # API Gateway/Lambda proxy event header lookup (normalize to lower-case in code if needed)
    event_key_jmespath="headers.idempotency-key",
    # Payload fields that should remain consistent when reusing the same key
    payload_validation_jmespath="powertools_json(body).[customerId, amount, currency]",
    expires_after_seconds=EXPIRES_SECONDS,
    use_local_cache=True,
)

def create_payment_intent(request: Dict[str, Any]) -> Dict[str, Any]:
    # Replace this with your real business logic / external payment call.
    # The critical point is that this function is wrapped with idempotency.
    payment_id = f"pay_{request['customerId']}_{request['amount']}_{request['currency']}"
    return {
        "paymentId": payment_id,
        "status": "AUTHORIZED",
        "amount": request["amount"],
        "currency": request["currency"],
    }

@idempotent(config=config, persistence_store=persistence_layer)
def process_request(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
    body = json.loads(event.get("body") or "{}")
    required = ["customerId", "amount", "currency"]
    missing = [k for k in required if k not in body]
    if missing:
        return {
            "statusCode": 400,
            "body": json.dumps({"message": f"Missing required fields: {missing}"})
        }

    result = create_payment_intent(body)

    # Powertools can persist and return this response for duplicate calls
    return {
        "statusCode": 200,
        "body": json.dumps(result),
        "headers": {"Content-Type": "application/json"},
    }

@logger.inject_lambda_context(log_event=False)
def lambda_handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
    # Normalize headers so event_key_jmespath is predictable
    headers = event.get("headers") or {}
    event["headers"] = {str(k).lower(): v for k, v in headers.items()}

    try:
        return process_request(event, context)
    except IdempotencyValidationError:
        return {
            "statusCode": 409,
            "body": json.dumps({
                "message": "Idempotency key was reused with a different request payload"
            }),
            "headers": {"Content-Type": "application/json"},
        }

Why this pattern works well

If the same request is retried, Powertools returns the stored response.
If the same key is reused with a different body, I return 409 Conflict.
I avoid duplicate payment authorization for simple retries.

API client contract (important and often skipped)

Idempotency works much better when the contract is explicit.

For API producers/clients, I document:

Idempotency-Key is required for mutating operations (POST, sometimes PATCH)
same key + same intent/payload -> safe retry
same key + different payload -> 409 Conflict
dedupe window (for example, 24h)
response replay behavior (cached result may be returned)

That avoids ambiguity across frontend, mobile, and backend teams.

Handling retries from SQS (Lambda event source mapping)

SQS is one of the most common places where teams need idempotency but only discover it after duplicates happen.

What changes for SQS?

Lambda receives a batch of messages.
Some messages may succeed while others fail.
I should use partial batch response so only failed records are retried.
Each message should still be processed idempotently.

Key design for SQS consumers

I avoid using transient delivery metadata. I prefer:

a business key in the message body (for example orderId)
or an upstream event ID included in the message

Example key:

order-paid:{orderId}
inventory-reservation:{reservationId}

SQS consumer example (Powertools Batch + per-record idempotency)

Below is a simplified Python example using:

Powertools Batch utility for partial batch handling
Powertools idempotency on the per-record processing function

import json
import os
from typing import Any, Dict

from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.batch import BatchProcessor, EventType, process_partial_response
from aws_lambda_powertools.utilities.batch.types import PartialItemFailureResponse
from aws_lambda_powertools.utilities.data_classes.sqs_event import SQSRecord
from aws_lambda_powertools.utilities.idempotency import (
    DynamoDBPersistenceLayer,
    IdempotencyConfig,
    idempotent_function,
)

logger = Logger(service="orders-sqs-worker")

TABLE_NAME = os.environ["IDEMPOTENCY_TABLE_NAME"]
persistence_layer = DynamoDBPersistenceLayer(table_name=TABLE_NAME)

# Here we build the key from the function argument "data"
# (Powertools hashes the configured subset to create an idempotency key)
idempotency_config = IdempotencyConfig(
    event_key_jmespath="orderId",
    payload_validation_jmespath="[orderId, action, version]",
    expires_after_seconds=3 * 24 * 60 * 60,  # 3 days
)

processor = BatchProcessor(event_type=EventType.SQS)

@idempotent_function(data_keyword_argument="data", config=idempotency_config, persistence_store=persistence_layer)
def process_order_event(*, data: Dict[str, Any]) -> Dict[str, Any]:
    # Business logic here (must be safe to retry / replay via idempotency)
    # Example: reserve inventory, update status, publish follow-up event, etc.
    logger.info("Processing order event", extra={"orderId": data["orderId"], "action": data["action"]})
    return {"ok": True, "orderId": data["orderId"], "action": data["action"]}

def record_handler(record: SQSRecord) -> Dict[str, Any]:
    payload = json.loads(record.body)
    return process_order_event(data=payload)

def lambda_handler(event, context) -> PartialItemFailureResponse:
    return process_partial_response(
        event=event,
        record_handler=record_handler,
        processor=processor,
        context=context,
    )

Best practices I apply with SQS + idempotency

Use partial batch response to avoid retrying already-successful records.
Set visibility timeout longer than worst-case processing time (or heartbeat/extend strategy).
Keep the idempotency key in the message payload, not delivery metadata.
Use a dedupe window long enough to cover retries, DLQ redrive, and operational replay.

Handling retries from EventBridge targets

EventBridge makes event-driven architecture clean, but duplicate-safe consumers are still my responsibility.

EventBridge-specific considerations

EventBridge may retry target delivery.
Archive/replay can intentionally re-send events.
Different publishers may emit semantically duplicate events unless the contract is strict.

Key strategy for EventBridge consumers

If the event has a stable id or domain event ID, I use it. If not, I derive one from:

detail-type
source/domain identifier
business entity ID
action/version

Example:

eventbridge:{source}:{detailType}:{detail.orderId}:{detail.version}

EventBridge consumer example (Python Lambda + Powertools idempotency)

import os
from typing import Any, Dict

from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.idempotency import (
    DynamoDBPersistenceLayer,
    IdempotencyConfig,
    idempotent,
)

logger = Logger(service="eventbridge-consumer")

TABLE_NAME = os.environ["IDEMPOTENCY_TABLE_NAME"]
persistence_layer = DynamoDBPersistenceLayer(table_name=TABLE_NAME)

config = IdempotencyConfig(
    # Prefer a producer-defined unique event ID if available in detail
    event_key_jmespath="detail.eventId || id",
    payload_validation_jmespath="[source, detail-type, detail.orderId, detail.version]",
    expires_after_seconds=7 * 24 * 60 * 60,
)

@idempotent(config=config, persistence_store=persistence_layer)
def handle_event(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
    detail = event["detail"]
    logger.info(
        "Handling EventBridge event",
        extra={"eventId": detail.get("eventId", event.get("id")), "orderId": detail.get("orderId")}
    )

    # Execute your domain logic here
    # Example: update read model, trigger notification, call downstream API, etc.
    return {"handled": True, "orderId": detail.get("orderId")}

def lambda_handler(event, context):
    return handle_event(event, context)

When I implement idempotency manually (without Powertools)

Powertools is my default, but sometimes I implement manually when:

I need a custom record schema shared across services/languages
I need custom conflict behavior beyond the library defaults
I am in a language/runtime where I am standardizing a platform abstraction
I want explicit control over lock and result update semantics

The key principle stays the same: use DynamoDB conditional writes.

Manual DynamoDB idempotency pattern (Python + boto3)

The pattern below shows the core idea:

Try to write IN_PROGRESS with attribute_not_exists(id)
If successful, execute business logic
Update item to COMPLETED and store result
If conditional check fails, inspect existing item and decide

import json
import time
import hashlib
import boto3
from botocore.exceptions import ClientError

dynamodb = boto3.client("dynamodb")
TABLE_NAME = "IdempotencyTable"

def sha256_text(value: str) -> str:
    return hashlib.sha256(value.encode("utf-8")).hexdigest()

def acquire_lock(idempotency_key: str, payload_hash: str, ttl_seconds: int = 86400, lock_seconds: int = 120):
    now = int(time.time())
    item = {
        "id": {"S": idempotency_key},
        "status": {"S": "IN_PROGRESS"},
        "payloadHash": {"S": payload_hash},
        "expiration": {"N": str(now + ttl_seconds)},
        "inProgressExpiryTimestamp": {"N": str(now + lock_seconds)},
        "createdAt": {"N": str(now)},
        "updatedAt": {"N": str(now)},
    }
    try:
        dynamodb.put_item(
            TableName=TABLE_NAME,
            Item=item,
            ConditionExpression="attribute_not_exists(id)"
        )
        return {"acquired": True}
    except ClientError as e:
        if e.response["Error"]["Code"] != "ConditionalCheckFailedException":
            raise
        return {"acquired": False}

def get_record(idempotency_key: str):
    resp = dynamodb.get_item(
        TableName=TABLE_NAME,
        Key={"id": {"S": idempotency_key}},
        ConsistentRead=True,
    )
    return resp.get("Item")

def mark_completed(idempotency_key: str, response_obj: dict):
    now = int(time.time())
    dynamodb.update_item(
        TableName=TABLE_NAME,
        Key={"id": {"S": idempotency_key}},
        UpdateExpression="SET #s = :completed, responseData = :resp, updatedAt = :now",
        ExpressionAttributeNames={"#s": "status"},
        ExpressionAttributeValues={
            ":completed": {"S": "COMPLETED"},
            ":resp": {"S": json.dumps(response_obj)},
            ":now": {"N": str(now)},
        },
    )

def handler(event, context):
    body = json.loads(event["body"])
    idem_key = event["headers"]["idempotency-key"]
    payload_hash = sha256_text(json.dumps({
        "customerId": body["customerId"],
        "amount": body["amount"],
        "currency": body["currency"],
    }, sort_keys=True))

    lock = acquire_lock(idem_key, payload_hash)
    if not lock["acquired"]:
        existing = get_record(idem_key)
        if not existing:
            # Rare race / eventual state issue; safe to retry
            raise Exception("Retry request")
        if existing.get("payloadHash", {}).get("S") != payload_hash:
            return {"statusCode": 409, "body": json.dumps({"message": "Idempotency key payload mismatch"})}
        status = existing["status"]["S"]
        if status == "COMPLETED":
            return {"statusCode": 200, "body": existing["responseData"]["S"]}
        if status == "IN_PROGRESS":
            # For API workflows you might return 409 or 425-style retry signal (implementation-specific)
            return {"statusCode": 409, "body": json.dumps({"message": "Request is already in progress"})}

    # Execute business logic after lock acquired
    result = {"paymentId": "pay_123", "status": "AUTHORIZED"}
    mark_completed(idem_key, result)
    return {"statusCode": 200, "body": json.dumps(result)}

Manual pattern caveats

If I go manual, I also need to think about:

expired IN_PROGRESS lock recovery
exception handling and safe cleanup
serialization of cached responses
metrics for duplicate hits vs fresh requests
consistent behavior across all event sources

That is exactly why Powertools is usually the better default.

End-to-end implementation discussion (how I wire this in production)

This is the part I care about most in architecture reviews: not just the code, but where idempotency sits in the overall system.

1) Idempotency belongs close to the handler boundary

I usually apply idempotency at the Lambda entry point (or record handler for batches), before business side effects occur.

Why:

it prevents duplicate external calls
it keeps the protection broad
it makes retries safe by default

2) I still design downstream writes carefully

Idempotency at the Lambda layer is great, but if the function can partially succeed before crashing, I also check downstream safety:

unique constraints in relational DBs
conditional writes in DynamoDB
provider-side idempotency support for payment APIs or webhooks

Think in layers, not in a single magic switch.

3) I define duplicate behavior per source

The response to a duplicate is not always the same.

API Gateway: return cached success response (best UX)
SQS: ack success for already-processed message, avoid poison-loop
EventBridge: safely no-op or return success after dedupe

4) I separate “idempotency key” and “correlation ID”

They are related but not identical.

Correlation ID -> tracing/observability
Idempotency key -> duplicate suppression for a specific operation

Sometimes they can be the same, but I do not assume that.

Handling edge cases and failure modes

Edge case 1: Same key, different payload

This should be treated as a contract violation.

Best practice: return 409 Conflict (or equivalent domain error) and log it loudly.

Why I do this:

protects clients from accidental misuse
prevents serving incorrect cached results
surfaces integration bugs early

Edge case 2: Function times out after making a side effect

This is the classic ambiguous outcome problem.

Idempotency helps, but only if:

the side effect can be detected or safely repeated, and/or
the result gets persisted before timeout

Best practices:

keep timeouts realistic
use downstream idempotency where available
break long operations into Step Functions when needed
persist progress checkpoints for multi-step work

Edge case 3: `IN_PROGRESS` records stuck after crashes

If a function crashes after acquiring the lock, duplicates may keep seeing IN_PROGRESS.

Best practices:

use an in-progress lock expiry
make retries back off
alert on sustained IN_PROGRESS accumulation
evaluate whether the operation is safe to re-attempt after lock expiry

Edge case 4: Replay and backfill

Operational replay is common and healthy. I design for it intentionally.

Best practices:

choose a dedupe window that matches replay expectations
if replay should re-run effects, use a different idempotency namespace/version
document replay semantics for ops teams

Example:

normal key: invoice-email:{invoiceId}
forced replay key namespace: invoice-email:replay:{jobId}:{invoiceId}

Observability: what I monitor for idempotency

Idempotency is not just a code concern. I want to know how often it is being exercised and whether it is hiding a deeper issue.

Metrics I like to emit

IdempotencyFreshRequests
IdempotencyDuplicateHits
IdempotencyInProgressConflicts
IdempotencyValidationConflicts (same key, different payload)
IdempotencyStoreErrors
handler latency split by fresh vs duplicate

Logs I always include

idempotency key (or redacted/hash if sensitive)
source (api, sqs, eventbridge)
dedupe outcome (fresh, duplicate_completed, duplicate_in_progress, validation_mismatch)
business identifier (orderId, paymentId, etc.)

This makes incident triage much faster.

Practical best practices checklist (the part I use in reviews)

Key selection

[ ] Key represents a business operation, not transport metadata
[ ] Key is stable across retries/replays
[ ] Key cardinality is high enough to avoid false collisions
[ ] Payload mismatch with same key is detected

Dedupe window

[ ] TTL matches retry + redrive + replay realities
[ ] Expiry is checked in code (not only by DynamoDB TTL cleanup)
[ ] Window is documented in API/event contract

Store design

[ ] DynamoDB conditional write used for first writer wins
[ ] IN_PROGRESS and COMPLETED states are handled explicitly
[ ] Cached response/ack strategy is defined
[ ] Encryption, backups/PITR, and least privilege are configured

Source-specific behavior

[ ] API duplicates return deterministic response
[ ] SQS uses partial batch response
[ ] EventBridge consumer supports replay safely
[ ] Redrive/replay semantics are documented for ops

Operations

[ ] Metrics and alarms exist for duplicate spikes and store failures
[ ] Logs include dedupe outcomes and business IDs
[ ] Runbooks cover stale IN_PROGRESS records and replay scenarios

Common mistakes I see (and how I avoid them)

Mistake 1: Using Lambda `requestId` as the idempotency key

That only identifies the invocation, not the logical request.

Fix: use business operation identity or client-provided idempotency key.

Mistake 2: Assuming FIFO queue means I do not need idempotency

FIFO helps with ordering and deduplication windows, but it does not replace end-to-end idempotency for all side effects and replay paths.

Fix: still make the consumer idempotent.

Mistake 3: Dedupe only at the API layer

Then an async worker downstream duplicates the side effect anyway.

Fix: apply idempotency where side effects happen, especially in SQS/EventBridge consumers.

Mistake 4: No payload validation on key reuse

This can return the wrong cached response and create hidden data integrity issues.

Fix: validate a stable subset of the payload with the idempotency key.

Mistake 5: Too-short TTL

The key expires before retries/redrives finish, so duplicates sneak through.

Fix: pick TTL based on actual operational timelines, not guesswork.

Final thoughts

If I had to summarize production-grade idempotency architecture in one line, it would be this:

Design for duplicate delivery as normal behavior, then make your Lambda handlers safe, deterministic, and observable.

AWS gives us excellent building blocks for this:

Lambda
SQS / EventBridge / API Gateway
DynamoDB conditional writes
AWS Lambda Powertools idempotency utility

When I combine them intentionally, retries stop being scary and start being a reliability feature instead of a data integrity risk.

If you are building Lambda-driven systems that write to money, inventory, notifications, or customer state, idempotency is not optional. It is a core part of the architecture.

That placement helps readers understand the flow before diving into implementation details.

References

AWS Lambda Powertools (Python) documentation
AWS Lambda developer guide
Amazon SQS developer guide (Lambda event source mappings / retries / partial batch response)
Amazon EventBridge documentation (retries, targets, replay/archive)
Amazon DynamoDB documentation (conditional writes, TTL, PITR)

The core idea in one sentence

Exactly-once is usually not the right mental model

Where duplicates come from in AWS Lambda-driven systems

API Gateway -> Lambda

SQS -> Lambda

EventBridge -> Lambda

What a good idempotency architecture looks like

Architecture at a glance

End-to-end walkthrough (the scenario I will implement)

Flow

Designing the idempotency key

Good key examples

Risky key choices

My rule of thumb

Dedupe windows (TTL): how long should I remember a key?

What affects the dedupe window

Practical examples

Important nuance

DynamoDB-based idempotency store design (recommended pattern)

Table design (single-purpose table)

Why payloadHash matters

State transitions I use

Infrastructure example (SAM / CloudFormation snippets)

Notes on the table schema vs Powertools

Implementation option 1 (recommended): AWS Lambda Powertools idempotency utility

Install

API Gateway example with idempotency (Python)

Why this pattern works well

API client contract (important and often skipped)

Handling retries from SQS (Lambda event source mapping)

What changes for SQS?

Key design for SQS consumers

SQS consumer example (Powertools Batch + per-record idempotency)

Best practices I apply with SQS + idempotency

Handling retries from EventBridge targets

EventBridge-specific considerations

Key strategy for EventBridge consumers

EventBridge consumer example (Python Lambda + Powertools idempotency)

When I implement idempotency manually (without Powertools)

Manual DynamoDB idempotency pattern (Python + boto3)

Manual pattern caveats

End-to-end implementation discussion (how I wire this in production)

1) Idempotency belongs close to the handler boundary

2) I still design downstream writes carefully

3) I define duplicate behavior per source

4) I separate “idempotency key” and “correlation ID”

Handling edge cases and failure modes

Edge case 1: Same key, different payload

Edge case 2: Function times out after making a side effect

Edge case 3: IN_PROGRESS records stuck after crashes

Edge case 4: Replay and backfill

Observability: what I monitor for idempotency

Metrics I like to emit

Logs I always include

Practical best practices checklist (the part I use in reviews)

Key selection

Dedupe window

Store design

Source-specific behavior

Operations

Common mistakes I see (and how I avoid them)

Mistake 1: Using Lambda requestId as the idempotency key

Mistake 2: Assuming FIFO queue means I do not need idempotency

Mistake 3: Dedupe only at the API layer

Mistake 4: No payload validation on key reuse

Mistake 5: Too-short TTL

Final thoughts

References

Why `payloadHash` matters

Edge case 3: `IN_PROGRESS` records stuck after crashes

Mistake 1: Using Lambda `requestId` as the idempotency key