DEV Community

Cover image for Idempotency Architecture for Lambda-Driven Systems on AWS
Renaldi for AWS Community Builders

Posted on

Idempotency Architecture for Lambda-Driven Systems on AWS

Duplicate processing is one of those problems that looks small in a diagram and very expensive in production.

I have seen teams build clean event-driven and Lambda-based systems, only to run into duplicate charges, duplicated emails, repeated downstream writes, or inconsistent state once retries and redrives start happening. The tricky part is that the system is often behaving as designed. AWS services are doing what they should do: retrying, buffering, redriving, and favoring delivery durability.

This is exactly why I consider idempotency architecture one of the most important and most underexplained topics in serverless engineering.

In this post, I will walk through how I design idempotency for Lambda-driven systems on AWS, including:

  • the exactly-once myth vs the at-least-once reality
  • how to choose idempotency keys and dedupe windows
  • a DynamoDB-based idempotency store design
  • using AWS Lambda Powertools idempotency utility
  • handling retries from API Gateway, SQS, and EventBridge
  • an end-to-end implementation walkthrough with code

I will keep this practical and architecture-heavy, so you can adapt it to real workloads instead of only toy examples.


The core idea in one sentence

Idempotency means I can safely process the same logical request more than once and still end up with the same intended outcome.

That does not mean the system only receives the request once. It means my system is resilient when it receives it multiple times.


Exactly-once is usually not the right mental model

A lot of production mistakes start with the assumption that a serverless flow will process each request exactly once.

In practice, in distributed systems, what I usually get is:

  • at-least-once delivery (common with queues/events)
  • retries at multiple layers (client SDKs, AWS services, Lambda, Step Functions, etc.)
  • timeouts and ambiguous outcomes (did the function finish but the caller timed out?)
  • redrives / replay (DLQ reprocessing, archive replay, manual reruns)
  • duplicate submissions from clients (double-click, refresh, mobile reconnect)

So instead of trying to force an “exactly-once” guarantee everywhere, I design for:

  1. at-least-once delivery
  2. idempotent handlers
  3. safe retries
  4. observability around duplicates

That mental shift makes the architecture much more robust.


Where duplicates come from in AWS Lambda-driven systems

Before I show the solution, I like to make the duplicate paths explicit.

API Gateway -> Lambda

Duplicates can happen when:

  • the client retries after a timeout
  • the network drops after the backend succeeded
  • the user taps “Submit” multiple times
  • an upstream reverse proxy retries

SQS -> Lambda

Duplicates can happen when:

  • Lambda fails and the message becomes visible again
  • processing exceeds visibility timeout
  • partial batch failures cause some records to be retried
  • DLQ redrive sends records back later
  • standard queues deliver the same message more than once

EventBridge -> Lambda

Duplicates can happen when:

  • target invocation is retried
  • a publisher emits semantically duplicate events
  • archive/replay is used
  • consumers reprocess historical events intentionally

That is why I architect idempotency at the business operation level, not at just one transport layer.


What a good idempotency architecture looks like

At a high level, I want:

  • a stable idempotency key for each logical operation
  • a dedupe window (TTL) appropriate for the business
  • a persistence store to track processing status and cached results
  • conditional writes to prevent concurrent duplicate execution
  • response replay for safe duplicate requests when appropriate
  • clear behavior for mismatched payloads using the same key

Architecture at a glance


End-to-end walkthrough (the scenario I will implement)

To make this concrete, I will use a common example:

“Create payment intent / order processing” style operation.

Flow

  1. A client sends POST /payments with an Idempotency-Key header.
  2. API Gateway invokes Lambda.
  3. Lambda checks DynamoDB idempotency table.
  4. If the key is new, Lambda acquires an IN_PROGRESS lock and processes the request.
  5. Lambda writes the business result (for example, a payment record) and stores the response in the idempotency table with status COMPLETED.
  6. If the same request is retried, Lambda returns the cached response instead of processing again.

Then I will extend the same pattern to:

  • SQS worker retries
  • EventBridge consumer retries/replay

Designing the idempotency key

This is where a lot of teams accidentally introduce bugs.

A good idempotency key should identify the logical operation, not just the transport envelope.

Good key examples

  • payment:{merchant_id}:{client_request_id}
  • order-create:{tenant_id}:{cart_checkout_id}
  • invoice-email:{invoice_id}:{template_version}
  • event:{event_id} (if the publisher guarantees a stable event ID)

Risky key choices

  • raw timestamp
  • Lambda aws_request_id (changes every invocation)
  • SQS receiptHandle (changes on delivery)
  • entire payload serialized without normalization (field order / formatting issues)
  • keys that are too broad (cause false dedupe)
  • keys that are too narrow (miss duplicates)

My rule of thumb

I choose a key from business identity + operation intent, and I define it explicitly in the contract.

For APIs, that often means:

  • require an Idempotency-Key header from the client, and
  • validate that it maps to a stable request intent.

For asynchronous consumers, that often means:

  • use the publisher’s stable eventId, or
  • derive a deterministic business key (for example order_id + action).

Dedupe windows (TTL): how long should I remember a key?

There is no universal value. I set the dedupe window based on business and retry patterns.

What affects the dedupe window

  • expected client retry duration
  • SQS redrive timing and replay operations
  • EventBridge replay windows / operational reruns
  • whether duplicates after long periods are still harmful
  • cost of storing idempotency records

Practical examples

  • API payment creation: 24 hours to 7 days
  • Webhook ingestion: 1 to 7 days (depends on provider retry policy)
  • Batch event processing: hours to days
  • high-volume telemetry: maybe minutes to hours (if duplicate impact is low)

Important nuance

TTL in DynamoDB is eventually applied, not immediate deletion. I design my logic so:

  • expiry_timestamp is authoritative in code, and
  • TTL is the cleanup mechanism.

In other words, I do not depend on the item disappearing exactly at expiry time.


DynamoDB-based idempotency store design (recommended pattern)

I prefer DynamoDB for idempotency state in Lambda workloads because it gives me:

  • low-latency key lookups
  • conditional writes
  • TTL support
  • simple scaling
  • good fit for stateless Lambda functions

Table design (single-purpose table)

A dedicated table keeps the pattern easy to reason about.

Partition key

  • id (string): the idempotency key

Recommended attributes

  • status (IN_PROGRESS, COMPLETED, optionally EXPIRED)
  • expiryTimestamp (epoch seconds for dedupe window)
  • inProgressExpiryTimestamp (shorter lock expiry to recover from crashed executions)
  • payloadHash (optional but highly recommended)
  • responseData (optional, cached result or safe response envelope)
  • createdAt
  • updatedAt
  • source (api / sqs / eventbridge)
  • functionName (optional for ops visibility)

Why payloadHash matters

If a client reuses the same idempotency key with a different payload, I want to detect that and reject it. Otherwise I can accidentally return a cached response for the wrong request.

This is a subtle but critical best practice.


State transitions I use

Here is the lifecycle I generally implement:

  1. No record exists
    • Try conditional write -> create IN_PROGRESS
  2. IN_PROGRESS exists
    • Another invocation is already processing (or crashed recently)
    • Return a retryable outcome or fail fast depending on source
  3. COMPLETED exists and not expired
    • Return cached result (or safe ack)
  4. Expired
    • Treat as a new request (depending on business policy)

That gives me concurrency safety and retry safety.


Infrastructure example (SAM / CloudFormation snippets)

Below is a minimal setup for:

  • Lambda function
  • DynamoDB idempotency table
  • IAM permissions
  • env vars for configuration
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Idempotent Lambda API example

Resources:
  PaymentsFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: python3.12
      Handler: app.lambda_handler
      CodeUri: src/
      MemorySize: 512
      Timeout: 29
      Policies:
        - Statement:
            - Effect: Allow
              Action:
                - dynamodb:GetItem
                - dynamodb:PutItem
                - dynamodb:UpdateItem
                - dynamodb:DeleteItem
              Resource: !GetAtt IdempotencyTable.Arn
      Environment:
        Variables:
          IDEMPOTENCY_TABLE_NAME: !Ref IdempotencyTable
          IDEMPOTENCY_EXPIRES_SECONDS: "86400" # 24h
      Events:
        CreatePaymentApi:
          Type: Api
          Properties:
            Path: /payments
            Method: post

  IdempotencyTable:
    Type: AWS::DynamoDB::Table
    Properties:
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: id
          AttributeType: S
      KeySchema:
        - AttributeName: id
          KeyType: HASH
      TimeToLiveSpecification:
        AttributeName: expiration
        Enabled: true
      PointInTimeRecoverySpecification:
        PointInTimeRecoveryEnabled: true
      SSESpecification:
        SSEEnabled: true

Outputs:
  ApiUrl:
    Value: !Sub "https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/payments"
Enter fullscreen mode Exit fullscreen mode

Notes on the table schema vs Powertools

AWS Lambda Powertools idempotency utility has its own default attribute names and record model. If I use the library, I usually let it manage the item shape and only customize when I truly need to.

That said, I still think through the conceptual schema above so the team understands what is being stored and why.


Implementation option 1 (recommended): AWS Lambda Powertools idempotency utility

If I am using Python Lambda functions, AWS Lambda Powertools is my default choice. It saves me from reimplementing concurrency locks, conditional checks, and record state transitions from scratch.

Install

pip install aws-lambda-powertools[boto3]
Enter fullscreen mode Exit fullscreen mode

API Gateway example with idempotency (Python)

This example assumes the client sends:

  • Idempotency-Key header
  • JSON body containing payment details

I use:

  • event_key_jmespath to extract the idempotency key from headers
  • payload_validation_jmespath to validate that key reuse with different payloads is detected
  • a DynamoDB persistence layer
import json
import os
from typing import Any, Dict

from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.idempotency import (
    DynamoDBPersistenceLayer,
    IdempotencyConfig,
    idempotent,
)
from aws_lambda_powertools.utilities.idempotency.exceptions import (
    IdempotencyValidationError,
)

logger = Logger(service="payments-api")

TABLE_NAME = os.environ["IDEMPOTENCY_TABLE_NAME"]
EXPIRES_SECONDS = int(os.getenv("IDEMPOTENCY_EXPIRES_SECONDS", "86400"))

persistence_layer = DynamoDBPersistenceLayer(table_name=TABLE_NAME)

config = IdempotencyConfig(
    # API Gateway/Lambda proxy event header lookup (normalize to lower-case in code if needed)
    event_key_jmespath="headers.idempotency-key",
    # Payload fields that should remain consistent when reusing the same key
    payload_validation_jmespath="powertools_json(body).[customerId, amount, currency]",
    expires_after_seconds=EXPIRES_SECONDS,
    use_local_cache=True,
)

def create_payment_intent(request: Dict[str, Any]) -> Dict[str, Any]:
    # Replace this with your real business logic / external payment call.
    # The critical point is that this function is wrapped with idempotency.
    payment_id = f"pay_{request['customerId']}_{request['amount']}_{request['currency']}"
    return {
        "paymentId": payment_id,
        "status": "AUTHORIZED",
        "amount": request["amount"],
        "currency": request["currency"],
    }

@idempotent(config=config, persistence_store=persistence_layer)
def process_request(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
    body = json.loads(event.get("body") or "{}")
    required = ["customerId", "amount", "currency"]
    missing = [k for k in required if k not in body]
    if missing:
        return {
            "statusCode": 400,
            "body": json.dumps({"message": f"Missing required fields: {missing}"})
        }

    result = create_payment_intent(body)

    # Powertools can persist and return this response for duplicate calls
    return {
        "statusCode": 200,
        "body": json.dumps(result),
        "headers": {"Content-Type": "application/json"},
    }

@logger.inject_lambda_context(log_event=False)
def lambda_handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
    # Normalize headers so event_key_jmespath is predictable
    headers = event.get("headers") or {}
    event["headers"] = {str(k).lower(): v for k, v in headers.items()}

    try:
        return process_request(event, context)
    except IdempotencyValidationError:
        return {
            "statusCode": 409,
            "body": json.dumps({
                "message": "Idempotency key was reused with a different request payload"
            }),
            "headers": {"Content-Type": "application/json"},
        }
Enter fullscreen mode Exit fullscreen mode

Why this pattern works well

  • If the same request is retried, Powertools returns the stored response.
  • If the same key is reused with a different body, I return 409 Conflict.
  • I avoid duplicate payment authorization for simple retries.

API client contract (important and often skipped)

Idempotency works much better when the contract is explicit.

For API producers/clients, I document:

  • Idempotency-Key is required for mutating operations (POST, sometimes PATCH)
  • same key + same intent/payload -> safe retry
  • same key + different payload -> 409 Conflict
  • dedupe window (for example, 24h)
  • response replay behavior (cached result may be returned)

That avoids ambiguity across frontend, mobile, and backend teams.


Handling retries from SQS (Lambda event source mapping)

SQS is one of the most common places where teams need idempotency but only discover it after duplicates happen.

What changes for SQS?

  • Lambda receives a batch of messages.
  • Some messages may succeed while others fail.
  • I should use partial batch response so only failed records are retried.
  • Each message should still be processed idempotently.

Key design for SQS consumers

I avoid using transient delivery metadata. I prefer:

  • a business key in the message body (for example orderId)
  • or an upstream event ID included in the message

Example key:

  • order-paid:{orderId}
  • inventory-reservation:{reservationId}

SQS consumer example (Powertools Batch + per-record idempotency)

Below is a simplified Python example using:

  • Powertools Batch utility for partial batch handling
  • Powertools idempotency on the per-record processing function
import json
import os
from typing import Any, Dict

from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.batch import BatchProcessor, EventType, process_partial_response
from aws_lambda_powertools.utilities.batch.types import PartialItemFailureResponse
from aws_lambda_powertools.utilities.data_classes.sqs_event import SQSRecord
from aws_lambda_powertools.utilities.idempotency import (
    DynamoDBPersistenceLayer,
    IdempotencyConfig,
    idempotent_function,
)

logger = Logger(service="orders-sqs-worker")

TABLE_NAME = os.environ["IDEMPOTENCY_TABLE_NAME"]
persistence_layer = DynamoDBPersistenceLayer(table_name=TABLE_NAME)

# Here we build the key from the function argument "data"
# (Powertools hashes the configured subset to create an idempotency key)
idempotency_config = IdempotencyConfig(
    event_key_jmespath="orderId",
    payload_validation_jmespath="[orderId, action, version]",
    expires_after_seconds=3 * 24 * 60 * 60,  # 3 days
)

processor = BatchProcessor(event_type=EventType.SQS)

@idempotent_function(data_keyword_argument="data", config=idempotency_config, persistence_store=persistence_layer)
def process_order_event(*, data: Dict[str, Any]) -> Dict[str, Any]:
    # Business logic here (must be safe to retry / replay via idempotency)
    # Example: reserve inventory, update status, publish follow-up event, etc.
    logger.info("Processing order event", extra={"orderId": data["orderId"], "action": data["action"]})
    return {"ok": True, "orderId": data["orderId"], "action": data["action"]}

def record_handler(record: SQSRecord) -> Dict[str, Any]:
    payload = json.loads(record.body)
    return process_order_event(data=payload)

def lambda_handler(event, context) -> PartialItemFailureResponse:
    return process_partial_response(
        event=event,
        record_handler=record_handler,
        processor=processor,
        context=context,
    )
Enter fullscreen mode Exit fullscreen mode

Best practices I apply with SQS + idempotency

  • Use partial batch response to avoid retrying already-successful records.
  • Set visibility timeout longer than worst-case processing time (or heartbeat/extend strategy).
  • Keep the idempotency key in the message payload, not delivery metadata.
  • Use a dedupe window long enough to cover retries, DLQ redrive, and operational replay.

Handling retries from EventBridge targets

EventBridge makes event-driven architecture clean, but duplicate-safe consumers are still my responsibility.

EventBridge-specific considerations

  • EventBridge may retry target delivery.
  • Archive/replay can intentionally re-send events.
  • Different publishers may emit semantically duplicate events unless the contract is strict.

Key strategy for EventBridge consumers

If the event has a stable id or domain event ID, I use it. If not, I derive one from:

  • detail-type
  • source/domain identifier
  • business entity ID
  • action/version

Example:

  • eventbridge:{source}:{detailType}:{detail.orderId}:{detail.version}

EventBridge consumer example (Python Lambda + Powertools idempotency)

import os
from typing import Any, Dict

from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.idempotency import (
    DynamoDBPersistenceLayer,
    IdempotencyConfig,
    idempotent,
)

logger = Logger(service="eventbridge-consumer")

TABLE_NAME = os.environ["IDEMPOTENCY_TABLE_NAME"]
persistence_layer = DynamoDBPersistenceLayer(table_name=TABLE_NAME)

config = IdempotencyConfig(
    # Prefer a producer-defined unique event ID if available in detail
    event_key_jmespath="detail.eventId || id",
    payload_validation_jmespath="[source, detail-type, detail.orderId, detail.version]",
    expires_after_seconds=7 * 24 * 60 * 60,
)

@idempotent(config=config, persistence_store=persistence_layer)
def handle_event(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
    detail = event["detail"]
    logger.info(
        "Handling EventBridge event",
        extra={"eventId": detail.get("eventId", event.get("id")), "orderId": detail.get("orderId")}
    )

    # Execute your domain logic here
    # Example: update read model, trigger notification, call downstream API, etc.
    return {"handled": True, "orderId": detail.get("orderId")}

def lambda_handler(event, context):
    return handle_event(event, context)
Enter fullscreen mode Exit fullscreen mode

When I implement idempotency manually (without Powertools)

Powertools is my default, but sometimes I implement manually when:

  • I need a custom record schema shared across services/languages
  • I need custom conflict behavior beyond the library defaults
  • I am in a language/runtime where I am standardizing a platform abstraction
  • I want explicit control over lock and result update semantics

The key principle stays the same: use DynamoDB conditional writes.


Manual DynamoDB idempotency pattern (Python + boto3)

The pattern below shows the core idea:

  1. Try to write IN_PROGRESS with attribute_not_exists(id)
  2. If successful, execute business logic
  3. Update item to COMPLETED and store result
  4. If conditional check fails, inspect existing item and decide
import json
import time
import hashlib
import boto3
from botocore.exceptions import ClientError

dynamodb = boto3.client("dynamodb")
TABLE_NAME = "IdempotencyTable"

def sha256_text(value: str) -> str:
    return hashlib.sha256(value.encode("utf-8")).hexdigest()

def acquire_lock(idempotency_key: str, payload_hash: str, ttl_seconds: int = 86400, lock_seconds: int = 120):
    now = int(time.time())
    item = {
        "id": {"S": idempotency_key},
        "status": {"S": "IN_PROGRESS"},
        "payloadHash": {"S": payload_hash},
        "expiration": {"N": str(now + ttl_seconds)},
        "inProgressExpiryTimestamp": {"N": str(now + lock_seconds)},
        "createdAt": {"N": str(now)},
        "updatedAt": {"N": str(now)},
    }
    try:
        dynamodb.put_item(
            TableName=TABLE_NAME,
            Item=item,
            ConditionExpression="attribute_not_exists(id)"
        )
        return {"acquired": True}
    except ClientError as e:
        if e.response["Error"]["Code"] != "ConditionalCheckFailedException":
            raise
        return {"acquired": False}

def get_record(idempotency_key: str):
    resp = dynamodb.get_item(
        TableName=TABLE_NAME,
        Key={"id": {"S": idempotency_key}},
        ConsistentRead=True,
    )
    return resp.get("Item")

def mark_completed(idempotency_key: str, response_obj: dict):
    now = int(time.time())
    dynamodb.update_item(
        TableName=TABLE_NAME,
        Key={"id": {"S": idempotency_key}},
        UpdateExpression="SET #s = :completed, responseData = :resp, updatedAt = :now",
        ExpressionAttributeNames={"#s": "status"},
        ExpressionAttributeValues={
            ":completed": {"S": "COMPLETED"},
            ":resp": {"S": json.dumps(response_obj)},
            ":now": {"N": str(now)},
        },
    )

def handler(event, context):
    body = json.loads(event["body"])
    idem_key = event["headers"]["idempotency-key"]
    payload_hash = sha256_text(json.dumps({
        "customerId": body["customerId"],
        "amount": body["amount"],
        "currency": body["currency"],
    }, sort_keys=True))

    lock = acquire_lock(idem_key, payload_hash)
    if not lock["acquired"]:
        existing = get_record(idem_key)
        if not existing:
            # Rare race / eventual state issue; safe to retry
            raise Exception("Retry request")
        if existing.get("payloadHash", {}).get("S") != payload_hash:
            return {"statusCode": 409, "body": json.dumps({"message": "Idempotency key payload mismatch"})}
        status = existing["status"]["S"]
        if status == "COMPLETED":
            return {"statusCode": 200, "body": existing["responseData"]["S"]}
        if status == "IN_PROGRESS":
            # For API workflows you might return 409 or 425-style retry signal (implementation-specific)
            return {"statusCode": 409, "body": json.dumps({"message": "Request is already in progress"})}

    # Execute business logic after lock acquired
    result = {"paymentId": "pay_123", "status": "AUTHORIZED"}
    mark_completed(idem_key, result)
    return {"statusCode": 200, "body": json.dumps(result)}
Enter fullscreen mode Exit fullscreen mode

Manual pattern caveats

If I go manual, I also need to think about:

  • expired IN_PROGRESS lock recovery
  • exception handling and safe cleanup
  • serialization of cached responses
  • metrics for duplicate hits vs fresh requests
  • consistent behavior across all event sources

That is exactly why Powertools is usually the better default.


End-to-end implementation discussion (how I wire this in production)

This is the part I care about most in architecture reviews: not just the code, but where idempotency sits in the overall system.

1) Idempotency belongs close to the handler boundary

I usually apply idempotency at the Lambda entry point (or record handler for batches), before business side effects occur.

Why:

  • it prevents duplicate external calls
  • it keeps the protection broad
  • it makes retries safe by default

2) I still design downstream writes carefully

Idempotency at the Lambda layer is great, but if the function can partially succeed before crashing, I also check downstream safety:

  • unique constraints in relational DBs
  • conditional writes in DynamoDB
  • provider-side idempotency support for payment APIs or webhooks

Think in layers, not in a single magic switch.

3) I define duplicate behavior per source

The response to a duplicate is not always the same.

  • API Gateway: return cached success response (best UX)
  • SQS: ack success for already-processed message, avoid poison-loop
  • EventBridge: safely no-op or return success after dedupe

4) I separate “idempotency key” and “correlation ID”

They are related but not identical.

  • Correlation ID -> tracing/observability
  • Idempotency key -> duplicate suppression for a specific operation

Sometimes they can be the same, but I do not assume that.


Handling edge cases and failure modes

Edge case 1: Same key, different payload

This should be treated as a contract violation.

Best practice: return 409 Conflict (or equivalent domain error) and log it loudly.

Why I do this:

  • protects clients from accidental misuse
  • prevents serving incorrect cached results
  • surfaces integration bugs early

Edge case 2: Function times out after making a side effect

This is the classic ambiguous outcome problem.

Idempotency helps, but only if:

  • the side effect can be detected or safely repeated, and/or
  • the result gets persisted before timeout

Best practices:

  • keep timeouts realistic
  • use downstream idempotency where available
  • break long operations into Step Functions when needed
  • persist progress checkpoints for multi-step work

Edge case 3: IN_PROGRESS records stuck after crashes

If a function crashes after acquiring the lock, duplicates may keep seeing IN_PROGRESS.

Best practices:

  • use an in-progress lock expiry
  • make retries back off
  • alert on sustained IN_PROGRESS accumulation
  • evaluate whether the operation is safe to re-attempt after lock expiry

Edge case 4: Replay and backfill

Operational replay is common and healthy. I design for it intentionally.

Best practices:

  • choose a dedupe window that matches replay expectations
  • if replay should re-run effects, use a different idempotency namespace/version
  • document replay semantics for ops teams

Example:

  • normal key: invoice-email:{invoiceId}
  • forced replay key namespace: invoice-email:replay:{jobId}:{invoiceId}

Observability: what I monitor for idempotency

Idempotency is not just a code concern. I want to know how often it is being exercised and whether it is hiding a deeper issue.

Metrics I like to emit

  • IdempotencyFreshRequests
  • IdempotencyDuplicateHits
  • IdempotencyInProgressConflicts
  • IdempotencyValidationConflicts (same key, different payload)
  • IdempotencyStoreErrors
  • handler latency split by fresh vs duplicate

Logs I always include

  • idempotency key (or redacted/hash if sensitive)
  • source (api, sqs, eventbridge)
  • dedupe outcome (fresh, duplicate_completed, duplicate_in_progress, validation_mismatch)
  • business identifier (orderId, paymentId, etc.)

This makes incident triage much faster.


Practical best practices checklist (the part I use in reviews)

Key selection

  • [ ] Key represents a business operation, not transport metadata
  • [ ] Key is stable across retries/replays
  • [ ] Key cardinality is high enough to avoid false collisions
  • [ ] Payload mismatch with same key is detected

Dedupe window

  • [ ] TTL matches retry + redrive + replay realities
  • [ ] Expiry is checked in code (not only by DynamoDB TTL cleanup)
  • [ ] Window is documented in API/event contract

Store design

  • [ ] DynamoDB conditional write used for first writer wins
  • [ ] IN_PROGRESS and COMPLETED states are handled explicitly
  • [ ] Cached response/ack strategy is defined
  • [ ] Encryption, backups/PITR, and least privilege are configured

Source-specific behavior

  • [ ] API duplicates return deterministic response
  • [ ] SQS uses partial batch response
  • [ ] EventBridge consumer supports replay safely
  • [ ] Redrive/replay semantics are documented for ops

Operations

  • [ ] Metrics and alarms exist for duplicate spikes and store failures
  • [ ] Logs include dedupe outcomes and business IDs
  • [ ] Runbooks cover stale IN_PROGRESS records and replay scenarios

Common mistakes I see (and how I avoid them)

Mistake 1: Using Lambda requestId as the idempotency key

That only identifies the invocation, not the logical request.

Fix: use business operation identity or client-provided idempotency key.


Mistake 2: Assuming FIFO queue means I do not need idempotency

FIFO helps with ordering and deduplication windows, but it does not replace end-to-end idempotency for all side effects and replay paths.

Fix: still make the consumer idempotent.


Mistake 3: Dedupe only at the API layer

Then an async worker downstream duplicates the side effect anyway.

Fix: apply idempotency where side effects happen, especially in SQS/EventBridge consumers.


Mistake 4: No payload validation on key reuse

This can return the wrong cached response and create hidden data integrity issues.

Fix: validate a stable subset of the payload with the idempotency key.


Mistake 5: Too-short TTL

The key expires before retries/redrives finish, so duplicates sneak through.

Fix: pick TTL based on actual operational timelines, not guesswork.


Final thoughts

If I had to summarize production-grade idempotency architecture in one line, it would be this:

Design for duplicate delivery as normal behavior, then make your Lambda handlers safe, deterministic, and observable.

AWS gives us excellent building blocks for this:

  • Lambda
  • SQS / EventBridge / API Gateway
  • DynamoDB conditional writes
  • AWS Lambda Powertools idempotency utility

When I combine them intentionally, retries stop being scary and start being a reliability feature instead of a data integrity risk.

If you are building Lambda-driven systems that write to money, inventory, notifications, or customer state, idempotency is not optional. It is a core part of the architecture.

That placement helps readers understand the flow before diving into implementation details.


References

  • AWS Lambda Powertools (Python) documentation
  • AWS Lambda developer guide
  • Amazon SQS developer guide (Lambda event source mappings / retries / partial batch response)
  • Amazon EventBridge documentation (retries, targets, replay/archive)
  • Amazon DynamoDB documentation (conditional writes, TTL, PITR)

Top comments (0)