Ujjwal Sinha

Posted on Nov 29, 2025

DynamoDB Race Conditions: Why Your Cache Is Burning Money

#aws #serverless #dynamodb #lambda

Last month, I watched my thirdparty API costs triple overnight. The strangest part? My DynamoDB cache was working perfectly or so I thought.

Turns out, I'd built a textbook race condition into my serverless architecture. The kind that only shows up under load, when it's expensive to debug.

The Setup

Standard serverless caching pattern: Lambda checks DynamoDB, returns cached data if present, otherwise hits an external API and writes the result back. Clean separation of concerns, scales to zero, the usual AWS promise.

In isolation, every request behaved exactly right. Cache miss triggered one API call, wrote to DynamoDB, subsequent requests got cache hits. Perfect.

Where It Breaks

Hot keys destroy this pattern. When 20 concurrent requests ask for the same uncached item, you'd expect one API call and 19 cache hits. That's not what happens.

Here's the actual sequence:

Request 1 reads DynamoDB (empty), prepares API call
Request 2 reads DynamoDB before Request 1 writes (still empty), prepares API call
Requests 3-20 do the same

You just paid for 20 API calls to fetch identical data. Zero cache benefit.

I tested this with controlled concurrency 20 parallel Lambda invocations requesting the same key. Every single invocation hit the external API. The race window between read and write is small, but it's large enough.

DynamoDB Conditional Updates

The fix isn't complicated, but it requires thinking about state differently. You can't rely on read then write logic because that gap is where the race lives.

DynamoDB's ConditionExpression parameter makes writes atomic. You specify conditions that must be true for the write to succeed. If they're not, you get ConditionalCheckFailedException immediately.

Implementation Pattern

I use a three-state flow: Pending → Processing → Completed.

When a Lambda detects a cache miss, it attempts an atomic state transition:

# Only one Lambda will succeed
try:
    table.update_item(
        Key={'id': item_id},
        UpdateExpression='SET #status = :processing',
        ConditionExpression='attribute_not_exists(id) OR #status = :pending',
        ExpressionAttributeNames={'#status': 'status'},
        ExpressionAttributeValues={
            ':processing': 'Processing',
            ':pending': 'Pending'
        }
    )
    # Winner: call external API
    data = fetch_from_external_api(item_id)

    # Write final result
    table.update_item(
        Key={'id': item_id},
        UpdateExpression='SET #status = :completed, #data = :data',
        ExpressionAttributeNames={
            '#status': 'status',
            '#data': 'data'
        },
        ExpressionAttributeValues={
            ':completed': 'Completed',
            ':data': data
        }
    )
    return data

except ClientError as e:
    if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
        # Loser: wait for winner to finish
        time.sleep(0.1)  # Simple backoff
        return get_from_cache(item_id)
    raise

Only one Lambda succeeds. The others catch ConditionalCheckFailedException and know someone else claimed the work. They can either return a retry message or implement exponential backoff until the data is ready.

Results

Same test, same 20 concurrent requests:

External API calls: 1
Cached responses: 19

The race condition is gone. DynamoDB handles the coordination because that's what it's built for atomic operations at scale.

Key Takeaways

If you're writing to DynamoDB based on a previous read without conditional expressions, you have a race condition. It might not matter at low traffic, but it will surface under load.

The pattern is straightforward:

Move concurrency control into the database layer
Use ConditionExpression for atomic state transitions
Handle ConditionalCheckFailedException in your Lambda code

A few extra lines of code eliminate an entire class of expensive bugs.

Have you run into similar race conditions in your serverless architecture? Drop a comment below. I'd love to hear how you've handled concurrent cache misses.

Top comments (1)

Wayne Robinson • Dec 4 '25

This should probably include thoughts on how to deal with deadlocks on those cache resources, possibly via some type of timeout-aware lease for the writer so if it fails between marking an entry as pending and completing it doesn't permanently deadlock that item.

I know this is only an example, but people have a tendency of blanket-copying implementations they see online.