Sowmya Katherla

Posted on Mar 16

26,000 EBS Snapshots, a 15-Minute Wall, and the Architecture That Finally Worked

#aws #devops #cloud #lambda

Originally published on Medium

A real-world breakdown of 5 compounding failure modes — memory exhaustion, Lambda timeouts, SNS limits, missing retry logic — and three progressively powerful architectures to fix them.

The Sunday Night That Changed Everything

Picture this: it's Sunday at 4 PM. A scheduled EventBridge rule quietly fires off your Lambda function. Its job? Simple. Scan all your EBS snapshots, find anything older than 90 days, delete it, and send a confirmation email.

Except it never sends that email. Because it never finishes.

Ten minutes pass. The Lambda runtime does what it always does when a function overstays its welcome — kills it. Hard stop. No cleanup. No notification. No idea how many snapshots (if any) were actually deleted. And then, because Lambda has a retry policy for async invocations, it tries again. And again. Three times total. All timeouts.

The scale problem in numbers: 26,000+ EBS snapshots in a single AWS account. A Lambda function loading ALL of them into memory at once. 512 MB of RAM. A 10-minute timeout. You do the math — it never stood a chance.

This article is the complete breakdown: what went wrong, why it went wrong at scale, and three progressively powerful architectures to fix it — from a 30-minute patch to a fully orchestrated enterprise solution.

Part 1: Anatomy of the Failure

Before jumping to solutions, let's understand all five failure modes. Most articles fix one. This one had five. Fix only one and you'll still fail.

Failure #1 — The Memory Bomb

The original function's first line of real work looked like this:

# Loads every single snapshot into RAM simultaneously
all_snaps = ec2_client.describe_snapshots(OwnerIds=['self'])['Snapshots']

# With 26,000+ snapshots, this single line can consume
# hundreds of megabytes before a single deletion attempt

What's happening here? The describe_snapshots API without pagination loads all results into a Python list simultaneously. The boto3 SDK does not auto-paginate for you. Beyond the memory issue, if you have more than 1,000 snapshots, you're silently missing results — the API pages at 1,000 items and you're never requesting the next page.

Definition: API Pagination
Most AWS list/describe APIs return results in pages of up to 1,000 items. Each response includes a NextToken field. If present, more results exist — you must loop, passing the token back, until NextToken is absent. Failing to do this means silently missing data at scale.

Failure #2 — The 15-Minute Wall

AWS Lambda has an absolute maximum execution time of 15 minutes. No exceptions. No extensions. No negotiating. When the clock hits zero, your function is terminated — mid-loop, mid-deletion, mid-anything.

With 26,000 snapshots to delete sequentially (one API call each), you're looking at potentially 30–60 minutes of work. Lambda simply cannot complete this in a single invocation.

What You Need	What Lambda Gives You
30–60+ min to process 26K snapshots sequentially	15 minutes maximum — hard limit, no exceptions
Progress saved if interrupted	Complete restart from scratch on every retry

Failure #3 — The SNS 256 KB Wall

AWS SNS has a hard 256 KB per-message limit. The original code built one massive string listing every deleted snapshot — ID, name, dates, description — for potentially thousands of entries. That string will exceed 256 KB every time.

# This string grows indefinitely — will blow past 256KB
email_body = 'Snapshot cleanup results:\n\n'
for snap in deleted_list:
    email_body += f'Snapshot ID: {snap["id"]}\n'
    email_body += f'Name:        {snap["name"]}\n'
    email_body += f'Created:     {snap["created"]}\n'
    email_body += '-' * 80 + '\n'

# Throws: InvalidParameter: Message too long
sns_client.publish(TopicArn=TOPIC_ARN, Message=email_body)

Failure #4 — No State, No Memory

Each timeout causes Lambda to retry from scratch. The function re-fetches all snapshots, re-attempts deletions that may have already succeeded, and has zero awareness of previous progress. No checkpoint. No tracking. No "pick up where I left off."

Failure #5 — No Error Differentiation

Not all snapshot deletion errors are equal:

A snapshot "in use" by an AMI legitimately cannot be deleted — skip it permanently
A throttling error should trigger a retry
A permissions error is a bug that needs an alert

The original code caught all exceptions identically: log and continue.

Actual production log:
Duration: 603,000 ms | Status: timeout | Memory Used: 512 MB (maxed)
The function ran for its full timeout limit, consumed every byte of allocated memory, and terminated silently. This happened three consecutive times. Zero snapshots confirmed deleted. Zero notifications sent.

Part 2: Three Ways to Fix It

There’s no single “correct” answer. The right solution depends on your account size, your team’s infrastructure comfort, and how much observability you need. We’ll walk all three — from quickest to most powerful.

Fix #1 — Pagination + Safe SNS (The Quick Win)

This is the minimum viable fix. It eliminates memory exhaustion and silent data-skipping. If your account has fewer than a few thousand snapshots to delete per run, this may be all you need.

The Paginator Pattern

AWS boto3 ships with built-in paginators for most describe/list operations. They handle the NextToken loop automatically and stream results one page at a time — never loading everything into RAM.

import boto3
from datetime import datetime, timedelta, timezone

ec2 = boto3.client('ec2')
RETENTION_DAYS = 90
cutoff = datetime.now(timezone.utc) - timedelta(days=RETENTION_DAYS)

# ❌ OLD — loads everything at once, silently misses >1000 results
# snaps = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']

# ✅ NEW — streams page by page, handles any account size
paginator = ec2.get_paginator('describe_snapshots')

eligible_ids = []

for page in paginator.paginate(OwnerIds=['self']):
    for snap in page['Snapshots']:
        if snap['StartTime'] < cutoff:
            eligible_ids.append(snap['SnapshotId'])

print(f'Found {len(eligible_ids)} snapshots older than {RETENTION_DAYS} days')

Safe SNS — Truncate + S3 Report

Cap the SNS message at 240 KB and store the full report in S3. The email contains a summary and a direct S3 link. No more Message too long errors.

import boto3, os
from datetime import datetime, timezone

sns = boto3.client('sns')
s3  = boto3.client('s3')

TOPIC_ARN     = os.environ['SNS_TOPIC_ARN']
REPORT_BUCKET = os.environ['S3_REPORT_BUCKET']
SNS_BYTE_CAP  = 240_000  # safely under the 256KB SNS hard limit

def safe_publish(subject, content):
    encoded = content.encode('utf-8')
    if len(encoded) > SNS_BYTE_CAP:
        content = encoded[:SNS_BYTE_CAP].decode('utf-8', errors='ignore')
        content += '\n\n[Truncated — full report saved to S3]'
    sns.publish(TopicArn=TOPIC_ARN, Subject=subject[:100], Message=content)

def upload_full_report(removed_snaps, cutoff_date):
    ts  = datetime.now(timezone.utc).strftime('%Y-%m-%d_%H-%M-%S')
    key = f'ebs-cleanup-reports/{ts}_report.txt'
    report = f'EBS Snapshot Cleanup Report\nGenerated: {ts}\n'
    report += f'Cutoff: {cutoff_date.strftime("%Y-%m-%d")}\n'
    report += f'Total deleted: {len(removed_snaps)}\n'
    report += '=' * 80 + '\n\n'
    for item in removed_snaps:
        report += f'ID: {item["id"]}  |  Name: {item["name"]}  |  Created: {item["created"]}\n'
    s3.put_object(Bucket=REPORT_BUCKET, Key=key,
                  Body=report.encode('utf-8'), ContentType='text/plain')
    return f's3://{REPORT_BUCKET}/{key}'

When Fix #1 is enough: If your account typically has fewer than ~3,000 snapshots to delete per run and your Lambda timeout is set to 15 minutes, pagination alone will likely keep you under the wire. Watch CloudWatch — if memory stays under 60% of your limit, you're safe.

Fix #2 — SQS Decoupling (Production Recommended)

The key insight: stop trying to do everything in one Lambda invocation. Split the work into two independent phases — Discovery and Deletion — connected by an SQS queue.

Think of it like a factory floor. A receiving team scans all incoming parts (snapshots), groups them into boxes of 100, and places them on the conveyor belt (SQS). Assembly workers (Deletion Lambdas) each pick up one box, do their 30 seconds of work, and move on. No single worker is overloaded. No single failure brings down the line.

EventBridge (Sunday 4 PM ET)
       |
       v
Discovery Lambda          ← runs once, exits cleanly under 15 min
  - Paginates ALL snapshots
  - Filters by age (> 90 days)
  - Groups into batches of 100
  - Sends each batch as SQS message
       |
       v
SQS Queue                 ← durable, retryable, scalable buffer
  (~260 messages for 26,000 snapshots)
       |
       v  (auto-triggered, up to 1000 concurrent)
Deletion Lambda           ← handles 1 batch = ~30 seconds each
  - Deletes 100 snapshots
  - Handles InUse / NotFound gracefully
  - Raises on throttle so SQS auto-retries
  - Only alerts SNS on real failures

Discovery Lambda — Scan and Enqueue

import boto3, json, os, logging
from datetime import datetime, timedelta, timezone

ec2 = boto3.client('ec2')
sqs = boto3.client('sqs')
sns = boto3.client('sns')

RETENTION_DAYS = int(os.environ.get('RETENTION_DAYS', 90))
QUEUE_URL      = os.environ['SQS_QUEUE_URL']
TOPIC_ARN      = os.environ['SNS_TOPIC_ARN']
CHUNK_SIZE     = 100

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    cutoff    = datetime.now(timezone.utc) - timedelta(days=RETENTION_DAYS)
    paginator = ec2.get_paginator('describe_snapshots')

    pending = []
    scanned = 0
    queued  = 0

    for page in paginator.paginate(OwnerIds=['self']):
        for snap in page['Snapshots']:
            scanned += 1
            if snap['StartTime'] < cutoff:
                tag_name = next(
                    (t['Value'] for t in snap.get('Tags', []) if t['Key'] == 'Name'),
                    'Untagged'
                )
                pending.append({
                    'snap_id':  snap['SnapshotId'],
                    'tag_name': tag_name,
                    'created':  snap['StartTime'].strftime('%Y-%m-%d'),
                })

            if len(pending) >= CHUNK_SIZE:
                _enqueue(pending)
                queued += len(pending)
                pending = []

            # Safety valve: flush and exit before timeout
            if context.get_remaining_time_in_millis() < 90_000:
                logger.warning('Approaching timeout — flushing remaining')
                if pending:
                    _enqueue(pending)
                    queued += len(pending)
                _notify(scanned, queued, cutoff, partial=True)
                return {'statusCode': 206, 'queued': queued}

    if pending:
        _enqueue(pending)
        queued += len(pending)

    _notify(scanned, queued, cutoff, partial=False)
    return {'statusCode': 200, 'queued': queued}


def _enqueue(chunk):
    sqs.send_message(QueueUrl=QUEUE_URL, MessageBody=json.dumps(chunk))
    logger.info(f'Enqueued {len(chunk)} snapshots')


def _notify(scanned, queued, cutoff, partial):
    prefix = 'WARNING: Partial — ' if partial else ''
    sns.publish(
        TopicArn=TOPIC_ARN,
        Subject=f'{prefix}EBS Cleanup — Discovery Complete',
        Message=(
            f'Total snapshots scanned: {scanned}\n'
            f'Queued for deletion: {queued}\n'
            f'Cutoff date: {cutoff.strftime("%Y-%m-%d")}\n'
            f'Batches sent to SQS: {(queued // 100) + 1}\n'
            + ('\nNOTE: Lambda approached timeout. Already-queued batches will still process.' if partial else '')
        )
    )

Deletion Lambda — Process One Batch

import boto3, json, os, logging

ec2 = boto3.client('ec2')
sns = boto3.client('sns')
TOPIC_ARN = os.environ['SNS_TOPIC_ARN']

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    removed, skipped, failed = [], [], []

    for record in event.get('Records', []):
        batch = json.loads(record['body'])

        for item in batch:
            sid = item['snap_id']
            try:
                ec2.delete_snapshot(SnapshotId=sid)
                removed.append(sid)
                logger.info(f'Deleted: {sid}')

            except ec2.exceptions.ClientError as exc:
                code = exc.response['Error']['Code']

                if code == 'InvalidSnapshot.InUse':
                    logger.warning(f'Skipped (in use by AMI): {sid}')
                    skipped.append({'id': sid, 'reason': 'In use by AMI'})

                elif code == 'InvalidSnapshot.NotFound':
                    logger.info(f'Already gone: {sid}')

                elif code == 'RequestLimitExceeded':
                    # Raise so SQS returns the message for retry
                    logger.warning(f'Throttled — returning to queue: {sid}')
                    raise exc

                else:
                    logger.error(f'Unexpected error {sid}: {exc}')
                    failed.append({'id': sid, 'reason': str(exc)})

    logger.info(f'Batch done — removed={len(removed)} skipped={len(skipped)} failed={len(failed)}')

    if failed:
        sns.publish(
            TopicArn=TOPIC_ARN,
            Subject='EBS Cleanup — Batch Errors',
            Message='\n'.join(f"{f['id']}: {f['reason']}" for f in failed)
        )

    return {'removed': len(removed), 'skipped': len(skipped), 'failed': len(failed)}

Why raise on throttling? When the Deletion Lambda raises an unhandled exception, SQS treats the message as failed and makes it visible again after the visibility timeout expires. This creates automatic intelligent backoff — AWS throttle recovers, SQS retries, zero extra retry code required.

Fix #3 — AWS Step Functions (Enterprise Grade)

Step Functions is AWS's managed workflow orchestration service. Instead of writing retry logic, parallelism, and state tracking yourself, you define a state machine — a visual flowchart of your workflow — and AWS handles execution guarantees, retries, and full audit history.

What Step Functions adds over the SQS approach

Visual state machine in the AWS console — see exactly where a workflow is at any moment
Built-in retry with configurable exponential backoff per state
Map state — fan out to N parallel Lambda workers simultaneously
Full execution history — audit log of every state transition with timestamps
No Lambda timeout concern — Step Functions executions can run for up to one year

State Machine Definition

{
  "Comment": "EBS Snapshot Cleanup Orchestrator",
  "StartAt": "DiscoverSnapshots",
  "States": {

    "DiscoverSnapshots": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:ebs-discovery",
      "Next": "FanOutDeletion",
      "TimeoutSeconds": 900,
      "Retry": [{
        "ErrorEquals": ["Lambda.ServiceException", "States.Timeout"],
        "IntervalSeconds": 30,
        "MaxAttempts": 3,
        "BackoffRate": 2
      }]
    },

    "FanOutDeletion": {
      "Type": "Map",
      "ItemsPath": "$.snapshot_batches",
      "MaxConcurrency": 20,
      "Iterator": {
        "StartAt": "DeleteOneBatch",
        "States": {
          "DeleteOneBatch": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:ebs-deletion",
            "End": true,
            "Retry": [{
              "ErrorEquals": ["States.TaskFailed"],
              "IntervalSeconds": 60,
              "MaxAttempts": 3
            }]
          }
        }
      },
      "Next": "SendFinalReport"
    },

    "SendFinalReport": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:ebs-reporter",
      "End": true
    }
  }
}

Cost note: Step Functions charges per state transition. For 260 batches, you'd generate roughly 800–1,000 state transitions per weekly run — well within the 4,000 free monthly transitions on the Standard workflow tier. For most teams, this runs at zero or near-zero cost.

Part 3: Lambda Invocation Types — Why They Matter

Understanding invocation types is critical to understanding why the original architecture failed silently, and why the SQS fix is reliable.

Synchronous — RequestResponse

The caller blocks and waits. The caller gets the actual result. If Lambda fails, the caller knows immediately and can handle it. No automatic retries.

# Synchronous — caller BLOCKS until Lambda returns
response = lambda_client.invoke(
    FunctionName='my-ebs-cleaner',
    InvocationType='RequestResponse',  # default
    Payload=json.dumps({'dry_run': False})
)
result = json.loads(response['Payload'].read())

Used by: API Gateway, direct SDK calls, Lambda console test button, SQS trigger.

Asynchronous — Event (Fire & Forget)

The caller gets HTTP 202 immediately and moves on. Lambda runs in the background. On failure, Lambda retries automatically twice. The caller never sees the outcome.

# Asynchronous — caller gets 202 immediately and moves on
response = lambda_client.invoke(
    FunctionName='my-ebs-cleaner',
    InvocationType='Event',
    Payload=json.dumps({'source': 'scheduled-trigger'})
)
# response['StatusCode'] == 202
# Lambda runs in background — caller never sees success or failure

Used by: EventBridge, SNS, S3 event notifications.

Why this explains the silent failure: EventBridge uses async invocation. When the Lambda timed out three consecutive times, EventBridge had no idea — it got a 202 Accepted and moved on. The three retries weren't code you wrote — they were Lambda's built-in async behavior (2 automatic retries after failure). This is why no SNS notification ever arrived: the function was terminated before it reached the notification code.

Property	Synchronous	Asynchronous
Caller behavior	Blocks and waits	Returns 202 immediately
Return value	Delivered to caller	Discarded
On failure	Caller handles it	Lambda retries ×2 automatically
Timeout awareness	Caller sees it	Caller never knows
In our solution	Deletion Lambda (SQS)	Discovery Lambda (EventBridge)

Part 4: What the Logs Showed After the Fix

After increasing memory to 1,000 MB and timeout to 15 minutes (with the pagination fix applied):

Run #1 — First Execution

Duration:   334,576 ms  (~5.5 minutes)
Memory:     820 MB used of 1,000 MB configured
Status:     Succeeded
Deletions:  Completed successfully
SNS:        FAILED — InvalidParameter: Message too long

# Deletions worked. Memory held. Pagination solved the OOM.
# But SNS still hit the 256KB limit on the summary message.
# Fix: S3 report + truncated SNS.

Run #2 — Immediately After

Duration:   41,697 ms  (~41 seconds)
Memory:     730 MB used
Total snapshots in account:  25,143
Snapshots eligible for deletion: 0
SNS: Delivered successfully

# Run #1 already deleted all the 90-day-old ones.
# 30K+ snapshots scanned in 41 seconds with pagination.
# Small result message = no SNS size issue.

Performance before vs. after:
Before: Three consecutive timeouts at 10 minutes each. 30 minutes of wasted compute. Zero confirmed deletions. Zero notifications.
After: Completed in 5.5 minutes. All eligible snapshots deleted. Remaining SNS issue resolved with S3 report storage.

Part 5: Implementation Checklist

Lambda Configuration

Set timeout to 15 minutes (maximum) for the Discovery Lambda
Set memory to at least 1,024 MB — Lambda allocates proportional CPU to memory
Use environment variables for all ARNs and thresholds — never hardcode
Enable X-Ray tracing to profile slow API calls

IAM Permissions Required

ec2:DescribeSnapshots — list all snapshots
ec2:DeleteSnapshot — delete eligible snapshots
sns:Publish — send email notifications
s3:PutObject — upload full deletion report
sqs:SendMessage — Discovery Lambda to SQS (Fix #2 only)
sqs:ReceiveMessage, sqs:DeleteMessage — Deletion Lambda (Fix #2 only)

SQS Queue Settings

Visibility timeout: set to 6× your Deletion Lambda timeout
Message retention: 4 days minimum
Dead Letter Queue: maxReceiveCount = 3
Lambda event source mapping: batch size = 1

Monitoring & Alerting

CloudWatch alarm: Lambda errors > 0
CloudWatch alarm: SQS messages not visible (stuck batches)
CloudWatch alarm: DLQ message count > 0
Weekly SNS digest: deletion count + S3 report link

Closing Thoughts

This is a story that plays out on AWS teams everywhere. A function that works perfectly at 500 snapshots silently becomes a liability at 5,000. The gap between "it works in dev" and "it survives production at scale" is exactly where thoughtful cloud architecture lives.

The five failure modes — memory exhaustion, the 15-minute ceiling, SNS size limits, no retry intelligence, and poor error differentiation — are each individually understandable. The dangerous part is how they compound quietly, only revealing themselves at scale, on a Sunday, in production.

Whether you choose the quick pagination fix, the SQS decoupling pattern, or the full Step Functions orchestration, the underlying principle is the same: don't fight your platform's constraints. Design around them.

The three rules, simplified:

Never load unbounded data into memory — always paginate.

Never put more than 15 minutes of work into a single Lambda — split it.

Never put an unbounded string into SNS — summarize in email, detail in S3.

Found this useful? Drop a comment or share it with someone debugging Lambda timeouts right now.

aws lambda ebs serverless cloud-architecture devops aws-lambda boto3

DEV Community