Originally published on Medium
A real-world breakdown of 5 compounding failure modes — memory exhaustion, Lambda timeouts, SNS limits, missing retry logic — and three progressively powerful architectures to fix them.
The Sunday Night That Changed Everything
Picture this: it's Sunday at 4 PM. A scheduled EventBridge rule quietly fires off your Lambda function. Its job? Simple. Scan all your EBS snapshots, find anything older than 90 days, delete it, and send a confirmation email.
Except it never sends that email. Because it never finishes.
Ten minutes pass. The Lambda runtime does what it always does when a function overstays its welcome — kills it. Hard stop. No cleanup. No notification. No idea how many snapshots (if any) were actually deleted. And then, because Lambda has a retry policy for async invocations, it tries again. And again. Three times total. All timeouts.
The scale problem in numbers: 26,000+ EBS snapshots in a single AWS account. A Lambda function loading ALL of them into memory at once. 512 MB of RAM. A 10-minute timeout. You do the math — it never stood a chance.
This article is the complete breakdown: what went wrong, why it went wrong at scale, and three progressively powerful architectures to fix it — from a 30-minute patch to a fully orchestrated enterprise solution.
Part 1: Anatomy of the Failure
Before jumping to solutions, let's understand all five failure modes. Most articles fix one. This one had five. Fix only one and you'll still fail.
Failure #1 — The Memory Bomb
The original function's first line of real work looked like this:
# Loads every single snapshot into RAM simultaneously
all_snaps = ec2_client.describe_snapshots(OwnerIds=['self'])['Snapshots']
# With 26,000+ snapshots, this single line can consume
# hundreds of megabytes before a single deletion attempt
What's happening here? The describe_snapshots API without pagination loads all results into a Python list simultaneously. The boto3 SDK does not auto-paginate for you. Beyond the memory issue, if you have more than 1,000 snapshots, you're silently missing results — the API pages at 1,000 items and you're never requesting the next page.
Definition: API Pagination
Most AWS list/describe APIs return results in pages of up to 1,000 items. Each response includes aNextTokenfield. If present, more results exist — you must loop, passing the token back, untilNextTokenis absent. Failing to do this means silently missing data at scale.
Failure #2 — The 15-Minute Wall
AWS Lambda has an absolute maximum execution time of 15 minutes. No exceptions. No extensions. No negotiating. When the clock hits zero, your function is terminated — mid-loop, mid-deletion, mid-anything.
With 26,000 snapshots to delete sequentially (one API call each), you're looking at potentially 30–60 minutes of work. Lambda simply cannot complete this in a single invocation.
| What You Need | What Lambda Gives You |
|---|---|
| 30–60+ min to process 26K snapshots sequentially | 15 minutes maximum — hard limit, no exceptions |
| Progress saved if interrupted | Complete restart from scratch on every retry |
Failure #3 — The SNS 256 KB Wall
AWS SNS has a hard 256 KB per-message limit. The original code built one massive string listing every deleted snapshot — ID, name, dates, description — for potentially thousands of entries. That string will exceed 256 KB every time.
# This string grows indefinitely — will blow past 256KB
email_body = 'Snapshot cleanup results:\n\n'
for snap in deleted_list:
email_body += f'Snapshot ID: {snap["id"]}\n'
email_body += f'Name: {snap["name"]}\n'
email_body += f'Created: {snap["created"]}\n'
email_body += '-' * 80 + '\n'
# Throws: InvalidParameter: Message too long
sns_client.publish(TopicArn=TOPIC_ARN, Message=email_body)
Failure #4 — No State, No Memory
Each timeout causes Lambda to retry from scratch. The function re-fetches all snapshots, re-attempts deletions that may have already succeeded, and has zero awareness of previous progress. No checkpoint. No tracking. No "pick up where I left off."
Failure #5 — No Error Differentiation
Not all snapshot deletion errors are equal:
- A snapshot "in use" by an AMI legitimately cannot be deleted — skip it permanently
- A throttling error should trigger a retry
- A permissions error is a bug that needs an alert
The original code caught all exceptions identically: log and continue.
Actual production log:
Duration: 603,000 ms | Status: timeout | Memory Used: 512 MB (maxed)
The function ran for its full timeout limit, consumed every byte of allocated memory, and terminated silently. This happened three consecutive times. Zero snapshots confirmed deleted. Zero notifications sent.
Part 2: Three Ways to Fix It
There’s no single “correct” answer. The right solution depends on your account size, your team’s infrastructure comfort, and how much observability you need. We’ll walk all three — from quickest to most powerful.
Fix #1 — Pagination + Safe SNS (The Quick Win)
This is the minimum viable fix. It eliminates memory exhaustion and silent data-skipping. If your account has fewer than a few thousand snapshots to delete per run, this may be all you need.
The Paginator Pattern
AWS boto3 ships with built-in paginators for most describe/list operations. They handle the NextToken loop automatically and stream results one page at a time — never loading everything into RAM.
import boto3
from datetime import datetime, timedelta, timezone
ec2 = boto3.client('ec2')
RETENTION_DAYS = 90
cutoff = datetime.now(timezone.utc) - timedelta(days=RETENTION_DAYS)
# ❌ OLD — loads everything at once, silently misses >1000 results
# snaps = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
# ✅ NEW — streams page by page, handles any account size
paginator = ec2.get_paginator('describe_snapshots')
eligible_ids = []
for page in paginator.paginate(OwnerIds=['self']):
for snap in page['Snapshots']:
if snap['StartTime'] < cutoff:
eligible_ids.append(snap['SnapshotId'])
print(f'Found {len(eligible_ids)} snapshots older than {RETENTION_DAYS} days')
Safe SNS — Truncate + S3 Report
Cap the SNS message at 240 KB and store the full report in S3. The email contains a summary and a direct S3 link. No more Message too long errors.
import boto3, os
from datetime import datetime, timezone
sns = boto3.client('sns')
s3 = boto3.client('s3')
TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
REPORT_BUCKET = os.environ['S3_REPORT_BUCKET']
SNS_BYTE_CAP = 240_000 # safely under the 256KB SNS hard limit
def safe_publish(subject, content):
encoded = content.encode('utf-8')
if len(encoded) > SNS_BYTE_CAP:
content = encoded[:SNS_BYTE_CAP].decode('utf-8', errors='ignore')
content += '\n\n[Truncated — full report saved to S3]'
sns.publish(TopicArn=TOPIC_ARN, Subject=subject[:100], Message=content)
def upload_full_report(removed_snaps, cutoff_date):
ts = datetime.now(timezone.utc).strftime('%Y-%m-%d_%H-%M-%S')
key = f'ebs-cleanup-reports/{ts}_report.txt'
report = f'EBS Snapshot Cleanup Report\nGenerated: {ts}\n'
report += f'Cutoff: {cutoff_date.strftime("%Y-%m-%d")}\n'
report += f'Total deleted: {len(removed_snaps)}\n'
report += '=' * 80 + '\n\n'
for item in removed_snaps:
report += f'ID: {item["id"]} | Name: {item["name"]} | Created: {item["created"]}\n'
s3.put_object(Bucket=REPORT_BUCKET, Key=key,
Body=report.encode('utf-8'), ContentType='text/plain')
return f's3://{REPORT_BUCKET}/{key}'
When Fix #1 is enough: If your account typically has fewer than ~3,000 snapshots to delete per run and your Lambda timeout is set to 15 minutes, pagination alone will likely keep you under the wire. Watch CloudWatch — if memory stays under 60% of your limit, you're safe.
Fix #2 — SQS Decoupling (Production Recommended)
The key insight: stop trying to do everything in one Lambda invocation. Split the work into two independent phases — Discovery and Deletion — connected by an SQS queue.
Think of it like a factory floor. A receiving team scans all incoming parts (snapshots), groups them into boxes of 100, and places them on the conveyor belt (SQS). Assembly workers (Deletion Lambdas) each pick up one box, do their 30 seconds of work, and move on. No single worker is overloaded. No single failure brings down the line.
EventBridge (Sunday 4 PM ET)
|
v
Discovery Lambda ← runs once, exits cleanly under 15 min
- Paginates ALL snapshots
- Filters by age (> 90 days)
- Groups into batches of 100
- Sends each batch as SQS message
|
v
SQS Queue ← durable, retryable, scalable buffer
(~260 messages for 26,000 snapshots)
|
v (auto-triggered, up to 1000 concurrent)
Deletion Lambda ← handles 1 batch = ~30 seconds each
- Deletes 100 snapshots
- Handles InUse / NotFound gracefully
- Raises on throttle so SQS auto-retries
- Only alerts SNS on real failures
Discovery Lambda — Scan and Enqueue
import boto3, json, os, logging
from datetime import datetime, timedelta, timezone
ec2 = boto3.client('ec2')
sqs = boto3.client('sqs')
sns = boto3.client('sns')
RETENTION_DAYS = int(os.environ.get('RETENTION_DAYS', 90))
QUEUE_URL = os.environ['SQS_QUEUE_URL']
TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
CHUNK_SIZE = 100
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
cutoff = datetime.now(timezone.utc) - timedelta(days=RETENTION_DAYS)
paginator = ec2.get_paginator('describe_snapshots')
pending = []
scanned = 0
queued = 0
for page in paginator.paginate(OwnerIds=['self']):
for snap in page['Snapshots']:
scanned += 1
if snap['StartTime'] < cutoff:
tag_name = next(
(t['Value'] for t in snap.get('Tags', []) if t['Key'] == 'Name'),
'Untagged'
)
pending.append({
'snap_id': snap['SnapshotId'],
'tag_name': tag_name,
'created': snap['StartTime'].strftime('%Y-%m-%d'),
})
if len(pending) >= CHUNK_SIZE:
_enqueue(pending)
queued += len(pending)
pending = []
# Safety valve: flush and exit before timeout
if context.get_remaining_time_in_millis() < 90_000:
logger.warning('Approaching timeout — flushing remaining')
if pending:
_enqueue(pending)
queued += len(pending)
_notify(scanned, queued, cutoff, partial=True)
return {'statusCode': 206, 'queued': queued}
if pending:
_enqueue(pending)
queued += len(pending)
_notify(scanned, queued, cutoff, partial=False)
return {'statusCode': 200, 'queued': queued}
def _enqueue(chunk):
sqs.send_message(QueueUrl=QUEUE_URL, MessageBody=json.dumps(chunk))
logger.info(f'Enqueued {len(chunk)} snapshots')
def _notify(scanned, queued, cutoff, partial):
prefix = 'WARNING: Partial — ' if partial else ''
sns.publish(
TopicArn=TOPIC_ARN,
Subject=f'{prefix}EBS Cleanup — Discovery Complete',
Message=(
f'Total snapshots scanned: {scanned}\n'
f'Queued for deletion: {queued}\n'
f'Cutoff date: {cutoff.strftime("%Y-%m-%d")}\n'
f'Batches sent to SQS: {(queued // 100) + 1}\n'
+ ('\nNOTE: Lambda approached timeout. Already-queued batches will still process.' if partial else '')
)
)
Deletion Lambda — Process One Batch
import boto3, json, os, logging
ec2 = boto3.client('ec2')
sns = boto3.client('sns')
TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
removed, skipped, failed = [], [], []
for record in event.get('Records', []):
batch = json.loads(record['body'])
for item in batch:
sid = item['snap_id']
try:
ec2.delete_snapshot(SnapshotId=sid)
removed.append(sid)
logger.info(f'Deleted: {sid}')
except ec2.exceptions.ClientError as exc:
code = exc.response['Error']['Code']
if code == 'InvalidSnapshot.InUse':
logger.warning(f'Skipped (in use by AMI): {sid}')
skipped.append({'id': sid, 'reason': 'In use by AMI'})
elif code == 'InvalidSnapshot.NotFound':
logger.info(f'Already gone: {sid}')
elif code == 'RequestLimitExceeded':
# Raise so SQS returns the message for retry
logger.warning(f'Throttled — returning to queue: {sid}')
raise exc
else:
logger.error(f'Unexpected error {sid}: {exc}')
failed.append({'id': sid, 'reason': str(exc)})
logger.info(f'Batch done — removed={len(removed)} skipped={len(skipped)} failed={len(failed)}')
if failed:
sns.publish(
TopicArn=TOPIC_ARN,
Subject='EBS Cleanup — Batch Errors',
Message='\n'.join(f"{f['id']}: {f['reason']}" for f in failed)
)
return {'removed': len(removed), 'skipped': len(skipped), 'failed': len(failed)}
Why raise on throttling? When the Deletion Lambda raises an unhandled exception, SQS treats the message as failed and makes it visible again after the visibility timeout expires. This creates automatic intelligent backoff — AWS throttle recovers, SQS retries, zero extra retry code required.
Fix #3 — AWS Step Functions (Enterprise Grade)
Step Functions is AWS's managed workflow orchestration service. Instead of writing retry logic, parallelism, and state tracking yourself, you define a state machine — a visual flowchart of your workflow — and AWS handles execution guarantees, retries, and full audit history.
What Step Functions adds over the SQS approach
- Visual state machine in the AWS console — see exactly where a workflow is at any moment
- Built-in retry with configurable exponential backoff per state
- Map state — fan out to N parallel Lambda workers simultaneously
- Full execution history — audit log of every state transition with timestamps
- No Lambda timeout concern — Step Functions executions can run for up to one year
State Machine Definition
{
"Comment": "EBS Snapshot Cleanup Orchestrator",
"StartAt": "DiscoverSnapshots",
"States": {
"DiscoverSnapshots": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:ebs-discovery",
"Next": "FanOutDeletion",
"TimeoutSeconds": 900,
"Retry": [{
"ErrorEquals": ["Lambda.ServiceException", "States.Timeout"],
"IntervalSeconds": 30,
"MaxAttempts": 3,
"BackoffRate": 2
}]
},
"FanOutDeletion": {
"Type": "Map",
"ItemsPath": "$.snapshot_batches",
"MaxConcurrency": 20,
"Iterator": {
"StartAt": "DeleteOneBatch",
"States": {
"DeleteOneBatch": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:ebs-deletion",
"End": true,
"Retry": [{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 60,
"MaxAttempts": 3
}]
}
}
},
"Next": "SendFinalReport"
},
"SendFinalReport": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:ebs-reporter",
"End": true
}
}
}
Cost note: Step Functions charges per state transition. For 260 batches, you'd generate roughly 800–1,000 state transitions per weekly run — well within the 4,000 free monthly transitions on the Standard workflow tier. For most teams, this runs at zero or near-zero cost.
Part 3: Lambda Invocation Types — Why They Matter
Understanding invocation types is critical to understanding why the original architecture failed silently, and why the SQS fix is reliable.
Synchronous — RequestResponse
The caller blocks and waits. The caller gets the actual result. If Lambda fails, the caller knows immediately and can handle it. No automatic retries.
# Synchronous — caller BLOCKS until Lambda returns
response = lambda_client.invoke(
FunctionName='my-ebs-cleaner',
InvocationType='RequestResponse', # default
Payload=json.dumps({'dry_run': False})
)
result = json.loads(response['Payload'].read())
Used by: API Gateway, direct SDK calls, Lambda console test button, SQS trigger.
Asynchronous — Event (Fire & Forget)
The caller gets HTTP 202 immediately and moves on. Lambda runs in the background. On failure, Lambda retries automatically twice. The caller never sees the outcome.
# Asynchronous — caller gets 202 immediately and moves on
response = lambda_client.invoke(
FunctionName='my-ebs-cleaner',
InvocationType='Event',
Payload=json.dumps({'source': 'scheduled-trigger'})
)
# response['StatusCode'] == 202
# Lambda runs in background — caller never sees success or failure
Used by: EventBridge, SNS, S3 event notifications.
Why this explains the silent failure: EventBridge uses async invocation. When the Lambda timed out three consecutive times, EventBridge had no idea — it got a
202 Acceptedand moved on. The three retries weren't code you wrote — they were Lambda's built-in async behavior (2 automatic retries after failure). This is why no SNS notification ever arrived: the function was terminated before it reached the notification code.
| Property | Synchronous | Asynchronous |
|---|---|---|
| Caller behavior | Blocks and waits | Returns 202 immediately |
| Return value | Delivered to caller | Discarded |
| On failure | Caller handles it | Lambda retries ×2 automatically |
| Timeout awareness | Caller sees it | Caller never knows |
| In our solution | Deletion Lambda (SQS) | Discovery Lambda (EventBridge) |
Part 4: What the Logs Showed After the Fix
After increasing memory to 1,000 MB and timeout to 15 minutes (with the pagination fix applied):
Run #1 — First Execution
Duration: 334,576 ms (~5.5 minutes)
Memory: 820 MB used of 1,000 MB configured
Status: Succeeded
Deletions: Completed successfully
SNS: FAILED — InvalidParameter: Message too long
# Deletions worked. Memory held. Pagination solved the OOM.
# But SNS still hit the 256KB limit on the summary message.
# Fix: S3 report + truncated SNS.
Run #2 — Immediately After
Duration: 41,697 ms (~41 seconds)
Memory: 730 MB used
Total snapshots in account: 25,143
Snapshots eligible for deletion: 0
SNS: Delivered successfully
# Run #1 already deleted all the 90-day-old ones.
# 30K+ snapshots scanned in 41 seconds with pagination.
# Small result message = no SNS size issue.
Performance before vs. after:
Before: Three consecutive timeouts at 10 minutes each. 30 minutes of wasted compute. Zero confirmed deletions. Zero notifications.
After: Completed in 5.5 minutes. All eligible snapshots deleted. Remaining SNS issue resolved with S3 report storage.
Part 5: Implementation Checklist
Lambda Configuration
- Set timeout to 15 minutes (maximum) for the Discovery Lambda
- Set memory to at least 1,024 MB — Lambda allocates proportional CPU to memory
- Use environment variables for all ARNs and thresholds — never hardcode
- Enable X-Ray tracing to profile slow API calls
IAM Permissions Required
-
ec2:DescribeSnapshots— list all snapshots -
ec2:DeleteSnapshot— delete eligible snapshots -
sns:Publish— send email notifications -
s3:PutObject— upload full deletion report -
sqs:SendMessage— Discovery Lambda to SQS (Fix #2 only) -
sqs:ReceiveMessage,sqs:DeleteMessage— Deletion Lambda (Fix #2 only)
SQS Queue Settings
- Visibility timeout: set to 6× your Deletion Lambda timeout
- Message retention: 4 days minimum
-
Dead Letter Queue:
maxReceiveCount = 3 - Lambda event source mapping: batch size = 1
Monitoring & Alerting
- CloudWatch alarm: Lambda errors > 0
- CloudWatch alarm: SQS messages not visible (stuck batches)
- CloudWatch alarm: DLQ message count > 0
- Weekly SNS digest: deletion count + S3 report link
Closing Thoughts
This is a story that plays out on AWS teams everywhere. A function that works perfectly at 500 snapshots silently becomes a liability at 5,000. The gap between "it works in dev" and "it survives production at scale" is exactly where thoughtful cloud architecture lives.
The five failure modes — memory exhaustion, the 15-minute ceiling, SNS size limits, no retry intelligence, and poor error differentiation — are each individually understandable. The dangerous part is how they compound quietly, only revealing themselves at scale, on a Sunday, in production.
Whether you choose the quick pagination fix, the SQS decoupling pattern, or the full Step Functions orchestration, the underlying principle is the same: don't fight your platform's constraints. Design around them.
The three rules, simplified:
- Never load unbounded data into memory — always paginate.
- Never put more than 15 minutes of work into a single Lambda — split it.
- Never put an unbounded string into SNS — summarize in email, detail in S3.
Found this useful? Drop a comment or share it with someone debugging Lambda timeouts right now.
aws lambda ebs serverless cloud-architecture devops aws-lambda boto3






Top comments (0)