Cost optimization in serverless is often discussed as a pricing exercise. In practice, I have found it is much more effective when treated as an architecture review discipline.
That is the lens I use in this post.
This is a practical framework I use to review serverless systems for cost efficiency without weakening reliability, operability, or delivery speed. It is designed for engineers, architects, and technical leads who want to go beyond “reduce Lambda memory” and make better design decisions across services.
I will focus on AWS serverless building blocks and the tradeoffs between:
- AWS Lambda
- AWS Step Functions
- Amazon EventBridge Pipes
- Direct service integrations (for example, Step Functions SDK/service integrations)
- Supporting services like SQS, EventBridge, CloudWatch Logs, DynamoDB, and S3
I will cover:
- Cost hotspots in serverless (invocations, state transitions, logs, data transfer)
- Architecture changes that reduce cost without harming reliability
- Tradeoffs: Lambda vs Pipes vs service integrations vs Step Functions
- Workload profiling and right-sizing patterns
- End-to-end walkthrough and implementation discussion
- Architecture and code examples you can adapt
Why this topic stands out
I like this topic because it is senior-level but immediately useful.
Almost every serverless system starts cost-efficient, then gradually accumulates “cost drag” as features are added:
- Extra orchestration steps
- Helper Lambdas that only transform payloads
- Verbose logging left on in production
- Duplicate event fan-out and reprocessing
- Cross-AZ / cross-region / internet egress paths that nobody revisits
- Overbuilt retries and polling loops
None of these are necessarily “wrong.” In many cases they are the fastest way to ship.
But once traffic grows, it becomes worth doing a structured review.
The goal is not to squeeze every cent out of the system. The goal is to spend intentionally, with a clear understanding of where cost comes from and what reliability/maintainability benefits we are buying with that spend.
What I mean by a cost-aware serverless architecture review
A cost-aware review is a design review that asks four questions for each path in the workload:
-
What is the unit of work?
- One event, one request, one file, one state transition, one batch, one customer action
-
What services does that unit of work touch?
- Lambda, Step Functions, EventBridge, SQS, DynamoDB, logs, storage, network paths
-
Which of those touches scales linearly with traffic, and which scales with payload size, duration, or retries?
- This is where the real hotspots show up
-
Can I reduce cost by changing the architecture instead of only tuning a single service?
- Example: removing three “glue Lambdas” with native integrations
That last question is the most important one.
Framework Overview
This is the practical framework I use in reviews:
Phase 1: Profile the workload
Understand traffic shape and execution shape:
- request rate, concurrency, burstiness
- payload sizes
- durations
- retries
- fan-out counts
- error rates
- cold-start sensitivity
- async vs sync paths
Phase 2: Map the cost path
Trace the cost of one unit of work across services:
- invocations
- transitions
- messages
- logs ingested/stored
- storage reads/writes
- data transfer
- optional downstream analytics/monitoring copies
Phase 3: Identify hotspots
Rank cost drivers by:
- total monthly impact
- growth rate with scale
- ease of remediation
- risk of change
Phase 4: Apply architecture changes
Prefer changes that:
- remove unnecessary compute hops
- reduce chatty orchestration
- batch operations safely
- right-size compute from measured data
- preserve observability and failure isolation
Phase 5: Re-measure
I treat cost optimization as an engineering loop:
- baseline
- change
- compare
- document tradeoffs
- repeat
End-to-End Walkthrough
To make this concrete, I will walk through a realistic example I often see:
Example workload: Event-driven order processing pipeline
A retail platform emits an OrderCreated event. The serverless pipeline:
- Validates the payload
- Enriches customer metadata
- Writes the order to DynamoDB
- Sends a fulfillment request to an internal API
- Publishes notifications
- Updates operational dashboards
- Triggers async follow-up workflows
Baseline architecture (common version)
A common initial design looks like this:
- EventBridge rule receives
OrderCreated - Step Functions orchestrates the flow
- Multiple Lambda functions perform:
- input validation
- JSON transformation
- service API calls
- retries / status polling
- notification dispatch
- Every function logs full payloads
- Step Functions uses many small states to represent each transformation
- Synchronous calls are used where async could work
- Duplicate events are emitted for multiple downstream consumers
This is usually functional and maintainable, but over time I see four cost hotspots appear.
Cost hotspots in serverless (what I look for first)
1) Invocation-driven hotspots (Lambda, messaging, API calls)
This is the most obvious one, but not always where the biggest savings are.
I look for:
- High-frequency helper Lambdas that do only simple mapping/filtering
- Repeated invocations caused by retries at multiple layers (client, service, workflow)
- Tiny per-item processing when batching is safe
- Polling loops implemented in Lambda instead of orchestration or event callbacks
Common anti-pattern
A Lambda is invoked just to:
- rename fields
- add a constant
- route based on one attribute
- pass the payload to another service
In many cases this can be replaced with:
- EventBridge input transformation
- EventBridge Pipes filtering/enrichment
- Step Functions
Parameters/ResultSelector - Step Functions direct service integrations
That removes both compute cost and operational surface area.
2) State transition hotspots (Step Functions)
I see this frequently in mature systems.
Step Functions is excellent, but cost can grow when workflows become overly chatty:
- One state per trivial transformation
- Frequent polling loops with short waits
- Repetitive branching for small decisions
- Using Standard workflows for high-volume, very short paths without reviewing fit
- Nested workflows for steps that could be in one integration
What I optimize carefully
I do not optimize by blindly collapsing everything into one Lambda. That usually hurts observability and retry control.
Instead, I look for:
- eliminating trivial pass/transform states
- replacing glue Lambdas with direct integrations
- increasing polling intervals or switching to callback/event-driven completions
- splitting hot short-lived paths from long-running orchestration
3) Logging hotspots (CloudWatch Logs)
This one is often underestimated because logging is “cheap enough” until scale increases.
Cost growth comes from:
- Logging full request/response payloads for every invocation
- Debug-level logs enabled permanently
- Duplicate logs across Lambda, workflow, and application layers
- Large structured objects written multiple times
- High-cardinality logs that are rarely queried
Practical rule I use
I keep logs useful and minimal in production:
- log identifiers, not entire payloads
- sample verbose payload logs
- gate debug logs by environment/feature flag
- separate audit records from debug logs
- set explicit retention periods
In serverless, “just logging more” can become a real cost driver.
4) Data transfer hotspots (often missed in reviews)
Data transfer is easy to miss because the architecture diagram looks “serverless,” but network paths still matter.
I review:
- Cross-region calls between Lambda and downstream services
- Internet egress to SaaS APIs
- Repeated large payload movement between services
- Download/re-upload patterns (for example, pulling S3 objects through Lambda when direct processing is possible)
- VPC-attached Lambda traffic patterns if applicable (including NAT-related patterns)
- Sending oversized event payloads when a pointer (S3 key/object reference) would work
Pattern I prefer
Move references more often than full payloads:
- Put large content in S3
- Pass object keys / IDs in events
- Fetch only where necessary
- Trim event payloads to fields required for the next step
This reduces both transfer and execution overhead.
A Practical Review Framework I Use in Real Projects
Below is the review structure I use during architecture reviews. It is intentionally simple so teams can repeat it.
Step 1: Define the unit economics
I start with a unit of work, for example:
- “One
OrderCreatedevent processed successfully” - “One failed event retried and sent to DLQ”
- “One batch of 100 events processed”
Then I estimate or measure the service touches for that unit.
Example (baseline, illustrative)
For one order:
- EventBridge ingress: 1 event
- Step Functions states: 22 transitions
- Lambda invocations: 8
- DynamoDB writes: 2
- CloudWatch log lines: ~120
- One external API call
- One notification event
- Occasional retries on fulfillment call
The exact numbers vary, but this gives me a concrete path to improve.
Step 2: Build a cost-path table (without overcomplicating it)
I do not start with exact pricing spreadsheets. I start with a ranked table:
- Driver (Lambda, Step Functions, logs, transfer, etc.)
- Scaling factor (requests, duration, payload size, retries)
- Observed value
- Risk/benefit if changed
- Optimization candidate
This quickly identifies which optimizations are architectural vs micro-tuning.
Step 3: Profile the workload (shape matters more than averages)
Averages hide the truth in serverless systems.
I profile:
- p50/p95/p99 duration
- payload size distribution
- retry rate by dependency
- burst concurrency
- fan-out ratio
- failure modes (timeouts, throttles, validation errors)
- cold-start frequency (only when relevant to SLOs)
Why this matters
A system with the same monthly volume can have very different costs if:
- one workload is smooth and small-payload
- the other is spiky, retry-heavy, and large-payload
Architecture decisions should follow the shape, not just total count.
Step 4: Apply architecture changes in this order
This is the sequence I use because it usually yields high value with lower risk:
- Remove unnecessary service hops
- Replace glue compute with native integrations
- Batch where safe
- Right-size Lambda from measured duration/memory
- Tune logs and retention
- Revisit orchestration granularity
- Optimize transfer/payload strategy
- Only then consider deeper redesigns
Architecture Changes That Reduce Cost Without Harming Reliability
Here are the changes I most commonly recommend, with the tradeoffs explained.
1) Replace glue Lambdas with service integrations
Before
A Lambda receives a payload and only:
- transforms JSON
- calls DynamoDB / SQS / EventBridge
- returns a small result
After
Use:
- Step Functions AWS SDK/service integrations
- EventBridge input transformers
- EventBridge Pipes filter/enrichment steps (where appropriate)
Why this helps
- Removes invocation and duration cost
- Reduces operational burden (fewer deployments, alarms, IAM roles)
- Improves determinism for simple operations
Reliability note
This can improve reliability because fewer components means fewer failure points. The key is preserving observability and explicit error handling.
2) Use Pipes for simple event movement and filtering
EventBridge Pipes is a strong fit when the path is mostly:
- source -> filter -> optional transform/enrichment -> target
Good fit examples
- SQS -> target Lambda (with filtering)
- DynamoDB stream -> EventBridge bus
- Kinesis -> SQS
- SQS -> Step Functions (selective forwarding)
When I do not force Pipes
If the workflow needs:
- complex branching
- human approvals
- multi-step retries with business semantics
- long waits
- compensation logic
then Step Functions is usually the better fit, even if the raw “per event movement” cost might not be the lowest.
3) Reduce Step Functions chatty orchestration
What I change
- Collapse trivial transform-only states
- Use
Parameters,ResultSelector, and JSONPath/JSONata features instead of helper Lambdas - Shift from polling to callback/event completion where feasible
- Split short high-volume flows from long-running exception flows
What I keep
- Distinct states for failure boundaries
- States that need targeted retries/catches
- Business-visible milestones (auditability)
- Human review / approval checkpoints
The goal is not “fewer states at any cost.” The goal is intentional state design.
4) Batch small units where the business semantics allow it
Batching can reduce cost significantly across:
- Lambda invocations
- downstream API calls
- logs
- orchestration overhead
Examples
- Process 10 records per Lambda invocation instead of 1
- Aggregate notifications
- Batch writes to DynamoDB when access patterns permit
- Buffer events briefly for downstream efficiency
Caution
Batching changes failure semantics. I only use it when I can answer:
- What happens if item 7 of 10 fails?
- Do I need partial success reporting?
- Can the downstream system handle duplicates on retries?
Reliability comes first. Cost savings that break recovery are not real savings.
5) Right-size Lambda with measured data, not guesswork
Right-sizing is more than memory reduction. Since Lambda allocates CPU proportionally to memory, a lower-memory function can become slower and more expensive overall.
What I measure
- p50/p95 duration
- memory usage
- init duration (where relevant)
- timeout headroom
- downstream latency contribution
- retry behavior after timeouts/throttles
What I look for
- Functions with high duration and low memory usage (often downstream-bound)
- Functions with CPU-bound work that benefit from more memory
- Functions with large package sizes causing slower init
- Functions using
/tmpheavily (may benefit from memory/storage review)
I optimize for cost per successful unit of work, not cost per invocation in isolation.
6) Tune logging intentionally (not minimally)
I do not recommend “turn off logs.” I recommend structured, useful, cost-conscious logging.
Practical changes:
- Move full payload dumps behind a sampled debug flag
- Log IDs and metrics by default
- Use explicit log retention (for example, shorter for noisy operational logs)
- Avoid duplicate payload logging at every layer
- Emit metrics for counts/latency instead of reconstructing them from logs
This usually reduces cost while improving signal-to-noise.
7) Pass references, not large payloads
Instead of pushing large documents or objects through multiple services:
- Store payloads in S3
- Pass a pointer in the event/workflow
- Fetch on demand
Benefits:
- lower transfer volume
- smaller event payloads
- faster orchestration payload handling
- cleaner replayability
This pattern is especially helpful in document, media, and analytics workflows.
Tradeoffs: Lambda vs Pipes vs Service Integrations vs Step Functions
This is the part teams usually want summarized, so here is the practical version I use in reviews.
Lambda is best when:
- I need custom code/business logic
- I need third-party SDKs or protocols
- I need non-trivial validation/transformation
- I need specialized retry behavior inside the function
- I need to encapsulate logic reused across multiple flows
Lambda is overkill when:
- I am only mapping fields
- I am just forwarding to another AWS service
- I can use native service integration safely and clearly
EventBridge Pipes is best when:
- The path is primarily source-to-target movement
- I need filtering and simple enrichment
- I want low operational overhead for plumbing
- I want to avoid writing a “routing Lambda”
Pipes is not the best fit when:
- I need complex multi-step orchestration
- I need long waits or human callbacks
- I need rich compensation logic across steps
Step Functions is best when:
- I need explicit orchestration and auditability
- I need retries/catches at step boundaries
- I need long-running workflows
- I need human-in-the-loop or callback patterns
- I need clear operational visualization for a business process
Step Functions can become costly when:
- I model every trivial transform as a state
- I use frequent polling instead of event-driven completion
- I keep hot paths in Standard without reviewing alternatives
- I use nested workflows where direct integrations would do
Direct service integrations are best when:
- The step is simple and deterministic
- I can express the call declaratively
- I want fewer moving parts
- I want to remove glue code and reduce operational burden
Direct integrations may be a poor fit when:
- I need complex custom serialization/validation
- I need cross-service custom logic not represented well declaratively
- The readability of the workflow would suffer too much
End-to-End Optimization Walkthrough (Baseline -> Improved)
Now I will apply the framework to the example order pipeline.
Baseline (illustrative)
- 8 Lambda invocations per order
- 22 Step Functions transitions
- verbose logs in every function
- large payload passed between multiple steps
- polling Lambda for fulfillment status
- duplicate notification paths
Review findings
Top cost hotspots:
- Step Functions state transitions from chatty orchestration
- Lambda invocations for glue transforms and simple service calls
- CloudWatch Logs ingestion due to payload dumps
- Data transfer/execution overhead from oversized payloads
- Retries multiplying costs during downstream API flakiness
Improved architecture (same reliability goals)
Changes I make
- Replace 3 glue Lambdas with:
- Step Functions
Parameters/ResultSelector - direct DynamoDB/EventBridge integrations
- Step Functions
- Replace one routing Lambda with EventBridge Pipes
- Move large enrichment blobs to S3 and pass object references
- Replace polling Lambda loop with callback/event-driven completion
- Keep Step Functions states at clear failure boundaries only
- Reduce default log verbosity and set retention explicitly
- Batch non-critical notifications
What I do not change
- Idempotency controls
- DLQs and retry policies
- Key audit checkpoints
- Alarms and traces for critical dependencies
This is important. I want lower cost without creating an opaque, fragile pipeline.
Architecture pattern (review-ready version)
The optimized pattern I recommend for many teams looks like this:
- Ingress: EventBridge
- Plumbing/filtering: EventBridge Pipes (where simple)
- Orchestration: Step Functions for business process + callbacks
- Compute: Lambda only for genuine custom logic
- Storage: S3 for large payloads, DynamoDB for metadata/state
- Observability: metrics first, targeted logs, traces on critical paths
This is cost-aware because each service is doing the kind of work it is best suited for.
Implementation Discussion
Below are code examples that demonstrate how I apply the framework in practice.
1) Step Functions workflow using service integrations instead of glue Lambdas (ASL fragment)
This fragment shows a flow that:
- validates and prepares payload once
- writes to DynamoDB via a direct integration
- emits an event via EventBridge integration
- keeps Lambda only for actual custom logic
{
"Comment": "Cost-aware serverless orchestration example",
"StartAt": "ValidateAndEnrich",
"States": {
"ValidateAndEnrich": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Arguments": {
"FunctionName": "${ValidateAndEnrichFnArn}",
"Payload.$": "$"
},
"OutputPath": "$.Payload",
"Next": "PutOrderMetadata",
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
"IntervalSeconds": 2,
"BackoffRate": 2.0,
"MaxAttempts": 4
}
]
},
"PutOrderMetadata": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:dynamodb:putItem",
"Arguments": {
"TableName": "${OrdersTable}",
"Item": {
"pk": { "S.$": "States.Format('ORDER#{}', $.orderId)" },
"sk": { "S": "META" },
"status": { "S": "RECEIVED" },
"customerId": { "S.$": "$.customerId" },
"payloadRef": { "S.$": "$.payloadRef.s3Uri" },
"createdAt": { "S.$": "$.timestamps.receivedAt" }
},
"ConditionExpression": "attribute_not_exists(pk)"
},
"Next": "PublishOrderReceived"
},
"PublishOrderReceived": {
"Type": "Task",
"Resource": "arn:aws:states:::events:putEvents",
"Arguments": {
"Entries": [
{
"Source": "com.example.orders",
"DetailType": "order.received",
"EventBusName": "${EventBusName}",
"Detail": {
"orderId.$": "$.orderId",
"customerId.$": "$.customerId",
"payloadRef.$": "$.payloadRef",
"correlationId.$": "$.correlationId"
}
}
]
},
"Next": "WaitForFulfillmentCallback"
},
"WaitForFulfillmentCallback": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Arguments": {
"FunctionName": "${CreateFulfillmentRequestFnArn}",
"Payload": {
"taskToken.$": "$$.Task.Token",
"orderId.$": "$.orderId",
"payloadRef.$": "$.payloadRef",
"correlationId.$": "$.correlationId"
}
},
"TimeoutSeconds": 900,
"ResultPath": "$.fulfillment",
"Next": "FinalizeOrder"
},
"FinalizeOrder": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:dynamodb:updateItem",
"Arguments": {
"TableName": "${OrdersTable}",
"Key": {
"pk": { "S.$": "States.Format('ORDER#{}', $.orderId)" },
"sk": { "S": "META" }
},
"UpdateExpression": "SET #s = :s, fulfillmentId = :f, fulfilledAt = :t",
"ExpressionAttributeNames": {
"#s": "status"
},
"ExpressionAttributeValues": {
":s": { "S": "FULFILLED" },
":f": { "S.$": "$.fulfillment.fulfillmentId" },
":t": { "S.$": "$.fulfillment.completedAt" }
}
},
"End": true
}
}
}
Why this is cost-aware
- I removed two common glue Lambdas (
put item,publish event) - I use callback instead of chatty polling
- I keep orchestration only where it adds reliability/visibility
2) EventBridge Pipes (CDK TypeScript) for low-overhead routing/filtering
This example shows a common “plumbing” path I do not want to solve with a custom Lambda.
import * as cdk from 'aws-cdk-lib';
import * as pipes from 'aws-cdk-lib/aws-pipes';
import * as iam from 'aws-cdk-lib/aws-iam';
const pipeRole = new iam.Role(this, 'OrdersPipeRole', {
assumedBy: new iam.ServicePrincipal('pipes.amazonaws.com'),
});
pipeRole.addToPolicy(new iam.PolicyStatement({
actions: [
'sqs:ReceiveMessage',
'sqs:DeleteMessage',
'sqs:GetQueueAttributes',
'sqs:ChangeMessageVisibility',
],
resources: [ordersQueue.queueArn],
}));
pipeRole.addToPolicy(new iam.PolicyStatement({
actions: ['states:StartExecution'],
resources: [orderWorkflow.stateMachineArn],
}));
new pipes.CfnPipe(this, 'FilteredOrdersPipe', {
roleArn: pipeRole.roleArn,
source: ordersQueue.queueArn,
target: orderWorkflow.stateMachineArn,
sourceParameters: {
sqsQueueParameters: {
batchSize: 10,
maximumBatchingWindowInSeconds: 5,
},
filterCriteria: {
filters: [
{
pattern: JSON.stringify({
body: {
eventType: ['OrderCreated'],
priority: ['standard', 'express'],
},
}),
},
],
},
},
targetParameters: {
stepFunctionStateMachineParameters: {
invocationType: 'FIRE_AND_FORGET',
},
inputTemplate: JSON.stringify({
orderId: '<$.body.orderId>',
customerId: '<$.body.customerId>',
priority: '<$.body.priority>',
payloadRef: {
s3Uri: '<$.body.payloadRef.s3Uri>',
},
correlationId: '<$.body.correlationId>',
}),
},
});
Why this is cost-aware
- No routing Lambda to maintain
- Built-in filtering avoids unnecessary workflow starts
- Batching reduces per-message overhead while preserving throughput
3) Lambda right-sizing profiling helper (Python)
I use a simple embedded metric approach to log what matters for right-sizing reviews. This is not a full observability solution, but it is enough to drive architecture decisions.
import json
import os
import time
from datetime import datetime, timezone
def emit_emf(namespace: str, metrics: dict, dimensions: dict):
emf = {
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [
{
"Namespace": namespace,
"Dimensions": [list(dimensions.keys())],
"Metrics": [{"Name": k, "Unit": "None"} for k in metrics.keys()],
}
],
},
**dimensions,
**metrics,
}
print(json.dumps(emf))
def lambda_handler(event, context):
start = time.perf_counter()
payload_size_bytes = len(json.dumps(event).encode("utf-8"))
# ... custom logic here ...
# Keep this function for genuine business logic, not simple routing
duration_ms = (time.perf_counter() - start) * 1000
configured_memory_mb = int(os.getenv("AWS_LAMBDA_FUNCTION_MEMORY_SIZE", "128"))
emit_emf(
namespace="ServerlessReview",
dimensions={
"FunctionName": context.function_name,
"Environment": os.getenv("STAGE", "prod"),
},
metrics={
"DurationMs": round(duration_ms, 2),
"PayloadSizeBytes": payload_size_bytes,
"ConfiguredMemoryMB": configured_memory_mb,
},
)
return {
"ok": True,
"processedAt": datetime.now(timezone.utc).isoformat()
}
How I use this data in reviews
I compare:
- duration by memory setting
- duration by payload size bucket
- timeout proximity
- retry rates correlated with duration
Then I decide whether to:
- increase memory (reduce duration and total cost)
- decrease memory (if clearly overprovisioned)
- batch inputs
- move logic out of Lambda entirely
4) Production-safe logging pattern (Python)
This pattern reduces log cost without sacrificing debuggability.
import json
import os
import random
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO").upper()
DEBUG_SAMPLE_RATE = float(os.getenv("DEBUG_SAMPLE_RATE", "0.01")) # 1% default
def should_log_debug_payload() -> bool:
if LOG_LEVEL == "DEBUG":
return True
return random.random() < DEBUG_SAMPLE_RATE
def log_info(message: str, **kwargs):
print(json.dumps({"level": "INFO", "message": message, **kwargs}))
def log_debug(message: str, **kwargs):
if LOG_LEVEL == "DEBUG":
print(json.dumps({"level": "DEBUG", "message": message, **kwargs}))
def handler(event, context):
correlation_id = event.get("correlationId", "unknown")
order_id = event.get("orderId", "unknown")
log_info(
"processing_order",
correlationId=correlation_id,
orderId=order_id
)
if should_log_debug_payload():
print(json.dumps({
"level": "DEBUG_SAMPLED",
"message": "event_payload_sample",
"correlationId": correlation_id,
"orderId": order_id,
"payload": event
}))
# ... business logic ...
log_info(
"order_processed",
correlationId=correlation_id,
orderId=order_id,
status="success"
)
return {"status": "ok"}
Why I like this pattern
- Production logs stay useful and cheaper
- I still have sampled payload visibility for troubleshooting
- It works well with structured log queries and metrics extraction
Workload Profiling and Right-Sizing Patterns
This is the part that makes the review credible. I do not want architecture recommendations based on intuition alone.
1) Profile by workload class, not just service
I segment workloads into classes, for example:
- Latency-sensitive sync API path
- Async event processing path
- Burst batch path
- Low-frequency admin path
Each class has different priorities and cost levers.
Example:
- Sync API path may justify higher Lambda memory for lower latency
- Async batch path may prioritize batching and reduced transitions
- Admin path may not be worth heavy optimization effort
2) Build a “cost to reliability” tradeoff table
For each proposed change, I document:
- Change
- Expected cost impact
- Reliability impact
- Operability impact
- Rollback plan
- Measurement plan
This keeps reviews grounded and avoids “optimize everything” behavior.
Example entries
- Replace routing Lambda with Pipe -> lower cost, lower ops burden, neutral reliability, easy rollback
- Collapse 5 workflow states into 2 -> moderate cost savings, possible loss of debug granularity, test carefully
- Batch writes by 25 -> potentially high savings, changes failure semantics, requires idempotency validation
3) Right-size from p95 and timeout headroom, not p50 alone
A function that looks fine at p50 may:
- time out at p99
- retry frequently during bursts
- create downstream contention that increases total cost
I tune using:
- p95/p99 duration
- timeout headroom
- retry counts
- dependency latency distribution
This avoids false savings that increase retries and operational incidents.
4) Profile payload size and serialization overhead
In event-driven systems, large payloads can create hidden costs:
- more transfer
- longer execution time
- larger logs
- larger Step Functions state payloads
- more memory pressure in Lambda
I explicitly bucket payload sizes and look for outliers.
A common improvement is:
- move large optional fields to S3
- pass references
- fetch lazily only in the steps that need them
5) Profile retries as a cost multiplier
Retries are often the hidden multiplier in “unexpected” serverless cost growth.
I measure:
- retry rate per dependency
- retry amplification across layers (SDK retry + Lambda retry + workflow retry)
- cost of failed attempts vs successful attempts
- DLQ rate and replay volume
Then I redesign retry placement:
- retry at the layer with the best context
- avoid stacking retries unnecessarily
- use backoff and circuit-breaking patterns where appropriate
This improves both cost and stability.
Review Checklist I Actually Use
Here is the short version of the checklist I use in architecture reviews.
Hotspots
- [ ] Which cost components scale with request count?
- [ ] Which scale with duration, payload size, retries, or fan-out?
- [ ] Where are we logging too much?
- [ ] Where are we moving data unnecessarily?
Architecture fit
- [ ] Is Lambda only used where custom logic is truly needed?
- [ ] Can any glue Lambda be replaced with Pipes or direct integration?
- [ ] Is Step Functions state granularity intentional?
- [ ] Are polling loops justified, or can callbacks/events replace them?
Profiling
- [ ] Do we have p95/p99 duration data?
- [ ] Do we know payload size distribution?
- [ ] Do we know retry rate by dependency?
- [ ] Do we know burst concurrency patterns?
Reliability guardrails
- [ ] Will the optimization change failure semantics?
- [ ] Is idempotency preserved?
- [ ] Are DLQ/replay paths intact?
- [ ] Is observability still sufficient?
Rollout
- [ ] Can we canary or phase the change?
- [ ] Do we have before/after metrics defined?
- [ ] Do we have a rollback path?
Common Mistakes in Cost Reviews
I see these often, so I call them out explicitly.
Mistake 1: Only tuning Lambda memory
Useful, but incomplete. The biggest cost issue may be:
- Step Functions transitions
- logs
- retries
- data movement
- unnecessary service hops
Mistake 2: Replacing orchestration with one “mega Lambda”
This can reduce one visible cost but create:
- worse reliability
- harder retries
- worse observability
- slower delivery velocity
Mistake 3: Optimizing by averages
p50 hides burst pain and retry amplification.
Mistake 4: Ignoring developer productivity
A design that is slightly cheaper but much harder to maintain may cost more overall.
Mistake 5: Removing logs indiscriminately
The right answer is better logging design, not blind log suppression.
Closing Thoughts
A strong cost-aware serverless review is not about chasing the cheapest architecture on paper. It is about making architecture choices that fit the workload, preserve reliability, and keep operating costs proportional to business value.
The framework I use is simple:
- Profile the workload
- Map the cost path
- Find hotspots
- Apply architecture changes first
- Right-size from measurements
- Re-measure and document tradeoffs
If I am reviewing a production serverless system, I want the outcome to be more than a list of tips. I want a clear plan that the team can implement safely, measure, and iterate on.
That is what makes the review practical.
References
(Consolidated at the end per request)
- AWS Lambda pricing, performance tuning guidance, and memory/CPU behavior documentation
- AWS Step Functions pricing and workflow design guidance (Standard vs Express)
- AWS Step Functions service integrations and callback/task token patterns
- Amazon EventBridge and EventBridge Pipes documentation (filtering, enrichment, targets)
- Amazon SQS retry, DLQ, and batching guidance
- Amazon DynamoDB write/read patterns and idempotency design guidance
- Amazon CloudWatch Logs pricing, retention, and observability best practices
- AWS Well-Architected Framework (Cost Optimization and Reliability pillars)
- Serverless Lens and workload-specific architecture review guidance

Top comments (0)