Renaldi for AWS Community Builders

Posted on Apr 12

Cost-Aware Serverless Architecture Reviews: A Practical Framework

#cloud #architecture #aws #webdev

Cost optimization in serverless is often discussed as a pricing exercise. In practice, I have found it is much more effective when treated as an architecture review discipline.

That is the lens I use in this post.

This is a practical framework I use to review serverless systems for cost efficiency without weakening reliability, operability, or delivery speed. It is designed for engineers, architects, and technical leads who want to go beyond “reduce Lambda memory” and make better design decisions across services.

I will focus on AWS serverless building blocks and the tradeoffs between:

AWS Lambda
AWS Step Functions
Amazon EventBridge Pipes
Direct service integrations (for example, Step Functions SDK/service integrations)
Supporting services like SQS, EventBridge, CloudWatch Logs, DynamoDB, and S3

I will cover:

Cost hotspots in serverless (invocations, state transitions, logs, data transfer)
Architecture changes that reduce cost without harming reliability
Tradeoffs: Lambda vs Pipes vs service integrations vs Step Functions
Workload profiling and right-sizing patterns
End-to-end walkthrough and implementation discussion
Architecture and code examples you can adapt

Why this topic stands out

I like this topic because it is senior-level but immediately useful.

Almost every serverless system starts cost-efficient, then gradually accumulates “cost drag” as features are added:

Extra orchestration steps
Helper Lambdas that only transform payloads
Verbose logging left on in production
Duplicate event fan-out and reprocessing
Cross-AZ / cross-region / internet egress paths that nobody revisits
Overbuilt retries and polling loops

None of these are necessarily “wrong.” In many cases they are the fastest way to ship.

But once traffic grows, it becomes worth doing a structured review.

The goal is not to squeeze every cent out of the system. The goal is to spend intentionally, with a clear understanding of where cost comes from and what reliability/maintainability benefits we are buying with that spend.

What I mean by a cost-aware serverless architecture review

A cost-aware review is a design review that asks four questions for each path in the workload:

What is the unit of work?
- One event, one request, one file, one state transition, one batch, one customer action
What services does that unit of work touch?
- Lambda, Step Functions, EventBridge, SQS, DynamoDB, logs, storage, network paths
Which of those touches scales linearly with traffic, and which scales with payload size, duration, or retries?
- This is where the real hotspots show up
Can I reduce cost by changing the architecture instead of only tuning a single service?
- Example: removing three “glue Lambdas” with native integrations

That last question is the most important one.

Framework Overview

This is the practical framework I use in reviews:

Phase 1: Profile the workload

Understand traffic shape and execution shape:

request rate, concurrency, burstiness
payload sizes
durations
retries
fan-out counts
error rates
cold-start sensitivity
async vs sync paths

Phase 2: Map the cost path

Trace the cost of one unit of work across services:

invocations
transitions
messages
logs ingested/stored
storage reads/writes
data transfer
optional downstream analytics/monitoring copies

Phase 3: Identify hotspots

Rank cost drivers by:

total monthly impact
growth rate with scale
ease of remediation
risk of change

Phase 4: Apply architecture changes

Prefer changes that:

remove unnecessary compute hops
reduce chatty orchestration
batch operations safely
right-size compute from measured data
preserve observability and failure isolation

Phase 5: Re-measure

I treat cost optimization as an engineering loop:

baseline
change
compare
document tradeoffs
repeat

End-to-End Walkthrough

To make this concrete, I will walk through a realistic example I often see:

Example workload: Event-driven order processing pipeline

A retail platform emits an OrderCreated event. The serverless pipeline:

Validates the payload
Enriches customer metadata
Writes the order to DynamoDB
Sends a fulfillment request to an internal API
Publishes notifications
Updates operational dashboards
Triggers async follow-up workflows

Baseline architecture (common version)

A common initial design looks like this:

EventBridge rule receives OrderCreated
Step Functions orchestrates the flow
Multiple Lambda functions perform:
- input validation
- JSON transformation
- service API calls
- retries / status polling
- notification dispatch
Every function logs full payloads
Step Functions uses many small states to represent each transformation
Synchronous calls are used where async could work
Duplicate events are emitted for multiple downstream consumers

This is usually functional and maintainable, but over time I see four cost hotspots appear.

Cost hotspots in serverless (what I look for first)

1) Invocation-driven hotspots (Lambda, messaging, API calls)

This is the most obvious one, but not always where the biggest savings are.

I look for:

High-frequency helper Lambdas that do only simple mapping/filtering
Repeated invocations caused by retries at multiple layers (client, service, workflow)
Tiny per-item processing when batching is safe
Polling loops implemented in Lambda instead of orchestration or event callbacks

Common anti-pattern

A Lambda is invoked just to:

rename fields
add a constant
route based on one attribute
pass the payload to another service

In many cases this can be replaced with:

EventBridge input transformation
EventBridge Pipes filtering/enrichment
Step Functions Parameters / ResultSelector
Step Functions direct service integrations

That removes both compute cost and operational surface area.

2) State transition hotspots (Step Functions)

I see this frequently in mature systems.

Step Functions is excellent, but cost can grow when workflows become overly chatty:

One state per trivial transformation
Frequent polling loops with short waits
Repetitive branching for small decisions
Using Standard workflows for high-volume, very short paths without reviewing fit
Nested workflows for steps that could be in one integration

What I optimize carefully

I do not optimize by blindly collapsing everything into one Lambda. That usually hurts observability and retry control.

Instead, I look for:

eliminating trivial pass/transform states
replacing glue Lambdas with direct integrations
increasing polling intervals or switching to callback/event-driven completions
splitting hot short-lived paths from long-running orchestration

3) Logging hotspots (CloudWatch Logs)

This one is often underestimated because logging is “cheap enough” until scale increases.

Cost growth comes from:

Logging full request/response payloads for every invocation
Debug-level logs enabled permanently
Duplicate logs across Lambda, workflow, and application layers
Large structured objects written multiple times
High-cardinality logs that are rarely queried

Practical rule I use

I keep logs useful and minimal in production:

log identifiers, not entire payloads
sample verbose payload logs
gate debug logs by environment/feature flag
separate audit records from debug logs
set explicit retention periods

In serverless, “just logging more” can become a real cost driver.

4) Data transfer hotspots (often missed in reviews)

Data transfer is easy to miss because the architecture diagram looks “serverless,” but network paths still matter.

I review:

Cross-region calls between Lambda and downstream services
Internet egress to SaaS APIs
Repeated large payload movement between services
Download/re-upload patterns (for example, pulling S3 objects through Lambda when direct processing is possible)
VPC-attached Lambda traffic patterns if applicable (including NAT-related patterns)
Sending oversized event payloads when a pointer (S3 key/object reference) would work

Pattern I prefer

Move references more often than full payloads:

Put large content in S3
Pass object keys / IDs in events
Fetch only where necessary
Trim event payloads to fields required for the next step

This reduces both transfer and execution overhead.

A Practical Review Framework I Use in Real Projects

Below is the review structure I use during architecture reviews. It is intentionally simple so teams can repeat it.

Step 1: Define the unit economics

I start with a unit of work, for example:

“One OrderCreated event processed successfully”
“One failed event retried and sent to DLQ”
“One batch of 100 events processed”

Then I estimate or measure the service touches for that unit.

Example (baseline, illustrative)

For one order:

EventBridge ingress: 1 event
Step Functions states: 22 transitions
Lambda invocations: 8
DynamoDB writes: 2
CloudWatch log lines: ~120
One external API call
One notification event
Occasional retries on fulfillment call

The exact numbers vary, but this gives me a concrete path to improve.

Step 2: Build a cost-path table (without overcomplicating it)

I do not start with exact pricing spreadsheets. I start with a ranked table:

Driver (Lambda, Step Functions, logs, transfer, etc.)
Scaling factor (requests, duration, payload size, retries)
Observed value
Risk/benefit if changed
Optimization candidate

This quickly identifies which optimizations are architectural vs micro-tuning.

Step 3: Profile the workload (shape matters more than averages)

Averages hide the truth in serverless systems.

I profile:

p50/p95/p99 duration
payload size distribution
retry rate by dependency
burst concurrency
fan-out ratio
failure modes (timeouts, throttles, validation errors)
cold-start frequency (only when relevant to SLOs)

Why this matters

A system with the same monthly volume can have very different costs if:

one workload is smooth and small-payload
the other is spiky, retry-heavy, and large-payload

Architecture decisions should follow the shape, not just total count.

Step 4: Apply architecture changes in this order

This is the sequence I use because it usually yields high value with lower risk:

Remove unnecessary service hops
Replace glue compute with native integrations
Batch where safe
Right-size Lambda from measured duration/memory
Tune logs and retention
Revisit orchestration granularity
Optimize transfer/payload strategy
Only then consider deeper redesigns

Architecture Changes That Reduce Cost Without Harming Reliability

Here are the changes I most commonly recommend, with the tradeoffs explained.

1) Replace glue Lambdas with service integrations

Before

A Lambda receives a payload and only:

transforms JSON
calls DynamoDB / SQS / EventBridge
returns a small result

After

Use:

Step Functions AWS SDK/service integrations
EventBridge input transformers
EventBridge Pipes filter/enrichment steps (where appropriate)

Why this helps

Removes invocation and duration cost
Reduces operational burden (fewer deployments, alarms, IAM roles)
Improves determinism for simple operations

Reliability note

This can improve reliability because fewer components means fewer failure points. The key is preserving observability and explicit error handling.

2) Use Pipes for simple event movement and filtering

EventBridge Pipes is a strong fit when the path is mostly:

source -> filter -> optional transform/enrichment -> target

Good fit examples

SQS -> target Lambda (with filtering)
DynamoDB stream -> EventBridge bus
Kinesis -> SQS
SQS -> Step Functions (selective forwarding)

When I do not force Pipes

If the workflow needs:

complex branching
human approvals
multi-step retries with business semantics
long waits
compensation logic

then Step Functions is usually the better fit, even if the raw “per event movement” cost might not be the lowest.

3) Reduce Step Functions chatty orchestration

What I change

Collapse trivial transform-only states
Use Parameters, ResultSelector, and JSONPath/JSONata features instead of helper Lambdas
Shift from polling to callback/event completion where feasible
Split short high-volume flows from long-running exception flows

What I keep

Distinct states for failure boundaries
States that need targeted retries/catches
Business-visible milestones (auditability)
Human review / approval checkpoints

The goal is not “fewer states at any cost.” The goal is intentional state design.

4) Batch small units where the business semantics allow it

Batching can reduce cost significantly across:

Lambda invocations
downstream API calls
logs
orchestration overhead

Examples

Process 10 records per Lambda invocation instead of 1
Aggregate notifications
Batch writes to DynamoDB when access patterns permit
Buffer events briefly for downstream efficiency

Caution

Batching changes failure semantics. I only use it when I can answer:

What happens if item 7 of 10 fails?
Do I need partial success reporting?
Can the downstream system handle duplicates on retries?

Reliability comes first. Cost savings that break recovery are not real savings.

5) Right-size Lambda with measured data, not guesswork

Right-sizing is more than memory reduction. Since Lambda allocates CPU proportionally to memory, a lower-memory function can become slower and more expensive overall.

What I measure

p50/p95 duration
memory usage
init duration (where relevant)
timeout headroom
downstream latency contribution
retry behavior after timeouts/throttles

What I look for

Functions with high duration and low memory usage (often downstream-bound)
Functions with CPU-bound work that benefit from more memory
Functions with large package sizes causing slower init
Functions using /tmp heavily (may benefit from memory/storage review)

I optimize for cost per successful unit of work, not cost per invocation in isolation.

6) Tune logging intentionally (not minimally)

I do not recommend “turn off logs.” I recommend structured, useful, cost-conscious logging.

Practical changes:

Move full payload dumps behind a sampled debug flag
Log IDs and metrics by default
Use explicit log retention (for example, shorter for noisy operational logs)
Avoid duplicate payload logging at every layer
Emit metrics for counts/latency instead of reconstructing them from logs

This usually reduces cost while improving signal-to-noise.

7) Pass references, not large payloads

Instead of pushing large documents or objects through multiple services:

Store payloads in S3
Pass a pointer in the event/workflow
Fetch on demand

Benefits:

lower transfer volume
smaller event payloads
faster orchestration payload handling
cleaner replayability

This pattern is especially helpful in document, media, and analytics workflows.

Tradeoffs: Lambda vs Pipes vs Service Integrations vs Step Functions

This is the part teams usually want summarized, so here is the practical version I use in reviews.

Lambda is best when:

I need custom code/business logic
I need third-party SDKs or protocols
I need non-trivial validation/transformation
I need specialized retry behavior inside the function
I need to encapsulate logic reused across multiple flows

Lambda is overkill when:

I am only mapping fields
I am just forwarding to another AWS service
I can use native service integration safely and clearly

EventBridge Pipes is best when:

The path is primarily source-to-target movement
I need filtering and simple enrichment
I want low operational overhead for plumbing
I want to avoid writing a “routing Lambda”

Pipes is not the best fit when:

I need complex multi-step orchestration
I need long waits or human callbacks
I need rich compensation logic across steps

Step Functions is best when:

I need explicit orchestration and auditability
I need retries/catches at step boundaries
I need long-running workflows
I need human-in-the-loop or callback patterns
I need clear operational visualization for a business process

Step Functions can become costly when:

I model every trivial transform as a state
I use frequent polling instead of event-driven completion
I keep hot paths in Standard without reviewing alternatives
I use nested workflows where direct integrations would do

Direct service integrations are best when:

The step is simple and deterministic
I can express the call declaratively
I want fewer moving parts
I want to remove glue code and reduce operational burden

Direct integrations may be a poor fit when:

I need complex custom serialization/validation
I need cross-service custom logic not represented well declaratively
The readability of the workflow would suffer too much

End-to-End Optimization Walkthrough (Baseline -> Improved)

Now I will apply the framework to the example order pipeline.

Baseline (illustrative)

8 Lambda invocations per order
22 Step Functions transitions
verbose logs in every function
large payload passed between multiple steps
polling Lambda for fulfillment status
duplicate notification paths

Review findings

Top cost hotspots:

Step Functions state transitions from chatty orchestration
Lambda invocations for glue transforms and simple service calls
CloudWatch Logs ingestion due to payload dumps
Data transfer/execution overhead from oversized payloads
Retries multiplying costs during downstream API flakiness

Improved architecture (same reliability goals)

Changes I make

Replace 3 glue Lambdas with:
- Step Functions Parameters / ResultSelector
- direct DynamoDB/EventBridge integrations
Replace one routing Lambda with EventBridge Pipes
Move large enrichment blobs to S3 and pass object references
Replace polling Lambda loop with callback/event-driven completion
Keep Step Functions states at clear failure boundaries only
Reduce default log verbosity and set retention explicitly
Batch non-critical notifications

What I do not change

Idempotency controls
DLQs and retry policies
Key audit checkpoints
Alarms and traces for critical dependencies

This is important. I want lower cost without creating an opaque, fragile pipeline.

Architecture pattern (review-ready version)

The optimized pattern I recommend for many teams looks like this:

Ingress: EventBridge
Plumbing/filtering: EventBridge Pipes (where simple)
Orchestration: Step Functions for business process + callbacks
Compute: Lambda only for genuine custom logic
Storage: S3 for large payloads, DynamoDB for metadata/state
Observability: metrics first, targeted logs, traces on critical paths

This is cost-aware because each service is doing the kind of work it is best suited for.

Implementation Discussion

Below are code examples that demonstrate how I apply the framework in practice.

1) Step Functions workflow using service integrations instead of glue Lambdas (ASL fragment)

This fragment shows a flow that:

validates and prepares payload once
writes to DynamoDB via a direct integration
emits an event via EventBridge integration
keeps Lambda only for actual custom logic

{
  "Comment": "Cost-aware serverless orchestration example",
  "StartAt": "ValidateAndEnrich",
  "States": {
    "ValidateAndEnrich": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Arguments": {
        "FunctionName": "${ValidateAndEnrichFnArn}",
        "Payload.$": "$"
      },
      "OutputPath": "$.Payload",
      "Next": "PutOrderMetadata",
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
          "IntervalSeconds": 2,
          "BackoffRate": 2.0,
          "MaxAttempts": 4
        }
      ]
    },
    "PutOrderMetadata": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:dynamodb:putItem",
      "Arguments": {
        "TableName": "${OrdersTable}",
        "Item": {
          "pk": { "S.$": "States.Format('ORDER#{}', $.orderId)" },
          "sk": { "S": "META" },
          "status": { "S": "RECEIVED" },
          "customerId": { "S.$": "$.customerId" },
          "payloadRef": { "S.$": "$.payloadRef.s3Uri" },
          "createdAt": { "S.$": "$.timestamps.receivedAt" }
        },
        "ConditionExpression": "attribute_not_exists(pk)"
      },
      "Next": "PublishOrderReceived"
    },
    "PublishOrderReceived": {
      "Type": "Task",
      "Resource": "arn:aws:states:::events:putEvents",
      "Arguments": {
        "Entries": [
          {
            "Source": "com.example.orders",
            "DetailType": "order.received",
            "EventBusName": "${EventBusName}",
            "Detail": {
              "orderId.$": "$.orderId",
              "customerId.$": "$.customerId",
              "payloadRef.$": "$.payloadRef",
              "correlationId.$": "$.correlationId"
            }
          }
        ]
      },
      "Next": "WaitForFulfillmentCallback"
    },
    "WaitForFulfillmentCallback": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Arguments": {
        "FunctionName": "${CreateFulfillmentRequestFnArn}",
        "Payload": {
          "taskToken.$": "$$.Task.Token",
          "orderId.$": "$.orderId",
          "payloadRef.$": "$.payloadRef",
          "correlationId.$": "$.correlationId"
        }
      },
      "TimeoutSeconds": 900,
      "ResultPath": "$.fulfillment",
      "Next": "FinalizeOrder"
    },
    "FinalizeOrder": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:dynamodb:updateItem",
      "Arguments": {
        "TableName": "${OrdersTable}",
        "Key": {
          "pk": { "S.$": "States.Format('ORDER#{}', $.orderId)" },
          "sk": { "S": "META" }
        },
        "UpdateExpression": "SET #s = :s, fulfillmentId = :f, fulfilledAt = :t",
        "ExpressionAttributeNames": {
          "#s": "status"
        },
        "ExpressionAttributeValues": {
          ":s": { "S": "FULFILLED" },
          ":f": { "S.$": "$.fulfillment.fulfillmentId" },
          ":t": { "S.$": "$.fulfillment.completedAt" }
        }
      },
      "End": true
    }
  }
}

Why this is cost-aware

I removed two common glue Lambdas (put item, publish event)
I use callback instead of chatty polling
I keep orchestration only where it adds reliability/visibility

2) EventBridge Pipes (CDK TypeScript) for low-overhead routing/filtering

This example shows a common “plumbing” path I do not want to solve with a custom Lambda.

import * as cdk from 'aws-cdk-lib';
import * as pipes from 'aws-cdk-lib/aws-pipes';
import * as iam from 'aws-cdk-lib/aws-iam';

const pipeRole = new iam.Role(this, 'OrdersPipeRole', {
  assumedBy: new iam.ServicePrincipal('pipes.amazonaws.com'),
});

pipeRole.addToPolicy(new iam.PolicyStatement({
  actions: [
    'sqs:ReceiveMessage',
    'sqs:DeleteMessage',
    'sqs:GetQueueAttributes',
    'sqs:ChangeMessageVisibility',
  ],
  resources: [ordersQueue.queueArn],
}));

pipeRole.addToPolicy(new iam.PolicyStatement({
  actions: ['states:StartExecution'],
  resources: [orderWorkflow.stateMachineArn],
}));

new pipes.CfnPipe(this, 'FilteredOrdersPipe', {
  roleArn: pipeRole.roleArn,
  source: ordersQueue.queueArn,
  target: orderWorkflow.stateMachineArn,
  sourceParameters: {
    sqsQueueParameters: {
      batchSize: 10,
      maximumBatchingWindowInSeconds: 5,
    },
    filterCriteria: {
      filters: [
        {
          pattern: JSON.stringify({
            body: {
              eventType: ['OrderCreated'],
              priority: ['standard', 'express'],
            },
          }),
        },
      ],
    },
  },
  targetParameters: {
    stepFunctionStateMachineParameters: {
      invocationType: 'FIRE_AND_FORGET',
    },
    inputTemplate: JSON.stringify({
      orderId: '<$.body.orderId>',
      customerId: '<$.body.customerId>',
      priority: '<$.body.priority>',
      payloadRef: {
        s3Uri: '<$.body.payloadRef.s3Uri>',
      },
      correlationId: '<$.body.correlationId>',
    }),
  },
});

Why this is cost-aware

No routing Lambda to maintain
Built-in filtering avoids unnecessary workflow starts
Batching reduces per-message overhead while preserving throughput

3) Lambda right-sizing profiling helper (Python)

I use a simple embedded metric approach to log what matters for right-sizing reviews. This is not a full observability solution, but it is enough to drive architecture decisions.

import json
import os
import time
from datetime import datetime, timezone

def emit_emf(namespace: str, metrics: dict, dimensions: dict):
    emf = {
        "_aws": {
            "Timestamp": int(time.time() * 1000),
            "CloudWatchMetrics": [
                {
                    "Namespace": namespace,
                    "Dimensions": [list(dimensions.keys())],
                    "Metrics": [{"Name": k, "Unit": "None"} for k in metrics.keys()],
                }
            ],
        },
        **dimensions,
        **metrics,
    }
    print(json.dumps(emf))

def lambda_handler(event, context):
    start = time.perf_counter()

    payload_size_bytes = len(json.dumps(event).encode("utf-8"))

    # ... custom logic here ...
    # Keep this function for genuine business logic, not simple routing

    duration_ms = (time.perf_counter() - start) * 1000
    configured_memory_mb = int(os.getenv("AWS_LAMBDA_FUNCTION_MEMORY_SIZE", "128"))

    emit_emf(
        namespace="ServerlessReview",
        dimensions={
            "FunctionName": context.function_name,
            "Environment": os.getenv("STAGE", "prod"),
        },
        metrics={
            "DurationMs": round(duration_ms, 2),
            "PayloadSizeBytes": payload_size_bytes,
            "ConfiguredMemoryMB": configured_memory_mb,
        },
    )

    return {
        "ok": True,
        "processedAt": datetime.now(timezone.utc).isoformat()
    }

How I use this data in reviews

I compare:

duration by memory setting
duration by payload size bucket
timeout proximity
retry rates correlated with duration

Then I decide whether to:

increase memory (reduce duration and total cost)
decrease memory (if clearly overprovisioned)
batch inputs
move logic out of Lambda entirely

4) Production-safe logging pattern (Python)

This pattern reduces log cost without sacrificing debuggability.

import json
import os
import random

LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO").upper()
DEBUG_SAMPLE_RATE = float(os.getenv("DEBUG_SAMPLE_RATE", "0.01"))  # 1% default

def should_log_debug_payload() -> bool:
    if LOG_LEVEL == "DEBUG":
        return True
    return random.random() < DEBUG_SAMPLE_RATE

def log_info(message: str, **kwargs):
    print(json.dumps({"level": "INFO", "message": message, **kwargs}))

def log_debug(message: str, **kwargs):
    if LOG_LEVEL == "DEBUG":
        print(json.dumps({"level": "DEBUG", "message": message, **kwargs}))

def handler(event, context):
    correlation_id = event.get("correlationId", "unknown")
    order_id = event.get("orderId", "unknown")

    log_info(
        "processing_order",
        correlationId=correlation_id,
        orderId=order_id
    )

    if should_log_debug_payload():
        print(json.dumps({
            "level": "DEBUG_SAMPLED",
            "message": "event_payload_sample",
            "correlationId": correlation_id,
            "orderId": order_id,
            "payload": event
        }))

    # ... business logic ...

    log_info(
        "order_processed",
        correlationId=correlation_id,
        orderId=order_id,
        status="success"
    )

    return {"status": "ok"}

Why I like this pattern

Production logs stay useful and cheaper
I still have sampled payload visibility for troubleshooting
It works well with structured log queries and metrics extraction

Workload Profiling and Right-Sizing Patterns

This is the part that makes the review credible. I do not want architecture recommendations based on intuition alone.

1) Profile by workload class, not just service

I segment workloads into classes, for example:

Latency-sensitive sync API path
Async event processing path
Burst batch path
Low-frequency admin path

Each class has different priorities and cost levers.

Example:

Sync API path may justify higher Lambda memory for lower latency
Async batch path may prioritize batching and reduced transitions
Admin path may not be worth heavy optimization effort

2) Build a “cost to reliability” tradeoff table

For each proposed change, I document:

Change
Expected cost impact
Reliability impact
Operability impact
Rollback plan
Measurement plan

This keeps reviews grounded and avoids “optimize everything” behavior.

Example entries

Replace routing Lambda with Pipe -> lower cost, lower ops burden, neutral reliability, easy rollback
Collapse 5 workflow states into 2 -> moderate cost savings, possible loss of debug granularity, test carefully
Batch writes by 25 -> potentially high savings, changes failure semantics, requires idempotency validation

3) Right-size from p95 and timeout headroom, not p50 alone

A function that looks fine at p50 may:

time out at p99
retry frequently during bursts
create downstream contention that increases total cost

I tune using:

p95/p99 duration
timeout headroom
retry counts
dependency latency distribution

This avoids false savings that increase retries and operational incidents.

4) Profile payload size and serialization overhead

In event-driven systems, large payloads can create hidden costs:

more transfer
longer execution time
larger logs
larger Step Functions state payloads
more memory pressure in Lambda

I explicitly bucket payload sizes and look for outliers.

A common improvement is:

move large optional fields to S3
pass references
fetch lazily only in the steps that need them

5) Profile retries as a cost multiplier

Retries are often the hidden multiplier in “unexpected” serverless cost growth.

I measure:

retry rate per dependency
retry amplification across layers (SDK retry + Lambda retry + workflow retry)
cost of failed attempts vs successful attempts
DLQ rate and replay volume

Then I redesign retry placement:

retry at the layer with the best context
avoid stacking retries unnecessarily
use backoff and circuit-breaking patterns where appropriate

This improves both cost and stability.

Review Checklist I Actually Use

Here is the short version of the checklist I use in architecture reviews.

Hotspots

[ ] Which cost components scale with request count?
[ ] Which scale with duration, payload size, retries, or fan-out?
[ ] Where are we logging too much?
[ ] Where are we moving data unnecessarily?

Architecture fit

[ ] Is Lambda only used where custom logic is truly needed?
[ ] Can any glue Lambda be replaced with Pipes or direct integration?
[ ] Is Step Functions state granularity intentional?
[ ] Are polling loops justified, or can callbacks/events replace them?

Profiling

[ ] Do we have p95/p99 duration data?
[ ] Do we know payload size distribution?
[ ] Do we know retry rate by dependency?
[ ] Do we know burst concurrency patterns?

Reliability guardrails

[ ] Will the optimization change failure semantics?
[ ] Is idempotency preserved?
[ ] Are DLQ/replay paths intact?
[ ] Is observability still sufficient?

Rollout

[ ] Can we canary or phase the change?
[ ] Do we have before/after metrics defined?
[ ] Do we have a rollback path?

Common Mistakes in Cost Reviews

I see these often, so I call them out explicitly.

Mistake 1: Only tuning Lambda memory

Useful, but incomplete. The biggest cost issue may be:

Step Functions transitions
logs
retries
data movement
unnecessary service hops

Mistake 2: Replacing orchestration with one “mega Lambda”

This can reduce one visible cost but create:

worse reliability
harder retries
worse observability
slower delivery velocity

Mistake 3: Optimizing by averages

p50 hides burst pain and retry amplification.

Mistake 4: Ignoring developer productivity

A design that is slightly cheaper but much harder to maintain may cost more overall.

Mistake 5: Removing logs indiscriminately

The right answer is better logging design, not blind log suppression.

Closing Thoughts

A strong cost-aware serverless review is not about chasing the cheapest architecture on paper. It is about making architecture choices that fit the workload, preserve reliability, and keep operating costs proportional to business value.

The framework I use is simple:

Profile the workload
Map the cost path
Find hotspots
Apply architecture changes first
Right-size from measurements
Re-measure and document tradeoffs

If I am reviewing a production serverless system, I want the outcome to be more than a list of tips. I want a clear plan that the team can implement safely, measure, and iterate on.

That is what makes the review practical.

References

(Consolidated at the end per request)

AWS Lambda pricing, performance tuning guidance, and memory/CPU behavior documentation
AWS Step Functions pricing and workflow design guidance (Standard vs Express)
AWS Step Functions service integrations and callback/task token patterns
Amazon EventBridge and EventBridge Pipes documentation (filtering, enrichment, targets)
Amazon SQS retry, DLQ, and batching guidance
Amazon DynamoDB write/read patterns and idempotency design guidance
Amazon CloudWatch Logs pricing, retention, and observability best practices
AWS Well-Architected Framework (Cost Optimization and Reliability pillars)
Serverless Lens and workload-specific architecture review guidance