Measuring What Users Experience Across API Gateway -> Lambda -> DynamoDB -> EventBridge
TL;DR
Serverless systems rarely fail at a single component. Failures occur at the junctures between managed services. Yet most SLO implementations still measure API Gateway, Lambda and DynamoDB in isolation.
This article shows how to define and operate composite, end-to-end SLOs for a real serverless chain. You'll see how to derive availability and latency SLIs across multiple AWS services, calculate error budgets correctly, wire burn-rate alerts, and ship a working dashboard using CloudWatch metric math and infrastructure as code.
Introduction: Why "Everything Is Green" Is Still Not Good Enough
If you have operated a serverless system for long time, you'll eventually experience this situation:
- API Gateway shows 99.9% availability
- Lambda error rate looks fine
- DynamoDB has no throttles
- EventBridge metrics are quiet
Despite this, users are retrying operations, workflows remain unfinished, and crucial events are failing to reach downstream systems.
Nothing is individually broken. The system is. The problem is not observability coverage. It's how reliability is modeled.
Serverless architectures push complexity into managed services. That's a good trade until reliability is measured per service instead of per request. At that point, SLOs stop representing user experience and start representing dashboards.
The Gap we are trying to close
There are a lot of content online on SLO but what's missing is specificity.
What I mostly find:
- Generic SLO explainers based on microservices
- Burn-rate math explained with Prometheus examples
- AWS blog posts measuring one service at a time
- "Composite SLO" is mentioned as a concept
What’s consistently absent:
- A step-by-step SLO model for serverless request chains
- Clear guidance on what counts as failure when managed services retry, buffer, or partially succeed
- Concrete examples using CloudWatch metric math
- A way to combine sync and async paths into a single availability signal
This article will close the gap.
The System We're Measuring
We'll use a very common production pattern:
From a user’s perspective, the request is successful only if:
- API Gateway accepts and processes the request
- Lambda executes successfully
- The DynamoDB write succeeds
- The event is published to EventBridge
This is an uncompromising definition that dictates all subsequent actions. It means that any result short of complete success is considered a partial failure, even if the HTTP response code is 200.
Why Per-Service SLOs Break Down in Serverless
Per-service SLOs assume clean failure boundaries. Serverless doesn’t have those.
Consider this real scenario:
- API Gateway returns 200
- Lambda executes successfully
- DynamoDB write succeeds
- EventBridge PutEvents partially fails
Your API metrics look perfect. Your Lambda metrics look perfect. Your DynamoDB metrics look perfect.
Your business workflow is broken.
This is why composite SLOs are not "advanced", they're table stakes for event-driven systems.
Defining the Composite SLO
Availability Objective
99.5% of requests must complete end-to-end successfully over a rolling 30-day window
A request is counted as successful only if all four steps succeed.
Important Scope Clarification: Note that for the asynchronous step (EventBridge), "success" means the event was successfully published to the bus. This SLO measures the promise of work, not the eventual consumption by downstream subscribers.
If you have critical downstream consumers, they need their own separate SLOs. Trying to jam async consumption into a synchronous API availability metric will only create noise.
Latency Objective
95% of successful requests must complete within 800 ms
Latency here reflects user-visible delay, not async processing time. EventBridge publishing is included in availability, not latency.
This distinction matters more than most teams realize.
Choosing SLIs That Actually Map to Reality
We will compose existing AWS metrics.
Availability Signals
- API Gateway
- Count
- 5XXError
- Lambda
- Invocations
- Errors
- DynamoDB
- UserErrors
- EventBridge
- PutEventsFailedEntries
These metrics already exist. The work is in combining them correctly.
Composite Availability: Turning Fragments Into a Single Signal
The core question is simple: Out of all incoming requests, how many completed the full chain?
We model this explicitly using CloudWatch metric math.
Composite Availability Expression
CompositeAvailabilityMetric:
Type: AWS::CloudWatch::Alarm
Properties:
Metrics:
- Id: totalRequests
MetricStat:
Metric:
Namespace: AWS/ApiGateway
MetricName: Count
Dimensions:
- Name: ApiName
Value: OrdersAPI
Period: 60
Stat: Sum
- Id: apiFailures
MetricStat:
Metric:
Namespace: AWS/ApiGateway
MetricName: 5XXError
Dimensions:
- Name: ApiName
Value: OrdersAPI
Period: 60
Stat: Sum
- Id: lambdaFailures
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: OrdersHandler
Period: 60
Stat: Sum
- Id: eventFailures
MetricStat:
Metric:
Namespace: AWS/Events
MetricName: FailedInvocations
Period: 60
Stat: Sum
- Id: availability
Expression: "1 - ((FILL(apiFailures,0) + FILL(lambdaFailures,0) + FILL(eventFailures,0)) / MAX([totalRequests], 1))"
Label: CompositeAvailability
ReturnData: true
This produces a single availability SLI that reflects user reality, not service health.
A Note on "Strict" Math & Retries
You might notice this formula is ruthless: 1 - (failures / total). In a serverless world, services like Lambda often retry automatically on failure.
If a Lambda fails twice and succeeds on the third try, this metric counts it as a failure. This is intentional. Hidden retries burn your error budget and increase latency. By penalizing retries in your availability score, you force the team to fix the underlying flakiness rather than letting the retry policy hide it.
Composite Latency: Measuring the Critical Path
Latency is additive across synchronous hops.
We use percentile metrics to avoid averages masking tail behavior.
CompositeLatencyMetric:
Metrics:
- Id: apiLatency
MetricStat:
Metric:
Namespace: AWS/ApiGateway
MetricName: Latency
Dimensions:
- Name: ApiName
Value: OrdersAPI
Period: 60
Stat: p95
- Id: lambdaDuration
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Duration
Dimensions:
- Name: FunctionName
Value: OrdersHandler
Period: 60
Stat: p95
- Id: dynamoLatency
MetricStat:
Metric:
Namespace: AWS/DynamoDB
MetricName: SuccessfulRequestLatency
Dimensions:
- Name: TableName
Value: Orders
Period: 60
Stat: p95
- Id: totalLatency
Expression: "apiLatency + lambdaDuration + dynamoLatency"
Label: EndToEndLatency
ReturnData: true
This prevents the common error of prematurely claiming "P95 latency is acceptable" even while users are still experiencing delays.
Error Budgets and Burn Rate (Where SLOs Become Useful)
For a 99.5% SLO over 30 days:
- Total error budget: 0.5%
- Budget in minutes: ~216 minutes/month
We use multi-window burn-rate alerts to avoid noise.
Fast Burn (Page): If we burn the monthly budget in under 2 days, something is seriously wrong.
Slow Burn (Ticket): If we’re slowly bleeding reliability, the system needs attention, but not at 2am.
These alerts are driven by the composite availability metric, not individual services. That alignment is the entire point.
Dashboard Design: Fewer Charts, Better Decisions
To keep the investigation time short and focused, we require these typical widgets:
- Composite availability (rolling 30 days)
- Error budget remaining
- End-to-end latency p95
- API 5xx
- Lambda errors
- EventBridge failed entries
Best Practices and Anti-Patterns
Best Practices
- Define failure conservatively
- Use percentiles, not averages
- Treat async failures as first-class reliability issues
- Alert on burn rate, not raw errors
Anti-Patterns
- Ignoring retries in SLI math
- Counting HTTP 200 as success unconditionally
- Measuring latency per service in isolation
- Treating EventBridge as "eventually reliable"
Conclusion
Serverless systems fail in ways that traditional SLO models don’t capture. Composite SLOs fix that by forcing reliability to align with user experience instead of service boundaries.
If you run event-driven systems and still rely on per-service health, you're measuring the wrong thing.

Top comments (0)