Nabin Debnath

Posted on Jan 5

Composite SLOs for Serverless Event-Driven Systems

#aws #serverless #observability #devops

Measuring What Users Experience Across API Gateway -> Lambda -> DynamoDB -> EventBridge

TL;DR

Serverless systems rarely fail at a single component. Failures occur at the junctures between managed services. Yet most SLO implementations still measure API Gateway, Lambda and DynamoDB in isolation.
This article shows how to define and operate composite, end-to-end SLOs for a real serverless chain. You'll see how to derive availability and latency SLIs across multiple AWS services, calculate error budgets correctly, wire burn-rate alerts, and ship a working dashboard using CloudWatch metric math and infrastructure as code.

Introduction: Why "Everything Is Green" Is Still Not Good Enough

If you have operated a serverless system for long time, you'll eventually experience this situation:

API Gateway shows 99.9% availability
Lambda error rate looks fine
DynamoDB has no throttles
EventBridge metrics are quiet

Despite this, users are retrying operations, workflows remain unfinished, and crucial events are failing to reach downstream systems.

Nothing is individually broken. The system is. The problem is not observability coverage. It's how reliability is modeled.

Serverless architectures push complexity into managed services. That's a good trade until reliability is measured per service instead of per request. At that point, SLOs stop representing user experience and start representing dashboards.

The Gap we are trying to close

There are a lot of content online on SLO but what's missing is specificity.
What I mostly find:

Generic SLO explainers based on microservices
Burn-rate math explained with Prometheus examples
AWS blog posts measuring one service at a time
"Composite SLO" is mentioned as a concept

What’s consistently absent:

A step-by-step SLO model for serverless request chains
Clear guidance on what counts as failure when managed services retry, buffer, or partially succeed
Concrete examples using CloudWatch metric math
A way to combine sync and async paths into a single availability signal

This article will close the gap.

The System We're Measuring

We'll use a very common production pattern:

From a user’s perspective, the request is successful only if:

API Gateway accepts and processes the request
Lambda executes successfully
The DynamoDB write succeeds
The event is published to EventBridge

This is an uncompromising definition that dictates all subsequent actions. It means that any result short of complete success is considered a partial failure, even if the HTTP response code is 200.

Why Per-Service SLOs Break Down in Serverless

Per-service SLOs assume clean failure boundaries. Serverless doesn’t have those.

Consider this real scenario:

API Gateway returns 200
Lambda executes successfully
DynamoDB write succeeds
EventBridge PutEvents partially fails

Your API metrics look perfect. Your Lambda metrics look perfect. Your DynamoDB metrics look perfect.
Your business workflow is broken.
This is why composite SLOs are not "advanced", they're table stakes for event-driven systems.

Defining the Composite SLO

Availability Objective

99.5% of requests must complete end-to-end successfully over a rolling 30-day window

A request is counted as successful only if all four steps succeed.

Important Scope Clarification: Note that for the asynchronous step (EventBridge), "success" means the event was successfully published to the bus. This SLO measures the promise of work, not the eventual consumption by downstream subscribers.
If you have critical downstream consumers, they need their own separate SLOs. Trying to jam async consumption into a synchronous API availability metric will only create noise.

Latency Objective

95% of successful requests must complete within 800 ms

Latency here reflects user-visible delay, not async processing time. EventBridge publishing is included in availability, not latency.
This distinction matters more than most teams realize.

Choosing SLIs That Actually Map to Reality

We will compose existing AWS metrics.
Availability Signals

API Gateway
- Count
- 5XXError
Lambda
- Invocations
- Errors
DynamoDB
- UserErrors
EventBridge
- PutEventsFailedEntries

These metrics already exist. The work is in combining them correctly.

Composite Availability: Turning Fragments Into a Single Signal

The core question is simple: Out of all incoming requests, how many completed the full chain?
We model this explicitly using CloudWatch metric math.
Composite Availability Expression

CompositeAvailabilityMetric:
  Type: AWS::CloudWatch::Alarm
  Properties:
    Metrics:
      - Id: totalRequests
        MetricStat:
          Metric:
            Namespace: AWS/ApiGateway
            MetricName: Count
            Dimensions:
              - Name: ApiName
                Value: OrdersAPI
          Period: 60
          Stat: Sum

      - Id: apiFailures
        MetricStat:
          Metric:
            Namespace: AWS/ApiGateway
            MetricName: 5XXError
            Dimensions:
              - Name: ApiName
                Value: OrdersAPI
          Period: 60
          Stat: Sum

      - Id: lambdaFailures
        MetricStat:
          Metric:
            Namespace: AWS/Lambda
            MetricName: Errors
            Dimensions:
              - Name: FunctionName
                Value: OrdersHandler
          Period: 60
          Stat: Sum

      - Id: eventFailures
        MetricStat:
          Metric:
            Namespace: AWS/Events
            MetricName: FailedInvocations
          Period: 60
          Stat: Sum

      - Id: availability
        Expression: "1 - ((FILL(apiFailures,0) + FILL(lambdaFailures,0) + FILL(eventFailures,0)) / MAX([totalRequests], 1))"
        Label: CompositeAvailability
        ReturnData: true

This produces a single availability SLI that reflects user reality, not service health.

A Note on "Strict" Math & Retries
You might notice this formula is ruthless: 1 - (failures / total). In a serverless world, services like Lambda often retry automatically on failure.

If a Lambda fails twice and succeeds on the third try, this metric counts it as a failure. This is intentional. Hidden retries burn your error budget and increase latency. By penalizing retries in your availability score, you force the team to fix the underlying flakiness rather than letting the retry policy hide it.

Composite Latency: Measuring the Critical Path

Latency is additive across synchronous hops.

We use percentile metrics to avoid averages masking tail behavior.

CompositeLatencyMetric:
  Metrics:
    - Id: apiLatency
      MetricStat:
        Metric:
          Namespace: AWS/ApiGateway
          MetricName: Latency
          Dimensions:
            - Name: ApiName
              Value: OrdersAPI
        Period: 60
        Stat: p95

    - Id: lambdaDuration
      MetricStat:
        Metric:
          Namespace: AWS/Lambda
          MetricName: Duration
          Dimensions:
            - Name: FunctionName
              Value: OrdersHandler
        Period: 60
        Stat: p95

    - Id: dynamoLatency
      MetricStat:
        Metric:
          Namespace: AWS/DynamoDB
          MetricName: SuccessfulRequestLatency
          Dimensions:
            - Name: TableName
              Value: Orders
        Period: 60
        Stat: p95

    - Id: totalLatency
      Expression: "apiLatency + lambdaDuration + dynamoLatency"
      Label: EndToEndLatency
      ReturnData: true

This prevents the common error of prematurely claiming "P95 latency is acceptable" even while users are still experiencing delays.

Error Budgets and Burn Rate (Where SLOs Become Useful)

For a 99.5% SLO over 30 days:

Total error budget: 0.5%
Budget in minutes: ~216 minutes/month

We use multi-window burn-rate alerts to avoid noise.

Fast Burn (Page): If we burn the monthly budget in under 2 days, something is seriously wrong.
Slow Burn (Ticket): If we’re slowly bleeding reliability, the system needs attention, but not at 2am.

These alerts are driven by the composite availability metric, not individual services. That alignment is the entire point.

Dashboard Design: Fewer Charts, Better Decisions

To keep the investigation time short and focused, we require these typical widgets:

Composite availability (rolling 30 days)
Error budget remaining
End-to-end latency p95
API 5xx
Lambda errors
EventBridge failed entries

Best Practices and Anti-Patterns

Best Practices

Define failure conservatively
Use percentiles, not averages
Treat async failures as first-class reliability issues
Alert on burn rate, not raw errors

Anti-Patterns

Ignoring retries in SLI math
Counting HTTP 200 as success unconditionally
Measuring latency per service in isolation
Treating EventBridge as "eventually reliable"

Conclusion

Serverless systems fail in ways that traditional SLO models don’t capture. Composite SLOs fix that by forcing reliability to align with user experience instead of service boundaries.

If you run event-driven systems and still rely on per-service health, you're measuring the wrong thing.

DEV Community