Gunnar Grosch

Posted on Mar 12

Chaos Engineering for AWS Lambda: failure-lambda 1.0

#aws #serverless #testing #devops

I wrote the first version of failure-lambda back in 2019. The idea was simple: inject faults into AWS Lambda functions so you can test how your system behaves when things go wrong. Latency spikes, exceptions, blocked network calls. The kind of failures that happen in production whether you're ready for them or not.

That version worked. People used it. But the codebase was showing its age. JavaScript with no types. AWS SDK v2. A flat configuration format that only allowed one failure mode at a time. And it only worked with Node.js.

failure-lambda 1.0 is a ground-up rewrite. TypeScript, AWS SDK v3, a feature flag configuration model, two new failure modes (timeout and corruption), and a Lambda Layer that brings fault injection to any managed runtime with zero code changes.

Why Chaos Engineering for Lambda?

If you're building on Lambda, you're building on a managed service. You don't manage servers, but you still manage dependencies. Your function calls DynamoDB, S3, third-party APIs, other microservices. Any of those can be slow, unreliable, or unavailable.

The question isn't whether failures will happen. It's whether your system handles them gracefully when they do. Does your function retry correctly? Does your circuit breaker trip? Does your API return a useful error message instead of a 500?

Failure injection lets you answer those questions before your users do. Enable latency injection and watch your downstream timeouts. Block a dependency with the denylist and see if your fallback logic works. Return a 503 and check that your retry policy backs off properly.

The important part: you control when and how these failures happen. Start with one mode at a low percentage in a test environment. Increase gradually. Build confidence that your system does what you think it does.

What's New

The short version: everything. TypeScript with full type definitions. AWS SDK v3. Two new failure modes.

The old format was flat: one failure mode, one rate, one toggle. The new format treats each mode as an independent feature flag. You can enable latency injection at 50% and DNS denylist at 100% simultaneously:

{
  "latency": {"enabled": true, "percentage": 50, "min_latency": 200, "max_latency": 500},
  "denylist": {"enabled": true, "deny_list": ["s3.*.amazonaws.com"]}
}

Seven failure modes:

Mode	Effect
`latency`	Adds random delay between configured bounds
`timeout`	Sleeps until the function is about to time out
`exception`	Throws an exception with a configurable message
`statuscode`	Returns a specific HTTP status code
`diskspace`	Fills `/tmp` to consume available disk space
`denylist`	Blocks network calls to matching hostnames
`corruption`	Mangles the response body after the handler returns

Beyond the modes:

Event-based targeting. Match conditions restrict injection to specific requests. Only corrupt GET requests to the prod stage. Only add latency to requests hitting /api. Conditions support exact match, exists, startsWith, and regex operators.
```
{
  "latency": {
    "enabled": true,
    "min_latency": 200,
    "max_latency": 500,
    "match": [{"path": "requestContext.http.path", "operator": "startsWith", "value": "/api"}]
  }
}
```
AppConfig Feature Flags. Native support for the AWS.AppConfig.FeatureFlags profile type. AppConfig gives you deployment strategies and automatic rollback, useful when you don't want an accidental "enable all failures at 100%" to take down your environment.
Middy middleware. If you use Middy, import failure-lambda/middy instead of wrapping your handler.
CLI. npx failure-lambda gives you an interactive CLI for managing failure configuration. Check status, enable modes, disable everything. Supports both SSM and AppConfig backends and saves connection profiles so you don't retype region and parameter names every time.

The Lambda Layer

This is the biggest addition. The npm package requires you to import failure-lambda and wrap your handler. That's fine for Node.js, but it doesn't help if your Lambda functions are written in Python, Java, .NET, or Ruby.

The Lambda Layer solves this. Add the layer to your function, set two environment variables, and fault injection works without touching your code. No imports, no wrapper, no middleware.

AWS_LAMBDA_EXEC_WRAPPER=/opt/failure-lambda-wrapper
FAILURE_INJECTION_PARAM=failureLambdaConfig

Under the hood, it's a lightweight Rust proxy that sits between your handler and the Lambda Runtime API. Single static binary, no runtime dependencies, negligible cold start impact. On each invocation, the proxy reads your failure configuration from SSM or AppConfig and decides whether to inject faults before or after forwarding the request to your handler. Your code never knows it's there.

┌─────────────────────────────────────────────────┐
│ Lambda Execution Environment                    │
│                                                 │
│  Lambda Runtime API                             │
│       │                                         │
│       ▼                                         │
│  failure-lambda proxy (Rust)                    │
│       │                                         │
│       ├── Read config from SSM / AppConfig      │
│       ├── Inject fault? ──yes──▶ Return early   │
│       │                         (statuscode,    │
│       │                          exception,     │
│       │                          latency)       │
│       no                                        │
│       │                                         │
│       ▼                                         │
│  Your handler (any runtime)                     │
│       │                                         │
│       ├── Inject fault? ──yes──▶ Modify response│
│       │                         (corruption)    │
│       no                                        │
│       │                                         │
│       ▼                                         │
│  Response returned                              │
└─────────────────────────────────────────────────┘

The layer works with all managed Lambda runtimes that support AWS_LAMBDA_EXEC_WRAPPER: Node.js, Python, Java, .NET, and Ruby. Both x86_64 and arm64 architectures. Custom runtimes can use the layer if they support AWS_LAMBDA_EXEC_WRAPPER: the runtime bootstrap must check for the variable and invoke the specified executable before starting its own runtime loop. Download the zip from the GitHub release, publish it to your account, and you're ready to go.

Getting Started: npm Package

For Node.js, the npm package gives you the most control.

You'll need:

Node.js 20 or later
An AWS account with permissions to create SSM parameters and Lambda functions
ssm:GetParameter granted to the function's execution role

npm install failure-lambda

Wrap your handler:

import failureLambda from "failure-lambda";

export const handler = failureLambda(async (event, context) => {
  return { statusCode: 200, body: "OK" };
});

Create an SSM parameter with your failure configuration:

aws ssm put-parameter \
  --name failureLambdaConfig \
  --type String \
  --value '{"latency": {"enabled": false, "min_latency": 100, "max_latency": 400}}' \
  --region eu-west-1

Set the FAILURE_INJECTION_PARAM environment variable on your Lambda function to the parameter name, grant ssm:GetParameter, and deploy.

When you're ready to inject a fault:

npx failure-lambda enable latency --param failureLambdaConfig --region eu-west-1

Getting Started: Lambda Layer

For any runtime, the layer path requires no code changes.

You'll need:

An AWS account with permissions to publish Lambda layers and create Lambda functions
AWS CLI configured
ssm:GetParameter granted to the function's execution role

Download failure-lambda-layer-x86_64.zip or failure-lambda-layer-aarch64.zip from the latest release.

Publish the layer:

aws lambda publish-layer-version \
  --layer-name failure-lambda \
  --zip-file fileb://failure-lambda-layer-x86_64.zip \
  --compatible-architectures x86_64 \
  --region eu-west-1

Add the layer ARN to your function, set AWS_LAMBDA_EXEC_WRAPPER=/opt/failure-lambda-wrapper and FAILURE_INJECTION_PARAM=failureLambdaConfig, create the SSM parameter, and grant ssm:GetParameter.

That's it. Your Python handler, your Java handler, your .NET handler: they all get the same fault injection capabilities without a single line of code changed.

Try It Out: Injecting Faults Step by Step

Let's walk through a concrete example. We'll deploy a simple function with the layer, verify it works normally, inject latency, inject a status code error, and then turn everything off. The whole thing uses a standard Node.js handler with zero failure-lambda code in it.

Here's the handler. It simulates a quick database lookup and returns the response time:

export const handler = async (event) => {
  const start = Date.now();
  await new Promise((resolve) => setTimeout(resolve, 5));
  return {
    statusCode: 200,
    body: JSON.stringify({
      message: "Order processed",
      duration_ms: Date.now() - start,
    }),
  };
};

The SAM template adds the layer and sets the two required environment variables. Nothing else:

DemoFunction:
  Type: AWS::Serverless::Function
  Properties:
    Handler: index.handler
    Runtime: nodejs22.x
    CodeUri: src/
    Timeout: 10
    Layers:
      - !Ref FailureLambdaLayerArn
    Environment:
      Variables:
        AWS_LAMBDA_EXEC_WRAPPER: /opt/failure-lambda-wrapper
        FAILURE_INJECTION_PARAM: !Ref FailureConfig
    Policies:
      - SSMParameterReadPolicy:
          ParameterName: !Ref FailureConfig

The full template including Parameters definitions is in the examples directory. The snippet above shows only the function resource for clarity.

After deploying with sam build && sam deploy --guided, we hit the endpoint a few times to see steady state:

The examples below use curl -s -o - -w ' HTTP %{http_code} | %{time_total}s\n' to append status code and total request time to each response. Standard curl won't show this without the -w flag.

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}   HTTP 200 | 0.21s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}   HTTP 200 | 0.19s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":6}   HTTP 200 | 0.19s

5ms handler duration, ~190ms end-to-end on warm invocations. That's our baseline. Now let's see what happens when conditions aren't ideal.

Injecting latency

A third-party API starts responding slowly. A DynamoDB table is throttling. A downstream microservice is under load. These are real scenarios, and you want to know how your system behaves before they happen in production.

Update the SSM parameter to add 500-1000ms of random latency on every invocation:

aws ssm put-parameter --name <your-param-name> --type String --overwrite \
  --value '{"latency": {"enabled": true, "percentage": 100, "min_latency": 500, "max_latency": 1000}}' \
  --region eu-west-1

Configuration is cached for 60 seconds. If you update the SSM parameter and immediately hit the endpoint, you'll see the old behavior. Wait for the cache to refresh before testing.

Wait about 60 seconds for the configuration cache to refresh, then hit the endpoint again:

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":11}   HTTP 200 | 0.99s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}    HTTP 200 | 1.08s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}    HTTP 200 | 1.14s

Responses jumped from ~190ms to over a second. The handler itself still runs in 5ms: the latency is injected by the proxy before the handler executes. This simulates what happens when a dependency responds slowly. Does your API Gateway timeout kick in at the right threshold? Do callers retry or give up? Does a slow function cause a queue to back up?

Injecting a status code error

Latency is one thing. Complete failure is another. A downstream service returning 5xx errors, an expired API key, a misconfigured endpoint: all of these surface as error responses. Replace the config with a 503 Service Unavailable:

aws ssm put-parameter --name <your-param-name> --type String --overwrite \
  --value '{"statuscode": {"enabled": true, "percentage": 100, "status_code": 503}}' \
  --region eu-west-1

After the cache refreshes (up to 60 seconds):

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Injected status code 503"}   HTTP 503 | 0.31s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Injected status code 503"}   HTTP 503 | 0.21s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Injected status code 503"}   HTTP 503 | 0.18s

The handler never runs. The proxy short-circuits the invocation and returns a 503 directly. This represents a function that's failing for any reason: a permissions change, a missing environment variable, an unhandled exception. Does your frontend show a useful error message or a blank page? Does your Step Functions workflow retry or fail the entire execution? Does your monitoring alert you?

What the logs show

The proxy writes structured JSON logs to CloudWatch. You can see exactly what it's doing on each invocation:

{"source":"failure-lambda","action":"config","config_source":"ssm","enabled_flags":"[]"}
{"source":"failure-lambda","action":"config","config_source":"ssm","enabled_flags":"[\"latency\"]"}
{"source":"failure-lambda","mode":"latency","action":"inject","latency_ms":670,"min_latency":500.0,"max_latency":1000.0}
{"source":"failure-lambda","mode":"latency","action":"inject","latency_ms":859,"min_latency":500.0,"max_latency":1000.0}
{"source":"failure-lambda","mode":"latency","action":"inject","latency_ms":933,"min_latency":500.0,"max_latency":1000.0}
{"source":"failure-lambda","action":"config","config_source":"ssm","enabled_flags":"[\"statuscode\"]"}
{"source":"failure-lambda","mode":"statuscode","action":"inject","status_code":503}
{"source":"failure-lambda","mode":"statuscode","action":"inject","status_code":503}
{"source":"failure-lambda","mode":"statuscode","action":"inject","status_code":503}

Every config fetch and every injection is logged with the mode, action, and parameters. You can query these in CloudWatch Logs Insights:

fields @timestamp, mode, action
| filter source = "failure-lambda"
| sort @timestamp desc

Turning it off

The CLI can disable everything in one command:

npx failure-lambda disable --all --param <your-param-name> --region eu-west-1

Or set the parameter back to an empty config manually:

aws ssm put-parameter --name <your-param-name> --type String --overwrite \
  --value '{}' --region eu-west-1

After the cache refreshes, everything is back to normal:

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}   HTTP 200 | 0.33s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}   HTTP 200 | 0.23s

$ curl https://ij58ovr5s5.execute-api.eu-west-1.amazonaws.com/
{"message":"Order processed","duration_ms":5}   HTTP 200 | 0.19s

No redeployment. No code changes. The proxy saw the empty config, disabled injection, and passed everything through.

Cleaning up

To remove everything deployed in this walkthrough:

sam delete
aws ssm delete-parameter --name <your-param-name> --region eu-west-1
aws lambda delete-layer-version --layer-name failure-lambda --version-number <version> --region eu-west-1

What You Learn

The walkthrough above uses a single function. A real application with multiple functions, queues, and dependencies will surface more. But even one function reveals things about your system:

Timeout behavior (latency, timeout). Are your Lambda timeouts, API Gateway timeouts, and client timeouts configured consistently? Latency injection exposes mismatches fast. A function with a 10-second timeout behind an API Gateway with a 3-second timeout will fail in ways that look intermittent.
Retry and backoff (statuscode, exception). When a function returns a 503, do callers retry with exponential backoff or hammer the endpoint? Do SQS redrive policies work as configured? Injecting errors at a percentage less than 100% lets you see if partial failures are handled differently than complete outages.
Error propagation (statuscode, exception). Does a failure in one function produce a clear error message at the API boundary, or does it cascade into a generic 500? Injecting status codes and exceptions at different points in a call chain shows you exactly where error context gets lost.
Alerting and observability (any mode). Do your CloudWatch alarms fire? Do they fire quickly enough? Injecting faults and watching your dashboards is the most direct way to validate your monitoring. If you don't get paged during a controlled experiment, you won't get paged during an incident.
Fallback behavior (denylist). If you've built fallback logic for when a dependency is unavailable, does it actually work? The denylist mode blocks specific hostnames, so you can test what happens when S3 or DynamoDB is unreachable without affecting other dependencies.
Capacity and scaling (latency). What happens when latency increases and concurrent executions climb? Do you hit reserved concurrency limits? Does a slow function cause upstream queues to grow? These are the kinds of cascading effects that are hard to predict and easy to test.

The point of chaos engineering isn't to cause outages. It's to discover how your system responds to conditions that will eventually occur, in a controlled way, before your users encounter them.

Troubleshooting

Injection isn't happening

The most common cause is a missing ssm:GetParameter permission on the function's execution role. Check CloudWatch Logs for a permission denied error from the proxy. The second most common cause is the configuration cache: changes take up to 60 seconds to take effect. If you've just updated the SSM parameter, wait for the next cache refresh before concluding injection isn't working.

Architecture mismatch

If you publish the x86_64 layer and attach it to an arm64 function (or vice versa), the proxy binary won't execute. Download the correct zip for your function's architecture: failure-lambda-layer-x86_64.zip or failure-lambda-layer-aarch64.zip.

AWS_LAMBDA_EXEC_WRAPPER has no effect

The AWS_LAMBDA_EXEC_WRAPPER mechanism is built into managed Lambda runtimes. Custom runtimes need to explicitly support it by checking for the variable and invoking the wrapper before starting their own runtime loop. If you're using a custom runtime that doesn't implement this, the layer won't intercept anything.

Design Decisions

A few things I learned building this.

Feature flags over a single toggle

The original version had one failure mode active at a time. That's not how production fails. In the real world, you might have a slow dependency and a flaky DNS resolution at the same time. The feature flag model lets you compose failures. Each mode is independent with its own percentage, so you can build realistic failure scenarios.

Why Rust for the layer

The proxy sits in the critical path of every Lambda invocation. It needs to be fast with minimal memory overhead. Rust was the natural choice: predictable performance, no garbage collector pauses, and the single binary keeps the layer small.

Caching with a purpose

Every invocation used to call SSM to get the configuration. That's unnecessary latency and API costs. The library now caches SSM responses for 60 seconds by default. For AppConfig, the Lambda extension already handles caching, so the library disables its own cache entirely to avoid staleness.

Validation that fails closed

If your configuration JSON is malformed or has invalid values, the library logs a clear error and disables injection. It doesn't crash your function and it doesn't silently inject with bad parameters. Regex patterns in denylist rules are checked for nested quantifiers to prevent ReDoS.

What's Next

failure-lambda 1.0 brings TypeScript, a feature flag configuration model, seven failure modes, and a Lambda Layer that works across all managed runtimes without touching your code. This release covers the core use cases I've seen in practice. There are things I'd like to explore next: more granular targeting with Lambda function aliases, integration with AWS Fault Injection Service, and better observability into what's being injected across a fleet of functions. If you have ideas, open an issue.

Additional Resources

What failure scenarios have you tested in your serverless applications? I'm curious what surprises people find with timeout mismatches and retry behavior. Let me know in the comments.

DEV Community