Gunnar Grosch

Posted on Mar 9

Circuit Breakers on AWS Lambda: Why In-Memory State Silently Fails

#aws #serverless #typescript #architecture

You added a circuit breaker to your Lambda function. It compiles, your tests pass, and it works correctly in local testing. But it's silently useless. The problem isn't the implementation. It's an assumption every in-memory circuit breaker makes that doesn't hold on Lambda.

What Circuit Breakers Do

The circuit breaker pattern comes from Michael Nygard's Release It! and is named after the electrical component. Think about the services your Lambda functions actually call: a payment processor, a third-party enrichment API, a database under load, another service in your own fleet. Anything external your function depends on is a downstream service, and any of them can start responding slowly or fail outright. Slow is often worse than down. A dependency that takes 10 seconds to time out costs you 10 seconds of held concurrency per call, not a fast failure you can handle gracefully.

That concurrency cost is the Lambda-specific reason to care. When a downstream call hangs, your function holds a concurrency unit. At 100 concurrent executions and a 10-second timeout, one flaky dependency can saturate your function in seconds, throttling every other request, including the ones with nothing to do with the sick service. The cascade happens fast: payment API slows down → order function saturates concurrency → order requests fail → the service calling orders backs up → users see errors across your entire checkout flow.

When a downstream service starts failing, a circuit breaker stops calling it entirely, returns a fallback response immediately, and probes for recovery. It also gives the downstream service breathing room: instead of a flood of timeouts hammering something that's already struggling, it gets near-silence while the circuit is open. The naming follows the electrical analogy: a closed circuit is complete and current flows; an open circuit is broken and nothing gets through.

Three states:

CLOSED: Normal operation. Calls go through. Failures are counted.
OPEN: Circuit tripped. Calls fail fast without reaching the downstream service. A timeout runs.
HALF-OPEN: One trial call allowed. If it succeeds, the circuit closes. If it fails, it reopens with a longer timeout.

The problem is how Lambda runs code.

Why In-Memory State Fails on Lambda

Lambda's concurrency model is built around isolated execution environments. From the AWS documentation: "For each concurrent request, Lambda provisions a separate instance of your execution environment." Two simultaneous invocations of the same function run in two separate environments with completely independent memory spaces. There is no shared memory between them.

Consider what this means for a circuit breaker with a failure threshold of 5. Your function is receiving 50 concurrent requests. A downstream service starts failing:

Execution environment 1 takes a request. The call fails. Its local failure count: 1/5.
Execution environment 2 takes a request. The call fails. Its local failure count: 1/5.
...
Execution environment 50 takes a request. The call fails. Its local failure count: 1/5.

50 failures have hit the downstream service. No circuit has opened. Each environment has counted 1 failure and needs 4 more before it does anything. Meanwhile, all 50 environments continue sending requests to a service that is already failing. In the worst case, where traffic distributes evenly across environments, you need up to 250 total failures before any single execution environment opens its circuit.

And that's assuming the same 50 execution environments handle all the traffic. Lambda scales by adding new execution environments as load increases. Each new environment starts with a failure count of zero. As long as traffic grows and new environments spin up, the fleet will always have environments that haven't seen enough failures to open. The circuit can never effectively protect you across the fleet.

This isn't a hypothetical. The most widely-used Node.js circuit breaker libraries (opossum, cockatiel) store state in process memory. They work correctly in a single-process server where all traffic goes through one circuit. They don't work for Lambda's distributed execution model. opossum does provide state export and import hooks (toJSON()) specifically documented for serverless environments, but these don't solve the cross-environment isolation problem: each environment still starts from whatever state you restore, not a live shared view of current circuit state.

Provisioned Concurrency reduces but doesn't eliminate this problem. PC keeps a fixed number of execution environments initialized and warm, so they accumulate local failure counts across more requests than standard on-demand environments. But they're still isolated from each other, and scaling events still add fresh environments that start at zero. In-memory state is less useless with PC, but it's still wrong at any meaningful concurrency level.

Lambda also periodically terminates execution environments for runtime maintenance and updates, even for continuously invoked functions. An environment accumulating failure counts can be replaced with a fresh one starting at zero at any time, adding another layer of unreliability to in-memory state.

Lambda Managed Instances (launched at re:Invent 2025) are an exception: they support multiple concurrent invocations per environment, so in-memory state accumulates across requests within the same environment. The argument above applies to standard Lambda functions, which remain the default.

Shared State Across Execution Environments

The fix is to store circuit state in a shared external store. When execution environment 1 records a failure, execution environment 2 sees it. When any environment opens the circuit, every environment stops calling the downstream service. Yes, this adds a network call to every invocation. The Performance and Cost sections have the numbers. For most workloads the overhead is small, and CachedProvider can reduce it further. ElastiCache (Valkey) is the fastest option (sub-millisecond reads) and is the right choice if your functions are already in a VPC. DynamoDB is the right default for most Lambda workloads: no VPC required, single-digit millisecond latency, and it supports atomic operations and conditional writes for concurrent safety. Adding a VPC solely for circuit breaker state adds deployment complexity and a modest cold start overhead, which isn't worth it unless you're already VPC-attached.

circuitbreaker-lambda is an open-source library I built that takes the DynamoDB path. It stores circuit state in DynamoDB and shares it across all execution environments running the same function.

Two paths to choose from:

	npm package	Lambda Layer
Runtimes	Node.js 20+	Any managed runtime
Integration	Import library	HTTP calls to local sidecar
Cold start overhead	~50ms	~350ms

Both paths share the same DynamoDB state schema, so a Node.js function using the npm package and a Python function using the Layer can share circuit state for the same downstream service.

Getting Started: npm Package

Install

npm install circuitbreaker-lambda

Requires Node.js 20+.

Create a DynamoDB table

aws dynamodb create-table \
  --table-name circuitbreaker-table \
  --attribute-definitions AttributeName=id,AttributeType=S \
  --key-schema AttributeName=id,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Set the environment variable

Add CIRCUITBREAKER_TABLE as a Lambda environment variable. In a SAM template:

Environment:
  Variables:
    CIRCUITBREAKER_TABLE: circuitbreaker-table

Grant the function access to DynamoDB

The function needs GetItem and UpdateItem on the table. In a SAM template:

Policies:
  - Statement:
      - Effect: Allow
        Action:
          - dynamodb:GetItem
          - dynamodb:UpdateItem
        Resource: !GetAtt CircuitBreakerTable.Arn

Use it in your handler

import { CircuitBreaker } from 'circuitbreaker-lambda'

// callDownstreamService is any async function that calls your downstream service.
// It should throw on failure; the circuit breaker catches the throw and counts it.
// Initialized outside the handler so the same instance
// is reused across warm invocations of this execution environment
const breaker = new CircuitBreaker(callDownstreamService, {
  failureThreshold: 5,
  successThreshold: 2, // successes (across any environment) required to close from HALF-OPEN
  timeout: 10000, // ms to wait in OPEN state before allowing a trial call (HALF-OPEN)
  fallback: async () => ({ data: 'cached response' }),
})

export const handler = async () => {
  try {
    const result = await breaker.fire()
    return { statusCode: 200, body: JSON.stringify(result) }
  } catch (err) {
    // fire() throws if the circuit is OPEN and no fallback is configured,
    // or if the downstream call fails and propagates the error.
    return { statusCode: 503, body: JSON.stringify({ error: 'Service unavailable' }) }
  }
}

fire() calls callDownstreamService. If the call succeeds, it records a success in DynamoDB. If it fails, it records a failure. When the failure count hits the threshold, it opens the circuit and subsequent calls return the fallback immediately (or throw if no fallback is configured). Every execution environment handling that function reads the same DynamoDB state. successThreshold counts successes across any execution environment via shared DynamoDB state, following the same last-writer-wins behavior as failure counts. Real fallbacks return something useful under degradation: cached data, a default empty state, or a simplified response. The { data: 'cached response' } placeholder in the example is where that goes.

If you're using Middy middleware, there's an integration that wraps your handler directly:

import middy from '@middy/core'
import { circuitBreakerMiddleware } from 'circuitbreaker-lambda/middy'

export const handler = middy(myHandler)
  .use(circuitBreakerMiddleware({
    failureThreshold: 5,
    fallback: async () => ({ statusCode: 503, body: 'Service unavailable' }),
  }))

The middleware wraps the entire handler rather than a specific downstream function. When the circuit is OPEN, the middleware short-circuits the handler before it runs and returns the fallback response. Without a fallback configured, it throws so your error handler can respond. If your handler calls multiple downstream services, use the npm package directly with a distinct circuit ID for each service.

Circuit IDs and Shared State

The circuit ID is what links circuit state to a specific downstream service. By default it uses AWS_LAMBDA_FUNCTION_NAME. Two execution environments running the same function share one circuit because they have the same function name and read from the same DynamoDB item.

If one function calls multiple downstream services, give each a distinct circuit ID:

const paymentBreaker = new CircuitBreaker(callPaymentService, {
  circuitId: 'payment-service',
})
const inventoryBreaker = new CircuitBreaker(callInventoryService, {
  circuitId: 'inventory-service',
})

If multiple functions protect the same downstream service and you want them to share a circuit, give them the same ID. A circuit open in one function will be seen by all functions using that ID. For the Lambda Layer, use the same circuit ID string in the HTTP path across all functions.

Getting Started: Lambda Layer

If your functions use a runtime other than Node.js, or if you want a single circuit breaker deployment that works across runtimes, the Lambda Layer is the other path. It ships a Rust extension that runs as a local sidecar on port 4243. Your handler makes HTTP calls to it instead of importing a library.

Add the layer to your SAM template

Download the layer zip from the GitHub releases page. The Rust extension is architecture-specific: download the x86_64 build for standard Lambda functions or the arm64 build for Graviton. Reference it as a AWS::Serverless::LayerVersion resource and attach it to your function. The examples/layer/template.yaml in the repo shows the full setup with both architectures. The key function configuration:

LayerNodeFunction:
  Type: AWS::Serverless::Function
  Properties:
    Layers: [!Ref CircuitBreakerLayer]
    Environment:
      Variables:
        CIRCUITBREAKER_TABLE: !Ref CircuitBreakerTable

Node.js handler

const CIRCUIT_ID = process.env.AWS_LAMBDA_FUNCTION_NAME
const CB_URL = 'http://127.0.0.1:4243'

// Check circuit state before calling downstream
const check = await fetch(`${CB_URL}/circuit/${CIRCUIT_ID}`)
const { allowed, state } = await check.json()

if (!allowed) {
  return { statusCode: 503, body: JSON.stringify({ error: 'Circuit OPEN', state }) }
}

// Call downstream and report result
try {
  const result = await callDownstream()
  await fetch(`${CB_URL}/circuit/${CIRCUIT_ID}/success`, { method: 'POST' })
  return { statusCode: 200, body: JSON.stringify(result) }
} catch (err) {
  await fetch(`${CB_URL}/circuit/${CIRCUIT_ID}/failure`, { method: 'POST' })
  throw err  // Lambda returns a non-200; event sources like SQS will retry
}

Python handler

import json
import os
import urllib.request

circuit_id = os.environ.get('AWS_LAMBDA_FUNCTION_NAME', 'default')
cb_url = 'http://127.0.0.1:4243'

def handler(event, context):
    try:
        with urllib.request.urlopen(f'{cb_url}/circuit/{circuit_id}', timeout=5) as resp:
            check = json.loads(resp.read())
    except Exception:
        # Sidecar unavailable — fail open and allow the downstream call
        check = {'allowed': True}

    if not check['allowed']:
        return {'statusCode': 503, 'body': json.dumps({'error': 'Circuit OPEN'})}

    try:
        result = call_downstream()
        urllib.request.urlopen(
            urllib.request.Request(f'{cb_url}/circuit/{circuit_id}/success', method='POST'),
            timeout=5
        )
        return {'statusCode': 200, 'body': json.dumps(result)}
    except Exception as e:
        urllib.request.urlopen(
            urllib.request.Request(f'{cb_url}/circuit/{circuit_id}/failure', method='POST'),
            timeout=5
        )
        # For event-driven triggers like SQS, raise here instead so Lambda retries.
        return {'statusCode': 500, 'body': json.dumps({'error': str(e)})}

Both examples make two HTTP calls to the sidecar per invocation: one to check state before the downstream call, one to report the result. These are loopback calls to 127.0.0.1, not network calls, so the round-trip is sub-millisecond. The Rust sidecar also runs the CachedProvider logic internally, so it rarely reaches DynamoDB on warm invocations. The warm latency numbers in the Performance section reflect this.

The Layer approach requires more boilerplate per handler, but it works in any managed runtime and keeps the state management and DynamoDB logic out of your application code. The handler wires up local HTTP calls to the sidecar. That part does live in your code. But the actual circuit state tracking, DynamoDB reads and writes, failure counting, and backoff logic are all inside the Rust extension.

Design Decisions

Fail-open

The term comes from physical security, not circuit states: a fail-open lock releases when power fails, defaulting to permissive. Here it means the same thing. If the DynamoDB state provider is unavailable, requests pass through rather than failing. This is a deliberate trade-off. The alternative is failing closed: a transient DynamoDB error takes down your service even if the downstream service it's protecting is completely healthy. Your circuit breaker becomes a single point of failure.

Fail-open accepts that brief periods of unprotected calls are better than self-inflicted downtime. State provider errors are logged as structured JSON so you can monitor and alert on them, but they don't block requests. The counter-argument: if DynamoDB is unavailable during an active downstream incident, fail-open leaves traffic unprotected. For most workloads this is the right call: a simultaneous DynamoDB outage and downstream failure is an unlikely combination, and failing closed (blocking all traffic because the circuit breaker can't read state) makes things worse. If your downstream is fragile enough that this scenario is a real concern, a lower-level fallback or degraded mode is a better answer than fail-closed.

For the Lambda Layer path, there are two sidecar failure modes to distinguish. If the extension fails during the INIT phase, Lambda restarts the execution environment entirely. The handler never runs, and Lambda retries automatically. If the extension crashes after initialization during an invocation, fetch calls to http://127.0.0.1:4243 throw connection refused errors. For this second case, wrap the sidecar calls in a try/catch and fail open: allow the downstream call to proceed. The same principle applies as with DynamoDB unavailability.

Warm invocation caching

Every fire() call reads circuit state from DynamoDB. For a function handling high throughput, that's a DynamoDB read on every invocation. You can reduce this with the CachedProvider, which caches state in memory for warm execution environments:

const breaker = new CircuitBreaker(callDownstream, {
  cacheTtlMs: 200, // use cached state for 200ms on warm invocations
})

On a warm invocation, the cache is checked first. If the state is fresh, no DynamoDB call is made. The cache is write-through: when state is saved to DynamoDB, the cache is also updated. Keep the TTL short. A long cache window can delay the CLOSED to OPEN transition: an execution environment that cached a CLOSED state won't see a newly-opened circuit until the cache expires. 200ms is a reasonable starting point: it caps the detection lag while cutting DynamoDB reads significantly for high-throughput functions. Increase the TTL to reduce costs further at the cost of slower circuit detection. Decrease it for faster propagation at higher DynamoDB cost.

What the DynamoDB item looks like

When debugging a stuck circuit, this is what you're looking for in the table:

{
  "id": "my-function-name",
  "circuitState": "OPEN",
  "failureCount": 5,
  "successCount": 0,
  "nextAttempt": 1741234567890,
  "lastFailureTime": 1741234557890,
  "consecutiveOpens": 1
}

circuitState is CLOSED, OPEN, or HALF-OPEN. nextAttempt is a Unix timestamp in milliseconds. The circuit won't probe until after that time. consecutiveOpens tracks how many consecutive HALF-OPEN→OPEN transitions have occurred, which drives the exponential backoff on the timeout.

The library uses last-writer-wins writes rather than atomic increments. Under extreme concurrent failures (many execution environments failing at the exact same millisecond) some failure counts can be lost: if 10 environments all read failureCount: 4 and each write 5, the count advances by 1 instead of 10. In practice this means the circuit may take slightly longer to open than the threshold suggests under burst concurrency. It will still open. For the CLOSED→OPEN transition itself, multiple environments writing OPEN simultaneously all succeed, which is fine: you want the circuit open. Atomic counter increments via DynamoDB's ADD operation could prevent lost failure counts, but a state transition updates multiple fields simultaneously: state, failure count, and timestamp. Last-writer-wins on the full item keeps the write logic simple at the cost of occasional lost counts under extreme concurrency. If your function handles high burst concurrency, set failureThreshold lower than you would in a single-process application. Lost counts mean the effective threshold is higher than the configured value, so a lower setting brings the actual behavior closer to the intended one.

Exponential backoff on repeated failures

When a circuit transitions from HALF-OPEN back to OPEN (a recovery probe failed), the timeout before the next probe doubles. This prevents a repeatedly-failing service from being probed too aggressively. The backoff resets when the circuit closes successfully. The maxTimeout option caps how long the backoff can grow.

HALF-OPEN probe behavior

With shared DynamoDB state, when the circuit transitions to HALF-OPEN, every warm execution environment that reads the updated state may attempt a trial call. Unlike a single-process circuit breaker where exactly one probe goes out, a fleet of 50 environments can send up to 50 simultaneous probes to a recovering downstream service. CachedProvider staggers probes across the TTL window as environments pick up the state change at different times, but doesn't eliminate the burst. A single-leader approach (using a DynamoDB conditional write to claim the probe slot) would be more precise, and it's tracked as a future improvement in the repo. The current behavior favors simplicity: the probe burst is proportional to the number of warm environments, which is typically small for functions with reasonable traffic patterns, and distributed leader election adds significant complexity for a probe that's designed to be retried on failure anyway.

Custom state backends

The StateProvider interface is pluggable. If you need Redis, a relational database, or anything else, implement two methods:

class RedisProvider implements StateProvider {
  async getState(circuitId: string): Promise<CircuitBreakerState | undefined> { ... }
  async saveState(circuitId: string, state: CircuitBreakerState): Promise<void> { ... }
}

const breaker = new CircuitBreaker(fn, { stateProvider: new RedisProvider() })

DynamoDB is the right default for most Lambda workloads. Valkey or Redis makes sense if you're already VPC-attached and running ElastiCache for caching: reusing existing infrastructure avoids the extra DynamoDB dependency. For most teams, running a cache cluster solely for circuit state isn't worth the VPC overhead and operational cost.

Performance

Using the npm package and Lambda Layer, here are measured results from a test run with 50 warm invocations per configuration and a shared DynamoDB table in the same region. All functions were configured at 512MB memory. The "downstream" in all cases was an HTTP call through an API Gateway endpoint backed by DynamoDB, which could be toggled healthy or unhealthy. The HTTP round-trip through API Gateway accounts for the ~590ms baseline. Raw DynamoDB read latency is single-digit milliseconds. Cold start times scale inversely with memory allocation. Lambda allocates CPU proportionally to memory, so at 128MB (where CPU is highly constrained) you would expect larger overhead, particularly for the Layer which initializes a Rust extension sidecar alongside the function runtime.

Cold start (forced by updating a function environment variable):

Configuration	Cold start	vs. baseline
Baseline (no circuit breaker)	1300ms	—
npm package (Node.js)	1353ms	+4%
Lambda Layer (Node.js)	1679ms	+29%
Lambda Layer (Python)	1541ms	+18%

The Layer cold start penalty comes from initializing the Rust extension sidecar alongside the function runtime. It's a one-time cost per execution environment. Since August 2025, AWS bills for the Lambda INIT phase on managed runtimes with ZIP deployment packages, so the Layer's +29% cold start overhead (379ms) is now both a latency and a cost consideration for functions with frequent cold starts.

Warm invocations (50 calls each):

Configuration	Median	p99
Baseline (no circuit breaker)	590ms	620ms
npm package (Node.js)	592ms	621ms
Lambda Layer (Node.js)	589ms	797ms
Lambda Layer (Python)	585ms	639ms

The npm package p99 (621ms) is essentially identical to baseline (620ms). The Lambda Layer Node.js p99 (797ms) is higher because the Rust extension sidecar occasionally adds latency on the first few invocations after a warm start. The median is fine but the tail is longer. The Layer configurations showing slightly below baseline median are within measurement noise, not a genuine speedup. With CachedProvider, DynamoDB reads are eliminated for subsequent invocations within the TTL window, which brings tail latency down for high-throughput functions.

Cost

Each fire() call reads circuit state from DynamoDB (one GetItem) and writes on state changes. At on-demand pricing, a GetItem on a small item costs $0.125 per million read request units. At one million Lambda invocations per day (around 11 RPS), that's roughly $0.125/day for the reads. State writes only happen on failures and state transitions, so they're a rounding error for a healthy function. During an active failure scenario writes increase. At $1.25 per million write request units, a function failing on every invocation could see $1.25/day in write costs before the circuit opens and stops the calls. In practice the circuit opens quickly (after the first threshold of failures per environment), so write volume drops sharply once OPEN. With CachedProvider at 200ms TTL and warm execution environments, reads drop by an order of magnitude on high-throughput functions.

Testing

The package includes a MemoryProvider for unit testing. Pass it as the stateProvider option to skip DynamoDB entirely in tests:

import { CircuitBreaker, MemoryProvider } from 'circuitbreaker-lambda'

const breaker = new CircuitBreaker(callDownstream, {
  stateProvider: new MemoryProvider(),
  failureThreshold: 3,
})

MemoryProvider uses an in-memory Map and is not safe for production. It's for tests and local development only.

When NOT to Use a Circuit Breaker

Not every Lambda function needs one:

Very low concurrency. If your function runs at a single execution environment (low traffic, no bursting), in-memory circuit breakers work: there's only one environment, so state is effectively shared. The overhead of distributed state isn't worth it for something handling a few requests per minute.
Calls to AWS services. The AWS SDK handles retries, timeouts, and transient failures with exponential backoff. Wrapping a DynamoDB GetItem or an S3 PutObject in a circuit breaker adds complexity without much benefit. AWS manages the resilience layer for you.
When fail-fast isn't better than retry. Circuit breakers are for cascading failure protection. If your function's caller expects a synchronous result and there's no meaningful fallback response, letting the error propagate and retry may be simpler than managing circuit state.

Try It Yourself

The repo includes two deployable SAM examples with a toggleable downstream service so you can watch the full circuit lifecycle (healthy calls, failures accumulating, circuit opening, recovery) against real AWS infrastructure:

examples/sam/: npm package example. Single Node.js function at /. Toggle the downstream at /toggle, check circuit state at /status.
examples/layer/: Layer example. Node.js (/node) and Python (/python) functions side by side, sharing the same Layer and DynamoDB table.
examples/minimal-npm/ and examples/minimal-layer/: Stripped-down versions if you just want the bare minimum code without the toggle/status test infrastructure.

Additional Resources

circuitbreaker-lambda on GitHub: Source, docs, and examples
circuitbreaker-lambda on npm: Package page
Using the circuit-breaker pattern with AWS Lambda extensions and Amazon DynamoDB: AWS Compute Blog post covering the same Lambda extension + DynamoDB architecture
AWS Prescriptive Guidance: Circuit Breaker: AWS' recommended approach using DynamoDB
AWS Lambda execution environment documentation: The concurrency and isolation model this post is based on
Circuit Breaker pattern (Martin Fowler): The canonical pattern reference

The silent failure mode of in-memory circuit breakers on Lambda isn't obvious until you're debugging a production incident. If you're running a circuit breaker today, check whether it's sharing state across execution environments. If it's not, it's not protecting you. The fix is a DynamoDB table and three lines of configuration. The alternative is finding out during the next downstream outage. Let me know in the comments how you're handling downstream resilience on Lambda.

DEV Community