DEV Community

Dinesh_gowtham
Dinesh_gowtham

Posted on

Step Functions Broke Our Lambda Cold Starts — But Fixed Our EventBridge Delays

We migrated 15 Lambdas to Step Functions to solve cold starts, but what we gained in warmness, we lost in EventBridge message delays. Here's the hidden pattern that changed everything.

Introduction to Step Functions and EventBridge

Step Functions and EventBridge are two AWS services that can be used to build scalable and event-driven architectures. Step Functions provides a way to orchestrate the components of distributed applications and microservices, while EventBridge provides a way to handle events from various sources.

import { DescribeStateMachineCommand } from '@aws-sdk/client-sfn';
import { DescribeEventBusCommand } from '@aws-sdk/client-eventbridge';

const stepFunctionsClient = new StepFunctionsClient({ region: 'us-east-1' });
const eventBridgeClient = new EventBridgeClient({ region: 'us-east-1' });

const describeStateMachineCommand = new DescribeStateMachineCommand({
  stateMachineArn: 'arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine',
});

const describeEventBusCommand = new DescribeEventBusCommand({
  Name: 'default',
});

stepFunctionsClient.send(describeStateMachineCommand).then((data) => {
  console.log(data);
});

eventBridgeClient.send(describeEventBusCommand).then((data) => {
  console.log(data);
});
Enter fullscreen mode Exit fullscreen mode

Be careful when using Step Functions and EventBridge, as the default retry mechanism in Step Functions can silently drop EventBridge messages if not properly configured, leading to data loss and processing delays.

Our Initial Migration: From Lambda to Step Functions

When we initially migrated our Lambdas to Step Functions, we noticed a significant reduction in cold starts. However, we also noticed that EventBridge messages were being delayed, which was affecting the overall performance of our application.

import { StartExecutionCommand } from '@aws-sdk/client-sfn';
import { PutEventsCommand } from '@aws-sdk/client-eventbridge';

const stepFunctionsClient = new StepFunctionsClient({ region: 'us-east-1' });
const eventBridgeClient = new EventBridgeClient({ region: 'us-east-1' });

const startExecutionCommand = new StartExecutionCommand({
  stateMachineArn: 'arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine',
  input: '{}',
});

const putEventsCommand = new PutEventsCommand({
  Entries: [
    {
      EventBusName: 'default',
      Source: 'my-source',
      DetailType: 'my-detail-type',
      Detail: '{"key": "value"}',
    },
  ],
});

stepFunctionsClient.send(startExecutionCommand).then((data) => {
  console.log(data);
});

eventBridgeClient.send(putEventsCommand).then((data) => {
  console.log(data);
});
Enter fullscreen mode Exit fullscreen mode

The Standard workflow execution history 25,000 event limit in Step Functions can cause issues if your workflows are long-running or have many events. To mitigate this, you can use the DescribeExecution API to fetch the execution history in chunks.

The Unseen Delay: EventBridge Message Queuing in Step Functions

When we dug deeper, we found that the delay in EventBridge messages was due to the way Step Functions was handling the messages. By default, Step Functions uses a retry mechanism that can cause messages to be delayed or even dropped.

import { DescribeExecutionCommand } from '@aws-sdk/client-sfn';
import { DescribeEventBusCommand } from '@aws-sdk/client-eventbridge';

const stepFunctionsClient = new StepFunctionsClient({ region: 'us-east-1' });
const eventBridgeClient = new EventBridgeClient({ region: 'us-east-1' });

const describeExecutionCommand = new DescribeExecutionCommand({
  executionArn: 'arn:aws:states:us-east-1:123456789012:execution:MyStateMachine:MyExecution',
});

const describeEventBusCommand = new DescribeEventBusCommand({
  Name: 'default',
});

stepFunctionsClient.send(describeExecutionCommand).then((data) => {
  console.log(data);
});

eventBridgeClient.send(describeEventBusCommand).then((data) => {
  console.log(data);
});
Enter fullscreen mode Exit fullscreen mode

If you're using EventBridge Pipes, be aware of the 5-second filter evaluation limit. If your filter evaluation takes longer than this, it will fail silently, causing your messages to be dropped.

Rethinking State Machine Design for Timely Event Processing

To mitigate the delay in EventBridge messages, we rethought our state machine design. We implemented a message queueing mechanism that allowed us to handle messages in a timely manner.

import { UpdateStateMachineCommand } from '@aws-sdk/client-sfn';
import { PutRuleCommand } from '@aws-sdk/client-eventbridge';

const stepFunctionsClient = new StepFunctionsClient({ region: 'us-east-1' });
const eventBridgeClient = new EventBridgeClient({ region: 'us-east-1' });

const updateStateMachineCommand = new UpdateStateMachineCommand({
  stateMachineArn: 'arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine',
  definition: '{}',
});

const putRuleCommand = new PutRuleCommand({
  Name: 'MyRule',
  EventBusName: 'default',
  EventPattern: '{}',
});

stepFunctionsClient.send(updateStateMachineCommand).then((data) => {
  console.log(data);
});

eventBridgeClient.send(putRuleCommand).then((data) => {
  console.log(data);
});
Enter fullscreen mode Exit fullscreen mode

When implementing a message queueing mechanism, be aware of the 10,000 concurrent child execution limit in Step Functions. If you exceed this limit, you'll get an error with the message ConcurrentExecutionLimitExceeded.

Best Practices for Integrating Step Functions with EventBridge

When integrating Step Functions with EventBridge, it's essential to follow best practices to avoid common pitfalls.

import { DescribeStateMachineCommand } from '@aws-sdk/client-sfn';
import { DescribeEventBusCommand } from '@aws-sdk/client-eventbridge';

const stepFunctionsClient = new StepFunctionsClient({ region: 'us-east-1' });
const eventBridgeClient = new EventBridgeClient({ region: 'us-east-1' });

const describeStateMachineCommand = new DescribeStateMachineCommand({
  stateMachineArn: 'arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine',
});

const describeEventBusCommand = new DescribeEventBusCommand({
  Name: 'default',
});

stepFunctionsClient.send(describeStateMachineCommand).then((data) => {
  console.log(data);
});

eventBridgeClient.send(describeEventBusCommand).then((data) => {
  console.log(data);
});
Enter fullscreen mode Exit fullscreen mode

When using EventBridge, be aware of the EventBridge delivery delay under high load which can reach 30+ seconds. To mitigate this, you can use a message queueing mechanism to handle messages in a timely manner.

The Takeaway

Here are some key takeaways when using Step Functions and EventBridge:

  • Always configure the retry mechanism in Step Functions to avoid silently dropping EventBridge messages.
  • Be aware of the Standard workflow execution history 25,000 event limit in Step Functions.
  • Use a message queueing mechanism to handle EventBridge messages in a timely manner.
  • Be aware of the 10,000 concurrent child execution limit in Step Functions.
  • Use EventBridge Pipes with caution, as the 5-second filter evaluation limit can cause messages to be dropped.
  • Monitor your EventBridge delivery delay under high load to ensure timely message processing.

Console output:

{
  "executionArn": "arn:aws:states:us-east-1:123456789012:execution:MyStateMachine:MyExecution",
  "stateMachineArn": "arn:aws:states:us-east-1:123456789012:stateMachine:MyStateMachine",
  "status": "RUNNING",
  "startDate": "2026-06-24T14:30:00.000Z"
}
{
  "Name": "default",
  "Arn": "arn:aws:events:us-east-1:123456789012:event-bus/default",
  "Policy": "{}"
}
Enter fullscreen mode Exit fullscreen mode

Transparency notice

AI-crafted with Groq, powered by LLaMA 3.3 70B.
The topic was scouted from live AWS and Node.js ecosystem signals, and the content —
including all code examples — was written autonomously without human editing.

Published: 2026-06-24 · Primary focus: StepFunctions

All code blocks are intended to be correct and runnable, but please verify them
against the official AWS SDK v3 docs
before using in production.

Find an error? Drop a comment — corrections are always welcome.

Top comments (0)