Matt Morgan for AWS Community Builders

Posted on Apr 30

Cost-Efficient Serverless Workflows with Express Step Functions

#aws #serverless #microservices #stepfunctions

Lambda and API Gateway are the bread and butter of the AWS serverless ecosystem. Lambda offers a compelling programming model of inputs and outputs. Lambda's name is taken from the concept of simple anonymous functions and implies simplicity and ease of use. Lambda delivers on this promise beautifully when our requirements are simple: "enqueue this message" or "fetch the item with this key from the database".

In the real world, our requirements aren't always so simple. Sometimes we must do multiple things in the scope of a synchronous request. Sometimes branching logic is necessary. When we drift from the "just a function" model of programming in Lambda, we can start to see challenges with cost, performance, and observability.

Synchronous Express Workflows for AWS Step Functions was announced back in 2020. I was instantly intrigued by the solution, but it wasn't until last year that I had a chance to really try them at scale. Based on my experience over the past year, this is a great way to build Well-Architected microservices.

This article includes source code for a sample project that demonstrates how we can use Express Workflows to create performant and economical microservices.

Let's consider a workflow for receiving orders on an e-commerce website. We are going to receive a web request and then do the following:

Transform and validate the incoming request
Validate the order (validate products, compute totals, etc.)
Reserve each inventory item
Process the payment with the chosen processor (Stripe, PayPal, or Apple Pay in the example)
Save the order to the database
Kick off post-order processing (notification, logging, metrics)
Send a response

We also need to handle errors, guarantee consistency, and respond promptly. Here is how that looks in the Step Functions console.

We've gained some efficiency here by running a couple of steps in parallel, and we've also handled a variety of error states.

If you are new to Step Functions, I recommend building in the console. Step Functions Workflow Studio gives you a drag-and-drop interface and the ability to export the result of your work to an IaC solution.

Project Layout

Our sample project manages infrastructure with AWS CDK. Starting with a basic CDK project, we build out some functions and our stack like this:

serverless-order-processor/
├── bin/
│   └── app.ts                          # CDK app entry
├── lib/
│   ├── order-processor-stack.ts        # CDK stack (infra + state machine)
│   └── order-workflow.ts               # Step Functions definition (CDK constructs)
├── functions/
│   ├── validate-order.ts               # Check products exist, prices match
│   ├── reserve-inventory.ts            # Reserve single item (used by Map state)
│   ├── release-inventory.ts            # Compensation: undo all reservations
│   ├── process-payment.ts              # Route to processor by config
│   ├── save-order.ts                   # Write final order to DynamoDB
│   ├── get-order.ts                    # GET /orders/{id}
│   └── list-products.ts                # GET /products
├── scripts/
│   └── seed.ts                         # Seed products + inventory
├── cdk.json
├── package.json
├── tsconfig.json
└── README.md

Our stack creates a DynamoDB table to store our products, inventory, and orders. It bundles and provisions our Lambda functions. It describes our state machine, synthesizes an ASL (Amazon States Language) definition, binds the Lambda functions to the state machine, and binds the state machine to an API Gateway that we use to synchronously invoke the workflow.

Workflow Construct

There are different patterns for writing CDK code. I prefer to create L3 constructs to contain complex business patterns. That looks like this:

export class OrderWorkflow extends Construct {
  public readonly stateMachine: StateMachine;

  constructor(scope: Construct, id: string, props: OrderWorkflowProps) {
    super(scope, id);

    // steps

    const definition = transformRequest.next(validateRequiredFields);

    // --- State Machine ---
    this.stateMachine = new StateMachine(this, "OrderStateMachine", {
      definitionBody: DefinitionBody.fromChainable(definition),
      stateMachineName: "order-workflow",
      stateMachineType: StateMachineType.EXPRESS,
      // more props
    });
  }
}

This keeps my main stack much cleaner than putting all that code inline.

    // --- Step Functions workflow ---
    const workflow = new OrderWorkflow(this, "OrderWorkflow", {
      validateOrderFn,
      reserveInventoryFn,
      releaseInventoryFn,
      processPaymentFn,
      saveOrderFn,
    });

Most steps are implemented with Lambda functions, though error-handling and data transformations can be handled with Pass states and JSONata. For more on JSONata, check out Create Stateful Serverless Workflows with AWS Step Functions and JSONata.

Each step needs a catch block that handles specific errors that step may throw. We need to decide whether to retry the step or fail, based on the error it threw. If the step indicates we can't proceed with the order due to insufficient inventory, it doesn't make sense to retry. But if a step fails due to a service error, throttling, or a partner's technical issues, a retry may be very helpful.

API Gateway Integration

API Gateway provides a direct integration pattern that invokes our state machine. This is simplified and abstracted using the CDK construct StepFunctionsIntegration. This construct is useful, but I prefer to modify it slightly. Let's look at the sample project code:

    const sfnIntegration = StepFunctionsIntegration.startExecution(
      workflow.stateMachine,
      {
        integrationResponses,
        requestTemplates: {
          "application/json": requestTemplate,
        },
        useDefaultMethodResponses: false,
      },
    );

    ordersResource.addMethod("POST", sfnIntegration, {
      methodResponses: [
        { statusCode: "200" },
        { statusCode: "400" },
        { statusCode: "500" },
      ],
    });

We create the integration and then bind it to the POST method of the orders resource. There are a couple of things I changed here to better suit our use case. I set useDefaultMethodResponses to false and supplied our own response templates. The reason is that the default response template returns a 500 if the state machine execution throws an error, or a 200 if it doesn't. I wanted to return a 400 for validation errors. To do this, we use a Velocity Template to detect an error key in the response and remap the response to 400 if it's present.

    const successResponseTemplate = [
      `#set($sfnOutput = $input.path('$.output'))`,
      `#if($sfnOutput.toString().contains('"status":"error"'))`,
      `#set($context.responseOverride.status = 400)`,
      `#end`,
      `$sfnOutput`,
    ].join("\n");

Named Executions

I also wanted to take advantage of the ability to name an execution, so I provided a custom request template.

    const requestTemplate = [
      `#set($customerId = $util.parseJson($input.body).get('customerId'))`,
      `{`,
      `  "input": "$util.escapeJavaScript($input.body).replaceAll("\\\\'", "'")",`,
      `  "name": "$util.escapeJavaScript($customerId)",`,
      `  "stateMachineArn": "${workflow.stateMachine.stateMachineArn}"`,
      `}`,
    ].join("\n");

This is a modified and simplified version of the request template that ships with AWS CDK. If you plan to do something like this, I suggest going back to the source to make sure you don't miss anything.

The way the named execution works is that we know any valid request will include a customerId, so we pull it from the JSON and set it to the name attribute in the request payload. Step Functions automatically appends a hash, so we don't need to worry about uniqueness. As a result, we can easily find our customer transactions in the Step Functions console!

Observability

It's essentially a must to enable CloudWatch logging. The named execution lets us find an exact execution that may have gone awry or be of interest. Then we can see exactly what happened, inspect logs, and improve our flow. The state machine execution will show JSONata variables at every step, as well as all inputs and outputs for each step. It's hard to imagine getting this level of fidelity in a trace without using Step Functions.

Cost

We need to consider trade-offs. Sure, this might help manage complexity, but doesn't it cost more? What about adding latency with state transitions or cold starts?

Let's start with cost. These services are very, very cheap if used correctly. It's often observed that the most expensive part of a serverless stack is the logging, and I can attest to that. These prices are in USD and for us-east-1

API Gateway Rest API charges $3.50 for 1 million requests.
1 million Express workflow executions with an average duration of 3 seconds and 64MB memory bills slightly higher at $4.13
5 million Lambda executions averaging 500ms duration with 128 MB is just $5.17 (excluding free tier).

The total for this part of the stack is $12.80/month (plus charges for the database and other services, which are beyond our scope here). That is an incredible price for 1 million executions. Cost scaling is mostly linear. If we scale this to 10 million requests, our bill is $127.93. We start to see better pricing tiers as we move to 100 million requests, with a monthly bill of $ 1,150.05. 100 million requests would indicate an average of 38 checkout conversions every second for the entire month, quite a brisk business! I'm not kidding that logging will be the big expense at that volume. I'm not attempting the math here because it's highly dependent on your use case, but suffice it to say you'll want to keep an eye on it, make sure you're only logging what is necessary, and set reasonable data retention on your log groups.

Performance

Now we've demonstrated that this architecture is great for managing complexity and that it's cost-effective. What about performance? Doesn't it stand to reason that passing state between multiple functions would be slower than encapsulating all the logic within a single function?

Not necessarily. First, Express Step Functions are extremely fast. In my testing, I'm through the first pass step and into my ValidateOrder Lambda function within milliseconds. Small functions with minimal dependencies will load and execute very quickly.

Parallel and Map States

The real value prop here is the ability to execute multiple functions in parallel. Imagine I'm checking out a cart with four different items. Most implementations will reserve the inventory sequentially. We could use Node.js or Go to wait on multiple requests at once, but there are downsides to doing that in a single function. We might need extra memory in anticipation of a large order. We have to add logic to handle the case where the order can only be partially fulfilled, which now mixes concerns. Our Express Workflow can run the same simple function in a Map state, then handle the combined results. We can even limit downstream impacts by setting a max concurrency limit on the map state, so a very large order doesn't attempt to adjust inventory for 100+ items in parallel.

What about executing unlike things in parallel? Express workflows can handle that as well. Zooming in here, we can see that we're persisting the order to the database while we send the confirmation and update our metrics. Most order-processing systems will handle those sequentially.

Cold Starts

What about all the cold starts? Won't the extra functions cause extra startup latency? Well, first of all, read this. If that doesn't convince you, my experience working in serverless for the better part of a decade now is that I don't worry about them at all. Yes, sometimes a function will start, costing you 200ms. At scale, this isn't much, because that function container may be invoked tens or even hundreds of thousands of times during its lifecycle, and that 200ms tax is paid only once across all those invocations.

Scaling and Service Limits

Step Functions

A casual reading of the docs suggests that express workflows may begin to throttle at 6000 RPS. This isn't the case - or rather, this is only the case for asynchronous invocations. For synchronous invocations, there's no rate limit at all! This is very rare for an AWS service, and it comes with an advisory.

Synchronous Express execution API calls don't contribute to existing account capacity limits. Step Functions provides capacity on demand and automatically scales with sustained workload. Surges in workload may be throttled until capacity is available.
If you experience throttling, try again after some time. For information about Synchronous Express workflows, see Synchronous and Asynchronous Express Workflows in Step Functions.

I have never seen a synchronous execution throttle, but I have seen services backed by Express Workflows scale very quickly. If you plan to operate this service on a very large scale, it's a good idea to speak with your account team.

API Gateway

With no hard limit on Step Functions executions, we need to look at API Gateway quotas. API Gateway can handle 10,000 RPS (sustained) per region. This limit can be increased and applies at the account and region levels.

Lambda

Lambda has a maximum concurrency rate of 1,000 concurrent executions at the account level. This can also be increased. This is your most likely source of throttles in this kind of architecture. Since we're invoking multiple functions per workflow, sometimes in parallel, we can quickly reach that 1,000 limit. Fortunately, this limit can be adjusted in the AWS Console. It's a very good idea to project your concurrency needs and set your quota appropriately. If you set it too high, a bug could result in an expensive bill. If you set it too low, you may experience throttling.

It's also a good idea to set reserved concurrency on every Lambda function. This is different than provisioned concurrency (prepay to keep a function "warm"). Instead, reserved concurrency protects your function from a "noisy neighbor" eating up all of your concurrent executions.

Finally, you can handle throttles with retries in Step Functions. Even synchronous executions should implement retries in case of throttle or service failure. If your service is operating at extremely high throughput, you might throttle, wait a few milliseconds, then receive a successful invocation. This is a much better outcome than failing with a 429 error!

To see this in action, I created a simple load test script using k6. Here's a sample run from my laptop.


         /\      Grafana   /‾‾/
    /\  /  \     |\  __   /  /
   /  \/    \    | |/ /  /   ‾‾\
  /          \   |   (  |  (‾)  |
 / __________ \  |_|\_\  \_____/


     execution: local
        script: scripts/load-test.js
        output: -

     scenarios: (100.00%) 1 scenario, 200 max VUs, 1m5s max duration (incl. graceful stop):
              * default: Up to 200 looping VUs for 35s over 4 stages (gracefulRampDown: 30s, gracefulStop: 30s)


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Load Test Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Total requests:  9442
  Failure rate:    0.0%
  Latency avg:     419ms
  Latency med:     408ms
  Latency p90:     499ms
  Latency p95:     555ms
  Latency max:     3673ms
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

running (0m35.5s), 000/200 VUs, 9442 complete and 0 interrupted iterations
default ✓ [======================================] 000/200 VUs  35s

I'm not able to generate much more load than that on a MacBook Pro, but this does illustrate how easily this architecture handles traffic spikes. "Scalability" here is a bit of a misnomer, as my service isn't scaling up to handle the load. Instead, there is available capacity to meet my needs at all times!

More on Step Functions

If you found this article helpful, check out my other writing on Step Functions!

Cover by Glynis Morgan

DEV Community