DEV Community

eelayoubi
eelayoubi

Posted on

Using AWS Step Functions To Implement The SAGA Pattern

Introduction

In this post I will walk you through how to leverage AWS Step Functions to implement the SAGA Pattern.

Put simply, the Saga pattern is a failure management pattern, that provides us the means to establish semantic consistency in our distributed applications by providing compensating transactions for every transaction where you have more than one collaborating services or functions.

For our use case, imagine we have a workflow that goes as the following:

  • The user books a hotel
  • If that succeeds, we want to book a flight
  • If booking a flight succeeds we want to book a rental
  • If booking a rental succeeds, we consider the flow a success.

As you may have guessed, this is the happy scenario. Where everything went right (shockingly ...).

However, if any of the steps fails, we want to undo the changes introduced by the failed step, and undo all the prior steps if any.

What if booking the hotel step failed? How do we proceed? What if the booking hotel step passes but booking a flight fails? We need to be able to revert the changes.

Example:

  1. User books a hotel successfully
  2. Booking the flight failed
  3. Cancel the flight (assuming the failure happened after we saved the flight record in the database)
  4. Cancel the hotel record
  5. Fail the machine

AWS Step functions can help us here, since we can implement these functionalities as steps (or tasks). Step functions can orchestrate all these transitions easily.

Deploying The Resources

You will find the code repository here.

Please refer to this section to deploy the resources.

For the full list of the resources deployed, check out this table.

DynamoDB Tables

In our example, we are deploying 3 DynamoDB tables:

  • BookHotel
  • BookFlight
  • BookRental

The following is the code responsible for creating the BookHotel table

module "book_hotel_ddb" {
  source         = "./modules/dynamodb"
  table_name     = var.book_hotel_ddb_name
  billing_mode   = var.billing_mode
  read_capacity  = var.read_capacity
  write_capacity = var.write_capacity
  hash_key       = var.hash_key
  hash_key_type  = var.hash_key_type

  additional_tags = var.book_hotel_ddb_additional_tags
}
Enter fullscreen mode Exit fullscreen mode

Lambda Functions

We will be relying on 6 Lambda functions to implement our example:

  • BookHotel
  • BookFlight
  • BookRental
  • CancelHotel
  • CancelFlight
  • CancelRental

The functions are pretty simple and straightforward.

BookHotel Function

exports.handler = async (event) => {
  ...
  const {
    confirmation_id,
    checkin_date,
    checkout_date
  } = event

...

  try {
    await ddb.putItem(params).promise();
    console.log('Success')

  } catch (error) {
    console.log('Error: ', error)
    throw new Error("Unexpected Error")
  }

  if (confirmation_id.startsWith("11")) {
    throw new BookHotelError("Expected Error")
  }

  return {
    confirmation_id,
    checkin_date,
    checkout_date
  };
};
Enter fullscreen mode Exit fullscreen mode

For the full code, please checkout the index.js file

As you can see, the function expects an input of the following format:

  • confirmation_id
  • checkin_date
  • checkout_date

The function will create an item in the BookHotel table. And it will return the input as an output.

To trigger an error, you can create a confirmation_id that starts with '11' this will throw a custom error that the step function will catch.

CancelHotel Function

const AWS = require("aws-sdk")
const ddb = new AWS.DynamoDB({ apiVersion: '2012-08-10' });

const TABLE_NAME = process.env.TABLE_NAME

exports.handler = async (event) => {

    var params = {
        TableName: TABLE_NAME,
        Key: {
            'id': { S: event.confirmation_id }
        }
    };

    try {
        await ddb.deleteItem(params).promise();
        console.log('Success')
        return {
            statusCode: 201,
            body: "Cancel Hotel uccess",
        };
    } catch (error) {
        console.log('Error: ', error)
        throw new Error("ServerError")
    }
};
Enter fullscreen mode Exit fullscreen mode

This function simply deletes the item that was created by the BookHotel function using the confirmation_id as a key.

We could have checked if the item was created. But to keep it simple, and I am assuming that the failure of the Booking functions always happen after the records were created in the tables.

💡 NOTE: The same logic goes for all the other Book and Cancel functions.

Reservation Step Function

# Step Function
module "step_function" {
  source = "terraform-aws-modules/step-functions/aws"

  name = "Reservation"

  definition = templatefile("${path.module}/state-machine/reservation.asl.json", {
    BOOK_HOTEL_FUNCTION_ARN    = module.book_hotel_lambda.function_arn,
    CANCEL_HOTEL_FUNCTION_ARN  = module.cancel_hotel_lambda.function_arn,
    BOOK_FLIGHT_FUNCTION_ARN   = module.book_flight_lambda.function_arn,
    CANCEL_FLIGHT_FUNCTION_ARN = module.cancel_flight_lambda.function_arn,
    BOOK_RENTAL_LAMBDA_ARN     = module.book_rental_lambda.function_arn,
    CANCEL_RENTAL_LAMBDA_ARN   = module.cancel_rental_lambda.function_arn
  })

  service_integrations = {
    lambda = {
      lambda = [
        module.book_hotel_lambda.function_arn,
        module.book_flight_lambda.function_arn,
        module.book_rental_lambda.function_arn,
        module.cancel_hotel_lambda.function_arn,
        module.cancel_flight_lambda.function_arn,
        module.cancel_rental_lambda.function_arn,
      ]
    }
  }

  type = "STANDARD"
}
Enter fullscreen mode Exit fullscreen mode

This is the code that creates the step function. I am relying on a terraform module to create it.

This piece of code, will create a step function with the reservation.asl.json file as a definition. And in the service_integrations, we are giving the step function the permission to invoke the lambda functions (since these functions are all part of the step function workflow)

Below is the full diagram for the step funtion:

Step Function Diagram

The reservation.asl.json is relying on the Amazon State language.

If you open the file, you will notice on the second line the "StartAt" : "BookHotel". This tells the step functions to start at the BookHotel State.

Happy Scenario

"BookHotel": {
    "Type": "Task",
    "Resource": "${BOOK_HOTEL_FUNCTION_ARN}",
    "TimeoutSeconds": 10,
    "Retry": [
        {
            "ErrorEquals": [
                "States.Timeout",
                "Lambda.ServiceException",
                "Lambda.AWSLambdaException",
                "Lambda.SdkClientException"
            ],
            "IntervalSeconds": 2,
            "MaxAttempts": 3,
            "BackoffRate": 1.5
        }
    ],
    "Catch": [
        {
            "ErrorEquals": [
                "BookHotelError"
            ],
            "ResultPath": "$.error-info",
            "Next": "CancelHotel"
        }
    ],
    "Next": "BookFlight"
},
Enter fullscreen mode Exit fullscreen mode

The BookHotel state is a Task. With a "Resource" that will be resolved to the BookHotel Lambda Function via terraform.

As you might have noticed, I am using a retry block. Where the step function will retry executing the BookHotel functions up to 3 times (after the first attempt) in case of an error that is equal to any of the following errors:

  • "States.Timeout"
  • "Lambda.ServiceException"
  • "Lambda.AWSLambdaException"
  • "Lambda.SdkClientException"

You can ignore the "Catch" block for now, we will get back to it in the unhappy scenario section.

After the BookHotel task is done, the step function will transition to the BookFlight, as specified in the "Next" field.

"BookFlight": {
    "Type": "Task",
    "Resource": "${BOOK_FLIGHT_FUNCTION_ARN}",
    "TimeoutSeconds": 10,
    "Retry": [
        {
            "ErrorEquals": [
                "States.Timeout",
                "Lambda.ServiceException",
                "Lambda.AWSLambdaException",
                "Lambda.SdkClientException"
            ],
            "IntervalSeconds": 2,
            "MaxAttempts": 3,
            "BackoffRate": 1.5
        }
    ],
    "Catch": [
        {
            "ErrorEquals": [
                "BookFlightError"
            ],
            "ResultPath": "$.error-info",
            "Next": "CancelFlight"
        }
    ],
    "Next": "BookRental"
},
Enter fullscreen mode Exit fullscreen mode

The BookFlight state follows the same pattern. As we retry invoking the BookFlight function if we face any of the errors specified in the Retry block. If no error is thrown the step function will transition to the BookRental state.

"BookRental": {
    "Type": "Task",
    "Resource": "${BOOK_RENTAL_LAMBDA_ARN}",
    "TimeoutSeconds": 10,
    "Retry": [
        {
            "ErrorEquals": [
                "States.Timeout",
                "Lambda.ServiceException",
                "Lambda.AWSLambdaException",
                "Lambda.SdkClientException"
            ],
            "IntervalSeconds": 2,
            "MaxAttempts": 3,
            "BackoffRate": 1.5
        }
    ],
    "Catch": [
        {
            "ErrorEquals": [
                "BookRentalError"
            ],
            "ResultPath": "$.error-info",
            "Next": "CancelRental"
        }
    ],
    "Next": "ReservationSucceeded"
},
Enter fullscreen mode Exit fullscreen mode

The BookRental state follows the same pattern. Again we retry invoking the BookRental function if we face any of the errors specified in the Retry block. If no error is thrown the step function will transition to the ReservationSucceeded state.

"ReservationSucceeded": {
    "Type": "Succeed"
 },
Enter fullscreen mode Exit fullscreen mode

The ReservationSucceeded, is a state with Succeed type.
In this case it terminates the state machine successfully

Happy scenario

Unhappy Scenarios

Oh no BookHotel failed

As you recall, in the BookHotel state, I included a Catch block. In the BookHotel function, if the confirmation_id starts with 11, a custom error of BookHotelError type will be thrown. This "Catch block" will catch it, and will use the state mentioned in the "Next" field, which is the CancelHotel in this case.

"CancelHotel": {
    "Type": "Task",
    "Resource": "${CANCEL_HOTEL_FUNCTION_ARN}",
    "ResultPath": "$.output.cancel-hotel",
    "TimeoutSeconds": 10,
    "Retry": [
        {
            "ErrorEquals": [
                "States.Timeout",
                "Lambda.ServiceException",
                "Lambda.AWSLambdaException",
                "Lambda.SdkClientException"
            ],
            "IntervalSeconds": 2,
            "MaxAttempts": 3,
            "BackoffRate": 1.5
        }
    ],
    "Next": "ReservationFailed"
},
Enter fullscreen mode Exit fullscreen mode

The CancelHotel is a "Task" as well, and has a retry block to retry invoking the function in case of an unexpected error. The "Next" field instructs the step function to transition to the "ReservationFailed" state.

"ReservationFailed": {
    "Type": "Fail"
}
Enter fullscreen mode Exit fullscreen mode

The "ReservationFailed" state is a Fail type, it will terminate the machine and mark it as "Failed".

BookHotel failed

BookFlight is failing

We can instruct the BookFlight lambda function to throw an error by passing a confirmation_id that starts with 22.

The BookFlight step function task, has a Catch block, that will catch the BookFlightError, and instruct the step function to transition to the CancelFlight state.

"CancelFlight": {
    "Type": "Task",
    "Resource": "${CANCEL_FLIGHT_FUNCTION_ARN}",
    "ResultPath": "$.output.cancel-flight",
    "TimeoutSeconds": 10,
    "Retry": [
      {
        "ErrorEquals": [
          "States.Timeout",
          "Lambda.ServiceException",
          "Lambda.AWSLambdaException",
          "Lambda.SdkClientException"
        ],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,
        "BackoffRate": 1.5
      }
    ],
    "Next": "CancelHotel"
  },
Enter fullscreen mode Exit fullscreen mode

Similar to the CancelHotel, the CancelFlight state will trigger the CancelFlight lambda function, to undo the changes. Then it will instruct the step function to go to the next step, CancelHotel. And we saw earlier that the CancelHotel will undo the changes introduced by the BookHotel, and will then call the ReservationFailed to terminate the machine.

BookFlight Failed

BookRental is failing

The BookRental lambda function will throw the ErrorBookRental error if the confirmation_id starts with 33.

This error will be caught by the Catch block in the BookRental task. And will instruct the step function to go to the CancelRental state.

"CancelRental": {
    "Type": "Task",
    "Resource": "${CANCEL_RENTAL_LAMBDA_ARN}",
    "ResultPath": "$.output.cancel-rental",
    "TimeoutSeconds": 10,
    "Retry": [
        {
            "ErrorEquals": [
                "States.Timeout",
                "Lambda.ServiceException",
                "Lambda.AWSLambdaException",
                "Lambda.SdkClientException"
            ],
            "IntervalSeconds": 2,
            "MaxAttempts": 3,
            "BackoffRate": 1.5
        }
    ],
    "Next": "CancelFlight"
},
Enter fullscreen mode Exit fullscreen mode

Similar to the CancelFlight, the CancelRental state will trigger the CancelRental lambda function, to undo the changes. Then it will instruct the step function to go to the next step, CancelFlight. After cancelling the flight, the CancelFlight has a Next field that instructs the step function to transition to the CancelHotel state, which will undo the changes and call the ReservationFailed state to terminate the machine.

BookRental failed

Conclusion

In this post, we saw how we can leverage AWS Step Functions to orchestrate and implement a fail management strategy to establish semantic consistency in our distributed reservation application.

I hope you found this article beneficial. Thank you for reading ... 🤓

Additional Resources

Top comments (0)