Valerio Vaudi

Posted on Sep 16, 2020 • Edited on Sep 28, 2020

Manage Distributed Transactions with Saga

#distributedsystems #transactions #pattern #microservices

In this article, I would like to describe a common pattern in the distributed system landscape about transactions, the SAGA Pattern.

What is SAGA

SAGA is a pattern that tries to solve the problem of long-lived distributed transactions. Let me try to explain the saga with a metaphor on SAGA TV just to make it funny.

The metaphor

Try to imagine what happens in your mind when you start to see a saga tv. You start to see the first episode and at the end of the first episode, you will have partial knowledge of the whole saga. As soon as you start to see more episodes your knowledge grow, in certain case you can be a little bit confused, some details can appear incoherent, you can go back in the episode because you do not understand what you are watching and try to watch again to do not fail to understand the episode. Hopefully, at the end of the SAGA, all strange incoherent details during the path became consistent.

The metaphor made reality

Similarly, the pattern tries to manage a long-lived distributed transaction like a SAGA, in which every episode is your local transaction performed by a peer that join in the transaction and only when you have seen all the saga, you will have done the whole distributed transaction with a commit or a rollback.

Since that the transaction is distributed across different peers, that do not share databases, the pattern takes as acceptable some inconsistent transient states, like in a saga you can have moments in which not all details are consistent, this characteristic takes the name of BASE transactions, I will talk later about the BASE transactions.

To face distributed commit and rollback, since that a SAGA is a set of local transactions performed by all peers, like in a saga your knowledge of the overall state grow up and down as you go back and forward with the episode, any peers should provide an execute operation and an undo operation to perform eventually some compensations.

Therefore if all execute operations have eventually performed, the transaction is completed with a commit, let say all local transactions are committed, otherwise if all undo operations are performed the transaction is rollbacked.

BASE VS ACID

The famous acronym ACID stands for Atomic Consistent Independent and Durable those are the typical features that everyone is common to use in Relation Databases like Postgres, MySql, and so on so forth.

Transactions that follow those features are the most consistent that we can implement in the Computer Science landscape. Unfortunately depending form the service mesh topology, communication style, required time coupling, space coupling of all peers, sometimes guarantee ACID feature became impractical for performance, business use cases, resource usage, and so on.

This is a more hard condition for a distributed system like SOA and Microservices systems. This is the case that a complete change of mind became required and this is when BASE comes handy, relaxing some constraints, to untie the knot of the skein.

BASE stands for: Basically Available Soft state Eventually consistent. Those features are a typical constraint relax required to make a distributed system optimized for performance, resiliency, and to make practicable in the real world.

Another constraint that should drive you to a BASE approach is about the time coupling that peers have in the transaction. If your transaction can take 1 day, keeping not visible and isolated partial modifications of the overall state is very complex and expensive. In practice in a long-lived distributed transaction, we have to embrace the eventual consistency and make it acceptable that the state of your system can change as soon as it is stimulated from the external world, let say your state is soft.

It is why your long-lived distributed transactions should be BASE, it should be Basically Available, we do not want complex distributed locking stuff that can suffer from very bad behavior in case of network failure, Soft state your peer's state can change state as soon as an execute or undo operation is triggered, Eventually consistency, your system will be consistent but eventually on the time.

It is for that reason that ACID is not suitable for a distributed system, especially when the system is time and spatially decoupled. We know that there is some attempt to make ACID in a distributed system but 2PC tends to introduce so many control stuff via locking that makes performance and maintenance very complex. Embracing BASE instead lets us break constraints that should be impractical to respect and speed up the overall availability of the system.

Remember for a rollback in SAGA we can leverage transaction compensation like other protocols, BPEL is an example, to achieve rollback without a lock, your state is Soft on execute and undo operations

Why SAGA

In a distributed system we have many peers that interact with each other, sometimes we can have a transaction that runs in a single system but sometimes it is not possible.

When a transaction span multiple services, the ACID approach is not suitable and a change approach is required. BASE transactions are more suitable in a distributed system landscape, in particular when time and space decoupling is needed. Saga connects multiple local transactions applying retry or undoes operation in case of local failures typically via messaging.

Therefore, when you have a distributed system and a transaction need to span multiple services an AP/CP system is born and how we learned, in this case, an ACID transaction is not suitable due to the network partition of the overall system, we need to switch to a BASE transaction.

In this case, a question may be: ok I need a BASE transaction and use SAGA like a flow of local transactions, but how connect these local transactions, how apply rollback, how discover that my distributed transaction is completed?
I can answer these questions with a couple of words: Orchestration/Choreography.

Orchestration Choreography

In SAGA we have two ways to manage the flow: Orchestration and Choreography.

Orchestration

The most common and simple approach to manage a SAGA is to use a third-party system that acts as Orchestrator to the whole transaction, it is like to have a high order business function that manages all the SAGA flow, pay attention I said third party system, not an ESB or something like that. This approach is the most simple to code, maintain and troubleshoot in case of issues, since that we have another system in which all the needed code to orchestrate the overall process is written in only one place and following the flow in the code is more natural, we can control in only one place all the SAGA.

Choreography

On the other hands with Choreography, we have that all the peers in autonomy manage your slot of transaction and events management for go forward and rollback the process, it is the most fluid approach since that we do not have a single point of failure that orchestrate the process, but it is the most complex to maintain and troubleshoot in case of issues due to the process knowledge is scattered across all the systems, even worst if some change in the process requires to add or remove some listeners due to it is complex to understand what happens in the ongoing processes and what are the owner.

Typically it is better to have an orchestrator. Even if we lose some fluidity, we gain more control and governance on the process. In the Orchestration landscape there exist many tools that can help like Axon Framework, Eventuate Tram, some messaging frameworks like Spring Integration, Apache Camel, and many more, that can help to implements the pattern.

Let's do it

As use case I will try to implement a salse order use case, probably the most common one use case that you can see around SAGA :D, here is the link on my GitHub repository.

As said before, even if Axon, Eventuate are good frameworks, for this working software example, I will prefer to use a more near to plain messaging framework in my case my choice is Spring Cloud Stream as an orchestration framework.

One of the first things that make an impact if pay attention to the previous discussion about how to integrate different peers in a SAGA transaction for sure will be the asynchronous nature of a system connected with a messaging system, in my working software use case apache Kafka.

Even if it seems good due to we have the messaging system that provides us fluidity, elasticity, and many other cool features at the end in many use cases we have one synchronous rest API that interacts with our orchestrator that instead is asynchronous. Now the question is clear how deal with our asynchronous orchestrated process if our final customer calls a synchronous rest API? how we can get the answer to the transaction and report the result to the customer, how we can save resources in the orchestrator? we do not want to keep alive an HTTP connection for days! The short answer to all previous questions is Soft state in the BASE approach. Let me explain how Soft state can solve the problem with a practical example.

In my sample code, you have a sales-order-service, catalog-service, and inventory-service. Sales order service is our high order service that acts as an orchestrator providing our saga entry point, catalog service like the name can suggest, is a catalog that provides us the goods price, while inventory gives us the inventory status in our stock decrementing goods quantity during the sales order process according to the quantity of the goods in stock.

Suppose that during the process, we have some services that take too long time to answer due to network congestion, service down, or some like that. How we can deal with these situations if my process involves many services and at the n -1 stage fails?

The first simple step in implementing a SAGA is to say: ok do not worries make all service communication integrated with a message broker. The benefit of a broker is the possibility to apply retry, distribute better the load, we can have as many replicas we want all attached to a message topic and distribute the load applying pub/sub pattern, we can manage dead letter queue pattern in case of failures, apply some very intelligent business decision in case too much failure and many more. But it is only the beautiful things, in practice, we have to manage idempotency either in our message consumer or in your orchestrator and undo path to apply for local transaction compensation in case of failure.

The idempotency is the most tricky feature due to this capability depends so much form the domain of the system. One of the most simple approaches is to take the message payload, encode it in a predictable way like a sha256 as a key plus some metadata that can give us an insight about if the operation is done or not, and put it in an AP system like Apache Cassandra. Then when an event tries to run we can check if this message is already in the store to say ok it is already executed do ahead without processing again the message. Some of you can recognize it as an event store, but if you are thing about in saga we deal with domains command and events that drive your distributed transaction.

Please do not force your mind about the solution of encoding the message payload as the key of our event store, it is only one possible idea, for instance, we can put idempotency logic in the orchestrator that has the responsibility of the process to go ahead and store in this place domain events and commands of the high-level process. I mean SAGA is a pattern, not a framework how implements messaging orchestration idempotency is up to you, you should even add business metrics to provide some business performance data and many more.

Now that we have the sensibility about idempotency and we know that we can connect our peer via messaging with executing and undo message listener in every transactions peer we have to clarify how to deal all this asynchronous process if our API is a classical synchronous rest API and how the soft state property can help us.

First of all remember that how in computer science some NP problems are solved with some tricks, we have to find our tricks and how in computer science some NP-hard problems are solved via approximation even in our saga we have to embrace some relaxing and approximation of our problem, Transactions.

Consider that we have a rest API if we, for instance, do a POST /sales-order endpoint it is infeasible waiting for the whole transaction and get a strong consistent answer we have to relax our expectation and accept the eventual consistency.

In my example when you fire a post the endpoint answer as soon as possible then all the subsequent GET /sales-order/{salesOrderId} return to you the current state of the order. You could get that your order is in progress and after some time completed or aborted. I mean as soon as the peers run the transaction, all the peers will update your internal state, fire events, it is possible that our inventory decreases the stock quantity for some goods but then increase again due to some rollback of other transactions, I mean the state of our system will be consistent but eventually in the time and as soon as our system fire and process domain events the state of the whole system will change. With this little approximation, we can deal synchronous with asynchronous stuff. It seems simple but actually, we have to think about our domain. In some domains this separation is easy in other not so easy, it is our responsibility as a software engineer to find the best way together with the business on how to make an impact on our business, in a word Agile!

DEV Community