Sergiy Yevtushenko

Posted on Jun 20, 2023

The Saga is Antipattern

#microservices #saga

The Saga pattern is often positioned as a better way to handle distributed transactions. I see no point in discussing Saga's disadvantages because the problem is that Saga should not be used in the microservices at all:

If you need distributed transactions across a few microservices, most likely you incorrectly defined and separated domains.

Below is a long explanation why.

Microservices As Distributed System

Any microservices-based system is a distributed system. To be precise — the simplest possible, basic version of it. Such a distributed system does not maintain any form of consensus. In other words, such a system:

Lacks any built-in means to coordinate nodes
Lacks any built-in means to get information about nodes
If nodes need to communicate, such a communication must be a part of business logic

These properties result in an inability of a microservices-based system to perform certain tasks. For example, perform transactions. Or maintain consistency (even eventual). Or get information if all necessary nodes are up and running (i.e. if the system is available). As a result, if one node needs a piece of information from the other node, it should explicitly include a request to remote service as one of the business steps. It looks surprisingly similar to interactions between, for example, browsers and web servers. Note, that such interaction means that each request is completely independent of each other. All transactions, if they are necessary, never cross request boundaries. Only with such a deep separation of domains, it makes sense to require a microservice to maintain its own data, independently and separately. Or claim independent deployability or testability. Or accept a possibly failed request to another node as an explicit step in business logic.

Since no transaction can cross the request boundary, the service must govern all data included in the transaction. This could be considered a validation criteria for the separation of domains. If any cross-service transaction is necessary, then split-up was done incorrectly.

Domain Size Issue

As soon as we start doing separation of domains according to data governance, we may quickly realize that in the vast majority of cases, microservices look a lot like traditional monoliths and the domain they should handle is big. Or realize that traditional monoliths are, in fact, microservices. This happens because most organizations have only a very limited number of truly independent domains. Most often — one.

Unfortunately, the whole microservices hype ignores this fact, and we get "best practices", "design patterns", books, articles, etc. which are stretching the initial idea of loosely coupled independent services to areas where it does not fit. This results in a mind-blowing, devastating consequences:

We build unreliable systems (see above about what kind of distributed systems microservices are) on top of reliable ones (cloud infrastructure)
We get ugly, inherently broken design, where layers are mixed up, communication error handling/retrying/etc. and transaction handling happens at the business logic level
We split data into parts and then try to collect them to process the request, introducing unpredictable and barely controllable tail latency
We get the systems, which are unable to ensure data integrity and consistency
We ought to perform end-to-end testing before deployment because there are no guarantees that a new version of service does not break the whole system. This completely obviates any independent testability and deployability of the services

The list above is definitely incomplete. Incorrectly applied microservices can cause all kinds of harm. Especially when combined with "cloud native" "microservices" frameworks like Spring, which turn the whole system into a bunch of slowly moving monoliths.

All advantages and requirements, which are inherent properties and natural fit for the microservices, either disappear or get transformed into quite painful and expensive obstacles.

Unfortunately, all these considerations might result in much bigger domains than could be considered acceptable for traditional microservices design. That’s fine and just means that we should not use microservices. What can we use then? Let’s take a look.

Handling Big Domain

Since we’re going to handle a big, but single domain, we need to use something capable to maintain consensus. There are at least three options:

Modular monolith (also known as modulith)
Event-driven architecture
Cluster-based architecture

Modular Monolith

This option addresses most monolith pain points, in particular maintainability and concurrent development. Mostly, this is achieved by improving design via application of DDD and other techniques. The ability to access all data, perform regular transactions and lack of communication errors makes this approach an appealing choice for many use cases. In addition, this approach is much easier to fix/rework/refactor/update in atomic fashion during system evolution. Note that inner sub-services are not required to maintain their own data (but they can, if necessary). Shared data is often inherent and natural in this approach, as we’re talking about a single domain. Since this is just a monolith, after all, there are no issues with maintaining the consensus nor deployment/monitoring/maintenance.

The main disadvantage of moduliths is the limited scalability. At some point, there might be a need to switch to another design. Fortunately, modularity significantly simplifies the transition.

Event-Driven Architecture (EDA)

EDA is quite a popular choice and while often it is mentioned in the context of microservices, this is incorrect. By design, EDA leverages reliable data sharing infrastructure in the form of message brokers or pub-sub service, which explicitly conflicts with how microservices maintain their data. Overall, this architecture is described in detail in many sources, so I see no point to repeating them here.
Its main disadvantage is the similarity to microservices regarding reliance on infrastructure, complex deployment, monitoring, etc.

Cluster-based Architecture

This design is quite a rare animal. It has several advantages, but a few worth being noted separately:

Simplicity of transition to this architecture from modulith
No reliance on infrastructure. Often, the whole system is self-contained, could be deployed on premises, in the cloud or across several clouds
Simplicity of deployment — there is only one deployable artifact
Nearly linear scalability
Real fault tolerance — failed node does not bring down the whole system nor results in error being returned from request. Requests are just somewhat slower processed
Great resource utilization, fine-grained two level scaling

Conclusion

Microservices were thought of as a way to solve problems, but their blind application causes more harm than good. The main issue is that there are no clear criteria, where they are applicable. I have no illusions, my article will not solve this problem, but at least it provides some meaningful criteria to assess where microservices should not be used.

Top comments (14)

Kirill Birger • Jul 3 '23

I do not feel as though you've made your point. You made a claim, then backed it up by making more claims, and then went on to talk about different design patterns.

Sagas are not only for microservices. You can have the need to perform distributed transactions even in a monolith. Also, not all transactions terminate within your organization's code.

Imagine a ticketing system where my application collects payment by sending HTTP to one third party, and then books a ticket by sending to another third party.

If the second request fails, then I should unwind and roll back. Sure, it's trivial in this case, but I think that is the argument for a saga pattern.

Sergiy Yevtushenko • Jul 4 '23 • Edited

Thanks for your comment. Let's analyze it step by step:

You made a claim, then backed it up by making more claims, and then went on to talk about different design patterns.

I guess you're mixing up my article with regular microservices' propaganda, which is based on making claims without bothering with evidence or reasoning. What you call "claims" are basic facts about distributed systems in general and microservices in particular. Since my goal is not click-baiting, but educating, besides "claims" I'm also providing information about approaches suitable for cases, where microservices are not applicable.

Sagas are not only for microservices. You can have the need to perform distributed transactions even in a monolith. Also, not all transactions terminate within your organization's code.

The problem is not with Saga, which definitely has its own areas of application (for example, state management in UI apps). The problem is attempting to perform distributed transactions in systems which are inherently incapable of performing them. Like microservices. And yes, you may have such a need, but the presence of need does not mean the presence of ability.

Imagine a ticketing system where my application collects payment by sending HTTP to one third party, and then books a ticket by sending to another third party. If the second request fails, then I should unwind and roll back. Sure, it's trivial in this case, but I think that is the argument for a saga pattern.

Thanks for this example. I don't know why you consider it trivial, but even in this case you can easily observe the consequences of a lack of consensus:

Imagine that your call for payment collection terminates with the timeout, and the same happens with the compensation operation. Your system and external payment processing system now may have different opinion about the status of the payment
Even more funny case happens if payment is processed successfully, but then the booking of ticket timeouts and exactly the same happens with the compensation operation for payment. Now every system may have its own understanding of what happened and what is the state of each part of the user request (payment and tickets)

Of course, in both cases mentioned above, you may try to recover and restore a consistent view of the data across all systems. This will require additional steps outside the Saga pattern, and recovery still may fail, causing even more mess. Adding each additional step exponentially increases the number of possible inconsistencies.

Kirill Birger • Jul 5 '23

I think the points you make are valid, but I also think they apply equally to any system, micro services, or not. It seems like there is a single point of failure in most, if not all improvements of Saga. In principle, any IO can fail.

In practice, you just need your Saga IO to be more highly available than your other IO.

Since the Saga controller should generally be responsible for orchestration, I don't think you should be able to encounter a scenario where multiple parts of the system have different opinions on the state. Wouldn't you simply end up with a stalled or incomplete transaction?

I'm not sure if I know a better solution to issues like this for these scenarios

Sergiy Yevtushenko • Jul 5 '23

Those distributed systems, which have consensus, don't suffer from such issues by design. For example, clusters can perform transactions (including ACID) without any problems.

Yes, in practice it is possible to achieve, as I call it, "parametric consensus". It means that everything will work properly (even in the face of some number of failures) as long as every component of the system works according to expectations of other components. The main issue in this case is the lack of any confidence that all components are working properly unless you have end-to-end testing which covers all (or most) of possible failures. In practice, I didn't see such setups. I guess the main reason is expensive maintenance of such a test setup.

The better (I'd say "best") solution is to use proper design, which does not suffer from microservices issues. A large part of my article is dedicated to possible solutions. From my experience, most organizations can safely stick with modulith. Being implemented with a decent technology stack, it can handle loads way more than those organizations may ever need. But if there is a real need to scale up (especially dynamically), then EDA or clustered approach will be better. EDA is better described in available sources, but requires somewhat specific internal design and mindset. The clustered design might provide more familiar internals, and the app might be designed completely self-contained (zero infrastructure dependency). Unfortunately, there is very little information about it, although the first app designed this way I've implemented more than a decade ago.

Kirill Birger • Jul 6 '23

You're not describing failures of microservices, or sagas. You're discussing failures of straw man implementations of code.

The better (I'd say "best") solution is to use proper design, which does not suffer from microservices issues.

Your design has nothing to do with saga. Let me say it again: You are claiming that saga is an anti pattern, and then you wrote an incredibly verbose pitch to your other blog post. I'm not making any claims about click bait or not, but that is what is coming across here.

The proposals you make have merit. I am simply pointing out that none of this has any meaningful connection to sagas. Hence my comment about your point not being made. It's simply a non sequitur.

Moreover, what you are referring to as cluster based nanoservices is actually how kubernetes and istio work.

You can't just ignore the fact that you will never have full control over every system on the internet. Many transactions do not terminate in your code, but call to third parties. It does not MATTER if you have "nano" services, micro services, mega services, or anything else. You make arguments about microservices with poor boundaries. That's not a trait of microservices, that's a trait of bad software writing.

Yes, if none of the software behaves in reasonable ways, saga won't work. What's your point? If an asteroid hits your router, will you get double charged?

Microservices and saga have disadvantages, but not the ones you're claiming, except for the comments about dependency management, and running locally, which seem to also be an issue in your proposal

Sergiy Yevtushenko • Jul 6 '23

You're not describing failures of microservices, or sagas.

I do. Perhaps you just don't want to accept that.

You're discussing failures of straw man implementations of code.

Even worse: I'm discussing a fundamental flaw in the microservices which Saga can't solve. Moreover, "straw man implementations" is the Saga, any recovery logic on top of it is not Saga.

Your design has nothing to do with saga.

We were discussing an example provided by you.

You are claiming that saga is an anti pattern, and then you wrote an incredibly verbose pitch to your other blog post.

Yes, it is, when applied to distributed transactions in microservices. And that my article exists for so long (first version was published around 2015) that pitching it makes no sense. Actually, today I'd rewrite it from scratch, but I'm keeping it as is for historical reasons.

The proposals you make have merit. I am simply pointing out that none of this has any meaningful connection to sagas.

It has, as long as sagas are used for distributed transactions.

Moreover, what you are referring to as cluster based nanoservices is actually how kubernetes and istio work.

As well as any other cluster - Redis, Apache Ignite, Hazelcast, Infinispan, Cassandra, Zookeeper, etc. etc. But you missed the key point of the proposed architecture: the application is part of the cluster. So, by putting your microservices inside Kubernetes, you don't get a system with the same properties and abilities as clustered nanoservices. There is another missing part: the first time when I've actually implemented a (somewhat simplified) version of the architecture was in 2012, when Kubernetes and istio didn't even exist.

Many transactions do not terminate in your code, but call to third parties.

So what?

Yes, if none of the software behaves in reasonable ways, saga won't work. What's your point?

It's worth reading what I wrote once again. It's not about "reasonable ways", it's about expectations of other parties. Some software may continue working reasonably and according to specs and docs, but no longer support some assumptions. And the whole system built with these assumptions in mind will stop working or, what is worse, start silently damaging or losing data.

Microservices and saga have disadvantages, but not the ones you're claiming,

Are you referring to my other articles? Because in this article, I'm not claiming, but pointing out, that microservices have no consensus. This is not a claim, but the fact.

except for the comments about dependency management, and running locally, which seem to also be an issue in your proposal

It largely depends on the particular implementation. For example, with Apache Ignite, I had an implementation which works starting from one node - perfectly fine for local deployment and development purposes.

chris damour • Jul 9 '23

Many transactions do not terminate in your code, but call to third parties.

So what?

so the cluster approach will not work given their service runs in their nodes and by definition cant run in your cluster.

fwiw your comments come off as arrogant. Start with asssuming you are wrong and reread the comments, theyll make more sense.

Sergiy Yevtushenko • Jul 9 '23

so the cluster approach will not work given their service runs in their nodes and by definition cant run in your cluster
It's not about cluster or EDA, given that we discuss microservices model. I just tried to get (second time) from my opponent explanation why/how this use case makes Saga not antipattern. Necessity can't make bad thing good. Using this use case as an argument sounds like declaring burning of fossil fuels not harmful to environment just because we live too far from the nearest supermarket and use gasoline car to go there.
Real solution for this use case is to provide API suitable for 2PC by such external services, but this requires changes at the far end which we don't control. From the other hand, wide understanding the problem may create demand and vendors start adjusting their APIs.

fwiw your comments come off as arrogant.

Sorry, that's my usual reaction to rude and ignorant comments from some wearers of architect hats.

Start with asssuming you are wrong and reread the comments, theyll make more sense

Thanks for suggestion. That's what I actually do every time.

chris damour • Jul 10 '23 • Edited

from my opponent

its not a battle man, chill out. your article isnt that strong, these comments critiques are offering u a chance to make it stronger.

Real solution for this use case is to provide API suitable for 2PC by such external services, but this requires changes at the far end which we don't control.

we have different definitions of "real", perhaps you mean ideal solution. real to me means real world and in the real world i have to play by "their" rules/implementation. and saga works well enough.

wide understanding the problem may create demand and vendors start adjusting their APIs

i can agree with this. i principal engineer for fairly large (15k employee) biz and have felt it my duty to change our RFPs to ask for EDA and 2PC capabilities, hoping that it moves the needle ever so slightly. if customers dont ask and bandaid every time with existing "rest" (99% of time its just json rpc and not restful at all) service offerings from 3rd parties then we'll never get out of this downward spiral.

overall based on your comments i think the problem with this article is poor title/intro. "X is antipattern" means don't do it. more it seems what you're trying to say is "stop allowing your circumstances to force you into X, think of it as n anti pattern and demand better solutions"

Khaled Hosseini • Jul 2 '23

Great article.

David Alexis • Jul 16 '23

I'm not sure you understand what the saga pattern is and what it solves. It has nothing to do with consensus among nodes. Hence the invalidity of your arguments. It has to do with coordinating the states of long-running business process, where the transition between states can be milliseconds or days. "Transactions" in the sense your describe in your argument area irrelevant in this context.

Sergiy Yevtushenko • Jul 16 '23

That's correct, strictly speaking the Saga pattern has nothing to do with consensus. But it performs coordination of involved nodes and coordination in distributed system requires consensus. You also may find interesting to take a look into other thread, where you can find example.

Khosro Pakmanesh • Jul 4 '23

I read the whole article, but I didn't get your point. Now, what is the alternative to using Saga? Maybe, it was much nicer if you made your point progressively by making some examples. At least, you should have mentioned some references for extra reading.

Sergiy Yevtushenko • Jul 5 '23

The point is explicitly stated at the beginning:

If you need distributed transactions across a few microservices, most likely you incorrectly defined and separated domains.

So, there can't be any alternatives. Instead, application should be designed using other approaches, which have no problems with handling transactions. Possible approaches are listed and discussed in article.

View full discussion (14 comments)