In the last article we looked at projectors, the backbone of any CQRS/Event Driven system. This article was originally meant to be about implementing projectors, but I realised there was an important question to answer first, one that would shape the solution, "When do we project the events, now, or later?". Turns out this question has far reaching effects, so it's important we dig into it before moving onward.
Immediate vs Eventual Consistency
When it comes to projectors there are two choices, immediate or eventual consistency. With immediate, events are processed by projectors as soon as they happen. With eventual, events get processed in a different process at a later time (usually a split second later).
Immediate is an all or nothing operation, if anything goes wrong then the entire process is halted. No events are stored and no events are processed. Eventual is a staggered operation, once the events are stored each of the projectors will process them at a later time (and potentially fail).
Reasons to not use Immediate Consistency
From the above you may think that immediate is the obvious choice, it seems simpler and has less moving parts. Well, that simplicity is an illusion. It turns out immediate is far more complex and the following questions will illustrate why.
1. What happens if one of the projectors has an error?
Say a projector fails and throws and exception, what do you do? The ideal solution is to roll back all your changes, ie. act like it never happened. This is easy enough if you're using a single DB to store everything (transactions FTW), but if you're using multiple technologies (e.g. Redis/MySQL/MongoDB/etc...) then this problem becomes a lot harder. Do you roll back all of them? How do you manage that? How do you test it? What happens if you make two API calls that you can't roll back? Hmmm, things just got very complicated.
2. What happens if one of the projectors has a temporary error?
Some errors are temporary. Say you have a projector that connects to an API that's rate limited.
- A user makes a request causing an event
- The projector tries to process that event
- It makes an API request, you're over the rate limit, so it fails
- Another user makes a request causing an event
- The projector tries to process that event
- It makes an API request, you're under the limit, so it goes through
How do you handle this? Do you just accept it and allow processes to fail? Do you force the user to retry the request and hope it works this time? That's not a great user experience and will definitely annoy people.
3. What happens if one of the projectors is slow?
Say one of the projectors performs an expensive process, like connecting to a slow external service (eg. sending an email). With immediate consistency we have to wait for it to complete before we can let the user continue. Worse, if we're using transactions to ensure data integrity across a domain (e.g. email address is unique), then you're potentially slowing down other processes, not just this one. These become bottlenecks that affect the entire system.
4. What happens when you launch a new projector in a running system?
Even if you opt for immediately consistent projectors, you'll still need some way to play historical events into projectors, otherwise you'll be unable to launch new ones. While new projectors are spinning up, they are not consistent with the live system, but they will be. You can architect the process so that you only release the code when all the projectors have finished (we did this, worked really well), but even so, to do this you had to build the core of an eventually consistent system.
5. What if you need to process events on a different service?
Ahh, the classic problem of distributed systems. Processing events within a single service can get complicated, but it's nothing compared to immediately processing events on a different service/server. The laziest solution is to force other services to process the events immediately via a synchronous call, but now you've coupled yourself to that system; if it goes down; you go down, and what do you do then? Immediate consistency becomes a lot harder (and next to impossible) once you're communicating with another service, even if it's one you control yourself.
Reasons to use Eventual Consistency
Now that's we've seen the problems caused by forcing immediate consistency, let's look at how things fare when we take an eventually consistent approach.
1. What happens if one of the projectors has an error?
That's fine, if there's an error, report it, maybe disable the projector (depends on the error). Once an event has happened, it's happened, so if one projector fails we don't need to roll back the events or the changes to other projectors. Instead we fix the projector, roll out a new release and let it catch up. Once you embrace eventual consistency, problems like this become a lot easier to handle.
2. What happens if one of the projectors has a temporary error?
Again, pretty simple. We simply swallow the error and try again. We know the request will eventually get through, we just need to keep sending it. If it's a rate limiting issue, we can throttle the projector, slowing down the speed at which it's processing events.
3. What happens if one of the projectors is slow?
This isn't an issue at all. Projectors run in their own background process, so they can take as long as they want. If we find out one projector in particular is slow and is affecting others, we can simply move it into it's own process and move on. Nice and easy.
4. What happens when you launch a new projector in a running system?
Not much to say here. Running a new projector is the same as running any other projector, it will play though the events until it eventually catches up. This simply isn't a problem anymore.
5. What if you need to process events on a different service?
Yep, no issues here either. Events are consumed by other services at their own pace. The producing service doesn't need to wait for them to handle the events, so it doesn't matter if they're running slow, or even that they're running at all.
Immediately consistent views
At this point I've hopefully convinced you that eventual is better, but there's still one problem to address, one you're probably asking right now, "What happens when you need views to be immediately consistent?".
Let's take an example, say you've processed a request to add an item to a cart, and the user is redirected to the cart page, what do you do if the cart is rendered without the latest item because the cart projector is running slow? Simple, you fake it. You render the page as if the item is actually in the cart, even if the view says it isn't.
This isn't as crazy as it sounds, in-fact most apps do this all the time and you barely notice. Have you ever posted to Facebook, seen your post appear, then refreshed the page and noticed your post isn't there? They were faking it. This fakery is mostly done on the client side, and it's made even easier by the likes of the reflux. This pattern is more commonly know as an optimistic UI, here's an article on the concept.
What I'm trying to say is that it really isn't a big deal, apps do this all time and it's very easy to implement, so there's really no reason not to do it.
Choosing Immediate or Eventual
At this stage it should be clear that there's a trade-off between the two. Immediate is easier to reason about, as it's an all or nothing operation, but it opens the door to lots of potential problems, especially once you move to a distributed architecture. Eventual on the other hand gives you more freedom and scalability, but it makes debugging harder. When deciding which to use, be sure to ask yourself how you'll handle failures. If you're using multiple storage technologies or APIs, then you should seriously consider moving to eventual consistency.
Protip: When running acceptance tests, run all your projectors as immediately consistent, this makes it easier to spot errors during tests and makes things a lot less complicated to debug.
Personally, I opt for immediate consistency when dealing with domain projections (i.e. projections required to validate domain wide business constraints) and eventual consistency for everything else.
What about you, do you opt for immediate or eventual consistency? What kind of issues have you had and how have you solved them? Let me know in the comments!
Top comments (3)
Awesome as always.
Initially, we have done immediate consistency across the board. It is easiest to reason about and create initially. And I would say if it is an internal business app, it probably never needs to scale, and therefore could stop there.
But even on a single node, full consistency some negative implications.
One is because of the coupling between events and view, deployments affect everything. When I deploy a new view, I risk also taking down the command API and vice versa.
Another is that other services interested in listening to events can become really awkward to implement. Take an emailer (that gets throttled by SES). To keep it consistent, we had to implement a fully-consistent projector to record the fact that emails need to be sent (based on events that got generated from the domain). Then we have a timed service that polls the query API every minute to see if there are emails to send. Then it calls the command side to actually send the emails and record they were sent. It's a really awkward workflow for the developer because of full consistency.
When we move over to eventual, the service can listen to events as they come in (instead of polling every minute), update it's own view of what it is doing, and save events indicating emails were sent. Far more straight-forward.
Another is scalability. Our load is more read-heavy. With full consistency, we would need to engage the view database's replica feature to add read capacity. This involves going deeper with (and being restricted by capabilities of) the projection DB -- settings, replication lag issues, etc. But with eventual, we don't even need to go that far. We can literally just spin up a new copy of the projector service, a new independent database for it, and add the service to a load balancer. We can also scale it down easily by doing the inverse. And basically every listener service works the same way (only with different integrations) so it is already something we know.
The really hard part of eventual consistency is accepting limitations around set validations and memberships. I've been meaning to write a post on that.
Thanks Kasey.
I've experienced the exact same issues you mention above, and once we made the switch to eventual we realised how easy it made things.
The funny thing is, you can architect your system as eventually consistent, then just force it to run as immediately consistent. Once you run into problems, switch over to eventual. It allows you to defer the problem, and it makes it easy to change when you need to. Building it as a immediate, then switching to eventual is a much more costly change, mainly because the immediate implementation is probably not event driven.
We used immediate for domain projections (mostly for set validations and memberships, as you phrased it) because we need them to guarantee domain constraints. There are other solutions though (think I discussed some in a previous article) but I'd love to see someone (maybe yourself :)) explore it further.
Immediate is nicer to reason about but quite hard these days. If you can get away with an architecture that is basically glorified LAMP with some transactional db, great, but a lot of projects are a bit more complicated than that. In architectures where you are dealing with search engines, asynchronous processing via queues, micro services, etc. you can't always wait for everything to catch up. A lot of this stuff is just fundamentally non transactional.
In my experience when building software like this with web based, and worse, mobile uis, things get quite hairy. These types of environments favor small payloads and fast requests. You don't want uis to block while a server waits for seconds to get some update through all its backend systems. At the same time you don't want to serve stale data to your users.
A good example is somebody is a user looking at a screen with search results. They 'edit' an item in some way and now they go back to the search results which are now out of date because it just takes a few seconds for updates to trickle through queues, be processed, and re-indexed in Elasticsearch. So there is a brief window where the results are inconsistent with what the user just did. That's a potential problem because it may cause the user to assume the update didn't happen.
A bad solution for this: grey out the screen and update a few seconds later and hope for the best. This creates a really bad experience because the system will be perceived as slow when in reality it is the frontend or some server side view doing a sleep to fake consistency. The better solution is to change the UX to not create this expectation of real time consistency when it is expensive to provide it and have network plumbing in place that ensures the UI is updated with fresh data when it comes available. This creates a more fluent experience and most users understand that their actions seconds ago may cause data to be updated in the UI. Use some nice animations, maybe some notification indicating success, etc. Blocking users and forcing them to wait is much worse than allowing them to continue to work but it requires UX to be aligned with reality.
This is the reason why most UI frameworks have switched from server side MVC to client side MVC with all the blocking stuff happening asynchronously. This is also why things like html 2, websockers, graphql, have become a thing.