rdoria1 for Booster

Posted on Oct 30, 2020

Redefining Event Sourcing

#eventsourcing #data #microservices

Event Sourcing is complex and involves multiple concepts, patterns, and architectures. At The Agile Monkeys, we have been working with it for several years and we would like to share our vision of the main concepts around Event Sourcing.

Events

Things that have happened (yes in the past).

e.g. a customer has placed an order.

State

Something that evolves and changes over time with a finite number of possible values.

e.g. A SKU can have multiple states (in stock, out of stock, soon to be available, discontinued, etc.)

Events + State

Events = Things that have happened that changed some state.

e.g. Customer placed an order and therefore the order is in the PLACED status and the inventory of the item decreased by 1 etc.

Event Storage

Permanent Storage where events are persisted.

e.g. Can be a DB (NoSQL, SQL…) or maybe object storage (S3...)

Event Sourcing Contract: 3 Main Concepts

Store the full history of events (aka state changes) in the Event Storage: of course, we mean only the events we intended to store, the ones we designed to persist.
Events are chronologically ordered: This allows to reconstruct the state of the system at any given point in time.
Immutable events: it’s an append-only system (no update nor removal).

Data Removal

Because of data privacy policies (like GDPR in Europe), there are some cases where users want to delete their data. Allowing this would break the Event Sourcing Contract we just described (immutable events).

How can we support this use case?

We can encrypt the data. When users request to delete their data, we are going to delete the encryption key.

This design solves both problems:

By dumping the key we are making data unusable which fulfills the purpose of deleting the data from the user perspective.
We are not opening unwanted doors by actually deleting the data (no permissions, no repo with delete methods, etc.) so we are still respecting the contract we described.

Event Sourcing usually works with

Domain-Driven Design
CQRS
Event-Driven Architecture
MicroServices

Event sourcing excels when used together with the 4 concepts above.

Domain-Driven Design

Uses conceptual maps (called object models) of the domain (easily maps to an actual business concept/idea) that incorporates both behavior and data.

Creates common concepts/vocabulary to easily communicate with other technical teams and business people.

e.g In the e-commerce business, a must-have object model would be an order (defined by its ID, customer ID, date, status, etc..)

1. Domain Event

Captures something that happened that changed the state of a domain model.

e.g. A person changed their address, which changes the Address domain model given an AddressChanged event.

We can then use the knowledge gained from the changes to do something useful with it.

e.g. Since the person moved to a new place, we can send them reminders to update their address for all their bills.

2. Event Sourcing + DDD

Stores the full history of domain events in the Event Storage.

Event Sourcing + Event-Driven Architecture

Misconception: They are not the same!

Event-Driven Architecture: System which is based on components communicating mainly or exclusively through notification events.

When talking about “event” in event sourcing we are actually referring to the state change and not a “notification” as in the event-driven architecture (queues, bus, streams, async communication in general).

domain event != notification event

Event Sourcing + Microservices

When working with microservices architecture, we need to notify multiple services about state changes (domain events) so they can react to them.

Usually, this is done with notification events (so we are using an event-driven architecture).

In this case, notification events are domain events.

e.g. Domain events persisted in the Event Storage. In an event-driven architecture, we need to notify Service A and B about them. In this case, we can call them notification events, as we decided to use async communication for this purpose.

Reading Data

Don’t use your Event Storage as a Read Model.

You might be tempted to query your Event storage to find the latest state for a given event.
This can be slow for large systems. It might work for very small systems but reconstructing the full history to find the latest state for regular usage of your system doesn’t sound like a very good idea.

1. CQRS - Read Model

A pattern that requires having separate classes for reading and writing data.

This allows us to have different models to read and write data.
This also means the repositories (and therefore the DBs) can be different for reading and writing data. For instance, we can write to a relational DB (the writing part is the Event storage) and then read from a NoSQL DB.

CQRS generally has 2 main concepts: commands (write) and queries (read).

A command can return a value: Most likely the returned value will have to do with operation confirmation or identifiers and nothing to do with state changes and this operation is expected to be asynchronous. A domain event will be put into a queue (stream etc..) and be processed at a later stage.
A query does not change any state, its only goal is to read.

This separation makes the following scenario possible.

Let’s say we decide our Event Storage will be a NoSQL DB. Our Read Models will live in our Services and they will be separate DBs, they can be relational DBs.
The Event Storage will remain our source of truth but each Read Model will be designed to render the data as needed per each service. That’s its only goal.

Snapshotting

Since we are storing everything in the Event storage, one very common question is “How do we avoid reading all the data to find the latest state for a given domain model?” The solution is to create snapshots from time to time. Snapshots are actually part of the Event Storage itself. Let’s give an example to make it clearer.

One of the classic examples of event sourcing is the bank card we use to perform withdrawal and credit operations. Domain events will persist these operations. In order to know the current balance of the account, we need to reconstruct all the operations from the beginning. Snapshotting here can be used to have the balance computed every 5 operations. In the worst case, we would need to get the latest snapshot + 4 domain events to know the current balance, thus decreasing the number of reads significantly.

1. Snapshotting vs Read Model: Replay events

If Snapshotting is used to reconstruct the latest state for a given domain model, why would we use Read Models?

When working with a micro-service architecture they are used for different purposes.

Imagine adding a new micro-service, therefore a new ReadModel DB. We need to reconstruct the latest history of domain events from the Event Storage so we’ll use the Replay events feature to achieve this.

Replay Events is a feature of event sourcing and is a byproduct of 2 of the main concepts we defined: full history of domain events + ordered domain events. We can replay the full history of what happened and get the latest state for a given domain model.

This can be a very time-consuming operation if we are talking about millions of domain events. The fastest way to do this is using Snapshotting.

In this case we can seamlessly add a new microservice. We are using snapshots for a different purpose as we are still using read models to read data from a given Microservice and not the snapshot from the Event storage.

The same situation applies if we decide to completely dump a Read Model and change it for a new one. Our source of truth remains the Event Storage and we just need to replay all the events and create our new read model.

Benefits of Event-Sourcing

Data-driven business
When starting a new business/new product etc.. we just don’t know how much our data is worth. Even if the goal is clear at that point in time, it’s always a better idea to keep all the data to possibly use it at some point in the future (storing data is cheap now) Example: Let’s imagine we are creating a new e-commerce site and we are in charge of specifically adding features and then the checkout process to cart items. In this case, what matters the most is the final state of my cart before the user places the order. The fact that the user has added and then removed items doesn’t really affect the checkout process and the correct functioning of the website. Nevertheless, we should keep the full event history of adding/removing items from the cart. Even if not useful now, it can help us analyze the customer behavior and see why they added and deleted those items.
Audit/Logging
By reliably persisting the full history of events we will be able to debug our applications more easily by finding the full history of events for a given domain model.
Since it’s an append-only system, we never lose data. We can’t delete any data.
High-Security Standard naturally by design: We can’t update or remove data. If a system gets hacked, data could still be appended meaning the current state of a given domain model would change. In any case, not losing any data is a great benefit.
Great fit for analytics: Since we have the full history of everything that happened, we analyze the past and use data to drive the business.

Disadvantages of Event-Sourcing

Mental shift for developers: There are a bunch of concepts that need to be mastered here, event sourcing, DDD, CQRS, event-driven architecture etc. It can be overwhelming.
It should not be used for all scenarios. Like any other pattern, it doesn’t make sense in some cases.
Unless we use the Event Storage for the read models (which is definitely not the best practice or not even possible in some cases), we are introducing data redundancy which can lead to potential inconsistencies and also eventual consistency.

This is how we work with event sourcing in Booster.

Top comments (5)

Ivan Fateev • Oct 31 '20

Thank you for the overview. The concept is quite clear.

How do you handle versioning?

I mean if business requirements have changed, and there's a need to work with the data differently, how do you update the logic?

For example, if there were an event of creating an order, and then you added some more fields to the order model, will event change as well? Or do you create a different type (version) of the event?

When you deploy a new instance of a microservice, with the updated logic, will it produce a new event, while old instances still producing old events?

It's clear that you can rebuild a read model after any change. I'm just curious how the entire process/approach looks for the Event Sourcing.

In traditional systems this is usually handled by migrations and backward compatibility with the n-1 version of the service.

Javier Toledo Booster • Nov 5 '20 • Edited

Hi @poisonousjohn , this is a fantastic question and a crucial topic with which most teams struggle while switching to an event-driven mindset. There's certainly no one-size-fits-all solution to handle versioning because depending on your stack choices, you might have different options. In my experience, I've seen two general ways to approach this, but take them with a grain of salt because when you go down to your specific implementation, things can get blurry.

For instance, some people use Apache Kafka to distribute the events among services. In this case, they count with the schema registry to keep track of message schemas, and it has a schema evolution feature that allows you to "upgrade" old message versions to newer ones on the fly in some situations (you probably want to avoid changes that can't be resolved this way). In this context, a new service can consume the newest version of the schema while all other services are still emitting the old version. This is a nice article about the Schema Registry in Kafka with a much better explanation.

I've also known about teams that treated events as "immutable" structures, so when they need to "change" an event, they don't really change it. They tag the original event as "deprecated," create a brand new event and put an event handler in the middle to "translate" the events from the deprecated version to the new one (if needed). Doing this, existing services that were working with the old version can still work with the existing event stream, and the services that use the new implementation will use the new stream without looking back to the original one. Depending on the case, it could make sense to run a process to "migrate" all the events from the original event stream to the new one or delete it once you realize no service is using the deprecated version.

In any case, when you use event sourcing, you probably want to avoid having state shared by different services. Each service should have its own database and build its own projection from the events, so once you fix versioning at the event level, it should be much easier to handle changes in each service state because when things are properly designed, they own that state and can be changed without affecting other services.

At least this is what I've experienced/learned about this so far, I'd be really interested in hearing about other solutions to this challenge because I haven't found simpler/easier answers yet!

Ivan Fateev • Nov 6 '20

Great, thanks. You've just described what I had in my mind. Good to know that.

Anyways, it sounds quite complex to support, and I'm wondering about the advantages of the architecture, let's say, over the traditional approaches, like REST API + RabbitMQ to communicate with services.

This article focuses on implementation mostly, but not on the advantages.

Usually, Event Sourcing is really needed in the systems where write operations dominate over read ones. Like banking systems, where there are a lot of transactions, and you rarely ask for a statment.

I'm wondering what benefits do other people find beside the ones described in a typical examples of the Event Sourcing. Like real examples where this approach helped a lot.

Javier Toledo Booster • Nov 6 '20

Yes, it certainly can become complex, but the advantages really pay off in some scenarios.

Talking about a close example to me, in e-commerce applications, the go-to way to scale is caching the catalog, and that helps a lot to keep the site online on sales days, but there are situations when the number of concurrent orders is such that the database becomes a bottleneck. When an order is processed a lot of transactional things happen under the hoods, you need to check that you have enough stock (and prevent others to pick it while you're processing the order), that the payment processor confirms the deposit, and trigger the warehouse processes to prepare the order, etc. In most e-commerce implementations all these things happen synchronously while the user waits, and some user actions could be blocked waiting for others to finalize, to the point that users get bored and leave or some calls time out and fail, with the consequent loss of orders (and money). With event-sourcing, you have an append-only database that can ingest way more data, it's easier to shard the writes, and the checks and everything else can be done asynchronously (you can still manage transactions with sagas). This requires a mindset change in the way the whole system is designed, and for sure it's not "easier", but this way no orders are ever dropped.

The main advantage I see over inter-services REST APIs is the inversion of dependence. When you use a REST API, the sender needs to understand the API contract and implement a client to consume it. When that API changes, you need to go back and find all the services that consume that API and change their API client implementation. Otherwise, everything fails. This can become challenging when you have more than a couple of services. In event-sourcing, as I mentioned in my previous message, one service that changes the contract could live with other services that are using a previous version of the message, reducing this risk. Dependencies happen at the data level, not at the code level, and that's generally easier to deal with.

With RabbitMQ you could deal with this dependency in a similar way, but the disadvantage I see is that RabbitMQ works with a real-time push mechanism that can easily overwhelm the consumers in some situations. I've seen a couple of times that because of errors or sudden user peaks, the consumers can't consume the messages fast enough until the hard drive is full and stops accepting messages. Event-sourced databases tend to be easier to scale and the messages are ingested and persisted until you can process them, so you don't need it to happen in real-time. The consumers can go at their own pace, they don't even need to be online when events are stored, and it's easier to recover from failures because no events are dropped.

Don't get me wrong, RabbitMQ is a fantastic piece of technology, and when the system is properly designed taking into account these edge cases, it can definitely work. Indeed, there are some implementations of Event Sourcing that use RabbitMQ under the hoods for notification events, so they are not really "competing" technologies.

Event-sourcing is not a silver bullet, but a way to store data that has some benefits in some situations like the one I described, but as with every other technique, I think it shines especially when you don't overuse it and combine it with other tools and techniques in a balanced way.

Florian Bischoff • Jun 3 '22

There is one big drawback to the GDPR compliance mechanism described above. Encryption and hashing algorithms may become weak over time. This happened in the not so recent past, e. g. one big social network partly used SHA1 for password hashes. In such a case, "forgotten" data can become readable again, posing a huge and hard to contain compliance risk.

Another approach would be to mark PII data and implement a redacting mechanism that preserves the structure of the data, while erasing/replacing its content.

This seems to break the basic contract of event sourcing. But forgetting the keys basically comes down to same thing. One could argue that data that is not readable (not decryptable) also has changed, as events do not live for their own sake, but to serve a purpose. What is the difference between changed data and data that can't be read? Which leads me to another point. The main purpose of an event is to record that something has happened to the domain. In most cases this purpose is still served, even when data gets redacted. In fact, if you "forget" the encryption key, then the event must still serve its purpose without actually accessing the data.

If you feel uncomfortable with redacting events or if it becomes a burden to track which events store PII, another approach would be to put sensitive PII data into a dedicated key-value store and store the key in the event. Ideally the key or record should contain some customerId (which itself may not be PII). This would allow to track and delete all PII belonging to a customer. In case of a GDPR delete request, you could configure your key value store to return a "Deleted due to Art XYZ GDPR request" in place of the deleted data.