Last time, we introduced streaming processing. In order to be able to handle batch and real-time data with a more pure infrastructure, we introduced Kafka and the streaming framework.
In this article, we will introduce a common design pattern for streaming, enrichment, and examine what benefits the streaming framework can bring.
What is enrichment? Briefly, it is the implementation of extending the original events to meet the new feature requirements.
Feature Description
Similar features are available on many social media platforms.
- Who visits me
- Others are also viewing
- What is the identity to view me
The first feature is how many people have viewed my profile in a given time period. For example, how many times my profile has been viewed in a week.
The second feature is an advanced version of the previous feature, who else did these people view? For example, 10 people viewed my profile in a week, and they also viewed Elon Musk.
The third feature is a derivative of the first one. What is the identity of these people who viewed me? For example, 10 people viewed my profile in a week, and three of them were google employees.
Who visits me & Others are also viewing
The first and second features can be easily implemented using event streaming techniques. First, let's define a "view" event.
{
eventType: "PageView",
timestamp: 1660270009,
viewerId: 1234,
viewedId: 5678
}
This event includes the time of occurrence and who viewed who. Then we just need to create a stream that collects all view events to analyze the features we need.
In the above diagram, suppose my user id is 4567, then it is easy to find two users who have viewed me from the event stream, which are 2345 and 1234.
Further from the stream record, we can also understand that 2345 has viewed 1234, while 1234 has viewed 5678 and 2345.
Putting this information together, we can answer.
- Who visits me? 2345 and 1234.
- Others are also viewing 1234, 5678, and 2345.
With all the event streams recorded by the streaming framework, we can perform a simple analysis to get the desired features.
What is the identity to view me?
However, it is not so simple to know what the identity to view my profile is.
Because our original event did not define additional attributes, only the id. Well, it is necessary to modify the original implementation to add new features.
Add metadata in events
Let's add the job to the original event. When the user is viewing, in addition to sending the event with the id, the job must also be attached.
{
eventType: "PageView",
timestamp: 1660270009,
viewerId: 1234,
viewedId: 5678,
viewerJob: "Google"
}
The problem seems to be solved, but is it really so?
There are several obvious issues with such a solution. Firstly, modifying the event format directly affects all downstream consumers. Under an event-driven architecture, the producer would not know the consumer's use cases because the two are decoupled.
Secondly, events become larger, and each original event must add this new field, whether the consumer needs it or not. In other words, the overhead has gotten bigger, not just in terms of storage overhead, but also transmission overhead.
Moreover, what to do if we want to add new features? In addition to the job, the feature requires a new title, age, and so on. Then the issues mentioned above will happen again and again.
Therefore, we know this is not a good approach.
Query from external datastores
Since modifying the original event is not a good approach, we will query the required data from elsewhere as soon as we receive the event.
Although it is stated in the diagram as a database, it can also be a microservice or any platform that can provide information.
Basically, this is the most typical implementation of an event-driven architecture. When workers take events from the message queue, they process the data, either by getting the necessary information from other data sources or by storing the results in a data store, and finally by sending the enriched events to the next one.
This implementation is fine in a message queue architecture. However, in a streaming framework, such an implementation creates a performance bottleneck. Let's say a streaming framework has an average throughput of more than ten times of a database. If every event had to rely on external data sources, the overall throughput would become 1/10
.
In addition, there is an existing issue with such an architecture.
When a worker cluster or streaming processing cluster crashes, this is very likely to happen, after all, errors are everywhere. At this point, events will continue to accumulate because no consumer will be able to handle them. This is still the normal situation.
Once the worker cluster is repaired and all workers are online, all workers will start digesting the message at full speed and a spike will hit the data source and most likely shut it down.
Own failure affects other services in a cascading manner. This tight coupling is to be avoided in the system design. Therefore, such a solution is not good enough.
Merge events
No new metadata into the event and can not be queried from external sources, then how to deal with the enrichment? This is the streaming framework to deliver the advantages.
In this example, we have chosen Apache Samza as an illustration, but in fact there are many good alternatives, such as Apache Flink.
In addition to changing the handler mentioned in the previous section to a stream processing framework, there is a new event source, ProfileEdit
.
This modification event can either come from the Change Data Capture (CDC) of the database, or the modification event can be sent from the user service, but the details are not the focus of this article.
The database from the previous section becomes the state in the stream processing framework. This state can be considered as a memory space for each worker, and is therefore very efficient. In addition, each worker shares the state, so the capacity is much larger than a single instance's memory.
When Samza receives ProfileEdit
, it can save the latest state of each user. Once PageView
is received, it can pull out the required information from the saved state, and assemble new events.
This approach solves the problem of tight coupling and significantly improves processing performance. More importantly, Samza provides an exactly-once guarantee to avoid duplication of events due to failure of external data sources.
Although the state is shared by multiple instances, it is still possible to hit the upper limit. How to solve the scalability issue?
In Kafka, a message partitioning mechanism is provided, where messages with the same key are assigned to the same partition and processed by the same consumer group. In the streaming framework, the partitioning mechanism still exists and has been abstracted to a higher level. Different streams can share the same key space, i.e., as long as PageView
and ProfileEdit
have the same key, they can be processed by the same workers.
In this way, the state can be stored by horizontal scalability.
Conclusion
In this article, we introduce a common design pattern that both traditional message queue architecture and modern stream processing architecture encounter the same issue, i. e., event enrichment.
When designing a system, it is important to take into account the robustness and performance of the system in addition to the feature implementation. This is why streaming frameworks are on the rise. Streaming offers the possibility to handle real-time heavy traffic, while streaming frameworks additionally provide a variety of useful tools to make development easy.
As mentioned in the previous article on streaming architecture, the streaming architecture has the ability to collapse many technology stacks into two core components: Kafka and the streaming processing framework. Therefore, I recommend that every developer, even if you are not a data engineer, should learn about the streaming framework. I believe having more insight will lead to a more reliable system design.
Top comments (0)