Notes on Kafka: What, Why, and How

#kafka #microservices #eventdriven #100daysofcode

This is the first section of the multi-part series called Notes on Kafka. These series of articles are intended as a collection of my notes and projects and is shortened to contain only the core concepts.

Companies might start with a simple data distribution system - a single source exchanging data with a single destination system.

Over time, this got more complicated as new source systems and destination systems are added to the mix. Additionally, integrating more and more source system increases the load from the connections.

With the complexity of adding more and more integrations, companies faced some issues like:

What protocol to use to transport data
how should the data be parsed
how should the data be shaped

Companies have overcome this difficulties of moving data for a while through traditional approaches:

Database Replication
- RDBMS to RDBMS only, relational db to relational db only
- Database-specific, doesn't work across vendors
- Tight-coupling, Tight integration between the source and target. Any change to the schema will have a direct impact on replication
Log Shipping
- Performance challenges, depending on how big is the log being shipped and its frequency
- Cumbersome to manage, difficult to maintain and only apply to a certain type of datastore.
Extract, Transform, and Load (ETL)
- Useful for integrating data between multiple sources and targets
- Costly and proprietary, most of the time
- Custom application. Lots of custom development being done
- Scalability challenges. Because of the complexity of managing custom applications, scalability becomes constrained.
Messaging
- For a large scale, traditional messaging system struggles.
- Reliant on messaging broker, can be a bottleneck
- Smaller messages, this is because larger messages can put sever strain on messaging brokers
- Dependent on consumer's rate of consumption, there's rapid consumption
- Fault-tolerance, if messages are lost or processed incorrectly
Custom Middleware Magic
- this is where your code needs to have intimate knowledge of the data stores
- Increasingly complex, you'll most likely be dealing with distributed coordination logic, error-handling, multi-faced commits
- With every change in the application, you'll have to consistently revisit the code
- Consistency concerns, challenges on data consistency especially when you're handling multiple sources of data at different velocities
- Potentially expensive, considering you'll have to write your own middleware and you'll be managing the maintenance of the entire codebase

Cue in Kafka

Because there is a consistent need to move data around cleanly, reliably, quickly, and autonomously, Linkedin came up with a better way in 2010 to deal with data without slowing down or limiting the system, through Kafka.

Apache Kafka does this by decoupling the data streams and systems. Once your data is in Kafka, you can simply put it in a database, analytics system, or audit.

Apache Kafka is all about DATA - getting large amounts of it from one place to another in a manner that is rapid, scaleable, and reliable.

What's in the name?

Linkedin got their inspiration from the adjective, Kafkaesque which in turn is based on German writer, Kafka whose writings are freakishly surreal.

Linkedin's data infrastructure at the time was so nightmarish - in a way, kafkaesque, that they named the solution to the situation, Kafka.

Messaging Goals

Being a messaging system, Kafka was conceived by Linkedin with the following goals:

high throughput - data in terabytes and more
highly scalable - by adding machines that will share the load
reliable and durable - transmit data with no loss in case of failure
replicable - messages can be replicated across clusters, which supports multiple subscribers
loosely coupled - an application's condition should not cascade or affect others
Flexible publish-subscribe semantics - producers and consumers model

Fast forward to present

Kafka has been adapted by many companies since its inception in 2009, Linkedin's launch of the solution in 2010, and when it was open-sourced in 2012 under the Apache Software Foundation.

Use Cases

Messaging System
- Applications can produce messages without the need to be concerned about the message format
- Built-in partitioning
- Fault-tolerance and Durability
Activity Tracking
- Kafka is origiranally designed at Linkedin to track user activity
- Site activities are published with one topic per activity type
Metrics and Logging
- Collect application and system log metrics
- Accumulate data from distributed applications
- Produce centralized feeds of operational data
Log Aggregation
- Collects physical log files from servers and put them in a central place for processing
- Abstracts details of the files by producing log/event data as a stream of messages
Stream Processing
- Operates data in real-time, as quickly as messages are received
- Raw input data in consumed from Kafka topics, and then aggregated, enriched, which can then be sent to another topicfor further consumption
Event Sourcing
- State changes are logged as a time-sequenced records
De-coupling of system dependencies
Big Data Integration

The succeeding notes will dive in deeper into the Apache Kafka's Architecture. If you'd like to know more, please proceed to the next note in the series.

Similarly, you can check out the following resources:

Getting Started with Apache Kafka by Ryan Plant
Apache Kafka Series - Learn Apache Kafka for Beginners v2 by Stephane Maarek
Apache Kafka A-Z with Hands on Learning by Learnkart Technology Private Limited
The Complete Apache Kafka Practical Guide by Bogdan Stashchuk

If you've enjoyed this concise and easy-to-digest article, I'd be happy to connect with you on Twitter. You can also hit the Follow button below to stay updated when there's new awesome contents! 😃

As always, happy learning!