This is the first section of the multi-part series called Notes on Kafka. These series of articles are intended as a collection of my notes and projects and is shortened to contain only the core concepts.
Companies might start with a simple data distribution system - a single source exchanging data with a single destination system.
Over time, this got more complicated as new source systems and destination systems are added to the mix. Additionally, integrating more and more source system increases the load from the connections.
With the complexity of adding more and more integrations, companies faced some issues like:
- What protocol to use to transport data
- how should the data be parsed
- how should the data be shaped
Companies have overcome this difficulties of moving data for a while through traditional approaches:
- RDBMS to RDBMS only, relational db to relational db only
- Database-specific, doesn't work across vendors
- Tight-coupling, Tight integration between the source and target. Any change to the schema will have a direct impact on replication
- Performance challenges, depending on how big is the log being shipped and its frequency
- Cumbersome to manage, difficult to maintain and only apply to a certain type of datastore.
Extract, Transform, and Load (ETL)
- Useful for integrating data between multiple sources and targets
- Costly and proprietary, most of the time
- Custom application. Lots of custom development being done
- Scalability challenges. Because of the complexity of managing custom applications, scalability becomes constrained.
- For a large scale, traditional messaging system struggles.
- Reliant on messaging broker, can be a bottleneck
- Smaller messages, this is because larger messages can put sever strain on messaging brokers
- Dependent on consumer's rate of consumption, there's rapid consumption
- Fault-tolerance, if messages are lost or processed incorrectly
Custom Middleware Magic
- this is where your code needs to have intimate knowledge of the data stores
- Increasingly complex, you'll most likely be dealing with distributed coordination logic, error-handling, multi-faced commits
- With every change in the application, you'll have to consistently revisit the code
- Consistency concerns, challenges on data consistency especially when you're handling multiple sources of data at different velocities
- Potentially expensive, considering you'll have to write your own middleware and you'll be managing the maintenance of the entire codebase
Because there is a consistent need to move data around cleanly, reliably, quickly, and autonomously, Linkedin came up with a better way in 2010 to deal with data without slowing down or limiting the system, through Kafka.
Apache Kafka does this by decoupling the data streams and systems. Once your data is in Kafka, you can simply put it in a database, analytics system, or audit.
Apache Kafka is all about DATA - getting large amounts of it from one place to another in a manner that is rapid, scaleable, and reliable.
Linkedin got their inspiration from the adjective, Kafkaesque which in turn is based on German writer, Kafka whose writings are freakishly surreal.
Linkedin's data infrastructure at the time was so nightmarish - in a way, kafkaesque, that they named the solution to the situation, Kafka.
Being a messaging system, Kafka was conceived by Linkedin with the following goals:
- high throughput - data in terabytes and more
- highly scalable - by adding machines that will share the load
- reliable and durable - transmit data with no loss in case of failure
- replicable - messages can be replicated across clusters, which supports multiple subscribers
- loosely coupled - an application's condition should not cascade or affect others
- Flexible publish-subscribe semantics - producers and consumers model
Kafka has been adapted by many companies since its inception in 2009, Linkedin's launch of the solution in 2010, and when it was open-sourced in 2012 under the Apache Software Foundation.
- Applications can produce messages without the need to be concerned about the message format
- Built-in partitioning
- Fault-tolerance and Durability
- Kafka is origiranally designed at Linkedin to track user activity
- Site activities are published with one topic per activity type
Metrics and Logging
- Collect application and system log metrics
- Accumulate data from distributed applications
- Produce centralized feeds of operational data
- Collects physical log files from servers and put them in a central place for processing
- Abstracts details of the files by producing log/event data as a stream of messages
- Operates data in real-time, as quickly as messages are received
- Raw input data in consumed from Kafka topics, and then aggregated, enriched, which can then be sent to another topicfor further consumption
- State changes are logged as a time-sequenced records
- De-coupling of system dependencies
- Big Data Integration
The succeeding notes will dive in deeper into the Apache Kafka's Architecture. If you'd like to know more, please proceed to the next note in the series.
Similarly, you can check out the following resources:
Getting Started with Apache Kafka by Ryan Plant
Apache Kafka Series - Learn Apache Kafka for Beginners v2 by Stephane Maarek
Apache Kafka A-Z with Hands on Learning by Learnkart Technology Private Limited
The Complete Apache Kafka Practical Guide by Bogdan Stashchuk
If you've enjoyed this concise and easy-to-digest article, I'd be happy to connect with you on Twitter. You can also hit the Follow button below to stay updated when there's new awesome contents! 😃
As always, happy learning!