DEV Community

Elvis Mwangi
Elvis Mwangi

Posted on

Apache Kafka Deep Dive

While revolving around key features such as extracting, transforming and loading of data, data pipelines play an important role in determining the integrity of information collected. Dependent on the volume, data sources (Foidl et al., 2024) and data absorption rate, most data pipeline processes have been automated to ensure activities such as data capture and extraction can take place simultaneously. The system also ensures that multiple professionals, such as data analysts, can access historical information previously stored. Data Pipelines tools are split under cloud-based systems, such as Google Data Flow and AWS Data Pipeline, and open source frameworks, such as Apache Airflow and Apache Kafka. Understanding which of the above tools to employ, one needs to classify their workload under the following three types of data pipelines:
Batch-Processing
Under Batch processing, data pipelines are built to handle large volumes of data processed at scheduled time intervals. Usually, the ideal technique is where real-time data processing is not essential. Tasks such as fetching historical data to be used for reporting rely on already stored data.

Stream Processing Pipelines
In fields that rely on multiple data sources, the above technique handles continuous large volumes of data at a scheduled time interval. Value-adding transformations under steam processing include filtering, aggregation, applying business logic and data enrichment. Real-life applications include running machine learning models that rely on real-time data in the financial world and online fraud detection programmes.

*Hybrid Processing Pipelines *
For projects requiring real-time data processing, creating a unified data processing pipeline ensures scalability and data availability.

_ APACHE KAFKA-CORE CONCEPTS_
Open source libraries and frameworks such as Apache-Kafka enable individuals to create and access publicly available data and tools while learning the concepts. Apache-Kafka comes off as a hybrid processing pipeline technique as it borrows various concepts from batch and stream processing. It is basically a distributed commit log (Das, 2021). A commit log under Kafka relies on an append-only, ordered sequence of records known as messages. Key concepts () discussed below are: producers, consumers, consumer group, brokers, topics, partitions, replications, Zookeeper, Kafka Streams, Kafka, Connect and Kafka cluster.

Producers
An application that sends messages using Kafka Producer APIs. Based on the partition strategy, a producer application secures information and requests about the events or messages ferried. Parameters that determine a producer’s ability to publish messages to a Kafka topic fall under

  1. No key specified - The producer application tries to balance out the event or message by randomly allocating a partition accommodating the total number of events being sent.
    1. Key Specified - Same key hash under consistent hashing, ensures minimal redistribution of keys in a re-hashing scenario. Messages containing similar keys are basically sent to similar partitions.
    2. Partition Specified - Involves specifying a particular storage destination that will receive the message.
    3. Custom partitioning logic- Rules are formulated based on the partition to be used.

Consumer
An application that subscribes to and receives messages. If the message being transmitted reads 10,13,16,17 and is inserted into the topic, a consumer application will read it in the same order. A log is stored in the Zookeeper every time a message is denoted in Kafka. It hedges the data loss risk by resetting an offset position after allowing access to older messages.

Consumer Group
Unlike a consumer application (Apache Kafka Concepts, Fundamentals, and FAQs, 2024), consumer groups involve a single logical consumer executed with multiple physical consumers. Should a single consumer need to scale up the messages or events Kafka is providing, one can create additional instances of it. In a practical setting, new consumer groups are assigned partitions previously held by the old consumer group, a terminology defined under rebalancing consumer groups. Kafka keeps track of the members of a consumer group and allocates data to them. Solving a utility crisis under Kafka, there must be as many partitions as consumers. If consumers are more than the partitions created, the situation will lead to idle consumers; vice versa, it will lead to consumers reading more than one partition. A deeper dive into fan-out exchange and order guarantee offers more insight into Kafka’s consumers.
Kafka Topics
Acting as a name log, it organises, stores streams of events where producers write to and consumers read from. Messages are append-only, allowing users to access historical data. A single topic can also be written by multiple producers and read by multiple consumers. Improving on data availability, topics can be divided into multiple partitions.
Kafka Broker
Kafka servers are held by Zookeeper. A Kafka broker is a single Kafka server. Functions of a broker include: receiving messages from producers, assigning offsets and committing them to the partition log. Brokers only view event data as opaque arrays of bytes, ensuring data is available and a message's integrity has been upheld. A producer connects to a set of initial servers in the cluster, requests information about the server and based on the partition strategy, it determines which server to use. Under consumer applications, consumer groups of the servers in a cluster coordinate sharing information received from a producer.

Kafka and Zookeeper running in the background

Kafka offsets
Individual messages are assigned a unique sequential ID, which consumers use to track their progress.
** Cluster**
It is a group of broker nodes running together to provide scalability, availability and fault tolerance. One of the servers takes up the role of the controller, assigning partitions to other servers, monitoring for broker failure and taking up administrative duties. When a partition is replicated onto 6 brokers, one of the servers takes up the controller role and the rest 5 fall back and become followers. Data and messages are then written on the controller server and replicated by the other 5. The benefit of the said technique ensures we do not incur data loss should a leader go down, as one of the followers takes up the role.
The following command highlights running a 6-node Kafka cluster
The steps have been split into 3:
Create a topic

bin/kafka-topics.sh --create \
  --topic my-topic \
  --bootstrap-server localhost:9092 \
  --partitions 6 \
  --replication-factor 3

Enter fullscreen mode Exit fullscreen mode

confirm topic details

bin/kafka-topics.sh --describe \
  --topic my-topic \
  --bootstrap-server localhost:9092

Enter fullscreen mode Exit fullscreen mode

Create 6 partitions of that topic

{
  "version": 1,
  "partitions": [
    {"topic": "my-topic", "partition": 0, "replicas": [1,2,3]},
    {"topic": "my-topic", "partition": 1, "replicas": [2,3,4]},
    {"topic": "my-topic", "partition": 2, "replicas": [3,4,5]},
    {"topic": "my-topic", "partition": 3, "replicas": [4,5,6]},
    {"topic": "my-topic", "partition": 4, "replicas": [5,6,1]},
    {"topic": "my-topic", "partition": 5, "replicas": [6,1,2]}
  ]
}

Enter fullscreen mode Exit fullscreen mode

Replicate the data of all 6 partitions into a total of 3 nodes

bin/kafka-topics.sh --create \
  --bootstrap-server localhost:9092 \
  --topic my-topic \
  --replica-assignment replica-assignment.json

Enter fullscreen mode Exit fullscreen mode

Zookeeper
Kafka servers run on Zookeeper platforms. The system manages, tracks Kafka brokers, topics, partitions assigned, and leader elections.

Zookeeper

Data Engineering Applications
Change Data Capture(CDC)
Data activities are synchronised to track and record changes. CDC feeds and logs contain record changes- inserts, updates and deletes, which enable them to preserve the integrity of a system even after datasets are not reloaded. Using Apache Kafka, the system leverages its streamlining capabilities to propagate and process changes occurring in source databases.

Kafka Connect is a framework used to connect Kafka and other systems that use it to stream data in and out. Source connectors such as Debezium are equipped with capabilities designed to perform CDC from databases such as MySQL, PostgreSQL, MongoDB, Oracle and SQL Server.
Streaming Analytics _
Process of analysing data in a continuous format rather than in batches. Synchronising data systems using tools such as Apache Flink and Kafka Streams, data engineers not only perform real-time analysis used for immediate insights but also implement tasks such as fraud detection. Streaming analytics also enables users to conduct data transformation using connected devices.
_Real-time ETL Pipelines

Involves designing and implementing scalable architectures that secure data quality through validation and cleansing. It also enables data experts to retain historical data for future data activities. Kafka uses its real-time data streaming capabilities to ensure data engineers can continuously extract, transform and load data into warehouses using tools such as Kafka Streams or Apache Flink. Kafka Connect is also used for simplified data integration.
**
Real-World Production Practices**()
Activity Tracking
Multinational companies such as Netflix, LinkedIn and Uber use open source Kafka tools to track user activities. Companies such as Uber that have spread out globally have data engineers building systems that collect large volumes of data on customer behaviour, enabling the company to repopulate their websites with better offers. Activity tracking also enables the company’s machine learning engineers to collect timely data that can be used to train models that predict numerous variables under investigation.
Capacity Planning
Due to Kafka’s ability to transfer storage capabilities to local devices, capacity planning involving projects may take place. Replication factors may also be reconfigured after a team schedules the deployment of a project. Fields such as inventory handling and logistics planning may require secondary storage devices once the business begins dealing with numerous customers and expands the number of orders taken.

References
Apache Kafka Concepts, Fundamentals, and FAQs. (2024). Confluent. https://developer.confluent.io/faq/apache-kafka/concepts/#:~:text=Kafka%20runs%20as%20a%20cluster,to%20process%20data%20in%20parallel.
Das, A. (2021, January 17). Kafka Basics and Core Concepts. Medium: inspiring brilliance. https://medium.com/inspiredbrilliance/kafka-basics-and-core-concepts-5fd7a68c3193
Harald Foidl, Golendukhina, V., Ramler, R., & Felderer, M. (2023). Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers. Journal of Systems and Software, 207, 111855–111855. https://doi.org/10.1016/j.jss.2023.111855

Top comments (0)