DEV Community

Okibaba
Okibaba

Posted on

Data Engineering Zoomcamp Week 6 - Streaming using kafka

This past couple weeks I spent some time learning about Kafka in week 6 of my data engineering zoomcamp.

Apache Kafka is a distributed streaming platform that has gained immense popularity in recent years due to its ability to handle large-scale, real-time data feeds. It provides a reliable and scalable solution for building streaming data pipelines and applications.

Grokking kafka requires getting familiar with some of its key architectural abstractions.

Kafka Architecture:
Kafka follows a publish-subscribe model (pub sub) , where producers send messages to topics, and consumers read messages from those topics. The architecture consists of the following main components:

  1. Producers:

    • Producers are responsible for publishing messages to Kafka topics.
    • They can choose to send messages to specific partitions within a topic.
    • Producers have the ability to control the partition assignment using keys.
  2. Consumers:

    • Consumers are the subscribers who read messages from Kafka topics.
    • They are organized into consumer groups, identified by a unique consumer group ID.
    • Each consumer within a group reads from a specific partition of a topic.
  3. Topics:

    • Topics are the fundamental unit of organization in Kafka.
    • They are used to categorize and store streams of records.
    • Topics are partitioned, allowing multiple consumers to read from different partitions simultaneously.
  4. Partitions:

    • Topics are divided into partitions, which are the smallest storage units in Kafka.
    • Each partition is an ordered, immutable sequence of records.
    • Partitions enable parallel processing and horizontal scalability.
  5. Cluster:

    • Kafka runs as a cluster of one or more servers called brokers.
    • The cluster is responsible for storing and managing the topics and their partitions.
    • Kafka ensures fault tolerance and high availability through replication.

Kafka Configuration:
Kafka provides various configuration options to control its behavior and performance:

  1. Replication Factor:

    • The replication factor determines the number of copies of each partition across the Kafka cluster.
    • It ensures fault tolerance and data durability.
    • A higher replication factor provides better reliability but increases storage overhead.
  2. Retention:

    • Retention refers to how long Kafka retains messages within a topic.
    • It can be configured based on time (e.g., retaining messages for a specific number of days) or size (e.g., retaining a certain amount of data).
    • Retention policies help manage storage space and comply with data retention requirements.
  3. Offsets:

    • Offsets represent the position of a consumer within a partition.
    • Consumers keep track of the offsets to know which messages they have already processed.
    • Kafka provides different offset management strategies, such as automatic offset commits or manual offset control.
  4. Auto Offset Reset:

    • The auto offset reset configuration determines the behavior when a consumer starts reading from a topic without a committed offset.
    • It can be set to "earliest" (start from the beginning) or "latest" (start from the most recent message).
  5. Acknowledgment (ACK):

    • Acknowledgment settings control the reliability of message delivery.
    • Producers can wait for acknowledgments from the Kafka brokers to ensure that messages are persisted.
    • The "acks" configuration allows trade-offs between latency and durability.

Conclusion:
Apache Kafka's distributed architecture, pub-sub model, and configurable options make it a powerful tool for building scalable and fault-tolerant streaming applications. Kafka's capabilities to process and analyze real-time data streams efficiently explains why its heavily used in real time data engineering and machine learning work flow.

Top comments (0)