<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: susan waweru</title>
    <description>The latest articles on DEV Community by susan waweru (@susan_waweru_a4fd6b337288).</description>
    <link>https://dev.to/susan_waweru_a4fd6b337288</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3434329%2F2af3794a-4886-472a-98ac-dfc6ecef5ce8.png</url>
      <title>DEV Community: susan waweru</title>
      <link>https://dev.to/susan_waweru_a4fd6b337288</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/susan_waweru_a4fd6b337288"/>
    <language>en</language>
    <item>
      <title>Core Kafka Fundamentals for Data Engineering</title>
      <dc:creator>susan waweru</dc:creator>
      <pubDate>Fri, 12 Sep 2025 14:54:55 +0000</pubDate>
      <link>https://dev.to/susan_waweru_a4fd6b337288/core-kafka-fundamentals-for-data-engineering-41f1</link>
      <guid>https://dev.to/susan_waweru_a4fd6b337288/core-kafka-fundamentals-for-data-engineering-41f1</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;Apache Kafka&lt;/em&gt;&lt;/strong&gt; is an open-source distributed event streaming platform.&lt;br&gt;
Was originally developed by LinkedIn but later open-sourced under the Apache Software Foundation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Event&lt;/em&gt;&lt;br&gt;
A record of something that happened in the system eg button click in a website, data insertion in a database etc&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Streaming&lt;/em&gt;&lt;br&gt;
This is the continuous generation, delivery and processing of data in real-time&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Event streaming&lt;/strong&gt; is the continuous capture, storage and processing of events as they happen ie:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;capture data in real time in the form of streams of events&lt;/li&gt;
&lt;li&gt;store these streams of events for later retrieval&lt;/li&gt;
&lt;li&gt;process, react to the event streams in real time&lt;/li&gt;
&lt;li&gt;route the event streams to destination technologies as needed&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Kafka Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6o8dfovn3r0fftgyei16.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6o8dfovn3r0fftgyei16.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;1. Brokers&lt;/strong&gt;&lt;br&gt;
Brokers are Kafka servers. They:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;store topics and partitions&lt;/li&gt;
&lt;li&gt;handle incoming and outgoing messages&lt;/li&gt;
&lt;li&gt;communicate with producers and consumers (clients)
Kafka clusters usually have multiple brokers for scalability and fault tolerance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Topics&lt;/em&gt; are where producers write events to and where consumers read events from. Topics are logical, not physical, they’re split into &lt;em&gt;partitions&lt;/em&gt; for scalability.&lt;br&gt;
An event is written into exactly one partition. Events inside a partition are stored in an ordered, immutable log (append-only).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Zookeper vs KRaft mode&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Zookeeper&lt;/em&gt;&lt;br&gt;
A distributed coordination service. &lt;br&gt;
It helps manage metadata, configuration and synchronization for distributed systems.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keeps track of all Kafka brokers &lt;/li&gt;
&lt;li&gt;Elects the controller broker responsible for partition assignments&lt;/li&gt;
&lt;li&gt;Detects and manages broker failures or restarts and decides which broker leads which partition&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suppose you have 3 Kafka brokers, if one broker fails, zookeeper notices the failure, elects the new leader for the affected partitions and updates metadata so consumers and producers know what to do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;cons&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;separate system to manage hence operational complexity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;KRaft Mode&lt;/em&gt;&lt;br&gt;
In KRaft mode, Kafka removes ZooKeeper and manages everything internally: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Brokers handle data, controllers manage metadata using the Raft protocol replacing ZooKeeper’s coordination model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;pros&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;simpler deployment and stronger metadata consistency&lt;/li&gt;
&lt;li&gt;easier scaling to thousands of brokers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Cluster setup and scaling&lt;/strong&gt;&lt;br&gt;
A cluster is a group of Kafka brokers working together&lt;/p&gt;

&lt;p&gt;A broker has:&lt;br&gt;
Multiple brokers (for load-balancing and fault-tolerance)&lt;br&gt;
A controller broker (which manages partition leadership)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling&lt;/strong&gt; is done by adding or removing brokers, which triggers partition rebalancing handled manually with tools like kafka-reassign-partitions.sh or automated with solutions such as Confluent Auto Data Balancer.&lt;/p&gt;

&lt;p&gt;Kafka &lt;strong&gt;scales out&lt;/strong&gt; by adding brokers and rebalancing partitions, and &lt;strong&gt;scales in&lt;/strong&gt; by removing brokers after reassigning their partitions. &lt;br&gt;
Partitioning enables parallelism, replication ensures fault tolerance, and monitoring metrics helps decide when to adjust cluster size.&lt;/p&gt;

&lt;p&gt;Setup and Scaling&lt;br&gt;
&lt;em&gt;&lt;strong&gt;step i)&lt;/strong&gt;&lt;/em&gt; deploy the more than one broker instances to increase storage and throughput&lt;br&gt;
&lt;em&gt;&lt;strong&gt;step ii)&lt;/strong&gt;&lt;/em&gt; create topics and subdivide them into partitions. Partitions spread data across brokers for parallel processing and load balancing.&lt;br&gt;
&lt;em&gt;&lt;strong&gt;step iii)&lt;/strong&gt;&lt;/em&gt; for fault tolerance, replicate each partition accross multiple brokers to ensure data availability if a broker fails.&lt;br&gt;
&lt;em&gt;&lt;strong&gt;step iv)&lt;/strong&gt;&lt;/em&gt; connect client applications ie producers and consumers to the cluster &lt;br&gt;
Producers auto-discover brokers via bootstrap servers. Consumers scale by adding more instances in a consumer group, each taking partitions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Topics, Partitions, Offsets
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3cxzfw2gi1irfdap0p52.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3cxzfw2gi1irfdap0p52.png" alt=" " width="800" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Topic&lt;/strong&gt;&lt;br&gt;
A topic is a logical category or feed name to which events (messages/records) are published.&lt;br&gt;
All events of a similar type go into the same topic.&lt;/p&gt;

&lt;p&gt;Producers write events to a topic (to one of a topic's partitions).&lt;br&gt;
Consumers read events from those partitions, usually in the order they were appended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partition&lt;/strong&gt;&lt;br&gt;
Partition is a subdivision of Kafka topics. It's the actual storage unit where the events are stored in an ordered, immutable log. Each topic is split into partitions for scalability and parallelism. These partitions physically reside on Kafka brokers&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Offset&lt;/strong&gt;&lt;br&gt;
A position of a record in a partition. Offset is an increasing integer assigned to each record within a partition.&lt;/p&gt;

&lt;p&gt;Producer writes messages and Kafka assigns them sequential offsets per partition &lt;br&gt;
&lt;em&gt;Example&lt;/em&gt;: In Partition 0, records may have offsets 0, 1, 2, 3..&lt;/p&gt;

&lt;p&gt;Consumers read messages using offsets as pointers: A consumer remembers “last processed offset”.&lt;br&gt;
Offsets act as bookmarks, if a consumer crashes and restarts, Kafka knows where it left off.&lt;/p&gt;




&lt;h2&gt;
  
  
  Producers
&lt;/h2&gt;

&lt;p&gt;Producers are client applications that publish (write) events to a Kafka topic . They can choose which partition the event goes to (using key or a round-robin)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a) Writing Data into Topics&lt;/strong&gt;&lt;br&gt;
A producer sends records (messages) to a topic.&lt;br&gt;
Kafka then decides which partition within that topic the record will go to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b) Key-based Partitioning (Controlling Message Distribution)&lt;/strong&gt;&lt;br&gt;
Each record may include a key.&lt;br&gt;
Kafka uses the key to determine which partition the record should go to:&lt;br&gt;
If a key is provided, Kafka applies a hash function on the key and the record always goes to the same partition for the provided key.&lt;br&gt;
If no key is provided, Kafka distributes records in a round-robin fashion across partitions for load balancing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c) Acknowledgment Modes (acks)&lt;/strong&gt;&lt;br&gt;
Producers can choose how much confirmation they want from Kafka before considering a write successful:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;acks=0 (Fire and Forget)&lt;/em&gt;&lt;br&gt;
Producer does not wait for any acknowledgment.&lt;br&gt;
Fastest, but messages may be lost if the broker fails.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;acks=1 (Leader Acknowledgment)&lt;/em&gt;&lt;br&gt;
Producer gets acknowledgment once the leader partition writes the record.&lt;br&gt;
Safer, but if the leader crashes before followers replicate, data may be lost.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;acks=all (or acks=-1) (All Replicas Acknowledge)&lt;/em&gt;&lt;br&gt;
Producer waits until the leader and all in-sync replicas acknowledge the write.&lt;br&gt;
Safest (strong durability guarantee), but slower due to waiting for replication&lt;/p&gt;




&lt;h2&gt;
  
  
  Consumers
&lt;/h2&gt;

&lt;p&gt;Consumers are client applications that subscribe to (read and process) events from a kafka topic.&lt;br&gt;
They can read from one or more partitions&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a) Read data from topics&lt;/strong&gt;&lt;br&gt;
A consumer subscribes to one or more topics. They pull data from Kafka at their own pace&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b) Consumer groups (scaling and parallel consumption)&lt;/strong&gt;&lt;br&gt;
Consumers belong to a consumer group, identified by a group id. Kafka ensures that each partition is consumed by exactly one consumer in a group.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Benefits&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Scalability&lt;/em&gt;: Multiple consumers in a group can read from different partitions in parallel.&lt;br&gt;
&lt;em&gt;Fault Tolerance&lt;/em&gt;: If one consumer fails, Kafka reassigns its partitions to other consumers in the group.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c) Offset management (automatic vs manual commits)&lt;/strong&gt;&lt;br&gt;
An offset is a number that marks a consumer’s position in a partition (like a bookmark).&lt;br&gt;
When a consumer reads messages, it must commit the offset to Kafka so it can resume from the right place if restarted.&lt;/p&gt;

&lt;p&gt;To commit offsets:&lt;br&gt;
&lt;strong&gt;1. Automatic Commits&lt;/strong&gt;&lt;br&gt;
Kafka handles committing offsets on behalf of the consumer at regular intervals. The consumer reads messages and commits the last consumed offset. Its controlled by setting &lt;code&gt;enable.auto.commit=true&lt;/code&gt; and &lt;code&gt;auto.commit.interval.ms&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pros&lt;/em&gt;: &lt;br&gt;
Simple, low overhead.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cons&lt;/em&gt;: &lt;br&gt;
Risk of data loss if consumer crashes after processing but before commit or duplication if commit happens before processing finishes&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Manual Commits&lt;/strong&gt;&lt;br&gt;
Consumer explicitly commits offsets after processing is done. Consumer reads messages, after processing each batch/ record, it calls commit to save the offset. Its controlled by setting &lt;code&gt;enable.auto.commit=false&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pros&lt;/em&gt;: &lt;br&gt;
More control, ensures exactly-once/at-least-once semantics.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cons&lt;/em&gt;: &lt;br&gt;
More complex to implement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Message Delivery Semantics
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4a766ch1qbkisgzx1ril.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4a766ch1qbkisgzx1ril.png" alt=" " width="545" height="577"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Message delivers semantics&lt;/strong&gt; describes the guarantee that a Kafka system provides when delivering messages to consumers, especially in the case of failures.&lt;br&gt;
ie how many times a consumer may receive a message, especially when failures occur.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At-MostOnce (messages may be lost, never duplicated)&lt;/strong&gt;&lt;br&gt;
Consumer commits the offset before processing the message. Kafka may consider that message 'done' even if consumer hasn't finished processing.&lt;/p&gt;

&lt;p&gt;In the case that the consumer crashes before processing, on restart, Kafka starts from the next offset.&lt;/p&gt;

&lt;p&gt;Guarantee: Messages are delivered 0 or 1 times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At-least-Once(messages are never lost, may be duplicated)&lt;/strong&gt;&lt;br&gt;
Consumer processes the message first, then commits the offset. This will ensure a message is acknowledged until its fully handled.&lt;/p&gt;

&lt;p&gt;If consumer crashes before committing offset, even after it has processed the message, on restart, Kafka re-delivers that message and consumer reprocess it again hence duplicate&lt;/p&gt;

&lt;p&gt;Guarantee: Messages are delivered 1 or more times. No message is lost, but duplicates may occur.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exactly-Once(each message is delivered precisely once)&lt;/strong&gt;&lt;br&gt;
Each message is delivered precisely once.&lt;br&gt;
With &lt;code&gt;enable.idempotence=true&lt;/code&gt;, producers avoid duplicate writes when retrying after network or broker issues.&lt;/p&gt;

&lt;p&gt;Kafka supports transactional producers where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Messages sent by the producer + consumer offset commits are part of the same atomic transaction.
This ensures that either both the message and offset are committed, or neither is.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Producer commits offset + writes results in the same transaction.&lt;br&gt;
If a crash happens mid-way, Kafka ensures nothing is partially committed.&lt;br&gt;
On restart, the consumer retries safely without duplicates.&lt;/p&gt;

&lt;p&gt;Guarantee: Each message is processed once and only once, even with retries and failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  Retention Policies
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Retention policies&lt;/strong&gt; are there to manage the data stored within the Kafka topics.&lt;br&gt;
They help manage disk usage while ensuring consumers have access to the required data&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time-Based Retention&lt;/strong&gt;&lt;br&gt;
Messages are kept for a specified duration, after which they can be deleted.&lt;br&gt;
Configured using parameters like &lt;code&gt;log.retention.hours&lt;/code&gt;, &lt;code&gt;log.retention.minutes&lt;/code&gt;, or &lt;code&gt;log.retention.ms&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;When a message’s age exceeds the configured retention period, it becomes eligible for deletion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Size-based retention&lt;/strong&gt;&lt;br&gt;
Policy limits how much disk space a topic or partition can use.&lt;br&gt;
Configured using parameter &lt;code&gt;log.retention.bytes&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;When the configured size threshold is reached, Kafka deletes the oldest log segments to make room for new data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log compaction&lt;/strong&gt;&lt;br&gt;
This policy focuses on retaining the latest value for each key within a topic.&lt;br&gt;
Configured using parameter &lt;code&gt;cleanup.policy=compact&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If multiple records exist with the same key, Kafka removes the older versions and keeps only the most recent one.&lt;br&gt;
Unlike time/size retention, compaction does not delete all old data — it only removes outdated records per key&lt;/p&gt;




&lt;h2&gt;
  
  
  Back pressure &amp;amp; Flow Control
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg4a8nfbs3bwpti4upb5o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg4a8nfbs3bwpti4upb5o.png" alt=" " width="761" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sometimes producers can generate messages faster than the consumer can process them. This can lead to consumer lag, memory buildup, or even crashes so Kafka uses back pressure and flow control mechanisms to handle this situation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Back Pressure (Handling Slow Consumers)&lt;/strong&gt;&lt;br&gt;
In back pressure the system signals that consumers (or brokers) are overloaded and cannot keep up with the data flow.&lt;br&gt;
If a consumer is slow, messages keep accumulating in the topic partitions.&lt;br&gt;
If lag exceeds retention limits, consumers may miss data permanently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flow Control&lt;/strong&gt;&lt;br&gt;
Flow control ensures that producers and consumers operate at a balanced pace without overwhelming brokers or clients.&lt;br&gt;
If the buffer fills up because brokers or consumers are too slow, Kafka blocks the producer &lt;code&gt;max.block.ms&lt;/code&gt; until space becomes available or throws an exception.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consumer Lag Monitoring&lt;/strong&gt;&lt;br&gt;
Consumer lag is the key metric for identifying slow consumers.&lt;br&gt;
Kafka exposes lag metrics via tools like Kafka Consumer Group CLI, monitoring systems like Prometheus + Grafana.&lt;br&gt;
Large or growing lag indicates consumers are falling behind.&lt;/p&gt;




&lt;h2&gt;
  
  
  Serialization &amp;amp; Deserialization
&lt;/h2&gt;

&lt;p&gt;Kafka stores and transmits messages as byte arrays. &lt;br&gt;
Producers must serialize data into bytes before sending it to Kafka, and consumers must deserialize those bytes back into a usable data structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Common Serialization Formats&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;JSON&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Pros&lt;/em&gt;: Human-readable, language-agnostic and easy debugging and integration.&lt;br&gt;
&lt;em&gt;Cons&lt;/em&gt;: Larger message size, no built-in schema &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Avro&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Pros&lt;/em&gt;: Compact, binary format and schema is stored separately&lt;br&gt;
&lt;em&gt;Cons&lt;/em&gt;: Requires schema management (via Schema Registry).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Protobuf (Protocol Buffers)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pros&lt;/em&gt;: Very compact and fast, Widely used in microservices &lt;br&gt;
&lt;em&gt;Cons&lt;/em&gt;: More complex tooling compared to JSON.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confluent Schema Registry&lt;/strong&gt;&lt;br&gt;
One big challenge in Kafka is, how do producers and consumers agree on message structure over time?&lt;br&gt;
&lt;em&gt;Confluent Schema Registry&lt;/em&gt; provides a centralized repository for managing and validating schemas for Kafka messages&lt;br&gt;
Its a central repository for schemas. Producers register schemas when publishing messages and Consumers fetch schemas to deserialize data correctly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Replication &amp;amp; Fault Tolerance
&lt;/h2&gt;

&lt;p&gt;In Kafka, high availability and durability is achieved through replication which ensures data is not lost if a broker fails and fault-tolerance which allows the system to keep working even when components go down&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Replication&lt;/strong&gt;&lt;br&gt;
Remember a topic is divided into partitions and each partition can have replicas distributed across brokers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Types of Replicas&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Leader replica&lt;/em&gt;&lt;br&gt;
Handles all reads/ writes for a partition.&lt;br&gt;
&lt;em&gt;Follower replica&lt;/em&gt; &lt;br&gt;
Continuously fetch data from the leader to keep their logs synchronized. This ensures that if the leader broker fails, a follower can take over without data loss. &lt;/p&gt;

&lt;p&gt;So the producer sends messages to the leader replica and follower replicas fetch data from the leader and stay in sync.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Fault-Tolerance *&lt;/em&gt;&lt;br&gt;
Achieved by ensuring data survives and remains available even after broker failures&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In-Sync Replicas (ISR):&lt;/em&gt;&lt;br&gt;
ISR are considered "in-sync" with the leader. They successfully replicated all committed messages from the leader and are not significantly lagging.&lt;br&gt;
Only replicas within the ISR are eligible to be elected as the new leader if the current leader fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High Availability&lt;/strong&gt;&lt;br&gt;
If a broker hosting a leader replica fails, Kafka automatically elects a new leader from the remaining followers in the ISR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failover&lt;/strong&gt;&lt;br&gt;
Switching to a new leader when the current leader becomes unavailable. This ensures continuous operation of the Kafka cluster.&lt;/p&gt;

&lt;p&gt;Consumers and producers automatically detect the new leader via metadata updates.&lt;br&gt;
If a follower replica falls behind, it is temporarily removed from ISR.&lt;br&gt;
Once it catches up, it rejoins ISR.&lt;/p&gt;




&lt;h2&gt;
  
  
  Kafka Connect
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Kafka Connect&lt;/strong&gt; is a framework for scalably and reliably streaming data between Apache Kafka and other systems (databases, cloud storage, search indexes, file systems)&lt;br&gt;
It eliminates the need for writing custom integration code by providing a standardized way to move large datasets in and out of Kafka with minimal latency.&lt;/p&gt;

&lt;p&gt;Kafka Connect includes two types of connectors:&lt;br&gt;
&lt;strong&gt;Source Connector&lt;/strong&gt;&lt;br&gt;
Reads data from external systems into Kafka.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;br&gt;
Stream database changes (CDC) from PostgreSQL, MySQL, etc.&lt;br&gt;
Collect metrics from application servers and publish them to Kafka topics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sink Connector&lt;/strong&gt;&lt;br&gt;
Writes data from Kafka into external systems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;br&gt;
Index records into Elasticsearch for search.&lt;br&gt;
Load messages into HDFS, S3, or Hadoop for offline analytics.&lt;br&gt;
Insert Kafka records into relational databases or cloud warehouses.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;connector instance&lt;/strong&gt; is a job that copies data between Kafka and another system.&lt;br&gt;
A &lt;strong&gt;connector plugin&lt;/strong&gt; contains the implementation (classes, configs) that define how the integration works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Scalable&lt;/em&gt;: Runs as a distributed service across multiple worker nodes.&lt;br&gt;
&lt;em&gt;Reliable&lt;/em&gt;: Handles faults, retries, and exactly-once semantics&lt;/p&gt;




&lt;h2&gt;
  
  
  Kafka Streams
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Kafka Streams&lt;/strong&gt; is a library that lets you process and analyze data as it flows through Kafka in real-time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Stateless vs Stateful Operations&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;Stateless operations&lt;/strong&gt; don't need to remember anything from before. They process each record independently without retaining any information from previous records.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;br&gt;
Map: to transform value of a record&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stateful operations&lt;/strong&gt; on the other hand need to remember past data to work. They require maintaining and updating state based on past records to produce results.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;br&gt;
Join: to combine records from two streams&lt;br&gt;
Aggregate: to aggregate a single aggregate value for records grouped by key&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Windowing Concepts&lt;/strong&gt;&lt;br&gt;
Windowing defines how streaming records are grouped by time so you can run stateful operations like aggregations or joins.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Types of Windows&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;Tumbling Windows&lt;/strong&gt;&lt;br&gt;
Fixed-size, non-overlapping intervals.&lt;br&gt;
Each record goes into exactly one window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hopping Windows&lt;/strong&gt;&lt;br&gt;
Fixed-size, but overlapping intervals that “hop” forward in smaller steps.&lt;br&gt;
A record can belong to multiple windows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session Windows&lt;/strong&gt;&lt;br&gt;
Based on activity periods, not fixed time.&lt;br&gt;
A session continues while events keep coming in, and ends after a defined inactivity gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sliding Windows&lt;/strong&gt;&lt;br&gt;
Windows move continuously with event timestamps.&lt;br&gt;
More advanced, often used with the lower-level Processor API.&lt;/p&gt;




&lt;h2&gt;
  
  
  ksqlDB
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqwgrha7c0c24v55kp5w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqwgrha7c0c24v55kp5w.png" alt=" " width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ksqlDB&lt;/strong&gt; provides a SQL-like interface for Apache Kafka therefore enabling real-time stream processing and analytics using familiar SQL syntax.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features and benefits&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;SQL-like Interface&lt;/em&gt;: Write queries using familiar SQL syntax just like working with a relational database&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Real-Time Stream Processing&lt;/em&gt;: Processes events the moment they arrive in Kafka topics&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Simplified Development&lt;/em&gt;: Eliminates the need for complex Java/Scala code by providing a higher-level SQL abstraction over Kafka Streams.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built on Kafka Streams&lt;/em&gt;: Inherits scalability, performance, and fault tolerance from the Kafka Streams library.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Cases&lt;/strong&gt; include Real-time analytics &amp;amp; dashboards, Data transformation and enrichment, Fraud and anomaly detection, IoT event processing, Event-driven microservices&lt;/p&gt;




&lt;h2&gt;
  
  
  Transactions &amp;amp; Idempotence
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Exactly-Once Semantics&lt;/strong&gt; (EOS) guarantees that a message is delivered and processed by the consuming application exactly one time. This ensures that even if a system component fails, the message will not be lost, and its effect on the target system will not be duplicated. &lt;/p&gt;

&lt;p&gt;EOS is achieved by combining Kafka transactions and the use of idempotency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kafka transactions&lt;/strong&gt; enable atomic processing of multiple write operations by grouping them into a single, indivisible unit that either succeeds entirely or fails entirely, ensuring data consistency in the event of failures. Ensure atomicity by means that all operations within the transaction are committed together, or none of them are. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idempotency&lt;/strong&gt; refers to the ability to perform an operation multiple times without causing unintended side effects or changes to the system's state beyond the initial application. It guarantees that messages sent by a producer are written to the Kafka log exactly once, even if the producer retries sending due to network issues or broker failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;&lt;br&gt;
With &lt;code&gt;enable.idempotence=true&lt;/code&gt;, the broker assigns a Producer ID (PID) and sequence numbers to each message. This lets the broker detect and drop duplicates caused by retries.&lt;br&gt;
&lt;em&gt;Guarantee&lt;/em&gt;: Messages are written once, in order, within a single producer session.&lt;br&gt;
&lt;em&gt;Limitation&lt;/em&gt;: Doesn’t prevent duplicates across producer restarts or across multiple partitions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security in Kafka
&lt;/h2&gt;

&lt;p&gt;Apache Kafka’s security model is built on several key components that work together to protect data and ensure data confidentiality, integrity, and controlled access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authentication&lt;/strong&gt;:&lt;br&gt;
Verifies the identity of clients (producers/consumers) and brokers attempting to connect to the Kafka cluster&lt;br&gt;
authentication mechanisms:&lt;br&gt;
SASL (Simple Authentication and Security Layer) is a framework for authentication used in Kafka to verify the identity of clients and brokers&lt;/p&gt;

&lt;p&gt;Example of a SASL mechanism in Kafka is:&lt;br&gt;
&lt;em&gt;PLAIN&lt;/em&gt;:  which uses Username + password authentication&lt;br&gt;
&lt;em&gt;GSSAPI (Kerberos)&lt;/em&gt;: Provides strong, centralized authentication using a Kerberos Key Distribution Center (KDC) for issuing tickets to clients and services&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authorization&lt;/strong&gt;:&lt;br&gt;
Authorization in Kafka is controlled by Access Control Lists (ACLs). Kafka checks authorization to determine what operations the client is allowed to perform. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Operations include&lt;/em&gt;:&lt;br&gt;
Read/Write: On topics.&lt;br&gt;
Describe/Create: On topics or consumer groups.&lt;br&gt;
Alter/Delete: On topics or consumer groups.&lt;br&gt;
Cluster-level operations: Such as managing brokers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Encryption&lt;/strong&gt;&lt;br&gt;
Encryption ensures confidentiality and integrity of data in transit. It prevents unauthorized parties from reading or tampering with messages while they move: between clients (producers/consumers) and brokers and between brokers (inter-broker communication).&lt;/p&gt;

&lt;p&gt;Kafka uses TLS (SSL) to encrypt connections. When TLS is enabled, data is encrypted before being sent over the network and the receiver decrypts it using trusted certificates&lt;/p&gt;




&lt;h2&gt;
  
  
  Operations &amp;amp; Monitoring
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Metrics to Monitor&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To effectively monitor Kafka, you should track consumer lag, under-replicated partitions (URP) and throughput and latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consumer Lag&lt;/strong&gt;&lt;br&gt;
Shows the difference between the last message produced to a partition and the last message consumed by a consumer group.&lt;br&gt;
High lag means consumers are falling behind, which can cause delays in latency-sensitive applications.&lt;br&gt;
Large values in &lt;code&gt;records-lag-max&lt;/code&gt; indicate a significant delay in consumption&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Broker Health &amp;amp; Under-Replicated Partitions (URP)&lt;/strong&gt;&lt;br&gt;
URPs occur when the leader partition has replicas that are not fully in sync.&lt;br&gt;
URPs signal a risk of data loss if a broker fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Throughput and Latency&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;br&gt;
Measures rate of message processing, usually in records per second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;br&gt;
Measures time from when a message arrives until it is processed (end-to-end delay).&lt;/p&gt;

&lt;p&gt;High throughput + low latency leads to efficient system.&lt;/p&gt;

&lt;p&gt;High latency leads to potential bottlenecks (example network congestion, slow consumers, broker overload)&lt;/p&gt;




&lt;h2&gt;
  
  
  Scaling Kafka
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scaling Kafka&lt;/strong&gt; is adjusting the Kafka cluster’s capacity so it can handle more data, higher throughput, or more clients without performance degradation&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partition Count Tuning&lt;/strong&gt;&lt;br&gt;
Increasing the number of partitions in a topic boosts parallelism, allowing more consumers in a consumer group to process data concurrently.&lt;/p&gt;

&lt;p&gt;However more partitions improve throughput but also add overhead in terms of open file handles, memory, and controller load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adding Brokers&lt;/strong&gt;&lt;br&gt;
Adding brokers to the cluster distributes data and workload across more servers.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Benefits&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increases overall storage capacity&lt;/li&gt;
&lt;li&gt;Improves throughput by balancing load&lt;/li&gt;
&lt;li&gt;Enhances fault tolerance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Rebalancing Partitions&lt;/strong&gt;&lt;br&gt;
Rebalancing redistributes partitions across brokers to ensure even load distribution.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Triggers&lt;/em&gt;&lt;br&gt;
New brokers added&lt;br&gt;
Brokers removed or fail&lt;br&gt;
Uneven data distribution detected&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance Optimization
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Performance Optimization&lt;/strong&gt; focuses on making message production, storage, and consumption faster and more reliable by reducing bottlenecks in the producer, broker, and consumer pipeline&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batching and Compression&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Batching&lt;/em&gt;:&lt;br&gt;
Kafka producers send messages in batches instead of individually hence fewer network round trips&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Compression&lt;/em&gt;:&lt;br&gt;
Compressing batches reduces network bandwidth and disk usage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Page Cache Usage&lt;/strong&gt;&lt;br&gt;
Kafka relies on the OS page cache for fast disk reads/writes.&lt;br&gt;
Keeping enough RAM ensures frequently accessed logs stay cached, avoiding expensive disk I/O.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Benefit&lt;/em&gt;: Minimizes random disk I/O, improving read and write performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disk &amp;amp; Network Considerations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Disk&lt;/em&gt;:&lt;br&gt;
Use fast disks (SSD preferred over HDD).&lt;br&gt;
Separate Kafka log directories from OS and application logs to avoid I/O contention&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Network&lt;/em&gt;: &lt;br&gt;
Kafka is network-intensive; &lt;br&gt;
High-bandwidth, low-latency connections (10Gbps+ for large clusters), tuned socket buffers, and balanced partition leaders.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Apache Kafka is a powerful distributed event streaming platform designed for real-time data processing at scale. Its scalability, durability, and rich ecosystem make it a backbone for event-driven architectures and data pipelines. As organizations continue to generate and consume data at massive scale, Kafka provides the foundation to ensure that information flows reliably, securely, and with low latency.&lt;/p&gt;




</description>
      <category>beginners</category>
      <category>dataengineering</category>
      <category>architecture</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Getting Started with Docker and Docker Compose: A Beginner’s Guide</title>
      <dc:creator>susan waweru</dc:creator>
      <pubDate>Wed, 27 Aug 2025 11:47:06 +0000</pubDate>
      <link>https://dev.to/susan_waweru_a4fd6b337288/getting-started-with-docker-and-docker-compose-a-beginners-guide-31e4</link>
      <guid>https://dev.to/susan_waweru_a4fd6b337288/getting-started-with-docker-and-docker-compose-a-beginners-guide-31e4</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;DOCKER&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Docker is a platform that packages applications and their dependencies into lightweight containers, ensuring they run consistently across different environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HOW IT WORKS&lt;/strong&gt;&lt;br&gt;
Instead of the familiar problem of “it works on my machine,” Docker guarantees “it works everywhere.” With Docker, developers don’t need to manually install dependencies or replicate complex setups. As long as Docker is installed, they can run containers that already include the code, versions, dependencies and everything an application requires to run.&lt;br&gt;
Docker manages these containers&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4pzi62k5cnnx19p2xfjs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4pzi62k5cnnx19p2xfjs.png" alt=" " width="800" height="751"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Containers vs Virtual Machines (VM)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Each VM runs its own OS along with the application and its dependencies hence making them larger and slower to start&lt;br&gt;
Containers share host OS kernel while isolating processes. Each container has only the application and its dependencies, not a full OS hence faster startup&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IMAGE&lt;/strong&gt;&lt;br&gt;
A Docker image is a blueprint or template for creating containers. It packages everything an application needs to run.&lt;/p&gt;

&lt;p&gt;It has the following stored inside it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runtime environment(e.g., specific python or java version)&lt;/li&gt;
&lt;li&gt;Required libraries and dependencies&lt;/li&gt;
&lt;li&gt;The application code&lt;/li&gt;
&lt;li&gt;Base OS (e.g., Ubuntu, Alpine, Debian)&lt;/li&gt;
&lt;li&gt;Docker images are read-only and immutable—once built, they do not change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;CONTAINER&lt;/strong&gt;&lt;br&gt;
A container is a runnable instance of an image ie when you run an image, Docker creates a container from it. Each container is an isolated process that executes the application exactly as outlined/ defined in the image it is in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DOCKER FILE&lt;/strong&gt;&lt;br&gt;
A recipe/instruction file used to build a Docker image.&lt;/p&gt;

&lt;p&gt;Contains steps like:&lt;br&gt;
** What base image to use&lt;br&gt;
** What libraries to install&lt;br&gt;
** How to copy code into the image&lt;br&gt;
** What command to run when the container starts&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DOCKER HUB&lt;/strong&gt;&lt;br&gt;
A registry/repository where Docker images are stored and shared.&lt;br&gt;
Its like GitHub but for images. Official images (Python, Postgres, Nginx, Ubuntu, etc.) live here. You can also push your custom images for sharing/deployment.&lt;/p&gt;

&lt;p&gt;To install Docker on WSL 2 with Ubuntu 22.04 follow these steps:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Step 1: Update system&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt update &amp;amp;&amp;amp; sudo apt upgrade -y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Step 2: Install required dependencies&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Step 3: Add Docker GPG key&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Step 4: Add Docker Repository&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;echo "deb [signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list &amp;gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Step 5: Install Docker image&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Step 6: Add your user to the Docker Group&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo usermod -aG docker $USER
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Step 7: Finally restart WSL on CMD/ Powershell.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;wsl --shutdown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;then&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker version
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;You should see something like below&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Docker version 28.3.3, build 980b856
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;DOCKER COMPOSE&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Docker Compose is a tool that allows you to define, configure, and run multi-container Docker applications using a single YAML file (docker-compose.yml).&lt;/p&gt;

&lt;p&gt;Instead of starting each container manually with long docker run commands, Compose lets you describe the whole application stack in one place and bring it up with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;HOW IT WORKS&lt;/strong&gt;&lt;br&gt;
With Docker Compose, you use a special configuration file written in YAML (called docker-compose.yml or compose.yaml) to describe your application. In this file, you list all the services your app needs like a web server, database, or cache and how they should work together.&lt;br&gt;
Once the file is ready, you can start everything with a single command using the Compose CLI&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The format of the Compose file follows a standard set of rules called the Compose Specification, which ensures your multi-container applications are defined in a consistent way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WHY USE COMPOSE&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It simplifies multi-container app setup: Defines and manages multi-container apps in one YAML file&lt;/li&gt;
&lt;li&gt;Efficient collaboration: Shareable YAML files support smooth collaboration between developers and operations&lt;/li&gt;
&lt;li&gt;Makes environments reproducible with a single command.&lt;/li&gt;
&lt;li&gt;Great for local development, CI/CD pipelines, and testing microservices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To install Docker compose, follow steps defined in below link&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://docs.docker.com/compose/install/" rel="noopener noreferrer"&gt;https://docs.docker.com/compose/install/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;COMMANDS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To start all the services defined in your compose.yaml file run
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;To stop the services run
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker compose down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;To list all the services and their current status run
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
    </item>
    <item>
      <title>Data Engineering Concepts</title>
      <dc:creator>susan waweru</dc:creator>
      <pubDate>Thu, 14 Aug 2025 14:43:13 +0000</pubDate>
      <link>https://dev.to/susan_waweru_a4fd6b337288/concepts-of-data-engineering-54l6</link>
      <guid>https://dev.to/susan_waweru_a4fd6b337288/concepts-of-data-engineering-54l6</guid>
      <description>&lt;p&gt;Data engineering is the discipline of designing, building, and maintaining the systems and workflows that make data accessible, reliable, and ready for analysis. It is the behind-the-scenes backbone of modern analytics, machine learning, and business intelligence.&lt;/p&gt;

&lt;p&gt;In this article, we will explore some foundational concepts in data engineering&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Batch vs Streaming Ingestion&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;These are methods for getting data (ingestion) for processing, analytics, or storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Batch Ingestion&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;What it is:&lt;/em&gt; &lt;br&gt;
Involves collecting data over a period of time, then processing it all at once in a batch.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;When used:&lt;/em&gt;&lt;br&gt;
Batch ingestion is used when one doesn’t need immediate results and when large volumes of data is required to be processed at once&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;data is stored temporarily (over a period of time)&lt;/li&gt;
&lt;li&gt;At a set time, a job is scheduled to run (to read and process the data in chunks)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Example use case&lt;/em&gt;&lt;br&gt;
Data warehouse: loading data into a data warehouse from operational systems daily&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming Ingestion&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;What it is:&lt;/em&gt;&lt;br&gt;
Involves collecting data in real time as it arrives.&lt;br&gt;
It's continuously ingested and processed without waiting for a batch window&lt;/p&gt;

&lt;p&gt;&lt;em&gt;When used:&lt;/em&gt;&lt;br&gt;
Used when one needs real-time insights and fast reactions, eg, in fraud detection, real-time alerts&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;data flows continuously&lt;/li&gt;
&lt;li&gt;data is processed in small increments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Example use case&lt;/em&gt;&lt;br&gt;
Stock trading - the trade data in financial markets is ingested in real time&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch vs Streaming Ingestion&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Batch&lt;/strong&gt;&lt;br&gt;
Processing of data: Scheduled&lt;br&gt;
Data delay: Over time (min -hrs)&lt;br&gt;
Data Volume: Large chunks of data at once&lt;br&gt;
Use case: Reports, analytics, ETL jobs&lt;br&gt;
Examples: Payroll, web logs&lt;br&gt;
Tools: SSIS&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming&lt;/strong&gt;&lt;br&gt;
Processing of data: Real-time&lt;br&gt;
Data delay: Real time (seconds to milliseconds)&lt;br&gt;
Data volume: Small data events continuously&lt;br&gt;
Use case: Real-time alerts, monitoring, dashboard&lt;br&gt;
Examples: Stock prices, IoT sensor data&lt;br&gt;
Tools: Apache Kafka, Spark Streaming&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ms6qm6821esovupfzhk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ms6qm6821esovupfzhk.png" alt=" " width="800" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Change Data Capture (CDC)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Just as the name suggests (change data capture) CDC is a technique that is used to track and capture changes in a database so that changes can be propagated to another system or used for downstream processing.&lt;br&gt;
The changes can be inserts, updates, deletes etc in a database&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CDC&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Detect the changes&lt;br&gt;
So Data sources (usually relational databases) are monitored for changes&lt;br&gt;
Identifies rows that have been inserted, updated or deleted&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Capture the changes&lt;br&gt;
The data changes are logged in a table or a log&lt;br&gt;
Metadata ie (operation type, timestamp) is included in the data changes &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deliver changes&lt;br&gt;
Push or pull the changes to a target system (eg data warehouse, an analytics tool, streaming pipeline)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Process changes&lt;br&gt;
Downstream systems (eg dashboards, alerts, machine learning models) update/ react based on the new data&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;How CDC is implemented&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Database triggers&lt;br&gt;
Timestamps&lt;br&gt;
Transaction logs&lt;br&gt;
Third-party tools/ services&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Why use CDC&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Efficiency: avoids full table scans to find out what has changed&lt;br&gt;
Real time replication: keeps systems in sync&lt;br&gt;
Audit trails: records what changed, when and by whom&lt;br&gt;
Data pipelines: feeds changes into data lakes, warehouses&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Use cases&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Data warehousing: &lt;br&gt;
Keeps data warehouse in sync with operational/ transactional dbs without reloading entire tables&lt;br&gt;
Source Systems log changes via CDC -&amp;gt; ETL/ELT pipelines extract only the changed records (new, updated, or deleted) -&amp;gt; loaded into the data warehouse&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Microservices communication&lt;br&gt;
Monolith or microservice writes data to a shared db -&amp;gt; CDC tool captures and publishes them -&amp;gt; &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Etl optimization&lt;br&gt;
Reduce processing time and system load in ETL pipelines by only extracting what changed instead of full table scans&lt;br&gt;
ETL jobs query CDC change tables or logs instead of full source tables -&amp;gt; extract only rows where  lastModified or CDC log entry exists since the last run&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Idempotency&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Involves an operation being performed multiple times without changing the result beyond the initial application ie doing it once or doing it multiple times will give the same outcome&lt;br&gt;
What Idempotency means is = Operation that can be repeated safely, on Repeat = same effect&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;How it works&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Say you’re calling an API or running a DB operation&lt;br&gt;
Without Idempotency, repeating the same operation multiple times would result to duplicate records, errors &lt;br&gt;
With idempotency, repeating the same operation would not affect the result, it would just behave as if done only once&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;What Idempotency involves&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Unique Identifiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;track each operation performed using a key&lt;/li&gt;
&lt;li&gt;server checks if the request with the same key has already been processed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;State checks: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;systems check the current state before performing an action eg only update if status is still in “Pending” status&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Idempotent methods: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;In REST APIs..GET, PUT, DELETE are idempotent..POST is not (but can be made idempotent using a unique key)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Upserts.. In databases you can prevent duplicates by using logic like (Update if exists else insert)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Where Idempotency is used mostly&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
. APIs(POST/PUT) - prevent duplicate submissions&lt;br&gt;
. Databases - prevent duplicate rows or repeated updates&lt;br&gt;
. ETL/ Batch jobs - Rerunning jobs shouldn’t corrupt data&lt;br&gt;
. Distributed Systems - Network retries must not reapply the same action&lt;br&gt;
. Messaging systems - reprocessing messages should not repeat side effects&lt;br&gt;
. Payments - backend uses the same payment_id and returns the existing result&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Use Cases&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
1.Payments .. in a case eg where the user clicks “Pay” but the network fails, without idempotency the user is charged twice, with idempotency backend uses the same payment_id and returns the existing result&lt;br&gt;
2.API Call.. if the same request is sent again with the same key, the server ignores it or returns the first response&lt;br&gt;
3.Database Upsert&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Importance&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
. Prevent duplicate transactions&lt;br&gt;
. Improve system reliability&lt;br&gt;
. Enable safe retries in unreliable networks&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. OLTP vs OLAP&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Online Transaction Processing (OLTP)&lt;/strong&gt;&lt;br&gt;
Systems optimized for managing day-to-day transactional operations. They are designed for speed, consistency and concurrency.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Use Case&lt;/em&gt;&lt;br&gt;
E-commerce systems (shopping cart, orders&lt;br&gt;
Banking systems&lt;br&gt;
Ticket booking platforms&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;br&gt;
A customer places an order in an online store, OLTP system records the purchase, updates inventory and creates an invoice&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Online Analytical Processing (OLAP)&lt;/strong&gt;&lt;br&gt;
Designed for analyzing large volumes of historical data, enabling business intelligence, reporting and decision-making&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Use Case&lt;/em&gt;&lt;br&gt;
Sales trends&lt;br&gt;
Executive dashboards&lt;br&gt;
Profitability analysis by region or product&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;br&gt;
Analysts want to know “what were the total sales per region in Year1 vs Year2?”&lt;br&gt;
This query runs on the OLAP system, which processes and aggregates the data quickly&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OLTP vs OLAP diffs&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;OLTP&lt;/strong&gt;&lt;br&gt;
Purpose: Handles real-time transactions&lt;br&gt;
Data type: Deals with Operational ie current data&lt;br&gt;
Operations: Implements Insert, Update, Delete and Read operations&lt;br&gt;&lt;br&gt;
Users: Is mainly for frontline employees, systems&lt;br&gt;
Speed Focus: High-speed, low latency&lt;br&gt;
Database: Normalized (3NF), fewer indexes&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OLAP&lt;/strong&gt;&lt;br&gt;
Purpose: Supports complex data analysis and reporting&lt;br&gt;
Data type: Can also have historical(aggregated, summarized data&lt;br&gt;
Operations: Implements the read-heavy operations ie SELECTs etc&lt;br&gt;
Users: Mainly for analysts and decision- makers&lt;br&gt;
Speed Focus: Fast query performance on large datasets&lt;br&gt;
Database: Denormalized(start/snowflake schema)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd3yzbeafilnvznr1gej9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd3yzbeafilnvznr1gej9.png" alt=" " width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Columnar vs Row-based  storage&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;These are ways databases organize data&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Row-based storage:&lt;/strong&gt; &lt;br&gt;
Stores data row by row hence &lt;br&gt;
All columns for a single row are stored together&lt;br&gt;
Efficient for OLTP where records are frequently accessed/ modified&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Columnar storage:&lt;/strong&gt; &lt;br&gt;
Stores data column by column&lt;br&gt;
All values for a single column are stored together&lt;br&gt;
Columnar storage excels in analytical (OLAP) queries that focus on specific columns across many rows/ over large datasets&lt;/p&gt;

&lt;p&gt;For example you have below query&lt;br&gt;
SELECT AVG(AGE) FROM STUDENTS;&lt;/p&gt;

&lt;p&gt;Columnar would work better because it only reads the AGE column. Row-based on the other hand would have to read every row and skip over other columns (non AGE columns) hence not efficient&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmp5bn2xm9vmjm2lkblct.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmp5bn2xm9vmjm2lkblct.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. Partitioning&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Partitioning involves dividing a large dataset to smaller subsets called partitions. The partitions are stored and managed independently&lt;/p&gt;

&lt;p&gt;Data partitioning is done mostly for performance and scalability.&lt;br&gt;
In the case for instance you have a large dataset but you only need to query for a certain period say last month, without partitioning the engine would have to scan the entire dataset hence consuming a lot of time, with partitioning the engine skips all the irrelevant partitions and only scans based on partition provided.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;- Improved performance&lt;/em&gt;&lt;br&gt;
ie faster query execution and data retrieval&lt;br&gt;
Because data is divided into smaller partitions, when querying, the query can target only the relevant subset hence reducing the amount of data that needs to be scanned and processed.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;- Enhanced scalability&lt;/em&gt;&lt;br&gt;
Because the partitions are distributed across multiple nodes/ servers more servers resources can be added to handle increasing data volumes and demands&lt;/p&gt;

&lt;p&gt;&lt;em&gt;- Increased Availability&lt;/em&gt;&lt;br&gt;
Because there are several partitions, if one partition is unavailable, the other partitions can be accessed hence ensures continued data availabilty&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Drawback&lt;/em&gt;&lt;br&gt;
One of the drawbacks of partitioning is extra management because there too many small partitions&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdhojbtdftnsai6jq5mk8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdhojbtdftnsai6jq5mk8.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7. ETL vs ELT&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Both are data integration methods ie moving and processing data. &lt;br&gt;
They differ in that ETL (Extract, Transform, Load) transforms data before loading in into the target system whereas  ELT(Extract, Load, Transform) first loads data into the target system and transforms the data withing target.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;ETL Steps&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data is extracted from its source&lt;/li&gt;
&lt;li&gt;Its then transformed in a staging area&lt;/li&gt;
&lt;li&gt;The data transformed data is then loaded into a data warehouse or data lake&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ETL is better used with complex transformations or when data quality and consistency are paramount&lt;/p&gt;

&lt;p&gt;&lt;em&gt;ELT Steps&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data is extracted from source&lt;/li&gt;
&lt;li&gt;Its loaded directly into a data warehouse or data lake without transformation&lt;/li&gt;
&lt;li&gt;Transformation is applied within target&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ELT is more suitable for cloud-based data warehouse and data lakes&lt;br&gt;
Its efficient for handling large volumes of data and is well-suited for analytics workloads&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Comparison&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;ETL&lt;/strong&gt;&lt;br&gt;
Transform step: Before loading&lt;br&gt;
Target type: Legacy/on-prem warehouses&lt;br&gt;
Raw data stored?: No (only transformed)&lt;br&gt;
Speed of loading: Slower (pre-transform)&lt;br&gt;
Flexibility: Lower&lt;br&gt;
Compute location: ETL tool/staging server&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt;&lt;br&gt;
Transform step: After loading&lt;br&gt;
Target type: Cloud warehouses/lakes&lt;br&gt;
Raw data stored?: Yes (raw + transformed)&lt;br&gt;
Speed of loading: Faster (load first)&lt;br&gt;
Flexibility: Higher&lt;br&gt;
Compute location: Data warehouse/lake&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;8. CAP Theorem&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Cap theorem states that in a distributed data system, its impossible to simultaneously achieve all three of: Consistency, Availability and Partition Tolerance. A distributed system must choose at most two of the three properties to prioritize&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistency (C)&lt;/strong&gt;&lt;br&gt;
Every read receives the most recent write or an error. After a successful write, any subsequent read should return that updated value. The clients therefore see a single, up-to-date view of the data.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;br&gt;
In a banking system if you deposit money in ATM A then check the balance immediately at ATM B, the money you deposited should reflect in the new balance&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Availability (A)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the system never refuses to answer 
Every request receives a valid (non-error) response - but it might not be the latest data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;br&gt;
In the banking system, if there’s a network glitch between the data centers when checking balance after depositing, it shows the previous balance instead of the current one&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partition Tolerance (P)&lt;/strong&gt;&lt;br&gt;
Network still continues to operate despite network failures or delays between nodes&lt;br&gt;
The system continues to operate even if messages between parts of the system are lost or delayed&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;br&gt;
A distributed database runs 3 data centers and communication to one of the data centers break, partition tolerance means the others can still serve reads/writes despite being cut off from the other.&lt;/p&gt;

&lt;p&gt;If part of the network goes down the system still runs using the available nodes/ reachable nodes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Key&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
In real-world distributed systems Partition Tolerance is non-negotiable because networks can and will fail therefore the trade-off in CAP is Consistency vs Availability during a partition:&lt;br&gt;
CP systems: prefer correctness&lt;br&gt;
AP systems: prefer uptime (may return stale data)&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;9. Windowing in Streaming&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Windowing in stream processing allows data be processed in small, manageable chunks over a specified period. It’s a technique that’s used to handle large datasets, real-time data processing, an in-memory analytics.&lt;br&gt;
The aim of windowing is to allow stream processing applications break down continuous data streams into manageable chunks for processing and analysis.&lt;/p&gt;

&lt;p&gt;Data streams (eg sensor readings) never end, so to calculate metrics like count, sum, average over time, you cant wait for the stream to finish because it wont. Windowing lets you process data in slices of time or count, producing intermediate results&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Types of Windows&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Tumbling Window&lt;/strong&gt;&lt;br&gt;
Divides data streams into fixed-size , non-overlapping time intervals&lt;br&gt;
Each event belongs to exactly one window&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Use Case example&lt;/em&gt;&lt;br&gt;
Website traffic monitoring: number of visits per minute&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Sliding Window&lt;/strong&gt;&lt;br&gt;
Create overlapping intervals allowing windows be included in multiple windows&lt;br&gt;
Defined by window length and slide interval.&lt;br&gt;
They capture overlapping patterns and trends in the data stream&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Use Case example&lt;/em&gt;&lt;br&gt;
Network monitoring: track packet loss or error rate over overlapping periods&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Session Window&lt;/strong&gt;&lt;br&gt;
Groups events that occur within a specific timeframe, separated by periods of inactivity. &lt;br&gt;
The window size is not fixed but determined by the gaps between events.&lt;br&gt;
Window closes after a period of inactivity (gap)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Use Case example&lt;/em&gt;&lt;br&gt;
E-commerce analytics: track user journey from first click to checkout&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10. DAGs and Workflow Orchestration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DAGs (Directed Acyclic Graphs)&lt;/strong&gt; &lt;br&gt;
DAG is a type of graph structure used to represent tasks and their dependencies in a workflow.&lt;br&gt;
Provide a structured way to represent and manage complex processes.&lt;br&gt;
Directed : Each edge has a direction, A -&amp;gt; B meaning B depends on A&lt;br&gt;
Acyclic: No loops or cycles&lt;br&gt;
Graph: Made up of nodes (tasks) and edges (dependencies)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;DAGS:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ensure tasks execute in the right order&lt;/li&gt;
&lt;li&gt;make it clear which tasks can run in paralle and which must wait&lt;/li&gt;
&lt;li&gt;prevent infinite loops in workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;br&gt;
Extract - Transform - Load&lt;br&gt;
Extract runs first, Transform waits for Extract, Load runs after Transform&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow Orchestration&lt;/strong&gt;&lt;br&gt;
Process of defining, scheduling and monitoring workflows. Ensures every task happens at the right time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Workflow Orchestration&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;defines workflows&lt;/li&gt;
&lt;li&gt;schedules workflows&lt;/li&gt;
&lt;li&gt;manages dependencies&lt;/li&gt;
&lt;li&gt;handles failures and retries&lt;/li&gt;
&lt;li&gt;monitors runs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjty290z00rty5yccwfti.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjty290z00rty5yccwfti.png" alt=" " width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;11. Retry Logic &amp;amp; Dead Letter Queues&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Essential components in building robust and resilient distributed systems&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retry Logic&lt;/strong&gt;&lt;br&gt;
The ability of a system to automatically reattempt an operation when it fails, instead of instantly giving up. &lt;br&gt;
Its crucial for handling transient errors which are temporary failures that are likely to resolve themselves with time.&lt;br&gt;
Failures are often temporary eg a network glitch, a busy server retrying can save you from losing messages&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How it works&lt;/em&gt;&lt;br&gt;
When an operation fails, the system attempts to re-execute it after a short delay. This process can be repeated a number of times. &lt;/p&gt;

&lt;p&gt;Immediate Retries - Try again instantly&lt;br&gt;
Fixed Interval – Wait a fixed time before retrying.&lt;br&gt;
Exponential Backoff – Increase wait time exponentially after each failure.&lt;br&gt;
Jitter - Add randomness to wait times to avoid ‘retry storms’ when many clients retry at once&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Benefits&lt;/em&gt;&lt;br&gt;
Improves system resilience by making applications more tolerant to temporary disruptions hence reducing the need for manual intervention and can prevent messages from being prematurely moved to a DLQ.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dead Letter Queues (DLQs)&lt;/strong&gt;&lt;br&gt;
A special “quarantine” queue where failed messages are sent after they’ve been retried the allowed number of times but still fail&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How it works&lt;/em&gt;&lt;br&gt;
When a message fails to be processed after the configured retries, or if an unrecoverable error occurs the message is moved from the main queue to the DLQ&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prevents problematic messages from blocking the main queue, ensuring flow of other messages&lt;/li&gt;
&lt;li&gt;provides a centralized location to inspect failed messages, analyzes the root cause of failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;12.Backfilling &amp;amp; Reprocessing&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Techniques used to ensure data completeness and accuracy in data pipelines&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backfilling&lt;/strong&gt;&lt;br&gt;
Running data pipelines for past time periods to fill in missing or incomplete data&lt;/p&gt;

&lt;p&gt;_Purpose _&lt;br&gt;
Data consistency: Ensures that all historical data is consistent and complete, avoiding inconsistencies or missing data points&lt;br&gt;
Error correction: Corrects errors or inconsistencies in historical data that may have been introduced by bugs, system failures, or incorrect data&lt;br&gt;
Regulatory compliance: Helps meet regulatory requirements by ensuring that historical data is complete and accurate&lt;br&gt;
Accurate analytics: Provides a complete and accurate historical dataset for analysis and reporting, preventing skewed results&lt;/p&gt;

&lt;p&gt;Goal: Fill missing historical data&lt;br&gt;
Trigger: Data gap or late arrival&lt;br&gt;
Data State: No data exists for that time period&lt;br&gt;
Example: Adding data for Jan 1–14 that was never loaded&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reprocessing&lt;/strong&gt;&lt;br&gt;
Running your data pipelines again for data that has already been processed, usually to correct or update results&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Purpose&lt;/em&gt;&lt;br&gt;
Error correction: Corrects errors or inconsistencies in historical data that may have been introduced by bugs, system failures, or incorrect data.&lt;br&gt;
Code updates: Allows for the application of new code or logic to existing data.&lt;br&gt;
Data quality improvements: Ensures that the data is in the desired state by re-running the processing logic.&lt;/p&gt;

&lt;p&gt;Goal: Correct or update existing processed data&lt;br&gt;
Trigger: Logic change, bug fix, or source update&lt;br&gt;
Data State: Data already exists but is wrong or outdated&lt;br&gt;
Example: Correcting sales data from Jan 1–14 that was miscalculated&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;13.Data Governance&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Refers to the policies, procedures, and practices that ensure data is managed, secured, and used effectively throughout its lifecycle. Basically the “rules” for how data is collected, stored, transformed, and accessed—so that your data remains accurate, consistent, secure, and compliant&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Components&lt;/strong&gt;&lt;br&gt;
Data Quality - Implement validation checks in ETL/ELT pipelines&lt;br&gt;
Metadata Management - Store information about data sources, transformations, and lineage&lt;br&gt;
Data Lineage - Track how data moves from source → transformations → destination. Helps in debugging and compliance audits.&lt;br&gt;
Access Control &amp;amp; Security - Use role-based access control (RBAC), encryption, masking, and tokenization to protect sensitive data&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Data Ingestion&lt;/em&gt;&lt;br&gt;
Governance rules decide what data can be ingested, from which sources, and at what frequency&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Data Transformation&lt;/em&gt;&lt;br&gt;
Apply quality checks, enrichment rules, and ensure changes are logged for lineage tracking&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Data Storage&lt;/em&gt;&lt;br&gt;
Follow partitioning, retention, and archival rules&lt;br&gt;
Apply encryption and role-based access at the storage layer&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Data Access&lt;/em&gt;&lt;br&gt;
Use a data catalog so users can discover data and understand its meaning before querying&lt;br&gt;
Enforce least-privilege access&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Monitoring &amp;amp; Auditing&lt;/em&gt;&lt;br&gt;
Automated alerts when governance rules are violated&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ap8157jtc8jgk2f29er.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ap8157jtc8jgk2f29er.jpg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;14.Time Travel and Data Versioning&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Techniques that let you query, restore, or compare data from a specific point in the past — almost like a “rewind” button for your datasets&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time Travel&lt;/strong&gt;&lt;br&gt;
‍Refers to the ability to access historical versions of feature values at previous points in time&lt;br&gt;
Time travel simplifies low-cost comparisons between previous versions of data. Time travel aids in analyzing performance over time. Time travel allows organizations to audit data changes over time, often required for compliance purposes. Time travel helps to reproduce the results from machine learning models.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How it works&lt;/em&gt;&lt;br&gt;
Systems like Delta Lake store changes as immutable snapshots with transaction logs&lt;br&gt;
Each write (insert/update/delete) produces a new version, but older files are retained until cleanup (compaction or vacuum)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Versioning&lt;/strong&gt;&lt;br&gt;
Data versioning is the storage of different versions of data that were created or changed at specific points in times&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Benefits&lt;/em&gt;&lt;br&gt;
Safe experimentation (branch &amp;amp; merge like Git)&lt;br&gt;
Reproducibility for ML models and analytics&lt;br&gt;
Ability to restore to a stable version after a bad pipeline run&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;15.Distributed Processing Concepts&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The concept of dividing a large task into smaller parts that are processed concurrently across multiple interconnected computers or nodes, rather than on a single system. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instead of a single powerful computer handling everything, multiple computers (nodes) work together. &lt;/li&gt;
&lt;li&gt;Tasks are broken down and executed simultaneously on different machines, significantly speeding up processing time. &lt;/li&gt;
&lt;li&gt;Nodes communicate and coordinate their actions through a network&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach enhances performance, scalability, and fault tolerance, making it ideal for handling large-scale data processing and complex computations&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Importance&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;- Improved performance&lt;/em&gt;&lt;br&gt;
Parallel processing can dramatically reduce the time it takes to complete a task. &lt;br&gt;
&lt;em&gt;- Scalability&lt;/em&gt;&lt;br&gt;
Systems can easily be scaled by adding more nodes as needed to handle increasing workloads. &lt;br&gt;
&lt;em&gt;- Fault tolerance&lt;/em&gt;&lt;br&gt;
If one node fails, the others can often continue working, preventing complete system failure. &lt;br&gt;
&lt;em&gt;- Cost-effectiveness&lt;/em&gt;&lt;br&gt;
Using multiple commodity computers can be more affordable than relying on a single, powerful (and expensive) machine, according to Hazelcast.&lt;br&gt;
&lt;em&gt;- Concurrency&lt;/em&gt;&lt;br&gt;
Nodes execute tasks at the same time.&lt;br&gt;
&lt;em&gt;- Independent failure&lt;/em&gt;&lt;br&gt;
Nodes can fail without bringing down the entire system.&lt;br&gt;
&lt;em&gt;- Resource sharing&lt;/em&gt;&lt;br&gt;
Nodes can share resources like processing power, storage, and data&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Cloud computing&lt;/em&gt;&lt;br&gt;
Services like AWS, Google Cloud utilize distributed processing to offer on-demand computing resources. &lt;br&gt;
&lt;em&gt;Big data processing&lt;/em&gt;&lt;br&gt;
Frameworks like Apache Hadoop and Spark leverage distributed processing for analyzing massive datasets. &lt;br&gt;
&lt;em&gt;Database systems&lt;/em&gt;&lt;br&gt;
Many database systems are designed with distributed processing to handle large amounts of data and high traffic&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fek07xg1z13gggcz189te.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fek07xg1z13gggcz189te.webp" alt=" " width="800" height="572"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
