<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sarah Muriithi</title>
    <description>The latest articles on DEV Community by Sarah Muriithi (@datawithsarah).</description>
    <link>https://dev.to/datawithsarah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3268706%2Fc48fe12c-8b3c-4277-ac72-1e7ad1f8fa47.jpeg</url>
      <title>DEV Community: Sarah Muriithi</title>
      <link>https://dev.to/datawithsarah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/datawithsarah"/>
    <language>en</language>
    <item>
      <title>Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications and Real-World Production Practices.</title>
      <dc:creator>Sarah Muriithi</dc:creator>
      <pubDate>Wed, 10 Sep 2025 05:25:56 +0000</pubDate>
      <link>https://dev.to/datawithsarah/apache-kafka-deep-dive-core-concepts-data-engineering-applications-and-real-world-production-1g41</link>
      <guid>https://dev.to/datawithsarah/apache-kafka-deep-dive-core-concepts-data-engineering-applications-and-real-world-production-1g41</guid>
      <description>&lt;p&gt;Apache Kafka is an open source distributed streaming platform. It is designed to handle real-time streams of data at scale, in a fault-tolerant way.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Core Concepts.
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Clusters.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A collection of brokers(servers) working together to provide fault tolerance, scalability and high throughput.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They handle million of messages per second in distributed systems.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Topic.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A topic is a logical channel where messages are produced and consumed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Each topic is split into partitions for parallel processing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Writing a Kafka topic:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bin/kafka-topics.sh --create \
  --topic my-first-topic \
  --bootstrap-server localhost:9092 \
  --partitions 3 \
  --replication-factor 1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Listing all topics:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bin/kafka-topics.sh --list --zookeeper localhost:2181
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Partition.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It is an ordered append-only sequence of records inside a topic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They enable parallel consumption and allow horizontal scaling.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Brokers.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A Kafka server that stores partitions and serves clients.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kafka brokers manage topic partitions, mess&amp;lt;age replication and data storage and retrieval.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Producers.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;They send messages (events) to Kafka topics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They ensure that messages with the same key go to the same partition.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Writing messages to a Kafka topic:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bin/kafka-console-producer.sh --broker-list localhost:9092 --topic customer_orders
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Consumers.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;They read messages from Kafka topics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Consumers belong to consumer groups, thus allowing parallel processing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reading messages from a topic:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic customer_orders --from-beginning
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Offset.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A unique identifier for each message within a partition.&lt;/li&gt;
&lt;li&gt;Consumers use offsets to track which messages they’ve already read.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Zookeeper vs Kraft.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Zookeeper is an external system used by Kafka for metadata management and cluster coordination.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kraft is a zookeeper-free mode where Kafka manages it's own metadata using the Raft consensus algorithm.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Starting a zookeeper.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bin/zookeeper-shell.sh localhost:2181
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Kafka Connect.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Kafka Connect is used to stream data between Kafka and external data systems like databases, file systems, and cloud storage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Running a Kafka Connect:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Replication.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka keeps multiple copies of partitions across brokers which provides fault tolerance.&lt;/li&gt;
&lt;li&gt;One broker hosts the leader partition, others host replicas (followers).&lt;/li&gt;
&lt;li&gt;Producers and consumers talk to the leader.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Retention.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka doesn’t delete messages immediately after consumption.&lt;/li&gt;
&lt;li&gt;Messages are kept based on time (e.g., 7 days) or size (e.g., 1GB).&lt;/li&gt;
&lt;li&gt;Allows consumers to re-read messages later (useful for replaying events).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Data Engineering Applications.
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Real-Time Data Ingestion.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Streams data from multiple sources (databases, APIs, IoT devices, apps).&lt;/li&gt;
&lt;li&gt;Example: Collecting clickstream events from a website.&lt;/li&gt;
&lt;li&gt;Tools: Kafka Connect + Source Connectors (Debezium).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ETL/ELT Pipelines.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka acts as the transport layer in ETL workflows.&lt;/li&gt;
&lt;li&gt;Data can be cleaned/transformed on the fly using Kafka Streams.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data Lake/ Warehouse Ingestion.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka feeds batch &amp;amp; streaming data into storage systems.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Write data to S3 (data lake).
-Send cleaned data into Snowflake, BigQuery, or Redshift.
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Change Data Capture.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tools like Debezium integrate with Kafka to capture changes from relational databases (MySQL, Postgres, Oracle). &lt;/li&gt;
&lt;li&gt;This enables real-time ETL, replication, and synchronization across systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Machine Learning Pipelines.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka streams feed real-time features into ML models, powering use cases like fraud detection, dynamic pricing, or recommendation systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Real-World Production Practices.
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Netflix&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Netflix uses Kafka for real-time monitoring, event sourcing and recommendations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Every playback event or error is streamed to Kafka for analysis.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;LinkendIn&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LinkendIn uses Kafka to process 1 trillion messages per day for activity tracking, search indexing and fraud-detection.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Uber&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uber relies on Kafa to match riders with drivers, hanles surge pricing and provides real-time Estimated Time of Arrival. &lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>DOCKER</title>
      <dc:creator>Sarah Muriithi</dc:creator>
      <pubDate>Wed, 27 Aug 2025 05:09:13 +0000</pubDate>
      <link>https://dev.to/datawithsarah/docker-3pk0</link>
      <guid>https://dev.to/datawithsarah/docker-3pk0</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;GETTING STARTED WITH DOCKER AND DOCKER COMPOSE: A BEGINNER GUIDE&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. What is Docker?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker is a tool used for building, running and managing containers.&lt;/li&gt;
&lt;li&gt;A Container is a lightweight, portable package that has everything an application needs to run-code, libraries, system tools and settings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Install Docker&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On Ubuntu/Linux
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; sudo apt install gnome-terminal
 sudo apt update
 sudo apt install -y docker.io
 sudo systemctl enable -now docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;On  Windows/Mac&lt;br&gt;
&lt;a href="https://www.docker.com/products/docker-desktop/" rel="noopener noreferrer"&gt;Install Docker Desktop&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After installation, run &lt;code&gt;docker version&lt;/code&gt; to verify that docker is working.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Basic Docker Concepts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Image&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It is  a blueprint that contains everything needed to run an app.(OS libraries, dependencies and your code)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They are immutable.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;-  Container&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It is a running instance of an image.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Containers are ephemeral -&amp;gt; you can start,stop, delete, and recreate them anytime.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;-  Dockerfile&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It is a text file with instructions to build a custom image.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;An example:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FROM python:3.12
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "app.py"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;-  Docker Hub/Registry&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A registry is where images are stored.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Docker hub is the default public registry and one can also have private registries.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;-  Docker Engine&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The underlying runtime that builds and runs containers.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Includes:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deamon&lt;/strong&gt;(dockerd) -&amp;gt; background services.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CLI&lt;/strong&gt;(docker) -&amp;gt; command-line tool.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;API&lt;/strong&gt; -&amp;gt; programmatic access.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Basic Docker Commands&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;List images
&lt;code&gt;docker images&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;List running containers
&lt;code&gt;docker ps&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Run a container
&lt;code&gt;docker run -d -p 8080:80 nginx&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Then visit &lt;a href="http://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt; in your browser&lt;/li&gt;
&lt;li&gt;Stop a container
&lt;code&gt;docker stop &amp;lt;container id&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. What is Docker Compose&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker Compose lets you define and run multi-container applications.&lt;/li&gt;
&lt;li&gt;Instead of starting containers one by one, you define everything in a docker-compose.yml file and run:
&lt;code&gt;docker compose up&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;6. Install Docker Compose&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you installed Docker Desktop, Compose is already included.&lt;/li&gt;
&lt;li&gt;On Linux:
&lt;code&gt;sudo apt install docker-compose-plugin&lt;/code&gt;
&lt;code&gt;docker compose version&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;7. Useful Docker compose Commands&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Start containers:&lt;br&gt;
&lt;code&gt;docker compose up -d&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stop containers:&lt;br&gt;
&lt;code&gt;docker compose down&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;View logs:&lt;br&gt;
&lt;code&gt;docker compose logs -f&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Restart a service:&lt;br&gt;
&lt;code&gt;docker compose restart web&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
  </channel>
</rss>
