<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Walter Ndung'u</title>
    <description>The latest articles on DEV Community by Walter Ndung'u (@walnold).</description>
    <link>https://dev.to/walnold</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3391032%2F6ebeae6f-abb4-4103-97c5-5eec5c3ea65b.png</url>
      <title>DEV Community: Walter Ndung'u</title>
      <link>https://dev.to/walnold</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/walnold"/>
    <language>en</language>
    <item>
      <title>Introduction to Power BI</title>
      <dc:creator>Walter Ndung'u</dc:creator>
      <pubDate>Mon, 13 Oct 2025 19:34:48 +0000</pubDate>
      <link>https://dev.to/walnold/introduction-to-power-bi-1cpd</link>
      <guid>https://dev.to/walnold/introduction-to-power-bi-1cpd</guid>
      <description>&lt;p&gt;The demand for data analytics and visualization tools has grown exponentially as organizations embrace digital transformation. &lt;strong&gt;Business Intelligence (BI)&lt;/strong&gt;  platforms play a crucial role in aggregating data from multiple sources, performing analysis, and presenting it in meaningful formats. &lt;strong&gt;Power BI&lt;/strong&gt;, developed by Microsoft, has emerged as one of the most robust and flexible BI tools available. It combines powerful data modeling capabilities, DAX (Data Analysis Expressions), and interactive visualizations-allowing analysts and business users alike to uncover insights and share them effortlessly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Power BI
&lt;/h2&gt;

&lt;p&gt;Power BI is a business analytics platform that helps you to turn data into actionable insights. It is designed for professionals of various levels of data knowledge. &lt;br&gt;
Power BI's dashboard can be used for reporting by visualizing utilizing a wide range of styles including graphs, maps, charts, scatter plot and more.&lt;/p&gt;
&lt;h2&gt;
  
  
  DAX Overview
&lt;/h2&gt;

&lt;p&gt;DAX (Data Analysis Expressions) is one of the most powerful features within Power BI. It is a formula language used to perform calculations and create custom measures within reports. Dax enhances the analytical capabilities of Power Bi, allowing users to go beyond simple aggregations and perform advanced data analysis. &lt;/p&gt;
&lt;h2&gt;
  
  
  Categories of DAX Functions
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Mathematical Functions
&lt;/h3&gt;

&lt;p&gt;Mathematical DAX functions are used to perform numeric calculations such as summing or averaging data.&lt;br&gt;
For example, using the Kenya Crops Dataset, we can calculate the total crop yield as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total Yield = SUM(Crops[Yield])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similarly, to find the average yield per county, we can use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Average Yield = AVERAGE(Crops[Yield])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Text Functions:
&lt;/h3&gt;

&lt;p&gt;Text functions allow users to manipulate and format text fields.&lt;br&gt;
For instance, if we want to extract the first three letters of each crop’s name, we can use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Crop Code = LEFT(Crops[CropName], 3)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To combine the crop name and county for better labeling, we can use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Crop Label = CONCATENATE(Crops[CropName], " - ", Crops[County])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Such transformations are useful for creating clearer visual labels and summaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Date &amp;amp; Time Functions
&lt;/h3&gt;

&lt;p&gt;Date and time functions are essential for time-based analysis, such as comparing yields over different seasons or years.&lt;br&gt;
For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Year = YEAR(Crops[HarvestDate])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To calculate the total yield for the current year to date:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;YTD Yield = TOTALYTD(SUM(Crops[Yield]), Crops[HarvestDate])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and to compare yields with the same period last year:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Last Year Yield = CALCULATE(SUM(Crops[Yield]), SAMEPERIODLASTYEAR(Crops[HarvestDate]))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These help track agricultural trends and assess performance across seasons.&lt;/p&gt;

&lt;h3&gt;
  
  
  Logical Functions:
&lt;/h3&gt;

&lt;p&gt;Logical functions allow conditional analysis.&lt;br&gt;
For example, to classify yields as “High” or “Low”:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Yield Category = IF(Crops[Yield] &amp;gt; 5000, "High", "Low")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or, to assign categories based on multiple conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Yield Status = SWITCH(
    TRUE(),
    Crops[Yield] &amp;gt; 8000, "Excellent",
    Crops[Yield] &amp;gt; 5000, "Good",
    "Needs Improvement"
)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These classifications can help farmers and policymakers quickly identify areas that need attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion and Insights
&lt;/h2&gt;

&lt;p&gt;Power BI, combined with DAX, provides a strong foundation for data-driven decision-making. In the context of agriculture, it allows farmers, researchers, and policymakers to visualize crop performance, identify patterns, and forecast future yields based on real data. By using DAX functions, users can build intelligent reports that not only summarize information but also uncover hidden insights.&lt;/p&gt;

&lt;p&gt;From my experience, Power BI has transformed how data is interpreted—it turns spreadsheets into stories and numbers into strategies. For Kenyan farmers and agricultural institutions, mastering Power BI and DAX means being able to make smarter, faster, and evidence-based decisions that can significantly improve productivity and sustainability.&lt;/p&gt;

</description>
      <category>powerbi</category>
      <category>businessintelligence</category>
      <category>bi</category>
      <category>visualization</category>
    </item>
    <item>
      <title>Apache Kafka Deep Dive: Concepts, Applications, and Production</title>
      <dc:creator>Walter Ndung'u</dc:creator>
      <pubDate>Mon, 08 Sep 2025 03:29:57 +0000</pubDate>
      <link>https://dev.to/walnold/apache-kafka-deep-dive-concepts-applications-and-production-5f15</link>
      <guid>https://dev.to/walnold/apache-kafka-deep-dive-concepts-applications-and-production-5f15</guid>
      <description>&lt;p&gt;You've probably heard of &lt;strong&gt;Kafka&lt;/strong&gt;, right? But how did it come to existence, and what kind of problems did it solve?&lt;br&gt;&lt;br&gt;
Kafka was developed by LinkedIn (2010) to handle massive streams of user activity and logs. In a &lt;a href="https://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future#:~:text=Use%20Cases%20at%20LinkedIn&amp;amp;text=These%20are%20then%20collected%20and,for%20our%20distributed%20database%20Espresso." rel="noopener noreferrer"&gt;publication &lt;/a&gt;by Mammad Zadeh(2015), "LinkedIn use kafka as the messaging backbone that helps the many company's applications to work together in a loosely coupled manner.". At LinkedIn, overall use cases are:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Activity Stream Tracking&lt;/em&gt;: Every click, profile view, search, or action is published to Kafka topics for analytics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Log Aggregation&lt;/em&gt;: Instead of services writing to files, logs are centralized via Kafka.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Real-Time Analytics&lt;/em&gt;: Metrics like "how many people viewed my profile in the last 10 minutes" are powered by Kafka.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Data Pipeline Backbone&lt;/em&gt;: Kafka acts as a central bus to feed data to Hadoop, monitoring systems, and other consumers.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This article will explore Apache Kafka and dive deeper to understand its core concepts.&lt;/p&gt;
&lt;h2&gt;
  
  
  What is Apache Kafka?
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Apache Kafka is an open-source, distributed,&lt;/em&gt; &lt;strong&gt;&lt;em&gt;event-streaming&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;system that processes real-time data.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Kafka has three main functions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It enables applications to publish or subscribe to data or event stream. &lt;/li&gt;
&lt;li&gt;It offers real-time data processing.&lt;/li&gt;
&lt;li&gt;Offers storage of streams of records as they occur.
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  What is event-streaming?
&lt;/h3&gt;

&lt;p&gt;Event-streaming is the real-time capture of data as it is produced from event sources like database, APIs', IoT devices, Cloud services and other software applications. &lt;/p&gt;
&lt;h2&gt;
  
  
  How does Kafka Work?
&lt;/h2&gt;

&lt;p&gt;Kafka has two messaging models, queuing and publish-subscribe. Queuing distributes data processing across multiple consumers, enabling scalability, while publish-subscribe supports multiple subscribers but sends every message to all, limiting workload distribution. Kafka resolves this by using a partitioned log model. A log is an ordered record sequence, divided into partitions that can be assigned to different subscribers. This design allows multiple consumers to process the same topic while balancing the workload efficiently. Additionally, Kafka supports replayability, enabling independent applications to read and reprocess data streams at their own pace, ensuring flexibility, scalability, and reliability in real-time data processing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ewvipi9k56x36fnnh45.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ewvipi9k56x36fnnh45.png" alt="producer-subscriber" width="800" height="276"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Kafka Concepts summary
&lt;/h3&gt;

&lt;p&gt;a). Event: A record of something that happened (key, value, timestamp, headers).&lt;br&gt;
b). Producer: Writes events to topics.&lt;/p&gt;

&lt;p&gt;c). Consumer: Reads events from topics.&lt;/p&gt;

&lt;p&gt;d). Topic: Stores events (like a folder).&lt;/p&gt;

&lt;p&gt;e). Partition: Subset of a topic; preserves order for events with the same key.&lt;/p&gt;

&lt;p&gt;f). Replication: Multiple copies of partitions for fault tolerance (commonly 3).&lt;/p&gt;

&lt;p&gt;g). Retention: Events kept for a configurable time, not deleted on read.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkzqgkaqf16rwtwbpzu9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkzqgkaqf16rwtwbpzu9.png" alt="Kafka concepts" width="764" height="342"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  A simple Quickstart project (via Docker)
&lt;/h2&gt;

&lt;p&gt;Here is a simple quickstart project in python to stream BTC/USDT price data from Binance API into Kafka, and then consume it back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prerequisite&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kafka &amp;amp; Zookeeper
&amp;gt; Example Docker compose snippet:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.4.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
  kafka:
    image: confluentinc/cp-kafka:7.4.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;Run:&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Install dependencies
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install kafka-python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;1. Producer: Stream BTC price from Binance to Kafka&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# producer.py
import time
import requests
from kafka import KafkaProducer
import json

KAFKA_TOPIC = "btc_prices"
KAFKA_BROKER = "localhost:9092"

def get_btc_price():
    url = "https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT"
    response = requests.get(url).json()
    return response

if __name__ == "__main__":
    producer = KafkaProducer(
        bootstrap_servers=KAFKA_BROKER,
        value_serializer=lambda v: json.dumps(v).encode("utf-8")
    )

    while True:
        price_data = get_btc_price()
        producer.send(KAFKA_TOPIC, price_data)
        print(f"Sent: {price_data}")
        time.sleep(2)  # fetch price every 2 seconds


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Consumer: Read BTC price from kafka&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# consumer.py
from kafka import KafkaConsumer
import json

KAFKA_TOPIC = "btc_prices"
KAFKA_BROKER = "localhost:9092"

if __name__ == "__main__":
    consumer = KafkaConsumer(
        KAFKA_TOPIC,
        bootstrap_servers=KAFKA_BROKER,
        value_deserializer=lambda m: json.loads(m.decode("utf-8")),
        auto_offset_reset="earliest",
        enable_auto_commit=True
    )

    for message in consumer:
        print(f"Received: {message.value}")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Run the Project
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Start Kafka + Zookeeper  on docker
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;docker-compose up -d&lt;/code&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run Producer:
&lt;code&gt;python producer.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Run Consumer:
&lt;code&gt;python consumer.py&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You'll see live BTC/USDT prices flowing from Binance --&amp;gt; Kafka --&amp;gt; Consumer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/walnold/kafkaTest" rel="noopener noreferrer"&gt;Github code&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In conclusion, Kafka bridges the gap between traditional queuing and publish-subscribe systems, offering a scalable, fault-tolerant, and high-performance solution for real-time data streaming. Its partitioned log architecture enables parallel processing while ensuring data consistency and replayability, making it an essential tool for modern data-driven applications. From powering Uber’s trip analytics to LinkedIn’s activity feeds, Kafka has proven its reliability in large-scale production environments. As organizations continue to embrace event-driven architectures, mastering Kafka will be a valuable skill for engineers seeking to build resilient, future-ready data pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further Reading:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Apache Kafka Documentation: &lt;a href="https://kafka.apache.org/documentation/" rel="noopener noreferrer"&gt;https://kafka.apache.org/documentation/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future#:~:text=Use%20Cases%20at%20LinkedIn&amp;amp;text=These%20are%20then%20collected%20and,for%20our%20distributed%20database%20Espresso." rel="noopener noreferrer"&gt;Kafka at LinkedIn: Current and Future&lt;/a&gt; (Mammad Zadeh, 2015)  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What is Apache kafka? : &lt;a href="https://www.ibm.com/think/topics/apache-kafka" rel="noopener noreferrer"&gt;https://www.ibm.com/think/topics/apache-kafka&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kafka</category>
      <category>eventdriven</category>
      <category>zookeeper</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Introduction to Docker and Docker Compose: Beginners Guide</title>
      <dc:creator>Walter Ndung'u</dc:creator>
      <pubDate>Tue, 26 Aug 2025 20:47:59 +0000</pubDate>
      <link>https://dev.to/walnold/introduction-to-docker-and-docker-compose-beginners-guide-3k2h</link>
      <guid>https://dev.to/walnold/introduction-to-docker-and-docker-compose-beginners-guide-3k2h</guid>
      <description>&lt;h2&gt;
  
  
  What is Docker
&lt;/h2&gt;

&lt;p&gt;Docker is an open source platform that enables developers and engineers to build, deploy, run and manage containers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Containers&lt;/strong&gt; are standardized, executable components that combine application source code, together with the operating System libraries and dependencies required to run that code in any environment.  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;containers enable multiple application components to share the resources of a single instance of the host Operating System.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fbkx0cjqt8hz9jyalc9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fbkx0cjqt8hz9jyalc9.png" alt="Containerization Image" width="570" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why use Docker
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Consistency&lt;/strong&gt;: Docker deals with the infamous headache of &lt;em&gt;"It works on my machine"&lt;/em&gt; problem. This problem occurs when an application works on the developers laptop, but when the application is deployed in to a server or the cloud, something breaks. Docker helps to package everything the app need into a &lt;strong&gt;container Image&lt;/strong&gt;. That container image will run the same way on any machine (laptop, staging server, production in cloud).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Light Weight&lt;/strong&gt;: Docker containers share the Host OS kernel. they don't need to boot an entire OS each time like is the case with traditional &lt;em&gt;Virtual Machines&lt;/em&gt;. As a result:
-Containers start quickly.
-Save cost on hardware and cloud resources.
-You can run many containers on a single machine.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalable&lt;/strong&gt;: With docker you can run multiple containers of the same app behind a load balancer. You can add more containers (&lt;strong&gt;Scale up&lt;/strong&gt;) when demand increases of remove container (&lt;strong&gt;scale down&lt;/strong&gt;) when demand decreases. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast Deployment&lt;/strong&gt;: With docker, you build an image and to starting a new container is an automated and repeatable process. &lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Terms and Tools within docker Architecture
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;em&gt;Docker Host&lt;/em&gt;:- This is the physical or virtual machine running a &lt;strong&gt;Docker engine&lt;/strong&gt; compatible Operating System such as the Linux.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Docker Engine&lt;/em&gt;:'- It a client/server application that consist of the &lt;strong&gt;Docker Daemon&lt;/strong&gt;, &lt;strong&gt;Docker API&lt;/strong&gt; that interacts with the Daemon, and a &lt;strong&gt;Docker CLI&lt;/strong&gt; that talks to the daemon. &lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Docker Daemon&lt;/em&gt;:- This is a service that creates and manages docker images by using commands from the client.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Docker client&lt;/em&gt;:-Provides the Command Line Interface (CLI) that accesses the &lt;strong&gt;Docker API&lt;/strong&gt; to communicate with the &lt;strong&gt;Docker Daemon&lt;/strong&gt; over a unix socket or a network interface.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Docker Object&lt;/em&gt;:- components of a docker deployment that help package and distribute applications. They include Images, containers, network, plugins, and volumes.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Container&lt;/em&gt;:- This is the live running instance of a docker Image. &lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Docker Image&lt;/em&gt;:- Contain executable applications source code and all tools, libraries and dependencies the application code needs to run as a container. &lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Docker Build&lt;/em&gt;:- a command that has tools and features for creating a docker image.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Docker file&lt;/em&gt;:- A simple text file containing instructions for how to build the docker container image. You can say it is a list of instructions that the docker engine will run to assemble the docker image.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Docker Hub&lt;/em&gt; this is a public repository of docker images.
11 &lt;em&gt;Docker Compose&lt;/em&gt;: is a tool to manage multiple container applications where all containers run on the same docker host.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Docker Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt update
sudo apt install docker.io -y
sudo systemctl enable docker --now
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify Installation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker version&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Running your first container
&lt;/h2&gt;



&lt;p&gt;&lt;code&gt;docker run hello-world&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;👆Docker pulls the image from Docker Hub and runs it inside a container&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic docker commands
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Pull an image from Docker Hub
docker pull ubuntu

# Run a container
docker run -it ubuntu bash

# List running containers
docker ps

# List all containers (including stopped)
docker ps -a

# Stop a container
docker stop &amp;lt;container_id&amp;gt;

# Remove a container
docker rm &amp;lt;container_id&amp;gt;

# Remove an image
docker rmi &amp;lt;image_id&amp;gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Building an Image
&lt;/h2&gt;

&lt;p&gt;create a file called &lt;em&gt;Dockerfile&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Use Python base image
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy files
COPY . /app

# Install dependencies
RUN pip install flask

# Run the app
CMD ["python", "app.py"]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Build and run&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker build -t myapp .
docker run -p 5000:5000 myapp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Managing Multi-Container Applications using Docker-compose
&lt;/h2&gt;

&lt;p&gt;In real projects , applications often need multiple services working together. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A web application (Flask, Django or Node.js)&lt;/li&gt;
&lt;li&gt;A database (PostgreSQL, MongoDB)&lt;/li&gt;
&lt;li&gt;Cache (Reds)
Running and connecting each container manually with docker run can get messy. 
## Why use docker compose
We use docker compose to define and manage multi-container applications using a single YAML file (&lt;em&gt;docker-compose.yml&lt;/em&gt;)
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  To run docker compose, just run:
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;docker-compose up&lt;/code&gt;  &lt;/p&gt;

&lt;h3&gt;
  
  
  Sample Docker Compose File
&lt;/h3&gt;

&lt;p&gt;Here is a simple example: a Flask app with a PostgreSQL database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: '3.8'

services:
  web:
    build: .
    ports:
      - "5000:5000"
    depends_on:
      - db

  db:
    image: postgres:13
    environment:
      POSTGRES_USER: myuser
      POSTGRES_PASSWORD: mypassword
      POSTGRES_DB: mydb
    ports:
      - "5432:5432"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Web -&amp;gt; Your Flask app (built from the dockerfile in the current directory).&lt;/li&gt;
&lt;li&gt;db -&amp;gt; A PostgreSQL database running in its own container.&lt;/li&gt;
&lt;li&gt;depends_on -&amp;gt; Ensures the database starts befor the web app&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Running docker compose&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;docker-compose up&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;👆This launches both containers (web + db)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To Stop them, run:&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;docker-compose down&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why docker compose is useful
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Simplifies running multiple containers.&lt;/li&gt;
&lt;li&gt;Keeps your setup reproducible and sharable.&lt;/li&gt;
&lt;li&gt;Handles networking automatically (services can talk to each other by names, e.g., db)&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>docker</category>
      <category>containers</category>
      <category>microservices</category>
      <category>compose</category>
    </item>
    <item>
      <title>15 Data Engineering Core Concepts Simplified</title>
      <dc:creator>Walter Ndung'u</dc:creator>
      <pubDate>Sun, 10 Aug 2025 20:07:05 +0000</pubDate>
      <link>https://dev.to/walnold/15-data-engineering-core-concepts-simplified-5fo3</link>
      <guid>https://dev.to/walnold/15-data-engineering-core-concepts-simplified-5fo3</guid>
      <description>&lt;h2&gt;
  
  
  INTRODUCTION
&lt;/h2&gt;

&lt;p&gt;In today’s world of Big Data, the term data engineering is everywhere — often surrounded by a cloud of technical buzzwords. These terms can feel overwhelming, especially if you’re new to the data ecosystem.&lt;/p&gt;

&lt;p&gt;This article aims to break down these concepts into &lt;strong&gt;simple, relatable explanations&lt;/strong&gt; so you can understand them without needing a technical background.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Data Engineering?
&lt;/h3&gt;

&lt;p&gt;Data engineering is the discipline of &lt;strong&gt;designing, building, and maintaining data pipelines&lt;/strong&gt; that ensure data can move reliably from its source to where it’s needed. These pipelines handle the &lt;strong&gt;movement, transformation, and storage&lt;/strong&gt; of data, making it ready for analysis and decision-making.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Concepts of Data Engineering
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Batch vs Streaming Ingestion&lt;/strong&gt;  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Batch Ingestion&lt;/em&gt; is a process whereby data is collected and processed in large, discrete chunks at specific times, usually scheduled.&lt;br&gt;&lt;br&gt;
&lt;em&gt;Stream Ingestion&lt;/em&gt; is the continuous collection of data as it arrives. Data is processed individually as it enters the system.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Change Data Capture (CDC)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Change Data Capture&lt;/em&gt; is a technique that identifies and tracks change (inserts, updates, deletes) made to data in a database and then deliver those changes in real-time to a downstream process or system, such as real-time data integration or data warehousing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Idempotency&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Idempotency&lt;/em&gt; is a property of an operation whereby executing the operation multiple times with the same set of input produces the same output. For example, when creating a record, by pressing the save button twice, only one record will be saved. &lt;br&gt;
&lt;strong&gt;4. OLTP VS OLAP&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Online Transaction Processing (OLTP)&lt;/em&gt; is a form of data processing that involves a large number of small, concurrent transactions. Example of such processes include online banking, shopping, order entry or sending text messages.&lt;br&gt;
&lt;em&gt;Online Analytical Processing (OLAP)&lt;/em&gt; is a way of storing and querying data so that you can quickly analyze it from different dimensions without having to run slow, complex queries on raw transactional data.&lt;br&gt;
Scenario:  Company's sales data&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In OLTP: Every single sale is recorded (like “Sold 3 units of product X in Nairobi on Aug 10, 2025”).  &lt;/p&gt;

&lt;p&gt;IN OLAP: Data is reorganized so you can quickly answer questions like:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;What were the total sales for product X by month for the past 2 years?&lt;/li&gt;
&lt;li&gt;Which region sold the most in Q2 2025?&lt;/li&gt;
&lt;li&gt;How do sales in Nairobi compare to Kisumu over time?
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;


&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;5. Columnar vs Row-based Storage&lt;/strong&gt; &lt;br&gt;
In row-based Storage, all values of a single record are stored contiguously on disk. This form of storage is efficient for transactional workloads (Inserting, updating, or deleting rows) and Write-intensive operations. Row-based storage is less efficient for queries that need to access only some columns across many rows, as the entire row must be read from the disk, leading to unnecessary I/O.&lt;/p&gt;

&lt;p&gt;In Columnar Storage, data is stored column by column, with all values for a single column stored contiguously on a disk. This form of storage is highly efficient for Analytical queries that involve aggregations, filtering, and analysis across a large dataset, as only the required columns are read from disk. However, it is less efficient for Transactional workloads as modifying a single row requires updates across multiple column blocks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Partitioning&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
In data engineering, partitioning means splitting a large dataset into smaller, more manageable parts to speed up queries and reduce resource usage. Instead of scanning an entire table or file, queries only read the relevant partitions.&lt;/p&gt;

&lt;p&gt;Common types of partitioning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Horizontal partitioning: Splitting rows based on a column’s value (e.g., date, region).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Vertical partitioning: Splitting columns into separate tables or files to reduce data scanned.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hash partitioning: Using a hash function on a key (e.g., customer ID) to evenly distribute data across partitions.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;7. ETL vs ELT&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;ETL&lt;/strong&gt; (Extract, Transform, Load): Data is extracted from source systems, transformed (cleaned, enriched, aggregated) in a separate processing environment, and then loaded into the target storage (e.g., a data warehouse).&lt;/p&gt;

&lt;p&gt;Good when transformations must happen before data enters storage.&lt;/p&gt;

&lt;p&gt;Often used with on-premise data warehouses or systems with strict schema requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ELT&lt;/strong&gt; (Extract, Load, Transform): Data is extracted from sources, loaded directly into the target storage first (often a cloud data warehouse), and transformed inside the storage using its processing power.&lt;/p&gt;

&lt;p&gt;Good when the storage is powerful enough to handle transformations at scale (e.g., Snowflake, BigQuery).&lt;/p&gt;

&lt;p&gt;Allows storing raw data for flexibility and reprocessing later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. CAP Theorem (Brewers Theorem)&lt;/strong&gt;&lt;br&gt;
CAP Theorem states that, for a distributed system, it is impossible to simultaneously achieve Consistency, Availability, and Partition Tolerance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 3 Properties&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Consistency (C)&lt;/strong&gt;: Every node in the system sees the same data at the same time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Availability (A)&lt;/strong&gt;:  Every request receives a response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition Tolerance (P)&lt;/strong&gt;: The system continues to operate despite a message being lost or delayed between nodes. 
&lt;strong&gt;The Trade-off&lt;/strong&gt;
When a network partition happens, you must choose between:&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;CA&lt;/strong&gt; → Consistency + Availability (no Partition Tolerance) → works only if network never fails (rare in real distributed systems).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CP&lt;/strong&gt; → Consistency + Partition Tolerance (may sacrifice availability during a partition).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AP&lt;/strong&gt; → Availability + Partition Tolerance (may serve stale data to keep responding).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;9. Windowing in Streaming&lt;/strong&gt;&lt;br&gt;
Windowing is a technique in stream processing where infinite flow of events (such as logs, sensor reading or transactions) and break it into finite chunks of time or count so you can run aggregations like sum, average or count. Windowing enables an end to calculate results by providing logical boundaries for calculations. &lt;br&gt;
&lt;strong&gt;10. DAG and Workflow Orchestration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DAG (Directed Acyclic Graph): A way to represent a workflow or process where steps have a defined order, and there's no way to go back to a previous step by following those directions. &lt;/li&gt;
&lt;li&gt;Workflow Orchestration: The process of automating, coordinating, and managing multiple tasks(DAGS) and systems to execute complex business processes. Some tools used for Orchestration include, Apache Airflow, Dagster and Luigi&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;11.Retry Logic &amp;amp; Dead Letter Queues&lt;/strong&gt;&lt;br&gt;
Data engineering and distributed systems need ways to handle failures without losing data. Amongst these ways are Retry Logic and Dead Letter Queues (DLQs).&lt;br&gt;
&lt;strong&gt;Retry Logic&lt;/strong&gt;: Is the process of automatically reattempting a failed task or message after a certain delay, often with a limit on how many times it can retry. It is useful when handling issues such as network glitches, API timeouts or locked resources.&lt;br&gt;
&lt;strong&gt;Dead Letter Queue&lt;/strong&gt;: Is a special holding queue for messages or events that failed processing after all retries. It is useful for preventing endless retry loops and preserves failed data for investigation. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;12. Backfilling &amp;amp; Reprocessing&lt;/strong&gt;&lt;br&gt;
Backfilling involves reprocessing historical data to correct errors, acomodate new data structures, or integrate new data sources.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;: If your sales database adds a new “discount” column, you might backfill past records so that older data also includes the correct discount information.&lt;/p&gt;

&lt;p&gt;Reprocessing is the act of running data through a processing pipeline again to correct inaccuracies, apply updated transformation logic, or ensure completeness.&lt;br&gt;
&lt;em&gt;Example&lt;/em&gt;: If you discover an error in your tax calculation logic, you might reprocess the past month’s sales data using the corrected formula.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;13. Data Governance&lt;/strong&gt;&lt;br&gt;
Data governance is the framework of rules, processes, and responsibilities that ensure data is accurate, secure, consistent and used appropriately through its life cycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Purpose in Data Engineering&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Quality assurance&lt;/strong&gt; – Making sure the data pipelines deliver clean, reliable data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Security &amp;amp; privacy&lt;/strong&gt; – Controlling who can access or modify data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compliance&lt;/strong&gt; – Meeting legal and regulatory requirements (e.g., GDPR, HIPAA).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Lineage &amp;amp; documentation&lt;/strong&gt; – Tracking where data came from, how it was transformed, and where it’s used.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Standardization&lt;/strong&gt; – Ensuring consistent formats, naming conventions, and definitions.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;14. Time Travel &amp;amp; Data Versioning&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Time travel&lt;/strong&gt; is the ability to view data set as it existed at a specific point in the past.&lt;br&gt;&lt;br&gt;
It can be used to recover accidentally deleted or modified data. It can also be used to audit historical states of data and finally can be used to compare current and past datasets.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Data Versioning&lt;/strong&gt;  is the practice of storing and managing multiple versions of a dataset over time. Its purpose is to track changes to data or to enable rollback to previous versions. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;15. Distributed Processing&lt;/strong&gt;&lt;br&gt;
Distributed processing is the method of breaking large computing task into smaller parts and running those parts simultaneously across multiple machines or processors, then combining the results  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Purpose in Data Engineering&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Handle datasets too large for a single machine’s memory or storage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Process data faster by working in parallel.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Improve fault tolerance — if one machine fails, others can continue.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>data</category>
      <category>bigdata</category>
      <category>dag</category>
    </item>
  </channel>
</rss>
