DEV Community: Walter Ndung'u

Introduction to Power BI

Walter Ndung'u — Mon, 13 Oct 2025 19:34:48 +0000

The demand for data analytics and visualization tools has grown exponentially as organizations embrace digital transformation. Business Intelligence (BI) platforms play a crucial role in aggregating data from multiple sources, performing analysis, and presenting it in meaningful formats. Power BI, developed by Microsoft, has emerged as one of the most robust and flexible BI tools available. It combines powerful data modeling capabilities, DAX (Data Analysis Expressions), and interactive visualizations-allowing analysts and business users alike to uncover insights and share them effortlessly.

What is Power BI

Power BI is a business analytics platform that helps you to turn data into actionable insights. It is designed for professionals of various levels of data knowledge.
Power BI's dashboard can be used for reporting by visualizing utilizing a wide range of styles including graphs, maps, charts, scatter plot and more.

DAX Overview

DAX (Data Analysis Expressions) is one of the most powerful features within Power BI. It is a formula language used to perform calculations and create custom measures within reports. Dax enhances the analytical capabilities of Power Bi, allowing users to go beyond simple aggregations and perform advanced data analysis.

Categories of DAX Functions

Mathematical Functions

Mathematical DAX functions are used to perform numeric calculations such as summing or averaging data.
For example, using the Kenya Crops Dataset, we can calculate the total crop yield as:

Total Yield = SUM(Crops[Yield])

Similarly, to find the average yield per county, we can use:

Average Yield = AVERAGE(Crops[Yield])

Text Functions:

Text functions allow users to manipulate and format text fields.
For instance, if we want to extract the first three letters of each crop’s name, we can use:

Crop Code = LEFT(Crops[CropName], 3)

To combine the crop name and county for better labeling, we can use:

Crop Label = CONCATENATE(Crops[CropName], " - ", Crops[County])

Such transformations are useful for creating clearer visual labels and summaries.

Date & Time Functions

Date and time functions are essential for time-based analysis, such as comparing yields over different seasons or years.
For example:

Year = YEAR(Crops[HarvestDate])

To calculate the total yield for the current year to date:

YTD Yield = TOTALYTD(SUM(Crops[Yield]), Crops[HarvestDate])

and to compare yields with the same period last year:

Last Year Yield = CALCULATE(SUM(Crops[Yield]), SAMEPERIODLASTYEAR(Crops[HarvestDate]))

These help track agricultural trends and assess performance across seasons.

Logical Functions:

Logical functions allow conditional analysis.
For example, to classify yields as “High” or “Low”:

Yield Category = IF(Crops[Yield] > 5000, "High", "Low")

Or, to assign categories based on multiple conditions:

Yield Status = SWITCH(
    TRUE(),
    Crops[Yield] > 8000, "Excellent",
    Crops[Yield] > 5000, "Good",
    "Needs Improvement"
)

These classifications can help farmers and policymakers quickly identify areas that need attention.

Conclusion and Insights

Power BI, combined with DAX, provides a strong foundation for data-driven decision-making. In the context of agriculture, it allows farmers, researchers, and policymakers to visualize crop performance, identify patterns, and forecast future yields based on real data. By using DAX functions, users can build intelligent reports that not only summarize information but also uncover hidden insights.

From my experience, Power BI has transformed how data is interpreted—it turns spreadsheets into stories and numbers into strategies. For Kenyan farmers and agricultural institutions, mastering Power BI and DAX means being able to make smarter, faster, and evidence-based decisions that can significantly improve productivity and sustainability.

Apache Kafka Deep Dive: Concepts, Applications, and Production

Walter Ndung'u — Mon, 08 Sep 2025 03:29:57 +0000

You've probably heard of Kafka, right? But how did it come to existence, and what kind of problems did it solve?

Kafka was developed by LinkedIn (2010) to handle massive streams of user activity and logs. In a publication by Mammad Zadeh(2015), "LinkedIn use kafka as the messaging backbone that helps the many company's applications to work together in a loosely coupled manner.". At LinkedIn, overall use cases are:

Activity Stream Tracking: Every click, profile view, search, or action is published to Kafka topics for analytics.
Log Aggregation: Instead of services writing to files, logs are centralized via Kafka.
Real-Time Analytics: Metrics like "how many people viewed my profile in the last 10 minutes" are powered by Kafka.
Data Pipeline Backbone: Kafka acts as a central bus to feed data to Hadoop, monitoring systems, and other consumers.

This article will explore Apache Kafka and dive deeper to understand its core concepts.

What is Apache Kafka?

Apache Kafka is an open-source, distributed, event-streaming system that processes real-time data.

Kafka has three main functions:

It enables applications to publish or subscribe to data or event stream.
It offers real-time data processing.
Offers storage of streams of records as they occur.

What is event-streaming?

Event-streaming is the real-time capture of data as it is produced from event sources like database, APIs', IoT devices, Cloud services and other software applications.

How does Kafka Work?

Kafka has two messaging models, queuing and publish-subscribe. Queuing distributes data processing across multiple consumers, enabling scalability, while publish-subscribe supports multiple subscribers but sends every message to all, limiting workload distribution. Kafka resolves this by using a partitioned log model. A log is an ordered record sequence, divided into partitions that can be assigned to different subscribers. This design allows multiple consumers to process the same topic while balancing the workload efficiently. Additionally, Kafka supports replayability, enabling independent applications to read and reprocess data streams at their own pace, ensuring flexibility, scalability, and reliability in real-time data processing.

Kafka Concepts summary

a). Event: A record of something that happened (key, value, timestamp, headers).
b). Producer: Writes events to topics.

c). Consumer: Reads events from topics.

d). Topic: Stores events (like a folder).

e). Partition: Subset of a topic; preserves order for events with the same key.

f). Replication: Multiple copies of partitions for fault tolerance (commonly 3).

g). Retention: Events kept for a configurable time, not deleted on read.

A simple Quickstart project (via Docker)

Here is a simple quickstart project in python to stream BTC/USDT price data from Binance API into Kafka, and then consume it back.

Prerequisite

Kafka & Zookeeper > Example Docker compose snippet:


services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.4.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
  kafka:
    image: confluentinc/cp-kafka:7.4.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

Run:

docker-compose up -d

Install dependencies

pip install kafka-python

Code
1. Producer: Stream BTC price from Binance to Kafka

# producer.py
import time
import requests
from kafka import KafkaProducer
import json

KAFKA_TOPIC = "btc_prices"
KAFKA_BROKER = "localhost:9092"

def get_btc_price():
    url = "https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT"
    response = requests.get(url).json()
    return response

if __name__ == "__main__":
    producer = KafkaProducer(
        bootstrap_servers=KAFKA_BROKER,
        value_serializer=lambda v: json.dumps(v).encode("utf-8")
    )

    while True:
        price_data = get_btc_price()
        producer.send(KAFKA_TOPIC, price_data)
        print(f"Sent: {price_data}")
        time.sleep(2)  # fetch price every 2 seconds

2. Consumer: Read BTC price from kafka

# consumer.py
from kafka import KafkaConsumer
import json

KAFKA_TOPIC = "btc_prices"
KAFKA_BROKER = "localhost:9092"

if __name__ == "__main__":
    consumer = KafkaConsumer(
        KAFKA_TOPIC,
        bootstrap_servers=KAFKA_BROKER,
        value_deserializer=lambda m: json.loads(m.decode("utf-8")),
        auto_offset_reset="earliest",
        enable_auto_commit=True
    )

    for message in consumer:
        print(f"Received: {message.value}")

Run the Project

Start Kafka + Zookeeper on docker

docker-compose up -d

Run Producer: python producer.py
Run Consumer: python consumer.py

You'll see live BTC/USDT prices flowing from Binance --> Kafka --> Consumer.

Github code

Conclusion

In conclusion, Kafka bridges the gap between traditional queuing and publish-subscribe systems, offering a scalable, fault-tolerant, and high-performance solution for real-time data streaming. Its partitioned log architecture enables parallel processing while ensuring data consistency and replayability, making it an essential tool for modern data-driven applications. From powering Uber’s trip analytics to LinkedIn’s activity feeds, Kafka has proven its reliability in large-scale production environments. As organizations continue to embrace event-driven architectures, mastering Kafka will be a valuable skill for engineers seeking to build resilient, future-ready data pipelines.

Introduction to Docker and Docker Compose: Beginners Guide

Walter Ndung'u — Tue, 26 Aug 2025 20:47:59 +0000

What is Docker

Docker is an open source platform that enables developers and engineers to build, deploy, run and manage containers.

Containers are standardized, executable components that combine application source code, together with the operating System libraries and dependencies required to run that code in any environment.

containers enable multiple application components to share the resources of a single instance of the host Operating System.

Why use Docker

Consistency: Docker deals with the infamous headache of "It works on my machine" problem. This problem occurs when an application works on the developers laptop, but when the application is deployed in to a server or the cloud, something breaks. Docker helps to package everything the app need into a container Image. That container image will run the same way on any machine (laptop, staging server, production in cloud).
Light Weight: Docker containers share the Host OS kernel. they don't need to boot an entire OS each time like is the case with traditional Virtual Machines. As a result: -Containers start quickly. -Save cost on hardware and cloud resources. -You can run many containers on a single machine.
Scalable: With docker you can run multiple containers of the same app behind a load balancer. You can add more containers (Scale up) when demand increases of remove container (scale down) when demand decreases.
Fast Deployment: With docker, you build an image and to starting a new container is an automated and repeatable process.

Terms and Tools within docker Architecture

Docker Host:- This is the physical or virtual machine running a Docker engine compatible Operating System such as the Linux.
Docker Engine:'- It a client/server application that consist of the Docker Daemon, Docker API that interacts with the Daemon, and a Docker CLI that talks to the daemon.
Docker Daemon:- This is a service that creates and manages docker images by using commands from the client.
Docker client:-Provides the Command Line Interface (CLI) that accesses the Docker API to communicate with the Docker Daemon over a unix socket or a network interface.
Docker Object:- components of a docker deployment that help package and distribute applications. They include Images, containers, network, plugins, and volumes.
Container:- This is the live running instance of a docker Image.
Docker Image:- Contain executable applications source code and all tools, libraries and dependencies the application code needs to run as a container.
Docker Build:- a command that has tools and features for creating a docker image.
Docker file:- A simple text file containing instructions for how to build the docker container image. You can say it is a list of instructions that the docker engine will run to assemble the docker image.
Docker Hub this is a public repository of docker images. 11 Docker Compose: is a tool to manage multiple container applications where all containers run on the same docker host.

Docker Installation

sudo apt update
sudo apt install docker.io -y
sudo systemctl enable docker --now

Verify Installation

docker version

Running your first container

docker run hello-world

👆Docker pulls the image from Docker Hub and runs it inside a container

Basic docker commands

# Pull an image from Docker Hub
docker pull ubuntu

# Run a container
docker run -it ubuntu bash

# List running containers
docker ps

# List all containers (including stopped)
docker ps -a

# Stop a container
docker stop <container_id>

# Remove a container
docker rm <container_id>

# Remove an image
docker rmi <image_id>

Building an Image

create a file called Dockerfile:

# Use Python base image
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy files
COPY . /app

# Install dependencies
RUN pip install flask

# Run the app
CMD ["python", "app.py"]

Build and run

docker build -t myapp .
docker run -p 5000:5000 myapp

Managing Multi-Container Applications using Docker-compose

In real projects , applications often need multiple services working together. For example:

A web application (Flask, Django or Node.js)
A database (PostgreSQL, MongoDB)
Cache (Reds) Running and connecting each container manually with docker run can get messy. ## Why use docker compose We use docker compose to define and manage multi-container applications using a single YAML file (docker-compose.yml)

To run docker compose, just run:

docker-compose up

Sample Docker Compose File

Here is a simple example: a Flask app with a PostgreSQL database.

version: '3.8'

services:
  web:
    build: .
    ports:
      - "5000:5000"
    depends_on:
      - db

  db:
    image: postgres:13
    environment:
      POSTGRES_USER: myuser
      POSTGRES_PASSWORD: mypassword
      POSTGRES_DB: mydb
    ports:
      - "5432:5432"

How it works

Web -> Your Flask app (built from the dockerfile in the current directory).
db -> A PostgreSQL database running in its own container.
depends_on -> Ensures the database starts befor the web app

Running docker compose
docker-compose up

👆This launches both containers (web + db)

To Stop them, run:
docker-compose down

Why docker compose is useful

Simplifies running multiple containers.
Keeps your setup reproducible and sharable.
Handles networking automatically (services can talk to each other by names, e.g., db)

15 Data Engineering Core Concepts Simplified

Walter Ndung'u — Sun, 10 Aug 2025 20:07:05 +0000

INTRODUCTION

In today’s world of Big Data, the term data engineering is everywhere — often surrounded by a cloud of technical buzzwords. These terms can feel overwhelming, especially if you’re new to the data ecosystem.

This article aims to break down these concepts into simple, relatable explanations so you can understand them without needing a technical background.

What is Data Engineering?

Data engineering is the discipline of designing, building, and maintaining data pipelines that ensure data can move reliably from its source to where it’s needed. These pipelines handle the movement, transformation, and storage of data, making it ready for analysis and decision-making.

Core Concepts of Data Engineering

1. Batch vs Streaming Ingestion

Batch Ingestion is a process whereby data is collected and processed in large, discrete chunks at specific times, usually scheduled.

Stream Ingestion is the continuous collection of data as it arrives. Data is processed individually as it enters the system.

2. Change Data Capture (CDC)

Change Data Capture is a technique that identifies and tracks change (inserts, updates, deletes) made to data in a database and then deliver those changes in real-time to a downstream process or system, such as real-time data integration or data warehousing.

3. Idempotency
Idempotency is a property of an operation whereby executing the operation multiple times with the same set of input produces the same output. For example, when creating a record, by pressing the save button twice, only one record will be saved.
4. OLTP VS OLAP

Online Transaction Processing (OLTP) is a form of data processing that involves a large number of small, concurrent transactions. Example of such processes include online banking, shopping, order entry or sending text messages.
Online Analytical Processing (OLAP) is a way of storing and querying data so that you can quickly analyze it from different dimensions without having to run slow, complex queries on raw transactional data.
Scenario: Company's sales data

In OLTP: Every single sale is recorded (like “Sold 3 units of product X in Nairobi on Aug 10, 2025”).

IN OLAP: Data is reorganized so you can quickly answer questions like:

What were the total sales for product X by month for the past 2 years?

Which region sold the most in Q2 2025?

How do sales in Nairobi compare to Kisumu over time?

5. Columnar vs Row-based Storage
In row-based Storage, all values of a single record are stored contiguously on disk. This form of storage is efficient for transactional workloads (Inserting, updating, or deleting rows) and Write-intensive operations. Row-based storage is less efficient for queries that need to access only some columns across many rows, as the entire row must be read from the disk, leading to unnecessary I/O.

In Columnar Storage, data is stored column by column, with all values for a single column stored contiguously on a disk. This form of storage is highly efficient for Analytical queries that involve aggregations, filtering, and analysis across a large dataset, as only the required columns are read from disk. However, it is less efficient for Transactional workloads as modifying a single row requires updates across multiple column blocks.

6. Partitioning

In data engineering, partitioning means splitting a large dataset into smaller, more manageable parts to speed up queries and reduce resource usage. Instead of scanning an entire table or file, queries only read the relevant partitions.

Common types of partitioning:

Horizontal partitioning: Splitting rows based on a column’s value (e.g., date, region).
Vertical partitioning: Splitting columns into separate tables or files to reduce data scanned.
Hash partitioning: Using a hash function on a key (e.g., customer ID) to evenly distribute data across partitions.

7. ETL vs ELT
ETL (Extract, Transform, Load): Data is extracted from source systems, transformed (cleaned, enriched, aggregated) in a separate processing environment, and then loaded into the target storage (e.g., a data warehouse).

Good when transformations must happen before data enters storage.

Often used with on-premise data warehouses or systems with strict schema requirements.

ELT (Extract, Load, Transform): Data is extracted from sources, loaded directly into the target storage first (often a cloud data warehouse), and transformed inside the storage using its processing power.

Good when the storage is powerful enough to handle transformations at scale (e.g., Snowflake, BigQuery).

Allows storing raw data for flexibility and reprocessing later.

8. CAP Theorem (Brewers Theorem)
CAP Theorem states that, for a distributed system, it is impossible to simultaneously achieve Consistency, Availability, and Partition Tolerance.

The 3 Properties

Consistency (C): Every node in the system sees the same data at the same time.
Availability (A): Every request receives a response.
Partition Tolerance (P): The system continues to operate despite a message being lost or delayed between nodes. The Trade-off When a network partition happens, you must choose between:

CA → Consistency + Availability (no Partition Tolerance) → works only if network never fails (rare in real distributed systems).

CP → Consistency + Partition Tolerance (may sacrifice availability during a partition).

AP → Availability + Partition Tolerance (may serve stale data to keep responding).

9. Windowing in Streaming
Windowing is a technique in stream processing where infinite flow of events (such as logs, sensor reading or transactions) and break it into finite chunks of time or count so you can run aggregations like sum, average or count. Windowing enables an end to calculate results by providing logical boundaries for calculations.
10. DAG and Workflow Orchestration

DAG (Directed Acyclic Graph): A way to represent a workflow or process where steps have a defined order, and there's no way to go back to a previous step by following those directions.
Workflow Orchestration: The process of automating, coordinating, and managing multiple tasks(DAGS) and systems to execute complex business processes. Some tools used for Orchestration include, Apache Airflow, Dagster and Luigi

11.Retry Logic & Dead Letter Queues
Data engineering and distributed systems need ways to handle failures without losing data. Amongst these ways are Retry Logic and Dead Letter Queues (DLQs).
Retry Logic: Is the process of automatically reattempting a failed task or message after a certain delay, often with a limit on how many times it can retry. It is useful when handling issues such as network glitches, API timeouts or locked resources.
Dead Letter Queue: Is a special holding queue for messages or events that failed processing after all retries. It is useful for preventing endless retry loops and preserves failed data for investigation.

12. Backfilling & Reprocessing
Backfilling involves reprocessing historical data to correct errors, acomodate new data structures, or integrate new data sources.

Example: If your sales database adds a new “discount” column, you might backfill past records so that older data also includes the correct discount information.

Reprocessing is the act of running data through a processing pipeline again to correct inaccuracies, apply updated transformation logic, or ensure completeness.
Example: If you discover an error in your tax calculation logic, you might reprocess the past month’s sales data using the corrected formula.

13. Data Governance
Data governance is the framework of rules, processes, and responsibilities that ensure data is accurate, secure, consistent and used appropriately through its life cycle.

Purpose in Data Engineering:

Quality assurance – Making sure the data pipelines deliver clean, reliable data.
Security & privacy – Controlling who can access or modify data.
Compliance – Meeting legal and regulatory requirements (e.g., GDPR, HIPAA).
Lineage & documentation – Tracking where data came from, how it was transformed, and where it’s used.
Standardization – Ensuring consistent formats, naming conventions, and definitions.

14. Time Travel & Data Versioning

Time travel is the ability to view data set as it existed at a specific point in the past.

It can be used to recover accidentally deleted or modified data. It can also be used to audit historical states of data and finally can be used to compare current and past datasets.

Data Versioning is the practice of storing and managing multiple versions of a dataset over time. Its purpose is to track changes to data or to enable rollback to previous versions.

15. Distributed Processing
Distributed processing is the method of breaking large computing task into smaller parts and running those parts simultaneously across multiple machines or processors, then combining the results