DEV Community: LuxDevHQ

Supermarket Sales and Customer Insights Dashboard — A Practical Power BI Project Guide.

Mwenda Harun Mbaabu — Wed, 04 Feb 2026 12:21:21 +0000

This technical article walks you step by step through a beginner-friendly Power BI project using a real-world supermarket transactions dataset. By the end of this guide, you will know where to download the data, how to prepare it, and how to build an interactive dashboard that answers real business questions.

Project Overview

In this project, you will analyze supermarket transaction data and transform it into an interactive Power BI dashboard. The focus is not just on visuals, but on answering business questions clearly and professionally.

You will act as a Junior Data Analyst, converting raw transaction records into insights that business stakeholders can explore without using spreadsheets.

Dataset Download

You can download the dataset to be used in this project here, https://github.com/LuxDevHQ/Data

Dataset contents:

Three years of supermarket transaction data
Multiple store locations (Australia)
Individual transaction-level records

Columns include:

Product Name
Quantity Sold
Total Sales Amount
Payment Method
Customer Type (Member / Non-Member)
Store Location
Transaction Date

📌 Download the dataset

Business Questions This Project Answers

Before opening Power BI, it is important to understand what questions the dashboard should answer.

1. Sales Performance

What is the total sales amount across all stores?
How do sales trend over time (monthly and yearly)?

2. Product Analysis

Which products generate the highest revenue?
How do apple sales compare across different payment methods?

3. Customer Behavior

How much do members vs non-members spend?
Which customer type contributes more to total revenue?

4. Payment Method Insights

Which payment method is used most frequently?
How does revenue differ by payment method?

5. Store Performance

Which store location generates the highest sales?
How does customer behavior vary by store?

These questions will guide every step of the analysis.

Step 1: Load the Data into Power BI

Open Power BI Desktop
Click Get Data → Text/CSV
Select supermarket_transactions.csv
Load the data into Power BI
Review column names and preview the data

At this stage, do not build visuals yet. First, ensure the data is correct.

Step 2: Data Cleaning in Power Query

Open Transform Data to enter Power Query.

Perform the following actions:

Remove unnecessary or duplicate columns
Fix incorrect data types:
- Dates → Date
- Sales & Quantity → Decimal / Whole Number
Rename columns for clarity (e.g. Total Sales Amount)
Check for missing or inconsistent values

⚠️ Clean data is critical. Poor data quality leads to misleading dashboards.

Step 3: Data Modeling

Once the data is clean:

Confirm all columns have correct data types
Ensure the table structure is logical
No complex relationships are required for this project (single-table model)

This project focuses on analysis and visualization, not complex modeling.

Step 4: Create Beginner-Level DAX Measures.

Create the following measures in Model view or Report view:


DAX
Total Sales =
SUM(supermarket_transactions[Total Sales Amount])

Total Quantity Sold =
SUM(supermarket_transactions[Quantity])

Average Transaction Value =
AVERAGE(supermarket_transactions[Total Sales Amount])

Sales by Customer Type =
SUM(supermarket_transactions[Total Sales Amount])


> ⚠️  These measures will power your KPI cards and charts. 

## **Step 5: Build the Power BI Dashboard**

Create a **1–2 page interactive Power BI dashboard** using the visuals listed below. The dashboard should be designed for **business users**, not technical users.

---

### **Required Visuals**

#### **KPI Cards**
- Total Sales  
- Total Quantity Sold  
- Average Transaction Value  

---

#### **Charts**
- **Bar Chart:** Sales by Product  
- **Bar Chart:** Sales by Store Location  
- **Pie or Column Chart:** Payment Method Distribution  
- **Line Chart:** Sales Trend Over Time  

---

#### **Slicers**
- Store Location  
- Product  
- Customer Type  
- Date  

> 🎯 **Design Principle:**  
> The goal is **clarity, not decoration**. Every visual should answer a specific business question.

---

## **Step 6: Validate Your Results**

Before submitting your work, verify the following:

- Confirm all totals match your **Excel analysis**
- Test all slicers and filters for correct behavior
- Check visual titles, labels, and number formatting
- Ensure visuals respond correctly to user interactions

---

### **Your Final Submission Should Include**

- Record a **4-minute walkthrough video** using **Loom**
- The video should demonstrate a **fully functional Power BI dashboard**
- Briefly explain:
  - The dataset used
  - Key visuals and filters
  - Main business insights and conclusions
- Upload the recording to **Loom** and copy the shareable link
- Submit the **Loom video link via WhatsApp** to **0796 448 232**

A Step-by-Step Guide to Streaming Live Weather Data Using Apache Kafka and Apache Cassandra

Nelson Sammy — Wed, 16 Apr 2025 03:31:52 +0000

Introduction

Delivering real-time weather data is increasingly important for applications across logistics, travel, emergency services, and consumer tools. In this tutorial, we will build a real-time weather data streaming pipeline using:

OpenWeatherMap API to fetch weather data
Apache Kafka (via Confluent Cloud) for streaming
Apache Cassandra (installed on a Linux machine) for scalable storage

We'll implement this pipeline using Python, demonstrate practical setups, and include screenshots to guide you through each step.

By the end, you'll have a running system where weather data is continuously fetched, streamed to Kafka, and written to Cassandra for querying and visualization.

Architecture Overview

Prerequisites

Python 3.8+
Linux Machine
Kafka cluster on Confluent Cloud
OpenWeatherMap API key

Step 1: Set Up Kafka on Confluent Cloud

Go to confluent.cloud
Create an account (free tier available)
Create a Kafka cluster
Create a topic named weather-stream
Generate an API Key and Secret
Note the Bootstrap Server, API Key, and API Secret

Step 2: Install Cassandra on a Linux Machine

Open your terminal and run:

sudo apt install openjdk-11-jdk -y

# Add Apache Cassandra repo
echo "deb https://downloads.apache.org/cassandra/debian 40x main" | sudo tee /etc/apt/sources.list.d/cassandra.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -

sudo apt update
sudo apt install cassandra -y

Start and verify Cassandra:

sudo systemctl enable cassandra
sudo systemctl start cassandra
nodetool status

Step 3: Connect Cassandra to DBeaver (GUI Tool)

DBeaver is a great visual interface for managing Cassandra.
Steps:

Install DBeaver
Open DBeaver and click New Connection
Select Apache Cassandra from the list
Fill in the following:
Host: 127.0.0.1
Port: 9042
Username: leave blank (default auth)
Password: leave blank
Click Test Connection — you should see a successful message Save and connect — you can now browse your keyspaces, tables, and run CQL visually.

Step 4: Create the Cassandra Table

Once connected (or in cqlsh), run:

CREATE KEYSPACE IF NOT EXISTS weather
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE weather;

CREATE TABLE IF NOT EXISTS weather_data (
    city TEXT,
    timestamp TIMESTAMP,
    temperature FLOAT,
    humidity INT,
    PRIMARY KEY (city, timestamp)
);

This schema stores weather info per city, indexed by time.
You can also run the above queries in DBeaver’s SQL editor.

Step 5: Create Kafka Producer in Python

Install Dependencies
pip install requests confluent-kafka python-dotenv
Create a .env file:

BOOTSTRAP_SERVERS=pkc-xyz.us-central1.gcp.confluent.cloud:9092
SASL_USERNAME=API_KEY
SASL_PASSWORD=API_SECRET
OPENWEATHER_API_KEY=YOUR_OPENWEATHER_API_KEY

Python Script: weather_producer.py

import requests
import json
from confluent_kafka import Producer
import time
from dotenv import load_dotenv
import os

load_dotenv()

conf = {
    'bootstrap.servers': os.getenv("BOOTSTRAP_SERVERS"),
    'security.protocol': 'SASL_SSL',
    'sasl.mechanisms': 'PLAIN',
    'sasl.username': os.getenv("SASL_USERNAME"),
    'sasl.password': os.getenv("SASL_PASSWORD")
}

producer = Producer(conf)
API_KEY = os.getenv("OPENWEATHER_API_KEY")
TOPIC = 'weather-stream'
CITIES = ["Nairobi", "Lagos", "Accra", "Cairo", "Cape Town", "Addis Ababa", "Dakar", "Kampala", "Algiers"]

def get_weather(city):
    url = f'https://api.openweathermap.org/data/2.5/weather?q={city}&appid={API_KEY}&units=metric'
    response = requests.get(url)
    return response.json()

def delivery_report(err, msg):
    if err is not None:
        print(f"Delivery failed: {err}")
    else:
        print(f"Delivered to {msg.topic()} [{msg.partition()}] @ offset {msg.offset()}")

while True:
    for city in CITIES:
        weather = get_weather(city)
        weather['city'] = city  # Attach city explicitly
        producer.produce(TOPIC, json.dumps(weather).encode('utf-8'), callback=delivery_report)
        producer.flush()
        time.sleep(2)  # This will prevent API rate limit
    time.sleep(60)  # Wait before the next full cycle

This script loads credentials from .env, loops through several African cities, and sends weather data to your Kafka topic.

Step 6: Create Kafka Consumer in Python (Store Data in Cassandra)

Install additional libraries:
pip install cassandra-driver
Python Script: weather_consumer.py

import json
from cassandra.cluster import Cluster
from confluent_kafka import Consumer
import os
from dotenv import load_dotenv

load_dotenv()

# Cassandra connection
cluster = Cluster(['127.0.0.1'])
session = cluster.connect()
session.set_keyspace('weather')

# Kafka configuration
conf = {
    'bootstrap.servers': os.getenv("BOOTSTRAP_SERVERS"),
    'security.protocol': 'SASL_SSL',
    'sasl.mechanisms': 'PLAIN',
    'sasl.username': os.getenv("SASL_USERNAME"),
    'sasl.password': os.getenv("SASL_PASSWORD"),
    'group.id': 'weather-group',
    'auto.offset.reset': 'earliest'
}

consumer = Consumer(conf)
consumer.subscribe(['weather-stream'])

print("Listening for weather data...")

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    if msg.error():
        print(f"Consumer error: {msg.error()}")
        continue

    data = json.loads(msg.value().decode('utf-8'))
    try:
        session.execute(
            """
            INSERT INTO weather_data (city, timestamp, temperature, humidity)
            VALUES (%s, toTimestamp(now()), %s, %s)
            """,
            (data['city'], data['main']['temp'], data['main']['humidity'])
        )
        print(f"Stored data for {data['city']}")
    except Exception as e:
        print(f"Failed to insert data: {e}")

This consumer listens to your Kafka topic, parses incoming messages, and stores them in the weather_data table.

Step 7: Querying Cassandra Data via DBeaver

Once the consumer is running and data is flowing, open DBeaver and run a CQL query to verify the data:
SELECT * FROM weather.weather_data;
You should now see rows of weather data streaming in from various African cities.

Conclusion & Next Steps

You’ve successfully built a real-time data pipeline using Python, Kafka, and Cassandra. Here’s a summary of what you’ve done:

Set up Kafka via Confluent Cloud
Pulled real-time weather data using OpenWeatherMap
Streamed data to Kafka via a Python producer
Consumed Kafka events and stored them in Cassandra
Queried Cassandra data in DBeaver

Suggested Enhancements:

Add Weather Alerts: Trigger notifications if temperatures exceed a threshold
Streamlit Dashboard: Build a live dashboard showing city-by-city weather updates
Data Retention Policy: Expire older data using Cassandra TTL
Dockerize the Project: For easier deployment

Building an Automated Weather Data Pipeline with Apache Kafka and Cassandra

Eric Katumo — Mon, 07 Apr 2025 10:02:16 +0000

This article creates an end-to-end weather data pipeline that collects weather data from multiple African cities, processes it through Apache Kafka, and stores it in a Cassandra database for analysis. The infrastructure is hosted on Microsoft Azure.

Introduction

Weather data pipelines are essential components of modern environmental monitoring systems. They enable organizations to collect, process, and analyze meteorological information in real-time, facilitating better decision-making. This tutorial demonstrates how to build a simple yet robust weather data pipeline using:

Python for data fetching and processing
OpenWeatherMap API as our data source
Apache Kafka for handling real-time data streams
Apache Cassandra for scalable data storage
Microsoft Azure for cloud infrastructure

By the end of this article, you will understand how these technologies are combined to construct a data pipeline that can be scaled to more complex use cases.

Architecture Overview

Our weather data pipeline consists of the following workflow:

Data Extraction: Weather data is extracted for five African cities from the OpenWeatherMap API using the producer script
Data Streaming: Weather data is streamed into a Kafka topic
Data Consumption: A consumer script reads from the Kafka topic
Data Storage: The data is processed and stored in a Cassandra database
Data Query: We can query the stored data using CQL (Cassandra Query Language)

Azure Infrastructure Setup

This project leverages Microsoft Azure for hosting our components:

Azure Virtual Machine: Ubuntu 20.04 LTS server to run our Python scripts
Network Security Group: Configured to allow necessary traffic for Kafka and Cassandra
Azure Managed Disks: For persistent storage of Cassandra data

To set up an Azure VM for this project:

Log in to the Azure Portal
Create a new Virtual Machine with Ubuntu 20.04 LTS
Select at least 2 vCPUs and 8GB RAM for optimal performance
Configure networking to allow inbound SSH (port 22), Kafka (port 9092), and Cassandra (port 9042)
Create and download SSH keys for secure access

After provisioning, connect to your VM using SSH:

ssh -i /path/to/your/key.pem azureuser@your-vm-ip-address

Setting Up Confluent Cloud for Your Weather Data Pipeline

Before diving into the implementation, we'll set up a managed Kafka environment on Confluent Cloud.

Creating a Confluent Cloud Account

Visit Confluent.io and click on "Get Started Free"
Complete the registration process by providing your email and setting up a password
Verify your email address to activate your account

Setting Up a Kafka Cluster

Once logged in to Confluent Cloud:

Click on "Create cluster" in the dashboard
Choose Azure as your cloud provider and select the region closest to your Azure VM
Select the "Basic" cluster type for development purposes (you can upgrade later)
Name your cluster (e.g., "weather-pipeline-cluster")
Click "Launch cluster"

Creating a Kafka Topic

After your cluster is provisioned:

Select your cluster from the dashboard
Navigate to the "Topics" section in the left sidebar
Click "Create topic"
Enter "weather_data" as the topic name
Set the number of partitions (start with 6 for this example)
Leave the default retention settings (7 days)
Click "Create with defaults" (or customize advanced settings if needed)

Creating API Keys for Authentication

To connect your application to the Confluent Cloud Kafka cluster:

In your cluster dashboard, click on "API keys" in the left sidebar
Click "Create key"
Select "Global access" (or "Granular access" if you prefer more control)
Provide a description (e.g., "Weather Pipeline API Key")
Click "Create key"
Important: Save both the API key and secret in a secure location as they will only be shown once

Testing the Connection

To verify your connection settings, create a simple test script on your Azure VM:

from confluent_kafka.admin import AdminClient
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Kafka configuration
kafka_config = {
    "bootstrap.servers": os.getenv('BOOTSTRAP_SERVERS'),
    "security.protocol": "SASL_SSL",
    "sasl.mechanisms": "PLAIN",
    "sasl.username": os.getenv('KAFKA_API_KEY'),
    "sasl.password": os.getenv('KAFKA_API_SECRET'),
}

# Create Admin client
admin_client = AdminClient(kafka_config)

# List topics
topics = admin_client.list_topics(timeout=10)
print("Available topics:", list(topics.topics.keys()))

Setting Up Cassandra on Azure VM

We'll install Apache Cassandra directly on our Azure VM:

# Install Java
sudo apt update
sudo apt install -y openjdk-8-jdk

# Add the Apache Cassandra repository
echo "deb https://downloads.apache.org/cassandra/debian 40x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
sudo apt update

# Install Cassandra
sudo apt install -y cassandra

# Start the service
sudo systemctl start cassandra
sudo systemctl enable cassandra

# Verify Cassandra is running
nodetool status

Configure Cassandra to accept remote connections by editing /etc/cassandra/cassandra.yaml:

sudo nano /etc/cassandra/cassandra.yaml

Change the following values:

listen_address: <your-vm-ip-address>
rpc_address: 0.0.0.0
seeds: "<your-vm-ip-address>"

Restart Cassandra to apply changes:

sudo systemctl restart cassandra

Setting Up the Environment

Now on your Azure VM, set up the development environment:

# Create a project directory
mkdir weather_data_pipeline
cd weather_data_pipeline

# Create and activate a virtual environment
python3 -m venv myvenv
source myvenv/bin/activate

Then install the required packages:

pip install confluent-kafka requests python-dotenv cassandra-driver

Create a .env file to store your environment variables:

WEATHER_API_KEY=your_openweathermap_api_key
BOOTSTRAP_SERVERS=your_confluent_bootstrap_servers
KAFKA_API_KEY=your_confluent_api_key
KAFKA_API_SECRET=your_confluent_api_secret
CASSANDRA_HOST=localhost

Creating the Producer Script

The producer script is responsible for fetching weather data from the OpenWeatherMap API and sending it to a Kafka topic. Let's break down the key components of our weather_producer.py script:

import json
import time
import os
import requests
from confluent_kafka import Producer
from dotenv import load_dotenv
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Load environment variables from .env file
load_dotenv()

# List of cities
cities = [
    "Nairobi",
    "Johannesburg",
    "Casablanca",
    "Lagos",
    "Kinshasa"
]

# OpenWeatherMap API setup
owm_api_key = os.getenv('WEATHER_API_KEY')
owm_base_url = "https://api.openweathermap.org/data/2.5/weather"

def fetch_weather_data(city):
    """Fetch weather data from OpenWeatherMap API using city name."""
    url = f"{owm_base_url}?q={city}&appid={owm_api_key}&units=metric"
    try:
        response = requests.get(url)
        response.raise_for_status()
        data = response.json()
        data["extracted_city"] = city
        return data
    except requests.exceptions.RequestException as e:
        logger.error(f"Error fetching data for {city}: {e}")
        return None

def delivery_report(err, msg):
    """Callback for Kafka message delivery status."""
    if err is not None:
        logger.error(f"Message delivery failed: {err}")
    else:
        logger.info(f"Message delivered to {msg.topic()} [{msg.partition()}] at offset {msg.offset()}")

# Kafka configuration
kafka_config = {
    "bootstrap.servers": os.getenv('BOOTSTRAP_SERVERS'),
    "security.protocol": "SASL_SSL",
    "sasl.mechanisms": "PLAIN",
    "sasl.username": os.getenv('KAFKA_API_KEY'),
    "sasl.password": os.getenv('KAFKA_API_SECRET'),
    "broker.address.family": "v4",
    "message.send.max.retries": 5,
    "retry.backoff.ms": 500,
}

producer = Producer(kafka_config)
topic = "weather_data"

def produce_weather_data():
    """Fetch weather data for each city and produce to Kafka."""
    for city in cities:
        data = fetch_weather_data(city)
        if data:
            producer.produce(topic, key=city, value=json.dumps(data), callback=delivery_report)
            producer.poll(0)
        else:
            logger.error(f"Failed to fetch data for {city}")
    producer.flush()

if __name__ == "__main__":
    produce_weather_data()
    logger.info("Data extraction and production complete")

Notable Components of the Producer Script:

Environment Setup: We import dotenv to load environment variables and initialize logging to track script execution.
City List: We define a list of African cities for which we want to collect weather data.
API Integration: The function fetch_weather_data() makes HTTP requests to the OpenWeatherMap API, asking for metric units of measurement for temperature.
Kafka Configuration: We set up a Kafka producer with security credentials and reliability properties.
Data Production: The produce_weather_data() method reads data for each city and produces it to the "weather_data" Kafka topic.
Delivery Reporting: The delivery_report() callback method prints if messages were delivered to Kafka successfully.

Creating the Consumer Script

Now, let's create the consumer script that will read data from the Kafka topic and store it in Cassandra. Here is our weather_consumer.py script:

import json
import uuid
import os
from confluent_kafka import Consumer, KafkaError
from cassandra.cluster import Cluster
from dotenv import load_dotenv
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv()

# Kafka configuration
kafka_config = {
    "bootstrap.servers": os.getenv('BOOTSTRAP_SERVERS'),
    "group.id": "weather_consumer_group",
    "auto.offset.reset": "earliest",
    "security.protocol": "SASL_SSL",
    "sasl.mechanisms": "PLAIN",
    "sasl.username": os.getenv('KAFKA_API_KEY'),
    "sasl.password": os.getenv('KAFKA_API_SECRET'),
    "broker.address.family": "v4",
}

# Cassandra configuration
cassandra_host = os.getenv('CASSANDRA_HOST')

# Connect to Cassandra
cluster = Cluster([cassandra_host])
session = None

def initialize_cassandra():
    """Initialize Cassandra connection and create keyspace/table if needed."""
    global session
    try:
        session = cluster.connect()

        # Create keyspace if it doesn't exist
        session.execute("""
            CREATE KEYSPACE IF NOT EXISTS weather_data
            WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}
            AND durable_writes = true;
        """)

        # Use the keyspace
        session.execute("USE weather_data")

        # Create table if it doesn't exist
        session.execute("""
            CREATE TABLE IF NOT EXISTS simple_weather (
                id uuid PRIMARY KEY,
                city_name text,
                temperature float,
                timestamp timestamp,
                weather_description text,
                weather_main text
            );
        """)
        logger.info("Cassandra table ready")
        return True
    except Exception as e:
        logger.error(f"Cassandra initialization error: {e}")
        return False

def insert_weather_data(weather_data):
    """Insert weather data into Cassandra."""
    try:
        # Extract interesting fields
        city = weather_data["extracted_city"]
        temp = weather_data["main"]["temp"]
        timestamp = weather_data["dt"]
        weather_desc = weather_data["weather"][0]["description"]
        weather_main = weather_data["weather"][0]["main"]

        # Insert data into Cassandra
        query = """
            INSERT INTO simple_weather (id, city_name, temperature, timestamp, weather_description, weather_main)
            VALUES (%s, %s, %s, toTimestamp(now()), %s, %s)
        """
        session.execute(query, (uuid.uuid4(), city, temp, weather_desc, weather_main))
        logger.info(f"Inserted weather for {city} at {timestamp}")
        return True
    except Exception as e:
        logger.error(f"Error inserting data: {e}")
        return False

def consume_weather_data():
    """Consume weather data from Kafka and store in Cassandra."""
    # Initialize Cassandra
    if not initialize_cassandra():
        return

    # Create Kafka consumer
    consumer = Consumer(kafka_config)
    consumer.subscribe(["weather_data"])

    logger.info("Subscribed to topic: weather_data")

    try:
        while True:
            msg = consumer.poll(1.0)

            if msg is None:
                continue
            if msg.error():
                if msg.error().code() == KafkaError._PARTITION_EOF:
                    continue
                else:
                    logger.error(f"Consumer error: {msg.error()}")
                    break
            try:
                weather_data = json.loads(msg.value())
                insert_weather_data(weather_data)
            except Exception as e:
                logger.error(f"Error processing message: {e}")

    except KeyboardInterrupt:
        logger.info("Stopping consumer")
    finally:
        consumer.close()
        cluster.shutdown()

if __name__ == "__main__":
    consume_weather_data()

Key Components of the Consumer Script:

Kafka Consumer Configuration: We set up a consumer to consume from the "weather_data" topic, with the appropriate security settings.
Cassandra Connection: We connect to the Cassandra cluster directly without authentication credentials.
Database Initialization: The initialize_cassandra() function creates the necessary keyspace and table if they haven't already been created.
Data Processing: The insert_weather_data() function extracts pertinent fields from the weather data and inserts them into Cassandra.
Continuous Consumption: The consume_weather_data() function continuously polls the Kafka topic and processes any messages it finds.

Running the Pipeline on Azure

Now that we've written both the producer and consumer scripts, let's run them on our Azure VM to see our data pipeline in action.

First, make sure your virtual environment is activated and all dependencies are installed. Then, run the producer script:

python3 weather_producer.py

As shown in the first screenshot, the producer is successfully retrieving weather data and producing it to Kafka:

2025-04-07 05:56:47,013 - __main__ - INFO - Message delivered to weather_data [4] at offset 8
2025-04-07 05:56:47,013 - __main__ - INFO - Message delivered to weather_data [4] at offset 9
2025-04-07 05:56:47,035 - __main__ - INFO - Message delivered to weather_data [3] at offset 8
2025-04-07 05:56:47,035 - __main__ - INFO - Message delivered to weather_data [3] at offset 9
2025-04-07 05:56:47,072 - __main__ - INFO - Message delivered to weather_data [2] at offset 4
2025-04-07 05:56:47,072 - __main__ - INFO - Data extraction and production complete

Next, run the consumer script in a separate terminal:

python3 weather_consumer.py

The consumer subscribes to the Kafka topic and begins processing messages, as shown the second screenshot:

Subscribed to topic: weather_data
Cassandra table ready
Inserted weather for Johannesburg at 2025-04-07 05:56:45
Inserted weather for Lagos at 2025-04-07 05:55:29
Inserted weather for Nairobi at 2025-04-07 05:56:45
Inserted weather for Casablanca at 2025-04-07 05:53:05
Inserted weather for Kinshasa at 2025-04-07 05:56:46

Querying the Data

Now that the data is in Cassandra, we can query it using CQL. On your Azure VM, use the cqlsh tool:

cqlsh localhost

Then query to retrieve the weather data:

USE weather_data;
SELECT * FROM simple_weather;

As in your third screenshot, this pulls back all of the weather data for our cities:

id                                   | city_name     | temperature | timestamp                       | weather_description | weather_main
--------------------------------------+--------------+-------------+--------------------------------+---------------------+-------------
9781d8f2-e7d1-484f-a780-4df61ed7c7da | Johannesburg |       15.02 | 2025-04-07 06:22:41.000000+0000 |    overcast clouds  |      Clouds
f64ab11f-a732-4d47-84fb-f450d1e4a9bc |     Kinshasa |       21.21 | 2025-04-07 06:22:41.000000+0000 |         few clouds  |      Clouds
01840d74-0b10-461b-98b4-8eaf5fce2e0c |      Nairobi |       18.62 | 2025-04-07 06:22:41.000000+0000 |     broken clouds   |      Clouds
546d4df6-45b4-4956-94b1-b92b92bc1e84 |        Lagos |       26.57 | 2025-04-07 06:22:41.000000+0000 |    overcast clouds  |      Clouds
95d0dc97-a940-45a9-ab5d-28b9deba7b02 |   Casablanca |       14.07 | 2025-04-07 06:20:30.000000+0000 |   scattered clouds  |      Clouds
(5 rows)

Extending the Pipeline

This basic pipeline can be extended as follows:

Add More Cities: Expand the list of cities to get a broader geographic coverage.
Collect More Data Points: Modify the scripts to collect additional weather parameters like humidity, wind speed, and barometric pressure.
Data Aggregation: Add functionality to calculate averages, minimums, and maximums over time intervals of varying sizes.
Visualizations: Connect a visualization software like Grafana or Azure Data Explorer to your Cassandra database to create dashboards.
Alerts: Configure alerts for extreme weather conditions by adding thresholds for temperature, rainfall, etc.
Scale with Azure Kubernetes Service: For larger deployments, consider containerizing your applications and deploying them on AKS.

Common Issues and Troubleshooting

API Rate Limiting: The OpenWeatherMap API rate limits free tier accounts. If you're dealing with a large amount of cities, you might hit these limits. Consider using a paid tier or request throttling.
Kafka Connection Issues: Ensure that your Kafka credentials and bootstrap servers are correctly configured in the .env file.
Cassandra Connectivity: Ensure that your Cassandra instance is accessible on your Azure VM and that the firewall rules allow connections.
Azure VM Connectivity: Check that your Network Security Group allows the necessary inbound and outbound traffic.
Disk Space: Monitor your Azure VM's disk space, especially if storing large amounts of historical weather data.

Resources

Building an Automated Bitcoin Price ETL Pipeline with Airflow and PostgreSQL

Eric Katumo — Mon, 31 Mar 2025 19:07:56 +0000

Introduction

This article details creating an automated ETL (Extract, Transform, Load) pipeline that retrieves daily Bitcoin price data from the Polygon.io API, performs necessary transformations, and loads the data into a PostgreSQL database. The workflow is orchestrated using Apache Airflow, ensuring reliable daily execution.

This project demonstrates several key data engineering concepts:

API data extraction
Data transformation using pandas
Database integration with PostgreSQL
Workflow orchestration with Apache Airflow
Deployment to a cloud environment

System Architecture

The pipeline consists of the following components:

Data Source: Polygon.io API providing cryptocurrency price data
ETL Script: Python script that handles extraction, transformation, and loading
Database: PostgreSQL for data storage
Orchestration: Apache Airflow for scheduling and monitoring
Infrastructure: Cloud VM for hosting the pipeline

The system flows in a linear fashion: Airflow triggers the ETL script daily, which extracts the latest BTC prices, transforms the data into a suitable format, and loads it into the PostgreSQL database.

Detailed Implementation

Step 1: Creating the ETL Script

The first component is btc_prices.py, which handles the core ETL functionality:

import requests
import os
from sqlalchemy import create_engine
import pandas as pd
from datetime import datetime
from dotenv import load_dotenv

# Define API endpoint
url = 'https://api.polygon.io/v1/open-close/crypto/BTC/USD/2025-03-31?adjusted=true&apiKey=YOUR_API_KEY'
response = requests.get(url)

if response.status_code == 200:
    data = response.json()
    open_price = data.get('open')
    close_price = data.get('close')
    date = data.get('day')
    symbol = data.get('symbol')
else:
    print(f"Failed to retrieve data: {response.status}")
    exit()

# Prepare data for insertion
data_df = {
    'symbol': symbol,
    'open_price': open_price,
    'close_price': close_price,
    'date': date
}
df = pd.DataFrame(data_df, index=[0])
df['date'] = pd.to_datetime(df['date']).dt.strftime('%Y-%m-%d')

# Load environment variables
load_dotenv()
dbname = os.getenv('dbname')
user = os.getenv('user')
password = os.getenv('password')
host = os.getenv('host')
port = os.getenv('port')

# Create database connection
engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{dbname}')

df.to_sql("crypto_prices", con=engine, if_exists="append", index=False, schema="dataengineering")
print(f"Successfully loaded crypto data for {df['date'][0]}")

This script:

Extracts Bitcoin price data from the Polygon.io API
Transforms and structures the data using pandas
Loads the data into PostgreSQL
Uses environment variables for secure database connection management

Step 2: Creating the Airflow DAG

Next, the btc_dag.py defines the Airflow DAG (Directed Acyclic Graph) that orchestrates the workflow:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator

# DAG default arguments
default_args = {
    "owner": "data_engineer",
    "depends_on_past": False,
    "start_date": datetime(2025, 3, 31),
    "email_on_failure": False,
    "email_on_retry": True,
    "retries": 2,
    "retry_delay": timedelta(minutes=2)
}

with DAG(
    'polygon_btc_data',
    default_args=default_args,
    schedule_interval='@daily',
) as dag:

    activate_venv = BashOperator(
        task_id='activate_virtual_env',
        bash_command='source /home/user/project/venv/bin/activate',
    )

    execute_file = BashOperator(
        task_id='execute_python_file',
        bash_command='python /home/user/project/btc_prices.py',
    )

    activate_venv >> execute_file

This DAG:

Defines the execution schedule
Activates the virtual environment
Executes the ETL script

Step 3: Setting Up the Environment

Creating a Virtual Environment:

   python -m venv venv
   source venv/bin/activate

Installing Dependencies:

   pip install requests pandas sqlalchemy python-dotenv psycopg2-binary apache-airflow

Setting Up Environment Variables:

   echo "dbname=your_database_name" >> .env
   echo "user=your_database_user" >> .env
   echo "password=your_database_password" >> .env
   echo "host=your_database_host" >> .env
   echo "port=your_database_port" >> .env

Step 4: Server Deployment

SSH into the cloud VM:

   ssh user@your_server_ip

Create necessary directories:

   mkdir -p ~/crypto_price
   mkdir -p ~/airflow/dags

Transfer scripts to the server:

   scp btc_prices.py user@your_server_ip:~/crypto_price/
   scp btc_dag.py user@your_server_ip:~/airflow/dags/

Step 5: PostgreSQL Configuration

Creating Database Schema:

   CREATE SCHEMA IF NOT EXISTS dataengineering;

   CREATE TABLE IF NOT EXISTS dataengineering.crypto_prices (
       id SERIAL PRIMARY KEY,
       symbol VARCHAR(10) NOT NULL,
       open_price NUMERIC(20, 8) NOT NULL,
       close_price NUMERIC(20, 8) NOT NULL,
       date DATE NOT NULL,
       created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
   );

Conclusion

The architecture follows best practices for data engineering:

Separation of extraction, transformation, and loading concerns
Secure credential management
Robust error handling
Automated scheduling
Cloud-based deployment

The combination of Python, Airflow, and PostgreSQL provides a powerful foundation for financial data analysis, enabling timely insights into cryptocurrency market trends.

Github

Amazon Redshift Data Warehousing (Sample Project)

Eric Katumo — Wed, 26 Mar 2025 15:32:33 +0000

Overview

This project focuses on analyzing two key business processes using Amazon Redshift:

Identifying the most common frequency of purchases per season.
Finding products with high review ratings.

Step 1: Setting Up Amazon Redshift

Create a Redshift Cluster
- Navigate to AWS Redshift Console.
- Click Create Cluster.
- Choose a single-node cluster.
- Select an appropriate node type (e.g., dc2.large).
- Set the database name, username, and password.
- Click Create cluster.
Create an IAM Role
- Go to the IAM Console.
- Create a new role with AmazonS3ReadOnlyAccess.
- Attach the role to your Redshift cluster.
Create an S3 Bucket and Upload Data
- Navigate to Amazon S3.
- Create a new bucket (e.g., zzetu).
- Upload shopping_data.csv to this bucket.

Step 2: Loading Data into Redshift

Use the Amazon Redshift Query Editor to run SQL commands.
Create a table to store raw data:

CREATE TABLE shopping (
    CustomerID INTEGER,
    Age INTEGER,
    Gender VARCHAR(50),
    Category VARCHAR(50),
    Location VARCHAR(50),
    Season VARCHAR(50),
    ReviewRating REAL,
    SubscriptionStatus VARCHAR(50),
    PaymentMethod VARCHAR(50),
    ShippingType VARCHAR(50),
    DiscountApplied VARCHAR(50),
    PromoCodeUsed VARCHAR(50),
    PreviousPurchases INTEGER,
    PreferredPaymentMethod VARCHAR(50),
    FrequencyOfPurchases VARCHAR(50)
);

Copy data from S3 into Redshift:

COPY shopping
FROM 's3://zzetu/shopping_data.csv'
IAM_ROLE 'arn:aws:iam::your-account-id:role/your-redshift-role'
FORMAT AS CSV
IGNOREHEADER 1;

Step 3: Schema Design

We are using a star schema with 1 fact table and 3 dimension tables:

1. Dimension Tables

CREATE TABLE dimCustomer (
    CustomerID INTEGER PRIMARY KEY,
    Age INTEGER,
    Gender VARCHAR(50),
    Location VARCHAR(50),
    SubscriptionStatus VARCHAR(50),
    PreferredPaymentMethod VARCHAR(50)
);

CREATE TABLE dimProduct (
    Category VARCHAR(50) PRIMARY KEY,
    DiscountApplied VARCHAR(50),
    PromoCodeUsed VARCHAR(50)
);

CREATE TABLE dimTransaction (
    TransactionID INTEGER IDENTITY(1,1) PRIMARY KEY,
    CustomerID INTEGER,
    PaymentMethod VARCHAR(50),
    ShippingType VARCHAR(50),
    Season VARCHAR(50),
    FrequencyOfPurchases VARCHAR(50),
    FOREIGN KEY (CustomerID) REFERENCES dimCustomer(CustomerID)
);

2. Fact Table

CREATE TABLE factPurchases (
    CustomerID INTEGER,
    Age INTEGER,
    ReviewRating REAL,
    PreviousPurchases INTEGER,
    FOREIGN KEY (CustomerID) REFERENCES dimCustomer(CustomerID)
);

Step 4: Data Insertion

-- Insert data into dimCustomer
INSERT INTO dimCustomer (CustomerID, Age, Gender, Location, SubscriptionStatus, PreferredPaymentMethod)
SELECT DISTINCT CustomerID, Age, Gender, Location, SubscriptionStatus, PreferredPaymentMethod FROM shopping;

-- Insert data into dimProduct
INSERT INTO dimProduct (Category, DiscountApplied, PromoCodeUsed)
SELECT DISTINCT Category, DiscountApplied, PromoCodeUsed FROM shopping;

-- Insert data into dimTransaction
INSERT INTO dimTransaction (CustomerID, PaymentMethod, ShippingType, Season, FrequencyOfPurchases)
SELECT CustomerID, PaymentMethod, ShippingType, Season, FrequencyOfPurchases FROM shopping;

-- Insert data into factPurchases
INSERT INTO factPurchases (CustomerID, Age, ReviewRating, PreviousPurchases)
SELECT CustomerID, Age, ReviewRating, PreviousPurchases FROM shopping;

Step 5: Business Queries

1. Most Used Frequency of Purchase per Season

SELECT t.Season, t.FrequencyOfPurchases, COUNT(*) AS PurchaseCount
FROM factPurchases f
JOIN dimTransaction t ON f.CustomerID = t.CustomerID
GROUP BY t.Season, t.FrequencyOfPurchases
ORDER BY t.Season, PurchaseCount DESC;

2. Products with High Review Ratings

SELECT Category, AVG(f.ReviewRating) AS AvgRating
FROM fact_Purchases f
JOIN shopping s ON f.CustomerID = s.CustomerID
GROUP BY Category
ORDER BY AvgRating DESC;

Setup Guide

Create Redshift Cluster and configure IAM roles.
Load Data from S3 into Redshift tables.
Run the SQL scripts for creating tables and inserting data.
Execute business queries to generate insights.

Repository Link

GitHub Repository

Understanding MCP and How AI Engineers Can Leverage It

Mwenda Harun Mbaabu — Sun, 16 Mar 2025 09:48:50 +0000

Understanding MCP (Model Context Protocol Server) and How AI Engineers Can Leverage It

In the rapidly evolving landscape of AI development, Model Context Protocol (MCP) Servers have emerged as a game-changer. These servers facilitate efficient communication between AI models and applications, ensuring that contextual data is preserved and utilized effectively. For AI engineers working with large models like Mixtral, MCP servers provide the backbone for deploying scalable, intelligent agents.

What is MCP (Model Context Protocol Server)?

MCP (Model Context Protocol) is a framework designed to handle context-aware AI interactions by maintaining and efficiently managing session data. Traditional AI models often suffer from context loss, where a model loses track of previous interactions. MCP servers solve this by maintaining contextual state, enabling more coherent and intelligent responses.

Key Features of MCP Servers:

Session Persistence: Stores past interactions for improved model continuity.
Scalability: Handles multiple AI model instances efficiently.
Dynamic Context Management: Adapts to evolving conversations.
Multi-Agent Coordination: Supports interactions between multiple AI agents.
Security & Access Control: Ensures safe AI model communication.

Best MCP Servers for AI Engineers

AI engineers can leverage various MCP servers to enhance their AI models' performance. Below are some of the top MCP implementations:

1. LangChain Server

Open-source framework for contextual AI interactions.
Provides tools for managing long conversations and memory.
Works well with OpenAI, Mixtral, and other LLMs.

2. FastAPI with Redis (Custom MCP Implementation)

Leverages FastAPI for fast API interactions.
Uses Redis for session persistence.
Ideal for real-time applications needing scalable AI context management.

3. Haystack MCP

Built for RAG (Retrieval-Augmented Generation).
Optimized for knowledge-driven AI models.
Works well for applications like chatbots and enterprise search.

4. Ollama MCP

Lightweight MCP server for local AI models.
Focused on privacy-preserving AI deployments.
Best suited for on-premises AI applications.

Use Cases of MCP in AI Applications

MCP servers enable a range of powerful AI-driven solutions. Here are some critical use cases:

1. AI-Powered Virtual Assistants

Ensures conversation continuity.
Reduces redundant queries and improves personalization.

2. Customer Support Chatbots

Maintains context across sessions.
Enhances chatbot accuracy and user satisfaction.

3. AI-Powered Recommendation Systems

Uses historical interactions for better recommendations.
Deployed in e-commerce, entertainment, and healthcare.

4. Financial AI Agents

Enhances fraud detection using contextual transaction analysis.
Improves financial forecasting through data-driven insights.

5. AI in Healthcare

Provides better patient interactions by remembering medical history.
Supports AI-driven diagnostics and decision-making.

Roadmap for Building an AI Agent with Python and Mixtral Large

Building an AI agent using Python and Mixtral Large requires a structured approach. Below is a roadmap:

Phase 1: Environment Setup

Install Python and necessary dependencies:

   pip install fastapi redis mixtral openai

Set up Redis for session persistence.

Phase 2: Build the AI Agent

Create an API backend using FastAPI:

   from fastapi import FastAPI
   import redis

   app = FastAPI()
   r = redis.Redis(host='localhost', port=6379, decode_responses=True)

   @app.get("/chat")
   def chat(session_id: str, user_input: str):
       context = r.get(session_id) or ""
       response = generate_response(context, user_input)
       r.set(session_id, response)
       return {"response": response}

Integrate Mixtral Large for LLM responses:

   from openai import OpenAI

   client = OpenAI(api_key="YOUR_API_KEY")

   def generate_response(context, user_input):
       prompt = f"Context: {context}\nUser: {user_input}\nAI:"  
       response = client.chat.completions.create(
           model="mixtral-large",
           messages=[{"role": "system", "content": prompt}]
       )
       return response.choices[0].message['content']

Phase 3: Deploy the MCP Server

Run the FastAPI MCP server:

   uvicorn app:app --host 0.0.0.0 --port 8000

Set up a Redis-backed session manager to maintain context.

Phase 4: Integrate and Scale

Deploy the solution using Docker and Kubernetes.
Implement WebSockets for real-time interaction.
Optimize Mixtral model calls for cost efficiency.

Final Thoughts

Model Context Protocol (MCP) Servers are indispensable in modern AI applications. Whether you're building virtual assistants, chatbots, or intelligent recommendation systems, MCP ensures context retention, making AI more responsive and intelligent. By leveraging Python and Mixtral Large, AI engineers can build scalable, stateful AI agents that drive real-world impact.

🚀 Next Steps:

Explore different MCP frameworks to find the best fit.
Optimize AI model deployments using Redis-backed session storage.
Implement multi-agent systems using MCP servers.

Ready to build the next-gen AI agent? Start coding today!

The role of SQL in Data Analytics: Why Every Analyst Should Learn SQL

Dennis-Kirimi — Tue, 11 Mar 2025 14:20:10 +0000

Introduction

Where is this coming from?

Data has become the main factor any time a person or organization is making decisions or insights. This calls for an analyst who is able to manipulate, transform and extract insights from this data.

Any time an analyst is interacting with data, there are a number of tools and languages available to complete the tasks. SQL is one of them!
SQL, Structured Query Language, is a standard language used for managing and manipulating relational databases. This is a must learn language whether you're a beginner or a professional in the data field.

The role of SQL in Data Analytics

We've just mentioned that SQL is used for interacting with relational databases, right? SQL remains to be the unsung hero of data analysis.
Data is stored in databases. A database is an organized collection of data stored electronically. The data could be in form of tables with rows and columns. But how do we interact with the data in the databases?
SQL helps analysts to retrieve, clean and manipulate data with ease.

But Why Should Every Data Analyst Learn SQL?

1. SQL helps an analyst get the data you need from a database for your analysis.

Can you image of a company that is over 20 years old? Yes, majority of this business data is stored in relational databases such as MySQL, PostgreSQL, and SQL Server. SQL helps you, as an analyst, pull the required data, it could be last year's sales data and perform further analysis without having to retrieve the others.

2. SQL is built for interacting with large datasets.

Excel might work for small reports. But what if you're dealing with millions of rows of data? That's where SQL comes in again, hurray!
SQL makes the interaction with large datasets easy and pardons the computer from crashing.

3. Data Cleaning

Most of the times, data comes in a dirty format which includes missing values, duplicates among others. SQL gives you a fair room to use in-built functions to clean the data and prepare it for visualization. This could mean removing duplicates or replacing missing values with the most appropriate values depending on the analysis.

4. Merging different tables

In databases, data is stored in schemas and tables holding different data and the data in those tables could be related.
With a simple query, one can join the tables and create a view to perform further analysis. This can hardly be achieved if using tools like Excel.

5. SQL integrates with a number of popular visualization tools

After you're done interacting with the data from a database, the next step is always visualizing or creating reports.
SQL gives a direct pipeline to connect with those visualization tools like Power BI or Tableau.

6. SQL ensures collaboration across teams.

In most organizations, SQL is a central and common language in the data department. Data analysts, engineers and scientists often communicate through data pipelines and majorly databases. Databases and no SQL?? Doesn't make sense!

SQL is more than just a language. It's a critical domain sensitive skill that every analyst should learn whenever interacting with databases.

If you haven't started learning SQL yet, now is the perfect time!!

The Ultimate Guide to Apache Kafka: Basics, Architecture, and Core Concepts

Eric Katumo — Mon, 10 Mar 2025 12:16:48 +0000

Introduction

Apache Kafka is a widely-used open-source platform for distributed event streaming, supporting high-performance data pipelines, streaming analytics, data integration, and mission-critical applications across thousands of companies https://kafka.apache.org/

Originally developed by LinkedIn, Kafka is renowned for its high throughput, scalability and durability. It enables real-time data processing and is a key component in modern event-driven architectures.

Kafka Architecture

Brokers

A single Kafka server is called a Kafka Broker. Each broker operates as an independent process on a distinct machine, communicating with other brokers via a reliable and high-speed network.

There can be any number of brokers but 3 are typically shown as a minimum. This allows one out for maintenance and one to fail at the same time.

Producers

A Producer is a client application that publishes (writes) events to a Kafka cluster. The Kafka producer is responsible for creating messages of the appropriate structure and sending them using the Kafka protocol. It has several configuration options to control message creation and delivery.

Consumers

The Kafka consumer works by issuing “fetch” requests to the brokers leading the partitions it wants to consume. The consumer receives back a chunk of log that contains all of the messages in that topic beginning from the offset position.

A consumer group is a set of consumers that cooperate to consume data from some topics

Topic

Kafka topics are the categories used to organize messages. Topics are handled by Kafka as independent queues. This means that a consumer can subscribe to a specific topic and only receive messages marked with that topic.

In Kafka, topics are partitioned and replicated across brokers throughout the implementation. Brokers refer to each of the nodes in a Kafka cluster

Cluster

Kafka clusters are a group of interconnected Kafka brokers that work together to manage the data streams entering and leaving a Kafka system. As user activity increases, so does the need for additional Kafka brokers to cope with the volume and velocity of the incoming data streams.

Kafka clusters enable the replication of data partitions across multiple brokers, ensuring high availability even in the case of node failures. Your data pipeline remains robust and responsive to fluctuating demand.

Partitions

Partitions are essential components within Kafka's distributed architecture that enable Kafka to scale horizontally, allowing for efficient parallel data processing. They are the building blocks for organizing and distributing data across the Kafka cluster.

Each partition can have multiple replicas spread across different brokers, guaranteeing fault tolerance and data redundancy.

KRaft

The consensus protocol that was introduced in KIP-500 to remove Apache Kafka’s dependency on ZooKeeper for metadata management. It leverages the Raft consensus algorithm to manage metadata and handle leader election natively within Kafka.

This eliminates the dependency on an external coordination system, allowing Kafka to function as a self-contained system.

Core Concepts

Events

A digital logbook that keeps track of everything important happening in your system. This logbook is filled with "events" – like a record of each significant action or change. In Kafka, these events are the core of how data is stored and shared.

Think of an event as a single entry in this logbook, with a few key pieces of information:

Key: A unique identifier for the event. This helps you categorize or group related events.
Value: The actual details of what happened.
Timestamp: When the event occurred.
Headers (optional): Extra bits of information about the event.

Replication

Replication is the process of having multiple copies of the data for the sole purpose of availability in case one of the brokers goes down and is unavailable to serve the requests. Copies of the partition are maintained at multiple broker instances using the partition’s write-ahead log.

The write-ahead log is where all the messages for that partition are stored in order. The messages are identified by the unique offset.

Offsets

The consumer offset is a way of tracking the sequential order in which messages are received by Kafka topics. Keeping track of the offset, or position, is important for nearly all Kafka use cases and can be an absolute necessity in certain instances, such as financial services.

The Kafka consumer offset allows processing to continue from where it last left off if the stream application is turned off or if there is an unexpected failure. In other words, by having the offsets persist in a data store, data continuity is retained even when the stream application shuts down or fails.

Consumer Groups

Consumer groups allow Kafka consumers to work together and process events from a topic in parallel. Each topic consists of one or more partitions. When a new consumer is started it will join a consumer group (this happens under the hood) and Kafka will then ensure that each partition is consumed by only one consumer from that group.

So, if you have a topic with two partitions and only one consumer in a group, that consumer would consume records from both partitions. After another consumer joins the same group, each consumer would continue consuming only one partition.

Retention

Kafka retention provides the ability to control the size of the Topic logs and avoid outgrowing the existing disk size. The retention can be configured or controlled based on the size of the logs (log retention) or based on the configured duration (Time Based Retention).

Also, the same retention can be set across all the Kafka topics or it can be configured per topic, depending on the nature of the topic we can set the retention accordingly.

Conclusion

Apache Kafka is a powerful distributed streaming platform that combines messaging and storage capabilities. Its architecture, featuring brokers, topics, and partitions, delivers scalability and fault tolerance. Kafka's core concepts, such as consumer groups and offsets, enable efficient and reliable stream processing.

With its ability to handle high-volume data streams and support real-time applications, Kafka is a crucial component of modern data architectures. It empowers developers to build robust, data-driven applications that address the challenges of today's data-intensive world. To get started with Apache Kafka visit https://kafka.apache.org/quickstart

Introducing LuxDevHQ Data Science, Artificial Intelligence, and Analytics Prep Program

Mwenda Harun Mbaabu — Sun, 02 Mar 2025 19:28:44 +0000

LuxDevHQ is excited to introduce an intensive 6 weeks Data Science, Artificial Intelligence, and Analytics Prep Program designed to equip learners with essential technical skills and real-world project experience.

This rigorous and relentless training is designed to push participants to their limits, ensuring they gain hands-on experience in key data technologies. Unlike other online or hybrid programs, this fully onsite training will require learners to attend physical classes daily from 9:00 AM to 4:00 PM EAT at LuxDevHQ’s Kilimani campus.

Note: Registration for this program ends on 31st March 2025 and training begins 21st of April 2025.

Program Fees
The total cost for the 6-week intensive program is 10,000 KES. This fee covers:

Full access to all training materials and resources
Hands-on mentorship from industry professionals
Participation in real-world projects

It is important to note that all out programs include a personalized skills assessment test, which is a 30 minutes virtual or in-person call with one of our trainers. This assessment helps determine your skill level to ensure you are placed in a program where you will benefit the most as a participant or learner. The test costs 500 KES, which is refundable if the applicant does not pass the entrance assessment.

The program is divided into two key phases:

Phase 1: Core Training (Weeks 1-2)

The first two weeks focus on developing proficiency in the most essential data tools:

Microsoft Excel

Data organization, manipulation, and cleaning techniques
Advanced formulas and functions (VLOOKUP, INDEX/MATCH, Pivot Tables)
Data visualization using charts and graphs
Macros and automation for efficiency

SQL

Introduction to relational databases and SQL syntax
Writing queries to extract, filter, and aggregate data
Advanced SQL functions: Joins, Subqueries, CTEs, and Window functions
Database optimization and indexing for performance tuning

Python for Data Science & Analytics

Python programming fundamentals (variables, loops, functions, OOP)
Data analysis with Pandas (data wrangling, transformations, merging datasets)
Data visualization using Matplotlib and Seaborn
Numerical computations using NumPy
Introduction to Machine Learning using Scikit-Learn
Automated data workflows with Apache Airflow
Implementing supervised and unsupervised machine learning models
Using ML libraries such as TensorFlow, PyTorch, and XGBoost

Power BI for Business Intelligence

Data connection and transformation using Power Query
Data modeling and DAX (Data Analysis Expressions)
Creating interactive dashboards and reports
Sharing and publishing reports for business insights

Phase 2: Real-World Project Work (Weeks 3-6)
After gaining foundational skills, participants will spend the remaining four weeks applying their knowledge to a real-world, industry-based project.

Learners will:

Work with real datasets sourced from industry partners
Solve practical business and AI-driven problems using machine learning and analytical methods
Collaborate in teams, simulating professional work environments
Develop a complete data science pipeline (data collection, cleaning, modeling, and visualization)
Deploy models and dashboards using cloud-based services or on-premise solutions
Present findings and insights in a structured report and deliver a professional presentation

Why Enroll in This Prep Program?
✅ Hands-on, immersive learning with real-world applications
✅ Intense and focused training for rapid skill development
✅ Industry-aligned curriculum designed for immediate applicability
✅ Collaboration with peers and expert mentorship
✅ A stepping stone to advanced AI and Data Science programs

Commitment and Expectations.
This program is not for the faint-hearted. It is an intensive and highly structured learning experience that demands full commitment. Participants should be prepared for a rigorous schedule that mirrors real-world industry expectations.

Are you ready to challenge yourself and take the first step toward a career in AI, Data Science, and Analytics?

Join us at LuxDevHQ and transform your skills in just six weeks. You can register for the program here, https://forms.gle/jVFV79CKtPLXTx7W9

For inquiries and enrollment details, visit our campus on Lenana Road, Kilimani, contact us at info@luxdevhq.com, call or WhatsApp at 0798166628.

LuxDevHQ Night of Code - A 20+ Hour Learning, Building, and Networking Event for LuxDevHQ Community.

Mwenda Harun Mbaabu — Fri, 07 Feb 2025 14:18:37 +0000

LuxDevHQ is a learning and upskilling platform specializing in data technologies and cloud solutions, dedicated to helping individuals and businesses stay ahead in the fast-evolving data and artificial intelligence landscape.

LuxDevHQ operates in two major capacities:

✅ A Cutting-Edge Training Hub, offering an intensive and hands-on 6 months boot camp that prepares learners for real world careers in data science, engineering, analytics, and cloud computing.

✅ A Data Talent Sourcing Platform, connecting businesses with verified, top-tier data and cloud engineering professionals.

🎓 LuxDevHQ BootCamp

The LuxDevHQ boot camp is an intensive program designed to equip learners with industry-ready skills through:

4 months of expert-led training covering SQL, Python, Big Data, and Cloud Technologies.
2 months of hands-on internship, where participants apply their skills in real-world projects.

This structure bridges the gap between learning and employment by providing mentorship, live projects, and career support to ensure graduates are job-ready.

LuxDevHQ Night of Code.

This event is a 20+ hour experience focused on learning, building, and networking. It is designed for LuxDevHQ enrolled students (open to public) to practice their skills and familiarize themselves with the different technologies they will be learning in the program and applying in the field.

We are excited to host Stephen Kolesh, the top-ranked Zindi competitor (currently #1 on the leaderboard). He will guide attendees in getting started with competitive programming, helping them develop valuable skills while also exploring ways to earn money through competitions.

Additionally, we will provide CV and résumé workshops, ensuring each attendee receives personalized guidance on how to enhance their profiles and make them stand out in the job market.

Among the guests are freelancers and industry professionals, who will offer insights into what to expect in the field and share valuable career advice.

You can register to participate using the link below: - https://paydexp.com/l/31G26

Understanding state and props in React

Sheila Kabiro — Fri, 02 Feb 2024 09:01:27 +0000

State and props are two essential ideas in the world of React that are crucial to creating dynamic and interactive user interfaces. React's component-based architecture is based on these concepts, which allow programmers to write reusable, modular, and maintainable code. We'll go on a journey to understand the relevance and smooth operation of React's state and props, simplifying their enchantment in this article.

The Role of State:
What is the State?

At its core, state in React represents the local mutable data that a component manages. It is what allows a component to keep track of information that can change over time – from user interactions, API responses, or any other dynamic aspect of the application.

Class Components and State:
Traditionally, class components were the primary way to manage state in React. The setState method played a crucial role in updating and re-rendering components based on changes in state. Understanding the component lifecycle methods became essential to grasp the intricacies of managing state effectively.

Introducing Hooks:
With the advent of React Hooks, especially the useState hook, functional components gained the ability to manage state. This simplified the state management process, making code cleaner and more concise. No longer bound to class components, developers could harness the power of state within functional components effortlessly.

Key characteristics of state:

Internal to a component.
Mutable: can be changed using setState.
Represents dynamic data that affects the component's behavior.
Example: Clicking a button in the Counter component might call setState to increment the count.

The Significance of Props:
What are Props?
Props, short for properties, allow components to receive data from their parent components. They are immutable and serve as a way to pass information down the component tree.

Passing and Receiving Props:
Props are passed from parent to child components, creating a flow of data within the application. This unidirectional data flow ensures that child components remain predictable and easily maintainable.

Key characteristics of props:

Passed from parent to child components.
Read-only.
Used to configure a component's behavior or appearance.

The Yin and Yang of React Data Flow
Interaction Between State and Props

Props and state work together seamlessly to create responsive and dynamic user interfaces. Props provide external configuration, while state manages internal data changes. Here's a helpful analogy:

Props are like the recipe: Just as a recipe provides instructions on how to bake a cake, props in React provide the necessary information for a component. They are passed down from parent components to child components and are immutable within the component receiving them. Props define what a component should render and how it should behave based on external input.

State is like the ingredients: While the recipe (props) provides the blueprint, the ingredients (state) are the dynamic elements that can change the outcome of the cake. State is internal to a component and represents its current condition or situation. It can be modified by the component itself through setState(), leading to re-rendering and updates in the component's UI.

Common Use Cases: When to Use Props vs. State

Use props for data that comes from a parent component and shouldn't be changed by the child.
Use state for data that is internal to the component and can change over time.
Props are ideal for static configuration, while state is perfect for dynamic behavior.

Anybody starting to build modern web applications needs to understand state and props in React. Developers may design dynamic, interactive, and maintainable user interfaces by grasping these concepts. Remember that state and props are your buddies when you start your React journey; they will enable you to easily create applications that are reliable and scalable. Happy coding!

Maximizing Efficiency and Savings: A Guide to Optimizing Amazon Redshift

Mwenda Harun Mbaabu — Mon, 11 Sep 2023 16:14:12 +0000

Amazon Redshift serves as a robust data warehousing service that assumes a pivotal role in the management of large-scale data analytics for organizations.

As a data engineer, your interaction with AWS Redshift becomes indispensable if your company prefers it as the data warehousing technology, or if your organization has embraced it as the central data lakehouse tool to leverage the combined advantages of a data lake and warehouse within a unified platform.

To fully exploit the capabilities of Redshift while concurrently managing costs and ensuring the efficiency of query performance, optimization becomes imperative.

In this article, we will delve into a set of strategies designed to assist you in optimizing Amazon Redshift for both cost-effectiveness and query performance. This endeavor will not only result in cost savings for your organization but also enhance query speed, benefiting you as a developer.

We will be discussing several strategies, including:

Data Modeling
Data Loading
Compression
Query Optimization
Concurrency Scaling
Workload Management (WLM)
Partitioning
Vacuuming and Analyzing
Monitoring and Alerts
Redshift Spectrum
Redshift Advisor and Reserved Instances
Regular Review and Optimization

1). Data Modeling

The foundation of effective Redshift optimization begins with smart data modeling decisions:

Data Distribution and Sort Keys: The choice of data distribution style (even, key, or all) and sort keys for your tables can significantly impact query performance. It's essential to select these attributes thoughtfully based on your specific needs.
Normalization vs. Denormalization: Evaluate your query patterns to decide whether to normalize or denormalize your data. Normalization conserves storage space, while denormalization can enhance query performance. Your choice should align with your unique requirements.

2). Data Loading

Efficient data loading processes are crucial for Redshift optimization:

COPY Command: Utilize the COPY command for bulk data loading instead of INSERT operations. It is not only faster but also more cost-effective, particularly when dealing with substantial data volumes.
Amazon S3 Staging: Consider using Amazon S3 as a staging area for data loading. This approach simplifies the process and reduces load times, enhancing overall efficiency.

3). Compression

Optimizing storage with proper compression techniques can lead to substantial savings and improved query performance:

Compression Encodings: Employ suitable compression encodings for columns to save storage costs and boost query performance. Selecting the right encodings is key to success.
ANALYZE Command: Run the ANALYZE command periodically to update statistics. This aids the query planner in making informed decisions regarding data distribution and compression.

4). Query Optimization

Fine-tuning your queries can significantly impact performance:

EXPLAIN Command: Use the EXPLAIN command to analyze query plans and identify performance bottlenecks. This helps in pinpointing areas that require optimization.
Column Selection: Avoid using SELECT * in queries; instead, explicitly list the columns you need. This reduces unnecessary data transfer and computation.
Minimize DISTINCT and ORDER BY: Minimize the use of DISTINCT and ORDER BY clauses, as they can be computationally expensive. Use them only when necessary.

5). Concurrency Scaling

Efficiently managing query concurrency is vital:

Automatic Concurrency Scaling: Enable automatic concurrency scaling to handle query load spikes without sacrificing performance.
Custom Concurrency Settings: Adjust concurrency scaling settings based on your workload and requirements, striking the right balance between cost and performance.

6). Workload Management (WLM)

Effectively allocate resources among different query workloads:

WLM Queues: Utilize WLM queues to distribute resources efficiently. Set appropriate memory and concurrency values for each queue to optimize both cost and performance.

7). Partitioning

For large tables with specific query patterns, partitioning is a game-changer:

Table Partitioning: Implement table partitioning if you frequently query specific date ranges or subsets of data. This enhances query performance and reduces costs.

8). Vacuuming and Analyzing

Maintenance tasks are essential for long-term optimization:

VACUUM and ANALYZE: Regularly run the VACUUM and ANALYZE commands to reclaim storage space and keep statistics up-to-date, ensuring peak performance.

9). Monitoring and Alerts

Stay proactive with monitoring and alert systems:

Monitoring Tools: Implement monitoring and set up alerts to track query performance and resource utilization. Services like Amazon CloudWatch can be invaluable for this purpose.

10). Redshift Spectrum

Leverage Redshift Spectrum for cost-effective data querying:

Amazon S3 Integration: Consider using Redshift Spectrum to query data stored in Amazon S3 directly, especially for historical or less-frequently accessed data. This can significantly reduce storage costs .

11) Redshift Advisor and Reserved Instances

Utilize built-in tools for guidance and cost savings:

Redshift Advisor: Take advantage of the Redshift Advisor tool, which provides recommendations for optimizing your cluster's performance and cost-efficiency.
Reserved Instances (RIs): If your Redshift usage is steady, consider purchasing Reserved Instances to lower your per-hour costs, providing predictability and savings.

12). Regular Review and Optimization

Continuous improvement is the key to success:

Performance and Cost Metrics: Regularly review your cluster's performance and cost metrics to identify opportunities for optimization. Adapting to changing needs is crucial.

Conclusion

Optimizing Amazon Redshift for cost and query performance is not a one-time task but rather an ongoing journey that requires a deep understanding of your data, workload, and business objectives. By implementing the strategies mentioned in this article and staying vigilant, you can continuously fine-tune your Redshift cluster to strike the right balance between cost savings and efficient data analytics. This iterative process ensures that your organization maximizes the benefits of this powerful data warehousing service, adapting to evolving needs and extracting valuable insights from your data.