DEV Community: Damaa-C

Building a Real-Time Weather Streaming Pipeline with Kafka, Docker & Python

Damaa-C — Sun, 17 May 2026 18:00:42 +0000

Introduction

In modern data engineering, handling high-velocity, real-time streams requires decoupled architectures that can scale seamlessly. A simple script fetching data from an API and writing it straight to a database creates a tight coupling; if the database goes down or the API experiences a spike, the entire system breaks.

This project implements a resilient Event-Driven Architecture (EDA). It extracts live global weather metrics from an API, streams them into an Apache Kafka topic managed via Docker, and processes them through an ETL (Extract, Transform, Load) consumer engine that flattens and persists the data into a PostgreSQL database.

Project System Architecture & Directory Layout

The project decouples data sourcing from data transformation and consumption using a publish-subscribe model. Docker isolates the streaming platform infrastructure, while Python applications drive the data operations.

text
openweather-kafka_confluent-project/
├── docker-compose.yml       # Orchestrates Zookeeper, Kafka Broker, & Control Center
├── producer.py              # Extracted RapidAPI multi-city pipeline (Ingestion)
├── consumer.py              # Advanced Pandas & SQLAlchemy Postgres pipeline (ETL)
├── test.ipynb               # Jupyter Notebook for interactive validation & debugging
└── .env                     # Local container and API credential configurations

Infrastructure Layer: Docker Compose & Commands

Instead of dealing with local, environment-specific installations of Kafka, the entire messaging backbone is containerized. The docker-compose.yml provisions a robust Confluent platform stack, exposing Kafka over port 9092 to the host machine.

version: '3.8'

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.4.0
    hostname: zookeeper
    container_name: zookeeper
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  broker:
    image: confluentinc/cp-server:7.4.0
    hostname: broker
    container_name: broker
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
      - "9101:9101"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_METRIC_REPORTERS: io.confluent.metrics.reporter.ConfluentMetricsReporter
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      CONFLUENT_METRICS_REPORTER_BOOTSTRAP_SERVERS: broker:29092
      CONFLUENT_SUPPORT_CUSTOMER_ID: 'anonymous'

Docker CLI Commands to Spin Up Infrastructure

To start the background message broker infrastructure, navigate to your project directory containing the configuration file and run:

# Start Kafka and Zookeeper services in detached mode
docker compose up -d

# Verify that your containers are running normally
docker ps

Data Ingestion: The Producer Layer (`producer.py`)

The producer.py script handles data ingestion. It loops continuously through an array of nine target international cities (Nairobi, Accra, Cape Town, Riga, Brussels, Moscow, Seoul, London, and Sucre), requests their real-time weather information via RapidAPI, and dispatches the payload to the weather_raw Kafka topic.

A standout feature here is the automatic inline JSON serialization using a lambda function passed directly into the KafkaProducer constructor.

from kafka import KafkaProducer
import time
import json
import requests
import os
from dotenv import load_dotenv

load_dotenv()

API_KEY  = os.getenv("API_KEY")
API_HOST = os.getenv("API_HOST")
LANG     = os.getenv("LANG")

topic = "weather_raw"

# Initialize Kafka Producer with integrated JSON byte-serializer
producer = KafkaProducer(
    bootstrap_servers = 'localhost:9092',
    value_serializer = lambda v : json.dumps(v).encode('utf-8')
)

Cities = ['Nairobi','Accra','Cape Town','Riga','Brussels','Moscow','Seoul','London','Sucre']

while True :
    for city in Cities :
        url = f"https://{API_HOST}/city?city={city}&lang={LANG}"
        headers = {
            'x-rapidapi-host': API_HOST,
            'x-rapidapi-key' : API_KEY
        }

        try :
            response = requests.get(url, headers=headers)

            if response.status_code == 200 :
                weather_data = response.json()
                producer.send(topic, value=weather_data)
                print(f"Sent weather data for {city}")
            else :
                print(f"Failed for {city} : {response.status_code} ")

        except Exception as e :
            print(f"Error for {city} : {e}")

    producer.flush()
    time.sleep(1)

To begin streaming live API payloads into your Kafka cluster, run the producer engine script in your terminal:

python3 producer.py

Output:

Storage & Transformation Tier: The ETL Consumer (`consumer.py`)

The consumer layer implements a true ETL pattern. Instead of just printing raw bytes, it targets the weather_raw topic, decodes the stream, flattens the highly nested API structures into a uniform format using Pandas, and loads the records into a PostgreSQL database instance using SQLAlchemy.

from kafka import KafkaConsumer
from sqlalchemy import create_engine
from dotenv import load_dotenv
import os
import json
import pandas as pd

load_dotenv()

Postgres_URI = os.getenv("POSTGRES_URI")
engine = create_engine(Postgres_URI)

# Initialize Kafka Consumer with native byte-decoding
consumer = KafkaConsumer(
    'weather_raw',
    bootstrap_servers = 'localhost:9092',
    auto_offset_reset = 'earliest',
    enable_auto_commit = True,
    value_deserializer = lambda x : json.loads(x.decode('utf-8'))
)

print("Consumer started listening ...")

for message in consumer :
    try :
        data = message.value

        # EXTRACT & TRANSFORM: Defensive parsing handles nested JSON safely
        transformed_data = {
            "city"        : data.get("name"),
            "temperature" : data.get("main",{}).get("temp"),
            "humidity"    : data.get("main",{}).get("humidity"),
            "pressure"    : data.get("main",{}).get("pressure"),
            "weather"     : data.get("weather",[{}])[0].get("main"),
            "description" : data.get("weather",[{}])[0].get("description"),
            "wind_speed"  : data.get("wind",{}).get("speed")
        }

        # Structure as a Pandas DataFrame
        df = pd.DataFrame([transformed_data])
        print("\n Transformed weather data")

        # LOAD: Persist metrics into the PostgreSQL destination table
        df.to_sql("weather_kafka", con=engine, if_exists="append", index=False)
        print(f"Loaded weather data for {transformed_data['city']}")

    except Exception as e :
        print(f"Consumer error : {e}")

Open a separate terminal shell pane and launch the engine to begin populating your relational database rows in real-time:

python3 consumer.py

Output:

Pipeline Verification & Data Verification

To verify the integration, we can monitor the execution traces across the python workflows, and query the final warehouse target to confirm data persistence.

Dual-Terminal Execution Log Comparison

When running both backend applications synchronously, your live terminal workspace layout matches the active message flow:

Verifying Records in the PostgreSQL Database

Because the consumer script implements df.to_sql(..., if_exists="append"), every iteration builds out relational records in real-time. Opening a connection tool or terminal CLI to your PostgreSQL instance reveals the transformed schemas waiting for analytics:

select * from weather_kafka;

Output log view:

Conclusion

This project successfully establishes a production-grade blueprint for real-time streaming data pipelines. By combining Docker container isolation with Kafka's decoupled storage guarantees, the system handles data ingestion loops safely without threatening the state of the loading layer. Using Python, Pandas, and SQLAlchemy turns nested API variations into structured relational records, resulting in an automated, robust data engine ready for downstream business intelligence dashboards.

Mastering Modern Data Workflows with Docker

Damaa-C — Mon, 11 May 2026 15:47:43 +0000

In the world of data engineering, the "it works on my machine" excuse is a relic of the past. Docker has revolutionized how we build and deploy applications by using containerization. A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

Why Containerize?

Isolation: Keep your Python libraries for one project separate from another.
Portability: Run the same container on Ubuntu, Windows (via WSL), or macOS.
Scalability: Easily spin up multiple instances of a service.

Essential Docker Commands

To manage your containers effectively, you must master these core CLI commands:

Command	Description
`docker build -t my-image .`	Builds an image from a Dockerfile in the current directory.
`docker run -d --name my-container my-image`	Runs a container in the background (detached mode).
`docker ps -a`	Lists all containers, including those that have stopped.
`docker logs -f <container_id>`	Follows the output logs of a specific container.
`docker exec -it <container_id> /bin/bash`	Opens an interactive terminal inside a running container.
`docker rm -f $(docker ps -aq)`	Forcefully removes all containers.

Orchestration with Docker Compose

While Docker handles individual containers, Docker Compose is used to manage multi-container applications. It uses a yaml file to define how different services (like a database and a script) interact.

Common Compose Commands:

docker-compose up -d: Starts the entire stack in detached mode.
docker-compose down: Stops and removes containers, networks, and images.
docker-compose logs -f [service]: Follows logs for a specific service.

Practical Example: A Health-Checked ETL Pipeline

This complete example shows a Python worker connecting to a PostgreSQL database. It utilizes health-checks to ensure the database is fully initialized before the ETL logic begins.

The Application Code (etl_script.py)

This script acts as our ETL worker, using environment variables for a secure connection.

import pandas as pd
from sqlalchemy import create_engine
import os

# Database connection string provided by Docker Compose
DB_URL = os.getenv('DATABASE_URL')
engine = create_engine(DB_URL)

def run_etl():
    # 1. EXTRACT & TRANSFORM
    data = {'id': [1, 2], 'user': ['Damaris', 'TechWriter']}
    df = pd.DataFrame(data)
    df['status'] = 'verified'

    # 2. LOAD
    print("Connecting to database and pushing data...")
    df.to_sql('users', engine, if_exists='replace', index=False)
    print("ETL Job Completed Successfully!")

if __name__ == "__main__":
run_etl()

The Dockerfile

The Dockerfile contains the instructions to build the environment for our script.

Dockerfile
# Use a lightweight Python image
FROM python:3.9-slim

# Set working directory and install system dependencies
WORKDIR /app
RUN apt-get update && apt-get install -y libpq-dev gcc

# Install required Python libraries
RUN pip install pandas sqlalchemy psycopg2-binary

# Copy the script and run it
COPY . .
CMD ["python", "etl_script.py"]

The `docker-compose.yaml` (The Orchestrator)

This file links the database and the worker, ensuring the worker only starts when the database is "healthy". YAMLversion: '3.8'

services:
  # Service 1: The Database with Healthcheck
  postgres_db:
    image: postgres:15-alpine
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret_password
      POSTGRES_DB: target_warehouse
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U admin -d target_warehouse"]
      interval: 10s
      timeout: 5s
      retries: 5

  # Service 2: The ETL Worker
  etl_worker:
    build: .
    depends_on:
      postgres_db:
        condition: service_healthy # Critical: Wait for DB to be ready
    environment:
      DATABASE_URL: postgresql://admin:secret_password@postgres_db:5432/target_warehouse

How to Run and Verify

Launch the stack: Run docker-compose up --build.
Monitor Status: Use docker ps to see the "healthy" status of the database.
Cleanup: Use docker-compose down to stop all services and clean up networks.

Conclusion

Mastering Docker and multi-container orchestration marks a significant shift from traditional script running to professional-grade engineering. By containerizing your workflows, you eliminate environment-specific bugs and ensure that your data infrastructure is as reliable as the code itself. Whether you are building a simple ETL script or a complex orchestration layer with Apache Airflow, the principles of isolation and health-based dependency management remain the keys to a resilient data stack.

Data Warehousing & Modeling: From Foundation to AWS Cloud Implementation

Damaa-C — Mon, 04 May 2026 12:46:41 +0000

In the current landscape of data engineering, the ability to transform raw, messy data into actionable insights is what separates successful organizations from the rest. This article explores the architecture of data warehouses, the nuances of data modeling, and how to implement these concepts using Amazon Web Services (AWS).

The Foundation: Data Warehousing and Modeling

What is a Data Warehouse?

A Data Warehouse (DWH) is a specialized database optimized for analysis rather than transaction processing. It aggregates data from multiple sources such as CRM systems, mobile apps, and billing databases into a single, unified repository.

Key Characteristics

Integrated: Consolidates data from inconsistent formats into a clean, unified structure.
Time-Variant: Maintains historical records to analyze trends over months or years.
Non-Volatile: Once data enters the warehouse, it is not modified; it is only added to.

Data Modeling Concepts

Data modeling is the process of defining how data is structured within the warehouse to ensure fast query performance.

Star Schema: The most common model. It features a central Fact Table (containing quantitative metrics like price or quantity) connected to multiple Dimension Tables(descriptive data like product_name or store_location).
Snowflake Schema: A variation where dimension tables are normalized into further tables, reducing redundancy but increasing query complexity.

OLTP vs. OLAP: Knowing the Difference

Understanding the distinction between these two systems is critical for any data professional.

Feature	OLTP (Online Transactional Processing)	OLAP (Online Analytical Processing)
Primary Goal	Record daily transactions	Analyze data for decision-making
Data State	Current, real-time	Historical, aggregated
Query Type	Simple (e.g., "Update account balance")	Complex (e.g., "Total revenue per region")
Optimization	Optimized for fast Writes	Optimized for fast Reads
Example	ATM withdrawal, E-commerce checkout	Quarterly sales report, Trend analysis

Configuring a Data Warehouse in AWS

AWS provides a robust ecosystem for data warehousing, primarily centered around Amazon Redshift. Below is the logical configuration flow.

Step 1: The Modern Data Stack Architecture

The most efficient workflow follows the ETL/ELT pattern:

Storage: Raw data lands in Amazon S3.
Transformation: AWS Glue catalogs and cleans the data.
Warehouse: Data is loaded into Amazon Redshift.

Step 2: Creating the Redshift Cluster

In the AWS Management Console, you must provision a cluster. You choose:

Node Type: (e.g., RA3 nodes allow you to scale storage and compute independently).
Number of Nodes: To determine parallel processing power.

Step 3: Network and Security Configuration

Because a data warehouse contains sensitive information, security is paramount:

VPC: Ensure your cluster resides within a private Virtual Private Cloud.
Security Groups: Configure rules to allow traffic only on Port 5439 from trusted IP addresses.
IAM Roles: Attach an IAM role to Redshift that grants it "Read-Only" access to your S3 buckets.

Step 4: Schema Implementation and Data Loading

Once the cluster is active, you use SQL to create your Star Schema. Loading data is typically done via the high-speedCOPY command:

COPY sales_fact
FROM 's3://my-data-bucket/sales_data.csv'
IAM_ROLE 'arn:aws:iam::0123456789:role/RedshiftS3Access'
DELIMITER ',' 
IGNOREHEADER 1;

Conclusion

Data warehousing is more than just "storing data"; it is about structuring information in a way that provides clarity. By moving from OLTP systems to an OLAP environment like Amazon Redshift and applying rigorous Data Modeling, organizations can turn their data into their most significant competitive advantage.

Beyond the UI: Mastering Airflow 3 with Bare-Metal Postgres and TaskFlow

Damaa-C — Sat, 25 Apr 2026 19:37:47 +0000

In the world of Data Engineering, there is a temptation to rely entirely on "magic" the UI buttons and high-level abstractions that hide how things work. However, when a pipeline fails or a scheduler hits an AirflowTaskTimeout, the engineer who understands the "bare metal" is the one who fixes it.

In this guide, we are going back to basics: configuring Airflow 3 via the .cfg, setting up a production-grade Aiven Postgres bridge, and demystifying the mechanics of XComs through the lens of kwargs.

The Foundation: Hard-Coding Your Database

Airflow is not just a scheduler; it is a database-backed application. Before writing a single DAG, your metadata environment must be solid.

Preparing the Handshake

Whether you are using a local instance or a managed cloud provider like Aiven, your Postgres environment needs a dedicated identity. Isolation is key to security in Data Engineering.

SQL
-- Execute in your Postgres terminal to set up the Airflow Backend
CREATE USER airflow_user WITH PASSWORD 'secure_password';
CREATE DATABASE airflow_db;
GRANT ALL PRIVILEGES ON DATABASE airflow_db TO airflow_user;

Installing the Translators (Drivers)

Airflow doesn't speak "Postgres" natively; it uses Python drivers.

psycopg2-binary: The standard synchronous driver for most operations.
asyncpg: Essential for Airflow 3’s asynchronous capabilities and high-performance execution loops.

pip install psycopg2-binary asyncpg

Configuration as Code: The airflow.cfg

While the Airflow UI is convenient, defining connections in airflow.cfgfollows the Configuration as Code (CaC) principle. This makes your environment reproducible, portable, and easier to manage in a CI/CD pipeline.

Locate your airflow.cfg (usually in ~/airflow/) and find the [database] and[connections]sections.

[database]
# Pointing Airflow's own metadata to your local or remote Postgres
sql_alchemy_conn = postgresql+psycopg2://airflow_user:password@localhost:5432/airflow_db

[connections]
# Defining an external Aiven Postgres connection via URI
# Note: 'sslmode=require' is critical for cloud security
AIRFLOW_CONN_AIVEN_PROD = "postgres://avnadmin:pass@pg-damaa.aivencloud.com:24848/defaultdb?sslmode=require"

The Data Bridge: XComs Decoded

In Airflow, tasks run in isolation. They cannot share Python variables in memory. To move small amounts of data (metadata, IDs, or status flags) between tasks, we use** XComs** (Cross-Communications).

The "Old School" Manual Way: kwargs['ti']

In traditional PythonOperator development, every function receives a "suitcase" called kwargs. Inside this suitcase is the Task Instance (ti), which acts as your API to the XCom table in Postgres.

def extract_ticker_data(**kwargs):
    # Pull the Task Instance from the context suitcase
    ti = kwargs['ti'] 
    scraped_data = {"ticker": "BTC", "price": 64000}

    # Manually pushing into the Postgres xcom table
    ti.xcom_push(key='raw_crypto_data', value=scraped_data)

def transform_ticker_data(**kwargs):
    ti = kwargs['ti']
    # Manually reaching into the xcom table to pull data from a specific task
    data = ti.xcom_pull(task_ids='extract_task', key='raw_crypto_data')

    processed_price = data['price'] * 130
    ti.xcom_push(key='price_kes', value=processed_price)

The "Modern" Way: TaskFlow API

Airflow 3 emphasizes the TaskFlow API (@task). Here, the complexity of ti.xcom_pull is abstracted away. Airflow treats the return value of a function as an implicit XCom push.

@task
def extract():
    return {"ticker": "BTC", "price": 64000} # Automatic XCom Push to 'return_value'

@task
def transform(data):
    # Airflow automatically pulls the XCom and passes it as 'data'
    return data['price'] * 130

The "Double-Write" Conflict: Why return and xcom_push Clash

This is the most critical technical nuance for any Airflow developer. You cannot useti.xcom_push and return for the same value without consequences.

The Technical Conflict

When you use TaskFlow, Airflow maps the return statement to a specific XCom key: return_value.

If you write:

@task
def my_task(**kwargs):
    ti = kwargs['ti']
    data = "Success_Flag"

    ti.xcom_push(key='return_value', value=data) # Manual Write #1
    return data                                  # Automatic Write #2

What happens in Postgres?

Redundant SQL Commands: Airflow issues two separate SQL INSERT or UPDATE commands to the xcom table for the exact same task and key.
Database Bloat: You are doubling the metadata overhead for every single task execution.
Race Conditions: In high-concurrency environments, these redundant writes can cause locking issues in your Postgres backend, leading to the very "timeouts" you want to avoid.

The Best Practice

Use return for the primary output of your task. It is cleaner and optimized for TaskFlow.
Use ti.xcom_push ONLY if you need to push additional, separate pieces of metadata (e.g., a row count and a file path) that are not the main return object.

Conclusion: Engineering for Performance

To build resilient pipelines in Airflow 3, you must respect the metadata database. By configuring your connections at the .cfg level and understanding the "double-write" conflict of XComs, you ensure your Postgres backend remains lean and your scheduler remains fast.

Mastering the manual kwargs['ti']gives you the control; mastering the return statement gives you the efficiency.

The Great Data Debate: Why Your Pipeline Choice Could Make or Break Your Insights

Damaa-C — Mon, 13 Apr 2026 22:09:23 +0000

Introduction

Imagine you are a chef in a high-end restaurant. You have two ways to run your kitchen; in the first scenario, you wash, peel, and chop every vegetable the moment it arrives from the farm, organized perfectly into containers before they ever touch the fridge and the second scenario, you shove everything into a massive walk-in freezer immediately and only pull out and prep what you need when a customer actually places an order.

In the world of Data Engineering, this is exactly the difference between ETL and ELT. One is about preparation; the other is about storage and speed.

1. ETL (Extract, Transform, Load)

The Perfectionist ETL is the "traditional" way. It was born in an era when server space was expensive and hard to find. You couldn't afford to store "messy" data, so you cleaned it before it landed in your database.

Extract : Grab the data from the source (e.g., an Excel file or an API).

Transform: Use a "Processing Engine" (like a Python script) to clean it, fix dates, and remove errors.

Load: Save the clean, polished data into your database.

Real-World Example: The "Daily Sales" Pipeline.

Think about a retail store like Old Mutual or a local supermarket. They have thousands of transactions a day.

The Problem: The cash register records everything, including errors, canceled orders, and employee test transactions.

The ETL Solution: A Python script runs every night. It filters out the "canceled" orders, converts the currency to a standard format, and calculates the total profit per store. Only that "Total Profit" number is saved to the final database.

The Benefit: The database stays small and incredibly fast for the management team to check.

2. ELT (Extract, Load, Transform)

The Speed Demon ELT is the "modern" way, powered by the Cloud. Since storage is now cheap and cloud processors are incredibly fast, we don't wait to clean the data. We "dump" it all in first and figure it out later.

Extract: Grab the raw data.

Load: Push that raw, messy data directly into a "Data Lake" or Cloud Warehouse.

Transform: When you actually need a report, you use SQL to clean the data inside the warehouse.

Real-World Example: The "Binance Crypto" Pipeline.
Imagine you are tracking Bitcoin prices on Binance.

The Problem: The market moves every millisecond. If you stop to "clean" the data before saving it, you might miss a price spike.

The ELT Solution: You set up a pipeline that copies every single "tick" (price change) directly into a cloud warehouse like BigQuery.

The Benefit: A year from now, if a data scientist asks, "What was the exact price at 2:03 AM on a Tuesday?", you have the raw data ready. In an ETL world, you probably would have averaged that data out and lost the detail.

Key Differences

ETL

Clean it before you store it.
Best for smaller data, high security/privacy
Tools: Python (Pandas), Apache Airflow
Maintenance, if the source data changes, the pipeline breaks.

ELT

Store it, then clean what you need.
Massive "Big Data," Cloud computing.
Tools: dbt (data build tool), Snowflake.
If the source data changes, you just update your SQL.

3. Which One Should You Use?

Deciding between these two isn't about which technology is "newer", it's about your resources and goals.

Choose ETL if;

Privacy is King: You need to remove sensitive info (like customer names) before storing the data
Limited Hardware: You are working on a local machine (like a VM with 2GB RAM) and can't afford to store terabytes of "messy" data.
Stability: Your data sources rarely change, and you want a very predictable database.

Choose ELT if:

You're in the Cloud: You have access to AWS, Google Cloud, or Azure.
You Want Agility: You aren't 100% sure what questions you'll need to answer in six months, so you want to keep all the raw details.
Scalability: You are dealing with "Big Data" that is too large for a single Python script to process efficiently.

Conclusion

Whether you are building a modular Python pipeline for Binance or a massive corporate data hub, the goal is the same: turning raw noise into clear signals.
ETL is your precision tool : it keeps things lean and secure.
ELT is your power tool : it keeps things flexible and fast. Most modern data engineers are moving toward ELT because it allows them to be more "agile," but understanding the "clean-as-you-go" logic of ETL remains the most important foundational skill you can have.

Data, Community, and the Cutting Edge: My Journey into Fedora

Damaa-C — Tue, 07 Apr 2026 19:38:15 +0000

Introduction

As I embark on the Outreachy 2026 application journey, I’ve had to look beyond my terminal and into the heart of the ecosystem I’m contributing to. While many know Fedora Linux as a high-performance operating system, I’ve discovered that the Fedora Project is something much larger: a global community of innovators, developers, and advocates dedicated to the future of free and open-source software. In this post, I’ll share what I’ve learned about this vibrant community and how you can join us.

The Foundations of Innovation

The Fedora Project isn't just about code; it is guided by a philosophy known as the Four Foundations. These principles act as the **"North Star" **for every decision made in the community:

Friends: We are a global family. Respect and collaboration are at the core of every interaction.

Freedom: We are committed to free software and content, ensuring that tools remain accessible to everyone.

Features: Fedora is a powerhouse of innovation, constantly striving to provide the best and most modern software features.

First: We are pioneers. Fedora is often the first to adopt new technologies that eventually become industry standards.

Getting Your Passport: The Fedora Account (FAS)

Before you can submit a line of code or a data cleaning script, you need your "passport" to the community: the Fedora Account (FAS). Here is how you get started:

Register: Visit accounts.fedoraproject.org and create your unique FAS ID.

Verify: Confirm your email and set up Two-Factor Authentication (2FA) to keep your contributions secure.

The FPCA: Sign the Fedora Project Contributor Agreement. This ensures all contributions remain open-source and protected.

Reflections on the Experience

What I find most interesting about Fedora is the sheer transparency. From the Fedora Council to the Special Interest Groups (SIGs), the "brain" of the project lives in the open on the Fedora Discussion forums.

Coming from a background focused on data and system performance, specifically working with tools like Python, Pandas, and Fedora XFCE in virtual environments, the technical scale was initially a bit confusing. Navigating Matrix rooms and mailing lists takes time, but the **"Friends" **foundation ensures that no one stays lost for long.

Conclusion

If you are an Outreachy 2027 applicant reading this next year, my advice is simple, dive in early. Your FAS ID is your ticket to a world of innovation. Fedora is more than just a platform for data engineering; it is a place where you can grow alongside the technology you’re building. Don't be afraid to ask questions, contribute where you can, and embrace the"First" mentality.

How to Connect Power BI to a SQL (PostgreSQL) Database and Build a Unified Dashboard

Damaa-C — Thu, 19 Mar 2026 06:46:00 +0000

Introduction

Power BI is a business intelligence (BI) and data visualization tool from Microsoft. It enables analysts and business users to transform raw data into interactive dashboards and reports. Companies use Power BI to analyze sales, customer behavior, inventory trends, and other critical business metrics.

SQL databases, like PostgreSQL, are widely used to store and manage structured data. Connecting Power BI to SQL databases allows analysts to retrieve, clean, transform, and model data efficiently, creating a single source of truth for dashboards and KPIs.

In this article, we’ll walk through:

Connecting Power BI to local and cloud PostgreSQL databases
Loading raw data and applying transformations
Combining datasets from multiple sources
Adding a Date Dimension table (DimDate) for KPIs and filtering
Preparing a dashboard-ready model for GitHub and reporting

What You Need Before Connecting

Power BI Desktop installed
PostgreSQL database connection details (server, port, database, username, password)
For cloud connections: SSL certificate (CA certificate)
Model view showing relationships across cloud and local tables

Connecting to a Local PostgreSQL Database

Open Power BI Desktop.
Click Get Data → PostgreSQL database.

Enter your connection info:

Server:

 localhost:5432 (replace 5432 with your PostgreSQL port)

Database: Name of your database

Enter your username and password.

Click OK to connect.

Note: The localhost:port syntax ensures Power BI connects to the correct PostgreSQL server port.

In the Navigator window, select tables to load, then click Load (or Transform Data).

Connecting to a Cloud PostgreSQL Database (Aiven or Similar)

Cloud databases require secure SSL connections.

Guide to Aiven Connection

From your cloud provider, obtain:

Host
Port
Database name
Username and password
SSL certificate (CA certificate)
Install the CA certificate on your system:
Open the downloaded .crt or .pem file.
Install it in the trusted root certificate store:

Windows: Right-click → Install Certificate → Place in “Trusted Root Certification Authorities”
Linux/macOS: Follow OS instructions to add to the system or user trust store

Installing the CA certificate allows Power BI to validate the server’s identity and establish a secure connection.

In Power BI Desktop:

Go to Get Data → PostgreSQL database
Enter Server

your-cloud-host:5432

and Database

Check Use SSL certificate and select the installed certificate
Enter your username and password

If successful, it opens your aiven cloud database as shown below
Choose tables from your database

Click OK and load the tables

Load Raw Data First

It’s important to load raw data before applying transformations:

Load all tables (customers, products, sales, inventory) first.
Review the raw data to assess missing values, incorrect or inconsistent formats and unnecessary columns
This helps you understand which fields need cleaning and transformation before building dashboards.

Filtering and Transformation

Use Power BI’s Power Query Editor for data preparation:

Remove unnecessary columns – keep only fields needed for analysis.
Rename columns – make names descriptive.
Filter rows – e.g., remove canceled sales or test records.
Change data types – ensure numeric and date fields are correct.
Create calculated columns – e.g., total sales = quantity × price.
Handle missing values – fill, replace, or remove nulls.

Creating Relationships Between Tables

Once data is cleaned:

Switch to Model View.
Create relationships:

`sales.customer_id → customers.customer_id

sales.product_id → products.product_id

inventory.product_id → products.product_id`

Validate relationships for correct cardinality and cross-filtering.

Relationships ensure Power BI can accurately summarize data across multiple tables.

Data Modeling Basics

Star schema
Fact table (sales) connected to dimension tables (customers, products, inventory).
Cardinality
One-to-many or many-to-one relationships.
Filter directions
Define single or bidirectional filters depending on your analysis needs.

Proper modeling ensures dashboards are interactive and accurate.

Why SQL Skills Matter for Power BI Analysts

Data extraction: Retrieve exactly the data you need.
Filtering & aggregation: Pre-process data for better performance.
Data cleaning & transformation: Prepare data before loading it into Power BI.
Performance optimization: Efficient queries improve dashboard responsiveness.

By combining SQL knowledge with Power BI, analysts can build powerful, reliable dashboards that drive business decisions.

Conclusion

Connecting PostgreSQL to Power BI empowers analysts to turn structured data into actionable insights. Whether using local or cloud databases, establishing secure and reliable connections ensures that the data you analyze is accurate and up-to-date. By first loading raw data, applying necessary transformations, and creating relationships, including a Date table for time-based analysis, you build a robust data model ready for interactive dashboards.

Mastering this workflow not only simplifies KPI calculations and trend analysis but also highlights the importance of SQL skills for preparing and managing data. Ultimately, integrating PostgreSQL with Power BI provides a seamless bridge between data storage and visualization, enabling smarter, data-driven business decisions.

Mastering SQL Joins and Window Functions: A Practical Guide with Example Data

Damaa-C — Fri, 06 Mar 2026 19:53:54 +0000

Introduction

SQL is a powerful language for managing and analyzing relational databases. Two essential concepts for data analysis are Joins and Window Functions.

Joins allow you to combine rows from multiple tables based on related columns.

Window Functions perform calculations across a set of rows related to the current row, enabling ranking, cumulative sums, moving averages, and more.

In this guide, we’ll create a sample database with customers, products, sales, and inventory tables, populate them with data, and demonstrate joins and window functions with real examples.

Create Database and Schema

-- Create a new database
CREATE DATABASE business_db;

-- Connect to the database
\c business_db;

-- Create schema
CREATE SCHEMA assignment;

-- Set schema for this session
SET search_path TO assignment;

-- Verify schema
SHOW search_path;

Create Tables and Insert Data

Customers Table
CREATE TABLE customers ( customer_id INT PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50), email VARCHAR(100), phone_number VARCHAR(50), registration_date DATE, membership_status VARCHAR(10) );

INSERT INTO customers 
(customer_id, first_name, last_name, email, phone_number, registration_date, membership_status) 
VALUES
(1, 'Karen', 'Molina', 'gonzalezkimberly@glass.com', '(728)697-1206', '2020-08-27', 'Bronze'),
(2, 'Elizabeth', 'Archer', 'tramirez@gmail.com', '778.104.6553', '2023-08-28', 'Silver'),
(3, 'Roberta', 'Massey', 'davislori@gmail.com', '+1-365-606-7458x399', '2024-06-12', 'Bronze'),
(4, 'Jacob', 'Adams', 'andrew72@hotmail.com', '246-459-1425x462', '2023-02-10', 'Gold'),
(5, 'Cynthia', 'Lowery', 'suarezkiara@ramsey.com', '001-279-688-8177x4015', '2020-11-13', 'Silver');

Products Table
CREATE TABLE products ( product_id INT PRIMARY KEY, product_name VARCHAR(100), category VARCHAR(50), price DECIMAL(10,2), supplier VARCHAR(100), stock_quantity INT );

INSERT INTO products
(product_id, product_name, category, price, supplier, stock_quantity)
VALUES
(1, 'Laptop', 'Electronics', 999.99, 'Dell', 50),
(2, 'Smartphone', 'Electronics', 799.99, 'Samsung', 150),
(3, 'Washing Machine', 'Appliances', 499.99, 'LG', 30),
(4, 'Headphones', 'Accessories', 199.99, 'Sony', 100),
(5, 'Refrigerator', 'Appliances', 1200.00, 'Whirlpool', 40);

Sales Table
CREATE TABLE sales ( sale_id INT PRIMARY KEY, customer_id INT, product_id INT, quantity_sold INT, sale_date DATE, total_amount DECIMAL(10,2), FOREIGN KEY (customer_id) REFERENCES customers(customer_id), FOREIGN KEY (product_id) REFERENCES products(product_id) );

INSERT INTO sales
(sale_id, customer_id, product_id, quantity_sold, sale_date, total_amount)
VALUES
(1, 1, 1, 1, '2023-07-15', 999.99),
(2, 2, 2, 2, '2023-08-20', 1599.98),
(3, 3, 3, 1, '2023-09-10', 499.99),
(4, 4, 4, 3, '2023-07-25', 599.97),
(5, 5, 5, 1, '2023-06-18', 1200.00);

Inventory Table
CREATE TABLE inventory ( product_id INT PRIMARY KEY, stock_quantity INT, FOREIGN KEY (product_id) REFERENCES products(product_id) );

INSERT INTO inventory
(product_id, stock_quantity)
VALUES
(1, 50),
(2, 150),
(3, 30),
(4, 100),
(5, 40);

SQL Joins

INNER JOIN

Returns only rows with matching values in both tables.

SELECT c.first_name, c.last_name, s.total_amount
FROM customers c
INNER JOIN sales s ON c.customer_id = s.customer_id;

 ## Shows customers who have made purchases.

LEFT JOIN

Returns all rows from the left table, with NULLs for unmatched right table rows.

SELECT c.first_name, c.last_name, s.total_amount
FROM customers c
LEFT JOIN sales s ON c.customer_id = s.customer_id;

 ## Show all customers, even those with no purchases.

SELF JOIN

Join a table to itself.

SELECT c.first_name AS customer1, m.first_name AS customer2, c.membership_status
FROM customers c
INNER JOIN customers m ON c.membership_status = m.membership_status
WHERE c.customer_id < m.customer_id;

## Find pairs of customers with the same membership status.

Window Functions

RANK()

Assigns ranks to rows in a partition.

SELECT 
    c.first_name,
    c.last_name,
    SUM(s.total_amount) AS total_spent,
    RANK() OVER(ORDER BY SUM(s.total_amount) DESC) AS customer_rank
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.first_name, c.last_name;

DENSE_RANK()

Assigns ranks to rows without gaps for ties.

SELECT 
    c.first_name,
    c.last_name,
    SUM(s.total_amount) AS total_spent,
    DENSE_RANK() OVER(ORDER BY SUM(s.total_amount) DESC) AS dense_rank
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.first_name, c.last_name;

Difference from RANK(): If two customers tie for 1st place, RANK() skips 2, giving the next rank as 3, while DENSE_RANK() gives 2.

ROW_NUMBER()

Assigns a unique sequential number to rows.

SELECT 
    c.first_name,
    c.last_name,
    s.sale_date,
    ROW_NUMBER() OVER(PARTITION BY c.customer_id ORDER BY s.sale_date) AS purchase_order
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id;

Cumulative SUM()

Calculate running totals without collapsing rows.

SELECT 
    p.product_name,
    s.sale_date,
    SUM(s.quantity_sold) OVER(PARTITION BY p.product_id ORDER BY s.sale_date) AS cumulative_sales
FROM products p
JOIN sales s ON p.product_id = s.product_id;

Top Customers per Membership Tier

SELECT 
    c.first_name,
    c.last_name,
    c.membership_status,
    SUM(s.total_amount) AS total_spent,
    RANK() OVER(PARTITION BY c.membership_status ORDER BY SUM(s.total_amount) DESC) AS tier_rank
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.first_name, c.last_name, c.membership_status
ORDER BY c.membership_status, tier_rank;

Sample Queries for Analysis

Total Sales Per Product

SELECT p.product_name, SUM(s.quantity_sold) AS total_sales
FROM products p
JOIN sales s ON p.product_id = s.product_id
GROUP BY p.product_name
ORDER BY total_sales DESC;

Customers with Purchases Over $1000


SELECT first_name, last_name
FROM customers
WHERE customer_id IN (SELECT customer_id FROM sales WHERE total_amount > 1000);

Products Low in Stock

SELECT p.product_name, i.stock_quantity
FROM products p
JOIN inventory i ON p.product_id = i.product_id
WHERE i.stock_quantity < 50;

Conclusion

Joins combine data across tables and are essential for querying normalized databases.

Window Functions perform calculations over a set of rows without collapsing them, enabling ranking, cumulative totals, and analytics within groups.

Together, they allow powerful data analysis, such as finding top customers, cumulative sales trends, and product performance.

How Analysts Translate Messy Data, DAX, and Dashboards into Action Using Power BI

Damaa-C — Tue, 10 Feb 2026 12:21:30 +0000

Introduction

In real-world analytics projects, data is rarely clean or analysis-ready. Analysts often receive data from multiple sources with missing values, inconsistent formats, duplicates, and unclear relationships. Power BI provides an end-to-end analytics platform that enables analysts to clean messy data, build strong data models, write meaningful DAX measures, and design dashboards that translate insights into action. This article explains how analysts achieve this process using Power BI, with reference to a Hospital and Pharmacy dataset.

Developing the Analytics Mindset

Effective analysis begins with the right mindset. Analysts must understand the business problem before working with the data. In a hospital and pharmacy environment, decision-makers may want answers to questions such as: How many patients are visiting the hospital? Which departments are busiest? Which drugs are most prescribed? Power BI is used not just to visualize data, but to support informed decision-making through evidence-based insights.

Cleaning and Transforming Messy Data Using Power Query

Messy data is one of the biggest challenges in analytics. Hospital datasets often contain duplicate patient records, inconsistent date formats, missing department names, and incorrect data types. Power Query is used to clean and transform this data before analysis begins.

Using Power Query, analysts:

Remove duplicate and irrelevant records
Standardize column names
Convert data types to correct formats
Handle missing or null values
Filter data to retain only what is necessary

These steps ensure that the dataset is accurate, consistent, and reliable. Importantly, Power Query transformations are repeatable, meaning the same cleaning steps can be applied when new data is added.

Data Modeling and Relationships

After data cleaning, analysts build a data model. A well-designed data model improves report performance and ensures accurate calculations. Best practices such as separating fact tables from dimension tables and using a star schema are applied.

In the Hospital and Pharmacy dataset:

Fact tables include patient visits and pharmacy transactions
Dimension tables include dates, departments, diseases, and drugs

Clear relationships between these tables allow Power BI visuals and DAX measures to behave correctly across filters and slicers.

Using DAX to Create Business Metrics

DAX (Data Analysis Expressions) enables analysts to create calculated measures that answer specific business questions. Unlike basic calculations, DAX measures are dynamic and respond to user interaction within reports.

Examples of DAX measures used in the analysis include:

Total Patient Visits

Total Patient Visits = COUNT('PatientVisits'[VisitID])

Total Pharmacy Sales

Total Pharmacy Sales = SUM('PharmacySales'[TotalAmount])

Average Daily Patient Visits


Average Daily Visits =
AVERAGEX(
    VALUES('Date'[Date]),
    [Total Patient Visits]
)

Total Prescriptions

Total Prescriptions = SUM('PharmacySales'[Quantity])

These measures help quantify hospital activity, track pharmacy performance, and identify trends over time. By using DAX, analysts move beyond raw data to meaningful metrics that support decision-making.

Selecting Appropriate Visuals

Choosing the right visuals is critical for effective communication. Analysts select visuals based on the type of insight they want to present. In Power BI:

Line chartsare used to show patient visit trends over time
Bar charts compare departments, diseases, or drugs
KPI cards highlight key metrics such as total visits and sales
Tables and matrices provide detailed breakdowns

The focus is on clarity and simplicity, ensuring that insights are easily understood by stakeholders.

Dashboard Design and Data Storytelling

Dashboards are not just collections of charts; they tell a story. Analysts design dashboards to guide users from high-level summaries to more detailed insights. Layout, spacing, and logical flow are carefully considered.

In the hospital dashboard, users can first view overall patient volumes and pharmacy sales, then drill down into department performance and disease patterns. This storytelling approach allows decision-makers to quickly identify issues and opportunities.

Translating Insights into Action

The ultimate goal of analytics is action. Insights generated from Power BI dashboards enable hospital management to:

Allocate staff to high-demand departments
Monitor disease trends for better planning
Optimize pharmacy stock levels
Improve operational efficiency and service delivery

By translating data into insights, Power BI supports informed, data-driven decisions.

Power BI enables analysts to transform messy data into actionable insights through a structured process of data cleaning, modeling, DAX calculations, and dashboard design. By combining technical skills with an analytics mindset, analysts bridge the gap between raw hospital and pharmacy data and real-world decisions. This approach supports better planning, efficiency, and outcomes in healthcare environments.

Practical Data Modeling in Power BI: Star and Snowflake Schemas Explained

Damaa-C — Tue, 03 Feb 2026 10:17:47 +0000

Introduction

In Power BI projects, many reporting issues such as slow performance, incorrect totals, or complex DAX formulas often stem from one root cause; poor data modeling. While visuals and measures usually get most of the attention, the data model is the true foundation of any reliable analytical solution.

Data modeling is the process of structuring data in a way that supports efficient analysis, accurate relationships, and meaningful insights. In Power BI, this typically means organizing data into fact tables and dimension tables, following proven data warehousing principles.

As Ralph Kimball explains in The Data Warehouse Toolkit:

Dimensions provide the 'who, what, when, where, why, and how'
context surrounding business process events.

This article provides a practical, beginner-friendly guide to data modeling in Power BI, using Sales and Fact Budget CSV datasets as working examples. The explanations are guided by concepts from Ralph Kimball’s The Data Warehouse Toolkit and practical demonstrations inspired by the Pragmatic Works Power BI Data Modeling video.

Why Data Modeling Is Important in Power BI

Improved report performance
Simpler and more readable DAX measures
Accurate filtering and aggregations
Easier maintenance and scalability
Consistent business logic across reports

Power BI is not just a visualization tool; it is also an analytical engine. Without a proper model, even the best visuals can produce misleading results.

Business Scenario and Datasets Used

To demonstrate practical data modeling concepts, this article uses two related fact tables;Sales table represents actual transactional sales data and Fact Budget table represents planned or budgeted values.

These datasets allow analysis of actual performance versus planned targets, which is a common real-world business scenario. By modeling these tables correctly, we can compare revenue against budget, calculate variances, and evaluate performance trends over time.

Both fact tables share common descriptive data such as
products, dates, and markets, making them ideal for demonstrating star and snowflake schemas in Power BI.

How We Arrive at a Data Model in Power BI

Before any relationships are created in the Model view, a good data model begins in Power Query. This is where raw data is shaped, cleaned, and organized into fact and dimension tables. The steps below describe a practical, repeatable approach used in real-world Power BI projects.

Step 1: Load the Source Data

The process starts by loading the source files:

Open Power BI Desktop
Select Get Data
Choose the appropriate source e.g Text(CSV, Excel, database, etc.)

In this example, we load:

Sales data (actual transactions)

FactBudget data (planned or budget figures)

Once loaded, we select Transform Data to open Power Query.

Step 2: Identify Facts and Dimensions

With the data visible in Power Query, the next step is to determine Which tables represent business processes (facts)
and Which attributes describe those processes (dimensions)

The Sales and FactBudget tables are kept as fact tables, while descriptive fields such as Product, Date, Market, or Department are candidates for dimension tables.

This approach follows Kimball’s principle of separating measurements from descriptive context.

Step 3: Create Dimension Tables from Fact Data

Rather than importing separate dimension files, Power BI allows dimensions to be created directly from fact tables.

For each required dimension:

Duplicate the fact table (Right-click → Duplicate)
Rename the duplicated table (e.g., DimProduct, DimMarket)
Unselect all columns, then select only the columns relevant to that dimension
Use Remove Duplicates to ensure one unique row per dimension member

For example:

DimProduct keeps only Product-related columns e.g productID, product, category, segment, unit cost, unit price.
DimCustomer keeps only Customer columns e.g customer ID, email,customer name, zip code, city, country, state.

This results in clean, compact dimension tables that connect efficiently to fact tables.

Step 4: Creating a Proper Date Dimension (DimDate)

Using a dedicated Date dimension is a core data warehousing best practice. While Power BI allows the use of date columns directly from fact tables, this approach is limited and not recommended for analytical models.

A true Date dimension allows analysts to answer questions such as:

Is this date a weekday or weekend?
What month or quarter does it belong to?
How do holidays affect performance?

To achieve this, we create a Date table using Power Query, based on a function by Devin Knight.

Adding the Date Dimension Using Power Query

The steps are:

Go to Transform Data → Power Query Editor
Select Home → New Source → Blank Query
Open Advanced Editor
Paste the Date dimension function code

//Create Date Dimension
(StartDate as date, EndDate as date)=>

let
    //Capture the date range from the parameters
    StartDate = #date(Date.Year(StartDate), Date.Month(StartDate), 
    Date.Day(StartDate)),
    EndDate = #date(Date.Year(EndDate), Date.Month(EndDate), 
    Date.Day(EndDate)),

    //Get the number of dates that will be required for the table
    GetDateCount = Duration.Days(EndDate - StartDate),

    //Take the count of dates and turn it into a list of dates
    GetDateList = List.Dates(StartDate, GetDateCount, 
    #duration(1,0,0,0)),

    //Convert the list into a table
    DateListToTable = Table.FromList(GetDateList, 
    Splitter.SplitByNothing(), {"Date"}, null, ExtraValues.Error),

    //Create various date attributes from the date column
    //Add Year Column
    YearNumber = Table.AddColumn(DateListToTable, "Year", 
    each Date.Year([Date])),

    //Add Quarter Column
    QuarterNumber = Table.AddColumn(YearNumber , "Quarter", 
    each "Q" & Number.ToText(Date.QuarterOfYear([Date]))),

    //Add Week Number Column
    WeekNumber= Table.AddColumn(QuarterNumber , "Week Number", 
    each Date.WeekOfYear([Date])),

    //Add Month Number Column
    MonthNumber = Table.AddColumn(WeekNumber, "Month Number", 
    each Date.Month([Date])),

    //Add Month Name Column
    MonthName = Table.AddColumn(MonthNumber , "Month", 
    each Date.ToText([Date],"MMMM")),

    //Add Day of Week Column
    DayOfWeek = Table.AddColumn(MonthName , "Day of Week", 
    each Date.ToText([Date],"dddd"))

in
    DayOfWeek

Rename the query (e.g., fnDimDate)

After saving the function:

Right-click the function → Invoke
Provide a Start Date and End Date Power BI generates a full Date dimension table

This approach is based on the method described by Devin Knight: Creating a Date Dimension with Power Querydate power query

Why Use a Date Dimension Instead of a Fact Date Column?

Relying on a raw date column from a fact table limits analytical capability. A dedicated Date dimension:

Enables advanced time intelligence
Provides consistent filtering across multiple fact tables
Allows identification of weekdays, weekends, holidays, and fiscal periods
Improves model clarity and reusability

This is why both Sales and FactBudget tables connect to the same DimDate table in the model.

Understanding Fact Tables

Fact tables store quantitative, measurable data generated by business processes. In this example:

Sales Fact Table

The Sales table contains transactional metrics such as:

Sales amount
Quantity sold
Revenue
Profit

The grain of the Sales table is defined at the level of a specific transaction, typically by product, date, and market.

FactBudget Table

The FactBudget table stores planned or forecasted metrics such as:

Budgeted sales amount
Budgeted revenue

Unlike transactional data, budget data is often recorded at a higher level (for example, monthly or by department), which influences how it is modeled.

Dimension Tables

Dimension tables contain descriptive attributes that give meaning to numeric facts. According to Ralph Kimball:

“Dimensions provide the descriptive context for facts.”

Common dimensions used in this model include:

DimDate – when the transaction occurred
DimProduct – what was sold
DimCustomer / Market – who bought it and where as shown in the image below;

Dimensions answer critical business questions:

Who made the purchase?
What product was sold?
When did it occur?
Where did it happen?
How or why did it occur (channel, promotion, etc.)

Date Dimension Theory and Implementation

Using a dedicated Date dimension is a best practice in data modeling. Instead of relying on raw date columns from fact tables, a Date table provides:

Consistent time-based analysis
Support for time intelligence functions
Clear relationships across multiple fact tables
Typical Date Attributes

A Date dimension commonly includes:

DateKey (e.g., YYYYMMDD)
Full date
Year
Quarter
Month number
Month name

In Power BI, a Date table can be generated using DAX. This article includes a Date table created using custom date code, which is then related to both the Sales and FactBudget tables.

In the table below, you can see dimdate relation to factsales table in the star schema.

Surrogate Keys and Relationship Design

In analytical models, surrogate (index) keys are preferred over textual fields. Examples include:

DateKey
ProductKey
CustomerKey
catseg id as shown below;

Surrogate keys improve:

Performance
Relationship consistency
Integration across multiple fact tables

They are especially useful when combining data from different systems or when natural keys are inconsistent or complex.

STAR SCHEMA

The star schema is the most common and recommended modeling pattern in Power BI.

In the Sales model:

The Sales fact table sits at the center
Dimension tables surround it
Relationships are one-to-many
Filters flow from dimensions to facts

This structure simplifies reporting and ensures efficient query performance.

Snowflake Schema

A snowflake schema occurs when dimension tables are further normalized into additional related tables.

In the FactBudget model:

Budget data may link to higher-level entities such as departments or regions
These dimensions connect to other dimension tables rather than directly to the fact table

While snowflake schemas add complexity, they are sometimes necessary, particularly for planning and budgeting data.

Integrating Star and Snowflake Schemas

Power BI allows multiple fact tables to coexist within a single model when they share common dimensions.

In this example:

Sales uses a star schema
FactBudget uses a snowflake schema

Both connect through shared dimensions such as Date and Product

This integration enables:

Actual vs budget comparisons
Variance analysis
Performance tracking across time and products

In conclusion, Power BI data modeling plays a crucial role in transforming raw data into meaningful insights. By structuring data effectively through relationships, calculated columns, measures, and hierarchies, users can create dynamic and interactive reports that support informed decision-making. Proper data modeling ensures data consistency, accuracy, and performance efficiency, allowing organizations to analyze trends, identify patterns, and make data-driven decisions with confidence. Mastery of Power BI’s data modeling capabilities not only enhances analytical capabilities but also empowers users to communicate insights visually, bridging the gap between complex data and actionable knowledge.

Introduction to Linux for Data Engineers, Including Practical Use of Vi and Nano with Examples

Damaa-C — Sun, 25 Jan 2026 11:07:13 +0000

Overview

Linux is the backbone of most modern data engineering systems. From cloud servers and big data platforms to ETL pipelines and data warehouses, Linux provides the environment where data engineers build, deploy, and manage data workflows. This article introduces Linux from a beginner’s perspective, explains why it is important for data engineers, and demonstrates basic Linux usage with a strong focus on text editing using Vi and Nano.

This guide is written for beginners with no prior Linux experience required.

Why Linux Is Important for Data Engineers?

Most data engineering tools run on Linux. Tools such as Apache Hadoop, Spark, Kafka, Airflow, Docker, and Kubernetes are primarily designed for Linux environments.

Here’s why Linux matters:

Server dominance – Most servers in the cloud (AWS, Azure, GCP) run on Linux.
Performance and stability – Linux handles large-scale data processing efficiently.
Automation-friendly – Powerful command-line tools for scripting and scheduling jobs.
Open source – Free, customizable, and widely supported

As a data engineer, you will often:

Connect to Linux servers via SSH
Edit configuration files
Run data processing scripts
Monitor logs and system resources

Understanding Linux is therefore a core skill.

Understanding the Linux Terminal

The terminal, also called the command line or shell, allows you to interact with the Linux system by typing commands.
Example of terminal output;

damaris@ubuntu:~$

This shows:

Username: damaris
Machine name: ubuntu
Current directory: ~ (home directory)

Basic Linux Commands for Beginners

1. Check for current directory

To check for current directory in the terminal, use the following command;

pwd

Output: /home/damaris

2. List files and folders

To list files in terminal use this command;

ls # listing files
ls -l ## to list folders

3. Create a directory

Use mkdir command to create a directory.

Example:

mkdir data_projects

4. Navigate Between Directories

Use cd command to navigate through directories.
Let's use the file mkdir data_projects we created to navigate the directory.
For example;

cd data_projects

To exit a directory, use cd ..
Example:

cd data_projects
ls #lists files in the directory
cd ..

5. Create a File

Use touch command to create a file.
Example;

touch sample.txt

6. View File Content

To view contexts or content of a file, use cat command.

cat sample.txt

Why Text Editors Matter in Data Engineering

As a data engineer, you will constantly edit SQL scripts, Python files, Shell scripts and Configuration files (YAML, JSON, .conf)

Linux provides powerful terminal-based text editors. The most common are Vi/Vim and Nano.

Using the Nano Editor in Ubuntu (Detailed Beginner Guide)

What Is Nano?

Nano is a simple, beginner-friendly text editor that runs inside the Ubuntu terminal. Unlike Vi/Vim, Nano does not use modes, which makes it much easier for new Linux users to learn and use.

For data engineers, Nano is commonly used to:

Edit configuration files (.conf, .yaml, .json)
Write quick notes or scripts
Modify ETL pipeline settings
Edit files on remote Linux servers

Nano is preinstalled on most Ubuntu systems, so you don’t need to install anything.

Opening the Terminal in Ubuntu

Before using Nano, you need to open the terminal.

You can do this in any of the following ways:

Press Ctrl + Alt + T

Search for Terminal in the Applications menu

You will see something like:

damaris@ubuntu:~$

This means you are in your home directory.
nano data_notes.txt

Creating a File Using Nano

To create a new file using Nano, type:

nano data_notes.txt

Then press Enter.

What Happens Next?

If the file does not exist, Nano creates it.

If the file already exists, Nano opens it for editing.

You will now see the Nano editor screen.

Understanding the Nano Editor Interface

When Nano opens, the screen has three main parts:

1. Main Editing Area (Center)

This is where you type your text.

Example:

This file contains notes for our data engineering project.
Source: MySQL
Destination: Data Warehouse

2. Status Bar (Bottom)

At the bottom of the screen, you’ll see something like:

^G Get Help   ^O Write Out   ^W Where Is   ^K Cut   ^X Exit

The ^ symbol means the Ctrl key.

So:

^O means Ctrl + O

^X means Ctrl + X

This shortcut list is one of Nano’s biggest advantages.

3. File Name Display (Top)

At the top, Nano shows the file name you are editing:

GNU nano 6.2         data_notes.txt

Typing Text in Nano

Nano starts in editing mode immediately.

You can begin typing right away without pressing any special keys.

Example:

ETL Pipeline Notes
------------------
Extract data from PostgreSQL
Transform data using Python
Load data into the warehouse

There is no insert mode or command mode like in Vi.

Saving a File in Nano

To save your work:

Press Ctrl + O (Write Out)

Nano will ask:

File Name to Write: data_notes.txt

Press Enter

Your file is now saved.

Exiting Nano

To exit Nano:

Ctrl + X

If You Have Unsaved Changes

Nano will ask:

Save modified buffer?

Press Y → Save changes

Press N → Exit without saving

Press Ctrl + C → Cancel exit

Opening an Existing File with Nano

To edit an existing file:

nano data_notes.txt

This opens the file so you can modify it.

Editing Text in Nano

Moving the Cursor

You can move around using:

Arrow keys ↑ ↓ ← →
Page Up / Page Down

Deleting Text

Backspace → Delete previous character
Delete key → Delete next character

Cutting and Pasting Text

Cut a Line
Ctrl + K

This cuts the entire line.

Paste a Line

Ctrl + U

This pastes the last cut text.

Searching for Text in Nano

To search within a file:
Ctrl + W

Type the word you want to find and press Enter.

Example:

warehouse

Practical Example: Editing a Configuration File

Imagine you are a data engineer editing a pipeline configuration file.

Step 1: Open the file

nano etl_config.conf

Step 2: Add configuration details

source_database=mysql
source_host=localhost
destination=warehouse
batch_size=500

Step 3: Save and exit

Ctrl + O → Enter

Ctrl + X

Viewing the File from Terminal

After exiting Nano, you can confirm the file content using:

cat etl_config.conf

Output:

source_database=mysql
source_host=localhost
destination=warehouse
batch_size=500

Common Nano Shortcuts (Beginner Must-Know)

Shortcut Action
Ctrl + O Save file
Ctrl + X Exit Nano
Ctrl + K Cut line
Ctrl + U Paste
Ctrl + W Search
Ctrl + G Help

Using the Vi Editor in Ubuntu (Detailed Beginner Guide)

What Is Vi?

Vi is a powerful text editor available on almost every Linux system, including Ubuntu. Unlike Nano, Vi works using modes, which can feel confusing at first but make Vi extremely efficient once learned.

For data engineers, Vi is important because:

It is always available on servers (even minimal installs)
It is fast and lightweight
It is widely used for editing configuration files and scripts
Many tools default to Vi

If you connect to a remote Linux server, Vi is almost always there.

Opening the Terminal in Ubuntu

Open the terminal using:

Ctrl + Alt + T, or

Search for Terminal in Applications

You will see something like:

damaris@ubuntu:~$

Opening or Creating a File with Vi

To open or create a file using Vi:

vi pipeline_config.txt

If the file does not exist → Vi creates it

If the file exists → Vi opens it for editing

You are now inside the Vi editor.

Understanding Vi Modes (Very Important)

Vi has three main modes. Most beginner confusion comes from not knowing which mode they are in.

1. Normal Mode (Default)

This is the mode Vi starts in

Used for navigation and commands

You cannot type text here

If you try typing, nothing appears — this is normal.

2. Insert Mode

Used for typing text

You must enter this mode manually

To enter Insert mode:

You will see something like:

-- INSERT --

at the bottom of the screen.

3. Command Mode

Used to save, quit, or exit without saving

Activated by typing : in Normal mode

Typing Text in Vi (Insert Mode)

Step-by-step example:

Open the file:

vi pipeline_config.txt

Press:

Type the text:


source_database=postgres
host=localhost
port=5432
destination=data_warehouse

You are now editing the file.

Exiting Insert Mode

To stop typing and return to Normal mode:

Esc

Always press Esc before saving or quitting.

Saving a File in Vi

Make sure you are in Normal mode (press Esc)

Type:

:w

Press Enter

This saves the file but keeps Vi open.

Saving and Exiting Vi

To save and exit at the same time:

:wq

Then press Enter.

Exiting Vi Without Saving

If you want to quit without saving changes:

:q!

This is useful if you make a mistake.

Navigating Inside a File
Using Arrow Keys

Most Ubuntu versions support arrow keys for movement.

Using Vi Keys (Optional but Powerful)

h → left

l → right

j → down

k → up

Deleting Text in Vi

Delete a Character -x
Delete a Line- dd
Undo a Change- u

Searching for Text in Vi

To search for a word:

/warehouse

Press Enter.

To move to the next match:

Practical Example: Editing a Configuration File on a Server

Imagine you are logged into a production server.

ssh user@data-server-ip
cd /opt/etl/config
vi etl_config.conf

Inside Vi, press i and add:

batch_size=1000
retry_count=3
log_level=INFO

Then:

Esc
:wq

This is a real-world daily task for data engineers.

Viewing the File After Exiting Vi

Back in the terminal:

cat etl_config.conf

Output:

batch_size=1000 retry_count=3 log_level=INFO

Common Vi Commands (Beginner Cheat Sheet)

Command Action
i Insert mode
Esc Normal mode
:w Save
:q Quit
:wq Save and quit
:q! Quit without saving
dd Delete line
u Undo
/text Search

Through this article, we explored the importance of Linux in data engineering, practiced essential Linux commands, and demonstrated practical text editing using the Nano and Vi editors on Ubuntu. Nano provides a simple and beginner-friendly way to create and edit files, while Vi offers powerful features that are widely used in professional and production environments. Learning both editors prepares beginners for real-world tasks such as editing ETL configurations, scripts, and log files on local or remote servers.

In conclusion, mastering Linux basics along with Nano and Vi is a strong first step toward a successful data engineering journey. With continued practice, these skills become second nature and form the foundation for working with advanced data tools, automation, and large-scale data pipelines.

Understanding Git: How it tracks, pushes and pulls code on Ubuntu

Damaa-C — Sun, 18 Jan 2026 09:41:54 +0000

Before we dive into how Git tracks changes and handles pushing and pulling code, let’s first understand what Git is, what it does, and how it works together with GitHub.

What is Git?

Git is a version control system. This means it works on your computer first, even without internet. Think of it as a time machine for your code.

Git allows you to track changes to your files, save versions of your project, revert to earlier states and collaborate safely with others.
GitHub, on the other hand, is a remote platform where Git repositories are stored online.

Installing Git and linking it to GitHub

Before installing Git, first go to your web browser and create a GitHub account. Go to your terminal after creating the account.

Step 1: Install Git

sudo apt update

Install Git:

sudo apt install git -y

Verify Git installation:

git --version

`Shows Git is successfully installed on Ubuntu.`

Step 2: Configure Git Identity

Set your name:

git config --global user.name "Your_Name"

Set your email:

git config --global user.email "your_email@gmail.com"

Check configuration:

git config --list

user.name=Your Name
user.email=your_email@gmail.com`

Shows Git configuration including username and email.

user.name=Your Name → Git knows who is making commits. This name appears in the commit history.

user.email=your_email@gmail.com → Git associates your commits with this email. It must match your GitHub account email.

This confirms Git is configured correctly. Without this, Git will not track who is making changes, and commits may not appear properly on GitHub.

Step 3: Create SSH Key and Link to GitHub

Generate a secure SSH key:

ssh-keygen -t ed25519 -C "your_email@gmail.com"

Start SSH agent:

eval "$(ssh-agent -s)"

Add your key to the agent:

ssh-add ~/.ssh/id_ed25519

Display your public key:

cat ~/.ssh/id_ed25519.pub

Test the connection:

ssh -T git@github.com

Example of wrong input:

ssh -T git@github.com
`Permission denied (publickey).`

Correct output:

Hi username! You've successfully authenticated, but GitHub does not provide shell access.

Shows a successful authentication with GitHub via SSH.

Step 4: Create a GitHub Repository

Create a new repository on GitHub and do not initialize with README. This will be your remote repository.

Step 5: Create Local Project Directory

Create a new folder and navigate into it:

mkdir beginner-git-project
cd beginner-git-project

Shows folder creation and navigation.

Step 6: Initialize Git

git init

Shows that the folder is now a Git repository.

Step 7: Create a File

Create a README file:

touch README.md
echo "# Beginner Git Project" >> README.md

Check repository status:

git status

Example of wrong input:

git statuz

Output:

`git: 'statuz' is not a git command. See 'git --help'.`

Step 8: Stage Changes

git add .

Check status:

git status

Shows staged files ready to commit.

Step 9: Commit Changes

git commit -m "Initial commit"

Example of wrong input:

git commit -m Initial commit

Output:

`error: pathspec 'Initial' did not match any file(s) known to git`

Correct syntax requires quotes around the commit message.

Step 10: Connect Local Repo to GitHub

Add the remote repository:

git remote add origin git@github.com:username/beginner-git-project.git
git remote -v

Shows the remote repository connection.

Step 11: Push Code to GitHub

Rename branch to main:

git branch -M main

Push your code:

git push -u origin main

Example of wrong push command:

git push
`fatal: No configured push destination.`

This occurs if the remote wasn’t added.

Step 12: Pull Changes from GitHub

git pull origin main

Downloads the latest changes from GitHub and merges them into your local project.

Step 13: View Project History

git log

Exit the log view:

q

Why Git Matters (Data Science & Data Engineering)

Track notebooks & pipelines
Collaborate safely
Revert mistakes
Production-ready workflows

Git allows developers, data scientists, and engineers to track code changes, collaborate safely, and recover from mistakes. By understanding each command and its purpose, beginners can confidently manage projects and work professionally using Git and GitHub.

Learning Git on Ubuntu empowers you to manage projects confidently and collaborate effectively. Whether you’re working on ETL pipelines, machine learning models, or data analysis notebooks, Git ensures your work is organized, secure, and scalable.

DEV Community: Damaa-C

Building a Real-Time Weather Streaming Pipeline with Kafka, Docker & Python

Introduction

Project System Architecture & Directory Layout

Infrastructure Layer: Docker Compose & Commands

Docker CLI Commands to Spin Up Infrastructure

Data Ingestion: The Producer Layer (producer.py)

Storage & Transformation Tier: The ETL Consumer (consumer.py)

Pipeline Verification & Data Verification

Dual-Terminal Execution Log Comparison

Verifying Records in the PostgreSQL Database

Conclusion

Mastering Modern Data Workflows with Docker

Why Containerize?

Essential Docker Commands

Orchestration with Docker Compose

Common Compose Commands:

Practical Example: A Health-Checked ETL Pipeline

The Application Code (etl_script.py)

The Dockerfile

The docker-compose.yaml (The Orchestrator)

How to Run and Verify

Conclusion

Data Warehousing & Modeling: From Foundation to AWS Cloud Implementation

The Foundation: Data Warehousing and Modeling

What is a Data Warehouse?

Key Characteristics

Data Modeling Concepts

OLTP vs. OLAP: Knowing the Difference

Configuring a Data Warehouse in AWS

Step 1: The Modern Data Stack Architecture

Step 2: Creating the Redshift Cluster

Step 3: Network and Security Configuration

Step 4: Schema Implementation and Data Loading

Conclusion

Beyond the UI: Mastering Airflow 3 with Bare-Metal Postgres and TaskFlow

The Foundation: Hard-Coding Your Database

Preparing the Handshake

Installing the Translators (Drivers)

Configuration as Code: The airflow.cfg

The Data Bridge: XComs Decoded

The "Old School" Manual Way: kwargs['ti']

The "Modern" Way: TaskFlow API

The "Double-Write" Conflict: Why return and xcom_push Clash

The Technical Conflict

What happens in Postgres?

The Best Practice

Conclusion: Engineering for Performance

The Great Data Debate: Why Your Pipeline Choice Could Make or Break Your Insights

Introduction

1. ETL (Extract, Transform, Load)

Real-World Example: The "Daily Sales" Pipeline.

2. ELT (Extract, Load, Transform)

Key Differences

3. Which One Should You Use?

Conclusion

Data, Community, and the Cutting Edge: My Journey into Fedora

Introduction

The Foundations of Innovation

Getting Your Passport: The Fedora Account (FAS)

Reflections on the Experience

Conclusion

How to Connect Power BI to a SQL (PostgreSQL) Database and Build a Unified Dashboard

Introduction

What You Need Before Connecting

Connecting to a Local PostgreSQL Database

Connecting to a Cloud PostgreSQL Database (Aiven or Similar)

Guide to Aiven Connection

Load Raw Data First

Filtering and Transformation

Creating Relationships Between Tables

Data Modeling Basics

Why SQL Skills Matter for Power BI Analysts

Conclusion

Mastering SQL Joins and Window Functions: A Practical Guide with Example Data

Introduction

Create Database and Schema

Create Tables and Insert Data

SQL Joins

INNER JOIN

Data Ingestion: The Producer Layer (`producer.py`)

Storage & Transformation Tier: The ETL Consumer (`consumer.py`)

The `docker-compose.yaml` (The Orchestrator)