DEV Community: Wangeci Ndovu

DAGS with Apache Airflow

Wangeci Ndovu — Mon, 18 May 2026 14:33:46 +0000

Data Pipeline Orchestration with Apache Airflow

A dedicated repository for orchestrating enterprise data workflows, automating dependencies, and scheduling ETL pipelines using Apache Airflow. This repository serves as the centralized automation layer that handles job scheduling, failure retries, and sequential execution for data engineering tasks.

Project Overview

This project showcases an automated Apache Airflow Directed Acyclic Graph (DAG) designed to manage a modular Python ETL pipeline. Instead of relying on manual script invocation or cron jobs, this architecture leverages Airflow's core workflow management capabilities to build a resilient, observable data infrastructure.

Core Architecture Components

Orchestrator (DAG.py): The primary workflow engine definition that schedules, monitors, and structures task groups.
Upstream Modules: Connected nodes executing individual pipeline stages—integrating APIs, custom transformation wrappers, and database loader clients.

Workflow Topology (DAG)

The workflow breaks down the weather processing pipeline into three decoupled, sequential tasks. Each task executes within an isolated transaction layer, ensuring that failure states are trapped immediately before processing downstream.

[extract_weather_data] ──► [transform_weather_data] ──► [load_weather_data]

Task Definitions

extract_weather_data: Invokes the upstream data extractor, managing structural payload captures from the designated edge nodes.

transform_weather_data: Handles downstream formatting, mapping unstructured data keys into deterministic relational data properties.

load_weather_data: Manages final storage delivery targets, committing structured logs into target analytical engines.

Infrastructure & Tech Stack

Core Engine: Apache Airflow

Runtime Environment: Python 3.8+ (Ubuntu / Linux Environment)

Execution Libraries: airflow.models, airflow.operators.python

Schedule Target: Hourly (@hourly)

Getting Started & Local Installation

Follow these steps to deploy and execute this DAG locally within your Apache Airflow environment on an Ubuntu system.

Clone the Tracking Directory

git clone [https://github.com/wangecindovu-lab/DAGS-using-Apache-Airflow.git](https://github.com/wangecindovu-lab/DAGS-using-Apache-Airflow.git)
cd DAGS-using-Apache-Airflow

Configure Airflow Environment Variables

Set your default home path for Airflow before initializing the database layout:

export AIRFLOW_HOME=~/airflow

Install Apache Airflow Dependencies

pip install "apache-airflow==2.8.1" --constraint "[https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.10.txt](https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.10.txt)"

Link the DAG to Your Airflow Deployment

Airflow scans the ~/airflow/dags directory by default. Symlink or copy DAG.py to your deployments path:

mkdir -p ~/airflow/dags
cp DAG.py ~/airflow/dags/weather_etl_dag.py

Initialize Airflow Database & Start Services

Initialize the underlying relational metadata engine, register an administrative user, and start both the scheduler and webserver:

# Initialize metadata database
airflow db init

Create an admin user account

airflow users create \
    --username admin \
    --firstname Wangeci \
    --lastname Ndovu \
    --role Admin \
    --email Wangecindovu@gmail.com \
    --password admin

Start the web server (accessible via http://localhost:8080)

airflow webserver --port 8080 -D

Start the pipeline scheduler engine

airflow scheduler

Configuration & Operational Safeguards

The pipeline uses strict execution constraints defined within the DAG default properties layer:

retries: Configured to safely try again up to 2 consecutive times before raising system alert flags.

retry_delay: Enforces a 5-minute dead-interval spacing between retries to allow external services or target nodes to recover from temporary downtime.

catchup=False: Prevents the scheduler from back-triggering historical executions for missed time intervals when the pipeline is activated for the first time.

Simple beginer python calculator

Wangeci Ndovu — Fri, 15 May 2026 09:32:38 +0000

Simple Python Calculator

A basic command-line calculator built with Python. This project performs simple arithmetic operations such as addition, subtraction, multiplication, and division.

Features

Addition
Subtraction
Multiplication
Division
User input handling
Simple and readable code structure

Getting Started

git@github.com:wangecindovu-lab/Simple-Beginner-python-calculator.git

Run the calculator.

python calculator.py

Usage

Run the script
Choose an operation (+, -, *, /)
Enter the first number
Enter the second number
View the result

Example

Enter operation (+, -, *, /):
Enter first number: 10
Enter second number: 5
Result: 50

Project Structure

.
├── calculator.py
└── README.md

SIMPLE BEGINNER CRYPTO ETL PIPELINE.

Wangeci Ndovu — Thu, 07 May 2026 04:28:32 +0000

Crypto ETL Pipeline (Python + PostgreSQL)

Overview

This project is a simple ETL (Extract, Transform, Load) pipeline that retrieves real-time cryptocurrency market data from the CoinPaprika API, processes it using Python (Pandas), and loads it into a PostgreSQL database.

The pipeline is designed for learning and demonstrating core data engineering concepts such as API ingestion, data transformation, and relational database storage.

Architecture

Extract → Transform → Load

Extract
- Data is fetched from the CoinPaprika REST API
- Cryptocurrencies include: Bitcoin, Ethereum, XRP, Solana
Transform
- JSON response is normalized using Pandas
- Unnecessary fields are removed
- Columns are renamed for clarity
- Timestamp (ingested_at) is added
Load
- Cleaned data is inserted into PostgreSQL
- Table: crypto_market_data

Tech Stack

Python 3.x
Pandas
Requests
SQLAlchemy
Psycopg2
PostgreSQL
python-dotenv

Project Structure

Crypto_etl.py/
│
├── crypto_etl.py Main ETL script
├── requirements.txt Python dependencies
├── .env Environment variables (not pushed to GitHub)
├── .gitignore Ignored files (venv, env, etc.)
└── README.md Project documentation

Setup Instructions

1. Clone repository

bash
git clone https://github.com/your-username/crypto-etl.git
cd crypto-etl

2. Create virtual environment

python3 -m venv venv
source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Configure environment variables

Create a .env file:

DB_USER=your_db_user
DB_PASSWORD=your_db_password
DB_HOST=localhost
DB_PORT=5432
DB_NAME=crypto_db

5. Run ETL pipeline

python crypto_etl.py

Database Schema

Table: crypto_market_data

Column	Description
coin_id	Unique coin identifier
coin_name	Name of cryptocurrency
coin_symbol	Symbol (BTC, ETH, etc.)
price_usd	Current price in USD
volume_24h_usd	24h trading volume
volume_24h_change	Volume change percentage
market_cap_usd	Market capitalization
ingested_at	Timestamp of ingestion

Features

Real-time crypto data ingestion
Automated transformation using Pandas
PostgreSQL data storage
Environment variable configuration
Modular and extensible ETL structure

Future Improvements

Add scheduling (cron/Airflow)
Implement upsert logic (avoid duplicates)
Dockerize the pipeline
Add logging and error handling
Build dashboard for visualization

Author

Wangeci Ndovu
Data Engineer

ETL VS ELT: WHICH ONE SHOULD YOU USE AND WHY?

Wangeci Ndovu — Fri, 10 Apr 2026 13:19:41 +0000

If you are considering a career in data, more so, engineering and or analytics you probably have come across this two terms, ETL and ELT now they might look plenty complicated at first glance but I am about to explain to you both, in simple straight forward words.
So lets dive right in:

just what is ETL?

Extraction transformation and loading or simply known as ETL is a fundamental part of gathering data from multiple sources and it follows exactly that order that is stated in the name, it simply brings together very key and fundamental principles of not only assembling data but also refining it and making it ready for experts and laymen alike, to the final product achieved.

let's break it down together

EXTRACTION

As the name states this is the process by which data gurus source raw data from multiple places, in its most unstructured and random of forms, this sources can include:

databases
Application programming interface(API's)
Internet Content
economy data
Real estate data
weather data
Surveys and Interviews

It is however, to be noted that sources of data are not only limited to the above.
so just what is the purpose of this raw data collection, and in comes:-

TRANSFORMATION

You guessed it it's actually all in the name, transformation simply refers to the ways by which data is cleaned, structured standardized and made suitable for both analysis and storage.
There are several key important ways and reason as to why raw data requires transformation this are:

Data cleaning.

This simply involves removing errors, unwanted values and or null values.

Structural conversion.

Very key in adding or removing columns and rows for either aesthetics purposes or relevance, standardizing ranges as well as putting the structure in itself incase you are working with an unstructured form of raw data.

Destructive Transformation.

Sometimes a data set can have bits and pieces that are either not needed or have been made irrelevant by various reasons, this could even be data already in storage this formula removes such unwanted information.

Attribution.

This Attribution process simply is the restructuring of various features of raw or already existing data, to for example, if you already have date of birth you can simply add a new feature under "age", same case for if you have total price and production cost you can change both to "profit" or "loss" for simpler analysis.

In doing this things data experts employ tools like:

Microsoft excel
Microsoft Power BI
Databased (DBT) tools for SQL
Pandas(Python)
Informatica

Safe to say that data cleaning is one of, if not the most important part of data integration, once data safely passes through this process it is ready for:

LOADING

Data Loading as the name suggests is the process of moving data from a source into a target system, usually for storage, analysis, or processing.
data can be loaded into either:

Data lake
Data Warehouse
Staging area
Repository for analytics and reporting

That is a simple ETL pipeline at a glance.

Here is a simple diagram of an ETL data integration method.

ELT

ELT as the name suggests and drawing from my above explanation simply changes between transforming before loading as in ETL and instead choses to load before transforming ELT this can be for several main reasons this include:

Scalability.

Modern cloud data warehouses have been designed to handle massive amounts of data, and are able to handle transformations at much higher speeds while also saving greatly on cost unlike past tools where you needed several tools to do the job.

Agility and flexibility.

Since the raw data is already loaded data analysts and scientist can create new models and transformations easily as business needs changes inevitably with time, without necessarily having to alter or overhaul the initial ingestion pipeline.

Simplified pipeline management.

Seeing that data is loaded immediately from extraction it reduces the total workflow and or workload needed in the transformation process and reduces the amount of tools needed for data integration.

Reduced Ingestion Time.

By doing away with the transformation process it greatly reduces the amount of time it takes for data to be made available in the target system, which is sometimes critical for real time or near real time reporting.

Support for Unstructured Data.

ELT is best suited for handling unstructured data such as JSON files, images and videos as it involves storing them in their raw form and only transforming them at the demands of analysis.

Below is a simple ELT data integration illustration.

When to use ETL VS ELT

Complex Analytics
ETL and ELT can both be used for complex analysis that will have multiple data formats from varied sources, data Engineers may set up ETL pipeline for some of this sources and or target databases then use ELT for various others depending on need and of course cost.

IOT Applications
Internet of things(IOT) applications that tend to use sensor data streams will often prefer ETL over ELT to for example:

Receive data from different protocols and convert it into standard data formats for use in cloud workload
Cleanse, deduplicate, or fill missing time series data elements.
When calculating values from data sources that differ on the local device, and send filtered values to the cloud backend.
In filtering high-frequency data, performing average functions on large datasets, then loading averaged or filtered values at a low cost.

Experimentation
Data Engineers sometimes due to one reason or another, or even due to business demands will conduct experiments that are crucial of course to the growth of any business, this are aimed at:

discovering new data sources for analytics
Trying out new ideas to answer business queries
help businesses move with changing times
Cut cost

Among others, ETL here is more suitable as it employs different tools in data pipelines thus can give a more in depth review of both data structures and the data itself.

In conclusion

On whether to use ETL or ELT in your business, the simple answer is there is no simple answer. However, both of this can be used separately or in a complimentary manner depending on the needs and size of a business or on the operation in question.
It also depends on other factors such as:

Overall cost
flexibility
Scalability
Security
speed
personnel experience

Among others, both are however equally up to the task and can handle well when used properly and efficiently.

SQL Joins and Window Functions: The Difference Between Combining Data and Analyzing It

Wangeci Ndovu — Wed, 04 Mar 2026 02:27:32 +0000

Let us talk about joins and windows functions in sql

Joins combine tables

Window Functions analyze data without collapsing it

Many beginners confuse the two. Let’s break them down properly step y step with real examples, clear explanations, and practical queries

Part 1- SQL Joins — Combining Data Across Tables

Imagine you have two tables:

customers

customer_id | first_name|second_name
1           | Alice     |Johnson
2           | Bob       |Njagi

orders

order_id| customer_id |Order_amount
1       | 1           | 250
2       | 1           | 300
3       | 2           | 150

If you want to see who made which order, you need a ##JOIN##.

What is a JOIN?

A JOIN allows you to combine rows from two or more tables based on a related column.

In simple terms:

A JOIN connects tables using a common key.

INNER JOIN

Returns only matching rows.

SELECT customer_id, order_id,
order_amount,
FROM customers,
INNER JOIN orders,
ON customer_id = customer_id;

Result:

Alice | 1 | 250
Alice | 2 | 300
Bob   | 3 | 150

If there’s no match, the row is excluded

LEFT JOIN

Returns all rows from the left table, even if there’s no match.

SELECT customer_name,
order_id,orders_amount,
FROM customers,
LEFT JOIN orders,
ON customer_id = customer_id;

If a customer has no orders, they still appear with NULL values for order columns.

RIGHT JOIN

Opposite of LEFT JOIN returns all rows from the right table.

FULL OUTER JOIN

Returns all rows from both tables matched where possible.

Key Insight About Joins

Joins increase columns.

They bring data from multiple tables into a single result set.

They do NOT calculate ranking, running totals, or row by row analytics.

That’s where Window Functions come in.

Part 2- Window Functions Analyzing Without Collapsing Data

Window functions are different.

They:

Do NOT reduce rows (unlike GROUP BY)
Perform calculations across related rows
Allow row-level analytics

This is extremely important.

Example question

What if we want:

Total spending per customer, but still show each individual order?

If you use GROUP BY:

SELECT customer_id,
SUM(amount) AS total_spent,
FROM orders
GROUP BY customer_id;

You get

1 | 550
2 | 150

But you lose individual orders.

Enter Window Functions

SELECT order_id,customer_id,amount,
SUM(amount) OVER (PARTITION BY customer_id) 
AS total_spent
FROM orders;

Result:

1 | 1 | 250 | 550
2 | 1 | 300 | 550
3 | 2 | 150 | 150

Now you have

Each order
AND total per customer
Without collapsing rows

That’s a better way to do it.

Understanding OVER()

The magic happens inside the OVER() clause.

PARTITION BY

Groups rows logically (like GROUP BY), but does not collapse them.

ORDER BY

Defines order within each partition(basically how you want them show).

Example: Ranking orders by amount.

SELECT order_id, customer_id,amount,
RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) 
AS customer_rank,
FROM orders;

This ranks each customer’s orders separately.

Common Window Functions You Should Know

ROW_NUMBER()

Gives unique row numbers.

ROW_NUMBER() OVER (ORDER BY amount DESC)

RANK()

Gives same rank for ties, skips numbers.

DENSE_RANK()

Gives same rank for ties, does NOT skip numbers.

SUM() OVER()

Running totals

SELECT order_id, amount,
SUM(amount) OVER (ORDER BY order_id) 
AS running_total
FROM orders;

LAG() and LEAD()

Compare rows to the ones before or the ones after.

SELECT order_id, amount,
LAG(amount) OVER (ORDER BY order_id) 
AS previous_amount
FROM orders;

Very useful for time-series analysis

Joins vs Window Functions

The Real Difference

Here’s the more clearer distinction

Joins

Combine tables
Increase columns
Used to bring related data
Based on keys

Window Functions

Analyze rows within a table
Add calculated insights
collapsing rows
Based on partitions and order

When Should You Use Each?

Use JOIN when-

You need data from multiple tables
You’re connecting facts and dimensions
You’re building analytical datasets

Use Window Functions when-

You need ranking
You need running totals
You need comparisons between rows

-You want aggregates without GROUP BY

In everyday analytics and data engineering, you often use BOTH together.

Example

SELECT customer_name, order_id,order_amount,
SUM(order_amount) OVER (PARTITION BY customer_id) AS total_spent
FROM customers
JOIN orders
ON customer_id = customer_id;

This combines tables AND applies analytics

That’s production-level SQL.

In conclusion

Joins connect tables using keys.

Window functions perform analytics without collapsing rows.

GROUP BY reduces rows, window functions do not.

PARTITION BY is like GROUP BY, but keeps detail rows.

Modern data work heavily relies on window functions.

If you’re serious about becoming strong in SQL especially as a Data Engineer mastering both concepts is non-negotiable.

you can check more of my articles on https://www.linkedin.com/in/thomas-wangeci-065469194/

How Analysts Translate Messy Data, DAX, and Dashboards into Action Using Power bi

Wangeci Ndovu — Mon, 16 Feb 2026 14:30:16 +0000

When I started working with large amounts of data I quickly realized one thing and that is that, raw data isn’t valuable until it tells a story. This is where analysts step in, turning messy datasets into actionable intelligence. At the heart of this transformation is Microsoft's Power BI, a powerful analytics platform that helps analysts organize data, build logic with DAX, and deliver dashboards that drive decisions.

In this article, we'll look at how analysts approach messy data, apply DAX (Data Analysis Expressions) to add intelligence, and design dashboards that empower stakeholders, who are often times the key decision makers in organizations to act with confidence.

Understanding the Messy Data

Messy data is everywhere. sometimes customers have inconsistent names, at other times dates are stored as text. decimals may sometimes use commas instead of full stops. The first step in any analytics initiative is data understanding and cleaning, key on the cleaning.

Common Messy Data Issues

Missing values
Inconsistent formats
Duplicated records
Mis-typed entries
Irrelevant data

What Analysts Do First

go through the dataset
Identify inconsistencies
Transform the data through data query

Before building anything in Power BI, analysts most times start in Power Query, cleaning data using an intuitive UI or M language.

Data Transformation using Power Query

Power BI’s Power Query Editor is where the heavy lifting actually happens.

Analysts use it to

Split columns
Change data types
Replace inconsistent text
Merge and append tables
Handle missing values

The key here is: “Prepare once, reuse many”. With Power Query steps, transformation logic persists every time the dataset refreshes.

Adding Intelligence With DAX

Once the data is clean and structured, it’s time for one of Power BI’s most powerful tools: DAX (Data Analysis Expressions)

DAX is the language that fuels calculated columns, measures, time intelligence, and business logic within Power BI, and it is designed specifically for data analysis.

so how do analysts use DAX in real life scenarios

Creating Important Metrics

Instead of simple raw columns, analysts define business critical metrics using DAX, such as:

Total Sales = SUM(Sales[Amount])

This creates a reusable measure that aggregates sales dynamically across filters.

Time Intelligence

Common business questions involve time comparisons. With DAX, you can express these like:

Sales Last Year = CALCULATE([Total Sales],

SAMEPERIODLASTYEAR(Calendar[Date]))

Now you can compare year-over-year trends with ease

Designing Dashboards That Tell Stories

Data without visualization is pontless it fails to tell the story that is intended overally. Dashboards are where analytics actually communicate.

Key Principles Analysts Follow

Start with questions — What business decisions do stakeholders need to make?
Choose clear visuals — Bar charts for comparisons, line charts for trends, cards for key numbers.
Use slicers thoughtfully — Let users filter context without clutter.
Avoid noise — Too many visuals complicate and dilute the point of focus.

A typical Power BI dashboard MUST answers:

What happened?
Why did it happen?
What might happen next?

This turns passive data into actionable insight

Refreshing and Operationalizing Insights

Power BI dashboards aren’t static reports — they’re living, refreshing assets. Analysts schedule data refreshes, connect to live data sources, and configure alerts on a daily so stakeholders don’t miss key changes.

This is where analytics becomes practically actionable, not just descriptive

To put it at a glance

Here’s an example of a simple step-by-step analytics workflow in Power BI:

Phase Tool Output

Ingest Power Query Clean tables
Logic DAX Dynamic measures
Visualize Reports & Dashboard Actionable views
Share Power BI Service Published insights

Each step builds on the last and without any of them, insights remain heavily inaccurate.

key take home

Power BI is powerful, but it’s the analyst’s mindset that turns raw data into actionable insights/ logic:

Curiosity — What story is the data trying to tell?
Precision — Is this metric calculated correctly?
Clarity — Can a user understand this at a glance?
Impact — Does this lead to better decisions?

Power BI is the tool — but it is virtually useless without the analyst

In conclusion

Whether you’re building your first dashboard or optimizing complex enterprise analytics, the process remains the same:

Understand the data
Clean and shape it
Add intelligence with DAX
Visualize simply and with clarity
Empower users to take action

Schemas and Data Modelling in Power BI: A Practical Guide for Accurate and High-Performance Reporting

Wangeci Ndovu — Mon, 02 Feb 2026 15:08:17 +0000

When working with Power BI, most beginners focus heavily on visuals—charts, tables, slicers, and dashboards. However, the real foundation of reliable, fast, and accurate Power BI reports is data modelling.

A poorly designed model can lead to:

Slow reports
Incorrect totals
Confusing relationships
Hard-to-maintain dashboards

In this article, we’ll explore schemas and data modelling in Power BI, focusing on:

Star schema
Snowflake schema
Fact and dimension tables
Relationships

Why good modelling is critical for performance and accurate reporting

What Is Data Modelling in Power BI?

Data modelling is the process of structuring your data into tables and defining how those tables relate to each other.

In Power BI, this happens in the Model view, where you:

Organize tables
Create relationships
Decide filter directions
Design a structure that supports efficient analysis

Think of data modelling as designing the blueprint before building the house.

Fact Tables vs Dimension Tables

Before discussing schemas, it’s important to understand the two main table types.

Fact Tables

Fact tables store measurable, numerical data.

Examples:

Sales amount
Quantity sold
Revenue
Cost
Profit

Characteristics:

Usually very large

Contain foreign keys to dimensions

Contain metrics used in calculations

Example:
Fact_Sales

DateKey ProductKey CustomerKey SalesAmount Quantity
Dimension Tables

Dimension tables store descriptive attributes used for filtering and grouping.

Examples:

Product name
Customer name
Region
Category

-Date details

Characteristics:

Smaller than fact tables

Contain descriptive columns

Used in slicers and axes

Example:
Dim_Product

ProductKey ProductName Category Brand

What Is a Star Schema?

The star schema is the recommended and most efficient data model for Power BI.

Structure

One central fact table

Multiple dimension tables

Dimensions connect directly to the fact table

The model visually resembles a star

Example:

Dim_Date Dim_Product Dim_Customer
\ | /
Fact_Sales

Why Star Schema Is Best for Power BI

Simple relationships
Faster performance
Easier DAX calculations
Clear filter flow
Easier to understand and maintain

Power BI’s engine VertiPaq is optimized for star schemas.

Example Star Schema in Power BI

Fact_Sales
Dim_Date
Dim_Product
Dim_Customer
Dim_Region

Each dimension connects one-to-many to the fact table.

What Is a Snowflake Schema?

A snowflake schema is a variation of the star schema where dimension tables are further normalized into sub-dimensions.

Structure

Fact table at the center

Dimension tables split into multiple related tables

More relationships

Example:

Dim_Product → Dim_Category
\
Fact_Sales

When Snowflake Schema Appears

Data comes directly from normalized databases

Dimensions have many hierarchical levels

Storage optimization is a priority

Drawbacks in Power BI

More complex relationships
Slower performance
Harder DAX calculations
Confusing filter paths

In Power BI, denormalizing dimensions back into a star schema is usually recommended.

Relationships in Power BI

Relationships define how tables filter each other.

Common Relationship Type

One-to-Many (1:*)
Dimension (1)
Fact (*)

Example:

Dim_Product[ProductKey] → Fact_Sales[ProductKey]

**Relationship Direction##

Power BI relationships usually use:

Single direction (Dimension → Fact)

Avoid:

Bi-directional filters unless absolutely necessary

Many-to-many relationships (performance risk)

Why Good Data Modelling Is Critical
##Performance##

Star schema reduces joins

Smaller, denormalized dimensions compress better

Faster report loading and interactions

Accurate Calculations

Bad models cause:

Double counting
Incorrect totals
Broken time intelligence

Good models ensure:

Correct aggregation
Predictable DAX behavior

##Simpler DAX##

With a clean star schema:

Measures are shorter
Logic is clearer
Debugging is easier

Example:

Total Sales = SUM(Fact_Sales[SalesAmount])

No complex filters needed.

##Easier Maintenance##

Adding new visuals is straightforward

New measures don’t break existing reports

New data sources integrate cleanly

Best Practices for Power BI Data Modelling

Use star schema whenever possible
Separate facts and dimensions
Avoid unnecessary bi-directional filters
Use surrogate keys (IDs)
Flatten snowflake dimensions when possible
Validate relationships early
Keep the model simple and readable

Conclusion

In Power BI, great visuals come from great models.

You can have the best charts in the world, but without:

Proper schemas
Clean relationships
Well-defined fact and dimension tables

your reports will be slow, inaccurate, and difficult to trust.

Mastering data modelling is what separates a Power BI user from a Power BI professional.

Introduction to Linux for Data Engineers, Including Practical Use of Vi and Nano with Examples

Wangeci Ndovu — Mon, 26 Jan 2026 16:34:43 +0000

#Introduction#

Linux is one of the most important technologies behind modern data systems. While many beginners focus first on programming languages like Python or SQL, most real-world data engineering work happens on Linux-based systems. Understanding Linux basics—especially how to work with files using terminal editors—is a key step in becoming a confident data engineer.

This article introduces Linux from a beginner’s perspective, explains why it matters in data engineering, and demonstrates practical text editing using Vi and Nano, supported by real terminal examples.

Why Linux Is Important for Data Engineers

Most data engineers do not work only on personal computers. Instead, they manage and maintain:

Cloud servers (AWS EC2, Google Compute Engine, Azure VMs)

Big data platforms (Hadoop, Spark, Kafka)

Workflow tools (Airflow, Luigi)

Databases and data warehouses

All these systems primarily run on Linux

Key benefits of Linux in data engineering

Server dominance Linux is the default operating system for servers

Stability Data pipelines can run for days or weeks without interruption

Automation Linux supports scripting and scheduling with ease

Cost-effective Open-source and widely supported

Command-line power Faster and more precise than graphical interfaces

For these reasons, Linux skills are often listed as a core requirement in data engineering job descriptions.

Getting Comfortable with the Linux Terminal

The Linux terminal allows users to interact with the system using text commands.

Example terminal prompt:

ndovu@NDOVU:~$

Explanation:

ndovu → username

NDOVU → computer name

~ → home directory

$ → ready to accept commands

Essential Linux Commands for Beginners
Checking Your Current Location
pwd

Output:

/home/ndovu

This command shows the current directory you are working in.

Viewing Files and Directories
ls

Sample output:

data scripts notes.txt

To see detailed information

ls -l
Creating Directories
mkdir pipelines

Creating multiple levels at once

mkdir -p data/raw data/processed

Creating Empty Files
touch readme.txt

Moving Between Directories
cd data

Go back one level

cd ..

Why Text Editors Matter in Linux

Data engineers frequently edit:

Configuration files

Shell scripts

SQL and Python files

Log files

On Linux servers, graphical editors are often unavailable. This is why terminal-based editors such as Nano and Vi are essential.

Editing Files with Nano (Beginner Friendly)

Nano is easy to learn and ideal for beginners.

Opening a File with Nano
nano readme.txt

If the file does not exist, Nano creates it automatically.

Writing Content in Nano

Type the following text

This project contains data engineering examples.
Linux is essential for managing pipelines.

Saving and Closing Nano

At the bottom of the screen, Nano shows helpful shortcuts:

^O Write Out ^X Exit

Steps:

Press CTRL + O to save

Press Enter to confirm

Press CTRL + X to exit

Confirming the File Content
cat readme.txt

Expected output

This project contains data engineering examples.
Linux is essential for managing pipelines.
Editing Files with Vi (Industry Standard)

Vi (or Vim) is more complex than Nano but extremely powerful.

Opening a File Using Vi
vi config.conf

Vi starts in command mode, not insert mode.

Switching to Insert Mode

Press

Now type

source=mysql
format=csv
target=hdfs

Saving and Exiting Vi

Press ESC to return to command mode

Type:

:wq

Press Enter

Common Vi Commands

Command Description
i Enter insert mode
ESC Return to command mode
:w Save file
:q Quit
:wq Save and quit
:q! Quit without saving

Practical Data Engineering Scenario

A common task for a data engineer is editing pipeline configurations on a remote server.

ssh user@analytics-server

cd /etc/pipelines
vi ingestion.conf

File content example

source=kafka
format=json
target=data_lake

This simple task reflects real production work done daily by data engineers.

Why Terminal Editors Are Still Relevant

They work on remote servers

No graphical interface required

Lightweight and fast

Essential for troubleshooting production issues

Conclusion

Linux is a foundational skill for data engineers. By learning basic commands and mastering text editors like Nano and Vi, beginners gain the confidence to work on real servers and real data systems.

Starting with Nano and gradually learning Vi is a practical approach that prepares you for professional data engineering environments.

What to Learn Next

Linux file permissions (chmod, chown)

Shell scripting basics

Running Python and SQL scripts on Linux

Exploring Spark and Airflow on Linux

With consistent practice, Linux will become a powerful and natural tool in your data engineering journey.

Happy coding

Understanding Git: How to Track Changes, Push, and Pull Code Like a Pro

Wangeci Ndovu — Fri, 16 Jan 2026 11:27:47 +0000

code 101

When you start writing code, you quickly realize something:
files change, mistakes happen, and things break

So how do professional developers keep track of what changed, how it changed, how to correct it or in words, how to go back in time when something breaks?

This article simply explains:

How Git tracks changes
How to push code to GitHub
How to pull data from GitHub

beginner friendly

What is Version Control

Version control is a system that keeps track of:

Every change you make to your code
When the change happened
Who made the change
What exactly was modified

Think of it like Google Docs history for your code.

If your code breaks, Git lets you correct it to a working version.

Git also helps you collaborate with other like minded individuals in shared projects.

What is GitHub?

GitHub is a cloud platform where Git repositories are stored online.
You use
Git on your computer
GitHub to back it up and share it with others

How to create a Git Repository

A repository is a folder that Git tracks.

git init

This creates a hidden .git folder inside your project.
Now Git is watching this directory.

Save a Version (Commit)

A commit is a snapshot of your project.

git commit -m "Add file"

Now Git has stored that version forever.
You can always go back to it.

Push Code to GitHub

First connect your project to GitHub:

git remote add origin git@github.com:yourname/yourrepo.git

Push:

git push -u origin main

Note: The default branch name is often main. If yours is "master"

use

git push -u origin master.

Your code is now safely stored online.

Pull Code from GitHub

If someone else updates the repository, or you work from another computer:
Navigate to your local repository directory using the command

cd

For example, if your repository is in a folder named Mombasa then,

cd Mombasa

Ensure you are on the correct branch by using the command

git status

Pull the latest changes from the remote repository using the git pull command

git pull

This downloads the latest changes into your project.

How Git Tracks Changes

When you edit a file:

git status

Git will show:

modified: Mombasa

To save the change:

git add Mombasa

then

git commit -m "Mombasa"

Git stores only what changed, not the entire file.

This makes Git fast and powerful.

Why Git is a Superpower

With Git you can

Undo mistakes
Work on features safely
Collaborate with others
Track project history
Work on multiple versions

This is why every professional developer uses Git.

conclusion

Git is not just a tool — it is how software is built.

Once you understand:

add
commit
push
pull

You can work on any real-world engineering project.