DEV Community: Edmund Eryuba

Understanding FastAPI, Uvicorn, and ASGI Through a Practical REST API

Edmund Eryuba — Mon, 29 Jun 2026 08:09:37 +0000

Introduction

The Python ecosystem has evolved significantly in recent years, particularly in backend web development. While frameworks such as Flask and Django remain widely used, the increasing demand for high-performance APIs, asynchronous programming, and automatic data validation has led to the rapid adoption of FastAPI.

FastAPI has become one of the most popular frameworks for developing RESTful APIs because it combines developer productivity with excellent runtime performance. It provides automatic request validation, interactive documentation, type-hint-based development, and native support for asynchronous programming; all while remaining lightweight and easy to learn.

However, FastAPI does not operate in isolation. Every FastAPI application relies on an ASGI server, most commonly Uvicorn, to receive HTTP requests and send responses. Understanding how FastAPI, Uvicorn, and ASGI work together provides a clearer picture of the modern Python web stack and explains why FastAPI applications are capable of handling many concurrent requests efficiently.

This article explores these technologies individually before demonstrating how they come together in a practical project: a Patient Appointment Tracker API. Rather than focusing on implementation details already covered in the project’s repository, the discussion emphasizes the architectural concepts and design choices that make FastAPI an effective framework for modern API development.

What is FastAPI?

FastAPI is an open-source Python framework designed specifically for building APIs. It is built on top of two powerful libraries:

Starlette, which provides routing, middleware, request handling, and ASGI support.
Pydantic, which handles data validation, parsing, and serialization.

Unlike traditional frameworks that require developers to manually validate incoming requests, FastAPI uses Python type hints to automatically determine what data is expected.

For example, when an endpoint expects an integer, FastAPI validates incoming requests before the application logic executes. Invalid requests never reach the business layer because the framework returns an informative validation error automatically.

This approach reduces boilerplate code while improving reliability and maintainability.

What is ASGI?

To understand why FastAPI performs so well, it is important to understand ASGI.

ASGI stands for Asynchronous Server Gateway Interface. It is the modern successor to WSGI (Web Server Gateway Interface), which powered many traditional Python web frameworks.

WSGI was designed around synchronous request processing. Each incoming request typically occupied a worker until the response was completed. While this model is perfectly adequate for many applications, it becomes inefficient when applications spend significant time waiting for external resources such as:

databases
APIs
file systems
cloud storage
email services

Instead of blocking while waiting, ASGI allows applications to continue processing other requests. In other words, ASGI enables concurrency.
This makes it particularly suitable for:

REST APIs
real-time dashboards
WebSockets
streaming responses
chat applications
notification systems

FastAPI was designed around ASGI from the beginning rather than adapting to it later.

Where Does Uvicorn Fit?

A common misconception is that FastAPI itself serves HTTP requests, this is entirely incorrect.

FastAPI defines the application. Uvicorn runs it.

An easy analogy is to think of FastAPI as a restaurant kitchen.
The kitchen prepares meals.

Uvicorn acts as the waiter who receives customer orders, delivers them to the kitchen, and returns the finished meals.

Without the waiter, customers cannot communicate with the kitchen.
Likewise, without an ASGI server, FastAPI cannot receive HTTP requests.

When running

uvicorn main:app –-reload

Uvicorn:

starts an HTTP server
listens on a network port
receives incoming requests
forwards them to the FastAPI application
returns the generated responses to clients

The --reload option watches project files and automatically restarts the server whenever changes are detected, making development significantly more efficient.

How FastAPI and Uvicorn Work Together

The interaction between these technologies follows a straightforward sequence.

A client sends an HTTP request.
Uvicorn receives the request.
Uvicorn passes it to the FastAPI application through the ASGI interface.
FastAPI determines the appropriate route.
Request data is validated automatically.
Business logic executes.
A response is generated.
Uvicorn sends the response back to the client.

The developer primarily writes application logic while Uvicorn handles the networking layer and FastAPI manages request processing.

Automatic Validation with Pydantic

One of FastAPI’s defining characteristics is its integration with Pydantic.

Instead of manually checking whether incoming JSON contains valid fields and data types, developers describe the expected structure using Python classes.

FastAPI then automatically:

validates requests
converts compatible data types
generates descriptive error responses
serializes outgoing objects into JSON

This approach improves both code readability and API reliability while reducing repetitive validation logic.

Interactive Documentation

Another major advantage of FastAPI is that documentation is generated automatically from the application’s routes and schemas.

Without writing any additional documentation, FastAPI exposes interactive interfaces where developers can:

inspect available endpoints
submit requests
view responses
understand expected request bodies

This feature significantly simplifies API testing and integration for frontend developers and third-party consumers.

Applying These Concepts: A Patient Appointment Tracker

To explore these concepts in practice, I developed a Patient Appointment Tracker API: GitHub Repo

The project provides a realistic example of how FastAPI’s ecosystem supports the development of a maintainable REST API for managing patient information and appointment records.

Rather than embedding database operations directly within route handlers, the application separates responsibilities into distinct layers. FastAPI routes focus on HTTP communication, validation occurs automatically through Pydantic models, and database interactions are encapsulated within dedicated CRUD functions. This separation makes the application easier to understand, test, and extend.

The project also demonstrates how validation and business rules complement one another. FastAPI ensures that incoming requests conform to the expected schema, while application logic enforces domain-specific constraints such as preventing duplicate patient registrations through unique phone numbers. Together, these mechanisms improve data integrity without cluttering endpoint implementations with repetitive validation code.

Although the project currently uses SQLite during development, the use of SQLAlchemy abstracts database interactions sufficiently to support migration to more robust relational database systems with minimal architectural changes. This illustrates one of the advantages of adopting an ORM within a layered application design.
Why This Architecture Matters

Many beginner APIs evolve into collections of increasingly large route handlers that combine request validation, database queries, business rules, and response formatting within a single function.
While functional for small projects, such an approach quickly becomes difficult to maintain.

FastAPI encourages a cleaner architecture by allowing different concerns to remain independent:

the ASGI server manages network communication
FastAPI handles routing and request processing
Pydantic validates and serializes data
SQLAlchemy manages persistence
business logic remains isolated from HTTP-specific concerns

Each component performs a clearly defined role, making the application easier to extend as requirements grow.

Looking Beyond the Project

The Patient Appointment Tracker demonstrates only a subset of what FastAPI can support.

The same architectural principles can be applied to much larger systems, including:

healthcare management platforms
financial services
e-commerce backends
inventory management systems
analytics platforms
microservices

Additional capabilities such as authentication, authorization, asynchronous database access, background tasks, caching, containerization, and cloud deployment can be incorporated without fundamentally changing the application’s structure.

This scalability is one of FastAPI’s strongest advantages: applications can begin as small prototypes and evolve into production-ready services while retaining a clean and maintainable architecture.

Conclusion

FastAPI represents a modern approach to API development by combining Python’s type hinting system with automatic validation, interactive documentation, and native ASGI support. When paired with Uvicorn, it provides an efficient request-processing pipeline capable of supporting both synchronous and asynchronous applications.

Understanding the relationship between FastAPI, ASGI, and Uvicorn helps demystify the underlying mechanics of a Python web application. FastAPI defines how requests should be handled, Uvicorn exposes the application to the network, and ASGI provides the communication standard that allows them to work together.

The Patient Appointment Tracker serves as a practical demonstration of these concepts, showing how a thoughtful combination of FastAPI, Pydantic, SQLAlchemy, and Uvicorn can produce an API that is not only functional but also maintainable, extensible, and aligned with modern backend development practices.

The complete implementation, including the source code and project documentation, is available in the accompanying GitHub repository.

Feel free to interact with my other projects: Github/EryubaEdmund

Dive Into Containerization, Docker & Docker Compose

Edmund Eryuba — Tue, 12 May 2026 07:17:33 +0000

Modern software systems are expected to run consistently across multiple environments such as development laptops, testing servers, cloud platforms and production infrastructure.

One of the major challenges developers and data engineers face is ensuring that applications behave the same way regardless of where they are deployed. This challenge led to the rise of containerization, a technology that packages applications together with their dependencies into isolated, portable environments called containers.

Among the most widely used containerization tools today are Docker and Docker Compose. These tools simplify application deployment, improve scalability and reduce environment-related issues.

What is Containerization

Containerization is the process of packaging an application together with everything it needs to run. This includes the applications source code, runtime environment, libraries, dependencies and configuration files. The application is packaged into a lightweight unit called a container.

A container can run consistently on a developer’s laptop, virtual machine (VMs), on-premise servers and also on cloud infrastructures. Containers isolate applications from the underlying system while still sharing system resources efficiently.

Unlike traditional virtual machines, containers share the host operating system kernel, making them lightweight, fast and efficient.

Why are containers useful

Portability – the isolated environment that containers provide effectively means the container is decoupled from the environment in which they run. Basically, they don't care much about the environment in which they run, which means they can be run in many different environments with different operating systems and hardware platforms.
Consistency – since the containers are decoupled from the environment in which they run, you can be sure that they operate the same, regardless of where they are deployed. The isolated environment that they provide is the same across different deployment environments.
Fast deployment - Containers start within seconds and can be deployed rapidly across environments. This supports continuous integration (CI), continuous deployment (CD), agile development.
Resource Efficiency - Containers consume fewer resources compared to virtual machines because they share the host operating system kernel. This reduces infrastructure cost and memory usage.

Docker

Docker is an open-source platform that allows you to build, deploy and manage containerized applications.

There are alternative containerization platforms, such as podman, however, Docker is the leading player in this space. There is also Docker Inc, the company that sells the commercial version of Docker. Docker comes with a command line interface (CLI), using which you can do all of the operations that the platform provides.

Docker allows developers to create container images, run containers, share applications easily and automate deployments.

Core Docker components

Docker Images: The blueprints of our application which form the basis of containers. These contain all of the configuration settings that define the isolated environment.
Docker Containers: Are running instances of a Docker image and are what run the actual application.
Dockerfile: A text file containing instructions to build a Docker image.
Docker Hub: A cloud repository for storing and sharing Docker images. A user can have their own registry, from which they can pull images.
Docker Compose: a tool used to manage multiple containers using a single YAML configuration file. Instead of starting containers one by one, Docker Compose allows developers to define services, configure networks, manage volumes and start entire applications using one command

How Similar Is Docker to a Virtual Machine?

Docker and virtual machines (VMs) may seem similar because they both provide isolated environments for applications, but they differ fundamentally in how they achieve this isolation.

Virtual Machines: A VM includes a full copy of an operating system, the application, necessary binaries, and libraries—making it heavier and requiring more resources. VMs run on a hypervisor that emulates hardware for each VM, allowing multiple VMs to run on a single physical machine. This offers strong isolation but at the cost of performance and efficient resource usage.
Docker Containers: In contrast, Docker containers share the host system's OS kernel, making them much lighter and faster to start. Containers package the application and its dependencies, but they do not include an entire OS, relying instead on features provided by the host OS. This results in faster performance and more efficient resource utilization compared to VMs.

Thus, while both technologies allow for application isolation, Docker containers provide a more lightweight and efficient solution, especially suited for cloud-native and distributed applications where quick scaling and portability across environments are critical.

Setting Up Docker

To begin working with Docker on your machine, you need to install Docker Desktop.

This comprehensive tool includes Docker Engine, Docker CLI, Docker Compose, and other essential components, providing everything you need to develop, test, and manage containers seamlessly. By installing Docker Desktop, you'll be equipped with:

The ability to use the Docker Command Line Interface (CLI) for executing commands such as managing containers, images, and networks.
Access to a user-friendly graphical interface to easily monitor and manage your Docker setup.

Interacting with Docker: Engine, Daemon, and CLI

To effectively work with Docker, it's important to understand its architecture and how its main components interact. Here’s how these pieces fit together:

Docker Engine: This is the core part of Docker—a client-server application that enables you to build and run containers. The Docker Engine consists of two main components: the Docker Daemon and the Docker CLI, which communicate via a REST API.
Docker Daemon: Also known as dockerd, the Docker Daemon is a background service that manages Docker objects such as images, containers, networks, and volumes. It listens for API requests and performs the actions needed to build, run, and manage containers. The Daemon is responsible for the actual work of creating and running containers.
Docker CLI: The Docker Command Line Interface (CLI) is the tool you use to interact with Docker from your terminal. When you type a command starting with docker, the CLI sends your request to the Docker Daemon via the REST API. The CLI acts as the user-facing entry point for managing containers, images, and other Docker resources.

The Docker Engine is the overall system that powers Docker, made up of the Docker Daemon (the server) and the Docker CLI (the client). When you run a Docker command, the CLI communicates with the Daemon, which then carries out the requested operation. This architecture allows you to efficiently manage containers and images.

Practical Use of Docker & Docker Compose

This example demonstrates a simple but realistic multi-container application using three services; python flask application, PostgreSQL database and pgAdmin (database management UI).

Project Structure

docker-demo/
│
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── app.py
└── .env

1. Create the Python Application (`app.py`)

Contents:

from flask import Flask
import psycopg2
import os

app = Flask(__name__)

DB_NAME = os.getenv("POSTGRES_DB")
DB_USER = os.getenv("POSTGRES_USER")
DB_PASSWORD = os.getenv("POSTGRES_PASSWORD")
DB_HOST = os.getenv("POSTGRES_HOST")

@app.route("/")
def home():

    try:
        conn = psycopg2.connect(
            dbname=DB_NAME,
            user=DB_USER,
            password=DB_PASSWORD,
            host=DB_HOST
        )

        return "Application connected to PostgreSQL successfully!"

    except Exception as e:
        return f"Database connection failed: {e}"

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

2. Create Requirements File (`requirements.txt`)

Contents:

flask
psycopg2-binary

3. Create Dockerfile (`Dockerfile`)

Contents:

# Use lightweight Python image
FROM python:3.11-slim

# Set working directory inside container
WORKDIR /app

# Copy dependency file
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application files
COPY . .

# Expose Flask port
EXPOSE 5000

# Start application
CMD ["python", "app.py"]

4. Create Environment Variables File (`.env`)

Contents:

POSTGRES_DB=demo_db
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres123
POSTGRES_HOST=postgres_db

5. Create Docker Compose File (`docker-compose.yml`)

Contents:

version: '3.9'

services: # Defines all application containers

  # PostgreSQL database service
  postgres_db:
    image: postgres:15
    container_name: postgres_container
    restart: unless-stopped
    env_file:
      - .env
    ports:
      - "5432:5432"
    volumes:
      # Persist database data
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      # Checks whether PostgreSQL is ready
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

  # Python flask application
  flask_app:
    build: .
    container_name: flask_container
    restart: unless-stopped
    depends_on:
      postgres_db:
        condition: service_healthy
    env_file:
      - .env
    ports:
      - "5000:5000"
    healthcheck:
      # Checks if application responds
      test: ["CMD", "curl", "-f", "http://localhost:5000"]
      interval: 30s
      timeout: 10s
      retries: 3

  # Pgadmin database ui
  pgadmin:
    image: dpage/pgadmin4
    container_name: pgadmin_container
    restart: unless-stopped
    environment:
      PGADMIN_DEFAULT_EMAIL: admin@example.com
      PGADMIN_DEFAULT_PASSWORD: admin123
    ports:
      - "5050:80"
    depends_on:
      postgres_db:
        condition: service_healthy

# Persistent storage
volumes:
  postgres_data:

6. Build and Start Containers

Run:

docker compose up –build

Explanation:

up → starts containers
--build→ rebuilds images

7. Verify Running Containers

Run:

docker ps

Expected containers:

postgres_container
flask_container
pgadmin_container

8. Access the Application

Flask Application

Open browser: http://localhost:5000

pgAdmin

Open: http://localhost:5050

9. Stop Containers

Run:

docker compose down

To also remove database storage:

docker compose down -v

In Summary

When the docker compose up –build command is executed, Docker Compose reads the docker-compose.yml file and automatically creates the three services defined in the project: the PostgreSQL database, the Flask application and pgAdmin.

Docker first builds the Flask application image using the instructions inside the Dockerfile, installs the required Python dependencies and then starts each container in the correct order. The PostgreSQL container initializes the database and stores its data in a persistent Docker volume so that the data is not lost when the container stops.

The Flask application container waits until the database passes its health check before attempting to connect, ensuring reliable startup. Once connected, the Flask app becomes accessible through port 5000 on the local machine, while pgAdmin becomes available on port 5050 for database management through a browser interface.

Docker automatically creates an internal network that allows the containers to communicate using service names such as postgres_db instead of IP addresses. Together, Docker and Docker Compose simplify deployment by orchestrating all services, networking, storage, startup dependencies and health monitoring using a single configuration file.

Data Management Systems: Transactional to Analytical Architectures

Edmund Eryuba — Mon, 04 May 2026 07:29:01 +0000

Data is no longer treated as a byproduct of business operations and has become one of the most valuable organizational assets. Every interaction on a banking application, e-commerce platform, hospital system, logistics network or social media service generates data continuously. As organizations increasingly adopt digital workflows, cloud platforms, machine learning systems and real-time applications, the amount of data being generated has grown exponentially.

This rapid expansion introduces significant challenges. Organizations must ensure that data remains accurate, secure, accessible and useful while simultaneously supporting millions of users and analytical operations. Businesses are not only expected to store data efficiently, but also to transform it into meaningful insights that influence strategic decisions, operational efficiency and customer experiences.
Modern data management exists to address these challenges.

This article explores the major systems and architectures used in contemporary data management, beginning with traditional databases and extending into modern analytical ecosystems such as data warehouses, data lakes, and data lakehouses. Particular attention is given to the distinction between OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) systems, since these two paradigms form the operational and analytical backbone of most modern data infrastructures.

Data management approaches overview:

Approach	Primary purpose	Data type	Query style
Database (OLTP)	Real-time transactional processing	Structured	Short, frequent reads/writes
Database (OLAP)	Analytical queries & reporting	Structured/Semi	Complex, long-running aggregations
Data Lake	Raw data storage at scale	Any (structured, semi, unstructured)	Ad hoc, exploratory
Data Warehouse	Aggregated analytics & BI	Structured (cleaned)	Pre-defined analytical queries
Data Lakehouse	Unified storage + analytics	Any	Both transactional & analytical

Understanding Data Management

Data management refers to the collection of practices, technologies and architectural strategies used to acquire, store, organize, secure, process and analyze data throughout its lifecycle.
At its core, effective data management ensures that organizations can:

Reliably process operational activities
Maintain data consistency and integrity
Support analytical and business intelligence workflows
Scale systems as data volume increases
Secure sensitive information
Enable data-driven decision making

The complexity of managing data arises because different types of workloads require different architectural approaches. A banking application processing financial transfers has very different requirements from a business intelligence dashboard analyzing five years of customer purchasing behavior.

This distinction introduces two foundational concepts in data systems:

OLTP systems, which handle operational transactions in real time.
OLAP systems, which handle analytical processing and large-scale querying.

Understanding the relationship between these systems provides the foundation for understanding modern data architectures.

OLTP Systems: Powering Real-Time Operations

Online Transaction Processing (OLTP) systems are designed to manage day-to-day operational activities. These systems process large numbers of small, fast, real-time transactions while ensuring that the data remains accurate and consistent.

Whenever a customer transfers money using a banking application, purchases an item online, books a flight ticket or updates a user profile, an OLTP system is involved behind the scenes.
The defining characteristic of OLTP systems is their focus on transactional reliability. These systems must process operations quickly while supporting thousands or even millions of simultaneous users.

A modern e-commerce platform provides a useful example. When a customer places an order, several operations happen almost instantly:

The payment is validated
Inventory levels are updated
The order record is created
Shipping information is generated
Transaction confirmations are sent

If any part of the operation fails, the system must ensure that the database does not become inconsistent. For example, inventory should not decrease if payment processing fails.

This reliability is achieved through the use of ACID properties:

Atomicity ensures transactions fully succeed or fully fail
Consistency maintains valid data states
Isolation prevents concurrent transactions from interfering
Durability guarantees committed data persists after failures

Because OLTP systems prioritize speed and consistency, they commonly rely on highly structured relational databases.

Popular OLTP databases include:

These systems typically use normalized schemas to reduce redundancy and improve transactional integrity.

The Limitations of Transactional Systems

Although OLTP systems excel at operational processing, they are not optimized for deep analytical workloads.

Consider a multinational retailer attempting to answer questions such as:

Which products generated the highest revenue over the last five years?
Which region experienced the fastest growth?
What customer segment has the highest retention rate?
Which marketing campaigns produced the best conversion rates?

These queries often require scanning millions or billions of records and performing complex aggregations across historical datasets.

Running such analytical queries directly on operational OLTP systems creates performance problems. Transactional systems are optimized for short, fast queries and not computationally intensive analytical workloads.

This challenge led to the development of OLAP systems.

OLAP Systems: Transforming Data into Insight

Online Analytical Processing (OLAP) systems are designed specifically for analytical workloads, reporting, forecasting and business intelligence.

Unlike OLTP systems, which focus on operational speed, OLAP systems focus on extracting strategic insights from large volumes of historical data.

Organizations use OLAP systems to answer complex business questions, identify patterns, predict trends, and support executive decision making.

For example, a retail organization may use OLAP systems to analyze:

Seasonal purchasing behavior
Customer segmentation trends
Revenue performance across regions
Supply chain inefficiencies
Long-term sales forecasting

OLAP systems are therefore optimized for:

Complex joins and aggregations
Large-scale reads
Historical data analysis
Multidimensional querying
High-volume analytical processing

Instead of storing only current operational data, OLAP systems typically maintain years of historical information aggregated from multiple operational systems.

Comparing OLTP and OLAP Architectures

Although OLTP and OLAP systems both manage data, they solve fundamentally different problems.

OLTP vs OLAP comparison:

Dimension	OLTP	OLAP
Primary Purpose	Record operational transactions	Analyze historical data
Query Type	Simple inserts, updates, lookups	Complex aggregations, multi-table joins
Data Volume per Query	Rows (single record)	Millions to billions of rows
Latency	Milliseconds	Seconds to minutes
Concurrency	Thousands of concurrent users	Tens to hundreds of users
Schema Design	Normalized (3NF)	Denormalized (Star/Snowflake)
Storage Model	Row-oriented	Column-oriented
Data Freshness	Real-time (seconds)	Near real-time to batch (hours/days)
Primary Users	Application users, customers	Analysts, data scientists, executives
Data History	Current/recent operational state	Months to years of history
Backup Priority	Continuous, mission-critical	Important, but less time-sensitive
Example Systems	MySQL, PostgreSQL, Oracle	Snowflake, BigQuery, Redshift

An airline reservation system is an example of an OLTP environment because it processes live ticket bookings continuously. A business intelligence dashboard analyzing global travel trends over ten years represents an OLAP workload.

This architectural separation is essential because attempting to optimize a single system for both operational and analytical workloads often leads to poor performance in both areas.

As organizations matured technologically, they began building specialized systems dedicated to analytics.

Data Warehouses: Centralized Analytical Repositories

A data warehouse is a centralized system designed to support OLAP workloads by consolidating data from multiple operational sources into a structured analytical environment.

Data warehouses allow organizations to combine information from different departments and systems into a unified repository for analysis and reporting.

Instead of querying live transactional systems directly, analysts query the warehouse.
This approach improves both:

Operational system performance
Analytical query efficiency

Data warehouses commonly support:

Executive dashboards
Business intelligence tools
Financial reporting
KPI monitoring
Predictive analytics

Data is typically moved into warehouses through ETL or ELT pipelines:

ETL (Extract, Transform, Load): Data is extracted, cleaned, transformed, and then loaded into the warehouse.
ELT (Extract, Load, Transform): Raw data is loaded first and transformed within the warehouse itself.

Modern cloud data warehouses include:

Data Lakes: Managing Raw and Large-Scale Data

As organizations began collecting increasingly diverse forms of data
such as logs, multimedia, IoT streams and machine learning datasets, traditional warehouses became insufficient for certain workloads.
This led to the emergence of data lakes.

A data lake is a large-scale storage environment capable of storing raw data in its original format without requiring immediate transformation.
Unlike warehouses, which impose predefined schemas, data lakes often use a schema-on-read approach, meaning structure is applied later during analysis.

Data lakes are particularly useful for:

Machine learning workloads
Streaming data ingestion
Scientific research
IoT ecosystems
Large-scale log analytics

Common technologies associated with data lakes include:

While highly scalable and flexible, early data lakes often suffered from governance and quality-control problems, resulting in poorly organized “data swamps.”

Data Lakehouses: Bridging Operational Flexibility and Analytics

To overcome the limitations of both warehouses and data lakes, modern architectures increasingly adopt the concept of the data lakehouse.
A data lakehouse combines:

The scalability and flexibility of data lakes
The governance and analytical performance of warehouses

Lakehouses support both business intelligence and machine learning workloads within a unified architecture.
They introduce features such as:

ACID transactions
Metadata governance
Versioned datasets
High-performance querying
Open storage formats

Popular lakehouse technologies include:

This architectural evolution reflects the growing need for unified platforms capable of supporting increasingly complex data ecosystems.

The Rise of Integrated Data Architectures

Modern organizations rarely rely on a single data system. Instead, they build interconnected ecosystems where different technologies handle different responsibilities.

A modern architecture may include:

OLTP databases for operational processing
Streaming platforms for real-time ingestion
Data lakes for raw storage
Warehouses for analytics
Lakehouses for unified workloads
Business intelligence tools for reporting
Machine learning platforms for predictive modeling

Workflow orchestration platforms such as Apache Airflow help coordinate these pipelines and automate data movement across systems.

This layered architecture enables organizations to process operational workloads efficiently while simultaneously extracting strategic insights from historical and large-scale data.

Conclusion

Modern data management is fundamentally about balancing operational efficiency with analytical capability.

OLTP systems ensure that real-time business operations remain fast, reliable, and consistent. OLAP systems transform accumulated data into strategic insight through large-scale analysis and reporting. Data warehouses centralize structured analytical workloads, data lakes enable flexible large-scale storage and lakehouses attempt to unify both worlds into a scalable modern architecture.

As organizations continue generating unprecedented amounts of data, the ability to design and manage these interconnected systems becomes increasingly critical. Businesses that successfully integrate transactional reliability with analytical intelligence gain not only operational stability, but also the strategic advantage necessary to compete in a data-driven world.

Automating Data Workflows with Apache Airflow

Edmund Eryuba — Mon, 27 Apr 2026 14:41:22 +0000

As organizations become increasingly data-driven, the scale of their pipelines has grown from modest daily batches to continuous, high-volume streams. What appears to be overwhelming complexity is, in practice, a matter of structure and discipline; imposed through the right tools. Apache Airflow embodies this principle as a batch-oriented orchestration framework, enabling the construction of scheduled, reliable data pipelines in Python while seamlessly integrating the diverse technologies that define modern data ecosystems.

What Airflow actually does

Apache Airflow is an open source tool used to write, schedule, and manage workflows as code. Whenever you have actions that depend on one another and must be performed in a specific order, you can define them as a workflow in Airflow.

Workflows in Airflow are modelled as DAGs (Directed Acyclic Graphs). A DAG is simply a collection of tasks with defined dependencies between them. The "acyclic" part just means there are no circular loops; Task A might trigger Task B, but Task B can never circle back and trigger Task A. This constraint keeps pipelines predictable and debuggable.

At its core, Airflow does three things well: it defines workflows as DAGs, schedules and executes tasks on a timeline and tracks state and dependencies so you always know what ran, when, and whether it succeeded.

Setting up Airflow on Linux

A stable Airflow deployment starts with a clean Python environment. Skipping this step is a common source of dependency conflicts down the road.

Start by creating and activating a virtual environment:

python3 -m venv airflow_env
source airflow_env/bin/activate

Then install Airflow and initialise its metadata database. This is the internal database Airflow uses to track DAG runs, task states, and logs:

pip install apache-airflow
airflow db init

Once that's done, start the webserver and scheduler as separate processes:

airflow webserver --port 8080
airflow scheduler

The web UI will be available at http://localhost:8080. If this is a fresh installation, you'll need to create an admin user before you can log in:

airflow users create \
  --username admin \
  --firstname Admin \
  --lastname User \
  --role Admin \
  --email admin@example.com

Getting your DAG to appear in the UI

A surprisingly common stumbling block is writing a DAG that simply doesn't show up in the Airflow interface. If that happens to you, run through this checklist before assuming something deeper is wrong.

Your DAG file must live in the ~/airflow/dags/ directory, have a .py extension, contain no syntax errors and instantiate the DAG object at the global scope of the file (not inside a function). After placing the file correctly, restart the scheduler and refresh the UI. The scheduler needs to re-parse the DAGs folder before new files become visible.

Designing the ETL pipeline

The pipeline in this guide follows a classic three-stage structure:

Extract — fetch stock data from an external API (such as Polygon.io)
Transform — clean and reshape the data using Pandas
Load — write the results to a PostgreSQL database

TaskFlow API vs. Operators and XComs

Airflow offers two styles for wiring tasks together. The TaskFlow API uses Python decorators (@task) and handles data passing automatically; it's clean and concise, but abstracts away some of what's happening under the hood. Operators with XComs, by contrast, give you explicit control over how data flows between tasks.

XCom (short for "cross-communication") is Airflow's built-in message bus for passing small pieces of data between tasks. Here's how the pattern looks across all three stages:

# Extract
def extract(**context):
    data = fetch_api_data()
    context['ti'].xcom_push(key='raw_data', value=data)

# Transform
def transform(**context):
    raw = context['ti'].xcom_pull(key='raw_data')
    df = process_data(raw)
    context['ti'].xcom_push(key='clean_data', value=df.to_json())

# Load
def load(**context):
    data = context['ti'].xcom_pull(key='clean_data')
    df = pd.read_json(data)
    df.to_sql(...)

One important caveat: XCom is designed for small payloads only. If you're passing large DataFrames between tasks, store them in external storage (an S3 bucket or a staging table) and pass only a reference through XCom.

Connecting to PostgreSQL

Airflow manages external connections through its Connections store. Add your PostgreSQL connection via the CLI:

airflow connections add postgres_default \
    --conn-uri "postgresql+psycopg2://airflow_user:password@localhost:5432/stocks_db"

In your DAG, reference this connection by ID:

PostgresHook(postgres_conn_id="postgres_default")

The connection ID in your DAG must match exactly what you registered in Airflow's Connections store. If they don't match, the task will fail with a connection error — and the error message won't always make it obvious why.

Fixing database permission errors

If your pipeline fails with a message like ERROR: permission denied for schema public, the issue is almost certainly that your database user lacks the necessary privileges. Fix it by granting the required permissions in PostgreSQL:

GRANT ALL ON SCHEMA public TO airflow_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON TABLES TO airflow_user;

This is easy to miss because the user might be able to connect to the database but still be blocked from creating or writing to tables in the public schema.

Handling API constraints

When fetching data from an external API, you may encounter a 403 NOT_AUTHORIZED error even when your credentials are correct. This usually means your API subscription doesn't cover the time range you're requesting. Many free or basic tiers restrict historical data access.

A simple fix is to narrow your query window to recent data only:

end = datetime.now(timezone.utc)
start = end - timedelta(hours=1)

If you need broader historical access, you'll need to upgrade your API plan.

Scheduling and time management

Airflow operates entirely in UTC internally, which is worth keeping in mind when debugging timing issues. When defining a DAG, always specify start_date, schedule_interval, and catchup explicitly:

start_date=datetime(2025, 1, 1)
schedule_interval='@hourly'
catchup=False

Setting catchup=False is particularly important. Without it, Airflow will attempt to backfill all the runs it "missed" between start_date and now, which can trigger dozens or hundreds of unexpected runs the first time you enable a DAG.

The DAG execution lifecycle

Understanding what happens when a DAG runs helps enormously when something goes wrong. The sequence is: the scheduler parses your DAG file, creates a DAG Run, queues the individual tasks, hands them to the executor, and then updates task states (success, failed, retrying) as they complete.

You can configure retry behaviour in the default_args dictionary:

default_args = {
    "retries": 2,
    "retry_delay": timedelta(minutes=3),
}

This means any failed task will automatically retry twice, with a three-minute gap between attempts.

Observability and Debugging

Airflow's UI gives you a clear view into what's happening at every level. The Graph View shows task dependencies at a glance, the Tree View shows run history over time, and clicking into any task gives you access to its full execution logs.

You can also pull logs directly from the command line:

airflow tasks logs <dag_id> <task_id> <execution_date>

When debugging a failure, always start with the task logs. They usually tell you exactly what went wrong, whether it's a Python exception, a connection error, or an API rejection.

Common failure points at a glance

Most issues in an Airflow workflow fall into a handful of categories:

DAG not appearing in UI — wrong directory, syntax error in the file, or the scheduler hasn't been restarted since the file was added.

Connection issues — mismatched connection ID between your DAG code and Airflow's Connections store, or incorrect credentials.

Database permission errors — the database user hasn't been granted the right privileges on the target schema.

API failures — rate limiting, subscription restrictions, or requesting data outside the allowed time window.

Moving toward production

Running Airflow locally on SQLite is fine for development, but not suitable for production use. When you're ready to take things further, consider these changes:

Swap SQLite for PostgreSQL or MySQL as Airflow's metadata database — SQLite has locking issues under concurrent load. Move from the default SequentialExecutor to a CeleryExecutor or KubernetesExecutor to run tasks in parallel. Add a remote logging backend (S3, Elasticsearch) so logs persist beyond the local machine. And implement secrets management — environment variables, HashiCorp Vault, or Airflow's built-in Secrets Backend — rather than storing credentials in plain text.

Insight

At a high level, building this pipeline involved: setting up a Linux environment with Airflow, configuring the metadata database, writing a DAG using Operators and XComs, integrating an external stock data API, transforming the results with Pandas, loading into PostgreSQL with the right permissions, wiring up Airflow connections, and monitoring everything through the scheduler and web UI.

The patterns used here include modular tasks, explicit data flow, external system integration, scheduled automation and observability. These are essential procedures that are also applicable at production-level data pipelines. The choice to use Operators over the TaskFlow API gives one a deeper understanding of how Airflow manages execution and state, which pays dividends when things inevitably break in unexpected ways.

ETL vs ELT: Two Paradigms, One Goal

Edmund Eryuba — Mon, 13 Apr 2026 11:09:58 +0000

What are the similarities, differences, benefits and use cases of ELT and ETL.

The pipelines at a glance

ELT (extract, load, transform) and ETL (extract, transform, load) are both data integration processes that move raw data from a source system to a target database. These data sources can be in multiple, different repositories or in legacy systems that are then transferred using ELT or ETL to a target data location. Both approaches move data from source to destination but where transformation happens changes everything about cost, speed, flexibility, and the kind of analytics you can build.

What is ELT (extract, load, transform)?

With ELT, unstructured data is extracted from a source system and loaded onto a target system to be transformed later, as needed. This unstructured, extracted data is made available to business intelligence systems, and there is no need for data staging.

ELT leverages data warehousing to do basic data transformations, such as data validation or removal of duplicated data. These processes are updated in real-time and used for large amounts of raw data. ELT is a newer process that has not reached its full potential compared to its older sister, ETL. The ELT process was originally based on hard-coded SQL scripts. Those SQL scripts are more likely to have potential coding errors than the more advanced methods used in ETL.

What is ETL (extract, transform, load)?

With ETL, unstructured data is extracted from a source system and specific data points and potential “keys” are identified prior to loading data into the target systems.

In a traditional ETL scenario, the source data is extracted to a staging area and moved into the target system. In the staging area, the data undergoes a transformation process that organizes and cleans all data types. This transformation process allows for the now structured data to be compatible with the target data storage systems.

Where they share common ground

Despite their differences, the two patterns share a substantial foundation; both solve the same fundamental problem of moving heterogeneous data into a single analytical store.

Data integration: Both consolidate data from multiple disparate sources into a unified destination
Transformation necessity: Neither skips transformation, they only differ in when and where it happens.
Orchestration: Both require scheduling, dependency management, and error-handling tooling.
Data quality concerns: Validation, deduplication, and schema enforcement are needed in both patterns.
Observability: Lineage tracking, logging, and alerting are essential regardless of order.

Where they diverge

The key differences between ELT and ETL are the order of operations between the two processes that make them uniquely suited for different situations. Other differences are in data size and data types that each process can handle.

Performance and scalability

ETL pipelines transform data on a dedicated server whose capacity is fixed, when data volumes spike, you hit a ceiling. ELT offloads transformation to cloud warehouses like BigQuery, Snowflake, or Redshift, which scale compute horizontally on demand. For organizations processing billions of rows, ELT's elastic compute model is a significant structural advantage.

Data freshness and raw access

Because ELT loads raw data first, analysts retain access to the original source records. If a transformation rule is wrong, you can fix it and rerun without re-ingesting from the source. With ETL, if data was dropped or transformed before loading, it may be gone permanently, making reruns more expensive and re-extraction from source systems often necessary.

Compliance and sensitivity

ETL's pre-load transformation gives security teams an easier lever: strip or mask personally identifiable information before it ever enters the warehouse. ELT stores raw, potentially sensitive data in the destination system, demanding robust row-level security, column masking, and access policies inside the warehouse itself. It is manageable, but a larger governance surface area.

Benefits of ELT and ETL

ELT Strengths

The ELT approach enables faster implementation than the ETL process, though the data is messy once it is moved. The transformation occurs after the load function, preventing the migration slowdown that can occur during this process. ELT decouples the transformation and load stages, ensuring that a coding error (or other error in the transformation stage) does not halt the migration effort. Additionally, ELT avoids server scaling issues by using the processing power and size of the data warehouse to enable transformation (or scalable computing) on a large scale. ELT also works with cloud data warehouse solutions to support structured, unstructured, semi-structured and raw data types.

ETL Strengths

ETL takes longer to implement but results in cleaner data. This process is well suited for smaller target data repositories that require less frequent updating. ETL also works with cloud data warehouses by using cloud-based SaaS platforms and onsite data warehouses.
There are also many open-source and commercial ETL tools with capabilities and benefits that include the following:

Comprehensive automation and ease-of-use functions that can automate the entire data flow and make recommendations on rules for the extract, transform and load process.
A visual drag-and-drop interface used for specifying rules and data flows.
Support for complex data management to assist with complex calculations, data integrations and string manipulation.
Security and compliance that encrypt sensitive data and are certified compliant with industry or government regulations. This provides a more secure way to encrypt, remove or mask specific data fields to protect client’s privacy.

Use Cases

ELT

An ELT process is best used in high-volume data sets or real-time data use environments. Specific examples include the following:

High-volume environments: Meteorological systems like weather services collect, collate and use large amounts of data on a regular basis. Businesses with large transaction volumes also fall into this category.
Cloud-native data platforms: Using scalable warehouses such as Snowflake, Databricks, and BigQuery that leverage microservices, containerization, and distributed storage enabling modern and flexible analytics architectures..
Real-time ingestion systems: Stock exchanges generate and use large amounts of data in real-time, where delays can be harmful. Additionally, large-scale distributors of materials and components need real-time access to current data for business intelligence.

ETL

ETL is best used for synchronizing several data use environments and migrating data from legacy systems. The following are some specific examples:

Need for data synchronization from several sources: Companies that are merging their ventures may have multiple consumers, supplies and partners in common. This data can be stored in separate data repositories and formatted differently. ETL works to transform the data in a unified format before loading it onto the target data location.
Updating and migrating data from legacy systems: The legacy systems require the ETL process to transform the data into a compatible format with the new structure of the target database.

The verdict

ETL predates the cloud era and remains the right choice when the destination system cannot bear transformation workloads, or when sensitive data must be sanitized before storage. ELT has become the dominant pattern for modern data teams precisely because cloud warehouses turned transformation into a solved compute problem; cheap, fast, and version-controlled through SQL.

In practice, many mature data platforms run both: ELT for the bulk of analytical pipelines, and targeted ETL steps where compliance or system constraints demand it. Understanding the tradeoffs of each and not treating one as universally superior is what separates thoughtful data architecture from following a trend.

Connecting Power BI to a SQL Database (PostgreSQL)

Edmund Eryuba — Fri, 13 Mar 2026 17:16:20 +0000

Introduction

Power Query is a data transformation and data preparation engine. Power Query comes with a graphical interface for getting data from sources and a Power Query editor for applying transformations. Because the engine is available in many products and services, the destination where the data is stored depends on where Power Query is used. Using Power Query, you can perform the extract, transform, and load (ETL) processing of data.

Microsoft Power BI is a comprehensive Business intelligence platform that uses Power Query to ingest data, then adds modelling, visualization, and sharing capabilities. Power Query preps the data, while Power BI turns it into reports

Why use PostgreSQL?

Maintaining dynamic database systems is critical in today’s digital landscape, especially considering the rate in which newer technologies emerge. PostgreSQL is expandable and versatile so it can quickly support a variety of specialized use cases with powerful extension ecosystem, which covers things from time-series data types to geospatial analytics.

Its versatile and approachable design makes PostgreSQL a “one-size-fits-all” solution for many enterprises looking for cost-effective and efficient ways to improve their database management systems.

Built as an open-source database solution, PostgreSQL is completely free from licensing restrictions, vendor lock-in potential, or the risk of over-deployment.

Expert developers and commercial enterprises who understand the limitations of traditional database systems heavily support PostgreSQL. They work diligently to provide a battle-tested, best-of-breed relational database management system.

How Power Query helps with data acquisition

Business users spend up to 80 percent of their time on data preparation, which delays the work of analysis and decision-making. Several challenges contribute to this situation and Power Query helps address many of them.

Enables wide range connectivity of data sources, including data of all sizes and shapes.
Consistency of experience, and parity of query capabilities over all data sources.
Highly interactive and intuitive experience for rapidly and iteratively building queries over any data source, of any size.
When using Power Query to access and transform data, you define a repeatable process (query) that can be easily refreshed in the future to get up-to-date data.
Power Query offers the ability to work against a subset of the entire data set to define the required data transformations, allowing you to easily filter down and transform your data to a manageable size.

Power Query experiences

The Power Query user experience is provided through the Power Query editor user interface. The goal of this interface is to help you apply the transformations you need simply by interacting with a user-friendly set of ribbons, menus, buttons, and other interactive components.

The Power Query editor is the primary data preparation experience. In the editor, you can connect to a wide range of data sources and apply hundreds of different data transformations by previewing data and selecting transformations from the UI. These data transformation capabilities are common across all data sources, whatever the underlying data source limitations.

When you create a new transformation step by interacting with the components of the Power Query interface, Power Query automatically creates the M code required to do the transformation so you don't need to write any code.

Connecting Power BI to a Local PostgreSQL Database

Connecting Microsoft Power BI to an SQL database allows you to import data and build dashboards directly from your database tables. The exact steps depend slightly on the SQL engine (e.g., PostgreSQL, MySQL, or Microsoft SQL Server), but the workflow is mostly the same.

1. Get data in Power BI Desktop

Launch Power BI Desktop. On the Home screen you will see options for selecting a data source or start with a blank report. Click on the Blank report option to be directed to the Home tab.

In Power BI Desktop, you can directly select an Excel worksheet, a Power BI semantic model, a SQL server database, direct data entry, Dataverse data or recently used data source without using the Get data option.

From the Data ribbon, selecting Get Data provides additional methods to select the desired connector.

Select the More option which opens a Get Data window that contains a complete list of available connectors.

2. Connection settings

After clicking Get Data: Choose Database > Select PostgreSQL Database > Click Connect

Power BI includes a built-in PostgreSQL connector that allows direct communication with PostgreSQL servers.

Scroll to PostgeSQL database, select the option and click on Connect to close the window.

In the PostgreSQL database dialog that appears, provide the name of the server and database.

Select either the Import or DirectQuery data connectivity mode. Use Import for snapshots or DirectQuery for live data.

Power BI allows you to optionally include a SQL query in the Advanced Options section if you want to retrieve only specific data from the database.

3. Authentication Credentials

If you're connecting to this database for the first time, select the authentication type you want to use, and then enter your credentials. The authentication types available are:

Database (Username and password)
Microsoft account (Microsoft Entra ID)

These credentials must match the PostgreSQL database user account.

Once authenticated, Power BI establishes a connection to the PostgreSQL server and retrieves metadata about the available tables.

Data preview

The goal of the data preview stage is to provide you with a user-friendly way to preview and select your data.

The Navigator lists all tables in the database, allowing one to preview each table, select multiple tables or load them directly. Either select Load to load the data or Transform Data to continue transforming the data in Power Query editor.

The data preview pane on the right side of the window shows a preview of the data from the object you selected.

Connecting Power BI to a Cloud PostgreSQL Database (Aiven)

Organizations often host databases in the cloud rather than on local machines. One example is Aiven, which provides managed PostgreSQL services.

Connecting Power BI to a cloud PostgreSQL database is similar to connecting to a local database, but additional security settings are required.

1. Obtain Connection Details from Aiven

Get Connection Details: Log in to your Aiven Web Console and navigate to your PostgreSQL service to find the Host, Port, Database Name, and Username/Password.

2. Download the SSL Certificate

Most cloud database providers require encrypted connections for security reasons. Aiven provides an SSL certificate that ensures secure communication between Power BI and the database.

From the Aiven console:

Navigate to Connection Information
Download the CA Certificate
Save the certificate file on your computer

SSL encryption ensures:

Data transmitted over the internet cannot be intercepted
Authentication of the database server
Secure communication between the client and server

3. Connect Using Power BI

Open Power BI: Open Power BI Desktop, click Get Data on the Home ribbon, and select More....

Select connector: Choose Database > PostgreSQL database and click Connect.

Enter Credentials: Provide the server and database name, your Aiven username and password and enable SSL if required. Power BI will use the certificate to verify the database server and establish a secure connection.

Loading Data and Creating Relationships

After connecting to the database, the tables are loaded into Power BI’s data model.

Power BI automatically detects relationships based on matching column names such as customer_id or product_id.
However, relationships can also be created manually.

These relationships form a data model, which defines how tables interact with each other.

Summary

Power BI is a powerful business intelligence platform that allows organizations to transform raw data into meaningful insights. By connecting Power BI to SQL databases such as PostgreSQL, companies can access large datasets, build interactive dashboards and make data-driven decisions.

Connecting to a local PostgreSQL database involves selecting the PostgreSQL connector in Power BI, entering the server and database details, authenticating with credentials, and loading tables into the Power BI model. When connecting to cloud databases such as those hosted on Aiven, additional security measures such as SSL certificates ensure that the connection is encrypted and secure.

Once the data is loaded, Power BI allows analysts to create relationships between tables, forming a structured data model that supports accurate analysis. Strong SQL skills further enhance a Power BI analyst’s ability to retrieve, filter, aggregate, and prepare data efficiently before building reports.

Designing Efficient Queries with SQL Joins and Window Functions

Edmund Eryuba — Mon, 02 Mar 2026 11:08:12 +0000

SQL(Structured Query Language) is a powerful tool to search through large amounts of data and return specific information for analysis. Learning SQL is crucial for anyone aspiring to be a data analyst, data engineer, or data scientist, and helpful in many other fields such as web development or marketing.

SQL Joins

JOINS in SQL are commands which are used to combine rows from two or more tables, based on a related column between those tables. They are predominantly used when a user is trying to extract data from tables which have one-to-many or many-to-many relationships between them.

There are mainly four types of joins that you need to understand. They are:

(INNER) JOIN
LEFT (OUTER) JOIN
RIGHT (OUTER) JOIN
FULL (OUTER) JOIN

INNER JOIN

INNER JOIN is used to retrieve rows where matching values exist in both tables. It helps in:

Combining records based on a related column.
Returning only matching rows from both tables.
Excluding non-matching data from the result set.
Ensuring accurate data relationships between tables.

Syntax:

SELECT left_table.id, left_table.left_val, right_table.right_val
FROM left_table
INNER JOIN right_table
ON left_table.id = right_table.id;

LEFT JOIN

LEFT JOIN is used to retrieve all rows from the left table and matching rows from the right table. It helps in:

Returning all records from the left table.
Showing matching data from the right table.
Displaying NULL values where no match exists in the right table.
Performing outer joins, also known as LEFT OUTER JOIN.

Syntax:

SELECT left_table.id, left_table.left_val, right_table.right_val
FROM left_table
LEFT JOIN right_table
ON left_table.id = right_table.id;

RIGHT JOIN

RIGHT JOIN is used to retrieve all rows from the right table and the matching rows from the left table. It helps in:

Returning all records from the right-side table.
Showing matching data from the left-side table.
Displaying NULL values where no match exists in the left table.
Performing outer joins, also known as RIGHT OUTER JOIN.

Syntax:

SELECT left_table.id, left_table.left_val, right_table.right_val
FROM left_table 
RIGHT JOIN right_tale
ON left_table.id = right_table.id;

FULL JOIN

FULL JOIN is used to combine the results of both LEFT JOIN and RIGHT JOIN. It helps in:

Returning all rows from both tables.
Showing matching records from each table.
Displaying NULL values where no match exists in either table.
Providing complete data from both sides of the join.

Syntax:

SELECT left_table.id, left_table.left_val, right_table.right_val
FROM left_table 
FULL JOIN right_tale
ON left_table.id = right_table.id;

Core Insights

SQL joins are fundamental for relational data modeling, enabling the combination of rows from multiple tables based on defined relationships, typically via primary and foreign keys.

Proper join selection directly affects result cardinality, null propagation, and business logic interpretation. Performance considerations include indexing join columns, minimizing unnecessary joins and understanding join order in execution plans.

Key takeaways are that joins operationalize relational integrity, drive multi-table analytics and must be designed carefully to avoid duplication, unintended filtering or performance degradation especially in high-volume transactional or analytical databases.

SQL Window Functions

A window function in SQL is a type of function that performs a calculation across a specific set of rows (the 'window' in question), defined by an OVER() clause.

Window functions use values from one or multiple rows to return a value for each row, which makes them different from traditional aggregate functions, which return a single value for multiple rows.

Similar to aggregate function GROUP BY, a window function performs calculations across multiple rows. Unlike aggregate functions, a window function does not group rows into one single row.

Key components of SQL window functions

The syntax for window functions is as follows:

SELECT column_1, column_2, column_3, function()
OVER (PARTITION BY partition_expression ORDER BY order_expression) as output_column_name
FROM table_name

In this syntax:

The SELECT clause defines the columns you want to select from the table_name table.
The function() is the window function you want to use.
The OVER clause defines the partitioning and ordering of rows in the window.
The PARTITION BY clause divides rows into partitions based on the specified partition_expression; if not specified, the result set will be treated as a single partition.
The ORDER BY clause uses the specified order_expression to define the order in which rows will be processed within each partition; if not specified, rows will be processed in an undefined order.
Finally, output_column_name is the name of your output column.

These are the key SQL window function components. One more thing worth mentioning is that window functions are applied after the processing of WHERE, GROUP BY, and HAVING clauses. This means you can use the output of your window functions in subsequent clauses of your queries.

The OVER() clause

The OVER() clause in SQL is essentially the core of window functions. It determines the partitioning and ordering of a rowset before the associated window function is applied.
The OVER() clause can be applied with functions to compute aggregated values such as moving averages, running totals, cumulative aggregates, or top N per group results.

The PARTITION BY clause

The PARTITION BY clause is used to partition the rows of a table into groups. This comes in handy when dealing with large datasets that need to be split into smaller parts, which are easier to manage.
PARTITION BY is always used inside the OVER() clause; if it is omitted, the entire table is treated as a single partition.

The ORDER BY clause

The ORDER BY determines the order of rows within a partition; if it is omitted, the order is undefined.
For instance, when it comes to ranking functions, ORDER BY specifies the order in which ranks are assigned to rows.

Frame Specification

In the same OVER() clause, you can specify the upper and lower bounds of a window frame using one of the two subclauses, ROWS or RANGE. The basic syntax for both of these subclauses is essentially the same:

ROWS BETWEEN lower_bound AND upper_bound

RANGE BETWEEN lower_bound AND upper_bound

And in some cases, they might even return the same result. However, there's an important difference.

In the ROWS subclause, the frame is defined by beginning and ending row positions. Offsets are differences in row numbers from the current row number.

As opposed to that, in the RANGE subclause, the frame is defined by a value range. Offsets are differences in row values from the current row value.

Types of SQL Window Functions

Window functions in SQL Server are divided into three main types: aggregate, ranking, and value functions. Let's have a brief overview of each.

Aggregate Window Functions

AVG(): returns the average of the values in a group, ignoring null values.
MAX(): returns the maximum value in the expression.
MIN(): returns the minimum value in the expression.
SUM(): returns the sum of all the values, or only the DISTINCT values, in the expression.
COUNT(): returns the number of items found in a group.
STDEV(): returns the statistical standard deviation of all values in the specified expression.
STDEVP(): returns the statistical standard deviation for the population for all values in the specified expression.
VAR(): returns the statistical variance of all values in the specified expression; it may be followed by the OVER clause.
VARP(): returns the statistical variance for the population for all values in the specified expression.

Sample query:

SELECT name, salary,
  SUM(salary) OVER (PARTITION BY dept) AS dept_total,
  AVG(salary) OVER (PARTITION BY dept) AS dept_avg
FROM employees;

Ranking Window Functions

Used to assign rank or position within partitions.

ROW_NUMBER(): assigns a unique sequential integer to rows within a partition of a result set.
RANK(): assigns a unique rank to each row within a partition with gaps in the ranking sequence when there are ties.
DENSE_RANK(): assigns a unique rank to each row within a partition without gaps in the ranking sequence when there are ties.
PERCENT_RANK(): calculates the relative rank of a row within a group of rows.
NTILE(): distributes rows in an ordered partition into a specified number of approximately equal groups.

Sample query:

SELECT name, salary,
  RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS dept_rank
FROM employees;

Offset(Value) Window Functions

Used to access data from other rows.

LAG(): retrieves values from rows that precede the current row in the result set.
LEAD(): retrieves values from rows that follow the current row in the result set.
FIRST_VALUE(): returns the first value in an ordered set of values within a partition.
LAST_VALUE(): returns the last value in an ordered set of values within a partition.
NTH_VALUE(): returns the value of the nth row in the ordered set of values.
CUME_DIST(): returns the cumulative distribution of a value in a group of values.

Sample Query:

SELECT date, revenue,
  LAG(revenue, 1) OVER (ORDER BY date) AS prev_month,
  revenue - LAG(revenue, 1) OVER (ORDER BY date) AS change
FROM monthly_sales;

Summary

SQL window functions provide a powerful analytical layer within standard SQL, enabling complex calculations across related rows while preserving row-level granularity. Unlike GROUP BY, they do not collapse result sets, which makes them ideal for scenarios requiring both detail and aggregate insight in the same query.

The OVER() clause is central, with PARTITION BY defining logical groups, ORDER BY controlling calculation sequence, and optional frame specifications (ROWS or RANGE) refining scope.

Key functional categories include aggregate window functions for running totals and moving averages, ranking functions such as ROW_NUMBER() and RANK() for ordered comparisons and offset functions like LAG() and LEAD() for time-series or sequential analysis.

When used correctly, window functions significantly reduce query complexity, eliminate the need for self-joins in many analytical patterns and improve expressiveness in reporting and business intelligence workloads.

Turning Data into Insight: An Analyst’s Guide to Power BI

Edmund Eryuba — Sun, 08 Feb 2026 18:44:08 +0000

Introduction: The reality of messy business data

In most organizations, data rarely arrives in a clean, analysis-ready format. Analysts typically receive information from multiple sources: spreadsheets maintained by business teams, exports from transactional systems, cloud applications, and enterprise platforms such as ERPs or CRMs. These datasets often contain inconsistent formats, missing values, duplicate records, and unclear naming conventions.

Working directly with such data leads to unreliable metrics, incorrect aggregations and ultimately poor business decisions. This is where Power BI plays a critical role. Power BI is not just a visualization tool, it is an analytical platform that allows analysts to clean, model, and interpret data before presenting it in a form that decision-makers can trust.

From raw data to business action: The analyst workflow

A typical analytical workflow in Power BI follows a logical sequence:

Load raw data from multiple sources e.g., imports from excel, databases or online services.
Clean and transform the data using Power Query.
Model the data into a meaningful structure.
Create business logic using DAX.
Design dashboards that communicate insight.
Enable decisions and actions by stakeholders.

Each step builds on the previous one. If any stage is poorly executed, the final insight becomes misleading, regardless of how attractive the dashboard looks.

Cleaning and transforming data with power query

Data cleaning is the foundation of all reliable analytics. Common data quality issues include:

Columns stored in the wrong data type.
Missing or null values.
Duplicate customer or transaction records.
Inconsistent naming and coding systems.

These issues directly affect calculations. For example, a null freight value treated as blank instead of zero will distort average shipping costs. Duplicate customer records inflate revenue totals. Incorrect data types prevent time-based analysis entirely.

Power Query provides a transformation layer where analysts can reshape data without altering the original source. This ensures reproducibility and auditability.

Key Transformation Principles

There are several key principles that should guide an analyst in their approach to data transformation:

1. Remove what is not need

Unnecessary columns increase model size, memory usage, and cognitive complexity. Every column should justify its existence in a business question.

2. Standardize naming

Column and table names should reflect business language, not system codes.
For example:

Cust_ID → Customer ID
vSalesTbl → Sales

This improves both usability and long-term maintainability.

3. Handle missing and invalid values

Nulls, errors, and placeholders must be explicitly addressed. Analysts must decide whether missing values represent:

Zero
Unknown
Not applicable Each choice has analytical consequences.

4. Remove duplicates strategically

Duplicates should be removed only when they represent the same real-world entity. Otherwise, analysts risk deleting legitimate records.

Building meaningful data models

Most analytical errors in Power BI do not come from DAX formulas or charts. They come from poor data models.

A strong model reflects how the business actually operates. This typically follows a star schema:

Fact tables: transactions (Sales, Orders, Payments)
Dimension tables: descriptive attributes (Date, Product, Customer, Region)

This structure ensures:

Correct aggregations.
Predictable filter behavior.
High performance.

Without proper modeling, even simple metrics like “Total Sales by Region” can produce incorrect results due to ambiguous relationships or double counting.

Creating business logic with DAX

DAX (Data Analysis Expressions) is a library of functions and operators that can be combined to build formulas and expressions in Power BI, Analysis Services, and Power Pivot in Excel data models. It enables dynamic, context-aware analysis that goes beyond traditional spreadsheet formulas.

Examples of business logic encoded in DAX:

What counts as “Revenue”?
How is “Customer Retention” defined?
What is the official “Profit Margin” formula?

These definitions must be centralized and reusable. Measures become the organization’s single source of analytical truth.

DAX uses a formula syntax similar to Excel but extends it with advanced functions designed specifically for tabular data models in Power BI. It allows users to create measures, calculated columns and calculated tables to perform dynamic and context-aware calculations.

Measures vs Calculated Columns

Calculated columns: A calculated column is a column that you add to an existing table (in the model designer) and then create a DAX formula that defines the column's values. They operate row by row and are stored in memory.
Measures are evaluated dynamically where results change based on report context.

Creating Measures for Advanced Calculations

Measures are a core component of DAX used for calculations on aggregated data.
They are evaluated at query time not stored in the data model
Measures respond dynamically to filters, slicers and report context
Commonly used measures include SUM, AVERAGE and COUNT
DAX supports both implicit and explicit measures
Using correct data types is essential for accurate measure calculations

For most analytical metrics, measures are preferred, because they respond to filters, slicers, and user interactions.

Understanding Context: The Core of Correct Analytics

Context is one of the most important concepts in DAX because it determines how and where a formula is evaluated. It is what makes DAX calculations dynamic: the same formula can return different results depending on the row, cell, or filters applied in a report.

Without understanding context, it becomes difficult to build accurate measures, optimize performance, or troubleshoot unexpected results.

There are three main types of context in DAX:

Row Context

Refers to the current row being evaluated. It is most commonly seen in calculated columns, where the formula is applied row by row.

Filter Context

It is the set of filters applied to the data. These filters can come from slicers and visuals in the report, or they can be explicitly defined inside a DAX formula.

Query Context

Created by the layout of the report itself.

If analysts misunderstand context, they produce:

Wrong totals.
Misleading KPIs.
Inconsistent executive reports.

In summary, context is the foundation of how DAX works. It controls what data a formula can “see” and therefore directly affects the result of every calculation. Mastering row, query, and filter context is essential for building reliable, high-performing, and truly dynamic analytical models in Power BI and other tabular environments.

Designing dashboards that communicate insight

Designing interactive dashboards helps businesses make data-driven decisions. A dashboard is not a collection of charts. It is a decision interface.

It is essential to design professional reports that focus on optimizing layouts for different audiences, and leveraging Power BI’s interactive features.

Good dashboards:

Highlight trends and deviations.
Compare performance against targets.
Expose anomalies and risks.
Support follow-up questions.

Bad dashboards:

Show too many metrics.
Focus on visuals over meaning.
Require explanation to interpret.

Turning Dashboards into Business Decisions

This is the most important step, and the most neglected.

Dashboards should answer questions like:

Which regions are underperforming?
Which products drive the most margin?
Where is customer churn increasing?
What happens if we change pricing?

Real business actions include:

Reallocating marketing budgets.
Optimizing inventory levels.
Identifying operational bottlenecks.
Redesigning sales strategies.

If no decision changes because of a dashboard, then the analysis failed in capturing key business indicators.

Common pitfalls that undermine analytical value

Even experienced analysts fall into these traps:

Treating Power BI as a visualization tool instead of a modeling tool.
Writing complex DAX on top of poor data models.
Using calculated columns instead of measures.
Ignoring filter propagation and relationship direction.
Optimizing visuals before validating metrics.

These issues lead to highly polished dashboards with fundamentally wrong numbers, an undesired outcome in analytics.

Conclusion

Power BI provides an integrated analytical environment where data preparation, semantic modeling, calculation logic, and visualization are combined into a single workflow.

The analytical value of the platform does not emerge from individual components such as Power Query, DAX, or reports in isolation, but from how these components are systematically designed and aligned with business requirements.

Effective use of Power BI requires analysts to impose structure on raw data, define consistent relationships, implement reusable calculation logic through measures and ensure that visual outputs reflect correct filter and evaluation contexts.

When these layers are properly engineered, Power BI supports reliable aggregation, scalable analytical models, and consistent interpretation of metrics across the organization, enabling stakeholders to base operational and strategic decisions on a shared and technically sound analytical foundation.

Data Modelling for High Performance and Accurate Analytics in Power BI

Edmund Eryuba — Sun, 01 Feb 2026 09:16:40 +0000

This article explores data modelling in Power BI with a focus on different schema types and explains how proper modelling enhances performance and ensures accurate reporting.

What is Data Modelling

Data modelling is one of the most critical steps in building effective business intelligence (BI) solutions. In Power BI, data modelling refers to the process of structuring data into related tables, defining relationships and creating a logical framework that supports efficient querying, accurate calculations and meaningful reporting.

A well-designed data model is not just about organizing tables; it directly impacts report performance, usability, scalability, and the correctness of insights derived from data. Poor data modelling leads to slow reports, incorrect aggregations, complex DAX expressions, and ultimately unreliable business decisions.

In Power BI, data modelling involves:

Identifying business entities (facts and dimensions)
Structuring tables logically
Defining relationships between tables
Setting cardinality and filter direction
Creating calculated columns and measures
Ensuring data granularity and consistency

What is a Schema in Power BI?

In Power BI, a schema refers to the structure and organization of data within a data model. Schemas define how data is connected and related within the model, influencing the efficiency and performance of data queries and reports. Understanding schemas requires modelers to classify their model tables as either dimension or fact.

Fact tables

Fact tables store quantitative, transactional data that can be sales orders, quantity sold, revenue, profit and more. A fact table contains dimension key columns that relate to dimension tables, and numeric measure columns. The dimension key columns determine the dimensionality of a fact table, while the dimension key values determine the granularity of a fact table.

Example of a fact table: Consider a simple sales analytics scenario in Power BI.

This table stores transactional (measurable) data.

SalesID	DateKey	ProductKey	CustomerKey	Quantity	SalesAmount
1001	20240101	501	301	2	200
1002	20240101	502	302	1	150
1003	20240102	501	303	3	300
1004	20240103	503	301	1	120

Characteristics:

Contains foreign keys to dimensions.
Contains numeric measures.
Has many rows (high volume).

Dimension tables

Dimension tables describe the business entities that are modelled. Entities can include products, people, places, and concepts including time itself. A dimension table contains a key column (or columns) that acts as a unique identifier, and other columns. Other columns support filtering and grouping your data.

This table provides descriptive attributes about products.

ProductKey	ProductName	Category	Brand
501	Laptop	Electronics	Dell
502	Headphones	Accessories	Sony
503	Mouse	Accessories	Logitech

Types of Schemas in Power BI:

1. Star Schema

The star schema consists of a central fact table connected directly to multiple dimension tables, much like the appearance of a star.
The central fact table contains quantitative data (e.g., sales), while the dimension tables hold descriptive attributes related to the facts (e.g. Employee, Date, Territory). Dimension tables are not connected to each other.

Star schemas are ideal for straightforward reporting and querying. They are efficient for read-heavy operations, making them suitable for dashboards and summary reports.

2. Snowflake Schema

The snowflake schema is a normalized version of the star schema. In this design, dimension tables are further divided into related tables, resulting in a more complex structure.
The normalization process eliminates redundancy by splitting dimension tables into multiple related tables. This results in a web-like structure, resembling a snowflake.

Snowflake schemas are used in scenarios requiring detailed data models and efficient storage. They are beneficial when dealing with large datasets where data redundancy needs to be minimized.

3. Galaxies Schema (Fact Constellation Schema)

The galaxies schema involves multiple fact tables that share dimension tables, creating a complex, interconnected data model.

This schema consists of multiple fact tables linked to shared dimension tables, enabling the analysis of different business processes within a single model.

Galaxies schemas are suitable for large-scale enterprise environments where multiple related business processes need to be analysed. They support complex queries and detailed reporting across various domains.

Implementing schemas in Power BI

a. Creating a Star Schema

Set Up Fact and Dimension Tables: Identify and create the central fact table and surrounding dimension tables.
Link Tables: Establish relationships between the fact table and dimension tables using foreign keys.
Optimize for Performance: Index key columns and use efficient data types to enhance query performance.

b. Implementing a Snowflake Schema

Normalize Dimension Tables: Split dimension tables into related sub-tables to reduce redundancy.
Create Relationships: Define relationships between sub-tables and the main dimension tables, ensuring referential integrity.
Optimize Storage: Use appropriate storage and indexing strategies to manage complex joins efficiently.

c. Setting Up a Galaxies Schema

Identify Fact Tables: Determine the various fact tables needed for different business processes.
Share Dimension Tables: Create shared dimension tables to link multiple fact tables.
Ensure Efficient Querying: Design the schema to support complex queries and optimize performance through indexing and data partitioning.

Role of data modelling in performance and reliable reporting

a. Query and Engine Performance

Query and engine performance refers to how efficiently Power BI processes data when users interact with reports. Power BI’s VertiPaq engine performs best when data is organised using a star schema, where fact tables store numerical data and dimension tables store descriptive attributes.

This structure improves compression, reduces the number of joins required during query execution, and simplifies DAX calculations. As a result, reports load faster, visuals respond more quickly, and overall system performance improves.

b. Memory and Scalability

Memory and scalability describe the ability of a Power BI model to handle large and growing datasets. Proper data modelling controls dataset size by reducing column cardinality and removing unnecessary fields. Low-cardinality columns compress efficiently, while column pruning helps minimise memory usage and refresh time.

By structuring data into lean fact and dimension tables rather than wide flat tables, Power BI models become more scalable and capable of supporting high data volumes without performance degradation.

c. Correct Aggregations and Metrics

Correct aggregations ensure that reported values accurately reflect business operations. This depends on defining clear data granularity and using proper relationship structures. Each fact table must represent a consistent level of detail, and filters should flow from dimensions to facts.

Poor modelling can result in double counting, ambiguous totals, or misleading KPIs. A well-designed model prevents these issues by enforcing one-to-many relationships and maintaining logical data structure.

d. Filter Propagation and User Trust

Filter propagation determines how slicers and filters affect report visuals. In a properly modelled system, filters behave predictably and consistently across all visuals, allowing users to explore data intuitively.

When modelling is poor, filters may behave inconsistently, leading to confusing or contradictory results. Reliable filter behavior builds user trust and ensures that insights derived from reports are credible and easy to interpret.

e. Maintainability and Governance

Maintainability refers to how easy the model is to manage and extend over time. A strong data model supports reusable measures, consistent dimensions, and standard business definitions across reports. This creates a single source of truth for the organisation, reduces duplication of logic, and simplifies governance.

As a result, the reporting environment becomes easier to maintain, more consistent, and more reliable for long-term decision-making.

Conclusion

Understanding different schemas in Power BI is crucial for designing efficient data models. Each schema has unique advantages: the star schema is ideal for straightforward reporting and querying, offering simplicity and ease of use; the snowflake schema provides detailed, normalized structures, reducing redundancy and optimizing storage; and the galaxies schema supports complex, large-scale data models with multiple fact tables sharing dimension tables.

Choosing the right schema improves query performance, data storage efficiency, and data refresh operations. By mastering these schemas, you can create robust and scalable data models, enabling your organization to make data-driven decisions effectively.

Understanding and implementing different schemas in Power BI is crucial for designing efficient and effective data models. Each schema type; star, snowflake, and galaxies, offers unique benefits and use cases. By mastering these schemas, you can create robust data models that support comprehensive analysis and insightful reporting. Experiment with different designs based on your data needs and continue refining your skills to become a Power BI expert.

Linux for Data Engineers: From Terminal to Text Editing

Edmund Eryuba — Sun, 25 Jan 2026 16:13:21 +0000

Linux is an open-source operating system that is based on the Unix operating system. It was created by Linus Torvalds in 1991.
Open-source means that the source code of the operating system is available to the public. This allows anyone to modify the original code, customize it, and distribute the new operating system to potential users.

Why should you learn about Linux?

In today's data center landscape, Linux and Microsoft Windows stand out as the primary contenders, with Linux having a major share.

Here are several compelling reasons to learn Linux:

Given the prevalence of Linux hosting, there is a high chance that your application will be hosted on Linux. So, learning Linux as a data engineer or developer becomes increasingly valuable.
With cloud computing becoming the norm, chances are high that your cloud instances will rely on Linux.
Linux serves as the foundation for many operating systems for the Internet of Things (IoT) and mobile applications.
Linux is built for automation, which is central to data engineering. Linux enables repeatability, fault tolerance and observability of the entire workflow.

What is a Linux Kernel?

The kernel is the central component of an operating system that manages the computer and its hardware operations. It handles memory operations and CPU time.

The kernel acts as a bridge between applications and the hardware-level data processing using inter-process communication and system calls.
The kernel loads into memory first when an operating system starts and remains there until the system shuts down. It is responsible for tasks like disk management, task management, and memory management.

What is a Linux distribution?

The Linux kernel is reused and configured differently across distributions. You can further combine different utilities and software to create a completely new operating system.

A Linux distribution or distro is a version of the Linux operating system that includes the Linux kernel, system utilities, and other software. Being open source, a Linux distribution is a collaborative effort involving multiple independent open-source development communities.

Today, there are thousands of Linux distributions to choose from, offering differing goals and criteria for selecting and supporting the software provided by their distribution.

Distributions vary from one to the other, but they generally have several common characteristics:

A distribution consists of a Linux kernel.
It supports user space programs.
A distribution may be small and single-purpose or include thousands of open-source programs.
Some means of installing and updating the distribution and its components should be provided.

Some popular Linux distributions are:

Ubuntu: One of the most widely used and popular Linux distributions. It is user-friendly and recommended for beginners.
Linux Mint: Based on Ubuntu, Linux Mint provides a user-friendly experience with a focus on multimedia support.
Arch Linux: Popular among experienced users, Arch is a lightweight and flexible distribution aimed at users who prefer a DIY approach.
Manjaro: Based on Arch Linux, Manjaro provides a user-friendly experience with pre-installed software and easy system management tools.
Kali Linux: Kali Linux provides a comprehensive suite of security tools and is mostly focused on cybersecurity and hacking.

How to install and access Linux

There are various methods that can be utilized in order to access Linux including on a Windows machine. This section goes into detail exploring these methods.

Install Linux as the primary OS

Installing Linux as the primary OS is the most efficient way to use Linux, as you can use the full power of your machine.
We'll focus on installing Ubuntu, which is one of the most popular Linux distributions. Linux has other numerous distributions suited for user specific applications that can be explored based on user preference.

Step 1 – Download the Ubuntu iso file. Make sure to select a stable release that is labelled "LTS". LTS stands for Long Term Support which means you can get free security and maintenance updates for a long time (usually 5 years).
Step 2 – Create a bootable pen drive: There are a number of softwares that can create a bootable pen drive.
Step 3 – Boot from the pen drive: Once your bootable pen drive is ready, insert it and boot from the pen drive. The boot menu depends on your laptop. You can google the boot menu for your laptop model.
Step 4 – Follow the prompts. Once, the boot process starts, select try or install ubuntu. The process will take some time. Once the GUI appears, you can select the language, and keyboard layout and continue. Enter your login and name. Remember the credentials as you will need them to log in to your system and access full privileges. Wait for the installation to complete.
Step 5 – Restart: Click on restart now and remove the pen drive.
Step 6 – Login: Login with the credentials you entered earlier.

And there you go! Now you can install apps and customize your desktop.

Accessing the terminal

An important part is learning about the terminal where you'll run all the commands and see the magic happen. You can search for the terminal by pressing the "windows" key and typing "terminal".
The shortcut for opening the terminal is ctrl + alt + t.

You can also open the terminal from inside a folder. Right click where you are and click on "Open in Terminal". This will open the terminal in the same path.

How to use Linux on a Windows machine

Sometimes you might need to run both Linux and Windows side by side. Luckily, there are some ways you can get the best of both worlds without getting different computers for each operating system.
This section explores a few ways to use Linux on a Windows machine.

Option 1: "Dual-boot" Linux + Windows

With dual boot, you can install Linux alongside Windows on your computer, allowing you to choose which operating system to use at startup.

This requires partitioning your hard drive and installing Linux on a separate partition. With this approach, you can only use one operating system at a time.

Option 2: Use Windows Subsystem for Linux (WSL)

Windows Subsystem for Linux provides a compatibility layer that lets you run Linux binary executables natively on Windows.

Using WSL has some advantages. The setup for WSL is simple and not time-consuming. It is lightweight compared to virtual machines (VMs) where you have to allocate resources from the host machine. You don't need to install any ISO or virtual disc image for Linux machines which tend to be heavy files. You can use Windows and Linux side by side.

How to install WSL2

First, enable the Windows Subsystem for Linux option in settings.

Go to Start. Search for "Turn Windows features on or off."
Check the option "Windows Subsystem for Linux" if it isn't already.

Next, open your command prompt and provide the installation commands.
Open Command Prompt as an administrator:
Run the command below:

wsl –install

Note: By default, Ubuntu will be installed.

Once installation is complete, you'll need to reboot your Windows machine. So, restart your Windows machine.

Once installation of Ubuntu is complete, you'll be prompted to enter your username and password.
And, that's it! You are ready to use Ubuntu.
Launch Ubuntu by searching from the start menu.

Option 3: Use a Virtual Machine (VM)

A virtual machine (VM) is a software emulation of a physical computer system. It allows you to run multiple operating systems and applications on a single physical machine simultaneously.

You can use virtualization software such as Oracle VirtualBox or VMware to create a virtual machine running Linux within your Windows environment. This allows you to run Linux as a guest operating system alongside Windows.

VM software provides options to allocate and manage hardware resources for each VM, including CPU cores, memory, disk space, and network bandwidth. You can adjust these allocations based on the requirements of the guest operating systems and applications.

Option 4: Use a Browser-based Solution

Browser-based solutions are particularly useful for quick testing, learning, or accessing Linux environments from devices that don't have Linux installed.
You can either use online code editors or web-based terminals to access Linux. Note that you usually don't have full administration privileges in these cases.

Online code editors: They offer editors with built-in Linux terminals. While their primary purpose is coding, you can also utilize the Linux terminal to execute commands and perform tasks.

Replit is an example of an online code editor, where you can write your code and access the Linux shell at the same time.

Web-based Linux terminals: Online Linux terminals allow you to access a Linux command-line interface directly from your browser. These terminals provide a web-based interface to a Linux shell, enabling you to execute commands and work with Linux utilities.
One such example is JSLinux.

Option 5: Use a Cloud-based Solution

Instead of running Linux directly on your Windows machine, you can consider using cloud-based Linux environments or virtual private servers (VPS) to access and work with Linux remotely.

Services like Amazon EC2, Microsoft Azure, or DigitalOcean provide Linux instances that you can connect to from your Windows computer. Note that some of these services offer free tiers, but they are not usually free in the long run.

Introduction to Bash Shell and System Commands

The Linux command line is provided by a program called the shell. Over the years, the shell program has evolved to cater to various options.

Different users can be configured to use different shells. But most users prefer to stick with the current default shell. The default shell for many Linux distros is the GNU Bourne-Again Shell (bash). Bash is succeeded by the Bourne shell (sh).

To find out your current shell, open your terminal and enter the following command:

echo $SHELL

Command breakdown:

The echo command is used to print on the terminal.
The $SHELL is a special variable that holds the name of the current shell.

In my setup, the output is /bin/bash. This means that I am using the bash shell.

Bash is very powerful as it can simplify certain operations that are hard to accomplish efficiently with a GUI (or Graphical User Interface). Remember that most servers do not have a GUI, and it is best to learn to use the powers of a command line interface (CLI).

Terminal vs Shell

The terms terminal and shell are often used interchangeably, but they refer to different parts of the command-line interface.
The terminal is the interface you use to interact with the shell. The shell is the command interpreter that processes and executes your commands.

What is a prompt?

When a shell is used interactively, it displays a $ when it is waiting for a command from the user. This is called the shell prompt.

[username@host ~]$

If the shell is running as root, the prompt is changed to #.

[root@host ~]#

Command Structure

A command is a program that performs a specific operation. Once you have access to the shell, you can enter any command after the $ sign and see the output on the terminal.
Generally, Linux commands follow this syntax:

command [options] [arguments]

Here is the breakdown of the above syntax:

command: This is the name of the command you want to execute. ls (list), cp (copy), and rm (remove) are common Linux commands.
[options]: Options, or flags, often preceded by a hyphen (-) or double hyphen (--), modify the behavior of the command. They can change how the command operates. For example, ls -a uses the -a option to display hidden files in the current directory.
[arguments]: Arguments are the inputs for the commands that require one. These could be filenames, user names, or other data that the command will act upon. For example, in the command cat access.log, cat is the command and access.log is the input. As a result, the cat command displays the contents of the access.log file.

Options and arguments are not required for all commands. Some commands can be run without any options or arguments, while others might require one or both to function correctly. You can always refer to the command's manual to check the options and arguments it supports. You can view a command's manual using the man command.

You can access the manual page for ls with man ls.

Manual pages are a great and quick way to access the documentation. I highly recommend going through man pages for the commands that you use the most.

Managing Files From the Command line

The Linux File-system Hierarchy

All files in Linux are stored in a file-system. It follows an inverted-tree-like structure because the root is at the topmost part.

The / is the root directory and the starting point of the file system. The root directory contains all other directories and files on the system. The / character also serves as a directory separator between path names. For example, /home/alice forms a complete path.
You can learn more about the file system using the man hier command.

Navigating the Linux File-system

The absolute path is the full path from the root directory to the file or directory. It always starts with a /. For example, /home/john/documents.
The relative path, on the other hand, is the path from the current directory to the destination file or directory. It does not start with a /. For example, documents/work/project.

Locating your current directory: You can locate your current directory in the Linux file system using the pwd command.

Changing directories: The command to change directories is cd and it stands for change directory. You can use the cd command to navigate to a different directory.

Some other commonly used cd shortcuts are:

Command	Description
`cd ..`	Go back one directory
`cd ../..`	Go back two directories
`cd or cd ~`	Go to the home directory
`cd -`	Go to the previous path

Managing Files and Directories

Creating new directories: You can create an empty directory using the mkdir command.

# creates an empty directory named "foo" in the current folder
mkdir foo

You can also create directories recursively using the -p option.

Creating new files: The touch command creates an empty file. You can use it like this:

# creates empty file "file.txt" in the current folder
touch file.txt

The file names can be chained together if you want to create multiple files in a single command.

# creates empty files "file1.txt", "file2.txt", and "file3.txt" in the current folder

touch file1.txt file2.txt file3.txt

Removing files and directories: You can use the rm command to remove both files and non-empty directories. The rmdir command removes an empty directory.

Command	Description
`rm file.txt`	Removes the file file.txt
`rm -r directory`	Removes the directory directory and its contents
`rm -f file.txt`	Removes the file file.txt without prompting for confirmation
`rmdir` directory	Removes an empty directory

Copying files using the cp command: To copy files in Linux, use the cp command.

Syntax to copy files: cp source_file destination_of_file This command copies a file named file1.txt to a new file location /home/adam/log.

cp file1.txt /home/adam/logs

The cp command also creates a copy of one file with the provided name.
This command copies a file named file1.txt to another file named file2.txt in the same folder.

cp file1.txt file2.txt

Moving and renaming files and folders: The mv command is used to rename and move files and folders from one directory to the other.

Syntax to move files: mv source_file destination_directory

# Moves a file named file1.txt to a directory named backup

mv file1.txt backup/

To move a directory and its contents:

mv dir1/ backup/

Renaming files and folders in Linux is also done with the mv command.

Syntax to rename files: mv old_name new_name

#Renames a file from file1.txt to file2.txt

mv file1.txt file2.txt

Locating Files and Folders: The find command lets you efficiently search for files, folders, and character and block devices.
Below is the basic syntax of the find command:

find /path/ -type f -name file-to-search

Where,

/path is the path where the file is expected to be found. This is the starting point for searching files. The path can also be/or . which represents the root and current directory, respectively.
-type represents the file descriptors. They can be any of the below:
f – Regular file such as text files, images, and hidden files.
d – Directory. These are the folders under consideration.
l – Symbolic link. Symbolic links point to files and are similar to shortcuts.
c – Character devices. Files that are used to access character devices are called character device files. Drivers communicate with character devices by sending and receiving single characters (bytes, octets). Examples include keyboards, sound cards, and the mouse.
b – Block devices. Files that are used to access block devices are called block device files. Drivers communicate with block devices by sending and receiving entire blocks of data. Examples include USB and CD-ROM
-name is the name of the file type that you want to search.

Basic Commands for Viewing Files

Display files and files contents: The cat command in Linux is used to display the contents of a file.

Here is the basic syntax of the cat command:

cat [options] [file]

If you want to view the contents of a file named file.txt, you can use the following command:

cat file.txt

This will display all the contents of the file on the terminal at once.

Viewing text files interactively using `less` and `more`

While cat displays the entire file at once, less and more allow you to view the contents of a file interactively. This is useful when you want to scroll through a large file or search for specific content.

The syntax of the less command is:

less [options] [file]

The more command is similar to less but has fewer features. It is used to display the contents of a file one screen at a time.
The syntax of the more command is:

more [options] [file]

The Essentials of Text Editing in Linux

Text editing skills using the command line are one of the most crucial skills in Linux. In this section, you will learn how to use two popular text editors in Linux: Vim and Nano. Vim and nano are safe choices to learn text editing as they are present on most Linux distributions.

Mastering Vim: Introductory Guide to Vim

Introduction to Vim

Vim is a popular text editing tool for the command line. Vim comes with its advantages: it is powerful, customizable, and fast. Vim has two variations: Vim (vim) and Vim tiny (vi). Vim tiny is a smaller version of Vim that lacks some features of Vim.

Here are some reasons why you should consider learning Vim:

Most servers are accessed via a CLI, so in system administration, you don't necessarily have the luxury of a GUI. But Vim will always be there.
Vim uses a keyboard-centric approach, as it is designed to be used without a mouse, which can significantly speed up editing tasks once you have learned the keyboard shortcuts. This also makes it faster than GUI tools.
Vim is suitable for all – beginners and advanced users. Vim supports complex string searches, highlighting searches, and much more. Through plugins, Vim provides extended capabilities to developers and system admins that includes code completion, syntax highlighting, file management, version control, and more.

The three Vim modes

You need to know the 3 operating modes of Vim and how to switch between them. Keystrokes behave differently in each command mode. The three modes are as follows:

Command mode.
Edit mode.
Visual mode.

Command Mode.

When you start Vim, you land in the command mode by default. This mode allows you to access other modes.
To switch to other modes, you need to be present in the command mode first

Edit Mode

This mode allows you to make changes to the file. To enter edit mode, press I while in command mode.

Visual mode

This mode allows you to work on a single character, a block of text, or lines of text. Let's break it down into simple steps. Remember, use the below combinations when in command mode.

Shift + V → Select multiple lines.
Ctrl + V → Block mode
V → Character mode The visual mode comes in handy when you need to copy and paste or edit lines in bulk.

Extended command mode.

The extended command mode allows you to perform advanced operations like searching, setting line numbers, and highlighting text. We'll cover extended mode in the next section.

Shortcuts in Vim: Making Editing Faster

Note: All these shortcuts work in the command mode only.

Basic Navigation

Command	Explanation
`h`	Move left
`j`	Move down
`k`	Move up
`l`	Move right
`0`	Move to the beginning of the line
`$`	Move to the end of the line
`gg`	Move to the beginning of the file
`G`	Move to the end of the file
`Ctrl+d`	Move half-page down
`Ctrl+u`	Move half-page up

Editing

Command	Explanation
`i`	Enter insert mode before the cursor
`I`	Enter insert mode at the beginning of the line
`a`	Enter insert mode after the cursor
`A`	Enter insert mode at the end of the line
`o`	Open a new line below the current line and enter insert mode
`O`	Open a new line above the current line and enter insert mode
`x`	Delete the character under the cursor
`dd`	Delete the current line
`yy`	Yank (copy) the current line
`p`	Paste below the cursor
`P`	Paste above the cursor

Searching and Replacing

Command	Explanation
`/`	Search for a pattern which will take you to its next occurrence
`?`	Search for a pattern that will take you to its previous occurrence
`n`	Repeat the last search in the same direction
`N`	Repeat the last search in the opposite direction
`:%s/old/new/g`	Replace all occurrences of `old` with `new` in the file

Exiting

Command	Explanation
`:w`	Save the file but don't exit
`:q`	Quit Vim (fails if there are unsaved changes)
`:wq` or `:x`	Save and quit
`:q!`	Quit without saving

Multiple Windows

Command	Explanation
`:split` or `:sp`	Split the window horizontally
`:vsplit` or `:vsp`	Split the window vertically
`Ctrl+w followed by h/j/k/l`	Navigate between split windows

Mastering Nano

Getting started with Nano: The user-friendly text editor

Nano is a user-friendly text editor that is easy to use and is perfect for beginners. It is pre-installed on most Linux distributions.
To create a new file using Nano, use the following command:

nano

To start editing an existing file with Nano, use the following command:

nano filename

List of key bindings in Nano

General

Command	Explanation
`Ctrl+X`	Exit Nano (prompting to save if changes are made)
`Ctrl+O`	Save the file
`Ctrl+R`	Read a file into the current file
`Ctrl+G`	Display the help text

Editing

Command	Explanation
`Ctrl+K`	Cut the current line and store it in the cutbuffer
`Ctrl+U`	Paste the contents of the cutbuffer into the current line
`Alt+6`	Copy the current line and store it in the cutbuffer
Ctrl+J	Justify the current paragraph

Navigation

Command	Explanation
`Ctrl+A`	Move to the beginning of the line
`Ctrl+E`	Move to the end of the line
`Ctrl+C`	Display the current line number and file information
`Ctrl+_ (Ctrl+Shift+-)`	Go to a specific line (and optionally, column) number
`Ctrl+Y`	Scroll up one page
`Ctrl+V`	Scroll down one page

Search and Replace

Command	Explanation
`Ctrl+W`	Search for a string (then Enter to search again)
`Alt+W`	Repeat the last search but in the opposite direction
`Ctrl+\`	Search and replace

Miscellaneous

Command	Explanation
`Ctrl+T`	Invoke the spell checker, if available
`Ctrl+D`	Delete the character under the cursor (does not cut it)
`Ctrl+L`	Refresh (redraw) the current screen
`Alt+U`	Undo the last operation
`Alt+E`	Redo the last undone operation

Conclusion

This article introduced Linux from both a conceptual and practical perspective, covering its core components, common distributions, and different ways to access it. We explored essential command-line skills, including file system navigation, system commands, and text editing using Vim and Nano.

For data engineers, Linux is a critical platform because most data systems and cloud infrastructures run on it. Mastery of Linux enables efficient automation, system management, troubleshooting, and deployment of data pipelines. As a result, Linux is not just a supporting skill, but a foundational requirement for working effectively in modern data engineering environments.

An Introduction to Git: Concepts, Commands, and Workflows

Edmund Eryuba — Sat, 17 Jan 2026 18:35:17 +0000

What is Git?

Git is a distributed version control system designed to track changes in source code and manage collaboration among developers. It enables individuals and teams to maintain a complete history of a project, revert to previous versions when needed, and work on the same codebase without conflicts. Because each developer has a full copy of the repository, Git offers high performance, reliability, and the ability to work offline. Git is the core technology behind popular platforms such as GitHub, GitLab, and Bitbucket.

Understanding Git Bash

Git Bash is a command-line interface that provides a Unix-like shell environment for using Git on Windows systems. It allows users to run Git commands in a terminal similar to those found on Linux and macOS, making cross-platform development easier. In addition to Git commands, Git Bash supports basic shell operations such as navigating directories, creating files, and managing folders, which are commonly used during development workflows.

Initializing and Managing a Repository

To begin tracking a project with Git, a repository is initialized using the git init command. This creates a hidden .git directory that stores all version history and configuration data. Files are then added to the staging area with git add ., which prepares changes for inclusion in a commit. A commit is created using git commit -m "message", capturing a snapshot of the staged changes along with a descriptive message. The git status command is used regularly to view the current state of files in the repository.

Working with Remote Repositories

Remote repositories enable collaboration by allowing developers to share code through platforms such as GitHub. The git clone command creates a local copy of an existing remote repository. Once changes are made locally, they can be uploaded to the remote repository using git push. To retrieve updates made by others, the git pull command is used, which fetches and merges changes into the local branch.

Branching and Collaboration Basics

Branching allows developers to work on new features or fixes independently without affecting the main codebase. The git branch command lists available branches, while git checkout or git switch is used to move between them. After completing work on a branch, changes can be merged back into the main branch. Through these features, Git and Git Bash provide a structured and efficient approach to version control and team collaboration.

Getting Started with the Git Workflow

To get started, it's important to know the basics of how Git works. You may choose to do the actual work within a terminal, an app like GitHub Desktop, or through GitHub.com. Below are the basic git terminologies aligned with GitHub usage.

Repository (Repo)

On GitHub, a repository is an online-hosted version of your project. It stores your code, commit history, issues, pull requests, and documentation. Most collaboration happens around the GitHub repository, which acts as the central reference point for all contributors. A local repository is connected to GitHub using a remote named origin.

git remote add origin https://github.com/username/repository.git

Commit

A commit is a recorded change to the repository. On GitHub, commits are visible in the Commits tab, where collaborators can review what changed and who made the change. Each commit pushed to GitHub becomes part of the shared project history.

git commit -m "Add login validation"

Branch

In GitHub, branches are heavily used to isolate work. The default branch is usually main, which represents production-ready code. Feature development and bug fixes are done in separate branches, which are later merged into main through pull requests.

git branch feature-auth
git switch feature-auth

Push

Pushing sends your local commits to GitHub, making them visible to collaborators. Until you push, your commits exist only on your local machine. This step is essential for teamwork and backup.

git push origin feature-auth

Pull

Pulling retrieves updates from GitHub and merges them into your local branch. This ensures you are working with the latest version of the project, especially when multiple contributors are involved.

git pull origin main

Pull Request (PR)

A pull request is a GitHub feature used to propose merging one branch into another, usually into main. It allows team members to review code, leave comments, request changes, and approve work before it becomes part of the main codebase. Pull requests are central to GitHub-based collaboration.

Merge

Merging on GitHub usually happens through a pull request rather than directly via the command line. Once approved, GitHub merges the feature branch into the target branch and records the merge in the project history.

Fork

A fork is a GitHub-specific feature that creates a personal copy of someone else’s repository under your account. Forks are common in open-source projects where contributors do not have direct access to the main repository. Changes made in a fork are submitted back using a pull request.

Collaboration workflow

A typical GitHub collaboration workflow involves cloning the repository, creating a branch, committing changes, pushing the branch to GitHub, opening a pull request, undergoing code review, and merging the changes. GitHub enhances Git by adding visibility, discussion, and project management features around this workflow.

When using GitHub, beginners should always pull before starting work, create one branch per feature or fix, push changes frequently, use pull requests instead of direct merges, and write clear commit messages. This approach keeps the repository clean, traceable, and easy to collaborate on.

DEV Community: Edmund Eryuba

Understanding FastAPI, Uvicorn, and ASGI Through a Practical REST API

Introduction

What is FastAPI?

What is ASGI?

Where Does Uvicorn Fit?

How FastAPI and Uvicorn Work Together

Automatic Validation with Pydantic

Interactive Documentation

Applying These Concepts: A Patient Appointment Tracker

Looking Beyond the Project

Conclusion

Dive Into Containerization, Docker & Docker Compose

What is Containerization

Why are containers useful

Docker

Core Docker components

How Similar Is Docker to a Virtual Machine?

Setting Up Docker

Interacting with Docker: Engine, Daemon, and CLI

Practical Use of Docker & Docker Compose

Project Structure

1. Create the Python Application (app.py)

2. Create Requirements File (requirements.txt)

3. Create Dockerfile (Dockerfile)

4. Create Environment Variables File (.env)

5. Create Docker Compose File (docker-compose.yml)

6. Build and Start Containers

7. Verify Running Containers

8. Access the Application

Flask Application

pgAdmin

9. Stop Containers

In Summary

Data Management Systems: Transactional to Analytical Architectures

Understanding Data Management

OLTP Systems: Powering Real-Time Operations

The Limitations of Transactional Systems

OLAP Systems: Transforming Data into Insight

Comparing OLTP and OLAP Architectures

Data Warehouses: Centralized Analytical Repositories

Data Lakes: Managing Raw and Large-Scale Data

Data Lakehouses: Bridging Operational Flexibility and Analytics

The Rise of Integrated Data Architectures

Conclusion

Automating Data Workflows with Apache Airflow

What Airflow actually does

Setting up Airflow on Linux

Getting your DAG to appear in the UI

Designing the ETL pipeline

TaskFlow API vs. Operators and XComs

Connecting to PostgreSQL

Fixing database permission errors

Handling API constraints

Scheduling and time management

The DAG execution lifecycle

Observability and Debugging

Common failure points at a glance

Moving toward production

Insight

ETL vs ELT: Two Paradigms, One Goal

The pipelines at a glance

What is ELT (extract, load, transform)?

What is ETL (extract, transform, load)?

Where they share common ground

Where they diverge

Performance and scalability

Data freshness and raw access

Compliance and sensitivity

Benefits of ELT and ETL

ELT Strengths

ETL Strengths

Use Cases

ELT

ETL

The verdict

Connecting Power BI to a SQL Database (PostgreSQL)

Introduction

Why use PostgreSQL?

How Power Query helps with data acquisition

1. Create the Python Application (`app.py`)

2. Create Requirements File (`requirements.txt`)

3. Create Dockerfile (`Dockerfile`)

4. Create Environment Variables File (`.env`)

5. Create Docker Compose File (`docker-compose.yml`)