<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lawrence Murithi</title>
    <description>The latest articles on DEV Community by Lawrence Murithi (@lawrence_murithi).</description>
    <link>https://dev.to/lawrence_murithi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3713327%2Fe1187555-8b89-4a2c-9168-be280e1c6b86.png</url>
      <title>DEV Community: Lawrence Murithi</title>
      <link>https://dev.to/lawrence_murithi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lawrence_murithi"/>
    <language>en</language>
    <item>
      <title>Folders, Apartments, and Fake Computers: A Guide to Virtual Environments, Docker, and VMs</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Thu, 07 May 2026 12:23:35 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/folders-apartments-and-fake-computers-a-guide-to-virtual-environments-docker-and-vms-503d</link>
      <guid>https://dev.to/lawrence_murithi/folders-apartments-and-fake-computers-a-guide-to-virtual-environments-docker-and-vms-503d</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;If you have been spending a substantial amount of time writing code, you must have run into a frustrating problem: "It works on my computer, but it doesn't work on yours."&lt;br&gt;
This happens because computers are set up differently. You might have a different operating system, a different version of a programming language, or different background software running. When a website or app breaks because of this, developers can lose hours or even days trying to figure out what the problem is.&lt;br&gt;
To solve this, developers came up with ways to isolate software. Instead of installing an app directly onto your main computer, you put it inside a &lt;strong&gt;protective bubble&lt;/strong&gt;. This bubble tricks the software into thinking it has its own private space, with exactly what it needs to run, so it won't mess with the rest of your system.&lt;br&gt;
There are three main tools we use to create these bubbles; &lt;strong&gt;Virtual Environments&lt;/strong&gt;, &lt;strong&gt;Virtual Machines (VMs)&lt;/strong&gt; and &lt;strong&gt;Docker&lt;/strong&gt;. While they all aim to solve similar problems, they do it in completely different ways, using completely different layers of your computer. &lt;br&gt;
Let's break down exactly what each one is, how they compare and when you should use them.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Virtual Environments
&lt;/h3&gt;

&lt;p&gt;A Virtual Environment is a localized directory that contains a specific version of a programming language and the specific software packages required for a project. It is the simplest and lightest way to isolate a project and is most commonly used in Python (using tools like venv or virtualenv) although similar concepts exist in other languages.&lt;/p&gt;

&lt;h4&gt;
  
  
  How Virtual Environments work
&lt;/h4&gt;

&lt;p&gt;A Virtual Environment provides no system-level isolation. It does not share hardware, nor does it isolate the OS. It simply changes the PATH variables in your terminal so that when you install a package or run a script, it uses the isolated folder instead of the computer's global system files.&lt;br&gt;
Imagine you are building two different websites on your laptop. Website A is older and needs version 2.0 of a web framework like Django. Website B is brand new and needs version 4.0 of that exact same framework. If you install these tools directly onto your main computer system, they will conflict and one of your websites will stop working.&lt;br&gt;
A virtual environment fixes this by creating a dedicated, private folder for your project. When you turn on(activate) the virtual, it temporarily rewrites your computer's internal GPS, known as the system PATH. Because of this, your computer temporarily ignores its main, global list of tools. Instead, it only looks at the tools installed inside that specific project folder.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Extremely fast&lt;/strong&gt; - Creating and starting a virtual environment takes less than a second because it is just moving some folders around.&lt;br&gt;
&lt;strong&gt;• Lightweight&lt;/strong&gt; - It only takes up a few megabytes of space on your hard drive. There is no heavy software running in the background.&lt;br&gt;
&lt;strong&gt;• Simple to use&lt;/strong&gt; - Usually, it just takes one or two simple commands in your terminal to get started and shut down.&lt;br&gt;
&lt;strong&gt;• No dependency conflicts&lt;/strong&gt; - it solves the problem of dependency conflicts between different projects&lt;/p&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Weak isolation&lt;/strong&gt; - It only isolates programming packages (like Python libraries). It does not isolate the operating system, the system clock, or your hardware settings.&lt;br&gt;
&lt;strong&gt;• "It works on my machine" can still happen&lt;/strong&gt; - Because the isolation is weak, hidden problems can sometimes slip through. If your code secretly relies on a specific font or a hidden system tool installed on your Mac, and you send your virtual environment code to a friend on a Windows PC, the code might still break.&lt;/p&gt;

&lt;p&gt;Virtual environments are used on local computer on day to day coding when working on multiple projects using the same programming language but want to keep their dependencies separate from one another.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Virtual Machines (VMs)
&lt;/h3&gt;

&lt;p&gt;A Virtual Machine is a complete software emulation of a physical computer. It runs its own full Operating System (Guest OS) entirely separate from the host computer's Operating System. It is the heaviest, most complete, and oldest form of isolation. Software like VirtualBox, VMware, or Microsoft Hyper-V allows you to do this.&lt;/p&gt;

&lt;h4&gt;
  
  
  How Virtual Machines work
&lt;/h4&gt;

&lt;p&gt;If a virtual environment is like putting your code in a separate folder, a Virtual Machine is like buying an entirely new physical computer, shrinking it down, and putting it inside your current computer.&lt;br&gt;
It uses a piece of software called a &lt;strong&gt;Hypervisor&lt;/strong&gt;(like VMware, VirtualBox, or Hyper-V). The hypervisor carves out a specific amount of your physical computer's RAM, CPU, and storage and dedicates it to the VM. You then install a full Operating System (like Windows or Ubuntu) onto that carved-out space. This new system is called the &lt;strong&gt;Guest OS&lt;/strong&gt; which operates/behaves like a real computer while the main computer is called the &lt;strong&gt;Host&lt;/strong&gt;. &lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Complete isolation&lt;/strong&gt; - What happens inside a VM stays inside a VM. Because the hypervisor locks the hardware, if a VM gets infected with a severe virus, your main host computer is almost always completely safe.&lt;br&gt;
&lt;strong&gt;• Run different operating systems&lt;/strong&gt; - You can run a full Windows computer inside a Mac, or a Linux computer inside Windows, allowing you to use software made for different platforms.&lt;br&gt;
&lt;strong&gt;• Highly secure&lt;/strong&gt; - Because the hardware is strictly separated at a deep level, it is trusted by banks, governments, and massive corporations for highly sensitive tasks.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Massive resource hog&lt;/strong&gt; - Since you are running a second operating system on top of your current one, VMs eat up a lot of RAM, CPU power, and battery life. Even if the VM is just sitting idle, it is still running background updates, managing a clock, and keeping a digital desktop alive hence wasting power.&lt;br&gt;
&lt;strong&gt;• Huge files&lt;/strong&gt; - A VM can easily take up 20 to 100 gigabytes of storage space just to hold the basic operating system files.&lt;br&gt;
&lt;strong&gt;• Slow&lt;/strong&gt; - Booting up a VM takes just as long as turning on a physical computer, and moving files in and out of it can be tedious.&lt;/p&gt;

&lt;p&gt;VMs are used in large corporate cloud servers or on a local machine when strict security is needed. Its critical when you need to test software on a completely different operating system, or when a business is running older, legacy applications that require an outdated OS to survive.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Docker (Containers)
&lt;/h3&gt;

&lt;p&gt;Docker is a platform that uses &lt;strong&gt;containerization&lt;/strong&gt; to package an application and all its necessary dependencies (libraries, frameworks, etc.) into a single, standardized unit called a &lt;strong&gt;container&lt;/strong&gt;. Containers are the clever middle ground between the lightness of a Virtual Environment and the strict, heavy isolation of a Virtual Machine.&lt;/p&gt;

&lt;h4&gt;
  
  
  How docker work
&lt;/h4&gt;

&lt;p&gt;Every operating system is made of two main parts; the &lt;strong&gt;core engine (Kernel)&lt;/strong&gt;, which physically tells your RAM and CPU what to do, and the &lt;strong&gt;user files/tools&lt;/strong&gt; that make up the desktop experience you see on screen.&lt;br&gt;
While a Virtual Machine duplicates both parts making it so heavy, Docker only duplicates the user files and tools. All Docker containers share the main host computer's Kernel.&lt;br&gt;
Think of it like an apartment building. A Virtual Machine is like giving everyone their own separate house with their own separate plumbing and electricity. Docker is like an apartment complex where everyone has their own locked, private room(container) and can decorate however they want, but they all share the building's central plumbing and electrical systems hidden in the walls(Host OS Kernel).&lt;/p&gt;

&lt;p&gt;To use Docker, you write a simple text file called a &lt;strong&gt;Dockerfile&lt;/strong&gt;. It reads like a recipe; Start with a bare-bones version of Linux, set up some default database passwords, download the latest PostgreSQL and start the database server. Docker reads this file and packages it into a container. This container can be handed to anyone, and it will run exactly the same way, regardless of what computer they have.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Consistent everywhere&lt;/strong&gt; - It solves the "it works on my machine" problem perfectly. A Docker container behaves exactly the same on a Mac, a Windows PC or a cloud server because the environment inside the container never changes.&lt;br&gt;
&lt;strong&gt;• Fast and lightweight&lt;/strong&gt;  - Because they don't boot up a full operating system kernel, containers start in seconds and usually only take up a few hundred megabytes of space.&lt;br&gt;
&lt;strong&gt;• Easy to share and scale&lt;/strong&gt; - You can run dozens or even hundreds of containers on the same computer without them fighting over resources. This allows developers to build microservices. Instead of building one massive app, you put the shopping cart in one container, the user login in another, and the payment system in a third. If the payment container crashes, the rest of the website stays up.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;p&gt;• &lt;strong&gt;Steeper learning curve&lt;/strong&gt; - You have to learn Docker-specific terminal commands, how to write Dockerfiles and how networking works to let containers talk to each other.&lt;br&gt;
&lt;strong&gt;• OS limitations&lt;/strong&gt; - Because Docker shares the host's kernel, you generally run Linux containers on Linux machines. Although Linux can run containers on Mac and Windows, Docker usually installs a tiny, hidden Linux Virtual Machine in the background to provide the Linux Kernel making Docker slightly heavier on Mac and Windows than it is on native Linux.&lt;br&gt;
&lt;strong&gt;• Less secure than VMs&lt;/strong&gt; - Because containers share the host kernel, the wall between them is thinner hence a critical vulnerability in the host OS could potentially affect all containers.&lt;/p&gt;

&lt;p&gt;Docker is used almost everywhere. On a developer's laptop, in automated testing environments, and in production running live websites on the open internet. Its used when building modern web applications, working with a team of developers who all use different computers, or breaking a large app down into smaller microservices.&lt;br&gt;
It gives developer's an isolated, highly reliable environment that is identical across all machines, without wasting your computer's RAM and hard drive space.&lt;/p&gt;

&lt;h4&gt;
  
  
  Similarities between the tools
&lt;/h4&gt;

&lt;p&gt;The core similarity between all three is the concept of isolation. &lt;br&gt;
They all exist to create boundaries between projects and software. &lt;br&gt;
They also all make it easier to delete a project without leaving junk files behind; you just delete the virtual environment folder, the VM file, or the container image, and everything associated with that project is instantly gone, leaving your main computer perfectly clean.&lt;br&gt;
Most times, they are often used together in the real world. A large company might run a giant Virtual Machine in the cloud to provide security, put Docker inside that Virtual Machine to manage different web apps easily, and a developer might use a Virtual Environment inside that Docker container to organize their Python code.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Major Differences
&lt;/h4&gt;

&lt;p&gt;The difference lies in how much they isolate and how heavy they are.&lt;br&gt;
&lt;strong&gt;• Virtual Environment (Lightest)&lt;/strong&gt; - Isolates only the language packages but relies entirely on your computer for everything else.&lt;br&gt;
&lt;strong&gt;• Docker (Middle)&lt;/strong&gt; - Isolates the application and the operating system files, but shares the core OS engine (the kernel) to save power and speed.&lt;br&gt;
&lt;strong&gt;• Virtual Machine (Heaviest)&lt;/strong&gt; - Isolates absolutely everything. It clones the physical hardware and runs a 100% separate operating system, taking up a lot of space and power to provide maximum security.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flryewh0u6iuk0nrvd682.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flryewh0u6iuk0nrvd682.png" alt="Isolation tools" width="800" height="712"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If you are just writing a quick Python script to scrape a website, analyze some data, and need to install a few libraries without breaking your computer, use a Virtual Environment.&lt;br&gt;
If you are building a web app, working with a database, collaborating with other developers, and need to make sure your code runs exactly the same way on your laptop as it will on your company's live servers, use Docker.&lt;br&gt;
If you are on a Mac but absolutely need to run a piece of Windows-only enterprise software, or you are testing dangerous malware and need maximum security to protect your real computer, use a Virtual Machine.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>virtualmachine</category>
      <category>virtualenvironment</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>The Medallion Architecture: Turning Messy Data into Business Gold</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Wed, 06 May 2026 16:06:09 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/the-medallion-architecture-turning-messy-data-into-business-gold-2blm</link>
      <guid>https://dev.to/lawrence_murithi/the-medallion-architecture-turning-messy-data-into-business-gold-2blm</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Imagine drawing water from a muddy river. You would never scoop a glass of water from the bank and drink it straight down. You would want that water pumped into a treatment plant, filtered to remove the debris, and chemically purified until it is crystal clear and safe to consume.&lt;br&gt;
Data requires the exact same treatment.&lt;br&gt;
Ever seen raw data pulled directly from a company’s servers? It's usually a complete mess. Website logs, sales applications, customer service chatbots, and payment gateways all generate endless streams of information. If you take all that raw information, dump it into a single pile, and try to build a revenue report, the results will be a disaster. Your numbers will be wrong, your system will crawl to a halt, and nobody will trust the data.&lt;br&gt;
To process this information safely, data engineers build systems with specific &lt;strong&gt;layers&lt;/strong&gt; that clean and organize records step-by-step. Historically, this was done using traditional data warehouse layers. Today, a modern framework called the &lt;strong&gt;Medallion Architecture&lt;/strong&gt; has taken over the industry.&lt;br&gt;
Here is a deep dive into how data layers work, why the Medallion concept was invented, and how it refines digital mud into a clear, single source of truth.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Old Way(Traditional Data Warehouse Layers)
&lt;/h3&gt;

&lt;p&gt;Before the Medallion Architecture existed, engineers used a classic three-step method to move data from external software into a company dashboard.&lt;br&gt;
To elaborate on the traditional Data Warehouse architecture, it is essential to ground the concepts in the frameworks introduced by &lt;strong&gt;W.H. Inmon&lt;/strong&gt; (often called the &lt;strong&gt;father of the data warehouse&lt;/strong&gt;) and &lt;strong&gt;Ralph Kimball&lt;/strong&gt;.&lt;br&gt;
Historically known as the Three-Tier Enterprise Data Warehouse (EDW) Architecture, this system was designed to separate &lt;strong&gt;operational systems&lt;/strong&gt; (where data is created) from &lt;strong&gt;analytical systems&lt;/strong&gt; (where data is analyzed).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Staging Layer(The Transient Extraction Zone)&lt;/strong&gt;&lt;br&gt;
This was the receiving dock. The staging area is defined as a temporary, intermediate storage zone between operational data sources (ODS) and the data warehouse. &lt;br&gt;
Data from a shopify store or a salesforce database was copied and temporarily dropped here. The main goal was speed; get the data out of the live application quickly so the app wouldn't slow down for regular users.&lt;/p&gt;

&lt;h4&gt;
  
  
  Attributes of staging layer
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Decoupling OLTP and OLAP&lt;/strong&gt; - The primary architectural goal of this layer is to isolate Online Transaction Processing (OLTP) systems (like Salesforce or Shopify) from Online Analytical Processing (OLAP) workloads. Analytical queries are highly &lt;strong&gt;resource-intensive&lt;/strong&gt;; running them directly on a live database can cause catastrophic latency for end-users.&lt;br&gt;
&lt;strong&gt;Extraction Mechanics&lt;/strong&gt; - Data is pulled into this layer using methodologies such as &lt;strong&gt;batch processing&lt;/strong&gt; or &lt;strong&gt;Change Data Capture (CDC)&lt;/strong&gt;. The data here is typically stored in its raw, native format.&lt;br&gt;
&lt;strong&gt;Volatility&lt;/strong&gt; - According to traditional DWH design principles, data in the staging layer is transient. Once the data is successfully moved to the next tier, it is generally purged or overwritten in the next batch cycle to conserve expensive storage space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Integration Layer (The Core Enterprise Data Warehouse)&lt;/strong&gt;&lt;br&gt;
This is where the heavy lifting happened. Engineers wrote scripts to clean the data and match up records.&lt;br&gt;
This layer represents what W.H. Inmon famously defined as the subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process.&lt;br&gt;
 If your billing system called a customer Client_001 and your website called them User_001, the Integration layer linked them together into a central, highly structured database.&lt;/p&gt;

&lt;h4&gt;
  
  
  Attributes of integration layer
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Semantic Reconciliation&lt;/strong&gt; - The heavy lifting is known as semantic reconciliation and &lt;strong&gt;Master Data Management&lt;/strong&gt; (MDM). Engineers must resolve heterogeneous data formats (e.g., merging Client_001 from an Oracle database and User_001 from a JSON web log) into a unified entity.&lt;br&gt;
&lt;strong&gt;Data Cleansing and Normalization&lt;/strong&gt; - In this layer, data undergoes rigorous cleansing (handling null values, standardizing date formats). Structurally, Inmon advocated for storing this data in the Third Normal Form (3NF). This highly normalized structure reduces data redundancy and ensures mathematical consistency across the enterprise, creating a &lt;strong&gt;Single Version of Truth (SVOT)&lt;/strong&gt;.&lt;br&gt;
&lt;strong&gt;The Bottleneck&lt;/strong&gt; - Because of the complex normalization rules, writing data into this layer requires highly complex, tightly coupled SQL scripts, making the Integration Layer notoriously slow to update or modify.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Presentation Layer(Data Marts and Dimensional Modeling)&lt;/strong&gt;&lt;br&gt;
The highly normalized 3NF data in the Integration layer is too complex for business users to query efficiently, therefore, data must be reshaped for consumption. Engineers would pre-package specific tables for specific teams e.g. creating a Marketing Table or a Sales Table that connected easily to dashboard software.&lt;/p&gt;

&lt;h4&gt;
  
  
  Attributes of presentation layer
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;The Data Mart&lt;/strong&gt; - The Presentation layer is composed of subsets of the data warehouse focused on a specific business unit also called Data Marts (e.g., Sales, HR, Marketing).&lt;br&gt;
&lt;strong&gt;Dimensional Modeling (The Kimball Method)&lt;/strong&gt; - In this layer, engineers apply Ralph Kimball’s dimensional modeling techniques, organizing data into &lt;strong&gt;Star Schemas&lt;/strong&gt; or &lt;strong&gt;Snowflake Schemas&lt;/strong&gt;. Data is divided into &lt;strong&gt;Facts&lt;/strong&gt; (measurable, quantitative data e.g sales amount) and &lt;strong&gt;Dimensions&lt;/strong&gt; (descriptive attributes e.g time, store, or customer).&lt;br&gt;
&lt;strong&gt;Optimized for Read-Heavy Workloads&lt;/strong&gt; - By pre-joining and denormalizing the data, this layer allows Business Intelligence tools like PowerBI to execute complex analytical queries rapidly without requiring end-users to understand underlying SQL structures.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Problem with the Old Way
&lt;/h4&gt;

&lt;p&gt;This system relied heavily on a process called ETL (Extract, Transform, Load). Engineers would extract the data, transform/clean it and then load it into the warehouse. The fatal flaw was that the raw data was often discarded after it was cleaned to save storage space. If a data engineer accidentally deleted a crucial column during the clean phase, that historical data was gone forever.&lt;br&gt;
&lt;strong&gt;1. The Schema-on-Write Constraint&lt;/strong&gt;&lt;br&gt;
Traditional DWHs operated on a &lt;strong&gt;Schema-on-Write&lt;/strong&gt; paradigm. This means that before data could be loaded into the warehouse, the warehouse's schema (tables, columns, data types) had to be rigidly defined. If a new column was added to the source software, the ETL pipeline would fail, or simply drop the unrecognized data, until an engineer manually updated the database schema.&lt;br&gt;
&lt;strong&gt;2. Destructive Transformations and Storage Costs&lt;/strong&gt;&lt;br&gt;
On-premise relational database storage such as Teradata or Oracle appliances used to be very expensive. To save disk space, raw data was deemed expendable. Data was extracted, transformed to fit the strict schema, and the raw source data was then discarded.&lt;br&gt;
This model had some downsides which included:&lt;br&gt;
      &lt;strong&gt;• Loss of Auditability and Lineage&lt;/strong&gt; - If a transformation logic error occurred (e.g., a script incorrectly rounded up financial figures), there was no historical raw data to refer back to since the original data was permanently lost.&lt;br&gt;
        &lt;strong&gt;• Lack of Flexibility for Machine Learning&lt;/strong&gt; - Modern Data Science requires massive amounts of raw, unstructured or semi-structured data to train machine learning models. The traditional integration layer stripped away the granular, raw anomalies that data scientists actually need, leaving only highly aggregated, structured data.&lt;br&gt;
As a result of the flaw which resulted to loss of raw data and rigidity of ETL paved the way for Data Lakes, there was a shift from ETL to ELT(where cheap cloud storage allows raw data to be stored before transformation), and ultimately the modern Medallion Architecture (Bronze, Silver, Gold), which preserves raw data while still providing structured analytics.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Modern Shift(The Medallion Architecture)
&lt;/h3&gt;

&lt;p&gt;As cloud storage became incredibly cheap, companies stopped throwing away their raw data and began dumping everything into massive, cheap storage areas(Data Lakes).&lt;br&gt;
Eventually, companies pioneered the Lakehouses which combined the cheap, infinite storage of a Data Lake with the strict organization of a traditional Data Warehouse.&lt;br&gt;
The need to help companies organize the massive amounts of data inside a Lakehouse therefore gave birth to the Medallion Architecture. &lt;br&gt;
The Medallion Architecture separates data into three specific stages; &lt;strong&gt;Bronze&lt;/strong&gt;, &lt;strong&gt;Silver&lt;/strong&gt;, and &lt;strong&gt;Gold&lt;/strong&gt;. It mimics the logical flow of the traditional layers but fundamentally changes how data is treated, preserved, and upgraded.&lt;/p&gt;

&lt;h4&gt;
  
  
  How do the three layers work?
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. The Bronze Layer(The Raw Zone)&lt;/strong&gt;&lt;br&gt;
This is where all the raw data lands from the various sources.&lt;br&gt;
The data is saved exactly as it arrived. You do not fix typos. You do not rename columns. You just capture it.&lt;/p&gt;

&lt;h4&gt;
  
  
  Features of bronze layer
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Safety and Troubleshooting&lt;/strong&gt; - since the raw data is completely untouched, you never have to worry about accidentally destroying information. If an engineer writes a bad piece of code that ruins the data in the later layers, they can simply go back to the Bronze layer and restart the process.&lt;br&gt;
&lt;strong&gt;• Historical Archive&lt;/strong&gt; - The Bronze layer acts as an infinite, permanent record of everything that ever happened in the business. It is usually &lt;strong&gt;append-only&lt;/strong&gt;, meaning new records are just added to the pile without overwriting old records.&lt;br&gt;
&lt;strong&gt;• Speed&lt;/strong&gt; - Getting data into the Bronze layer is fast since the computer isn't doing any complex math, translations or cleaning. Engineers often use tools called &lt;strong&gt;Change Data Capture (CDC)&lt;/strong&gt; to stream this raw data in real-time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Silver Layer(The Cleaned Zone)&lt;/strong&gt;&lt;br&gt;
Once the data is safely locked away in the Bronze layer, it is copied and moved into the Silver layer. The goal of the Silver layer is to create a &lt;strong&gt;Single Source of Truth&lt;/strong&gt; for the entire enterprise.&lt;/p&gt;

&lt;h4&gt;
  
  
  What happens here?
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Cleaning and Standardization&lt;/strong&gt; - Engineers fix the formatting. For example, if one source system writes dates as DD-MM-YYYY and another writes MM-DD-YYYY, the Silver layer standardizes all into one standard format.&lt;br&gt;
• &lt;strong&gt;Filtering and Quarantining&lt;/strong&gt; - Junk data is handled here. If a user accidentally enters an age like 999, the system spots it and instead of deleting it, engineers push that bad record into a separate quarantine table so it doesn't ruin the main data set, but can still be investigated later.&lt;br&gt;
&lt;strong&gt;• Deduplication&lt;/strong&gt; - Sometimes, source systems can glitch and send the same receipt twice. The Silver layer strips out duplicates so every row is unique.&lt;br&gt;
&lt;strong&gt;• Joining&lt;/strong&gt; - Data from different tables is connected using relationships. A log of customer purchases is joined with a product inventory table so that you can see exactly what item was bought, not just a random product ID number.&lt;br&gt;
&lt;strong&gt;• Security&lt;/strong&gt; - This is where sensitive information (like passwords, social security numbers, or personal emails) is scrambled or hidden so that analysts using the data later on cannot see private customer details.&lt;/p&gt;

&lt;p&gt;Data scientists and analysts spend a lot of time in the Silver layer. It is clean and trustworthy, but it is still highly detailed. Every single individual action is visible, which makes it the perfect place to look for hidden trends or train machine learning models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Gold Layer(The Action Zone)&lt;/strong&gt;&lt;br&gt;
The Gold layer is the final destination. The data here is no longer meant for deep exploration but is designed to answer specific business questions immediately.&lt;/p&gt;

&lt;h4&gt;
  
  
  What happens here?
&lt;/h4&gt;

&lt;p&gt;In the Silver layer, you might have a table with ten million individual rows. If a user tries to load the rows into a dashboard, the software will freeze. However, in the Gold layer, those millions of rows are turned into highly summarized, bite-sized metrics.&lt;br&gt;
&lt;strong&gt;• Aggregations&lt;/strong&gt; - Instead of listing every single sale, engineers create a Gold table that simply shows Total Sales per Store per Day.&lt;br&gt;
&lt;strong&gt;• Business Logic&lt;/strong&gt; - This is where specific company rules live. If your marketing team defines an active subscriber as someone who has opened an email in the last 30 days, that exact mathematical rule is applied to a Gold table.&lt;br&gt;
&lt;strong&gt;• Performance&lt;/strong&gt; - Data loads instantly since it's heavily summarized and simplified using Star Schema layout. When you connect any Business Intelligence tools to the Gold layer, the charts populate immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the Medallion Architecture Wins
&lt;/h3&gt;

&lt;p&gt;The reason nearly every modern data team is adopting this structure is because it solves the biggest headaches that have plagued developers for decades.&lt;br&gt;
&lt;strong&gt;1. Bulletproof Data Lineage&lt;/strong&gt;&lt;br&gt;
When an executive looks at a Gold dashboard and sees that monthly revenue dropped by 50%, panic sets in. The data team needs to find out if the business is actually failing, or if the system is just broken. &lt;br&gt;
With this architecture, they can trace the flow backward. They check the rules in the Gold layer. If those are correct, they look at the cleaned data in the Silver layer. If that looks fine, they check the raw files in the Bronze layer.&lt;br&gt;
&lt;strong&gt;2. Extreme Flexibility&lt;/strong&gt;&lt;br&gt;
If the finance department suddenly requests a completely new way to calculate annual growth, the data team doesn't have to panic. They do not have to go back to the original software sources, and they do not have to re-clean everything. They simply build a new Gold table on top of the already clean Silver data.&lt;br&gt;
&lt;strong&gt;3. System Reliability (ACID Transactions)&lt;/strong&gt;&lt;br&gt;
Modern Medallion architectures are built on specialized table formats like &lt;strong&gt;Delta Lake&lt;/strong&gt; or &lt;strong&gt;Apache Iceberg&lt;/strong&gt; which support ACID transactions. That means if a server crashes halfway through moving data from Silver to Gold, it won't leave you with a half-finished, corrupted table. The system will automatically roll back to the last safe state, preventing bad data from leaking into executive reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If you want to remember how the Medallion Architecture functions, just remember these three phrases:&lt;br&gt;
Bronze -Here is everything we found(Messy, huge, exact copies).&lt;br&gt;
Silver -Here is what actually happened (Clean, standardized, truthful).&lt;br&gt;
Gold - Here is what we should do about it (Summarized, fast, ready for action).&lt;br&gt;
The Foundation of Trust&lt;br&gt;
A data platform is only as useful as the trust people put into it. If employees constantly find missing numbers, broken charts, or conflicting reports, they will abandon the dashboards and go back to guessing.&lt;br&gt;
The Medallion Architecture is much more than a way to organize servers; it is a framework for building organizational trust. By moving information systematically through the Bronze, Silver, and Gold layers, a company guarantees that every digital footprint is captured securely, cleaned relentlessly, and presented flawlessly. Just like turning muddy water into something safe to drink, the Medallion Architecture takes the chaos of raw information and refines it into the exact clarity a business needs to survive.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>luxdev</category>
      <category>dataengineering</category>
      <category>warehouse</category>
    </item>
    <item>
      <title>Transactional Power Vs Analytical Precision: The Essential Guide to OLTP and OLAP</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Fri, 01 May 2026 19:58:49 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/transactional-power-vs-analytical-precision-the-essential-guide-to-oltp-and-olap-2nfe</link>
      <guid>https://dev.to/lawrence_murithi/transactional-power-vs-analytical-precision-the-essential-guide-to-oltp-and-olap-2nfe</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Behind every digital interaction is a fundamental divide in how data is handled. The system required to process your grocery checkout with lightning speed is radically different from the system a corporation uses to analyze a decade of sales growth. This is the core distinction between &lt;strong&gt;Transactional Power&lt;/strong&gt; vs. &lt;strong&gt;Analytical Precision&lt;/strong&gt;. To understand the backbone of modern technology, you must understand &lt;strong&gt;OLTP (Online Transactional Processing)&lt;/strong&gt; and &lt;strong&gt;OLAP (Online Analytical Processing)&lt;/strong&gt;. &lt;br&gt;
Though they sound like technical jargon, they are simple concepts that define how businesses operate and grow. &lt;br&gt;
This article serves as your roadmap to understanding how these systems function, their unique strengths, and why the balance between them is the secret to data-driven success.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OLTP(Online Transaction Processing): Handling the Day-to-Day&lt;/strong&gt;&lt;br&gt;
OLTP is the engine that runs traditional databases. It is designed to manage everyday business operations and process thousands of short, fast interactions per second. It is the system that handles the daily, minute-by-minute work of a business. Whenever a specific action or transaction takes place, OLTP is the system taking care of it.&lt;br&gt;
In a database, a transaction is any small unit of work such as changing your password. &lt;br&gt;
Transaction systems follow important rules called &lt;strong&gt;ACID properties&lt;/strong&gt;.&lt;br&gt;
ACID Properties are a set of four fundamental principles that guarantee reliable database transactions. They ensure data integrity and accuracy, preventing corruption even during system failures or concurrent operations. &lt;br&gt;
The four principles are:&lt;br&gt;
&lt;strong&gt;Atomicity(All-or-Nothing)&lt;/strong&gt; - A transaction is treated as a single unit, it either fully completes or entirely fails and rolls back.&lt;br&gt;
&lt;strong&gt;Consistency(Data Integrity)&lt;/strong&gt; - A transaction ensures the database moves from one valid state to another, adhering to all constraints and rules. That means data remains valid before and after transaction&lt;br&gt;
&lt;strong&gt;Isolation(Concurrent Control)&lt;/strong&gt; - Concurrent transactions are isolated from each other, ensuring they don’t interfere with each other.&lt;br&gt;
&lt;strong&gt;Durability(Permanent Data)&lt;/strong&gt; - Once a transaction is committed, its changes are permanently saved and will survive system failures or crashes.&lt;/p&gt;

&lt;h4&gt;
  
  
  Examples of OLTP in real life
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Adding an item to your online shopping cart.&lt;/li&gt;
&lt;li&gt;Booking an airline ticket.&lt;/li&gt;
&lt;li&gt;Sending a text message.&lt;/li&gt;
&lt;li&gt;Banking systems (Mpesa, ATM transactions)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of OLTP like the cashier at a busy grocery store. The cashier’s job is to scan items quickly, take your money, hand you a receipt, and move on to the next person.&lt;/p&gt;

&lt;h3&gt;
  
  
  How OLTP Works
&lt;/h3&gt;

&lt;p&gt;OLTP systems prioritize speed and accuracy. They use a design concept called &lt;strong&gt;normalization&lt;/strong&gt;. This means the database organizes data into many small tables to avoid saving the same piece of information twice. Because the data is spread out neatly, the system can insert a new record, update a row, or delete a piece of data almost instantly.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example
&lt;/h4&gt;

&lt;p&gt;Imagine you want to withdraw $50 from an ATM. The bank's OLTP system immediately checks your balance, approves the withdrawal, and updates your account to show $50 less. This has to happen in seconds, and it has to be 100% accurate so you cannot overdraw your account.&lt;/p&gt;

&lt;h4&gt;
  
  
  Key features of OLTP
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Low latency/Fast response time&lt;/strong&gt; - When you swipe your card, you expect it to be approved in seconds. OLTP databases are built to respond instantly.&lt;br&gt;
&lt;strong&gt;• High number of users&lt;/strong&gt; - The system ensures that thousands of users can access the same row in a database without failure.&lt;br&gt;
&lt;strong&gt;• Normalized Data&lt;/strong&gt; - Databases are typically highly normalized to reduce redundancy and ensure fast data entry. A single OLTP transaction does not require much data. &lt;br&gt;
&lt;strong&gt;• Real-time processing/Accuracy&lt;/strong&gt; - If you transfer $50 from your current account to your savings account, the system must subtract $50 from one and add $50 to the other. If the system crashes halfway through, the OLTP system cancels the whole thing so your data does not get corrupted. OLTP systems are built to be perfectly accurate and fail-safe.&lt;br&gt;
&lt;strong&gt;• Write-heavy operations&lt;/strong&gt; - Thousands of users might be doing things at the exact same time, the system is therefore constantly writing, updating or deleting information to the database.&lt;br&gt;
&lt;strong&gt;• Highly available&lt;/strong&gt; - Because OLTP systems handle the immediate, day-to-day operations of a business, the system is designed to be online, working, and accessible virtually 100% of the time thus downtime is not an option. &lt;br&gt;
OLTP systems are usually built with backup servers and fail-safes. If one server crashes, another one instantly takes over so the customer doesn't notice a glitch.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros of OLTP
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Efficiency in Data Entry&lt;/strong&gt; - Highly optimized for adding, modifying, or deleting records.&lt;br&gt;
&lt;strong&gt;• Data Integrity&lt;/strong&gt; - High reliability due to ACID compliance.&lt;br&gt;
&lt;strong&gt;• Availability&lt;/strong&gt; - Designed for 24/7 uptime for business-critical applications.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cons of OLTP
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Inefficient for complex Analysis&lt;/strong&gt; - If you ask an OLTP database to calculate the average sales of a product over the last five years, it will have to scan millions of everyday records. This takes a lot of computing power and can slow down the system for people trying to use it for normal tasks.&lt;br&gt;
&lt;strong&gt;• Limited History&lt;/strong&gt; - To keep things fast, OLTP systems usually only hold current or recent data. Old data is often moved somewhere else to save space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OLAP (Online Analytical Processing)&lt;/strong&gt;&lt;br&gt;
OLAP is the engine behind data warehouses. If OLTP is the system for doing things, OLAP is the system for analyzing things. While OLTP only looks at a tiny slice of data at a time, OLAP is the brains used for strategic planning since its designed for data mining, processing huge amounts of information to find patterns, trends and summaries as well as complex reporting. Managers, data scientists, and business owners use OLAP to spot trends, build reports and make big decisions. &lt;/p&gt;

&lt;h4&gt;
  
  
  Making Sense of OLAP
&lt;/h4&gt;

&lt;p&gt;Think of OLAP as the manager in the back office of the grocery store. They aren't ringing up customers. They are sitting at a desk, looking at charts and graphs of past sales to decide if they need to order more apples for next week.&lt;/p&gt;

&lt;h4&gt;
  
  
  How OLAP Works
&lt;/h4&gt;

&lt;p&gt;OLAP systems are not built to process quick, small updates. To make this faster, OLAP uses &lt;strong&gt;denormalization&lt;/strong&gt;. Instead of spreading data across many tiny tables like OLTP, OLAP groups massive amounts of related data together into large tables. This takes up more storage space, but it means the system can read through billions of records very quickly to find patterns.&lt;/p&gt;

&lt;h4&gt;
  
  
  Key features of OLAP
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Read-heavy operations&lt;/strong&gt; - Unlike OLTP, which is constantly writing new data (new orders, new users), OLAP mostly just reads old data. It looks at what already happened.&lt;br&gt;
&lt;strong&gt;• Complex Queries&lt;/strong&gt; - OLAP tasks involve complex math—adding, averaging, and grouping massive lists of numbers.&lt;br&gt;
&lt;strong&gt;• Multidimensional Analysis&lt;/strong&gt; - Users can slice and dice data (e.g. viewing sales by region, then by month, then by product category) using data cubes.&lt;br&gt;
&lt;strong&gt;• Denormalized Data&lt;/strong&gt; - Databases often use Star or Snowflake schemas to reduce the number of table joins needed for queries.&lt;br&gt;
&lt;strong&gt;• Slower response time&lt;/strong&gt; - While nobody wants to wait all day, an OLAP report might take a few minutes or even a few hours to run. This usually is not a concern since the person waiting is usually a business manager, not a customer standing at a checkout counter.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros of OLAP
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Handles Massive Data&lt;/strong&gt; - It can easily process millions or billions of rows of historical data.&lt;br&gt;
&lt;strong&gt;• Does Not Disrupt the Business&lt;/strong&gt; - Because OLAP lives in a data warehouse, running a massive, heavy report will not slow down the cash registers running on the OLTP database.&lt;br&gt;
&lt;strong&gt;• High Performance for Reporting&lt;/strong&gt; - Optimized for complex analytical queries.&lt;br&gt;
&lt;strong&gt;• Strategic Insights&lt;/strong&gt; - Allows businesses to identify trends, patterns, and anomalies to drive decision-making.&lt;br&gt;
&lt;strong&gt;• User-Friendly&lt;/strong&gt;: The system is often integrated with Business Intelligence tools like PowerBI for visualization.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cons of OLAP
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;• Data is Not Real-Time&lt;/strong&gt; - OLAP systems are usually updated in batches, often overnight. If you look at an OLAP report at 2:00 PM, it usually only includes data up until the night before.&lt;br&gt;
&lt;strong&gt;• Slow to Update&lt;/strong&gt; - Adding new data to an OLAP system takes time because the data has to be heavily organized and formatted before it is saved.&lt;br&gt;
&lt;strong&gt;• Expensive and Complex&lt;/strong&gt; - Building and maintaining a data warehouse requires specialized engineers and large amounts of server storage.&lt;br&gt;
&lt;strong&gt;• Latency&lt;/strong&gt; - Queries can take seconds, minutes, or even hours because of the massive volume of data being scanned.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example
&lt;/h4&gt;

&lt;p&gt;A regional manager for a coffee shop chain wants to know, "Between hot chocolate or dark roast coffee, which sold better on rainy days last year?" To answer this, the system has to look at weather data, sales data from fifty stores and a whole year of dates. An OLAP system can pull this specific report together without breaking a sweat.&lt;/p&gt;

&lt;h4&gt;
  
  
  Examples of OLAP in real life
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Netflix figuring out what genres of movies are most popular in different countries during the summer.&lt;/li&gt;
&lt;li&gt;A hospital analyzing patient records over ten years to see if a specific treatment is working.

&lt;ul&gt;
&lt;li&gt;A retail store deciding how much inventory to buy for Black Friday based on the last three years of sales.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Common OLAP Operations
&lt;/h4&gt;

&lt;p&gt;OLAP systems organize massive amounts of data into multi-dimensional structures, often referred to as &lt;strong&gt;OLAP cubes&lt;/strong&gt;. These cubes allow users to view business metrics from any angle. To explore, analyze, and make sense of this complex data, OLAP systems support several powerful analytical operations.&lt;/p&gt;

&lt;p&gt;Here is a detailed look at the five core OLAP operations:&lt;br&gt;
&lt;strong&gt;1. Roll-Up (Consolidation)&lt;/strong&gt;&lt;br&gt;
Roll-up is also known as consolidation or aggregation and involves summarizing data to a higher, more generalized level. This operation reduces the detail of the data by climbing up a concept hierarchy or by removing a dimension entirely. It is primarily used by upper management to view macro-level business trends.&lt;br&gt;
It uses mathematical functions—such as summing, averaging or counting to group smaller data points into larger, overarching categories.&lt;br&gt;
&lt;strong&gt;Example (Time Hierarchy)&lt;/strong&gt; &lt;br&gt;
Daily sales → Monthly sales → Yearly sales.&lt;/p&gt;

&lt;p&gt;If a company has millions of records of individual daily transactions, viewing them all at once can be overwhelming. Using a roll-up operation, an executive can consolidate these daily records to see total sales by month, and then roll up again to see the total gross revenue for the entire year.&lt;br&gt;
&lt;strong&gt;Business Value&lt;/strong&gt; - Roll-up provides a big picture view of business performance, stripping away unnecessary granular details to highlight overarching trends.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Drill-Down&lt;/strong&gt;&lt;br&gt;
Drill-down is the exact opposite of roll-up. It involves navigating from highly summarized, macro-level data down to highly detailed, micro-level data. This is done by stepping down a concept hierarchy or by adding a new dimension to the dataset.&lt;br&gt;
It breaks a larger aggregated number into the smaller components that make it up, allowing analysts to uncover the root causes behind a specific metric.&lt;br&gt;
&lt;strong&gt;Example (Geography &amp;amp; Time Hierarchy)&lt;/strong&gt;&lt;br&gt;
Yearly sales → Monthly sales → Daily sales (or Country → Region → Individual Store).&lt;/p&gt;

&lt;p&gt;Imagine an annual report shows that total yearly sales are significantly lower than expected. A manager can drill down from the yearly view to the monthly view and discover in what specific month sales plummeted. They can then drill down further into the month's daily sales to find which specific week caused the drop.&lt;br&gt;
&lt;strong&gt;Business Value&lt;/strong&gt; - It is essential for root-cause analysis, troubleshooting anomalies, and investigating sudden spikes or drops in performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Slice&lt;/strong&gt;&lt;br&gt;
The slice operation performs a selection on one specific dimension of the OLAP cube, resulting in a new, smaller slice of the data. &lt;br&gt;
Think of it like slicing a single piece of bread from a whole loaf. It locks one variable in place so you can analyze the rest of the data in a two-dimensional table.&lt;br&gt;
You isolate a single value within one dimension (e.g., Time, Geography, or Product) while keeping the other dimensions open.&lt;br&gt;
&lt;strong&gt;Example&lt;/strong&gt; &lt;br&gt;
Show sale records for Nairobi city only.&lt;/p&gt;

&lt;p&gt;If a data cube contains sales data across Products, Time, and Cities, applying a slice on the City dimension for Nairobi isolates that market. The resulting view will show the sales of all products over all time periods, but exclusively for Nairobi location.&lt;br&gt;
&lt;strong&gt;Business Value&lt;/strong&gt; - It allows regional managers, department heads or specific product owners to filter out irrelevant data and focus entirely on the one area of the business they are responsible for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Dice&lt;/strong&gt;&lt;br&gt;
While a slice filters data based on a single condition, a dice operation isolates a highly specific sub-cube by applying multiple filters across two or more dimensions simultaneously. &lt;br&gt;
Think of it like cutting a smaller block out of a larger block of cheese.&lt;br&gt;
It selects specific ranges or values across multiple dimensions to create a highly targeted subset of the original data.&lt;br&gt;
&lt;strong&gt;Example&lt;/strong&gt; &lt;br&gt;
Show laptop sales in Nairobi and Mombasa during January and February.&lt;/p&gt;

&lt;p&gt;Here, the user is applying filters across three separate dimensions, Product Dimension(Laptops only), Geography Dimension(Nairobi and Mombasa only) and Time Dimension(January and February only).&lt;br&gt;
&lt;strong&gt;Business Value&lt;/strong&gt; - Dicing is used for highly specialized, multi-faceted analysis. It allows data scientists and marketers to look at exact intersections of data, such as evaluating the success of a specific winter promotion for a specific tech product in key coastal cities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Pivot (Rotate)&lt;/strong&gt;&lt;br&gt;
Pivot, sometimes called rotation, does not filter or change the underlying data, instead, it changes the visual perspective. It rotates the data axes to provide an alternative presentation, making different relationships easier to spot.&lt;br&gt;
It rearranges the layout of the data, typically by swapping rows and columns, or by moving a dimension from the background into the foreground.&lt;br&gt;
&lt;strong&gt;Example&lt;/strong&gt;&lt;br&gt;
Swapping Products and Time periods.&lt;/p&gt;

&lt;p&gt;A manager might be looking at a table where Products (Laptops, Phones, Tablets) are listed in the rows and Months (January, February, March) are the columns. By pivoting the data, they can make Months the rows and Products the columns.&lt;br&gt;
&lt;strong&gt;Business Value&lt;/strong&gt; - Different layouts highlight different trends. A pivot makes it easier to compare data side-by-side depending on what the analyst is trying to prove, ensuring the final report is as readable and impactful as possible.&lt;br&gt;
&lt;strong&gt;NB:&lt;/strong&gt; OLAP is not mainly about recording what is happening right now. It is about understanding what has happened and what it means.&lt;/p&gt;

&lt;h3&gt;
  
  
  OLTP vs. OLAP
&lt;/h3&gt;

&lt;p&gt;The distinction between OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) boils down to two distinct phases of business; &lt;strong&gt;execution&lt;/strong&gt; and &lt;strong&gt;strategy&lt;/strong&gt;. Simply put, OLTP runs the business, while OLAP analyzes the business.&lt;br&gt;
These two systems are designed for fundamentally different jobs. Understanding how they differ and how they work together comes down to understanding their relationship with time, purpose, and data architecture.&lt;/p&gt;

&lt;p&gt;Here is a detailed comparison of how the two systems operate.&lt;br&gt;
&lt;strong&gt;1. Main Purpose and System Goals&lt;/strong&gt;&lt;br&gt;
OLTP - Its primary objective is to handle daily business operations and execute transactions seamlessly. Its core focus is on accuracy, transaction safety, and ensuring the day-to-day business continues without interruption.&lt;br&gt;
OLAP - Its primary objective is to extract valuable insights from data to help leadership make smart, strategic decisions. Instead of facilitating transactions, it focuses on reporting, identifying long-term trends, and planning for the future.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The User Profiles&lt;/strong&gt;&lt;br&gt;
OLTP - These systems are used by everyday customers, cashiers, front-line staff, and mobile applications. These are the people actively interacting with the business in real-time buying items, logging into portals or booking appointments.&lt;br&gt;
OLAP - These systems are utilized by business analysts, managers, and corporate executives. These users interact with data using dashboards, Business Intelligence reports and complex spreadsheets to evaluate business performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Data State and Architectural Design&lt;/strong&gt;&lt;br&gt;
OLTP - Data is current, real-time, and highly operational. Since the data is constantly changing, the database is highly normalized to ensure efficiency and eliminate data redundancy. It is optimized to handle a constant stream of inserting, updating, and deleting small bits of data.&lt;br&gt;
OLAP - Data is historical, static, and rarely changes. It consists of summarized data spanning months or years. Because the goal is fast analysis rather than fast updates, the database is often denormalized allowing the system to efficiently read millions of rows of data at once without altering them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Query Dynamics and Performance Needs&lt;/strong&gt;&lt;br&gt;
OLTP - Queries are short, simple, and require incredibly fast response times per transaction. They generally touch only a few records at a time.&lt;br&gt;
&lt;strong&gt;Example Query&lt;/strong&gt; - Update bread's price to $10, What is John's email address? or Update a specific customer's order.&lt;br&gt;
OLAP - Queries are heavy, long, and highly complex. While speed is still important, the system is built to process massive analytical workloads rather than split-second individual actions.&lt;br&gt;
&lt;strong&gt;Example Query&lt;/strong&gt; - What is the average age of customers who bought bread in November of 2022? or Show the global sales trends broken down by region over the past 5 years.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Real-World Examples&lt;/strong&gt;&lt;br&gt;
OLTP Systems - ATMs, retail checkout registers, airline booking systems, and e-commerce shopping carts.&lt;br&gt;
OLAP Systems - Corporate data dashboards, annual financial reports, and Business Intelligence (BI) platforms.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Synergy(How OLTP and OLAP Work Together)
&lt;/h3&gt;

&lt;p&gt;A successful business relies on a symbiotic relationship between both systems. You cannot accurately analyze a business if you do not have an OLTP system reliably recording the daily sales. Conversely, you cannot grow a business if you lack an OLAP system to look back at your history and determine what strategies are actually working.&lt;/p&gt;

&lt;p&gt;So, how does the two systems connect? &lt;br&gt;
They are linked through a pipeline process known as &lt;strong&gt;ETL (Extract, Transform, Load)&lt;/strong&gt;.&lt;br&gt;
Every day, the OLTP database handles the rapid work of serving customers and processing transactions. At the end of the day, usually in the night when customer traffic and system strain are at their lowest, an automated batch script runs.&lt;br&gt;
Extract - The script pulls a copy of the day's newly generated operational data from the OLTP database.&lt;br&gt;
Transform - It cleans, formats, and aggregates that raw data to ensure it is properly structured for analysis.&lt;br&gt;
Load - Finally, the script deposits that formatted data into the OLAP data warehouse.&lt;br&gt;
By the time the business analysts and executives log into their dashboards the next morning, the OLAP warehouse is fully updated with yesterday's finalized numbers. The data is now perfectly prepped to be searched, graphed, and studied.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Bottom Line
&lt;/h3&gt;

&lt;p&gt;The difference between OLTP and OLAP simply comes down to time. While OLTP handles the exact moment a transaction occurs, OLAP handles months or years of historical data that the transactions leaves behind. Together, they allow a business to operate today while intelligently planning for tomorrow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Every time you interact with a screen, you are leaving a digital footprint. Databases are the safe spaces that hold those footprints. OLTP ensures daily transactions are fast and secure. Data warehouses collect all those footprints over time. Finally, OLAP helps businesses look at the giant trail of footprints to figure out where they should step next.&lt;br&gt;
These tools might be invisible, but they are the engine running modern business, keeping our digital lives fast, organized, and constantly improving.&lt;/p&gt;

</description>
      <category>luxdev</category>
      <category>dataengineering</category>
      <category>olap</category>
    </item>
    <item>
      <title>From Tables to Tides: Navigating Databases, Warehouses, Marts, Lakes, and the Lakehouse Revolution</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Fri, 01 May 2026 17:13:47 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/the-data-behind-your-screen-a-simple-guide-to-databases-data-warehouses-oltp-and-olap-1bj</link>
      <guid>https://dev.to/lawrence_murithi/the-data-behind-your-screen-a-simple-guide-to-databases-data-warehouses-oltp-and-olap-1bj</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Every time you buy a coffee with a card, "like" a post on social media, withdraw money from an ATM or buy a shirt online, you are interacting with a database. Behind the scenes of every app and website, data is constantly being created, moved, stored and read.&lt;br&gt;
However, not all data storage is the same. The way a system stores your checkout items at a grocery store is very different from the way that same grocery chain analyzes ten years of sales trends.&lt;br&gt;
To understand how modern software handles data, we need to look at the main types of storage; &lt;strong&gt;traditional databases&lt;/strong&gt;, &lt;strong&gt;data warehouses&lt;/strong&gt;, &lt;strong&gt;data marts&lt;/strong&gt;, &lt;strong&gt;data lakes&lt;/strong&gt; and &lt;strong&gt;lake houses&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
If you are not a computer guru, these terms might sound very technical but they are not as complex as they sound. But once you break them down, they make perfect sense. &lt;br&gt;
This article gives a simple but detailed breakdown of what these are, how they work, and why software relies on both.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Basics of Data Storage
&lt;/h3&gt;

&lt;p&gt;In today's data-driven world, organizations generate massive amounts of information. To effectively store, manage, and analyze this data, businesses use different architectural models based on their specific needs. &lt;br&gt;
Before we look at the specific processing types, it helps to understand the physical or virtual places where data lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Database (The Daily Worker/Operational Engine)&lt;/strong&gt;&lt;br&gt;
Think of how you keep track of your personal budget. You might use a spreadsheet. A spreadsheet is great for one person looking at a few hundred rows of information. Now imagine a company like Amazon trying to use a spreadsheet to track millions of orders happening every minute. The spreadsheet would freeze and crash instantly.&lt;br&gt;
A database is like a highly advanced, incredibly secure digital filing cabinet built to store massive amounts of information without crashing. Databases are primarily designed for &lt;strong&gt;OLTP (Online Transactional Processing)&lt;/strong&gt;. They are the workhorses that power day-to-day operations, such as processing bank transactions, managing inventory, or storing user profiles. Its main job is to quickly record new information, update existing information, and allow users to quickly look up specific details. More importantly, it is organized so that users can find exactly what they are looking for in a fraction of a second.&lt;br&gt;
Information in a standard database is usually organized into tables with rows and columns. For example, an online store might have one table for Customers, one for Products, and one for Orders. The database connects these tables so the system knows exactly which customer bought which product. Think of a traditional database like a busy cash register. It needs to be fast, accurate, and handle hundreds of transactions at once without freezing.&lt;/p&gt;

&lt;h4&gt;
  
  
  Key Characteristics of a Database
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;ACID Compliance&lt;/strong&gt; – Traditional relational databases follow strict rules (Atomicity, Consistency, Isolation, Durability) to ensure that transactions are processed reliably and that data remains accurate even in the event of a system crash.&lt;br&gt;
&lt;strong&gt;Normalized Structure&lt;/strong&gt; – Data is organized into tables to reduce redundancy. For example, a customer’s address is stored in one place rather than being repeated for every order they place.&lt;br&gt;
&lt;strong&gt;Real-Time Interaction&lt;/strong&gt; – Databases are designed to handle thousands of concurrent users making small, rapid changes to the data simultaneously.&lt;/p&gt;

&lt;h4&gt;
  
  
  Types of Databases
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Relational (SQL)&lt;/strong&gt; - Uses tables with rows and columns (e.g., MySQL, PostgreSQL, Oracle). Ideal for structured data where relationships are clearly defined.&lt;br&gt;
&lt;strong&gt;Non-Relational (NoSQL)&lt;/strong&gt; - Uses flexible structures like documents or graphs (e.g., MongoDB, Cassandra). Ideal for rapidly changing data types and massive scaling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Data Warehouse (The Long-Term Archive/Analytical Hub)&lt;/strong&gt;&lt;br&gt;
As a business runs, over time, its database fills up with millions of past transactions. After a few years, a company manager might want to know, "Which of our stores sold the most winter coats in December over the last five years?"&lt;br&gt;
For the database to answer that question, it has to dig through millions of old records thus it slows down. This causes the system to freeze hence people trying to buy things on the website at that moment cannot check out.&lt;br&gt;
Using the grocery store analogy, a store manager walking up to a cashier who has a long line of customers and asking them to calculate the store's total profit for the last decade would cause a crisis and bring the whole store to a halt. To fix this, companies build &lt;strong&gt;Data Warehouses&lt;/strong&gt;.&lt;br&gt;
A data warehouse is a massive storage system designed to hold historical data from many different sources. It aggregates data from various sources such as different operational databases, CRM systems and flat files to provide a comprehensive, historical view of the entire organization. Periodically, usually in the night, the company copies all the new data from these sources and dumps it into the data warehouse.&lt;br&gt;
From the previous example, if the database is the cash register, the data warehouse is the company's central filing room. A data warehouse takes the daily receipts from all the different cash registers, organizes them and stores them for years. &lt;br&gt;
The data warehouse acts as the company's long-term memory. It doesn't handle everyday customer actions. Instead, it is a quiet, organized space where business analysts can run massive queries and reports without interrupting the live website. &lt;br&gt;
Data warehouses utilize &lt;strong&gt;OLAP (Online Analytical Processing)&lt;/strong&gt;. Instead of focusing on individual transactions, they are optimized to scan millions of rows to find trends, averages and insights.&lt;/p&gt;

&lt;h4&gt;
  
  
  The ETL Process (The Warehouse Engine)
&lt;/h4&gt;

&lt;p&gt;Before data enters a warehouse, it must undergo &lt;strong&gt;ETL&lt;/strong&gt; (Extract, Transform, Load).&lt;br&gt;
Extract - Pulling data from multiple, often messy, source systems.&lt;br&gt;
Transform - Cleaning, deduplicating, and formatting the data into a standardized structure.&lt;br&gt;
Load - Moving the clean data into the warehouse.&lt;br&gt;
This is known as &lt;strong&gt;Schema-on-Write&lt;/strong&gt;, meaning the structure of the data must be defined and validated before it can be stored.&lt;/p&gt;

&lt;h4&gt;
  
  
  Key Benefits of a Data Warehouse
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Data Integration&lt;/strong&gt; – It breaks down data silos by combining information from marketing, sales, and finance into one single source of truth.&lt;br&gt;
&lt;strong&gt;Historical Context&lt;/strong&gt; – While databases often only show current data, warehouses store years of historical records, allowing for year-over-year comparisons.&lt;br&gt;
&lt;strong&gt;Optimized for Performance&lt;/strong&gt; – Warehouses often use columnar storage, which allows them to perform complex calculations such as, What was the total revenue for 2023?, significantly faster than a standard database.&lt;br&gt;
&lt;strong&gt;High Quality &amp;amp; Accuracy&lt;/strong&gt; – Because data is cleaned during the ETL process, business leaders can trust that the reports they generate are based on accurate, non-conflicting information.&lt;br&gt;
Why use a Data Warehouse?&lt;br&gt;
&lt;strong&gt;NB:&lt;/strong&gt; A data warehouse is the foundation for Business Intelligence. It allows executives to run complex What if? scenarios and generate reports that inform long-term strategy. It also ensures that the operational databases are not slowed down by heavy analytical queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Data Marts(The Departmental Lens)&lt;/strong&gt;&lt;br&gt;
A data mart is a highly focused, specialized subset of a data warehouse designed to serve the specific needs of a single department or business unit.&lt;br&gt;
While a traditional Data Warehouse acts as a massive, centralized repository containing all of an organization's structured data, a data mart isolates only the information relevant to a specific team.&lt;/p&gt;

&lt;h4&gt;
  
  
  Key Benefits of a Data Mart
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Enhanced Performance&lt;/strong&gt; - Because the data mart is smaller and queries are highly specific, reports and dashboards load much faster.&lt;br&gt;
&lt;strong&gt;Improved Security&lt;/strong&gt; - By isolating data, companies can strictly control who has access to sensitive departmental information &lt;br&gt;
&lt;strong&gt;Ease of Use&lt;/strong&gt; - Business users and analysts do not have to sift through irrelevant enterprise data to find what they need.&lt;br&gt;
Data marts can be &lt;strong&gt;Dependent&lt;/strong&gt; (built by drawing data from an existing enterprise data warehouse) or &lt;strong&gt;Independent&lt;/strong&gt; (built directly from operational systems).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Data Lakes(The Raw Data Reservoir)&lt;/strong&gt;&lt;br&gt;
A data lake is a massive, highly scalable storage system designed to hold vast amounts of raw, unprocessed data in its native format.&lt;br&gt;
Unlike a data warehouse, which requires data to be cleaned, transformed, and structured into strict tables before it can be stored(Schema-on-Write), a data lake stores data exactly as it is generated, assigning structure only when the data is eventually read or queried (Schema-on-Read).&lt;/p&gt;

&lt;h5&gt;
  
  
  Data Lakes store?
&lt;/h5&gt;

&lt;p&gt;&lt;em&gt;Structured Data&lt;/em&gt; - Traditional tables and relational databases.&lt;br&gt;
&lt;em&gt;Semi-Structured Data&lt;/em&gt; - JSON files, XML, CSVs, and server logs.&lt;br&gt;
&lt;em&gt;Unstructured Data&lt;/em&gt; - Emails, documents, PDFs.&lt;br&gt;
&lt;em&gt;Binary/Media Data&lt;/em&gt; - Images, audio files, and videos.&lt;br&gt;
&lt;em&gt;Streaming Data&lt;/em&gt; - Real-time IoT sensor data and website clickstreams.&lt;/p&gt;

&lt;h5&gt;
  
  
  Why use a Data Lake?
&lt;/h5&gt;

&lt;p&gt;A data lake is ideal when an organization wants to capture and retain everything, even data they don't immediately need. It is highly cost-effective because it utilizes cheap cloud storage. Furthermore, having raw, unmanipulated data is essential for training artificial intelligence (AI) and complex Machine Learning (ML) models.&lt;br&gt;
&lt;strong&gt;NB:&lt;/strong&gt; Without proper organization and governance, a data lake can become a messy, unsearchable &lt;strong&gt;Data Swamp&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Data Lakehouse(The Modern Hybrid)&lt;/strong&gt;&lt;br&gt;
For years, companies had to maintain a two-tier architecture; a Data Lake for raw data and machine learning, and a separate Data Warehouse for clean data and business reporting. This resulted in expensive storage costs, data duplication, and complex maintenance.&lt;br&gt;
A Data Lakehouse is a modern architectural design that merges the best concepts of both systems. It is built directly on top of cheap data lake storage, but it applies the organizational structures, management tools, and performance speeds of a data warehouse.&lt;/p&gt;

&lt;h4&gt;
  
  
  Key Features of a Lakehouse
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Flexibility &amp;amp; Scale&lt;/strong&gt; - Like a data lake, it can store massive amounts of structured, semi-structured, and unstructured data.&lt;br&gt;
&lt;strong&gt;Reliability &amp;amp; Structure&lt;/strong&gt; - Like a data warehouse, it supports ACID transactions (meaning data is reliable, updates don't break the system, and multiple people can read/write simultaneously).&lt;br&gt;
&lt;strong&gt;Single Source of Truth&lt;/strong&gt; - Teams no longer have to copy data from the lake to the warehouse. Business analysts can build BI dashboards, and data scientists can run machine learning models directly on the exact same data platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary of the Storage systems
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2a89psyc3zhg2niiz8xm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2a89psyc3zhg2niiz8xm.png" alt="storage" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b995bwwvxr923luvsn4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b995bwwvxr923luvsn4.png" alt="Data storage" width="800" height="989"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Bottom Line
&lt;/h3&gt;

&lt;p&gt;In today's modern economy, data is a company’s most valuable asset. However, data only provides value if it can be accessed, analyzed, and trusted. By understanding the distinctions between these storage methods, organizations can build a robust infrastructure that avoids the Data Swamp, reduces operational costs, and ultimately turns raw information into a competitive advantage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Choosing the right data storage architecture is no longer about finding a one-size-fits-all solution but matching the right tool to the specific needs of the business. As organizations evolve from simple record-keeping to complex artificial intelligence and real-time analytics, their data strategy must also mature.&lt;br&gt;
For Day-to-Day Operations, Database remains the essential engine, ensuring that transactions are processed accurately and instantly.&lt;br&gt;
For Strategic Reporting, Data Warehouse and its specialized Data Marts provide the single source of truth needed for executive decision-making and departmental efficiency.&lt;br&gt;
For Big Data &amp;amp; Innovation, Data Lake serves as the vital reservoir for raw information, fueling the next generation of Machine Learning and AI development.&lt;br&gt;
For the Future of Scalability, Data Lakehouse represents the ultimate convergence, offering the best of all worlds; the speed of a warehouse with the massive flexibility of a lake.&lt;/p&gt;

</description>
      <category>datawarehousing</category>
      <category>luxdev</category>
      <category>dataengineering</category>
      <category>database</category>
    </item>
    <item>
      <title>Apache Airflow for Beginners: DAGs, Tasks, Operators, and Scheduling Explained</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Wed, 29 Apr 2026 20:24:12 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/apache-airflow-for-beginners-dags-tasks-operators-and-scheduling-explained-p2d</link>
      <guid>https://dev.to/lawrence_murithi/apache-airflow-for-beginners-dags-tasks-operators-and-scheduling-explained-p2d</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Being a beginner in data engineering can seem very scary. People use technical words like ETL, pipelines, data warehouses, architecture, orchestration etc. At that point, it is very easy to feel like you need a computer science degree just to understand what they mean. However, most of these terms are just technical but not as complicated as they sound. &lt;br&gt;
Data engineering, in simple terms, involves extracting data from a place such as websites, social media pages, excel/csv files or payment systems etc, cleaning it, and storing it somewhere (database, data warehouse or data lake). If you need this done once, you can run a simple Python script. However, if the job must run every hour, every day, or every week, you need a tool that can manage it for you. That's where Apache Airflow comes in.&lt;/p&gt;
&lt;h3&gt;
  
  
  What is Apache Airflow?
&lt;/h3&gt;

&lt;p&gt;To understand Apache Airflow, think about a process like baking a cake. You do not just throw everything into the oven. You follow steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Buy the ingredients&lt;/li&gt;
&lt;li&gt;Prepare the dough&lt;/li&gt;
&lt;li&gt;Put the dough in the oven&lt;/li&gt;
&lt;li&gt;Bake the cake&lt;/li&gt;
&lt;li&gt;Let it cool&lt;/li&gt;
&lt;li&gt;Add frosting&lt;/li&gt;
&lt;li&gt;Serve the cake&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some steps must happen before others. You cannot frost the cake before baking it. You cannot bake the cake before preparing the dough. You also need to know how long each step should take and what to do if something goes wrong. &lt;br&gt;
This kind of process is called a &lt;strong&gt;workflow&lt;/strong&gt; or &lt;strong&gt;pipeline&lt;/strong&gt; and Airflow helps you manage that workflow.&lt;br&gt;
&lt;strong&gt;NB:&lt;/strong&gt; Airflow does not usually do the heavy data processing itself but tells other tools when to do the work.&lt;br&gt;
A workflow may be a data pipeline, a machine learning pipeline, a reporting process, or any process made up of several steps.&lt;br&gt;
Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;extract_data &amp;gt;&amp;gt; clean_data &amp;gt;&amp;gt; load_data &amp;gt;&amp;gt; send_email
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apache Airflow is an open-source platform used to &lt;strong&gt;schedule&lt;/strong&gt;, &lt;strong&gt;monitor&lt;/strong&gt; and &lt;strong&gt;manage&lt;/strong&gt; workflows. It was originally created by Maxime Beauchemin at Airbnb in 2014 to manage increasingly complex data workflows. It helps you decide what task should run first, what should follow, what should happen if something fails, and when the whole process should run again.&lt;/p&gt;

&lt;h4&gt;
  
  
  Airflow as an Orchestrator
&lt;/h4&gt;

&lt;p&gt;Orchestration refers to arranging many tasks so they run in the right order and at the scheduled time. It makes sure that task B does not run before task A has finished. It also records whether each task succeeded or failed. Without orchestration, you will have many scripts running manually or through separate cron jobs hence becoming difficult to manage as your project grows.&lt;/p&gt;

&lt;h4&gt;
  
  
  Why Airflow?
&lt;/h4&gt;

&lt;p&gt;While a normal Python script could run fine with simple tasks, you need more control as the number of tasks increases. Airflow is useful because data jobs often have many moving parts.&lt;/p&gt;

&lt;p&gt;Airflow is useful because of various reasons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Scheduling&lt;/strong&gt;&lt;br&gt;
Since most data work is repetitive, scheduling enables workflows to run automatically based on the scheduled time. Airflow handles complex timezone logic natively, ensuring global data pipelines run exactly when they should.&lt;br&gt;
Airflow can also automatically run a pipeline for historical dates through a process called &lt;strong&gt;backfilling&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Task Orchestration&lt;/strong&gt;&lt;br&gt;
Tasks are arranged depending on which task runs first, second, and last.&lt;br&gt;
Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;extract &amp;gt;&amp;gt; transform &amp;gt;&amp;gt; load
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This order is critical because if the load task runs before the transform task, the database may receive dirty data. If the transform task runs before extract, there will be no data to clean.&lt;br&gt;
Airflow has parallel execution capabilities to run several tasks simultaneously and wait for all of them to finish before moving to the next step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Monitoring&lt;/strong&gt;&lt;br&gt;
Monitoring standard scripts to know if a job ran successfully requires SSH-ing into a server and digging through terminal files. However, Airflow provides a centralized web interface for the entire data ecosystem to monitor:&lt;br&gt;
&lt;strong&gt;Task Statuses&lt;/strong&gt; - Color-coded views showing what is running, successful, failed, or queued.&lt;br&gt;
&lt;strong&gt;Gantt Charts&lt;/strong&gt; - Visual representations of task duration, helping you identify bottlenecks in your pipeline.&lt;br&gt;
&lt;strong&gt;Historical Trends&lt;/strong&gt; - view the history of a specific pipeline over a duration of time to spot intermittent failures or slowing performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Automated Retries&lt;/strong&gt;&lt;br&gt;
In the real world, tasks can  fail for temporary reasons. An API may be rate-limited, a database might briefly drop a connection, or a network hiccup might occur.&lt;br&gt;
Instead of waking up at 3:00 AM to manually restart a failed script, Airflow handles transient errors gracefully by trying the task again based on the number of retries set.&lt;br&gt;
Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"retries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"retry_delay"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;timedelta(minutes=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this scenario, if the task fails, Airflow will wait for 5 minutes before trying again, up to three times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Accessible Logs&lt;/strong&gt;&lt;br&gt;
Finding out why and when a pipeline breaks is very critical. Airflow attaches isolated logs to every single task execution eliminating the need to hunt through an entire server log file.&lt;br&gt;
A user is also able to click on a failed task directly in the web UI and instantly read the error message for that specific run, reducing debugging time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Failure Handling&lt;/strong&gt;&lt;br&gt;
When a task fails, letting the rest of the script run can result in corrupt data or crashed databases. Airflow thus stops execution of the downstream tasks preventing bad data from moving through the pipeline.&lt;br&gt;
Airflow can also be configured to send an automated email, slack message, or an alert when a pipeline fails, ensuring the team is instantly aware of critical data outages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Clear Pipeline Structure&lt;/strong&gt;&lt;br&gt;
Airflow workflows are written entirely in Python hence the pipeline configuration is treated like any other software project. Workflows are visible and anyone can see how tasks connect to each other hence a new person joining the team can open the Airflow UI and understand the pipeline flow.&lt;br&gt;
Workflows can be committed to Git, peer-reviewed, and rolled back if a mistake is made.&lt;/p&gt;
&lt;h3&gt;
  
  
  Important Airflow Terms
&lt;/h3&gt;

&lt;p&gt;Before writing any Airflow code, its important to understand the main words used in the Airflow world because they describe parts of a workflow system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. DAG&lt;/strong&gt;&lt;br&gt;
In Airflow, a full workflow is called a DAG(Directed Acyclic Graph).&lt;br&gt;
&lt;strong&gt;Directed&lt;/strong&gt; - the workflow moves in one direction. The process has a starting point and an ending point and does not move backward.&lt;br&gt;
&lt;strong&gt;Acyclic&lt;/strong&gt; - there are no loops. Since workflow must have a clear start and a clear end loops are not allowed since they create endless cycles and the pipeline might never finish running.&lt;br&gt;
&lt;strong&gt;Graph&lt;/strong&gt; - a structure made up of points and connections. The points are &lt;strong&gt;tasks&lt;/strong&gt; and the connections are &lt;strong&gt;dependencies&lt;/strong&gt;&lt;br&gt;
A DAG is, therefore, a workflow made up of tasks arranged in a clear order and indicating how they connect with each other.&lt;br&gt;
Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stock_etl_dag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Task&lt;/strong&gt;&lt;br&gt;
A task is one step inside a DAG or one job inside a pipeline. A task should usually do one clear job. Creating one huge task that does everything makes debugging hard thus work should be split into separate tasks.&lt;br&gt;
Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fetch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fetch_stock_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fetch_stock&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;fetch is the task object, and fetch_stock_data is the task name shown in Airflow. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Operator&lt;/strong&gt;&lt;br&gt;
An operator is the tool used to create and run a task. Different operators are used for different types of jobs.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj2kb77ahnutf277pln9u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj2kb77ahnutf277pln9u.png" alt="Operators" width="466" height="201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Dependency&lt;/strong&gt;&lt;br&gt;
A dependency defines the order of tasks by telling Airflow which task must run before another task. In simple terms, a dependency is the relationship between tasks.&lt;br&gt;
Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;extract &amp;gt;&amp;gt; transform &amp;gt;&amp;gt; load
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means extract runs first, transform runs after extract succeeds and load runs after transform succeeds.&lt;br&gt;
You can also define parallel dependencies to show which tasks should run simultaneously.&lt;br&gt;
Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;download &amp;gt;&amp;gt; [clean_data, backup_data] &amp;gt;&amp;gt; send_email
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means download runs first, clean_data and backup_data run after download then send_email executes after both clean_data and backup_data finish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Scheduler&lt;/strong&gt;&lt;br&gt;
The scheduler is the brain of Airflow which checks the DAGs and decides which tasks should run and when.&lt;br&gt;
If the scheduler is not running, DAGs may appear in the UI but tasks may stay queued or show no status.&lt;br&gt;
The scheduler constantly checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which DAGs exist&lt;/li&gt;
&lt;li&gt;whether a DAG is due to run&lt;/li&gt;
&lt;li&gt;whether a task’s upstream tasks have succeeded&lt;/li&gt;
&lt;li&gt;whether a task should be queued&lt;/li&gt;
&lt;li&gt;whether a failed task should retry&lt;/li&gt;
&lt;li&gt;whether a DAG run is complete
The scheduler does not usually execute the task itself but decides which task is ready and sends it to the executor.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;6. Executor&lt;/strong&gt;&lt;br&gt;
The executor is the part of Airflow that decides how tasks are actually run. Different Airflow setups use different executors.&lt;br&gt;
Common executors include:&lt;br&gt;
&lt;strong&gt;SequentialExecutor&lt;/strong&gt; - This runs one task at a time thus cannot run many tasks in parallel. It is simple and often used for learning or testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LocalExecutor&lt;/strong&gt; - This runs tasks locally on the same machine, and it can run more than one task at the same time. It's useful when Airflow is installed on one server and you want tasks to run on that server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CeleryExecutor&lt;/strong&gt; - This is used for larger setups. The scheduler sends tasks to a queue, and workers pick them up and run them. This setup usually needs a message broker such as Redis or RabbitMQ.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KubernetesExecutor&lt;/strong&gt; - This runs each task in a separate Kubernetes pod. It's more advanced and usually used in cloud or production environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NB:&lt;/strong&gt; The scheduler decides that a task should run while the executor handles the running method(how Airflow runs tasks).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Worker&lt;/strong&gt;&lt;br&gt;
A worker is the process that actually executes tasks. This term is particularly important when using CeleryExecutor.&lt;br&gt;
In a Celery setup, the flow looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scheduler &amp;gt;&amp;gt; Queue &amp;gt;&amp;gt; Worker &amp;gt;&amp;gt; Task runs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The scheduler decides the task is ready, the executor sends the task to a queue and the worker picks it up and runs it.&lt;br&gt;
&lt;strong&gt;NB:&lt;/strong&gt; The scheduler decides what should run while the worker does the actual execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. XCom&lt;/strong&gt;&lt;br&gt;
XCom means cross-communication. It allows tasks to pass small pieces of data to each other. XCom can help pass data from one task to another.&lt;br&gt;
XCom is for passing small messages between tasks, not for moving large datasets. Passing large datasets through XCom slows down Airflow and fills the metadata database.&lt;/p&gt;

&lt;p&gt;In PythonOperator, you can push data to XCom:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then another task can pull it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, you can save large data somewhere else, then pass the location through XCom.&lt;br&gt;
Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;extract task saves data to /tmp/raw_stock_data.csv
XCom passes "/tmp/raw_stock_data.csv"
transform task reads the file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;9. Metadata Database&lt;/strong&gt;&lt;br&gt;
The metadata database is Airflow’s internal database and uses it to remember what happened(records results of a DAG).&lt;br&gt;
It stores information such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DAGs&lt;/li&gt;
&lt;li&gt;DAG runs&lt;/li&gt;
&lt;li&gt;Task runs&lt;/li&gt;
&lt;li&gt;Task states&lt;/li&gt;
&lt;li&gt;Schedules&lt;/li&gt;
&lt;li&gt;Retries&lt;/li&gt;
&lt;li&gt;Users&lt;/li&gt;
&lt;li&gt;Roles&lt;/li&gt;
&lt;li&gt;Variables&lt;/li&gt;
&lt;li&gt;Connections&lt;/li&gt;
&lt;li&gt;XCom values&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This database is very important because Airflow needs memory.&lt;br&gt;
For example, Airflow needs to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did this task succeed?&lt;/li&gt;
&lt;li&gt;Did this task fail?&lt;/li&gt;
&lt;li&gt;How many times has it retried?&lt;/li&gt;
&lt;li&gt;When did the DAG last run?&lt;/li&gt;
&lt;li&gt;What logs belong to this task?&lt;/li&gt;
&lt;li&gt;What DAGs exist?&lt;/li&gt;
&lt;li&gt;Which users can log in?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Apache Airflow may look difficult when you first encounter it because it comes with many technical jargons making data engineering feel more complicated than it really is. Airflow is simply a workflow manager which helps you organise work that must happen in a specific order. Apache Airflow is about control. It helps you control timing, order, failure, retries, logs, and monitoring.&lt;br&gt;
Just like baking a cake, you must follow the right sequence. A data pipeline works the same way. You extract data, transform it, load it, check it, and sometimes send a notification. Each step depends on the previous one. Airflow gives you a clean way to define these steps and make sure they run correctly.&lt;/p&gt;

</description>
      <category>luxdev</category>
      <category>dataengineering</category>
      <category>apacheairflow</category>
    </item>
    <item>
      <title>ETL vs ELT: Which One Should You Use and Why?</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Sat, 11 Apr 2026 17:51:32 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/etl-vs-elt-which-one-should-you-use-and-why-4and</link>
      <guid>https://dev.to/lawrence_murithi/etl-vs-elt-which-one-should-you-use-and-why-4and</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Imagine you are running a massive kitchen. Every day, trucks arrive carrying raw ingredients from different farms. Some boxes have dirty potatoes, some tomatoes are bruised, and the meat needs to be separated from the bone.&lt;br&gt;
Can you just throw all of this straight onto a customer’s plate? Definitely not. You have to wash, chop, season, and cook the ingredients first.&lt;/p&gt;

&lt;p&gt;In the business world, data works the same way. Every day, companies generate tons of raw data from apps, websites, payment gateways, customer service logs etc. This raw data is usually dirty and messy. It has errors, missing fields and mismatched formats. Before it can be used for reporting or decision-making, it needs to be moved, processed and organized. This process of moving and cleaning data is called &lt;strong&gt;data integration&lt;/strong&gt;. &lt;br&gt;
The two main approaches are used in data integration are ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). Although both methods aim to prepare data for analysis, they follow different steps and are suited for different situations. &lt;br&gt;
If you are just stepping into data engineering, software engineering or backend development, ETL and ELT are common terms you will encounter. &lt;br&gt;
This article explains both approaches in detail, compares them, and helps you understand when to use each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is ETL?
&lt;/h2&gt;

&lt;p&gt;ETL stands for Extract, Transform, Load. It is the traditional method used to move and prepare data.&lt;br&gt;
The key idea in ETL is that data is cleaned and transformed before it is stored in the final system. This means that by the time the data reaches the data warehouse, it is already structured, organized, and ready for use.&lt;br&gt;
This approach was developed at a time when computing resources were limited, and companies had to be very careful about what data they stored.&lt;/p&gt;

&lt;h3&gt;
  
  
  Steps in ETL
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Extract&lt;/strong&gt;&lt;br&gt;
This step involves collecting raw data from different sources such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Databases&lt;/li&gt;
&lt;li&gt;APIs&lt;/li&gt;
&lt;li&gt;Excel files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In real-world scenarios, data rarely comes from a single source. A company may have customer data in one system, sales data in another, and marketing data in a third system. This extraction step pulls all this data together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Transform&lt;/strong&gt;&lt;br&gt;
In this stage, data is processed in a separate system before being stored. This transformation step ensures that all data is consistent, accurate, and usable.&lt;/p&gt;

&lt;p&gt;Common transformations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standardizing data formats&lt;/li&gt;
&lt;li&gt;Handling missing values&lt;/li&gt;
&lt;li&gt;Removing duplicate records&lt;/li&gt;
&lt;li&gt;Fixing errors in data&lt;/li&gt;
&lt;li&gt;Masking sensitive data such as credit card numbers&lt;/li&gt;
&lt;li&gt;Combining data from different sources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This step is where raw data is made meaningful. Without transformation, data would remain inconsistent and difficult to analyze.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Load&lt;/strong&gt;&lt;br&gt;
After transformation, the cleaned data is loaded into a data warehouse or database.&lt;br&gt;
At this stage, the data is ready for carrying out analysis, creating dashboards and reporting. &lt;br&gt;
&lt;strong&gt;Simple Diagram of ETL&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr3blgiapqpgd16xihdio.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr3blgiapqpgd16xihdio.png" alt="ETL" width="341" height="148"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why ETL Was Popular
&lt;/h3&gt;

&lt;p&gt;In the past, data warehouses were physical servers sitting in basements. Storage space was incredibly expensive and computing power was very limited. Companies, therefore, could not afford to store raw, useless data. They had to clean it up and shrink it down before loading it into the warehouse.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is ELT?
&lt;/h2&gt;

&lt;p&gt;ELT stands for Extract, Load, Transform. It is a modern approach made possible by cloud computing. Here data is loaded first and transformed later inside the data lake.&lt;br&gt;
This approach takes advantage of modern systems that can store large amounts of data and process it quickly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Steps in ELT
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Extract&lt;/strong&gt;&lt;br&gt;
Data is collected from different sources just like in ETL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Load&lt;/strong&gt;&lt;br&gt;
This is a major shift from ETL. Instead of first cleaning the data, you load the raw data directly into your target data lake without any changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Transform&lt;/strong&gt;&lt;br&gt;
The transformation happens inside the data lake. This means analysts can use the warehouse's own computing power to clean, format, and organize the data..&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Diagram of ELT&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdv7cxynv3ek5ljfic0f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdv7cxynv3ek5ljfic0f.png" alt="ELT" width="795" height="197"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why ELT Became Popular
&lt;/h3&gt;

&lt;p&gt;The emergence of modern cloud data warehouses such as Snowflake, Google BigQuery, and Amazon Redshift changed the game. Today, storing data in the cloud is incredibly cheap. Furthermore, these cloud warehouses have massive, scalable computing power.&lt;br&gt;
Instead of buying a separate, expensive server just to transform data (like in ETL), companies no longer need to clean data before storing it. They can store everything and process it later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Differences Between ETL and ELT
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Order of Steps&lt;/strong&gt;&lt;br&gt;
in ETL, transformation happens before loading while in ELT transformation happens after loading.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Where the Transformation Happens&lt;/strong&gt;&lt;br&gt;
In ETL, transformation happens in a separate server outside the warehouse while in ELT, the transformation happens right inside the destination data warehouse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Speed of Loading&lt;/strong&gt;&lt;br&gt;
ELT is usually much faster at the loading stage since there is no cleaning of the data. ETL takes longer because the data has to wait in line to be processed before it can be loaded into the warehouse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Maintenance and Flexibility&lt;/strong&gt;&lt;br&gt;
ETL is less flexible and changes require rebuilding pipelines. If a mistake is made in an ETL pipeline, or if you want to format the data differently, you have to go back to the source, re-extract the data, and run it through the whole pipeline again.&lt;br&gt;
With ELT, the raw data is already sitting in your warehouse. Any mistake during transformation, you simply write a new SQL query and transform the raw data afresh. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The Skills Required&lt;/strong&gt;&lt;br&gt;
ETL often requires specialized tools and programming such as software engineers who know Java, Python or drag-and-drop tools. ELT uses SQL and since the data is transformed inside a database, it is accessible to analysts.&lt;br&gt;
&lt;strong&gt;NB:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL focuses on control, structure, and quality before storage&lt;/li&gt;
&lt;li&gt;ELT focuses on speed, flexibility, and scalability after storage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Advantages and Disadvantages
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ETL
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Advantages&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Security and Compliance&lt;/strong&gt; - If you are dealing with highly sensitive data (like medical records or credit cards), ETL allows you to strip out/mask the sensitive parts before storage in the main warehouse. &lt;br&gt;
&lt;strong&gt;Reduced and cheaper Storage&lt;/strong&gt; - Because you are only loading refined data, you take up much less storage space in your destination database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disadvantages&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Rigid&lt;/strong&gt; - Setting up an ETL pipeline takes a lot of time. If a source system needs to make a change, the whole ETL pipeline might break and need to be rewritten.&lt;br&gt;
&lt;strong&gt;Bottlenecks&lt;/strong&gt; - If you have massive amounts of data, the processing server can easily get overwhelmed and slow down the whole operation.&lt;/p&gt;

&lt;h3&gt;
  
  
  ELT
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Advantages&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Agility&lt;/strong&gt; - Since raw data is loaded quickly and directly into the warehouse, analysts do not have to wait for engineers to build complex pipelines to access the raw data.&lt;br&gt;
&lt;strong&gt;Future-Proof&lt;/strong&gt; - Because you keep a copy of the exact raw data, reprocessing of raw data is always possible. You can also go back and answer new business questions that you hadn't thought of previously.&lt;br&gt;
&lt;strong&gt;Scalability&lt;/strong&gt; - Cloud warehouses are designed to scale automatically thus are able to support large datasets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disadvantages&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Security Risks&lt;/strong&gt; - Since you are loading raw, unfiltered data into your warehouse, you have to be careful about who has access to the warehouse if that data contains sensitive information such as passwords, personal addresses or financial details.&lt;br&gt;
&lt;strong&gt;Higher computing costs&lt;/strong&gt; - While cloud storage is cheap, cloud computing can get expensive. If you have bad SQL code running inefficient transformations inside your warehouse every hour, your monthly cloud bill will skyrocket.&lt;/p&gt;

&lt;h3&gt;
  
  
  ETL Tools
&lt;/h3&gt;

&lt;p&gt;These tools are designed for structured, enterprise-level data pipelines.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Informatica&lt;/li&gt;
&lt;li&gt;IBM DataStage&lt;/li&gt;
&lt;li&gt;Talend&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ELT Tools
&lt;/h3&gt;

&lt;p&gt;Modern ELT uses different tools for each step:&lt;br&gt;
These tools allow analysts to work directly with data using SQL.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fivetran / Airbyte → Extract and Load&lt;/li&gt;
&lt;li&gt;dbt (Data Build Tool) → Transform&lt;/li&gt;
&lt;li&gt;Cloud Warehouses → Snowflake, BigQuery, Redshift&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-World Use Cases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Banking System (ETL)&lt;/strong&gt;&lt;br&gt;
A bank handles sensitive data from mobile app banking, ATMs and physical branch locations. This data contains raw account numbers, account balances, passwords and PIN, personal details and financial transactions thus must be secured before storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;E-commerce Startup (ELT)&lt;/strong&gt;&lt;br&gt;
An online store that wants to track user behavior will generate large amounts of data daily just from people clicking around their website, viewing products, adding items to carts etc. The marketing team thus has to constantly change what they want to measure. One week they may want to track abandoned carts while the following week they may want to track how long people look at a specific product. The business has to frequently change what it wants to analyze.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which One Should You Use and Why?
&lt;/h2&gt;

&lt;p&gt;If you are starting a new project and trying to choose between ETL and ELT, here is a practical guide to help you decide.&lt;br&gt;
&lt;strong&gt;Choose ETL if&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;- You are bound by strict privacy laws&lt;/strong&gt; - If you work with sensitive data (healthcare, banking), the ability to scrub data before it lands in a database should be key.&lt;br&gt;
&lt;strong&gt;- Your system uses on-premise databases&lt;/strong&gt; - If your company still keeps its servers in a physical server room, your database may not have high processing power required to do transformations internally hence you will need a separate ETL server.&lt;br&gt;
&lt;strong&gt;- Your data source is unstructured&lt;/strong&gt; - If you are extracting data from highly complex, old mainframes that output weird file types, standard ELT tools might not know how to read them. You will need a custom ETL script to decode and format the data before it can be saved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose ELT if&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;- You are using a cloud data warehouse&lt;/strong&gt; - If you have Snowflake, BigQuery, or Redshift, ELT is most convinient since it takes advantage of what you are already paying for.&lt;br&gt;
&lt;strong&gt;- You work with large volumes of diverse data&lt;/strong&gt; - If you are tracking millions of tiny events (like website clicks, product views or IoT sensor readings), pushing it directly to the cloud is the only way to keep up with the volume.&lt;br&gt;
&lt;strong&gt;- You need flexibility in analysis and fast data processing&lt;/strong&gt; - ELT allows data engineers to focus purely on moving data from point A to point B, while empowering data analysts to handle the business logic and formatting using SQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The debate between ETL and ELT is less about which one is better and more about matching your business needs, data size, and system architecture. Understanding both approaches helps you design better data pipelines and make smarter decisions when working with data.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>data</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Advanced SQL Techniques for Data Analytics Every Data Analyst Should Know</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Thu, 09 Apr 2026 13:21:19 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/advanced-sql-techniques-for-data-analytics-every-data-analyst-should-know-53c8</link>
      <guid>https://dev.to/lawrence_murithi/advanced-sql-techniques-for-data-analytics-every-data-analyst-should-know-53c8</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In today’s data-driven environment, organizations rely heavily on data to make decisions. Businesses collect large amounts of information from different sources such as sales systems, customer platforms, and operational databases. However, raw data alone is not useful unless it can be analyzed and transformed into meaningful insights.&lt;/p&gt;

&lt;p&gt;SQL (Structured Query Language) plays a central role in this process. It allows analysts to retrieve, clean, and analyze data stored in relational databases. While basic SQL skills are important, advanced SQL techniques are what truly enable analysts to solve complex business problems.&lt;/p&gt;

&lt;p&gt;This article explains advanced SQL concepts in simple terms and shows how they are applied in real-world data analytics scenarios. The goal is to help you understand not just how to write SQL queries, but how to use them effectively in practical situations.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Role of SQL in Data Analytics
&lt;/h3&gt;

&lt;p&gt;SQL is the foundation of data analytics. Most business data is stored in databases, and SQL is the language used to interact with that data.&lt;/p&gt;

&lt;p&gt;Data analysts use SQL to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extract data from databases&lt;/li&gt;
&lt;li&gt;Filter and clean datasets&lt;/li&gt;
&lt;li&gt;Combine data from multiple tables&lt;/li&gt;
&lt;li&gt;Perform calculations and aggregations&lt;/li&gt;
&lt;li&gt;Prepare data for reporting tools like Power BI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SQL is often the first step before using any visualization tools. If the data is not properly prepared using SQL, the final reports may be inaccurate or misleading.&lt;/p&gt;

&lt;h3&gt;
  
  
  Working with Complex Queries
&lt;/h3&gt;

&lt;p&gt;As data becomes more complex, simple queries are not enough to handle it. Advanced SQL, therefore, introduces techniques that help break down complex problems into manageable steps. &lt;br&gt;
In real-world data analysis, datasets are often large and contain multiple tables with different relationships. Moreover, analysts are expected to answer questions that involve comparisons, calculations and multiple layers of logic. These techniques therefore allow analysts to solve the problems step by step instead of trying to do everything in one single query.&lt;br&gt;
Complex query techniques thus help analysts organize their queries in a way that is easier to understand, maintain, and scale. &lt;/p&gt;

&lt;p&gt;They are useful when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Comparing values against aggregated results&lt;/li&gt;
&lt;li&gt;Reusing part of a query&lt;/li&gt;
&lt;li&gt;Working with multi-step transformations&lt;/li&gt;
&lt;li&gt;Simplifying long and confusing SQL statements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some of the advanced SQL techniques include:&lt;/p&gt;
&lt;h4&gt;
  
  
  Subqueries
&lt;/h4&gt;

&lt;p&gt;A subquery is a query inside another query. Subqueries are useful when you need to perform a calculation first and then use that result in another query. They allow you to embed logic directly inside your main query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explanation:&lt;br&gt;
_- The inner query calculates the average salary&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The outer query returns employees earning above average_&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Subqueries can be used in different parts of a query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the WHERE clause&lt;/li&gt;
&lt;li&gt;In the SELECT clause&lt;/li&gt;
&lt;li&gt;In the FROM clause (called derived tables)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real-World Case Scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify high-performing employees based on salary or performance metrics. &lt;/li&gt;
&lt;li&gt;Finding customers who spend more than the average customer&lt;/li&gt;
&lt;li&gt;Identifying products priced above the average price&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;NB:&lt;/strong&gt; While subqueries are powerful, they can become slow if used incorrectly, especially with large datasets.&lt;/p&gt;
&lt;h4&gt;
  
  
  Common Table Expressions (CTEs)
&lt;/h4&gt;

&lt;p&gt;A CTE is a temporary result in an SQL query that helps improves readability and organization(temporary table that exists only while the query is running).&lt;/p&gt;

&lt;p&gt;CTEs allow you to define a query once and then use it in the main query. This makes complex queries easier to read and understand, especially when working with multiple steps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;sales_summary&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_summary&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Types of CTEs:&lt;br&gt;
&lt;strong&gt;- Recursive CTE&lt;/strong&gt;: A specialized CTE that references itself, which is essential for querying hierarchical data like organizational charts or family trees.&lt;br&gt;
&lt;strong&gt;- Non-Recursive CTE&lt;/strong&gt;: The most common type, used to simplify standard queries by creating manageable logical steps.&lt;br&gt;
Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Makes queries clean and easier to read&lt;/li&gt;
&lt;li&gt;Breaks complex logic into steps, thus easier to debug and modify&lt;/li&gt;
&lt;li&gt;Improves maintainability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;NB:&lt;/strong&gt; You can also have multiple CTEs in one query, which is useful for complex data transformations.&lt;/p&gt;

&lt;p&gt;In business reporting, analysts often build layered queries. CTEs allow them to structure their logic clearly when working with large datasets.&lt;br&gt;
Step 1: Calculate total sales per product&lt;br&gt;
Step 2: Filter high-performing products&lt;br&gt;
Step 3: Join with other tables for reporting&lt;/p&gt;
&lt;h3&gt;
  
  
  Advanced Joins
&lt;/h3&gt;

&lt;p&gt;Joins are used to combine data from multiple tables. In advanced SQL, joins become more powerful when dealing with complex relationships.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a retail company:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customers table stores customer details&lt;/li&gt;
&lt;li&gt;Orders table stores transactions&lt;/li&gt;
&lt;li&gt;Products table stores product information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using joins, analysts can create a full view of customer purchases.&lt;/p&gt;

&lt;p&gt;Poor joins can lead to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duplicate data&lt;/li&gt;
&lt;li&gt;Incorrect totals&lt;/li&gt;
&lt;li&gt;Misleading reports&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Window Functions
&lt;/h3&gt;

&lt;p&gt;Window functions allow us to perform advanced calculations across a group of related rows while keeping the original data. They are useful for ranking, running totals, moving averages, and analytical reporting.&lt;br&gt;
Window functions often remove the need for complex self-joins and provide an analytical layer within SQL.&lt;br&gt;
Window functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep every row&lt;/li&gt;
&lt;li&gt;Add calculated values to each row
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;column_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
           &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;column&lt;/span&gt;
           &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;column&lt;/span&gt;
       &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;output_column&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Window functions are widely used in business intelligence and reporting for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rankings within a group&lt;/li&gt;
&lt;li&gt;Calculating running totals&lt;/li&gt;
&lt;li&gt;Compare rows (current vs previous)&lt;/li&gt;
&lt;li&gt;Analyzing trends over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Companies use ranking to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify top performers&lt;/li&gt;
&lt;li&gt;Allocate bonuses&lt;/li&gt;
&lt;li&gt;Compare employee performance
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;##&lt;/span&gt; &lt;span class="n"&gt;Ranking&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Businesses use running totals to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track revenue growth&lt;/li&gt;
&lt;li&gt;Monitor daily or monthly performance&lt;/li&gt;
&lt;li&gt;Forecast future trends
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;##&lt;/span&gt; &lt;span class="n"&gt;Running&lt;/span&gt; &lt;span class="n"&gt;totals&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;running_total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Aggregations and Grouping
&lt;/h3&gt;

&lt;p&gt;Aggregation helps summarize large datasets. Raw data is often too detailed to understand directly. Aggregation thus helps turn large datasets into meaningful summaries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aggregation allows analysts to answer questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total sales by region&lt;/li&gt;
&lt;li&gt;Sales by product category&lt;/li&gt;
&lt;li&gt;Monthly revenue trends&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Aggregation is often used together with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filtering (HAVING)&lt;/li&gt;
&lt;li&gt;Sorting (ORDER BY)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Cleaning and Transformation
&lt;/h3&gt;

&lt;p&gt;Data cleaning is one of the most important steps in analytics. Since raw data is usually dirty and messy, SQL helps clean and prepare it before analysis. &lt;br&gt;
Raw data may contain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duplicates&lt;/li&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Incorrect formats&lt;/li&gt;
&lt;li&gt;Inconsistent entries&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Removing Duplicates
&lt;/h4&gt;

&lt;p&gt;Removes repeated values and ensures each entry appears only once.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Handling Missing Values
&lt;/h4&gt;

&lt;p&gt;Replaces NULL values with a default value thus preventing errors in reports&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Not Available'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Data Transformation
&lt;/h4&gt;

&lt;p&gt;Creates a new calculated column&lt;br&gt;
Data transformation also includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Changing data types&lt;/li&gt;
&lt;li&gt;Formatting dates&lt;/li&gt;
&lt;li&gt;Standardizing values
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Using SQL for Real-World Business Problems
&lt;/h3&gt;

&lt;p&gt;Advanced SQL is not just about writing queries but solving real problems.&lt;br&gt;
In organizations, SQL is used daily to answer business questions and support decisions.&lt;/p&gt;
&lt;h4&gt;
  
  
  Customer Segmentation
&lt;/h4&gt;

&lt;p&gt;Businesses use customer segmentation to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Target high-value customers&lt;/li&gt;
&lt;li&gt;Design marketing strategies&lt;/li&gt;
&lt;li&gt;Improve customer retention
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;##&lt;/span&gt; &lt;span class="k"&gt;Grouping&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="n"&gt;based&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;spending&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;CASE&lt;/span&gt; 
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;total_spent&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'High Value'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;total_spent&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Medium Value'&lt;/span&gt;
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'Low Value'&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;segment&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  Sales Performance Analysis
&lt;/h4&gt;

&lt;p&gt;Total sales are calculated per product and sorted products by performance to identify best-selling products.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Segmentation helps organizations to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understand performance&lt;/li&gt;
&lt;li&gt;Identify opportunities&lt;/li&gt;
&lt;li&gt;Solve operational problems&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Performance Optimization
&lt;/h3&gt;

&lt;p&gt;SQL queries must be clean, easy to understand and efficient.&lt;br&gt;
In large databases, poor queries can slow down systems and delay reports.&lt;/p&gt;

&lt;p&gt;Best Practices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use indexes on important columns to speed up data retrieval&lt;/li&gt;
&lt;li&gt;Avoid selecting unnecessary columns&lt;/li&gt;
&lt;li&gt;Filter data early to reduces data size&lt;/li&gt;
&lt;li&gt;Use CTEs instead of repeated subqueries&lt;/li&gt;
&lt;li&gt;Avoid unnecessary joins&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Advanced SQL is a critical skill for data analysts. It goes beyond basic queries and allows analysts to work with complex datasets, perform advanced calculations and solve real-world business problems.&lt;/p&gt;

&lt;p&gt;In this article, we explored key advanced SQL techniques such as subqueries, CTEs, joins, window functions, aggregations, and data transformation and how they are applied in real business scenarios&lt;/p&gt;

&lt;p&gt;In data analytics, SQL is not just a tool but is a core skill that connects raw data to meaningful insights. Mastering advanced SQL allows analysts to move from basic reporting to deeper, more impactful analysis&lt;/p&gt;

</description>
      <category>sql</category>
      <category>luxdev</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Connecting Power BI to SQL Databases: A Practical Guide for Data Analysts</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Tue, 17 Mar 2026 12:03:47 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/connecting-power-bi-to-sql-databases-a-practical-guide-for-data-analysts-5745</link>
      <guid>https://dev.to/lawrence_murithi/connecting-power-bi-to-sql-databases-a-practical-guide-for-data-analysts-5745</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In most modern organizations, data is one of the most valuable assets. Companies collect large amounts of information from sales systems, websites, customer platforms, and operational databases. To make sense of this information, businesses use tools that can transform this raw data into clear insights. One of the most widely used tools for this purpose is the Microsoft Power BI platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Power BI
&lt;/h2&gt;

&lt;p&gt;Power BI is a business intelligence and data visualization tool developed by Microsoft. It allows users to connect to different data sources, analyze data, and create interactive dashboards and reports. These reports help organizations monitor performance, understand trends, and support decision-making among other uses.&lt;/p&gt;

&lt;p&gt;Power BI is commonly used by data analysts, business managers, and decision makers because it can present complex data in simple visual forms such as charts, tables, maps, and dashboards.&lt;/p&gt;

&lt;p&gt;Most organizations store their operational and analytical data in SQL databases. SQL databases are designed to store large amounts of structured data in tables. They allow users to query, filter, update, and analyze data efficiently using Structured Query Language (SQL). SQL databases are reliable, secure, and scalable, hence they are widely used in business systems such as sales platforms, customer management systems, and inventory systems.&lt;/p&gt;

&lt;p&gt;Connecting Power BI to a database allows analysts to access this stored data directly. Instead of manually exporting data into spreadsheets, Power BI can retrieve the data automatically, refresh it when the database changes, and build dashboards that always reflect the latest information.&lt;/p&gt;

&lt;p&gt;This article explains how Power BI connects to SQL databases, how to connect to a local PostgreSQL database, how to connect to a cloud database such as Aiven PostgreSQL, and how the loaded data is modeled for analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding the Power BI Interface
&lt;/h3&gt;

&lt;p&gt;Before connecting to a database, it is helpful to understand the Power BI Desktop interface. Power BI Desktop is the main application used for building reports and dashboards.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fccc89cx73itso4ddn7vt.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fccc89cx73itso4ddn7vt.jpg" alt="BI Desktop" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
The Power BI Desktop interface includes several sections such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ribbon (Top Menu) – Contains commands and tabs such as Get Data, Transform Data, and Publish.&lt;/li&gt;
&lt;li&gt;Report Canvas – The workspace where charts and dashboards are created.&lt;/li&gt;
&lt;li&gt;Visualizations Pane – Used to select and customize charts.&lt;/li&gt;
&lt;li&gt;Fields Pane – Displays the tables and columns loaded into Power BI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can download Power BI desktop app &lt;a href="https://www.microsoft.com/en-us/power-platform/products/power-bi/desktop" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Connecting Power BI to a Local PostgreSQL Database
&lt;/h3&gt;

&lt;p&gt;PostgreSQL is one of the most popular open-source relational databases used in data analytics. Many organizations run databases locally on their own servers. &lt;br&gt;
The steps below explain how to connect Power BI to a local PostgreSQL database.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 1: Open Power BI Desktop
&lt;/h4&gt;

&lt;p&gt;Start by opening Power BI Desktop on your computer.&lt;br&gt;
When the application opens, a blank report canvas appears. This is where you will build your report after loading the data.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 2: Get Data
&lt;/h4&gt;

&lt;p&gt;On the Home tab of the ribbon, click Get Data.&lt;br&gt;
This button opens a list of available data sources. Power BI supports many data sources including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Excel&lt;/li&gt;
&lt;li&gt;SQL Server&lt;/li&gt;
&lt;li&gt;PostgreSQL&lt;/li&gt;
&lt;li&gt;Web APIs
The Get Data feature is the starting point for connecting Power BI to any data source. Other data sources are as shown on the image.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flsqplwvo6mi1vf0n859m.jpg" alt="Get Data" width="800" height="450"&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Step 3: Select PostgreSQL Database
&lt;/h4&gt;

&lt;p&gt;From the list of available data connectors, click &lt;strong&gt;more&lt;/strong&gt; to view more options. Scroll down, select &lt;strong&gt;PostgreSQL Database&lt;/strong&gt; and click &lt;strong&gt;Connect&lt;/strong&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 4: Enter the Database Connection Details
&lt;/h4&gt;

&lt;p&gt;After selecting PostgreSQL and clicking connect, Power BI opens a connection window that requires connection details for the connection to go through.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwa7subvjsrzicalek1n.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwa7subvjsrzicalek1n.jpg" alt="Credentials" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Server&lt;/strong&gt; – The location of the database server. If the database is on your computer, use localhost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database&lt;/strong&gt; – The name of the database you want to connect to.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 5: Provide Login Credentials
&lt;/h4&gt;

&lt;p&gt;After a connection is made, Power BI will ask for authentication details.&lt;br&gt;
You will need to provide:&lt;br&gt;
&lt;strong&gt;Username&lt;br&gt;
Password&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;These credentials were set up during installation of the PostgreSQL and allows Power BI to securely access the database.&lt;/p&gt;

&lt;p&gt;Once the credentials are entered, click Connect.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 6: Select Tables to Import
&lt;/h4&gt;

&lt;p&gt;After connecting successfully, Power BI opens the Navigator Window which displays all available tables in the database.&lt;br&gt;
You can preview the contents of each table before loading them.&lt;br&gt;
There are two options:&lt;br&gt;
&lt;strong&gt;Load&lt;/strong&gt; – Import the data directly.&lt;br&gt;
&lt;strong&gt;Transform Data&lt;/strong&gt; – Clean or modify the data before loading it.&lt;/p&gt;
&lt;h3&gt;
  
  
  Connecting Power BI to a Cloud Database (Aiven PostgreSQL)
&lt;/h3&gt;

&lt;p&gt;Many organizations now store their databases in the cloud. Cloud databases are accessible through the internet and provide benefits such as scalability, backups, and easier management.&lt;br&gt;
Aiven is a cloud platform that provides managed PostgreSQL databases.&lt;br&gt;
Connecting Power BI to a cloud database is not different to connecting to a local database, only that additional security steps are required.&lt;/p&gt;

&lt;p&gt;The steps below explain how to connect Power BI to an Aiven PostgreSQL database.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 1: Get the Database Connection Details from Aiven
&lt;/h4&gt;

&lt;p&gt;Login to Aiven and inside the dashboard, you will find the connection information for your PostgreSQL service. These details are used by Power BI to locate and connect to the database.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgniq9zxf2r4e78vijr8d.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgniq9zxf2r4e78vijr8d.jpg" alt="Connection Details" width="800" height="289"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 2: Download and install the SSL Certificate
&lt;/h4&gt;

&lt;p&gt;Cloud database providers often require SSL encryption to secure the connection.&lt;br&gt;
An SSL certificate ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data transferred between Power BI and the database is encrypted&lt;/li&gt;
&lt;li&gt;Unauthorized users cannot intercept the connection&lt;/li&gt;
&lt;li&gt;The database server identity is verified&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Aiven, download the certificate file(CA Certificate) from the Connection Information section of the service dashboard.&lt;br&gt;
Rename the downloaded file from &lt;strong&gt;ca.pem&lt;/strong&gt; to &lt;strong&gt;ca.crt&lt;/strong&gt; and install the Certificate on your PC.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzahcljjsav3r40i5mh96.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzahcljjsav3r40i5mh96.jpg" alt="SSL Certificate" width="597" height="763"&gt;&lt;/a&gt;&lt;br&gt;
Choose Local Machine as the location of the installation and click next.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx30qrvf497dbag9gsbad.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx30qrvf497dbag9gsbad.jpg" alt="Local Machine" width="793" height="775"&gt;&lt;/a&gt;&lt;br&gt;
Choose &lt;strong&gt;place all certificates in the following store&lt;/strong&gt; and browse certificate store to &lt;strong&gt;Trusted Root Certification Authorities&lt;/strong&gt;. &lt;br&gt;
Click &lt;strong&gt;ok&lt;/strong&gt; and &lt;strong&gt;finish&lt;/strong&gt;.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyg4l0upupjwxaz519po5.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyg4l0upupjwxaz519po5.jpg" alt="Store" width="783" height="811"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 3: Connect Power BI
&lt;/h4&gt;

&lt;p&gt;Open Power BI desktop as before, click &lt;strong&gt;Get Data&lt;/strong&gt;, click &lt;strong&gt;more&lt;/strong&gt;, scroll down and select &lt;strong&gt;PostgreSQL Database&lt;/strong&gt;.&lt;br&gt;
Copy the &lt;strong&gt;Server Name&lt;/strong&gt; from the service URL(&lt;strong&gt;host_name:port_number&lt;/strong&gt;) on Connection Information and paste on Power BI.&lt;br&gt;
Input the name of your database and click &lt;strong&gt;ok&lt;/strong&gt;.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu18dtkx45ilpu9hofwse.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu18dtkx45ilpu9hofwse.jpg" alt="server &amp;amp; db" width="800" height="426"&gt;&lt;/a&gt;&lt;br&gt;
Copy the username and password from Aiven, input them on the Power BI credentials window that opens and click &lt;strong&gt;connect&lt;/strong&gt;.&lt;br&gt;
Once the connection is successful, a navigator window opens and displays all tables in the database. &lt;br&gt;
Select the tables you want to work with and click on &lt;strong&gt;load/transfrom data&lt;/strong&gt; depending on what you wish to do with the data.&lt;br&gt;
Transform data option is used to clean raw data e.g delete any duplicates and address null values using the most appropriate method.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj384cz97geuk12zvbe1s.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj384cz97geuk12zvbe1s.jpg" alt="Load data" width="800" height="425"&gt;&lt;/a&gt;&lt;br&gt;
Successfully loaded data displays on the data pane as shown on figure below. &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmoqoi3v0etj70udifbs3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmoqoi3v0etj70udifbs3.jpg" alt="Tables" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Creating Relationships Between Tables
&lt;/h4&gt;

&lt;p&gt;Once loaded, Power BI automatically detects relationships between tables based on matching columns using primary and foreign keys. Relationships not created can be manually created by dragging a column from one table onto the matching column in another table.&lt;br&gt;
These relationships allow Power BI to combine information across multiple tables. &lt;br&gt;
For example:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17mn1jzfmmcng1bk54zg.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17mn1jzfmmcng1bk54zg.jpg" alt="Connection" width="543" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the Model View, the tables appear as connected boxes. The relationships show how data flows between tables.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ebu7s8y3tr1j5io4sqr.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ebu7s8y3tr1j5io4sqr.jpg" alt="Relatioships" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Data Modeling and Why its Important
&lt;/h2&gt;

&lt;p&gt;Data modeling is the process of defining how data is stored, structured, and related within a database. It ensures that Power BI understands how different tables are related.&lt;/p&gt;

&lt;p&gt;Good data modeling allows Power BI to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter data correctly&lt;/li&gt;
&lt;li&gt;Calculate totals accurately&lt;/li&gt;
&lt;li&gt;Create meaningful visualizations&lt;/li&gt;
&lt;li&gt;Avoid duplicated values&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, when analyzing sales:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The sales table stores transaction records.&lt;/li&gt;
&lt;li&gt;The customers table provides customer information.&lt;/li&gt;
&lt;li&gt;The products table describes the items sold.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  why SQL skills are important for Power BI analysts
&lt;/h3&gt;

&lt;p&gt;Power BI is a powerful tool for building reports and dashboards, but it does not replace the need for strong data handling skills. Most business data is stored in SQL databases, and before that data can be visualized in Power BI, it must first be retrieved, cleaned, and structured properly. &lt;br&gt;
SQL skills give Power BI analysts a real edge by providing an easier way to grab just what you need without pulling everything into Power BI. &lt;br&gt;
Without SQL, analysts may rely too much on raw data, which can lead to slow reports, incorrect results, and inefficient workflows.&lt;br&gt;
SQL allows analysts to:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Retrieve Data&lt;/strong&gt;&lt;br&gt;
Analysts can write queries to select specific rows and columns relevant to their analysis from a database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- selecting only products name and price columns from products table&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why this matters in Power BI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduces the amount of data imported&lt;/li&gt;
&lt;li&gt;Improves performance&lt;/li&gt;
&lt;li&gt;Makes the model easier to manage&lt;/li&gt;
&lt;li&gt;Avoids unnecessary columns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Filter Data&lt;/strong&gt;&lt;br&gt;
In real-world scenarios, not all data is useful for analysis. Analysts often need to focus on specific time periods, regions, or business conditions. SQL thus makes it easy to filter datasets based on a specific criteria before loading them into Power BI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Retrieving only sales from 2024 onwards.&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;sale_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why this matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduces dataset size&lt;/li&gt;
&lt;li&gt;Speeds up report loading&lt;/li&gt;
&lt;li&gt;Focuses analysis on relevant data&lt;/li&gt;
&lt;li&gt;Avoids unnecessary processing inside Power BI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Perform Aggregations&lt;/strong&gt;&lt;br&gt;
Aggregation is the process of summarizing data. In business analysis, analysts often need totals, averages, counts, and other summary metrics. SQL can summarize large datasets quickly by using functions such as GROUP BY, SUM, COUNT, and AVG.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Calculating total sales per product&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_quantity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why aggregation in SQL is important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduces data volume before loading&lt;/li&gt;
&lt;li&gt;Improves Power BI performance&lt;/li&gt;
&lt;li&gt;Simplifies data models&lt;/li&gt;
&lt;li&gt;Avoids heavy calculations in DAX&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Preparing data for Analysis
&lt;/h3&gt;

&lt;p&gt;Raw data must be cleaned or transformed before it is ready for visualization. &lt;br&gt;
SQL can be used to:&lt;/p&gt;
&lt;h4&gt;
  
  
  Joining Tables and Combining Data
&lt;/h4&gt;

&lt;p&gt;Business data is usually stored in multiple tables.&lt;br&gt;
SQL allows analysts to combine these tables using joins. Joined datasets in SQL can simplify the data model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why this is important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Combines related data into one dataset&lt;/li&gt;
&lt;li&gt;Reduces the need for complex relationships in Power BI&lt;/li&gt;
&lt;li&gt;Makes analysis easier&lt;/li&gt;
&lt;li&gt;Prevents duplication errors&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Data Cleaning and Preparation
&lt;/h4&gt;

&lt;p&gt;Raw data is often messy. It may contain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Duplicate records&lt;/li&gt;
&lt;li&gt;Incorrect formats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SQL helps clean and prepare the data before it is loaded into Power BI hence leading to better insights.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Eliminating duplicates&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Handling missing values&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phone_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Not Provided'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;phone&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why data cleaning matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensures data accuracy&lt;/li&gt;
&lt;li&gt;Improves report reliability&lt;/li&gt;
&lt;li&gt;Reduces cleaning work in Power BI&lt;/li&gt;
&lt;li&gt;Prevents errors in calculations&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Creating Calculated Fields
&lt;/h4&gt;

&lt;p&gt;SQL allows analysts to create new columns based on existing data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Calculate total sales&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why calculated fields are useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prepares key metrics before loading&lt;/li&gt;
&lt;li&gt;Reduces need for DAX calculations&lt;/li&gt;
&lt;li&gt;Keeps logic centralized in the database&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Supporting Advanced Analysis
&lt;/h4&gt;

&lt;p&gt;SQL also supports more advanced operations such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Window functions (running totals, ranking)&lt;/li&gt;
&lt;li&gt;Subqueries&lt;/li&gt;
&lt;li&gt;Common Table Expressions (CTEs)&lt;/li&gt;
&lt;li&gt;Data transformations
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;sale_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sale_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;running_total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Power BI is a powerful tool that helps organizations transform raw data into meaningful insights. By connecting directly to SQL databases, Power BI allows analysts to access structured data stored in business systems and convert it into interactive dashboards and reports.&lt;br&gt;
SQL prepares the foundation, and Power BI builds the story on top of it. Strong SQL skills allow analysts to work more efficiently, produce accurate reports, and deliver better insights for decision-making.&lt;br&gt;
When SQL and Power BI are used together, they provide a powerful combination for modern data analysis and business intelligence.&lt;/p&gt;

</description>
      <category>luxdev</category>
      <category>dataengineering</category>
      <category>sql</category>
      <category>database</category>
    </item>
    <item>
      <title>Mastering SQL Joins and Window Functions</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Tue, 03 Mar 2026 10:36:56 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/mastering-sql-joins-and-window-functions-1f30</link>
      <guid>https://dev.to/lawrence_murithi/mastering-sql-joins-and-window-functions-1f30</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;SQL (Structured Query Language) is a powerful tool used to search, manage, and analyze large amounts of data. It is widely used by data enthusiasts, software developers and even marketing professionals.&lt;br&gt;
In real-world databases, data is not stored in one large table. It is divided into multiple related tables. This makes storage efficient and avoids duplication. To work effectively with such data, you must understand SQL joins and window functions. These two features allow you to combine data correctly and perform advanced analysis without losing important details.&lt;/p&gt;
&lt;h2&gt;
  
  
  SQL Joins
&lt;/h2&gt;

&lt;p&gt;A JOIN in SQL is used to combine rows from two or more tables based on a related column. This relationship is usually created using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A primary key (unique identifier in one table)&lt;/li&gt;
&lt;li&gt;A foreign key (reference to that key in another table)
Joins are essential when working with relational databases because data is often split across multiple tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Importance of Joins
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Combining related data from multiple tables&lt;/li&gt;
&lt;li&gt;Maintaining relational integrity&lt;/li&gt;
&lt;li&gt;Supporting one-to-many and many-to-many relationships&lt;/li&gt;
&lt;li&gt;Building meaningful reports and analytics&lt;/li&gt;
&lt;li&gt;Preventing unnecessary duplication of data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The type of join you use directly affects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The number of rows returned&lt;/li&gt;
&lt;li&gt;Whether NULL values appear&lt;/li&gt;
&lt;li&gt;How business logic is interpreted
NB: Choosing the wrong join can lead to missing data, duplicated records, or incorrect analysis.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Types of SQL Joins
&lt;/h3&gt;
&lt;h4&gt;
  
  
  INNER JOIN
&lt;/h4&gt;

&lt;p&gt;The INNER JOIN returns only the rows that have matching values in both tables.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Combines records based on a related column&lt;/li&gt;
&lt;li&gt;Returns only matching rows&lt;/li&gt;
&lt;li&gt;Excludes non-matching rows
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fijwwf41l1fwvnmt9kowo.jpg" alt="Inner Join" width="800" height="106"&gt;
INNER JOIN is used when:&lt;/li&gt;
&lt;li&gt;You only need matched data&lt;/li&gt;
&lt;li&gt;You want to exclude incomplete relationships&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  LEFT (OUTER) JOIN
&lt;/h4&gt;

&lt;p&gt;The LEFT (OUTER) JOIN returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All rows from the left table&lt;/li&gt;
&lt;li&gt;Matching rows from the right table&lt;/li&gt;
&lt;li&gt;NULL values if no match exists
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftfmhtuhwawp91s4otip.jpg" alt="Left Join" width="800" height="86"&gt;
LEFT JOIN is used when:&lt;/li&gt;
&lt;li&gt;You want all records from the main table&lt;/li&gt;
&lt;li&gt;You want to identify missing matches&lt;/li&gt;
&lt;li&gt;You need complete reporting from one side&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  RIGHT (OUTER) JOIN
&lt;/h4&gt;

&lt;p&gt;The RIGHT (OUTER) JOIN returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All rows from the right table&lt;/li&gt;
&lt;li&gt;Matching rows from the left table&lt;/li&gt;
&lt;li&gt;NULL where no match exists on the left
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn4qq9vkqqibun7ys4i9.jpg" alt="Right Join" width="800" height="84"&gt;
NB: RIGHT JOIN works like LEFT JOIN but from the opposite direction.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  FULL (OUTER) JOIN
&lt;/h4&gt;

&lt;p&gt;The FULL JOIN returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All rows from both tables&lt;/li&gt;
&lt;li&gt;Matching records where possible&lt;/li&gt;
&lt;li&gt;NULL values where no match exists
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk7lesr3ojfu51sn1gj8g.jpg" alt="Full Join" width="800" height="84"&gt;
The FULL JOIN is used when:&lt;/li&gt;
&lt;li&gt;Comparing two datasets&lt;/li&gt;
&lt;li&gt;Identifying differences between systems&lt;/li&gt;
&lt;li&gt;Performing reconciliation tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  CROSS JOIN
&lt;/h4&gt;

&lt;p&gt;A CROSS JOIN returns all possible combinations of rows thus can create very large results.&lt;br&gt;
If Table A has 5 rows and Table B has 10 rows:&lt;br&gt;
Result = 50 rows.&lt;br&gt;
It does not use a matching condition.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftbk5gfm1fjjiemhjcp68.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftbk5gfm1fjjiemhjcp68.jpg" alt="Cross Join" width="800" height="112"&gt;&lt;/a&gt;&lt;br&gt;
A CROSS JOIN is used to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate combinations&lt;/li&gt;
&lt;li&gt;Create calendar expansions&lt;/li&gt;
&lt;li&gt;Test scenarios&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  SELF JOIN
&lt;/h4&gt;

&lt;p&gt;A self join joins a table to itself. Aliases are used to refer to the same tale&lt;br&gt;
Example:&lt;br&gt;
Employee table:&lt;br&gt;
| EmployeeID | ManagerID |&lt;br&gt;
To show each employee and their manager name, the table is joined to itself.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxkfb28zynixzjr1xj2wm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxkfb28zynixzjr1xj2wm.jpg" alt="Self Join" width="800" height="103"&gt;&lt;/a&gt;&lt;br&gt;
Self joins are useful for hierarchical data.&lt;/p&gt;
&lt;h4&gt;
  
  
  NATURAL JOIN
&lt;/h4&gt;

&lt;p&gt;A natural join Joins all tables using columns that have the same name.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7932nmokz5hfoco4y1k.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7932nmokz5hfoco4y1k.jpg" alt="Natural Join" width="800" height="47"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Performance Considerations for Joins.
&lt;/h3&gt;

&lt;p&gt;Joins can affect performance, especially in large databases.&lt;br&gt;
Best practices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Index join columns (primary and foreign keys)&lt;/li&gt;
&lt;li&gt;Avoid unnecessary joins&lt;/li&gt;
&lt;li&gt;Filter data early using WHERE&lt;/li&gt;
&lt;li&gt;Understand execution plans&lt;/li&gt;
&lt;li&gt;Be careful with joins that multiply rows unintentionally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Improper joins can cause:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duplicate results&lt;/li&gt;
&lt;li&gt;Data inflation&lt;/li&gt;
&lt;li&gt;Slow query execution&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Window Functions
&lt;/h2&gt;

&lt;p&gt;Window functions allow us to perform advanced calculations across a group of related rows while keeping the original data. They are useful for ranking, running totals, moving averages, and analytical reporting.&lt;br&gt;
Window functions often remove the need for complex self-joins and provide an analytical layer within SQL.&lt;br&gt;
Window functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Keep every row&lt;/li&gt;
&lt;li&gt;Add calculated values to each row&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Structure of a window function:
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT column_1,
       function() OVER (
           PARTITION BY column
           ORDER BY column
       ) AS output_column
FROM table_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  1.  OVER()
&lt;/h4&gt;

&lt;p&gt;The OVER() clause defines how the window function operates and controls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Partitioning&lt;/li&gt;
&lt;li&gt;Ordering&lt;/li&gt;
&lt;li&gt;Optional frame boundaries&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  2.  PARTITION BY()
&lt;/h4&gt;

&lt;p&gt;The PARTITION BY divides rows into logical groups. If omitted, the entire dataset is treated as one group.&lt;/p&gt;
&lt;h4&gt;
  
  
  3.  ORDER BY()
&lt;/h4&gt;

&lt;p&gt;ORDER BY defines the sequence of rows inside each partition.&lt;/p&gt;

&lt;p&gt;It is essential for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ranking&lt;/li&gt;
&lt;li&gt;Running totals&lt;/li&gt;
&lt;li&gt;Time-based comparisons&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If ORDER BY is omitted, row processing order is undefined.&lt;/p&gt;
&lt;h4&gt;
  
  
  4. Frame Clause (ROWS vs RANGE)
&lt;/h4&gt;

&lt;p&gt;Used to define a range of rows(boundary) and commonly used for moving averages and cumulative calculations.&lt;br&gt;
In the ROWS subclause, the frame is defined by beginning and ending row positions while in the RANGE subclause, the frame is defined by a value range.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ROWS BETWEEN lower_bound AND upper_bound
RANGE BETWEEN lower_bound AND upper_bound
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Types of SQL Window Functions
&lt;/h3&gt;

&lt;p&gt;Window functions fall into three main categories.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Aggregate Window Functions
&lt;/h4&gt;

&lt;p&gt;These include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AVG() - Calculates moving averages.&lt;/li&gt;
&lt;li&gt;SUM() - Creates running totals.&lt;/li&gt;
&lt;li&gt;COUNT() - calculates the number of items found in a group.&lt;/li&gt;
&lt;li&gt;MIN() - returns the minimum value.&lt;/li&gt;
&lt;li&gt;MAX() - returns the maximum value.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some use cases of Aggregate window functions include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Department totals&lt;/li&gt;
&lt;li&gt;Running totals&lt;/li&gt;
&lt;li&gt;Moving averages&lt;/li&gt;
&lt;li&gt;Cumulative metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Ranking Window Functions
&lt;/h4&gt;

&lt;p&gt;They are used to assign position or rank.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ROW_NUMBER() - Assigns a unique number to each row. &lt;/li&gt;
&lt;li&gt;RANK() - Assigns rank with gaps when ties exist.&lt;/li&gt;
&lt;li&gt;DENSE_RANK() - Similar to RANK but does not skip numbers and better for ranking reports where gaps are not desired.&lt;/li&gt;
&lt;li&gt;PERCENT_RANK() - calculates the relative rank of a row within a group of rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some use cases of Ranking window functions include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Top N per group&lt;/li&gt;
&lt;li&gt;Performance ranking&lt;/li&gt;
&lt;li&gt;Leaderboards&lt;/li&gt;
&lt;li&gt;Percentile analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Offset (Value) Window Functions
&lt;/h4&gt;

&lt;p&gt;They are used to access data from other rows.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LAG() - shows previous row value and used in time-based analysis.&lt;/li&gt;
&lt;li&gt;LEAD() - shows next row value and used in time-based analysis.&lt;/li&gt;
&lt;li&gt;FIRST_VALUE() - returns the first value in an ordered set of values within a partition.&lt;/li&gt;
&lt;li&gt;LAST_VALUE() - returns the last value in an ordered set of values within a partition.&lt;/li&gt;
&lt;li&gt;NTH_VALUE() - Divides rows into equal groups and useful in performance analysis and segmentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some use cases of Offset window functions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Month-over-month growth&lt;/li&gt;
&lt;li&gt;Time-series comparison&lt;/li&gt;
&lt;li&gt;Trend detection&lt;/li&gt;
&lt;li&gt;Sequential analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;SQL joins and window functions are core tools for designing efficient and powerful queries.&lt;br&gt;
Joins allow you to combine data from multiple tables using defined relationships while Window functions provide an advanced analytical layer in SQL.&lt;/p&gt;

</description>
      <category>luxdev</category>
      <category>sql</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>The Power of BI; Translating Messy Data, DAX, and Dashboards into Action</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Mon, 09 Feb 2026 17:59:06 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/the-power-of-bi-translating-messy-data-dax-and-dashboards-into-action-3kmj</link>
      <guid>https://dev.to/lawrence_murithi/the-power-of-bi-translating-messy-data-dax-and-dashboards-into-action-3kmj</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In the real world, data rarely comes in a clean and perfect format. Most of the time, it comes from multiple systems, created by different people, and maintained with different rules. It may have missing values, inconsistent naming, or outdated records. Databases may store the same information in different ways. This is where analysts come in. Their work is not just about building reports, but about turning raw, messy data into clear insights that drive real business actions. Power BI is one of the main tools that helps them do this effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Business Before Touching the Data
&lt;/h2&gt;

&lt;p&gt;Good analysts first try to understand what problem they want to solve before embarking on the data analysis journey. They ask questions like: What decision needs to be made? Who will use the report? What actions should the dashboard support? Without this context, even the best-looking dashboard can fail.&lt;br&gt;
Understanding who your target is and what their needs are could help analysts decide what data to use, what calculations matter, and what level of detail is required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making Sense of Messy Data
&lt;/h2&gt;

&lt;p&gt;Most data comes in a rough and messy state, therefore, a majority of an analyst’s time is spent cleaning and preparing the data to make sure it is reliable. Power BI’s Power Query tool is designed for this task. Analysts use it to load data from many sources such as Excel files, SQL databases, APIs, and cloud platforms. Power Query allows analysts to apply repeatable step-by-step transformations to clean the data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common data problems analysts handle
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Duplicate records that inflate totals&lt;/li&gt;
&lt;li&gt;Different spellings or codes for the same category&lt;/li&gt;
&lt;li&gt;Incomplete dates or incorrect data types&lt;/li&gt;
&lt;li&gt;Columns that mix multiple values in one field&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Building a Strong Data Model
&lt;/h2&gt;

&lt;p&gt;After cleaning the data, analysts focus on building a proper data model. This includes defining relationships between tables, choosing the correct granularity, and organizing data in a way that supports accurate analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Characteristics of a well-designed data model
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Improves performance&lt;/li&gt;
&lt;li&gt;Makes DAX calculations easier&lt;/li&gt;
&lt;li&gt;Reduces confusion for report users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Analysts often use star schemas, separating fact tables from dimension tables, to keep the model simple and efficient. This step is invisible to most users, but is critical for reliable results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using DAX to Add Meaning to the Data
&lt;/h2&gt;

&lt;p&gt;Raw data alone does not answer business questions. DAX(Data Analysis Expressions), therefore, helps analysts turn raw numbers into useful metrics/insight. Analysts use DAX to create measures that reflect real performance. For example, instead of showing total sales, DAX can show how sales compare over different months or years thus leading to better decision-making.&lt;/p&gt;

&lt;h3&gt;
  
  
  Examples of Insights obtained using DAX
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Year-over-year growth&lt;/li&gt;
&lt;li&gt;Employee turnover rate&lt;/li&gt;
&lt;li&gt;Running totals and averages&lt;/li&gt;
&lt;li&gt;Comparison of current performance to past periods&lt;/li&gt;
&lt;li&gt;Percentages, ratios, and growth rates&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Building Dashboards That Tell a Story
&lt;/h2&gt;

&lt;p&gt;Once the data and calculations are ready, analysts design dashboards with the end user in mind. Dashboards are not just charts but are tools for communication. They should guide users toward key insights without overwhelming them. The goal is clarity, not complexity. A good dashboard should have the right visuals, avoid clutter and highlight the most important numbers hence telling a clear story. Filters and slicers allow users to explore the data without needing technical skills.&lt;br&gt;
A good dashboard should have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The right chart type for each metric&lt;/li&gt;
&lt;li&gt;Logical layout and flow of information&lt;/li&gt;
&lt;li&gt;Clear labels, titles, and tooltips&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good Dashboards help users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quickly see what is going well&lt;/li&gt;
&lt;li&gt;Spot problems early&lt;/li&gt;
&lt;li&gt;Ask better questions&lt;/li&gt;
&lt;li&gt;Drill down to find the cause&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Turning Insights into Real Business Actions
&lt;/h2&gt;

&lt;p&gt;The real success of a dashboard is determined by the actions it supports. A good dashboard helps teams respond faster and make better decisions. This could include setting targets, tracking performance, or identifying risks early.&lt;br&gt;
Since dashboards can refresh automatically, decisions are based on up-to-date information rather than static reports. Alerts and shared reports also ensure that insights reach the right people at the right time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring Impact and Improving Over Time
&lt;/h2&gt;

&lt;p&gt;Analyst work does not end after publishing a dashboard. They gather feedback, monitor usage, and refine reports over time. As business needs change, dashboards must evolve with them.&lt;/p&gt;

&lt;p&gt;Over time, Power BI reports can show measurable impact, such as reduced costs, improved performance, faster reporting cycles, or better accountability. These outcomes demonstrate how technical skills translate into real business value.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Analysts act as a bridge between raw data and business decisions. They clean messy data, use DAX to add meaning, and design dashboards that help people understand what is happening and what to do next. While Power BI provides the tools, it is the analyst’s understanding of data and business that turns information into action.&lt;/p&gt;

&lt;p&gt;The value of Power BI, therefore, is not in the visuals or formulas alone but lies in how analysts use it to support smarter decisions and create measurable impact across the organization.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>luxdev</category>
    </item>
    <item>
      <title>Schemas and Data Modelling in Power BI</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Tue, 03 Feb 2026 15:19:33 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/schemas-and-data-modelling-in-power-bi-2ja6</link>
      <guid>https://dev.to/lawrence_murithi/schemas-and-data-modelling-in-power-bi-2ja6</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Power BI is a business intelligence tool used to turn raw data into meaningful reports and dashboards. It allows organizations to analyze data, track performance, and make informed decisions. However, the quality of insights produced by Power BI depends heavily on how the data is structured behind the scenes. Good visuals and advanced calculations cannot compensate for poorly designed data.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are Schemas and Data Modelling
&lt;/h3&gt;

&lt;p&gt;Data modelling is the process of organizing data into tables and defining how those tables relate to each other. A schema is the structure or design of this data model. &lt;br&gt;
Schemas and data modelling define how data is organized, connected, and interpreted in Power BI. They determine how tables relate to each other, how filters flow across the model, and how calculations are performed. &lt;/p&gt;
&lt;h4&gt;
  
  
  Characteristics of a data model.
&lt;/h4&gt;

&lt;p&gt;A good data model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Makes reports faster and more responsive&lt;/li&gt;
&lt;li&gt;Produces correct totals and calculations&lt;/li&gt;
&lt;li&gt;Is easy to understand and maintain&lt;/li&gt;
&lt;li&gt;Works naturally with DAX measures and visuals.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A bad data model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slows down reports&lt;/li&gt;
&lt;li&gt;Produces wrong or inconsistent numbers&lt;/li&gt;
&lt;li&gt;Forces complex and hard-to-read DAX formulas&lt;/li&gt;
&lt;li&gt;Confuses report users and developers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article explains schemas and data modelling in Power BI, focusing on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Star schema&lt;/li&gt;
&lt;li&gt;Snowflake schema&lt;/li&gt;
&lt;li&gt;Fact and dimension tables&lt;/li&gt;
&lt;li&gt;Relationships&lt;/li&gt;
&lt;li&gt;Importance of good modelling for performance and accurate reporting&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Understanding Data Modelling in Power BI
&lt;/h2&gt;

&lt;p&gt;Data modelling happens after data is loaded from sources such as Excel, databases, or cloud systems. The model is built in the Model view, where tables and relationships are defined.&lt;br&gt;
A Power BI data model usually includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fact tables (business events or measurements)&lt;/li&gt;
&lt;li&gt;Dimension tables (descriptive information)&lt;/li&gt;
&lt;li&gt;Relationships between tables&lt;/li&gt;
&lt;li&gt;A schema design (such as star or snowflake)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Fact Tables and Dimension Tables
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fact tables&lt;/strong&gt; store measurable business data (i.e) numerical data that you want to analyze and measure.&lt;br&gt;
Characteristics of fact tables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Very large, with many rows&lt;/li&gt;
&lt;li&gt;Contain numeric values used in calculations&lt;/li&gt;
&lt;li&gt;Contain keys that link to dimension tables&lt;/li&gt;
&lt;li&gt;Grow over time as new transactions are added.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples of fact data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sales amount&lt;/li&gt;
&lt;li&gt;Quantity sold&lt;/li&gt;
&lt;li&gt;Profit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dimension tables&lt;/strong&gt; store descriptive information that helps explain the facts. Dimension tables are used for filtering, grouping, and slicing data in reports.&lt;/p&gt;

&lt;p&gt;Characteristics of dimension tables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smaller than fact tables&lt;/li&gt;
&lt;li&gt;Mostly text and categorical data&lt;/li&gt;
&lt;li&gt;Used for filtering, grouping, and slicing&lt;/li&gt;
&lt;li&gt;Change less frequently than facts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples of dimension data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Product name&lt;/li&gt;
&lt;li&gt;Customer name&lt;/li&gt;
&lt;li&gt;Region&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Relationships in Power BI
&lt;/h2&gt;

&lt;p&gt;Relationships in Power BI define how tables are connected and how data flows between them. A relationship is usually created between a key column in one table and a matching column in another table. These keys allow Power BI to link descriptive data from dimension tables to numerical data in fact tables. Relationships are mainly defined by cardinality, direction, and status.&lt;/p&gt;
&lt;h3&gt;
  
  
  Types of Relationships
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;One-to-Many&lt;/strong&gt; - This is the most common and recommended relationship where one record in a dimension table matches many records in a fact table.&lt;br&gt;
&lt;strong&gt;One-to-One&lt;/strong&gt; - One row in one table matches one row in another table.&lt;br&gt;
&lt;strong&gt;Many-to-One&lt;/strong&gt; - Many rows in the fact table match one row in the dimension table.&lt;br&gt;
&lt;strong&gt;Many-to-Many&lt;/strong&gt; - multiple rows in one table match multiple rows in another table and is often used when there is no unique key.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why Relationships Matter
&lt;/h3&gt;

&lt;p&gt;Good relationships:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensure correct totals and aggregations&lt;/li&gt;
&lt;li&gt;Control how slicers and filters behave&lt;/li&gt;
&lt;li&gt;Improve report performance&lt;/li&gt;
&lt;li&gt;Make DAX measures simpler and easier to maintain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the contrary, poorly defined relationships often result in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrong numbers&lt;/li&gt;
&lt;li&gt;Missing data in visuals&lt;/li&gt;
&lt;li&gt;Confusing filter behavior&lt;/li&gt;
&lt;li&gt;Slow reports&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Best Practices for Relationships in Power BI
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use one-to-many relationships wherever possible&lt;/li&gt;
&lt;li&gt;Connect dimension tables to fact tables, not dimension to dimension&lt;/li&gt;
&lt;li&gt;Use numeric surrogate keys instead of text&lt;/li&gt;
&lt;li&gt;Avoid unnecessary many-to-many relationships&lt;/li&gt;
&lt;li&gt;Use single-direction filtering by default&lt;/li&gt;
&lt;li&gt;Keep the model simple and clear&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Star Schema
&lt;/h2&gt;

&lt;p&gt;A star schema is the recommended data model in Power BI and consists of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One central fact table&lt;/li&gt;
&lt;li&gt;Multiple dimension tables connected directly to the fact table.
A star schema structure looks like a star, with the fact table in the center and dimension tables branching out around it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        Date
         |
Product — Sales — Customer
         |
       Region
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Benefits of Star Schema
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Simple and easy to understand&lt;/li&gt;
&lt;li&gt;Faster query performance&lt;/li&gt;
&lt;li&gt;Fewer relationships&lt;/li&gt;
&lt;li&gt;Easier DAX calculations&lt;/li&gt;
&lt;li&gt;Better compatibility with Power BI’s engine.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Power BI can process queries more efficiently because dimension tables are not connected to each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Snowflake Schema
&lt;/h2&gt;

&lt;p&gt;A snowflake schema is a more complex version of the star schema. In this structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dimension tables are normalized&lt;/li&gt;
&lt;li&gt;Dimension tables are connected to other dimension tables.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sales → Product → Category → Department
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Benefits of Snowflake Schema
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Reduces data redundancy&lt;/li&gt;
&lt;li&gt;Useful for very large or complex dimensions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Challenges of Snowflake Schema:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;More complex relationships&lt;/li&gt;
&lt;li&gt;Slower performance due to extra joins&lt;/li&gt;
&lt;li&gt;More complex DAX measures&lt;/li&gt;
&lt;li&gt;Harder for users to understand&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Good Data Modelling is Critical
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Performance&lt;/strong&gt; - Power BI uses an in-memory engine. A clean star schema reduces joins and improves query speed. Poor models can cause reports to load slowly or even fail.&lt;br&gt;
&lt;strong&gt;Accurate Reporting&lt;/strong&gt;  - Correct relationships and proper table design ensure that filters and totals behave correctly. Bad modelling often leads to duplicated values or missing data.&lt;br&gt;
&lt;strong&gt;Simpler DAX&lt;/strong&gt; - DAX formulas are easier to write and maintain when the model is simple. Complex schemas often require complicated formulas, which are harder to debug.&lt;br&gt;
&lt;strong&gt;Better User Experience&lt;/strong&gt; - Business users prefer models that are easy to understand. Clear table names, logical relationships, and simple structures help users create reports without confusion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Modelling Mistakes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Using many-to-many relationships unnecessarily&lt;/li&gt;
&lt;li&gt;Mixing transactional and lookup data in one table&lt;/li&gt;
&lt;li&gt;Using bi-directional relationships everywhere&lt;/li&gt;
&lt;li&gt;Not creating a proper date dimension&lt;/li&gt;
&lt;li&gt;Loading unnecessary columns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;NB:&lt;/strong&gt; Avoiding these mistakes improves both performance and reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Good data modelling is the foundation of effective Power BI reporting. Visuals and calculations only work well when the underlying model is designed correctly. A clean star schema with clear fact and dimension tables leads to faster performance, accurate results, and reports that are easy to build and maintain. &lt;br&gt;
 Good data modelling improves performance, ensures accurate reporting, simplifies DAX, and makes reports easier to use. Without proper modelling, even the best visuals cannot deliver correct insights.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>schema</category>
      <category>datamodelling</category>
    </item>
    <item>
      <title>Introduction to Linux for Data Engineers</title>
      <dc:creator>Lawrence Murithi</dc:creator>
      <pubDate>Mon, 26 Jan 2026 16:28:35 +0000</pubDate>
      <link>https://dev.to/lawrence_murithi/introduction-to-linux-for-data-engineers-29jn</link>
      <guid>https://dev.to/lawrence_murithi/introduction-to-linux-for-data-engineers-29jn</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;Linux is one of the most important tools for data engineers. Most data systems today run on Linux servers, including cloud platforms, databases, and big data tools like Hadoop and Spark. Understanding Linux basics is, therefore, a key skill for anyone starting a career in data engineering.&lt;/p&gt;

&lt;p&gt;This article introduces Linux in a simple way. It explains why Linux is important for data engineers, shows basic Linux commands, and demonstrates how to create and edit files using Vi and Nano, which are common Linux text editors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Linux for Data Engineers
&lt;/h3&gt;

&lt;p&gt;Linux is important for data engineers for several reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Most data pipelines run on Linux servers&lt;/li&gt;
&lt;li&gt;Cloud platforms like AWS, Azure, and Google Cloud use Linux&lt;/li&gt;
&lt;li&gt;Tools such as Hadoop, Spark, Airflow, and Kafka are built for Linux&lt;/li&gt;
&lt;li&gt;Linux is stable, secure, and efficient for large data processing&lt;/li&gt;
&lt;li&gt;Data engineers often work with Log files, Configuration files and Scripts written in Python, SQL, or Bash. Linux makes it easy to manage these files directly from the terminal.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Basic Linux Commands For Beginners
&lt;/h3&gt;

&lt;p&gt;Linux commands are instructions typed in the terminal to tell the operating system what to do, such as creating files, moving between folders, or running programs. They allow users to interact directly with the system in a fast and efficient way. Linux commands help manage files, automate tasks and work effectively on servers, which is essential in data engineering and software development.&lt;br&gt;
Below are some of the beginner linux commands.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ssh root@IP&lt;/strong&gt; - connects to the server
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkgnls3mkhgft5boas1i2.jpg" alt="connect to server" width="800" height="203"&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;pwd&lt;/strong&gt; - Shows current directory&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F587xtbaj2j82213ilg0k.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F587xtbaj2j82213ilg0k.jpg" alt="current directory" width="800" height="67"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ls&lt;/strong&gt; - shows all files and folders in the current directory.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3mpamitb90f2ygfrxgn2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3mpamitb90f2ygfrxgn2.jpg" alt="list of files" width="800" height="80"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;cd&lt;/strong&gt; - Changes directory and &lt;strong&gt;cd ..&lt;/strong&gt; moves back one level&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fev19c7uurtluu6dnt6kr.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fev19c7uurtluu6dnt6kr.jpg" alt="Changes directory" width="800" height="54"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;mkdir&lt;/strong&gt;- Creates a new directory&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6et04w276yxlpbo7uqus.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6et04w276yxlpbo7uqus.jpg" alt="Creates a new folder" width="800" height="88"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;touch&lt;/strong&gt; - Creates an empty file&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzail6y4d6jb2n1dxk3fd.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzail6y4d6jb2n1dxk3fd.jpg" alt="Creates a new file" width="800" height="58"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;cp&lt;/strong&gt; - Copies files and &lt;strong&gt;cp -r&lt;/strong&gt; copies folder&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16obwrr1ufd3g8ybl85c.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16obwrr1ufd3g8ybl85c.jpg" alt="copy files" width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;mv&lt;/strong&gt; - Moves or renames files&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fywehezjr2voj3kb9omxn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fywehezjr2voj3kb9omxn.jpg" alt="rename file/folder" width="800" height="185"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;rm&lt;/strong&gt; - Deletes a file and &lt;strong&gt;rm -r&lt;/strong&gt; deletes a folder&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70xipdqhtewyj4jpe7dg.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70xipdqhtewyj4jpe7dg.jpg" alt="delete file/folder" width="800" height="145"&gt;&lt;/a&gt;&lt;br&gt;
-&lt;strong&gt;cat&lt;/strong&gt; - display file content&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8kkeo6w9ctdxvc2z6g4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8kkeo6w9ctdxvc2z6g4.jpg" alt="display file content" width="800" height="60"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Linux Vi and Nano Text Editors&lt;/strong&gt;&lt;br&gt;
Linux editors are programs used to create, open, and edit text files directly from the terminal. They are important because many configuration files, scripts, and logs in Linux are text-based. Data engineers and developers often use Linux editors when working on servers where graphical tools are not available.&lt;br&gt;
Some of the common Linux text editors are Vi and Nano. &lt;br&gt;
&lt;strong&gt;1. Nano Editor&lt;/strong&gt;&lt;br&gt;
Nano is a simple and beginner-friendly editor.&lt;br&gt;
To open or create a file with Nano:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nano filename.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F22el82mc6eivsybispy4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F22el82mc6eivsybispy4.jpg" alt="opening nano" width="800" height="162"&gt;&lt;/a&gt;&lt;br&gt;
The command opens the window below.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xzfjrwjr1ts9jcapa8f.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xzfjrwjr1ts9jcapa8f.jpg" alt="nano editor" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Other nano commands
&lt;/h4&gt;

&lt;h5&gt;
  
  
  Command    What it does
&lt;/h5&gt;

&lt;p&gt;Ctrl + O    Saves the file&lt;br&gt;
Ctrl + X    Exits Nano&lt;br&gt;
Ctrl + G    Shows help&lt;br&gt;
Ctrl + W    Searches for text&lt;br&gt;
Ctrl + K    Cuts (removes) a line&lt;br&gt;
Ctrl + U    Pastes a cut line&lt;br&gt;
Ctrl + A    Moves cursor to start of line&lt;br&gt;
Ctrl + E    Moves cursor to end of line&lt;br&gt;
Ctrl + C    Shows current line and column&lt;br&gt;
Ctrl + _    Go to a specific line number  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Vi/Vim Editor&lt;/strong&gt;&lt;br&gt;
Vi is a powerful editor and widely used in professional environments.&lt;br&gt;
Vi has 3 main modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normal mode - navigation and commands&lt;/li&gt;
&lt;li&gt;Insert mode - typing text&lt;/li&gt;
&lt;li&gt;Visual mode - selecting text
To open a file in Vi editor:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vi filename.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xlo6qvida5587lliorm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xlo6qvida5587lliorm.jpg" alt="open vi editor" width="800" height="199"&gt;&lt;/a&gt;&lt;br&gt;
The below window opens when the command is prompted.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefoqg9hnzcivrrm3ce4t.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefoqg9hnzcivrrm3ce4t.jpg" alt="vi editor" width="800" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Other vi commands
&lt;/h4&gt;

&lt;h5&gt;
  
  
  Entering Insert Mode
&lt;/h5&gt;

&lt;p&gt;i - insert before cursor&lt;br&gt;
a - append after cursor&lt;br&gt;
o - open new line below&lt;br&gt;
I - insert at beginning of line&lt;br&gt;
A - append at end of line&lt;/p&gt;

&lt;h5&gt;
  
  
  Saving and Exiting/quiting
&lt;/h5&gt;

&lt;p&gt;:w - save (write)&lt;br&gt;
:q - quit&lt;br&gt;
:wq or ZZ - save and quit&lt;br&gt;
:q! - quit without saving&lt;br&gt;
:w filename - save as new file&lt;/p&gt;

&lt;h5&gt;
  
  
  Navigation Commands
&lt;/h5&gt;

&lt;p&gt;h - Move left&lt;br&gt;
l - Move right&lt;br&gt;
j - Move down&lt;br&gt;
k - Move up&lt;br&gt;
gg - Go to start of file&lt;br&gt;
G - Go to end of file&lt;br&gt;
0 - Start of line&lt;br&gt;
$ - End of line&lt;/p&gt;

&lt;h5&gt;
  
  
  Editing Commands
&lt;/h5&gt;

&lt;p&gt;x - delete character&lt;br&gt;
dd - delete line&lt;br&gt;
yy - copy line&lt;br&gt;
p - paste below&lt;br&gt;
P - paste above&lt;br&gt;
u - undo&lt;br&gt;
Ctrl+r - redo&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
Linux is a core skill for data engineers because it is used in servers, cloud platforms, and data tools. Basic Linux commands help you move around the system and manage files. &lt;br&gt;
Learning Linux early makes it easier to work with data pipelines, scripts, and production systems.&lt;/p&gt;

</description>
      <category>linux</category>
      <category>dataengineering</category>
      <category>vim</category>
      <category>luxdev</category>
    </item>
  </channel>
</rss>
