DEV Community: Lawrence Murithi

Building the Pipes: Core Data Engineering Concepts Explained

Lawrence Murithi — Wed, 27 May 2026 20:53:29 +0000

Introduction

Data engineering is the practice of designing and building systems for collecting, storing, transforming, and managing data so it can be safely used for reporting, analytics, machine learning, and making business decisions. Think of it as the behind-the-scenes work that makes apps, websites, and businesses function.
Every modern company depends on data. If data is like water, data engineers are the plumbers. They build the pipes, water filters, and reservoirs so that clean, safe water comes out of the tap exactly when you need it.
This article explains some of the most important foundational concepts in data engineering using practical examples and simple language.

1. Batch vs Streaming Ingestion

Ingestion is simply a fancy word for bringing data from an outside source into your own system. There are two main ways to do this.
Batch Ingestion - The process of bringing data from an outside source into your system in large, pre-scheduled chunks. It is like filling up a bucket at a well. You wait until a specific time, gather a large chunk of data, and move it all at once. Think of it as a slow but steady cargo ship.

Characteristics

Processes high volumes of data at once
Runs on scheduled intervals (e.g., hourly or daily)
Higher latency (delay)
Highly cost-effective.

Tools: AWS Glue, Apache Airflow, Airbyte.

Examples: Processing all of the daily store sales every night at midnight, or running weekly customer reports every Sunday.

Pros: It is much cheaper, easier to manage, and great for moving large historical files.
Cons: You do not get information in real-time; you have to wait for the next batch to finish.

Streaming Ingestion - The process of bringing data into a system continuously the exact moment it is created. It is like turning on a faucet. Data flows continuously into the system. Think of it as a fast but expensive sports car.

Characteristics

Operates in real-time or near real-time
Handles continuous data feeds
Very low latency
Requires always-on compute resources.

Tools: Apache Kafka, Google Cloud Pub/Sub.

Examples: Live GPS tracking for delivery trucks, live stock market updates, or instant credit card fraud detection.

Pros: You get immediate insights and can react to problems instantly.
Cons: Systems are much more complex to build and cost a lot more money to keep running 24 hours a day, 7 days a week.

2. Change Data Capture (CDC)

A technique used to identify, capture, and deliver only the data changes (inserts, updates, deletes) in a database to a downstream system in real-time.
Imagine you have a giant printed phone book with one million numbers, and ten people change their phone numbers today. If you want to share the updated phone book with a friend, printing a whole new book is a massive waste of paper and time. Instead, you just hand your friend a small sticky note saying, Update these 10 numbers. Change Data Capture (CDC) is that sticky note.
How it works: Instead of copying an entire database table over and over again to keep a secondary system updated (which would slow down the computers), CDC tools constantly monitor the original database. They capture only the new records, the updated records, or the deleted records. If a database has 1 million rows, and only 20 change today, CDC only moves those 20 changes.
This saves storage space, lowers computing costs, and keeps data pipelines moving very quickly. Engineers usually use log-based CDC, which silently reads the database's hidden background activity log so it doesn't slow down the main system that customers are using.

Characteristics

Highly efficient
Minimizes network traffic
Reduces computing load on source systems
Reads transaction logs rather than querying tables directly.

Tools: Debezium, AWS Database Migration Service (DMS), Oracle GoldenGate.

Pros: Highly efficient (saves bandwidth/compute), provides near real-time updates, captures exact deleted records, and has a very low impact on the source database's performance.
Cons: Highly complex to configure, fragile to sudden database schema changes, and executing the "initial load" snapshot before streaming begins can be difficult.

3. Idempotency

A property of operations in computer science where performing a task multiple times yields the exact same result as performing it just once.
It guarantees safety in a world where computers glitch all the time. Think of an elevator button. If you press the button for the 5th floor once, you go to the 5th floor. If you get impatient and press it ten times, you still just go to the 5th floor. Nothing extra happens.

Characteristics

Highly fault-tolerant
Relies on unique ID tags or primary keys
Prevents data corruption
Ensures absolute safety during automatic system retries.

Example: In data pipelines, tasks often fail and automatically try again. Imagine a customer buys a shirt, but the internet connection glitches and the payment system automatically retries the transaction. Without idempotency, the customer accidentally gets charged twice. With idempotency, the system recognizes the duplicate request and ignores it. Engineers achieve this by giving every piece of data a unique ID tag or using special rules so data is never corrupted or duplicated by automatic retries.

Pros: Guarantees data accuracy, prevents duplicate records, and makes pipelines highly fault-tolerant.
Cons: Requires extra engineering effort to design, necessitates state tracking or unique IDs, and can introduce a slight performance overhead.

4. OLTP vs OLAP

These are two different types of databases built for very different jobs.
OLTP (Online Transaction Processing) - Systems designed to handle day-to-day business operations by instantly executing fast, small database actions.
How it works: Focus on the word Transaction. Think of using an ATM or buying something at an e-commerce checkout. These systems need to look up a price, add a single order, or update a bank balance instantly. They are built for incredibly fast, very small actions.

Characteristics

Uses highly normalized data
Features incredibly fast write speeds
Supports high volumes of simultaneous users
Uses row-based storage.

Tools: OLTP databases include MySQL, PostgreSQL, and SQL Server.

Pros: Incredibly fast for single-record inserts/updates and supports thousands of live users simultaneously.
Cons: Terrible for analyzing massive historical datasets or running complex aggregations.

OLAP (Online Analytical Processing) - Systems built for complex reporting, data mining, and researching massive datasets.
How it works: Focus on the word Analytical. Think of a manager looking at sales trends over the last five years. They do not care about one specific receipt but want to add up millions of receipts to see the big picture. OLAP databases handle massive, complex math queries and historical analysis.

Characteristics

Uses denormalized data
Incredibly fast read speeds
Handles heavy mathematical aggregations
Uses columnar storage.

Tools: Popular examples (often called data warehouses) include Snowflake, Google BigQuery, and Amazon Redshift.

Pros: Excellent for complex math, reporting, and scanning millions of rows instantly.
Cons: Very slow for single-row updates and not built to handle live application transactions.

5. Columnar vs Row-based Storage

The physical method by which data is saved onto a computer's hard drive, organized either by rows or by columns.

Row-based Storage

How it works: Saves data row by row, exactly like reading a book from left to right. If you have a table with ID, Name, and Salary, it saves 1, John, 5000, together in one sentence. This is perfect for the fast OLTP systems. If John logs into an app, the system can grab his entire profile in one quick read.

Characteristics
Optimizes write operations
Retrieves entire single records instantly
Ideal for OLTP transaction systems.

Tools/Formats: CSV and standard relational database tables.

Pros: Perfect for retrieving entire individual records quickly, highly optimized for fast write operations.
Cons: Slow and expensive for analytical queries that only need to read one or two specific columns across millions of rows.

Columnar Storage

How it works: Saves data column by column. All the IDs are stored together in one block, all the Names together in another block, and all the Salaries together in a final block. This is perfect for OLAP analytics. If you ask the database, What is the average salary of all employees?, the system only scans the specific file block containing salaries. It ignores the names and IDs entirely, making the math incredibly fast.

Characteristics

Optimizes read operations
Retrieves specific fields quickly while ignoring others
Highly compressible
Ideal for OLAP analytics.

Tools/Formats: Parquet, ORC, and Apache Arrow.

Pros: Extremely fast for analytical reads and highly compressible (saving storage space).
Cons: Very slow for writing single new rows or updating specific records.

6. Partitioning

The process of dividing a massive messy dataset into smaller, perfectly organized folders based on a specific rule (like organizing by Date, Region, or Product Category).
Imagine a huge filing cabinet with ten years of paper invoices stuffed randomly into one drawer. To find an invoice from January 2024, you have to look through every single piece of paper. If you partitioned the cabinet by year, and then divided it again by month, you could immediately open the 2024 drawer, grab the January folder, and find your paper.
How it works: By organizing data into partitions, a computer's search query only looks inside the relevant folders. This stops the computer from searching like a needle in a haystack, which speeds up the search and drastically reduces the cost of running the system.

Characteristics

Drastically reduces query scan time
Improves overall system performance
Lowers cloud computing costs
Organizes data into physical directory paths.

Tools: Apache Spark, Hive, AWS Athena.

Pros: Drastically speeds up search queries and significantly lowers cloud computing costs.
Cons: Over-partitioning can create the small file problem (too many tiny files that slow down the system), and uneven data distribution can cause performance-killing data skew.

7. ETL vs ELT

These are the three steps taken to move and prepare data; Extract (pulling it from the source), Transform (cleaning, filtering, and shaping it), and Load (saving it in its final destination).

ETL (Extract, Transform, Load)

Cleans data before loading it to the final destination, protects the data warehouse from bad data, relies on a separate processing engine, and represents the traditional method.
How it works: You pull data from the source, clean it on a temporary processing server, and then load the cleaned data into the final warehouse. Think of this like washing your muddy vegetables outside in the yard before bringing them into your house. It keeps bad data out of your warehouse, but the temporary server can be slow and rigid.

Tools: Informatica, Talend, IBM DataStage.

Pros: Keeps bad/dirty data out of your final warehouse, eases the compute burden on the target database.
Cons: The temporary transformation server can be a slow, rigid bottleneck.

ELT (Extract, Load, Transform)

Cleans data after it reaches the destination, utilizes the massive scalable compute power of modern cloud warehouses, is highly flexible, and stores raw data natively.
How it works: You pull the raw, messy data and dump it straight into the final warehouse exactly as it is. Then, you use the massive power of the modern cloud warehouse to clean and shape the data inside it. Think of this like bringing all your muddy vegetables straight into a giant, modern kitchen and washing them there because you have a super-fast, industrial sink. This is the modern approach. It is much faster and more flexible, though you have to pay to store all that raw, uncleaned data.

Tools: dbt (data build tool), Fivetran, Matillion.

Pros: Much faster ingestion, highly flexible, leverages the massive scale of modern cloud warehouses.
Cons: Increases storage costs (by saving all raw data) and can clutter the warehouse with messy data if not governed properly.

8. CAP Theorem

A fundamental principle of system design stating that a distributed data store can only simultaneously guarantee two out of three features at the exact same time - Consistency, Availability, and Partition Tolerance.
How it works: The CAP theorem is a strict rule for distributed systems (a network of multiple computers working together).
• Consistency (C) - Every user sees the exact same data at the exact same time, no matter which computer they connect to.
• Availability (A) - The system is always turned on and responds to every single request without failing.
• Partition Tolerance (P) - The system keeps working even if the internet or network connection between the computers suddenly breaks.

Examples: Because network breaks will always happen eventually in the real world, systems must have Partition Tolerance. Therefore, engineers must make a difficult choice during a failure; do you want Consistency or Availability?
• A banking system will choose Consistency. It will refuse to show your account balance (losing Availability) if it cannot guarantee the number is 100% accurate.
• A social media feed will choose Availability. It will keep loading posts for you to scroll through, even if it accidentally misses a comment your friend just left (losing Consistency).

Characteristics

Represents necessary trade-offs in architecture
Forces engineers to prioritize either accuracy or uptime during failures
Dictates database behavior.

Tools (Databases typed by CAP): MongoDB and HBase prioritize CP (Consistency/Partition Tolerance). Cassandra and CouchDB prioritize AP (Availability/Partition Tolerance).

Pros: Provides a clear architectural framework for understanding distributed system limitations and guides safe database selection.
Cons: It forces difficult trade-offs; you cannot have a perfect system, meaning businesses must sacrifice either absolute accuracy or constant uptime during network failures.

9. Windowing in Streaming

The technique of dividing a continuous, never-ending stream of data into finite, manageable time blocks to perform math and calculations.
How it works: When you process data in batches, you know exactly when the data starts and ends. But streaming data is continuous, it never stops. To do math on a continuous stream, engineers have to chop time into smaller, manageable chunks called windows.

Examples:
• Tumbling Window - Fixed blocks of time that do not overlap. For example, counting website clicks from 1:00 to 1:05, and then starting a totally new count from 1:05 to 1:10.
• Sliding Window - Overlapping blocks of time. You might ask for a total count of clicks in the last 10 minutes, but you want that total number updated on your screen every 1 minute.
• Session Window - Based on a specific user's activity. The window opens when the user logs in and stays open as long as the user is clicking around. The window only closes after they put their phone down and are inactive for 30 minutes.

Characteristics

Time-bound
Relies on event-time or processing-time
Handles unbounded data effortlessly
Maintains a system state across intervals.

Tools: Apache Flink, Apache Beam, Spark Streaming.

Pros: Makes calculating infinite continuous data possible and allows for stateful real-time aggregations (like running totals). Cons: Complex to configure (especially handling late-arriving data) and requires high memory usage to maintain the active window states.

10. DAGs and Workflow Orchestration

A DAG is a visual map of dependent tasks, and Workflow Orchestration is the automated system that coordinates and triggers those tasks.
How it works:
DAG (Directed Acyclic Graph) - a one-way flowchart where tasks point in one direction, and they never loop backward. In a data pipeline, a DAG maps out the exact order of steps.

Example: Step 1: Extract Data -> Step 2: Clean Data -> Step 3: Generate Report. The cleaning step cannot start until the extraction step finishes.
Workflow Orchestration - These tools are the traffic cops that manage these DAG flowcharts. They wake up on a strict schedule, trigger the steps in the exact right order, monitor the system for failures, and manage automatic retries if something breaks.

Characteristics

Executes tasks sequentially or in parallel
Visually tracks dependencies
Never loops backwards
Automates scheduling, monitoring, and error handling.

Tools: Apache Airflow, Prefect, Dagster, Mage.

Pros: Provides clear visual monitoring, automates error retries, and completely removes the need for manual scheduling. Cons: Introduces a steep learning curve, and the orchestration tools themselves require dedicated servers and maintenance.

11. Retry Logic and Dead Letter Queues

Built-in error handling mechanisms. Retry logic defines how a system automatically re-attempts failed operations, while a Dead Letter Queue (DLQ) isolates persistently failing data for manual review.
How it works: Computer systems fail all the time. Passwords expire, internet networks drop, and software programs crash. Good data systems plan for these failures in advance.
• Retry Logic - tells the system what to do when it hits an error. If you keep asking a broken system for data every single second, you might break it worse. Instead, engineers use an exponential backoff strategy; if it fails, wait 5 seconds and try again. If it fails again, wait 15 seconds. If it fails again, wait 60 seconds. This gives the broken system time to recover.
• Dead Letter Queues (DLQ) - If a piece of data fails repeatedly and simply will not work, you do not want it to block the rest of the pipeline like a broken car on a highway. Instead, the system moves the failing data to a Dead Letter Queue. This is a special needs fixing folder just for bad messages. The main pipeline keeps running smoothly, and an engineer can manually look at the DLQ folder later to see what went wrong.

Characteristics

Employs exponential backoff strategies
Prevents bad data from bottlenecking pipelines
Isolates errors cleanly
Ensures overall fault tolerance.

Tools: AWS SQS (DLQ feature), RabbitMQ, Apache Kafka, and orchestration tools like Airflow (for retries).

Pros: Prevents transient glitches from crashing pipelines and isolates bad data without stopping the whole system.
Cons: DLQs require manual human review to fix the errors, and poorly configured retries can overload a broken system.

12. Backfilling and Reprocessing

The procedures used to load missing historical data into a system or overwrite existing historical data with corrected calculations.
How it works: Pipelines usually run today's data. But sometimes you need to process data from the past.
• Backfilling - filling in missing history. It is like filling in a blank page in a diary. If a pipeline breaks and is offline for two days, you have an empty gap in your database. Once you fix the system, you must backfill those missing two days to make your records complete again.
• Reprocessing - fixing mistakes. It is like using an eraser to fix a misspelled word in your diary. If your code had a hidden bug and calculated last month's numbers incorrectly, you have to update the code, delete the bad data, and run last month's raw data through the pipeline all over again to get the right answers.

Characteristics

Computationally heavy
Manages historical time-ranges
Fixes bugs or system outages
Requires strict idempotent pipelines to run safely.

Tools: Apache Airflow, dbt, Apache Spark.

Pros: Fixes broken data, restores missing historical records, and ensures analytics are 100% accurate.
Cons: Very expensive in cloud compute costs, time-consuming, and can accidentally duplicate data if pipelines aren't perfectly idempotent.

13. Data Governance

A formal framework of rules, policies, and processes that dictates how a company keeps its data secure, accurate, and organized.
How it works: Without these rules, data becomes a messy, unusable pile. Think of it like running a strict library.
• Data Quality - Are the books in the right section? (Checking the data for missing values, duplicate entries, and invalid text formats).
• Security - Who gets the keys to the rare book room? (Ensuring only authorized people can see sensitive information like passwords or credit card numbers).
• Compliance - Are we following the law? (Meeting strict government regulations like GDPR or HIPAA, which legally dictate how long you can keep data and how you must protect it).
• Metadata Management - Is there an index card catalog? (Tracking exactly where data came from, who owns it, and what the columns actually mean, so business analysts know exactly what they are looking at).

Characteristics

Enforces data quality
Ensures legal regulatory compliance
Manages access controls
Actively catalogs metadata.

Tools: Collibra, Alation, Apache Atlas, Monte Carlo (for Data Quality).

Pros: Builds organizational trust in data, ensures legal compliance, and secures sensitive user information.
Cons: Can slow down agile development, requires heavy organizational buy-in, and enterprise governance tools are often expensive.

14. Time Travel and Data Versioning

The ability to instantly query historical states of a database table (Time Travel) or track explicit, named checkpoints of datasets over time (Data Versioning), preventing accidental data loss and ensuring reproducibility.
How it works: Normally, if you update a row in a standard database, the old information is overwritten and gone forever. If someone accidentally deletes a highly important table, it is a disaster. Modern storage tools solve this with an amazing feature called Time Travel. Every single time a change is made, the system keeps a hidden log of previous states. Think of Time Travel like a save state in a video game, or a massive undo button. Data Versioning is a closely related concept, but instead of automatically logging every second, it acts like Git for data. It allows engineers to manually save and label specific, permanent checkpoints of a dataset (like Dataset Version 1.0).
Example: Using Time Travel, an engineer can write a query that says, Show me what this table looked like yesterday at 4:00 PM. This is heavily used to recover quickly from human mistakes, debug broken code, and officially prove to auditors what the data looked like on a specific date.
Using Data Versioning, a data scientist can tell their system, Run this new algorithm on Dataset Version 2.4, ensuring their machine learning experiments are perfectly repeatable.

Example: An engineer can write a query that says, Show me what this table looked like yesterday at 4:00 PM. This is heavily used to recover quickly from human mistakes, debug broken code, and officially prove to auditors what the data looked like on a specific date in the past.

Characteristics

Utilizes immutable transaction logs
Enables point-in-time recovery
Allows for instant rollbacks
Crucial for auditing and debugging.

Tools: Delta Lake, Apache Iceberg, Snowflake, and DVC (Data Version Control).

Pros: Enables instant rollbacks of human errors, provides perfect auditability, and simplifies debugging.
Cons: Drastically increases storage costs because the system retains hidden copies of all deleted or changed data (which must be managed and periodically purged).

15. Distributed Processing Concepts

The method of splitting massive computational tasks across multiple interconnected computers ( acluster) to solve problems faster than a single machine could.
How it works: There is a limit to how fast one single computer can work. When a dataset is simply too big (billions of rows), a single computer cannot process it without freezing. You have to split the work across multiple computers. To do this effectively, systems rely on four core mechanisms;
Cluster - A group of machines working together as if they were one giant, unified computer. Tools like Apache Spark act as the managers for these clusters, directing the smaller worker machines.
Parallel Processing - Multiple tasks run simultaneously. The manager chops massive files into smaller blocks and sends them to different worker machines to be processed at the exact same time, rather than one by one.
Fault Tolerance - The system continues operating even when machines fail. If one worker machine catches fire and breaks, the manager system just assigns its unfinished block of data to another machine, and the entire job still finishes successfully.
Data Locality - Processing data close to where it is stored to reduce network movement. Sending massive, heavy files back and forth over a network is very slow. Instead of moving the data to the compute power, the manager sends the lightweight processing instructions directly to the specific machine where that chunk of data already sits.
Think of painting a one-mile-long fence. One person painting alone might take 10 hours. But if you divide the fence into ten equal sections and assign ten workers to paint simultaneously, the exact same job takes only 1 hour. This is called parallel processing.
Manager/Worker dynamics - Tools act as the managers for these workers. They chop massive files into smaller blocks and send them to different worker machines. This setup also provides "fault tolerance"—if one worker machine catches fire and breaks, the manager system just assigns its unfinished block of data to another machine, and the job still finishes successfully.

Characteristics

Relies on parallel execution
Uses horizontal scaling (adding more machines)
Splits files into logical blocks
Highly fault-tolerant.

Tools: Apache Spark, Hadoop, Apache Flink.

Pros: Can process virtually infinite amounts of data by scaling horizontally, and is highly fault-tolerant to hardware failures.
Cons: Introduces network overhead, is difficult to debug, and is absolute overkill for small datasets.

Conclusion

Data engineering forms the physical backbone of modern software and business. It might seem intimidating at first, but by understanding these simple foundational concepts, from how data is ingested and stored, to how it is distributed and governed, you can see exactly how companies turn messy, raw information into reliable, valuable insights. Every time you open an app or view a chart, you are seeing the careful work of data engineering in action.

Where Does the Data Go? A Comprehensive Guide to Databases

Lawrence Murithi — Tue, 26 May 2026 13:11:15 +0000

Introduction

Whether you are building a website, a mobile app, or a complex business system, you will inevitably need a secure and efficient place to house your data. This data serves as the digital backbone of your application, and how you manage it directly impacts your system's reliability and performance. To handle this responsibility, developers rely on a database. A database is a specialized, organized system designed to store, manage, and retrieve information with precision and speed.
However, in today’s digital landscape, not all data is created equal. Storing a highly sensitive list of user passwords requires a fundamentally different architectural approach than processing thousands of temperature readings collected per second from a weather sensor. Because modern data comes in various shapes, sizes, and velocities, software engineers have developed different types of databases specifically tailored to handle these diverse requirements.
This article explores the primary types of databases, how they work, when to best utilize them, foundational database concepts, and the pros and cons of each.

1. Relational Databases (SQL)

Relational databases are the traditional filing cabinets of the software world. They have been around since the 1970s and store data in strict, organized tables containing rows and columns—very much like an Excel spreadsheet. The relational part means you can link these tables together. For example, a Customers table can be easily linked to an Orders table using a customer ID number. To talk to these databases, developers use a standard language called SQL (Structured Query Language).
Examples: MySQL, PostgreSQL, Oracle, Microsoft SQL Server.
Uses: Banking systems, inventory management, e-commerce stores, and anywhere you need absolute accuracy.

Pros

Reliability - They use a strict rule system (ACID), guaranteeing that data transactions are processed reliably. If you transfer $10 to a friend, the database ensures your account loses $10 and their account gains $10. If the system crashes halfway through, it safely cancels the whole transaction rather than losing the money.
Standardization - SQL is widely known, highly documented, and supported by almost all data tools.
Complex Querying - You can easily ask complicated questions that pull connected data from dozens of different tables at once.
Data Integrity - Built-in rules prevent orphaned data (like having an order in the system for a customer that has been deleted).

Cons

Rigid Structure - You must define exactly what your tables look like ahead of time. Adding a new column later can be a slow, system-locking process.
Hard to Scale Out - To handle more traffic, you usually have to buy a bigger, more expensive server (scaling up). It is very difficult to spread a relational database across dozens of cheap servers (scaling out).
Performance Bottlenecks - They can become slow when searching through massive, row datasets if not perfectly optimized.

2. NoSQL Databases

NoSQL stands for Not Only SQL. These databases were created in the 2000s when companies realized traditional relational databases could not handle the massive amounts of fast, messy, unstructured data they were collecting. Instead of strict tables and rows, NoSQL databases come in a few different types.

A. Document Databases
These store data as documents (usually JSON format), which look like nested lists. Instead of splitting a user's details across five different tables, you store all their information together in one single document.
Examples: MongoDB, CouchDB.
Uses: User profiles, content management systems (like blogs), and product catalogs.

Pros

Highly flexible - You can add new, unique pieces of information to one document without affecting the others.
Developer-friendly - the documents look exactly like the code objects developers write.

Cons

Not great for complex queries - Cross-referencing data across hundreds of documents is slow.
Data duplication - stores the same piece of info in multiple documents to keep reading fast.

B. Key-Value Stores
This is the simplest type of database. It works just like a dictionary. You have a Key (a unique word or ID) and a Value (the data attached to it). To retrieve the data, you must know the exact key.
Examples: Amazon DynamoDB, Redis.
Uses: Shopping carts, user session data (keeping you logged in), and user preferences.

Pros

Fast and simple to implement
Easy to scale across hundreds of servers.

Cons

You can only search by the exact key. You cannot easily ask the database complex questions like, Show me all users who live in New York.
They also offer no built-in way to link data together.

C. Wide-Column Stores
These look a bit like relational tables, but the names and formats of the columns can change from row to row. They are designed to hold massive amounts of data spread out over many servers.
Examples: Apache Cassandra, HBase.
Uses: Logging user activity, smart home (IoT) device data, massive music or video streaming platforms.

Pros

Can handle millions of data writes per second.
They are highly fault-tolerant and almost never go down, even if several servers in the cluster break. They also compress data very efficiently.

Cons

Complex to set up and manage.
They offer very limited querying options and are slower for reading individual, specific records compared to a Key-Value store.

D. Graph Databases
These databases care more about the relationships between data than the data itself. They store data in nodes (entities/people) and edges (the connections between them).
Examples: Neo4j, Amazon Neptune.
Uses: Social networks (finding friends of friends), fraud detection rings, and recommendation engines (users who bought X also bought Y).

Pros

Makes querying deeply connected, complex relationships incredibly fast.
They also provide highly intuitive, visual ways to look at your data.

Cons

Highly specialized. Its built to do one specific job and cant be used as a normal database.
They are terrible for keeping standard business records, doing simple counting tasks,
Require learning steep, specialized query languages.

3. NewSQL Databases

NewSQL combines the strict, reliable rules of a Relational (SQL) database, but scales and grows across hundreds of servers as effortlessly as a NoSQL database.
Examples: CockroachDB, Google Cloud Spanner.
Uses: Global financial platforms, multi-region software apps, and high-availability gaming backends.

Pros

You get the absolute safety of SQL, but the database automatically balances itself across many servers without manual work.
It survives data center crashes easily and prevents downtime.

Cons

They are highly complex to operate compared to standard SQL databases
Require more expensive hardware or cloud setups to run smoothly
Smaller community of developers if you need troubleshooting help.

4. Vector Databases

Vector databases are the storage brains behind modern Artificial Intelligence. Instead of searching for exact word matches, they turn data (like text, images, or audio) into complex arrays of numbers called embeddings. They find things based on underlying meaning and similarity.
Examples: Pinecone, Milvus, Weaviate.
Uses: AI chatbots (feeding relevant info to models like ChatGPT), facial recognition, and semantic search (searching by context, not just keywords).

Pros

Extremely fast at similarity searches, even when sifting through billions of complex records.
Effortlessly handle unstructured data like audio and video files.

Cons

Very specialized hence cannot be used to store user passwords, process credit card payments, or keep track of inventory.
Yield approximate results (trading 100% accuracy for speed)
Useless for standard business bookkeeping.
Generating the embeddings to put into the database can be computationally expensive.

5. Time-Series Databases

Sometimes, time is the most important piece of data you have. A time-series database is built specifically to handle data that is recorded over time, usually arriving in a constant, heavy stream.
Examples: InfluxDB, Prometheus, TimescaleDB.
Uses: Stock market tracking, server performance monitoring, weather station data.

Pros

Heavily optimized to write data quickly and compress it so it takes up minimal space.
They feature built-in functions to easily answer questions like, What was the average temperature between 2:00 PM and 4:00 PM?
Can automatically delete old data after a set period.

Cons

They are notoriously bad at handling random data updates or deleting specific old records.
Once a timestamped event happens, it is generally treated as permanent history.

6. Search Databases

If you search for red running shoes on a website, a standard database might struggle to quickly find all variations, misspellings, and relevant items. Search databases (search engines) are built specifically to read through massive blocks of text and return highly relevant results in milliseconds.
Examples: Elasticsearch, Meilisearch, Apache Solr.
Uses: E-commerce search bars, scanning millions of computer logs for errors, document/wiki search.

Pros

Lightning-fast full-text search.
They handle typos gracefully (fuzzy matching)
Feature built-in ranking systems so the most relevant results show up at the top.

Cons

They consume a massive amount of memory (RAM) and storage space because they index every single word.
They lack the strict safety rules of a relational database.

7. In-Memory Databases

Standard databases permanently save data onto a server's hard drive. In-memory databases save data directly into the computer's RAM (Main Memory).
Examples: Redis, Memcached.
Uses: Caching (saving a temporary copy of frequently requested data so the main database doesn't have to work as hard), real-time gaming leaderboards, instant messaging.

Pros

Very fast. They retrieve data in fractions of a millisecond, drastically reducing the load on your primary databases.

Cons

RAM is very expensive, meaning storage capacity is limited.
Because data is stored in memory, a sudden power loss or server crash will cause the data to disappear instantly (though modern versions offer background backup tricks to mitigate this).

8. Object-Oriented Databases

These store data exactly how software developers write it in modern programming languages (as objects). Instead of breaking a complex digital item apart to fit into rows and columns, it saves the whole object exactly as it is.
Examples: db4o, ObjectDB.
Uses: Complex engineering systems (CAD designs), 3D modeling, scientific simulations.

Pros

Excellent for highly complex, nested data.
Removes the tedious extra work developers usually have to do to translate their application code into database tables.

Cons

They have a very small developer community
No universally standard query language
Poor interoperability with other data tools
Have mostly been replaced by Document NoSQL databases.

9. Cloud-Native and Serverless Databases

These databases do not run on a physical server that you own. They live entirely in the cloud. The cloud provider handles all the heavy lifting—such as backups, security patching, and adding more compute power when traffic spikes.
Examples: Amazon Aurora, Firebase, Supabase, DynamoDB.
Uses: Fast-moving startups, mobile apps, and systems where user traffic spikes unpredictably.

Pros

Zero operational headaches.
Built-in disaster recovery.
If the database is truly serverless, it scales down to zero when not in use, meaning you only pay for exactly what you consume.

Cons

Vendor lock-in; once you build your app heavily around a specific cloud database, it is difficult to move to a competitor.
Troubleshooting is harder because you do not control the underlying server
Costs can skyrocket unexpectedly if your app suddenly goes viral.

10. Multi-Model Databases

Instead of setting up a document database for user profiles, a graph database for friend connections, and a key-value store for session data, a multi-model database allows you to do all of it in a single system.
Examples: ArangoDB, PostgreSQL (when utilizing extra add-ons).
Uses: Applications with diverse data needs, or small teams that want to keep their software architecture simple.

Pros

You only have one database to learn, monitor, secure, and back up.
All your data stays unified in one place, making agile development easier.

Cons

The jack of all trades, master of none dilemma. They might not be as incredibly fast or efficient at one specific task as a dedicated, single-purpose database would be.
They can also be complex to tune for performance.

A Few Important Database Rules to Know

When comparing these options, developers often use specific acronyms to describe how the database handles data.

ACID vs. BASE
ACID guarantees your data is always 100% correct, verified, and safely recorded before moving on (used by Relational databases).
BASE prioritizes keeping the database online and running fast, even if it means some users briefly see slightly outdated data for a few milliseconds (used by NoSQL databases).

The CAP Theorem
In a system spread across many servers, you can only guarantee two out of these three things;
Consistency - everyone sees the exact same data at the exact same time.
Availability - the system always responds and never goes down .
Partition Tolerance -the system survives even if the servers lose connection to each other.
NB: You must pick a database that makes the right trade-off for your specific business needs.

Sharding vs. Replication
Sharding means chopping your data into pieces and putting different pieces on different servers to make the database hold more total data.
Replication means making exact duplicate copies of your data and putting them on different servers so the data remains safe and readable if one server dies.

Tools Often Mistaken for Databases

Because modern software architecture is complicated, you will often hear names of technologies thrown around that sound like databases, but aren't. They work alongside databases, but serve totally different purposes. Let's clarify a few common ones.

1. Debezium
People often confuse Debezium for a database because it deals heavily with data. In reality, Debezium is a messenger.
It is a Change Data Capture (CDC) tool. Imagine you have a main database running your store, a separate search engine, and a data warehouse for accounting. When a user updates their address in your main database, how do the other systems know? Debezium sits next to your database and constantly reads its internal diary (transaction log). When a change happens, Debezium grabs it and broadcasts a message saying, Hey everyone, User 123 changed their address!
Debezium doesn't store your data permanently. It simply watches for changes and passes the news along.

2. Apache Kafka
Kafka is often mentioned alongside Debezium and is frequently mistaken for a database because it can hold data. However, Kafka is actually an Event Streaming Platform, or a Message Broker.
Think of Kafka as a giant, high-speed conveyor belt. Different parts of your software drop messages onto the conveyor belt, and other parts of your software pick them up.
While Kafka temporarily holds onto messages so they don't get lost, it is not designed for you to look up specific information. You cannot easily query Kafka. Its job is to move data reliably from Point A to Point B, not to act as a filing cabinet.

3. Elasticsearch
While we listed Elasticsearch as a Search Database in the categories above, developers often dump all their primary data into it and treat it like a traditional Relational database.
Elasticsearch is brilliant at reading blocks of text and returning search results in milliseconds. However, it lacks strict ACID compliance. If you use it to permanently store critical things, a server crash could cause you to lose data or get out of sync. It should always be used as a secondary copy of your data strictly for search purposes, functioning alongside a more reliable primary database.

Conclusion

Choosing the right database ultimately comes down to understanding the shape of your data and what you want to achieve with it.
No single database is perfect for everything. The best, most scalable software systems in the world use multiple databases together; a relational database for financial transactions, a search engine for product discovery, an in-memory cache for speed, and a wide-column data warehouse for analytics. This approach is known as polyglot persistence.

A Beginner’s Guide to Apache Kafka: The Engine of Real-Time Data

Lawrence Murithi — Thu, 14 May 2026 07:34:43 +0000

Introduction

Imagine you are running a massive online store. Every second, hundreds of users are clicking items, adding them to carts and making purchases. Your inventory system needs to know about the purchases, your recommendation engine needs to know about the clicks and your security system needs to monitor for fraud.
If you connect every single system directly to each other, you get a tangled, unmanageable mess.
This is the exact problem Apache Kafka was built to solve. Instead of systems talking directly to each other, they all send their data to a central hub (Kafka) and any system that needs that data simply reads it from the hub. This creates a completely decoupled architecture; the system sending the data doesn't need to know anything about the systems receiving it.
This article delves into everything you need to know to understand Apache Kafka, from its history to running your first commands.

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform.
Let's break that down.
• Event - This is a record of something that happened (e.g. User A clicked button B at 12:00 PM). These events, also known as messages or records, are the fundamental immutable data structures consisting of a key, value, timestamp and headers that are continuously transmitted.
• Streaming - The data flows continuously in real-time, rather than waiting to be processed in daily batches. Kafka allows you to publish (write) and subscribe to (read) streams of events, store them indefinitely, and process them as they occur.
• Distributed - It doesn't just run on one computer. It runs across many computers working together, making it incredibly fast and virtually impossible to crash.
Think of Kafka as a massive, high-speed, highly organized post office. Senders drop off packages (data) and the post office holds onto them until the receivers come to pick them up. This complete journey of data, from generation and publishing to storage, consumption, and eventual deletion, represents the Kafka lifecycle.

Kafka was originally created at LinkedIn in 2011 by software engineers Jay Kreps, Neha Narkhede and Jun Rao.
LinkedIn was generating billions of data points daily (profile views, messages, connections), and their existing databases and message queues couldn't keep up. They needed a system that could handle these massive amounts of data in real-time without slowing down.
It was named Kafka, after the author Franz Kafka, because he was a writer and the software was an optimized system for writing data. Eventually, LinkedIn gave it out to the Apache Software Foundation, making it free and open-source.

Core Characteristics of Kafka

Thousands of companies such as Netflix, Uber and Airbnb use Kafka for various reasons.
1. High Throughput - Kafka can handle millions of messages per second.
2. Scalable - Kafka expands seamlessly without any downtime. If you need more power, you just add another computer (node) to the Kafka system.
3. Permanent (Durable) - Unlike traditional message queues that delete a message once it is read, Kafka writes data to a hard drive and keeps it for a set amount of time (days, weeks, or forever).
4. Fault-Tolerant - Kafka keeps copies (replicas) of your data on different computers. If one computer crashes or catches fire, the data is still safe on another, and the system automatically switches to the backup without missing a beat.

When to Use Kafka

• Real-time tracking - Tracking website activity (page views, clicks) as it happens.
• Log aggregation - Collecting logs from hundreds of different servers into one central place for monitoring, debugging, or auditing.
• Location tracking - Apps like Uber use Kafka to process the real-time GPS locations of drivers and riders.
• Stream processing - Transforming data on the fly such as using the Kafka Streams API to convert currencies in real-time as transactions happen.
• Event sourcing - Storing state-changing events. Instead of overwriting existing data to save the current state, you permanently record every individual action that led to that state as an append-only log(a database doesn't update a shopping cart's final inventory to show 1 Hat, it records the exact history, Added Shirt, Added Hat and Removed Shirt).
• Data Integration - Using Kafka Connect to continuously pull data from an old database and push it into a new cloud warehouse. In this context, kafka connect is utilized as a specialized tool and framework for scalably and reliably streaming data between Kafka and these external systems without custom code.

When NOT to use Kafka

• You just need a standard database to search for specific records (use SQL or NoSQL). Kafka is designed for sequential reading, not searching for a specific item.
• You only have a small amount of data (Kafka is complex; setting it up for low traffic is overkill).
• You need simple task routing.

Key Kafka Concepts and Rules

To understand Kafka, you need to know its vocabulary and the strict rules that govern how data is managed.
• Event (Message/record) - This is the actual piece of data( immutable record of something that happened). An event consists of a Key, a Value, a Timestamp, and optional metadata headers. The Message Key acts as an optional identifier used for routing the event to a specific partition, while the Message Value is the core payload containing the business data. The Value is the actual data. The Key is optional but important for organizing data. Before an event is sent over the network, it is translated into a binary format (bytes) in a process called serialization.
Serialization is the crucial step of converting these readable data objects into binary bytes for efficient network transmission and storage. Conversely, Deserialization is the reverse process used by consumers to convert those binary bytes back into readable data.
• Producer - The application that sends data into Kafka (e.g. the website frontend sending click data). Producers send/write (publish) messages to topics and decide which partition the data should be sent to.
• Consumer - The application that reads data from Kafka (e.g. the analytics dashboard). Consumers receive/read (subscribe) messages from topics.
• Topic - Its a named stream of events or logical category or channel where related events are continuously published and stored.
If you send user clicks to Kafka, you would send them to a topic named user_clicks. Consumers read from specific topics. Unlike traditional queues, topics have a retention policy. You configure a topic to keep data for 24 hours, 7 days, or until the disk reaches a certain size. Once the limit is hit, the oldest data is automatically deleted.
• Partition - This is the secret to Kafka's speed. A single topic is split into multiple parts called Partitions. They split data across multiple brokers/servers, enabling multiple consumers to read data in parallel. This is the mechanism that allows Kafka to scale and process massive amounts of data concurrently.
Imagine a grocery store with only one checkout lane (one topic), a line forms. If you open 10 checkout lanes (partitions), 10 times as many people can check out at once.
- Rule 1 - Order is only guaranteed within a single partition. If you send messages to Partition 0 and Partition 1, you cannot guarantee which one gets read first. But messages inside each individual partition are read in the exact order they arrive.
- Rule 2 - Keys determine the partition. If a Producer sends an event without a key, Kafka assigns it to a random partition (Round-robin routing). If an event has a key (like customer_id_123), Kafka uses a math formula (hashing) to ensure every event with that same key always goes to the exact same partition. This guarantees all purchases by customer_123 are processed in the correct order.
• Offset - Inside a partition, every single message is assigned a unique, sequential ID number called an Offset (e.g., 0, 1, 2, 3...). These offsets act as unique, ever-increasing integers used to accurately maintain reading positions. Offsets only go up and are never reused. Consumers use offsets to keep a bookmark of their specific reading position so they can resume reading safely after a crash or restart.
• Consumer Groups - A team of consumers working together to read a topic.
- The Golden Rule - A single partition can only be read by one consumer within the same group. If a topic has 4 partitions, and your group has 4 consumers, each gets exactly one partition. If you have 5 consumers in the group, the 5th one sits idle. This is how Kafka scales reading perfectly without processing the same message twice.
• Broker/ Server - A single Kafka node. This individual node is responsible for receiving messages from producers, storing them on disk, and serving them to consumers upon request.
• Cluster - A group of Brokers(servers) working together. Brokers are linked together to operate seamlessly as a single distributed network, providing fault tolerance, high availability, and massive scale. Within a cluster, data is duplicated across brokers using a Replication Factor. Replication factor is a configuration defining the exact number of copies of a partition that must be maintained across different brokers to ensure fault tolerance. If your Replication Factor is 3, three different brokers have a copy of the data.
For each partition, one broker is assigned the Leader (primary broker) and exclusively handles all read and write requests for that specific partition to ensure strict data consistency.
The other brokers become Followers (backup brokers) and they passively replicate the data from the Leader (acting as In-Sync Replicas or ISR) so they can seamlessly and instantly take over if the Leader fails.
• KRaft (Kafka Raft) - The internal manager of Kafka. Kraft is the modern built-in consensus protocol that functions as the internal cluster manager, meaning it is the overarching system responsible for managing broker states, leader elections, and metadata within Kafka. It keeps track of which brokers are online, which broker holds which partition, and handles the recovery if a broker crashes.
Historically, Zookeeper served as the legacy external coordination service that acted as the cluster manager before being phased out. Kafka recently removed ZooKeeper and replaced it with KRaft, which is built directly into Kafka to make it faster and easier to manage the state of the cluster.

How Kafka Works

Here is a detailed flow of how data moves through Kafka, showing Partitions, Offsets, and Consumer Groups:

NB: Two different Consumer Groups can read the exact same messages without interfering with each other. Because Kafka stores the data on disk, Consumer Group 2 (Receipt System) can read the message hours after Consumer Group 1 (Inventory System) read it, simply by starting at an older Offset.

Kafka architecture

To understand Apache Kafka’s architecture, we need to examine its internal design and core components.
Kafka architecture explains how Kafka does what it does. Kafka’s architecture is designed to do three things flawlessly; never lose data, handle millions of messages a second, and scale up without turning the system off.
To understand how it works, we can break Kafka’s architecture into three main areas.
1. The Network Architecture (The Physical Components)
Kafka is a distributed system, meaning it is not just one big computer. It is a network of smaller computers working together as a single unit and comprises of The Kafka Cluster, Brokers, The KRaft Controller (The Manager), Producers and Consumers.

2. The Data Architecture/Storage (Logical Components)
Kafka does not store data in tables like a standard database. It stores data using a concept called an Append-Only Commit Log.
Imagine a physical logbook. When a new message arrives, Kafka writes it at the very bottom of the page. You cannot erase, edit, or insert a message in the middle. You can only append (add) to the end.
This architectural choice is the secret to Kafka's speed. Because it never wastes time searching for a record to update or delete, writing to Kafka is incredibly fast.

How Topics and Partitions fit into the Log.

A Topic is just a logical name for a group of these logbooks.
A Partition is the actual physical logbook file sitting on a Broker's hard drive.
An Offset is the line number in that logbook.

3. High Availability Architecture (Fault Tolerance)
Because hardware fails, Kafka’s architecture assumes that brokers will eventually crash. It protects your data using Replication.
When you create a topic, you set a Replication Factor. If you set a replication factor of 3, Kafka guarantees that three different Brokers will have an exact copy of the data.
For every single partition, Kafka elects one Broker to be the Leader. which Producers and Consumers talk. The other Brokers are become Followers. They do not talk to producers or consumers but copy everything the Leader does in real-time.
If the Leader broker crashes, the KRaft Controller notices immediately. It instantly promotes one of the Followers to become the new Leader. The Producers and Consumers automatically connect to the new Leader, and the system continues without dropping a single message.

4. The Philosophy(Core Design Rules)
Kafka’s architecture relies on a few specific design choices that make it different from almost every other messaging system.

A. Smart Consumers, Dumb Brokers/Servers
In traditional message queues, the server (the queue) is smart. It remembers which consumer read which message and deletes the message after it is read. This puts a heavy workload on the server.
Kafka flipped this architecture. The Kafka Broker is dumb while the Consumer is smart. The server just stores the data and deletes it after a certain time while the consumer tracks its own Offset (its place in the logbook). This takes a massive load off the brokers, allowing them to handle millions of messages per second.

B. The Pull Model
Many systems push data to the consumers. If a sudden spike in traffic happens, the system pushes so much data that it crashes the consumer application.
Kafka uses a Pull architecture. Producers push data into Kafka, but Kafka never pushes data to Consumers. The Consumers pull data from Kafka only when they are ready for it. If the consumer gets overloaded, it just slows down its pulling. The data waits safely on Kafka’s hard drive until the consumer catches up.

C. Using Disk instead of RAM (The OS Page Cache)
While most messaging systems try to keep data in RAM (memory) because it is faster, Kafka writes straight to the hard drive.
because it relies on the Operating System's Page Cache. The OS automatically uses free RAM to temporarily hold the most recently written data. When a consumer asks for the latest data, Kafka actually serves it straight from the OS memory without ever spinning the hard disk, giving you memory-like speeds with hard-drive-like storage capacity.

Visualizing the Architecture

Here is how all these pieces connect in a real-world scenario.

How to Install Apache Kafka (Locally)

To run Kafka on your computer, you need to have Java installed.
Step 1: Download Kafka
Go to the official Apache Kafka Downloads page here and download the latest .tgz file (binaries).

Step 2: Extract the file
Open your terminal and extract the folder

tar -xzf kafka_2.13-4.2.0.tgz

# you can rename the folder 
mv kafka_2.13-4.2.0/ kafka

# Navigate into the folder
cd kafka

Step 3: Generate a Cluster UUID
Since we are using modern Kafka (KRaft mode), first generate a unique ID for the cluster

KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"

Step 4: Format the Storage

bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties

Step 5: Start the Kafka Server

bin/kafka-server-start.sh config/kraft/server.properties

Leave this terminal window open. Kafka is now running!

Key Kafka Commands for Beginners

Open a new terminal window (keep the server running in the first one) to run these commands.

1. Create a Topic
Before you can send data, you need to create a topic. Let's create one called first_topic.

bin/kafka-topics.sh --create --topic first_topic --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1

NB: localhost:9092 is the default address where your local Kafka broker is running. We set --partitions 3 to split the topic into three lanes for speed, and --replication-factor 1 because we only have one local broker running right now.

2. Start a Producer (Send Data)
This command opens a prompt where you can type messages.

bin/kafka-console-producer.sh --topic first_topic --bootstrap-server localhost:9092

Once it starts, type a few lines and hit Enter after each.

Hello Kafka!
This is my first message.

3. Start a Consumer (Read Data)
Open a third terminal window and run the below command to read the data.

bin/kafka-console-consumer.sh --topic first_topic --from-beginning --bootstrap-server localhost:9092

The --from-beginning flag tells the consumer to start reading from Offset 0. You will instantly see the messages you typed in the Producer terminal appear here. If you go back to the Producer terminal and type a new message, it will pop up in the Consumer terminal in real-time.

Conclusion

Apache Kafka is the nervous system of modern data engineering. By sitting in the middle of your architecture, it decouples the applications that create data from the applications that need to use data.
While it can be complex to manage at a massive scale, the core concept remains remarkably simple; Producers write events to Topics, those Topics are split into Partitions for speed, and Consumers use Offsets to read those events whenever they are ready. By leveraging Consumer Groups, Kafka ensures that data is processed efficiently, securely, and at an incredibly massive scale.

The Great Data Debate: Should You Build Your Warehouse Top-Down or Bottom-Up?

Lawrence Murithi — Mon, 11 May 2026 16:06:03 +0000

Introduction

Imagine you have a massive, disorganized garage. You need to clean it up so you can actually find things. You have two ways to tackle this.
The first way is to take every single item out of the garage, build a perfect, custom-sized shelving unit for the entire space, categorize every loose screw and tool into a master list, and then put everything in its exact, permanent place.
The second way is to just clean out the corner where you keep your gardening tools because it’s spring and that’s what you need right now. Later, when winter comes, you can clean out the corner for your snow shovels.
This is exactly how the data engineering world looks at building a Data Warehouse. The clean the whole garage first method is the Inmon approach. The clean corner by corner method is the Kimball approach.
If your company wants to store data to make smart business decisions, you will inevitably bump into these two names; Bill Inmon and Ralph Kimball.
This article looks at how the architectures work, the good, the bad, and which one you should actually use.

The Inmon Architecture(The Top-Down Master Plan)

Bill Inmon is often called the father of the data warehouse. His philosophy is that a data warehouse should be the single, ultimate source of truth for the entire business.

How It works

Inmon uses a top-down approach. You start by looking at the entire company, pull data from all the different software systems (sales, HR, finance) and clean it up. Then, you store all of it in one massive, highly organized central database.
Because of this design, the Inmon approach requires that business requirements are defined first. You must have a complete understanding of the enterprise's overarching data needs before building the model. Furthermore, it relies on strong governance, meaning there are strict, centralized rules controlling data quality, security, and standardization across the board.
Inmon uses a normalized structure. This means data is stored without any duplication. If a customer's name changes, you only have to update it in one single place.
Building a centralized warehouse first is the core of this method. Once this giant central warehouse is built, you carve out smaller pieces of it, Data Marts, for specific departments to use. Each department gets their own data mart, but that data mart is fed strictly by the central warehouse.

Below is a flowchart showing multiple source systems feeding into a single Staging Area, flows into a large central Enterprise Data Warehouse, which then splits into smaller Data Marts pointing to the end users.

Source → ETL → Data Warehouse → Data Marts → Reports

Pros

- Single source of truth - Because everything flows from one central hub, the different teams will never have conflicting numbers.
- High consistency - Due to strong governance and a centralized structure, definitions and metrics mean the exact same thing across the entire enterprise.
- Good for large organizations - The robust, highly structured foundation is capable of handling vast amounts of complex, enterprise-wide data efficiently over the long term.
- Easy to update - Since data isn't duplicated, updating records or fixing errors is very clean and simple.
- Built for the future - If the company grows or adds new departments, the foundation is already solid.

Cons

- Slow to implement - Designing a perfect system for an entire enterprise takes months, sometimes years, before anyone sees real value.
- It’s expensive - You need highly specialized database experts and a massive upfront budget to build and maintain the central hub.
- Hard for business users to read - The normalized database is great for computers, but very confusing for a regular business person trying to run a report.
- Hard to change - Because the entire enterprise is highly integrated and normalized, pivoting the architecture to accommodate new, unforeseen business models is difficult and time-consuming.

The Kimball Architecture(The Bottom-Up Quick Win)

Ralph Kimball felt Inmon method was slow and expensive and decided to craft a better method. His philosophy is that a data warehouse focus on business processes and answer specific business questions as quickly as possible.

How it works

Kimball uses a bottom-up approach prioritized around fast delivery. Instead of building a giant central warehouse first, you start by building individual Data Marts.
For example, if the sales team needs a report urgently, you pull data just for the sales team, run it through ETL and build a Sales Data Mart. Then later, you build an HR Data Mart.
Kimball uses a denormalized structure, known as the Star Schema. This means he doesn't care if data is duplicated. He organizes data into Facts (numbers such as sales amount) and Dimensions (context such as time, location, or customer name).
Rather than being isolated silos, these individual Data Marts are eventually linked together to form an Integrated Warehouse. To keep things from getting chaotic, Kimball uses conformed dimensions (an enterprise bus). This is a strict rule that says if both the Sales mart and the HR mart use a Date or a Customer, they must use the exact same definition, allowing the data marts to connect logically for company-wide reporting.

Below is a flowchart showing source systems feeding into an ETL process, which builds independent Data Marts(Star Schemas) first. These marts are linked together by shared conformed dimensions to form a logical Integrated Warehouse, which is then used for End-User Reports.

Source → ETL → Data Marts → Integrated Warehouse → Reports

The Pros

- Faster implementation - You can get a single department up and running with data in a matter of weeks, delivering immediate ROI.
- Cheaper to start - You don't need a massive upfront budget.
- Business-friendly - The Star Schema is incredibly easy for regular business users to understand. They can drag and drop fields in software like Tableau or PowerBI easily.
- Flexible - It is much easier to add new data marts or modify existing ones as business needs change without breaking a massive central database.

The Cons:
- Data duplication - Because data is stored in multiple different marts, you use up more storage space.
- Harder to update - Because Kimball favors speed and query performance over strict organization, the same piece of data is intentionally stored in multiple places. For example, if a customer's address changes, you might have to update it in five different data marts.
- Risk of inconsistency - If you aren't strictly enforcing conformed dimensions, your data marts will drift apart. Because data is duplicated across different marts, sales and finance might end up reporting different total revenue numbers.
- Integration challenges - Because the system is built piece-by-piece rather than centrally planned from the start, tying all the disparate data marts together into a unified, integrated warehouse later on can become technically complex and messy.
For example, if Sales mart is built in January and the HR mart in July, the teams might design their databases differently. A user trying to generate a combined report showing Sales Revenue vs. Employee Training Costs might realize that Sales measures time in Weeks, while HR measures time in Months. Trying to join the two data marts together to answer enterprise-wide questions thus becomes technologically complex.

Which Architecture is better?

If you ask a room full of data engineers this question, you will probably start an argument. But realistically, neither is better. It entirely depends on what your company needs.

You should use Inmon if

You work in a highly regulated industry (like banking, insurance, or healthcare) where data accuracy and audit trails are more important than speed.
You have a large budget, a big team of data engineers, and plenty of time.
Your company's data is incredibly complex and changes constantly.

You should use Kimball if

You are a startup, a retail business, or a fast-moving company that needs data right now.
You want your non-technical business teams to build their own reports without asking IT for help every time.
You are on a tight budget and need to prove the value of the data warehouse to your boss quickly.

The Modern Reality

It is worth mentioning that technology has changed a lot since Inmon and Kimball wrote their books in the 1990s.
Back then, computer storage was incredibly expensive and Inmon’s method of not duplicating data saved money.
Today, cloud storage is incredibly cheap. Because storage is cheap, many companies lean heavily toward Kimball's Star Schema because the cost of duplicating data just doesn't matter much anymore.
Furthermore, new hybrid approaches have popped up. The Data Vault architecture (by Dan Linstedt) is becoming very popular. It essentially takes the best of Inmon’s strict central storage and pairs it with Kimball’s easy-to-read data marts.

The Bottom Line

When it comes to building a data warehouse, don't get caught up in treating Inmon or Kimball like a religion. You aren't building a monument but a tool to help your company make money.
If your company has the patience to build a bulletproof foundation, go top-down with Inmon. If your company needs answers tomorrow to keep the lights on, go bottom-up with Kimball.
Pick the approach that fits your business reality, not the one that looks prettiest on a whiteboard.

Docker for Data Professionals: From Zero to Containerizing Your First Project

Lawrence Murithi — Mon, 11 May 2026 13:39:03 +0000

Introduction

If you work with data, you probably have spent hours writing a Python script, training a machine learning model or building a data pipeline. It runs perfectly on your laptop but when you send the same code to a teammate or try to run it on a company server, it instantly crashes.
Usually, the error has nothing to do with your code. It crashes because of issues like; the other computer has a different version of Python, is missing a library like pandas, or uses a different operating system.
Docker was created to solve this exact problem.
This article delves into what Docker is, why data scientists and analysts should care about it, and how to use it step-by-step.

What is Docker?

Before the 1950s, global shipping was a mess. Loading and unloading a ship was a nightmare(slow and unstandardized) because contents such as barrels, sacks, cars and boxes were different shapes weights and size.
Then, the shipping industry invented the steel shipping container. It didn't matter if you were shipping cars, coffee, or electronics, you just put your contents in a standard box. Ships, trains, and cranes were now built to handle that box.
Docker does the exact same thing for software.
Instead of just moving your code from one computer to another, Docker allows you to package your code, the programming language, the exact libraries you used, and the system settings into one standard box.
Because everything your code needs is inside that box, it will run exactly the same way on your laptop, your coworker's laptop, or a cloud server.

Docker vs. Virtual Machines

You might be thinking, Isn't that just a Virtual Machine(VM)?
It’s a fair assumption, as both provide isolated environments for your applications, but Docker is fundamentally lighter and more efficient.
A traditional VM relies on software called a hypervisor to bundle your code and libraries with a complete, dedicated guest operating system. Booting up a whole new copy of Windows or Linux makes VMs massive, resource-heavy, and slow to start.
However, Docker only virtualizes the application layer. It uses a background service called the Docker Engine to share your host computer's underlying operating system. By stripping away the bulky guest OS, Docker packages only the absolute essentials; the code, runtime, and settings, into a highly portable container. This isolation guarantees your app will run reliably across any infrastructure. Docker containers also take up a fraction of the disk space and launch in mere seconds.

Core Docker Terminologies

Before writing any code, it's critical to understand the basic vocabulary of the Docker ecosystem.
1. Docker Engine - This is the underlying background program running on your machine. It does the actual heavy lifting required to build, run and manage your containers.
2. Dockerfile - Think of this as a recipe. It is a plain text document that contains a step-by-step list of commands. Docker reads this file to know exactly which software to install and which files to copy to build your environment.
3. Images - An image is a frozen, unchangeable blueprint created by running a Dockerfile. It holds your code, tools, and system libraries in one package. Images are used to spawn active containers. Think of it as the static mold used to make identical products.
4. Containers - This is the live, running version of an image. A container isolates your application and its requirements from the rest of your computer, guaranteeing it behaves the exact same way no matter what machine it runs on.
5. Docker Hub - This is a massive online library for Docker images. Just like GitHub is used for sharing code, Docker Hub is a public platform where people can upload their own custom images or download pre-made environments to save time.
6. Volumes - Because containers are temporary, any data saved inside them is lost when they shut down. Volumes fix this by linking a folder inside the container to a folder securely saved on your actual hard drive, preventing data loss.
7. Networks - This is the system that allows multiple standalone containers to talk to each other safely. For example, a network lets a container holding your Python code securely send data to a separate container running a database.

Why Data Professionals Need Docker

While Software developers have used Docker for years to run websites, it has now become a required skill for data teams because of various reasons.
• Reproducibility - In data science, if someone cannot reproduce your results, your results are not valid. Docker guarantees that anyone who runs your container will get the exact same output.
• Easy Handoffs - A Predictive model is usually handed over to a data engineer or a software team to put it into production. A Docker container would easen their work since they don't have to guess how to set up the environment. They just run it.
• Working with Old Code - Sometimes you need to run a script written three years ago using Python 3.6. Instead of messing up your current computer by downgrading your software, you just spin up a Docker container with the old versions, run the job, and delete it.

Pros of Using Docker

1. Portability
Docker packages your application along with all its dependencies, libraries, and configuration files into a single image. Because the environment is locked inside this image, the application will run exactly the same way on a developer's laptop, a testing server, or in the production cloud. It eliminates the problem of "It Works on My Machine".
2. Resource Efficiency
Traditional Virtual Machines (VMs) require a full, heavy guest Operating System for every application. Docker containers, however, share the host machine's OS kernel thus are incredibly lightweight, take up significantly less hard drive space, and can start up in milliseconds. You can also run many more containers on a single server than you could VMs.
3. Isolation of Environments
Every container runs in its own isolated environment. This means you can have one container running an application that requires Python 2.7, and another container running an application that requires Python 3.10 on the exact same server.
4. Ideal for Microservices and Scalability
Docker is the foundation for modern microservices architectures. Instead of building one massive, monolithic application, you can build small, independent services (e.g., a database container, a web server container, an authentication container). If your web traffic spikes, you can quickly spin up 10 extra web server containers without having to duplicate the database.
5. Faster Deployment and CI/CD Integration
Docker images are pre-configured thus deploying them is as simple as downloading the image and pressing run. This makes Docker incredibly popular for Continuous Integration/Continuous Deployment (CI/CD) pipelines. If a new version of an app has a bug, rolling back is as easy as running the previous Docker image tag.

Cons of Using Docker

1. Steep Learning Curve
For beginners, Docker introduces a lot of new concepts hence takes time to become proficient. Developers have to learn how to write Dockerfiles, manage docker-compose configurations, understand container networking (how containers talk to each other), and grasp how images are built.
2. Data Persistence Complexity
By design, containers are ephemeral (temporary). If a container is deleted or crashes, all the data inside it is permanently lost. To save data permanently (like a database), you have to learn how to manage Docker Volumes or Bind Mounts to connect container storage to the host machine's hard drive.
3. Cross-Platform Performance and Quirks
Docker is natively a Linux technology. While Docker Desktop allows you to run it on macOS and Windows, it actually does this by running a lightweight, hidden Linux Virtual Machine in the background. This can lead to heavy RAM/CPU usage on Mac and Windows machines, and file-syncing between the host and the container can sometimes be slow.
4. Security Concerns (Shared Kernel)
Because containers share the host's Operating System kernel, they are less isolated than full Virtual Machines. If a hacker finds a vulnerability in the host OS kernel, they might be able to break out of the container and access the host machine or other containers. Additionally, poorly configured containers running as the root user pose a significant security risk.
5. Not Ideal for Desktop/GUI Applications
Docker is heavily optimized for backend services, APIs, databases, and command-line tools. While it is technically possible to run graphical desktop applications (GUI) inside Docker, it is highly complex, clunky, and generally not recommended.

First Docker Project

Let’s build a simple data project and put it inside a Docker container.

Step 1: Install Docker
Download Docker Desktop for Windows, Mac, or Linux here. Docker Desktop is a helpful graphical interface that includes the underlying Docker Engine. Install it and open the application. It runs quietly in the background.

Step 2: Set Up Your Project Files
Create a new folder on your computer e.g. myproject. Inside this folder, create three files.
File 1: main.py
This is the Python script. Write a simple program that uses the pandas library to create a small dataset and print it out.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Role': ['Data Analyst', 'Data Engineer', 'Data Scientist']
}

df = pd.DataFrame(data)

print("--- Team Data ---")
print(df)

File 2: requirements.txt
This file tells Python which libraries are needed. Since pandas was used, list it here.

pandas==2.1.0

File 3: Dockerfile
This is the magic file. Create a file named Dockerfile. Open it in a text editor and paste the following code.

# 1. Start with a base image pulled from Docker Hub that already has Python
FROM python:3.10-slim

# 2. Create a working directory inside the container
WORKDIR /app

# 3. Copy our requirements file into the container
COPY requirements.txt .

# 4. Install the libraries listed in requirements.txt
RUN pip install -r requirements.txt

# 5. Copy the rest of our code into the container
COPY main.py .

# 6. Tell the container what to do when it starts
CMD ["python", "main.py"]

Step 3: Build the Docker Image
Now, turn those three files into a Docker Image.
Open your computer's terminal (Command Prompt on Windows, Terminal on Mac/Linux) and navigate to the myproject folder and run the below command.

docker build -t my-first-data-app .

NB: The period . at the end tells Docker to look for the Dockerfile in the current folder.
• docker build tells Docker to read the recipe.
• -t my-first-data-app gives the image a name (tag) so it can be easier to find it later.

Step 4: Run the Container
Once the build is finished, the image is ready and can be run using the command below.

docker run my-first-data-app

This displays the output of the Python script on the screen.

--- Team Data ---
      Name  Age            Role
0    Alice   25    Data Analyst
1      Bob   30   Data Engineer
2  Charlie   35  Data Scientist

The Image can now be sent to anyone in the world, and it would print the exact same table, even if they don't have Python installed on their computer.

The Multi-Container Problem

Running a single container is great. However, modern applications are rarely just one piece of software.
A standard web application usually consists of:

A frontend application
A backend API
A database
A caching system

Using basic Docker commands means you have to build and run each of these containers manually, figure out how to connect them to the same network so they can talk to each other and manage their startup order. Doing this by typing long commands into the terminal every single day is frustrating and prone to human error.

Docker Compose

Docker Compose is a tool designed specifically to solve the multi-container problem.
Instead of typing a bunch of manual terminal commands, Docker Compose allows you to define your entire multi-container application in a single text file called docker-compose.yml.
YAML (Yet Another Markup Language) is just a way to write configuration data in a clean, readable format using indentation.
With Compose, you define services. Each service represents one container in your application setup. You can define what image the service should use, what ports it should open, and how it connects to the other services.

A Practical Docker Compose Example

# The structure of docker compose

services:
  app:
    build: .
    ports:
      - "8080:8080"

  postgres:
    image: postgres
    ports:
      - "5432:5432"

services:
  postgres:
    image: postgres:latest
    container_name: postgres
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: 12345
      POSTGRES_DB: postgres
    ports:
      - "5433:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 15s
      timeout: 10s
      retries: 5

  etl:
    build: .
    container_name: etl
    environment:
      DB_USER: postgres
      DB_PASSWORD: 12345
      DB_HOST: postgres
      DB_PORT: 5432
      DB_NAME: postgres
    depends_on:
      postgres:
        condition: service_healthy

Let's break it down.
services - We have two services defined; postgres (the database) and etl (the process interacting with the database).
image - For the postgres service, we are not building our own image. We are downloading the official postgres:latest image directly from Docker Hub.
container_name - Explicitly sets the names of the running containers to postgres and etl instead of letting Docker auto-generate random names.
environment - This passes variables (like passwords and database names) into the containers. The etl service's DB_HOST is simply postgres. Docker Compose automatically creates a network so the etl container can talk to the database using its service name.
ports - Maps port 5433 on your host machine to port 5432 inside the postgres container. This allows you to connect to the database from outside of Docker using port 5433.
healthcheck - Tells Docker how to test if the postgres database is actually ready to accept connections. It runs the command pg_isready every 15 seconds, waiting up to 10 seconds for a response, and will try 5 times before failing.
build - For the etl service, we tell Compose to look in the current folder (.) for a Dockerfile and build the image from scratch.
depends_on - This tells Docker not to start the etl service until the postgres container is fully up and has successfully passed its healthcheck (condition: service_healthy).

Once you have written the above file, you can start your entire application; the database, the custom network, and the backend with just one command as below.

docker-compose up

When you are done working and want to shut everything down and clean up the network, you type

docker-compose down

Using Docker Volumes for Data

Volumes persist data outside of the container's lifecycle. Why is this important for data professionals?
In the example above, we packed our Python script directly into the container. But what if you are processing a 10-gigabyte CSV file? You do not want to pack a massive data file inside your Docker image. Images are supposed to be lightweight. Furthermore, if your code generates a cleaned CSV, and the container stops running, that new file will be lost forever.
A Volume fixes this by acting as a bridge between your actual computer and the container.
Imagine you have a folder called data on your laptop, and you want your Docker container to read a file inside it. You would run your container like this:

docker run -v /path/to/your/local/data:/app/data my-first-data-app

The -v command maps a folder on your computer to a folder inside the container. Now, your Python script can read heavy datasets and save output files directly to your laptop, without making the Docker image bloated.

Summary

Docker is an incredibly powerful tool that has revolutionized software engineering by making apps fast, portable, and scalable. However, for a very simple, static website or a solo developer building a basic script, adding Docker might introduce unnecessary complexity and overhead.
If you want to start using Docker in your daily data work, ensure to follow these rules.
1. Use official base images - When writing a Dockerfile, always start with an official image from Docker Hub like python:3.10-slim or jupyter notebook. They are secure and well-maintained.
2. Keep it small - Use versions of Linux that have slim or alpine in the name. They take up less space on your hard drive.
3. Pin your versions - Always use a requirements.txt file and specify the exact version of the library you used (e.g., scikit-learn==1.3.0). If you just write scikit-learn, Docker will download the newest version, which might break your code.
4. Don't put passwords in Dockerfiles - If your script connects to a database, never hardcode your password into the script or the Dockerfile. Use environment variables instead.
5. Level up with Docker Compose - Once you are comfortable running a single container, look into Docker Compose. While Docker commands handle individual containers, Docker Compose allows you to define and manage multi-container applications. By writing a single docker-compose.yml file, you can seamlessly utilize Networks to connect multiple containers e.g. running Python script in one container and a PostgreSQL database in another and spin them all up with just one simple command (docker-compose up).
Mastering Docker could save you hundreds of hours of debugging. Once you learn how to containerize your data projects, "it works on my machine" will be a phrase you never have to say again.

Folders, Apartments, and Fake Computers: A Guide to Virtual Environments, Docker, and VMs

Lawrence Murithi — Thu, 07 May 2026 12:23:35 +0000

Introduction

If you have been spending a substantial amount of time writing code, you must have run into a frustrating problem: "It works on my computer, but it doesn't work on yours."
This happens because computers are set up differently. You might have a different operating system, a different version of a programming language, or different background software running. When a website or app breaks because of this, developers can lose hours or even days trying to figure out what the problem is.
To solve this, developers came up with ways to isolate software. Instead of installing an app directly onto your main computer, you put it inside a protective bubble. This bubble tricks the software into thinking it has its own private space, with exactly what it needs to run, so it won't mess with the rest of your system.
There are three main tools we use to create these bubbles; Virtual Environments, Virtual Machines (VMs) and Docker. While they all aim to solve similar problems, they do it in completely different ways, using completely different layers of your computer.
Let's break down exactly what each one is, how they compare and when you should use them.

1. Virtual Environments

A Virtual Environment is a localized directory that contains a specific version of a programming language and the specific software packages required for a project. It is the simplest and lightest way to isolate a project and is most commonly used in Python (using tools like venv or virtualenv) although similar concepts exist in other languages.

How Virtual Environments work

A Virtual Environment provides no system-level isolation. It does not share hardware, nor does it isolate the OS. It simply changes the PATH variables in your terminal so that when you install a package or run a script, it uses the isolated folder instead of the computer's global system files.
Imagine you are building two different websites on your laptop. Website A is older and needs version 2.0 of a web framework like Django. Website B is brand new and needs version 4.0 of that exact same framework. If you install these tools directly onto your main computer system, they will conflict and one of your websites will stop working.
A virtual environment fixes this by creating a dedicated, private folder for your project. When you turn on(activate) the virtual, it temporarily rewrites your computer's internal GPS, known as the system PATH. Because of this, your computer temporarily ignores its main, global list of tools. Instead, it only looks at the tools installed inside that specific project folder.

Pros

• Extremely fast - Creating and starting a virtual environment takes less than a second because it is just moving some folders around.
• Lightweight - It only takes up a few megabytes of space on your hard drive. There is no heavy software running in the background.
• Simple to use - Usually, it just takes one or two simple commands in your terminal to get started and shut down.
• No dependency conflicts - it solves the problem of dependency conflicts between different projects

Cons

• Weak isolation - It only isolates programming packages (like Python libraries). It does not isolate the operating system, the system clock, or your hardware settings.
• "It works on my machine" can still happen - Because the isolation is weak, hidden problems can sometimes slip through. If your code secretly relies on a specific font or a hidden system tool installed on your Mac, and you send your virtual environment code to a friend on a Windows PC, the code might still break.

Virtual environments are used on local computer on day to day coding when working on multiple projects using the same programming language but want to keep their dependencies separate from one another.

2. Virtual Machines (VMs)

A Virtual Machine is a complete software emulation of a physical computer. It runs its own full Operating System (Guest OS) entirely separate from the host computer's Operating System. It is the heaviest, most complete, and oldest form of isolation. Software like VirtualBox, VMware, or Microsoft Hyper-V allows you to do this.

How Virtual Machines work

If a virtual environment is like putting your code in a separate folder, a Virtual Machine is like buying an entirely new physical computer, shrinking it down, and putting it inside your current computer.
It uses a piece of software called a Hypervisor(like VMware, VirtualBox, or Hyper-V). The hypervisor carves out a specific amount of your physical computer's RAM, CPU, and storage and dedicates it to the VM. You then install a full Operating System (like Windows or Ubuntu) onto that carved-out space. This new system is called the Guest OS which operates/behaves like a real computer while the main computer is called the Host.

Pros

• Complete isolation - What happens inside a VM stays inside a VM. Because the hypervisor locks the hardware, if a VM gets infected with a severe virus, your main host computer is almost always completely safe.
• Run different operating systems - You can run a full Windows computer inside a Mac, or a Linux computer inside Windows, allowing you to use software made for different platforms.
• Highly secure - Because the hardware is strictly separated at a deep level, it is trusted by banks, governments, and massive corporations for highly sensitive tasks.

Cons

• Massive resource hog - Since you are running a second operating system on top of your current one, VMs eat up a lot of RAM, CPU power, and battery life. Even if the VM is just sitting idle, it is still running background updates, managing a clock, and keeping a digital desktop alive hence wasting power.
• Huge files - A VM can easily take up 20 to 100 gigabytes of storage space just to hold the basic operating system files.
• Slow - Booting up a VM takes just as long as turning on a physical computer, and moving files in and out of it can be tedious.

VMs are used in large corporate cloud servers or on a local machine when strict security is needed. Its critical when you need to test software on a completely different operating system, or when a business is running older, legacy applications that require an outdated OS to survive.

3. Docker (Containers)

Docker is a platform that uses containerization to package an application and all its necessary dependencies (libraries, frameworks, etc.) into a single, standardized unit called a container. Containers are the clever middle ground between the lightness of a Virtual Environment and the strict, heavy isolation of a Virtual Machine.

How docker work

Every operating system is made of two main parts; the core engine (Kernel), which physically tells your RAM and CPU what to do, and the user files/tools that make up the desktop experience you see on screen.
While a Virtual Machine duplicates both parts making it so heavy, Docker only duplicates the user files and tools. All Docker containers share the main host computer's Kernel.
Think of it like an apartment building. A Virtual Machine is like giving everyone their own separate house with their own separate plumbing and electricity. Docker is like an apartment complex where everyone has their own locked, private room(container) and can decorate however they want, but they all share the building's central plumbing and electrical systems hidden in the walls(Host OS Kernel).

To use Docker, you write a simple text file called a Dockerfile. It reads like a recipe; Start with a bare-bones version of Linux, set up some default database passwords, download the latest PostgreSQL and start the database server. Docker reads this file and packages it into a container. This container can be handed to anyone, and it will run exactly the same way, regardless of what computer they have.

Pros

• Consistent everywhere - It solves the "it works on my machine" problem perfectly. A Docker container behaves exactly the same on a Mac, a Windows PC or a cloud server because the environment inside the container never changes.
• Fast and lightweight - Because they don't boot up a full operating system kernel, containers start in seconds and usually only take up a few hundred megabytes of space.
• Easy to share and scale - You can run dozens or even hundreds of containers on the same computer without them fighting over resources. This allows developers to build microservices. Instead of building one massive app, you put the shopping cart in one container, the user login in another, and the payment system in a third. If the payment container crashes, the rest of the website stays up.

Cons

• Steeper learning curve - You have to learn Docker-specific terminal commands, how to write Dockerfiles and how networking works to let containers talk to each other.
• OS limitations - Because Docker shares the host's kernel, you generally run Linux containers on Linux machines. Although Linux can run containers on Mac and Windows, Docker usually installs a tiny, hidden Linux Virtual Machine in the background to provide the Linux Kernel making Docker slightly heavier on Mac and Windows than it is on native Linux.
• Less secure than VMs - Because containers share the host kernel, the wall between them is thinner hence a critical vulnerability in the host OS could potentially affect all containers.

Docker is used almost everywhere. On a developer's laptop, in automated testing environments, and in production running live websites on the open internet. Its used when building modern web applications, working with a team of developers who all use different computers, or breaking a large app down into smaller microservices.
It gives developer's an isolated, highly reliable environment that is identical across all machines, without wasting your computer's RAM and hard drive space.

Similarities between the tools

The core similarity between all three is the concept of isolation.
They all exist to create boundaries between projects and software.
They also all make it easier to delete a project without leaving junk files behind; you just delete the virtual environment folder, the VM file, or the container image, and everything associated with that project is instantly gone, leaving your main computer perfectly clean.
Most times, they are often used together in the real world. A large company might run a giant Virtual Machine in the cloud to provide security, put Docker inside that Virtual Machine to manage different web apps easily, and a developer might use a Virtual Environment inside that Docker container to organize their Python code.

The Major Differences

The difference lies in how much they isolate and how heavy they are.
• Virtual Environment (Lightest) - Isolates only the language packages but relies entirely on your computer for everything else.
• Docker (Middle) - Isolates the application and the operating system files, but shares the core OS engine (the kernel) to save power and speed.
• Virtual Machine (Heaviest) - Isolates absolutely everything. It clones the physical hardware and runs a 100% separate operating system, taking up a lot of space and power to provide maximum security.

Conclusion

If you are just writing a quick Python script to scrape a website, analyze some data, and need to install a few libraries without breaking your computer, use a Virtual Environment.
If you are building a web app, working with a database, collaborating with other developers, and need to make sure your code runs exactly the same way on your laptop as it will on your company's live servers, use Docker.
If you are on a Mac but absolutely need to run a piece of Windows-only enterprise software, or you are testing dangerous malware and need maximum security to protect your real computer, use a Virtual Machine.

The Medallion Architecture: Turning Messy Data into Business Gold

Lawrence Murithi — Wed, 06 May 2026 16:06:09 +0000

Introduction

Imagine drawing water from a muddy river. You would never scoop a glass of water from the bank and drink it straight down. You would want that water pumped into a treatment plant, filtered to remove the debris, and chemically purified until it is crystal clear and safe to consume.
Data requires the exact same treatment.
Ever seen raw data pulled directly from a company’s servers? It's usually a complete mess. Website logs, sales applications, customer service chatbots, and payment gateways all generate endless streams of information. If you take all that raw information, dump it into a single pile, and try to build a revenue report, the results will be a disaster. Your numbers will be wrong, your system will crawl to a halt, and nobody will trust the data.
To process this information safely, data engineers build systems with specific layers that clean and organize records step-by-step. Historically, this was done using traditional data warehouse layers. Today, a modern framework called the Medallion Architecture has taken over the industry.
Here is a deep dive into how data layers work, why the Medallion concept was invented, and how it refines digital mud into a clear, single source of truth.

The Old Way(Traditional Data Warehouse Layers)

Before the Medallion Architecture existed, engineers used a classic three-step method to move data from external software into a company dashboard.
To elaborate on the traditional Data Warehouse architecture, it is essential to ground the concepts in the frameworks introduced by W.H. Inmon (often called the father of the data warehouse) and Ralph Kimball.
Historically known as the Three-Tier Enterprise Data Warehouse (EDW) Architecture, this system was designed to separate operational systems (where data is created) from analytical systems (where data is analyzed).

1. The Staging Layer(The Transient Extraction Zone)
This was the receiving dock. The staging area is defined as a temporary, intermediate storage zone between operational data sources (ODS) and the data warehouse.
Data from a shopify store or a salesforce database was copied and temporarily dropped here. The main goal was speed; get the data out of the live application quickly so the app wouldn't slow down for regular users.

Attributes of staging layer

Decoupling OLTP and OLAP - The primary architectural goal of this layer is to isolate Online Transaction Processing (OLTP) systems (like Salesforce or Shopify) from Online Analytical Processing (OLAP) workloads. Analytical queries are highly resource-intensive; running them directly on a live database can cause catastrophic latency for end-users.
Extraction Mechanics - Data is pulled into this layer using methodologies such as batch processing or Change Data Capture (CDC). The data here is typically stored in its raw, native format.
Volatility - According to traditional DWH design principles, data in the staging layer is transient. Once the data is successfully moved to the next tier, it is generally purged or overwritten in the next batch cycle to conserve expensive storage space.

2. The Integration Layer (The Core Enterprise Data Warehouse)
This is where the heavy lifting happened. Engineers wrote scripts to clean the data and match up records.
This layer represents what W.H. Inmon famously defined as the subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process.
If your billing system called a customer Client_001 and your website called them User_001, the Integration layer linked them together into a central, highly structured database.

Attributes of integration layer

Semantic Reconciliation - The heavy lifting is known as semantic reconciliation and Master Data Management (MDM). Engineers must resolve heterogeneous data formats (e.g., merging Client_001 from an Oracle database and User_001 from a JSON web log) into a unified entity.
Data Cleansing and Normalization - In this layer, data undergoes rigorous cleansing (handling null values, standardizing date formats). Structurally, Inmon advocated for storing this data in the Third Normal Form (3NF). This highly normalized structure reduces data redundancy and ensures mathematical consistency across the enterprise, creating a Single Version of Truth (SVOT).
The Bottleneck - Because of the complex normalization rules, writing data into this layer requires highly complex, tightly coupled SQL scripts, making the Integration Layer notoriously slow to update or modify.

3. The Presentation Layer(Data Marts and Dimensional Modeling)
The highly normalized 3NF data in the Integration layer is too complex for business users to query efficiently, therefore, data must be reshaped for consumption. Engineers would pre-package specific tables for specific teams e.g. creating a Marketing Table or a Sales Table that connected easily to dashboard software.

Attributes of presentation layer

The Data Mart - The Presentation layer is composed of subsets of the data warehouse focused on a specific business unit also called Data Marts (e.g., Sales, HR, Marketing).
Dimensional Modeling (The Kimball Method) - In this layer, engineers apply Ralph Kimball’s dimensional modeling techniques, organizing data into Star Schemas or Snowflake Schemas. Data is divided into Facts (measurable, quantitative data e.g sales amount) and Dimensions (descriptive attributes e.g time, store, or customer).
Optimized for Read-Heavy Workloads - By pre-joining and denormalizing the data, this layer allows Business Intelligence tools like PowerBI to execute complex analytical queries rapidly without requiring end-users to understand underlying SQL structures.

The Problem with the Old Way

This system relied heavily on a process called ETL (Extract, Transform, Load). Engineers would extract the data, transform/clean it and then load it into the warehouse. The fatal flaw was that the raw data was often discarded after it was cleaned to save storage space. If a data engineer accidentally deleted a crucial column during the clean phase, that historical data was gone forever.
1. The Schema-on-Write Constraint
Traditional DWHs operated on a Schema-on-Write paradigm. This means that before data could be loaded into the warehouse, the warehouse's schema (tables, columns, data types) had to be rigidly defined. If a new column was added to the source software, the ETL pipeline would fail, or simply drop the unrecognized data, until an engineer manually updated the database schema.
2. Destructive Transformations and Storage Costs
On-premise relational database storage such as Teradata or Oracle appliances used to be very expensive. To save disk space, raw data was deemed expendable. Data was extracted, transformed to fit the strict schema, and the raw source data was then discarded.
This model had some downsides which included:
• Loss of Auditability and Lineage - If a transformation logic error occurred (e.g., a script incorrectly rounded up financial figures), there was no historical raw data to refer back to since the original data was permanently lost.
• Lack of Flexibility for Machine Learning - Modern Data Science requires massive amounts of raw, unstructured or semi-structured data to train machine learning models. The traditional integration layer stripped away the granular, raw anomalies that data scientists actually need, leaving only highly aggregated, structured data.
As a result of the flaw which resulted to loss of raw data and rigidity of ETL paved the way for Data Lakes, there was a shift from ETL to ELT(where cheap cloud storage allows raw data to be stored before transformation), and ultimately the modern Medallion Architecture (Bronze, Silver, Gold), which preserves raw data while still providing structured analytics.

The Modern Shift(The Medallion Architecture)

As cloud storage became incredibly cheap, companies stopped throwing away their raw data and began dumping everything into massive, cheap storage areas(Data Lakes).
Eventually, companies pioneered the Lakehouses which combined the cheap, infinite storage of a Data Lake with the strict organization of a traditional Data Warehouse.
The need to help companies organize the massive amounts of data inside a Lakehouse therefore gave birth to the Medallion Architecture.
The Medallion Architecture separates data into three specific stages; Bronze, Silver, and Gold. It mimics the logical flow of the traditional layers but fundamentally changes how data is treated, preserved, and upgraded.

How do the three layers work?

1. The Bronze Layer(The Raw Zone)
This is where all the raw data lands from the various sources.
The data is saved exactly as it arrived. You do not fix typos. You do not rename columns. You just capture it.

Features of bronze layer

• Safety and Troubleshooting - since the raw data is completely untouched, you never have to worry about accidentally destroying information. If an engineer writes a bad piece of code that ruins the data in the later layers, they can simply go back to the Bronze layer and restart the process.
• Historical Archive - The Bronze layer acts as an infinite, permanent record of everything that ever happened in the business. It is usually append-only, meaning new records are just added to the pile without overwriting old records.
• Speed - Getting data into the Bronze layer is fast since the computer isn't doing any complex math, translations or cleaning. Engineers often use tools called Change Data Capture (CDC) to stream this raw data in real-time.

2. The Silver Layer(The Cleaned Zone)
Once the data is safely locked away in the Bronze layer, it is copied and moved into the Silver layer. The goal of the Silver layer is to create a Single Source of Truth for the entire enterprise.

What happens here?

• Cleaning and Standardization - Engineers fix the formatting. For example, if one source system writes dates as DD-MM-YYYY and another writes MM-DD-YYYY, the Silver layer standardizes all into one standard format.
• Filtering and Quarantining - Junk data is handled here. If a user accidentally enters an age like 999, the system spots it and instead of deleting it, engineers push that bad record into a separate quarantine table so it doesn't ruin the main data set, but can still be investigated later.
• Deduplication - Sometimes, source systems can glitch and send the same receipt twice. The Silver layer strips out duplicates so every row is unique.
• Joining - Data from different tables is connected using relationships. A log of customer purchases is joined with a product inventory table so that you can see exactly what item was bought, not just a random product ID number.
• Security - This is where sensitive information (like passwords, social security numbers, or personal emails) is scrambled or hidden so that analysts using the data later on cannot see private customer details.

Data scientists and analysts spend a lot of time in the Silver layer. It is clean and trustworthy, but it is still highly detailed. Every single individual action is visible, which makes it the perfect place to look for hidden trends or train machine learning models.

3. The Gold Layer(The Action Zone)
The Gold layer is the final destination. The data here is no longer meant for deep exploration but is designed to answer specific business questions immediately.

What happens here?

In the Silver layer, you might have a table with ten million individual rows. If a user tries to load the rows into a dashboard, the software will freeze. However, in the Gold layer, those millions of rows are turned into highly summarized, bite-sized metrics.
• Aggregations - Instead of listing every single sale, engineers create a Gold table that simply shows Total Sales per Store per Day.
• Business Logic - This is where specific company rules live. If your marketing team defines an active subscriber as someone who has opened an email in the last 30 days, that exact mathematical rule is applied to a Gold table.
• Performance - Data loads instantly since it's heavily summarized and simplified using Star Schema layout. When you connect any Business Intelligence tools to the Gold layer, the charts populate immediately.

Why the Medallion Architecture Wins

The reason nearly every modern data team is adopting this structure is because it solves the biggest headaches that have plagued developers for decades.
1. Bulletproof Data Lineage
When an executive looks at a Gold dashboard and sees that monthly revenue dropped by 50%, panic sets in. The data team needs to find out if the business is actually failing, or if the system is just broken.
With this architecture, they can trace the flow backward. They check the rules in the Gold layer. If those are correct, they look at the cleaned data in the Silver layer. If that looks fine, they check the raw files in the Bronze layer.
2. Extreme Flexibility
If the finance department suddenly requests a completely new way to calculate annual growth, the data team doesn't have to panic. They do not have to go back to the original software sources, and they do not have to re-clean everything. They simply build a new Gold table on top of the already clean Silver data.
3. System Reliability (ACID Transactions)
Modern Medallion architectures are built on specialized table formats like Delta Lake or Apache Iceberg which support ACID transactions. That means if a server crashes halfway through moving data from Silver to Gold, it won't leave you with a half-finished, corrupted table. The system will automatically roll back to the last safe state, preventing bad data from leaking into executive reports.

Conclusion

If you want to remember how the Medallion Architecture functions, just remember these three phrases:
Bronze -Here is everything we found(Messy, huge, exact copies).
Silver -Here is what actually happened (Clean, standardized, truthful).
Gold - Here is what we should do about it (Summarized, fast, ready for action).
The Foundation of Trust
A data platform is only as useful as the trust people put into it. If employees constantly find missing numbers, broken charts, or conflicting reports, they will abandon the dashboards and go back to guessing.
The Medallion Architecture is much more than a way to organize servers; it is a framework for building organizational trust. By moving information systematically through the Bronze, Silver, and Gold layers, a company guarantees that every digital footprint is captured securely, cleaned relentlessly, and presented flawlessly. Just like turning muddy water into something safe to drink, the Medallion Architecture takes the chaos of raw information and refines it into the exact clarity a business needs to survive.

Transactional Power Vs Analytical Precision: The Essential Guide to OLTP and OLAP

Lawrence Murithi — Fri, 01 May 2026 19:58:49 +0000

Introduction

Behind every digital interaction is a fundamental divide in how data is handled. The system required to process your grocery checkout with lightning speed is radically different from the system a corporation uses to analyze a decade of sales growth. This is the core distinction between Transactional Power vs. Analytical Precision. To understand the backbone of modern technology, you must understand OLTP (Online Transactional Processing) and OLAP (Online Analytical Processing).
Though they sound like technical jargon, they are simple concepts that define how businesses operate and grow.
This article serves as your roadmap to understanding how these systems function, their unique strengths, and why the balance between them is the secret to data-driven success.

OLTP(Online Transaction Processing): Handling the Day-to-Day
OLTP is the engine that runs traditional databases. It is designed to manage everyday business operations and process thousands of short, fast interactions per second. It is the system that handles the daily, minute-by-minute work of a business. Whenever a specific action or transaction takes place, OLTP is the system taking care of it.
In a database, a transaction is any small unit of work such as changing your password.
Transaction systems follow important rules called ACID properties.
ACID Properties are a set of four fundamental principles that guarantee reliable database transactions. They ensure data integrity and accuracy, preventing corruption even during system failures or concurrent operations.
The four principles are:
Atomicity(All-or-Nothing) - A transaction is treated as a single unit, it either fully completes or entirely fails and rolls back.
Consistency(Data Integrity) - A transaction ensures the database moves from one valid state to another, adhering to all constraints and rules. That means data remains valid before and after transaction
Isolation(Concurrent Control) - Concurrent transactions are isolated from each other, ensuring they don’t interfere with each other.
Durability(Permanent Data) - Once a transaction is committed, its changes are permanently saved and will survive system failures or crashes.

Examples of OLTP in real life

Adding an item to your online shopping cart.
Booking an airline ticket.
Sending a text message.
Banking systems (Mpesa, ATM transactions)

Think of OLTP like the cashier at a busy grocery store. The cashier’s job is to scan items quickly, take your money, hand you a receipt, and move on to the next person.

How OLTP Works

OLTP systems prioritize speed and accuracy. They use a design concept called normalization. This means the database organizes data into many small tables to avoid saving the same piece of information twice. Because the data is spread out neatly, the system can insert a new record, update a row, or delete a piece of data almost instantly.

Example

Imagine you want to withdraw $50 from an ATM. The bank's OLTP system immediately checks your balance, approves the withdrawal, and updates your account to show $50 less. This has to happen in seconds, and it has to be 100% accurate so you cannot overdraw your account.

Key features of OLTP

• Low latency/Fast response time - When you swipe your card, you expect it to be approved in seconds. OLTP databases are built to respond instantly.
• High number of users - The system ensures that thousands of users can access the same row in a database without failure.
• Normalized Data - Databases are typically highly normalized to reduce redundancy and ensure fast data entry. A single OLTP transaction does not require much data.
• Real-time processing/Accuracy - If you transfer $50 from your current account to your savings account, the system must subtract $50 from one and add $50 to the other. If the system crashes halfway through, the OLTP system cancels the whole thing so your data does not get corrupted. OLTP systems are built to be perfectly accurate and fail-safe.
• Write-heavy operations - Thousands of users might be doing things at the exact same time, the system is therefore constantly writing, updating or deleting information to the database.
• Highly available - Because OLTP systems handle the immediate, day-to-day operations of a business, the system is designed to be online, working, and accessible virtually 100% of the time thus downtime is not an option.
OLTP systems are usually built with backup servers and fail-safes. If one server crashes, another one instantly takes over so the customer doesn't notice a glitch.

Pros of OLTP

• Efficiency in Data Entry - Highly optimized for adding, modifying, or deleting records.
• Data Integrity - High reliability due to ACID compliance.
• Availability - Designed for 24/7 uptime for business-critical applications.

Cons of OLTP

• Inefficient for complex Analysis - If you ask an OLTP database to calculate the average sales of a product over the last five years, it will have to scan millions of everyday records. This takes a lot of computing power and can slow down the system for people trying to use it for normal tasks.
• Limited History - To keep things fast, OLTP systems usually only hold current or recent data. Old data is often moved somewhere else to save space.

OLAP (Online Analytical Processing)
OLAP is the engine behind data warehouses. If OLTP is the system for doing things, OLAP is the system for analyzing things. While OLTP only looks at a tiny slice of data at a time, OLAP is the brains used for strategic planning since its designed for data mining, processing huge amounts of information to find patterns, trends and summaries as well as complex reporting. Managers, data scientists, and business owners use OLAP to spot trends, build reports and make big decisions.

Making Sense of OLAP

Think of OLAP as the manager in the back office of the grocery store. They aren't ringing up customers. They are sitting at a desk, looking at charts and graphs of past sales to decide if they need to order more apples for next week.

How OLAP Works

OLAP systems are not built to process quick, small updates. To make this faster, OLAP uses denormalization. Instead of spreading data across many tiny tables like OLTP, OLAP groups massive amounts of related data together into large tables. This takes up more storage space, but it means the system can read through billions of records very quickly to find patterns.

Key features of OLAP

• Read-heavy operations - Unlike OLTP, which is constantly writing new data (new orders, new users), OLAP mostly just reads old data. It looks at what already happened.
• Complex Queries - OLAP tasks involve complex math—adding, averaging, and grouping massive lists of numbers.
• Multidimensional Analysis - Users can slice and dice data (e.g. viewing sales by region, then by month, then by product category) using data cubes.
• Denormalized Data - Databases often use Star or Snowflake schemas to reduce the number of table joins needed for queries.
• Slower response time - While nobody wants to wait all day, an OLAP report might take a few minutes or even a few hours to run. This usually is not a concern since the person waiting is usually a business manager, not a customer standing at a checkout counter.

Pros of OLAP

• Handles Massive Data - It can easily process millions or billions of rows of historical data.
• Does Not Disrupt the Business - Because OLAP lives in a data warehouse, running a massive, heavy report will not slow down the cash registers running on the OLTP database.
• High Performance for Reporting - Optimized for complex analytical queries.
• Strategic Insights - Allows businesses to identify trends, patterns, and anomalies to drive decision-making.
• User-Friendly: The system is often integrated with Business Intelligence tools like PowerBI for visualization.

Cons of OLAP

• Data is Not Real-Time - OLAP systems are usually updated in batches, often overnight. If you look at an OLAP report at 2:00 PM, it usually only includes data up until the night before.
• Slow to Update - Adding new data to an OLAP system takes time because the data has to be heavily organized and formatted before it is saved.
• Expensive and Complex - Building and maintaining a data warehouse requires specialized engineers and large amounts of server storage.
• Latency - Queries can take seconds, minutes, or even hours because of the massive volume of data being scanned.

Example

A regional manager for a coffee shop chain wants to know, "Between hot chocolate or dark roast coffee, which sold better on rainy days last year?" To answer this, the system has to look at weather data, sales data from fifty stores and a whole year of dates. An OLAP system can pull this specific report together without breaking a sweat.

Examples of OLAP in real life

Netflix figuring out what genres of movies are most popular in different countries during the summer.
A hospital analyzing patient records over ten years to see if a specific treatment is working.
- A retail store deciding how much inventory to buy for Black Friday based on the last three years of sales.

Common OLAP Operations

OLAP systems organize massive amounts of data into multi-dimensional structures, often referred to as OLAP cubes. These cubes allow users to view business metrics from any angle. To explore, analyze, and make sense of this complex data, OLAP systems support several powerful analytical operations.

Here is a detailed look at the five core OLAP operations:
1. Roll-Up (Consolidation)
Roll-up is also known as consolidation or aggregation and involves summarizing data to a higher, more generalized level. This operation reduces the detail of the data by climbing up a concept hierarchy or by removing a dimension entirely. It is primarily used by upper management to view macro-level business trends.
It uses mathematical functions—such as summing, averaging or counting to group smaller data points into larger, overarching categories.
Example (Time Hierarchy)
Daily sales → Monthly sales → Yearly sales.

If a company has millions of records of individual daily transactions, viewing them all at once can be overwhelming. Using a roll-up operation, an executive can consolidate these daily records to see total sales by month, and then roll up again to see the total gross revenue for the entire year.
Business Value - Roll-up provides a big picture view of business performance, stripping away unnecessary granular details to highlight overarching trends.

2. Drill-Down
Drill-down is the exact opposite of roll-up. It involves navigating from highly summarized, macro-level data down to highly detailed, micro-level data. This is done by stepping down a concept hierarchy or by adding a new dimension to the dataset.
It breaks a larger aggregated number into the smaller components that make it up, allowing analysts to uncover the root causes behind a specific metric.
Example (Geography & Time Hierarchy)
Yearly sales → Monthly sales → Daily sales (or Country → Region → Individual Store).

Imagine an annual report shows that total yearly sales are significantly lower than expected. A manager can drill down from the yearly view to the monthly view and discover in what specific month sales plummeted. They can then drill down further into the month's daily sales to find which specific week caused the drop.
Business Value - It is essential for root-cause analysis, troubleshooting anomalies, and investigating sudden spikes or drops in performance.

3. Slice
The slice operation performs a selection on one specific dimension of the OLAP cube, resulting in a new, smaller slice of the data.
Think of it like slicing a single piece of bread from a whole loaf. It locks one variable in place so you can analyze the rest of the data in a two-dimensional table.
You isolate a single value within one dimension (e.g., Time, Geography, or Product) while keeping the other dimensions open.
Example
Show sale records for Nairobi city only.

If a data cube contains sales data across Products, Time, and Cities, applying a slice on the City dimension for Nairobi isolates that market. The resulting view will show the sales of all products over all time periods, but exclusively for Nairobi location.
Business Value - It allows regional managers, department heads or specific product owners to filter out irrelevant data and focus entirely on the one area of the business they are responsible for.

4. Dice
While a slice filters data based on a single condition, a dice operation isolates a highly specific sub-cube by applying multiple filters across two or more dimensions simultaneously.
Think of it like cutting a smaller block out of a larger block of cheese.
It selects specific ranges or values across multiple dimensions to create a highly targeted subset of the original data.
Example
Show laptop sales in Nairobi and Mombasa during January and February.

Here, the user is applying filters across three separate dimensions, Product Dimension(Laptops only), Geography Dimension(Nairobi and Mombasa only) and Time Dimension(January and February only).
Business Value - Dicing is used for highly specialized, multi-faceted analysis. It allows data scientists and marketers to look at exact intersections of data, such as evaluating the success of a specific winter promotion for a specific tech product in key coastal cities.

5. Pivot (Rotate)
Pivot, sometimes called rotation, does not filter or change the underlying data, instead, it changes the visual perspective. It rotates the data axes to provide an alternative presentation, making different relationships easier to spot.
It rearranges the layout of the data, typically by swapping rows and columns, or by moving a dimension from the background into the foreground.
Example
Swapping Products and Time periods.

A manager might be looking at a table where Products (Laptops, Phones, Tablets) are listed in the rows and Months (January, February, March) are the columns. By pivoting the data, they can make Months the rows and Products the columns.
Business Value - Different layouts highlight different trends. A pivot makes it easier to compare data side-by-side depending on what the analyst is trying to prove, ensuring the final report is as readable and impactful as possible.
NB: OLAP is not mainly about recording what is happening right now. It is about understanding what has happened and what it means.

OLTP vs. OLAP

The distinction between OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) boils down to two distinct phases of business; execution and strategy. Simply put, OLTP runs the business, while OLAP analyzes the business.
These two systems are designed for fundamentally different jobs. Understanding how they differ and how they work together comes down to understanding their relationship with time, purpose, and data architecture.

Here is a detailed comparison of how the two systems operate.
1. Main Purpose and System Goals
OLTP - Its primary objective is to handle daily business operations and execute transactions seamlessly. Its core focus is on accuracy, transaction safety, and ensuring the day-to-day business continues without interruption.
OLAP - Its primary objective is to extract valuable insights from data to help leadership make smart, strategic decisions. Instead of facilitating transactions, it focuses on reporting, identifying long-term trends, and planning for the future.

2. The User Profiles
OLTP - These systems are used by everyday customers, cashiers, front-line staff, and mobile applications. These are the people actively interacting with the business in real-time buying items, logging into portals or booking appointments.
OLAP - These systems are utilized by business analysts, managers, and corporate executives. These users interact with data using dashboards, Business Intelligence reports and complex spreadsheets to evaluate business performance.

3. Data State and Architectural Design
OLTP - Data is current, real-time, and highly operational. Since the data is constantly changing, the database is highly normalized to ensure efficiency and eliminate data redundancy. It is optimized to handle a constant stream of inserting, updating, and deleting small bits of data.
OLAP - Data is historical, static, and rarely changes. It consists of summarized data spanning months or years. Because the goal is fast analysis rather than fast updates, the database is often denormalized allowing the system to efficiently read millions of rows of data at once without altering them.

4. Query Dynamics and Performance Needs
OLTP - Queries are short, simple, and require incredibly fast response times per transaction. They generally touch only a few records at a time.
Example Query - Update bread's price to $10, What is John's email address? or Update a specific customer's order.
OLAP - Queries are heavy, long, and highly complex. While speed is still important, the system is built to process massive analytical workloads rather than split-second individual actions.
Example Query - What is the average age of customers who bought bread in November of 2022? or Show the global sales trends broken down by region over the past 5 years.

5. Real-World Examples
OLTP Systems - ATMs, retail checkout registers, airline booking systems, and e-commerce shopping carts.
OLAP Systems - Corporate data dashboards, annual financial reports, and Business Intelligence (BI) platforms.

The Synergy(How OLTP and OLAP Work Together)

A successful business relies on a symbiotic relationship between both systems. You cannot accurately analyze a business if you do not have an OLTP system reliably recording the daily sales. Conversely, you cannot grow a business if you lack an OLAP system to look back at your history and determine what strategies are actually working.

So, how does the two systems connect?
They are linked through a pipeline process known as ETL (Extract, Transform, Load).
Every day, the OLTP database handles the rapid work of serving customers and processing transactions. At the end of the day, usually in the night when customer traffic and system strain are at their lowest, an automated batch script runs.
Extract - The script pulls a copy of the day's newly generated operational data from the OLTP database.
Transform - It cleans, formats, and aggregates that raw data to ensure it is properly structured for analysis.
Load - Finally, the script deposits that formatted data into the OLAP data warehouse.
By the time the business analysts and executives log into their dashboards the next morning, the OLAP warehouse is fully updated with yesterday's finalized numbers. The data is now perfectly prepped to be searched, graphed, and studied.

The Bottom Line

The difference between OLTP and OLAP simply comes down to time. While OLTP handles the exact moment a transaction occurs, OLAP handles months or years of historical data that the transactions leaves behind. Together, they allow a business to operate today while intelligently planning for tomorrow.

Conclusion

Every time you interact with a screen, you are leaving a digital footprint. Databases are the safe spaces that hold those footprints. OLTP ensures daily transactions are fast and secure. Data warehouses collect all those footprints over time. Finally, OLAP helps businesses look at the giant trail of footprints to figure out where they should step next.
These tools might be invisible, but they are the engine running modern business, keeping our digital lives fast, organized, and constantly improving.

From Tables to Tides: Navigating Databases, Warehouses, Marts, Lakes, and the Lakehouse Revolution

Lawrence Murithi — Fri, 01 May 2026 17:13:47 +0000

Introduction

Every time you buy a coffee with a card, "like" a post on social media, withdraw money from an ATM or buy a shirt online, you are interacting with a database. Behind the scenes of every app and website, data is constantly being created, moved, stored and read.
However, not all data storage is the same. The way a system stores your checkout items at a grocery store is very different from the way that same grocery chain analyzes ten years of sales trends.
To understand how modern software handles data, we need to look at the main types of storage; traditional databases, data warehouses, data marts, data lakes and lake houses.

If you are not a computer guru, these terms might sound very technical but they are not as complex as they sound. But once you break them down, they make perfect sense.
This article gives a simple but detailed breakdown of what these are, how they work, and why software relies on both.

The Basics of Data Storage

In today's data-driven world, organizations generate massive amounts of information. To effectively store, manage, and analyze this data, businesses use different architectural models based on their specific needs.
Before we look at the specific processing types, it helps to understand the physical or virtual places where data lives.

1. The Database (The Daily Worker/Operational Engine)
Think of how you keep track of your personal budget. You might use a spreadsheet. A spreadsheet is great for one person looking at a few hundred rows of information. Now imagine a company like Amazon trying to use a spreadsheet to track millions of orders happening every minute. The spreadsheet would freeze and crash instantly.
A database is like a highly advanced, incredibly secure digital filing cabinet built to store massive amounts of information without crashing. Databases are primarily designed for OLTP (Online Transactional Processing). They are the workhorses that power day-to-day operations, such as processing bank transactions, managing inventory, or storing user profiles. Its main job is to quickly record new information, update existing information, and allow users to quickly look up specific details. More importantly, it is organized so that users can find exactly what they are looking for in a fraction of a second.
Information in a standard database is usually organized into tables with rows and columns. For example, an online store might have one table for Customers, one for Products, and one for Orders. The database connects these tables so the system knows exactly which customer bought which product. Think of a traditional database like a busy cash register. It needs to be fast, accurate, and handle hundreds of transactions at once without freezing.

Key Characteristics of a Database

ACID Compliance – Traditional relational databases follow strict rules (Atomicity, Consistency, Isolation, Durability) to ensure that transactions are processed reliably and that data remains accurate even in the event of a system crash.
Normalized Structure – Data is organized into tables to reduce redundancy. For example, a customer’s address is stored in one place rather than being repeated for every order they place.
Real-Time Interaction – Databases are designed to handle thousands of concurrent users making small, rapid changes to the data simultaneously.

Types of Databases

Relational (SQL) - Uses tables with rows and columns (e.g., MySQL, PostgreSQL, Oracle). Ideal for structured data where relationships are clearly defined.
Non-Relational (NoSQL) - Uses flexible structures like documents or graphs (e.g., MongoDB, Cassandra). Ideal for rapidly changing data types and massive scaling.

2. The Data Warehouse (The Long-Term Archive/Analytical Hub)
As a business runs, over time, its database fills up with millions of past transactions. After a few years, a company manager might want to know, "Which of our stores sold the most winter coats in December over the last five years?"
For the database to answer that question, it has to dig through millions of old records thus it slows down. This causes the system to freeze hence people trying to buy things on the website at that moment cannot check out.
Using the grocery store analogy, a store manager walking up to a cashier who has a long line of customers and asking them to calculate the store's total profit for the last decade would cause a crisis and bring the whole store to a halt. To fix this, companies build Data Warehouses.
A data warehouse is a massive storage system designed to hold historical data from many different sources. It aggregates data from various sources such as different operational databases, CRM systems and flat files to provide a comprehensive, historical view of the entire organization. Periodically, usually in the night, the company copies all the new data from these sources and dumps it into the data warehouse.
From the previous example, if the database is the cash register, the data warehouse is the company's central filing room. A data warehouse takes the daily receipts from all the different cash registers, organizes them and stores them for years.
The data warehouse acts as the company's long-term memory. It doesn't handle everyday customer actions. Instead, it is a quiet, organized space where business analysts can run massive queries and reports without interrupting the live website.
Data warehouses utilize OLAP (Online Analytical Processing). Instead of focusing on individual transactions, they are optimized to scan millions of rows to find trends, averages and insights.

The ETL Process (The Warehouse Engine)

Before data enters a warehouse, it must undergo ETL (Extract, Transform, Load).
Extract - Pulling data from multiple, often messy, source systems.
Transform - Cleaning, deduplicating, and formatting the data into a standardized structure.
Load - Moving the clean data into the warehouse.
This is known as Schema-on-Write, meaning the structure of the data must be defined and validated before it can be stored.

Key Benefits of a Data Warehouse

Data Integration – It breaks down data silos by combining information from marketing, sales, and finance into one single source of truth.
Historical Context – While databases often only show current data, warehouses store years of historical records, allowing for year-over-year comparisons.
Optimized for Performance – Warehouses often use columnar storage, which allows them to perform complex calculations such as, What was the total revenue for 2023?, significantly faster than a standard database.
High Quality & Accuracy – Because data is cleaned during the ETL process, business leaders can trust that the reports they generate are based on accurate, non-conflicting information.
Why use a Data Warehouse?
NB: A data warehouse is the foundation for Business Intelligence. It allows executives to run complex What if? scenarios and generate reports that inform long-term strategy. It also ensures that the operational databases are not slowed down by heavy analytical queries.

3. Data Marts(The Departmental Lens)
A data mart is a highly focused, specialized subset of a data warehouse designed to serve the specific needs of a single department or business unit.
While a traditional Data Warehouse acts as a massive, centralized repository containing all of an organization's structured data, a data mart isolates only the information relevant to a specific team.

Key Benefits of a Data Mart

Enhanced Performance - Because the data mart is smaller and queries are highly specific, reports and dashboards load much faster.
Improved Security - By isolating data, companies can strictly control who has access to sensitive departmental information
Ease of Use - Business users and analysts do not have to sift through irrelevant enterprise data to find what they need.
Data marts can be Dependent (built by drawing data from an existing enterprise data warehouse) or Independent (built directly from operational systems).

4. Data Lakes(The Raw Data Reservoir)
A data lake is a massive, highly scalable storage system designed to hold vast amounts of raw, unprocessed data in its native format.
Unlike a data warehouse, which requires data to be cleaned, transformed, and structured into strict tables before it can be stored(Schema-on-Write), a data lake stores data exactly as it is generated, assigning structure only when the data is eventually read or queried (Schema-on-Read).

Data Lakes store?

Structured Data - Traditional tables and relational databases.
Semi-Structured Data - JSON files, XML, CSVs, and server logs.
Unstructured Data - Emails, documents, PDFs.
Binary/Media Data - Images, audio files, and videos.
Streaming Data - Real-time IoT sensor data and website clickstreams.

Why use a Data Lake?

A data lake is ideal when an organization wants to capture and retain everything, even data they don't immediately need. It is highly cost-effective because it utilizes cheap cloud storage. Furthermore, having raw, unmanipulated data is essential for training artificial intelligence (AI) and complex Machine Learning (ML) models.
NB: Without proper organization and governance, a data lake can become a messy, unsearchable Data Swamp.

5. Data Lakehouse(The Modern Hybrid)
For years, companies had to maintain a two-tier architecture; a Data Lake for raw data and machine learning, and a separate Data Warehouse for clean data and business reporting. This resulted in expensive storage costs, data duplication, and complex maintenance.
A Data Lakehouse is a modern architectural design that merges the best concepts of both systems. It is built directly on top of cheap data lake storage, but it applies the organizational structures, management tools, and performance speeds of a data warehouse.

Key Features of a Lakehouse

Flexibility & Scale - Like a data lake, it can store massive amounts of structured, semi-structured, and unstructured data.
Reliability & Structure - Like a data warehouse, it supports ACID transactions (meaning data is reliable, updates don't break the system, and multiple people can read/write simultaneously).
Single Source of Truth - Teams no longer have to copy data from the lake to the warehouse. Business analysts can build BI dashboards, and data scientists can run machine learning models directly on the exact same data platform.

Summary of the Storage systems

The Bottom Line

In today's modern economy, data is a company’s most valuable asset. However, data only provides value if it can be accessed, analyzed, and trusted. By understanding the distinctions between these storage methods, organizations can build a robust infrastructure that avoids the Data Swamp, reduces operational costs, and ultimately turns raw information into a competitive advantage.

Conclusion

Choosing the right data storage architecture is no longer about finding a one-size-fits-all solution but matching the right tool to the specific needs of the business. As organizations evolve from simple record-keeping to complex artificial intelligence and real-time analytics, their data strategy must also mature.
For Day-to-Day Operations, Database remains the essential engine, ensuring that transactions are processed accurately and instantly.
For Strategic Reporting, Data Warehouse and its specialized Data Marts provide the single source of truth needed for executive decision-making and departmental efficiency.
For Big Data & Innovation, Data Lake serves as the vital reservoir for raw information, fueling the next generation of Machine Learning and AI development.
For the Future of Scalability, Data Lakehouse represents the ultimate convergence, offering the best of all worlds; the speed of a warehouse with the massive flexibility of a lake.

Apache Airflow for Beginners: DAGs, Tasks, Operators, and Scheduling Explained

Lawrence Murithi — Wed, 29 Apr 2026 20:24:12 +0000

Introduction

Being a beginner in data engineering can seem very scary. People use technical words like ETL, pipelines, data warehouses, architecture, orchestration etc. At that point, it is very easy to feel like you need a computer science degree just to understand what they mean. However, most of these terms are just technical but not as complicated as they sound.
Data engineering, in simple terms, involves extracting data from a place such as websites, social media pages, excel/csv files or payment systems etc, cleaning it, and storing it somewhere (database, data warehouse or data lake). If you need this done once, you can run a simple Python script. However, if the job must run every hour, every day, or every week, you need a tool that can manage it for you. That's where Apache Airflow comes in.

What is Apache Airflow?

To understand Apache Airflow, think about a process like baking a cake. You do not just throw everything into the oven. You follow steps:

Buy the ingredients
Prepare the dough
Put the dough in the oven
Bake the cake
Let it cool
Add frosting
Serve the cake

Some steps must happen before others. You cannot frost the cake before baking it. You cannot bake the cake before preparing the dough. You also need to know how long each step should take and what to do if something goes wrong.
This kind of process is called a workflow or pipeline and Airflow helps you manage that workflow.
NB: Airflow does not usually do the heavy data processing itself but tells other tools when to do the work.
A workflow may be a data pipeline, a machine learning pipeline, a reporting process, or any process made up of several steps.
Example

extract_data >> clean_data >> load_data >> send_email

Apache Airflow is an open-source platform used to schedule, monitor and manage workflows. It was originally created by Maxime Beauchemin at Airbnb in 2014 to manage increasingly complex data workflows. It helps you decide what task should run first, what should follow, what should happen if something fails, and when the whole process should run again.

Airflow as an Orchestrator

Orchestration refers to arranging many tasks so they run in the right order and at the scheduled time. It makes sure that task B does not run before task A has finished. It also records whether each task succeeded or failed. Without orchestration, you will have many scripts running manually or through separate cron jobs hence becoming difficult to manage as your project grows.

Why Airflow?

While a normal Python script could run fine with simple tasks, you need more control as the number of tasks increases. Airflow is useful because data jobs often have many moving parts.

Airflow is useful because of various reasons:

1. Scheduling
Since most data work is repetitive, scheduling enables workflows to run automatically based on the scheduled time. Airflow handles complex timezone logic natively, ensuring global data pipelines run exactly when they should.

2. Catchup and Backfilling
If your pipeline breaks over the weekend and you don't fix it until Monday, Airflow knows it missed Saturday and Sunday. It will automatically go back in time and run the missed jobs in order.

3. Task Orchestration
Tasks are arranged depending on which task runs first, second, and last.
Example

extract >> transform >> load

This order is critical because if the load task runs before the transform task, the database may receive dirty data. If the transform task runs before extract, there will be no data to clean.
Airflow has parallel execution capabilities to run several tasks simultaneously and wait for all of them to finish before moving to the next step.

4. Monitoring
Monitoring standard scripts to know if a job ran successfully requires SSH-ing into a server and digging through terminal files. However, Airflow provides a centralized web interface for the entire data ecosystem to monitor.
The Web Dashboard/Task Statuses - Airflow comes with a beautiful, easy-to-read user interface (UI) with Color-coded views. You can log in and see exactly which tasks succeeded(green), which are currently running(light green), queued(gray) and which failed(red).
Gantt Charts - Visual representations of task duration, helping you identify bottlenecks in your pipeline.
Historical Trends - view the history of a specific pipeline over a duration of time to spot intermittent failures or slowing performance.

5. Automated Retries
In the real world, tasks can fail for temporary reasons. An API may be rate-limited, a database might briefly drop a connection, or a network hiccup might occur.
Instead of waking up at 3:00 AM to manually restart a failed script, Airflow handles transient errors gracefully by trying the task again based on the number of retries set.
Example

"retries": 3,
"retry_delay": timedelta(minutes=5)

In this scenario, if the task fails, Airflow will wait for 5 minutes before trying again, up to three times.

6. Accessible Logs
Finding out why and when a pipeline breaks is very critical. Airflow attaches isolated logs to every single task execution eliminating the need to hunt through an entire server log file.
A user is also able to click on a failed task directly in the web UI and instantly read the error message for that specific run, reducing debugging time.

7. Failure Handling
When a task fails, letting the rest of the script run can result in corrupt data or crashed databases. Airflow thus stops execution of the downstream tasks preventing bad data from moving through the pipeline.
Airflow can also be configured to send an automated email, slack message, or an alert when a pipeline fails, ensuring the team is instantly aware of critical data outages.

8. Clear Pipeline Structure
Airflow workflows are written entirely in Python hence the pipeline configuration is treated like any other software project. Workflows are visible and anyone can see how tasks connect to each other hence a new person joining the team can open the Airflow UI and understand the pipeline flow.
Workflows can be committed to Git, peer-reviewed, and rolled back if a mistake is made.

The Core Assets(Airflow Terminologies)

Before writing any Airflow code, its important to understand its building blocks and the main terms used in the Airflow world because they describe parts of a workflow system.

1. DAG
In Airflow, a full workflow is called a DAG(Directed Acyclic Graph).
Directed - the workflow moves in one direction. The process has a starting point and an ending point and does not move backward.
Acyclic - there are no loops. Since workflow must have a clear start and a clear end loops are not allowed since they create endless cycles and the pipeline might never finish running.
Graph - a structure made up of points and connections. The points are tasks and the connections are dependencies
A DAG is, therefore, a workflow made up of tasks arranged in a clear order and indicating how they connect with each other.
Example

with DAG(
    dag_id="stock_etl_dag",
    start_date=datetime(2026, 4, 20),
    schedule=timedelta(hours=1),
    catchup=False
) as dag:

2. Task
A task is one step inside a DAG or one job inside a pipeline. A task should usually do one clear job. Creating one huge task that does everything makes debugging hard thus work should be split into separate tasks.
Example

fetch = PythonOperator(
    task_id="fetch_stock_data",
    python_callable=fetch_stock
)

fetch is the task object, and fetch_stock_data is the task name shown in Airflow.

3. Operator
An operator is the tool used to create and run a task. Different operators are used for different types of jobs.

4. Dependency
A dependency defines the order of tasks by telling Airflow which task must run before another task. In simple terms, a dependency is the relationship between tasks.
Example

extract >> transform >> load

This means extract runs first, transform runs after extract succeeds and load runs after transform succeeds.
You can also define parallel dependencies to show which tasks should run simultaneously.
Example

download >> [clean_data, backup_data] >> send_email

This means download runs first, clean_data and backup_data run after download then send_email executes after both clean_data and backup_data finish.

5. Scheduler
The scheduler is the brain of Airflow which checks the DAGs and decides which tasks should run and when.
If the scheduler is not running, DAGs may appear in the UI but tasks may stay queued or show no status.
The scheduler constantly checks:

which DAGs exist
whether a DAG is due to run
whether a task’s upstream tasks have succeeded
whether a task should be queued
whether a failed task should retry
whether a DAG run is complete The scheduler does not usually execute the task itself but decides which task is ready and sends it to the executor.

6. Executor
The executor is the part of Airflow that decides how tasks are actually run. Different Airflow setups use different executors.
Common executors include:
SequentialExecutor - This runs one task at a time thus cannot run many tasks in parallel. It is simple and often used for learning or testing.

LocalExecutor - This runs tasks locally on the same machine, and it can run more than one task at the same time. It's useful when Airflow is installed on one server and you want tasks to run on that server.

CeleryExecutor - This is used for larger setups. The scheduler sends tasks to a queue, and workers pick them up and run them. This setup usually needs a message broker such as Redis or RabbitMQ.

KubernetesExecutor - This runs each task in a separate Kubernetes pod. It's more advanced and usually used in cloud or production environments.

NB: The scheduler decides that a task should run while the executor handles the running method(how Airflow runs tasks).

7. Worker
A worker is the process that actually executes tasks. This term is particularly important when using CeleryExecutor.
In a Celery setup, the flow looks like:

Scheduler >> Queue >> Worker >> Task runs

The scheduler decides the task is ready, the executor sends the task to a queue and the worker picks it up and runs it.
NB: The scheduler decides what should run while the worker does the actual execution.

8. XCom
XCom means cross-communication. It allows tasks to pass small pieces of data to each other. XCom can help pass data from one task to another.
XCom is for passing small messages between tasks, not for moving large datasets. Passing large datasets through XCom slows down Airflow and fills the metadata database.

In PythonOperator, you can push data to XCom:

kwargs["ti"].xcom_push(key="raw_data", value=data)

Then another task can pull it:

data = kwargs["ti"].xcom_pull(task_ids="extract", key="raw_data")

However, you can save large data somewhere else, then pass the location through XCom.
Example

extract task saves data to /tmp/raw_stock_data.csv
XCom passes "/tmp/raw_stock_data.csv"
transform task reads the file

9. Sensors
A Sensor is a special type of Operator that just waits.
Imagine you are expecting an important package. You stand at the window waiting for the mail truck. A Sensor does this in software. It can wait for a file to drop into a folder, or wait for another database to finish an update before letting the DAG move on to the next step.

10. Metadata Database
The metadata database is Airflow’s internal database and uses it to remember what happened(records results of a DAG).
It stores information such as:

DAGs
DAG runs
Task runs
Task states
Schedules
Retries
Users
Roles
Variables
Connections
XCom values

This database is very important because Airflow needs memory.
For example, Airflow needs to know:

Did this task succeed?
Did this task fail?
How many times has it retried?
When did the DAG last run?
What logs belong to this task?
What DAGs exist?
Which users can log in?

What Airflow is NOT

To fully understand Airflow, you also need to know its limits.
It's not a data streaming tool - Airflow is built for batch processing (running jobs every hour, every day, or every week). It is not designed to process live data happening by the second, like tracking live mouse clicks on a website.
It's not a data processing engine - Airflow is the manager, not the worker. You should not use Airflow to process a 100-gigabyte CSV file in its own memory. Instead, Airflow should send a command to a tool like Apache Spark or Snowflake to do the heavy lifting.

Conclusion

Apache Airflow may look difficult when you first encounter it because it comes with many technical jargons making data engineering feel more complicated than it really is. Airflow is simply a workflow manager which helps you organise work that must happen in a specific order. Apache Airflow is about control. It helps you control timing, order, failure, retries, logs, and monitoring.
Just like baking a cake, you must follow the right sequence. A data pipeline works the same way. You extract data, transform it, load it, check it, and sometimes send a notification. Each step depends on the previous one. Airflow gives you a clean way to define these steps and make sure they run correctly.

ETL vs ELT: Which One Should You Use and Why?

Lawrence Murithi — Sat, 11 Apr 2026 17:51:32 +0000

Introduction

Imagine you are running a massive kitchen. Every day, trucks arrive carrying raw ingredients from different farms. Some boxes have dirty potatoes, some tomatoes are bruised, and the meat needs to be separated from the bone.
Can you just throw all of this straight onto a customer’s plate? Definitely not. You have to wash, chop, season, and cook the ingredients first.

In the business world, data works the same way. Every day, companies generate tons of raw data from apps, websites, payment gateways, customer service logs etc. This raw data is usually dirty and messy. It has errors, missing fields and mismatched formats. Before it can be used for reporting or decision-making, it needs to be moved, processed and organized. This process of moving and cleaning data is called data integration.
The two main approaches are used in data integration are ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). Although both methods aim to prepare data for analysis, they follow different steps and are suited for different situations.
If you are just stepping into data engineering, software engineering or backend development, ETL and ELT are common terms you will encounter.
This article explains both approaches in detail, compares them, and helps you understand when to use each one.

What is ETL?

ETL stands for Extract, Transform, Load. It is the traditional method used to move and prepare data.
The key idea in ETL is that data is cleaned and transformed before it is stored in the final system. This means that by the time the data reaches the data warehouse, it is already structured, organized, and ready for use.
This approach was developed at a time when computing resources were limited, and companies had to be very careful about what data they stored.

Steps in ETL

1. Extract
This step involves collecting raw data from different sources such as:

Databases
APIs
Excel files

In real-world scenarios, data rarely comes from a single source. A company may have customer data in one system, sales data in another, and marketing data in a third system. This extraction step pulls all this data together.

2. Transform
In this stage, data is processed in a separate system before being stored. This transformation step ensures that all data is consistent, accurate, and usable.

Common transformations include:

Standardizing data formats
Handling missing values
Removing duplicate records
Fixing errors in data
Masking sensitive data such as credit card numbers
Combining data from different sources

This step is where raw data is made meaningful. Without transformation, data would remain inconsistent and difficult to analyze.

3. Load
After transformation, the cleaned data is loaded into a data warehouse or database.
At this stage, the data is ready for carrying out analysis, creating dashboards and reporting.
Simple Diagram of ETL

Why ETL Was Popular

In the past, data warehouses were physical servers sitting in basements. Storage space was incredibly expensive and computing power was very limited. Companies, therefore, could not afford to store raw, useless data. They had to clean it up and shrink it down before loading it into the warehouse.

What is ELT?

ELT stands for Extract, Load, Transform. It is a modern approach made possible by cloud computing. Here data is loaded first and transformed later inside the data lake.
This approach takes advantage of modern systems that can store large amounts of data and process it quickly.

Steps in ELT

1. Extract
Data is collected from different sources just like in ETL.

2. Load
This is a major shift from ETL. Instead of first cleaning the data, you load the raw data directly into your target data lake without any changes.

3. Transform
The transformation happens inside the data lake. This means analysts can use the warehouse's own computing power to clean, format, and organize the data..

Simple Diagram of ELT

Why ELT Became Popular

The emergence of modern cloud data warehouses such as Snowflake, Google BigQuery, and Amazon Redshift changed the game. Today, storing data in the cloud is incredibly cheap. Furthermore, these cloud warehouses have massive, scalable computing power.
Instead of buying a separate, expensive server just to transform data (like in ETL), companies no longer need to clean data before storing it. They can store everything and process it later.

Differences Between ETL and ELT

1. Order of Steps
in ETL, transformation happens before loading while in ELT transformation happens after loading.

2. Where the Transformation Happens
In ETL, transformation happens in a separate server outside the warehouse while in ELT, the transformation happens right inside the destination data warehouse.

3. Speed of Loading
ELT is usually much faster at the loading stage since there is no cleaning of the data. ETL takes longer because the data has to wait in line to be processed before it can be loaded into the warehouse.

4. Maintenance and Flexibility
ETL is less flexible and changes require rebuilding pipelines. If a mistake is made in an ETL pipeline, or if you want to format the data differently, you have to go back to the source, re-extract the data, and run it through the whole pipeline again.
With ELT, the raw data is already sitting in your warehouse. Any mistake during transformation, you simply write a new SQL query and transform the raw data afresh.

5. The Skills Required
ETL often requires specialized tools and programming such as software engineers who know Java, Python or drag-and-drop tools. ELT uses SQL and since the data is transformed inside a database, it is accessible to analysts.
NB:

ETL focuses on control, structure, and quality before storage
ELT focuses on speed, flexibility, and scalability after storage.

Advantages and Disadvantages

ETL

Advantages
Security and Compliance - If you are dealing with highly sensitive data (like medical records or credit cards), ETL allows you to strip out/mask the sensitive parts before storage in the main warehouse.
Reduced and cheaper Storage - Because you are only loading refined data, you take up much less storage space in your destination database.

Disadvantages
Rigid - Setting up an ETL pipeline takes a lot of time. If a source system needs to make a change, the whole ETL pipeline might break and need to be rewritten.
Bottlenecks - If you have massive amounts of data, the processing server can easily get overwhelmed and slow down the whole operation.

ELT

Advantages
Agility - Since raw data is loaded quickly and directly into the warehouse, analysts do not have to wait for engineers to build complex pipelines to access the raw data.
Future-Proof - Because you keep a copy of the exact raw data, reprocessing of raw data is always possible. You can also go back and answer new business questions that you hadn't thought of previously.
Scalability - Cloud warehouses are designed to scale automatically thus are able to support large datasets.

Disadvantages
Security Risks - Since you are loading raw, unfiltered data into your warehouse, you have to be careful about who has access to the warehouse if that data contains sensitive information such as passwords, personal addresses or financial details.
Higher computing costs - While cloud storage is cheap, cloud computing can get expensive. If you have bad SQL code running inefficient transformations inside your warehouse every hour, your monthly cloud bill will skyrocket.

ETL Tools

These tools are designed for structured, enterprise-level data pipelines.

Informatica
IBM DataStage
Talend

ELT Tools

Modern ELT uses different tools for each step:
These tools allow analysts to work directly with data using SQL.

Fivetran / Airbyte → Extract and Load
dbt (Data Build Tool) → Transform
Cloud Warehouses → Snowflake, BigQuery, Redshift

Real-World Use Cases

Banking System (ETL)
A bank handles sensitive data from mobile app banking, ATMs and physical branch locations. This data contains raw account numbers, account balances, passwords and PIN, personal details and financial transactions thus must be secured before storage.

E-commerce Startup (ELT)
An online store that wants to track user behavior will generate large amounts of data daily just from people clicking around their website, viewing products, adding items to carts etc. The marketing team thus has to constantly change what they want to measure. One week they may want to track abandoned carts while the following week they may want to track how long people look at a specific product. The business has to frequently change what it wants to analyze.

Which One Should You Use and Why?

If you are starting a new project and trying to choose between ETL and ELT, here is a practical guide to help you decide.
Choose ETL if
- You are bound by strict privacy laws - If you work with sensitive data (healthcare, banking), the ability to scrub data before it lands in a database should be key.
- Your system uses on-premise databases - If your company still keeps its servers in a physical server room, your database may not have high processing power required to do transformations internally hence you will need a separate ETL server.
- Your data source is unstructured - If you are extracting data from highly complex, old mainframes that output weird file types, standard ELT tools might not know how to read them. You will need a custom ETL script to decode and format the data before it can be saved.

Choose ELT if
- You are using a cloud data warehouse - If you have Snowflake, BigQuery, or Redshift, ELT is most convinient since it takes advantage of what you are already paying for.
- You work with large volumes of diverse data - If you are tracking millions of tiny events (like website clicks, product views or IoT sensor readings), pushing it directly to the cloud is the only way to keep up with the volume.
- You need flexibility in analysis and fast data processing - ELT allows data engineers to focus purely on moving data from point A to point B, while empowering data analysts to handle the business logic and formatting using SQL.

Conclusion

The debate between ETL and ELT is less about which one is better and more about matching your business needs, data size, and system architecture. Understanding both approaches helps you design better data pipelines and make smarter decisions when working with data.

Advanced SQL Techniques for Data Analytics Every Data Analyst Should Know

Lawrence Murithi — Thu, 09 Apr 2026 13:21:19 +0000

Introduction

In today’s data-driven environment, organizations rely heavily on data to make decisions. Businesses collect large amounts of information from different sources such as sales systems, customer platforms, and operational databases. However, raw data alone is not useful unless it can be analyzed and transformed into meaningful insights.

SQL (Structured Query Language) plays a central role in this process. It allows analysts to retrieve, clean, and analyze data stored in relational databases. While basic SQL skills are important, advanced SQL techniques are what truly enable analysts to solve complex business problems.

This article explains advanced SQL concepts in simple terms and shows how they are applied in real-world data analytics scenarios. The goal is to help you understand not just how to write SQL queries, but how to use them effectively in practical situations.

The Role of SQL in Data Analytics

SQL is the foundation of data analytics. Most business data is stored in databases, and SQL is the language used to interact with that data.

Data analysts use SQL to:

Extract data from databases
Filter and clean datasets
Combine data from multiple tables
Perform calculations and aggregations
Prepare data for reporting tools like Power BI

SQL is often the first step before using any visualization tools. If the data is not properly prepared using SQL, the final reports may be inaccurate or misleading.

Working with Complex Queries

As data becomes more complex, simple queries are not enough to handle it. Advanced SQL, therefore, introduces techniques that help break down complex problems into manageable steps.
In real-world data analysis, datasets are often large and contain multiple tables with different relationships. Moreover, analysts are expected to answer questions that involve comparisons, calculations and multiple layers of logic. These techniques therefore allow analysts to solve the problems step by step instead of trying to do everything in one single query.
Complex query techniques thus help analysts organize their queries in a way that is easier to understand, maintain, and scale.

They are useful when:

Comparing values against aggregated results
Reusing part of a query
Working with multi-step transformations
Simplifying long and confusing SQL statements

Some of the advanced SQL techniques include:

Subqueries

A subquery is a query inside another query. Subqueries are useful when you need to perform a calculation first and then use that result in another query. They allow you to embed logic directly inside your main query.

SELECT name
FROM employees
WHERE salary > (
    SELECT AVG(salary)
    FROM employees
);

Explanation:
_- The inner query calculates the average salary

The outer query returns employees earning above average_

Subqueries can be used in different parts of a query:

In the WHERE clause
In the SELECT clause
In the FROM clause (called derived tables)

Real-World Case Scenarios:

Identify high-performing employees based on salary or performance metrics.
Finding customers who spend more than the average customer
Identifying products priced above the average price

NB: While subqueries are powerful, they can become slow if used incorrectly, especially with large datasets.

Common Table Expressions (CTEs)

A CTE is a temporary result in an SQL query that helps improves readability and organization(temporary table that exists only while the query is running).

CTEs allow you to define a query once and then use it in the main query. This makes complex queries easier to read and understand, especially when working with multiple steps.

WITH sales_summary AS (
    SELECT product_id, SUM(amount) AS total_sales
    FROM sales
    GROUP BY product_id
)
SELECT *
FROM sales_summary
WHERE total_sales > 1000;

Types of CTEs:
- Recursive CTE: A specialized CTE that references itself, which is essential for querying hierarchical data like organizational charts or family trees.
- Non-Recursive CTE: The most common type, used to simplify standard queries by creating manageable logical steps.
Benefits:

Makes queries clean and easier to read
Breaks complex logic into steps, thus easier to debug and modify
Improves maintainability

NB: You can also have multiple CTEs in one query, which is useful for complex data transformations.

In business reporting, analysts often build layered queries. CTEs allow them to structure their logic clearly when working with large datasets.
Step 1: Calculate total sales per product
Step 2: Filter high-performing products
Step 3: Join with other tables for reporting

Advanced Joins

Joins are used to combine data from multiple tables. In advanced SQL, joins become more powerful when dealing with complex relationships.

SELECT c.customer_name, o.order_date, p.product_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN products p ON o.product_id = p.product_id;

In a retail company:

Customers table stores customer details
Orders table stores transactions
Products table stores product information

Using joins, analysts can create a full view of customer purchases.

Poor joins can lead to:

Duplicate data
Incorrect totals
Misleading reports

Window Functions

Window functions allow us to perform advanced calculations across a group of related rows while keeping the original data. They are useful for ranking, running totals, moving averages, and analytical reporting.
Window functions often remove the need for complex self-joins and provide an analytical layer within SQL.
Window functions:

Keep every row
Add calculated values to each row

SELECT column_1,
       function() OVER (
           PARTITION BY column
           ORDER BY column
       ) AS output_column
FROM table_name;

Window functions are widely used in business intelligence and reporting for:

Rankings within a group
Calculating running totals
Compare rows (current vs previous)
Analyzing trends over time

Companies use ranking to:

Identify top performers
Allocate bonuses
Compare employee performance

## Ranking employees

SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank
FROM employees;

Businesses use running totals to:

Track revenue growth
Monitor daily or monthly performance
Forecast future trends

## Running totals

SELECT date, sales,
SUM(sales) OVER (ORDER BY date) AS running_total
FROM sales;

Aggregations and Grouping

Aggregation helps summarize large datasets. Raw data is often too detailed to understand directly. Aggregation thus helps turn large datasets into meaningful summaries.

SELECT region, product_id, SUM(sales) as total_sales
FROM sales
GROUP BY region, product_id;

Aggregation allows analysts to answer questions such as:

Total sales by region
Sales by product category
Monthly revenue trends

Aggregation is often used together with:

Filtering (HAVING)
Sorting (ORDER BY)

Data Cleaning and Transformation

Data cleaning is one of the most important steps in analytics. Since raw data is usually dirty and messy, SQL helps clean and prepare it before analysis.
Raw data may contain:

Duplicates
Missing values
Incorrect formats
Inconsistent entries

Removing Duplicates

Removes repeated values and ensures each entry appears only once.

SELECT DISTINCT customer_id
FROM customers;

Handling Missing Values

Replaces NULL values with a default value thus preventing errors in reports

SELECT COALESCE(phone, 'Not Available')
FROM customers;

Data Transformation

Creates a new calculated column
Data transformation also includes:

Changing data types
Formatting dates
Standardizing values

SELECT price, quantity, price * quantity AS total_sales
FROM sales;

Using SQL for Real-World Business Problems

Advanced SQL is not just about writing queries but solving real problems.
In organizations, SQL is used daily to answer business questions and support decisions.

Customer Segmentation

Businesses use customer segmentation to:

Target high-value customers
Design marketing strategies
Improve customer retention

## Grouping customers based on spending

SELECT customer_id,
CASE 
    WHEN total_spent > 1000 THEN 'High Value'
    WHEN total_spent > 500 THEN 'Medium Value'
    ELSE 'Low Value'
END AS segment
FROM customer_sales;

Sales Performance Analysis

Total sales are calculated per product and sorted products by performance to identify best-selling products.

SELECT product_id, SUM(sales) AS total_sales
FROM sales
GROUP BY product_id
ORDER BY total_sales DESC;

Segmentation helps organizations to:

Understand performance
Identify opportunities
Solve operational problems

Performance Optimization

SQL queries must be clean, easy to understand and efficient.
In large databases, poor queries can slow down systems and delay reports.

Best Practices:

Use indexes on important columns to speed up data retrieval
Avoid selecting unnecessary columns
Filter data early to reduces data size
Use CTEs instead of repeated subqueries
Avoid unnecessary joins

Conclusion

Advanced SQL is a critical skill for data analysts. It goes beyond basic queries and allows analysts to work with complex datasets, perform advanced calculations and solve real-world business problems.

In this article, we explored key advanced SQL techniques such as subqueries, CTEs, joins, window functions, aggregations, and data transformation and how they are applied in real business scenarios

In data analytics, SQL is not just a tool but is a core skill that connects raw data to meaningful insights. Mastering advanced SQL allows analysts to move from basic reporting to deeper, more impactful analysis