<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cliffe Okoth</title>
    <description>The latest articles on DEV Community by Cliffe Okoth (@cliffe_okoth).</description>
    <link>https://dev.to/cliffe_okoth</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3190513%2Fbf57608e-16ba-4b44-9550-71288e9e6f37.jpg</url>
      <title>DEV Community: Cliffe Okoth</title>
      <link>https://dev.to/cliffe_okoth</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cliffe_okoth"/>
    <language>en</language>
    <item>
      <title>How Apache Kafka Powers Real-Time Data Pipelines</title>
      <dc:creator>Cliffe Okoth</dc:creator>
      <pubDate>Mon, 18 May 2026 12:29:41 +0000</pubDate>
      <link>https://dev.to/cliffe_okoth/how-apache-kafka-powers-real-time-data-pipelines-3ef9</link>
      <guid>https://dev.to/cliffe_okoth/how-apache-kafka-powers-real-time-data-pipelines-3ef9</guid>
      <description>&lt;p&gt;Most standard data pipelines run on a schedule. You use tools like Airflow and dbt to extract and transform large batches of API data once a day. However, what would happen if the data wasn't collected but rather is being collected in the moment. &lt;/p&gt;

&lt;p&gt;This is where streaming comes in. You would require a system designed for continuous event streaming like Apache Kafka, an open-source distributed event streaming platform.&lt;br&gt;
It acts as a massive central nervous system, allowing data to flow continuously from source to destination.&lt;/p&gt;

&lt;p&gt;To understand how Kafka works, I'll break down its core concepts using a live streaming project: &lt;br&gt;
&lt;a href="https://github.com/Cliffe16/Lux_Assignments/tree/main/Kafka/open_weather_streaming" rel="noopener noreferrer"&gt;a pipeline that extracts real-time weather data from the OpenWeatherMap API and streams it directly into a Cassandra database.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4pmkkz5dubqy0lzu83jx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4pmkkz5dubqy0lzu83jx.png" alt="Batch vs Streaming" width="800" height="504"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's look at Kafka's architecture.&lt;/p&gt;
&lt;h2&gt;
  
  
  Broker
&lt;/h2&gt;

&lt;p&gt;Kafka does not run on a single machine, it is a &lt;strong&gt;distributed system&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Broker&lt;/strong&gt; is a single Kafka server responsible for receiving, storing and serving messages.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Cluster&lt;/strong&gt; is a group of brokers working together. If one broker fails, the cluster ensures the data is replicated and safe elsewhere.&lt;/p&gt;

&lt;p&gt;In the project's code, you can see the connection to the broker defined via the &lt;code&gt;bootstrap_servers&lt;/code&gt; parameter pointing to &lt;code&gt;localhost:9092&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Events
&lt;/h2&gt;

&lt;p&gt;In Kafka, an event (also record or message) records the fact that 'something happened.' They consist of a key, value, timestamp and headers and cannot be updated or changed. In the streaming pipeline, this is the json response extracted from the weather api.&lt;/p&gt;
&lt;h2&gt;
  
  
  Topic
&lt;/h2&gt;

&lt;p&gt;Whereas in a database you insert data into a &lt;strong&gt;table&lt;/strong&gt;, in Kafka, you push data to a &lt;strong&gt;topic.&lt;/strong&gt; In the weather pipeline, the topic is simply defined as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;weather_info&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every API response pulled will be published to this specific topic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;api_response&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To ensure the system can scale horizontally and process millions of messages simultaneously, topics are split into &lt;strong&gt;Partitions&lt;/strong&gt;. They allow multiple consumers to read from the same topic in parallel.&lt;/p&gt;

&lt;p&gt;Within each partition, messages are assigned a unique sequential ID known as an &lt;strong&gt;Offset&lt;/strong&gt;. This allows consumers to track exactly where they left off in reading the stream, ensuring no data is skipped or read twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Producer
&lt;/h2&gt;

&lt;p&gt;A producer is any application that publishes data to a Kafka topic. Their only job is to gather data and push it to the broker. For this project, &lt;code&gt;producer.py&lt;/code&gt; acts as the producer. It requests data from the weather api, receives a json payload, and sends it to the topic every 5 seconds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;producer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaProducer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value_serializer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Serialization
&lt;/h3&gt;

&lt;p&gt;Kafka is designed for maximum throughput, meaning it doesn't process the internal structure of your data. To Kafka, a complex JSON payload or DataFrame is just an array of &lt;strong&gt;raw bytes&lt;/strong&gt; hence the &lt;code&gt;value_serializer&lt;/code&gt; argument in the code block above which transforms the data to Kafka-readable bytes. &lt;/p&gt;

&lt;p&gt;Conversely, when the data reaches its destination, it must be deserialized back into a readable format. This is why the consumer script includes a matching deserializer &lt;code&gt;value_deserializer&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Consumer
&lt;/h2&gt;

&lt;p&gt;A consumer subscribes to one or more topics, reads the stream of incoming records and processes them. In &lt;code&gt;consumer.py&lt;/code&gt;, the script acts as a continuous listener on the &lt;code&gt;weather-info&lt;/code&gt; topic. As soon as a new weather event arrives, the consumer receives the event, flattens the nested JSON, converts Unix timestamps into standard datetime formats, and executes an &lt;code&gt;INSERT&lt;/code&gt; statement to load the clean data into a Cassandra database table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;

    &lt;span class="c1"&gt;# ... data flattening and timestamp conversion ...
&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;insert_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;unix_to_dt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sunrise&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="c1"&gt;# ... other fields ...
&lt;/span&gt;        &lt;span class="n"&gt;sys_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unlike a standard Python &lt;code&gt;for loop&lt;/code&gt; that ends when it reaches the bottom of a list, a Kafka &lt;code&gt;for message in consumer&lt;/code&gt; loop is infinite. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv5io27evjxhsiyg5sbvp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv5io27evjxhsiyg5sbvp.png" alt="Architeture" width="500" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Why use Kafka?&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Because, if the OpenWeatherMap API surges, sending exponentially more data, and you have a Python script writing directly to Cassandra, the database might become overwhelmed and crash, taking your entire pipeline down with it. &lt;/p&gt;

&lt;p&gt;Kafka on the other hand, acts as an indestructible shock absorber. The Producer can dump millions of records into the Kafka topic at lightning speed and Kafka will just take them. The Consumer will then read from the topic at its own pace, processing and inserting records into Cassandra as fast as it can without overwhelming the database. &lt;br&gt;
And even if the consumer crashes, Kafka remembers exactly where it left off, ensuring zero data loss when it restarts.&lt;/p&gt;

</description>
      <category>eventdriven</category>
      <category>kafka</category>
      <category>ai</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>From Local Scripts to Cloud Servers: Demystifying Docker for DataOps</title>
      <dc:creator>Cliffe Okoth</dc:creator>
      <pubDate>Tue, 12 May 2026 14:04:01 +0000</pubDate>
      <link>https://dev.to/cliffe_okoth/from-local-scripts-to-cloud-servers-demystifying-docker-for-dataops-38h</link>
      <guid>https://dev.to/cliffe_okoth/from-local-scripts-to-cloud-servers-demystifying-docker-for-dataops-38h</guid>
      <description>&lt;p&gt;"...But it works on my machine."&lt;/p&gt;

&lt;p&gt;If you spend enough time in data engineering or software development, you will inevitably hear this phrase. You might write a brilliant ETL script that works flawlessly on your laptop, but the moment you move that code to a cloud server, everything breaks. The server has the wrong version of Python, missing libraries or conflicting dependencies.&lt;/p&gt;

&lt;p&gt;This exact problem is why Docker exists.&lt;/p&gt;

&lt;p&gt;To understand how Docker works in the real world, we are going to break down its role in a live DataOps project: &lt;br&gt;
an automated NBA Analytics pipeline that extracts game statistics and transforms them using Apache Airflow and dbt.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What is Docker?&lt;/em&gt;&lt;br&gt;
Docker is an open source platform for developing, shipping and running applications. Docker enables you to separate your applications from your infrastructure so you can deliver software quickly. Look at it this way:&lt;/p&gt;

&lt;p&gt;Instead of installing your code, libraries and tools directly onto a computer, you package them all into a template known as a &lt;strong&gt;Docker Image&lt;/strong&gt;. When you run this image, it forms a &lt;strong&gt;Container&lt;/strong&gt; which is an isolated environment.&lt;br&gt;
Because the container holds everything your application needs to run, you can drop it onto a laptop or a server of your choice, and it will run exactly the same way every single time.&lt;/p&gt;

&lt;p&gt;In this project, the orchestrator, Apache Airflow, is hosted on an Azure Virtual Machine. It is supposed to trigger a local worker to extract data, and then execute transformations using dbt SQL models inside Snowflake.&lt;/p&gt;

&lt;p&gt;This creates a massive dependency headache.&lt;/p&gt;

&lt;p&gt;Instead of manually installing Airflow on the Azure server and hoping for the best, Docker is initialized to create a container where Airflow is strictly pinned to version 2.10.0.&lt;/p&gt;
&lt;h2&gt;
  
  
  Deconstructing the Dockerfile
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Dockerfile&lt;/strong&gt; contains a set of instructions on how to build a an image. Think of it as a recipe. &lt;/p&gt;

&lt;p&gt;Here is the exact Dockerfile used to build the Airflow orchestrator for this NBA project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; apache/airflow:2.10.0-python3.10&lt;/span&gt;

&lt;span class="c"&gt;# Step 1: Install system-level tools&lt;/span&gt;
&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; root&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="nt"&gt;--no-install-recommends&lt;/span&gt; build-essential

&lt;span class="c"&gt;# Step 2: Switch back to standard user for security&lt;/span&gt;
&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; airflow&lt;/span&gt;

&lt;span class="c"&gt;# Step 3: Install Python packages&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --chown=airflow:root requirements.txt /requirements.txt&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; /requirements.txt

&lt;span class="c"&gt;# Step 4: Copy the dbt models into the container&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --chown=airflow:root nba_analytics /opt/airflow/nba_analytics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's break it down line by line:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;FROM apache/airflow:2.10.0...&lt;/code&gt;&lt;br&gt;
Every Dockerfile starts with a &lt;code&gt;FROM&lt;/code&gt; command. This tells Docker what "base image" to start with. Instead of building an operating system from scratch, we are telling Docker to go grab the official Apache Airflow 2.10.0 blueprint from its registry &lt;strong&gt;Docker Hub&lt;/strong&gt;. This instantly guarantees we bypass the version conflict issues mentioned earlier.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;USER root&lt;/code&gt; &amp;amp; &lt;code&gt;RUN apt-get...&lt;/code&gt;: We temporarily switch to the administrative &lt;code&gt;root&lt;/code&gt; user to install &lt;code&gt;system tools&lt;/code&gt;, then safely switch back to &lt;code&gt;USER airflow&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;COPY&lt;/code&gt; &amp;amp; &lt;code&gt;RUN pip install&lt;/code&gt;: We copy the &lt;code&gt;requirements.txt&lt;/code&gt; file from our local computer into the container. The &lt;code&gt;RUN&lt;/code&gt; command then executes a terminal command to install all our necessary libraries. The &lt;code&gt;--no-cache-dir&lt;/code&gt; flag tells Docker not to save the leftover installation files, keeping the final container lightweight.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;COPY ... nba_analytics&lt;/code&gt;: By copying the &lt;code&gt;nba_analytics&lt;/code&gt; folder directly into the container, we ensure our orchestrator has immediate access to the SQL models it needs to run.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqs2pn241qpwirnd8fqbd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqs2pn241qpwirnd8fqbd.png" alt="Docker Architecture" width="800" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker Compose
&lt;/h2&gt;

&lt;p&gt;A Dockerfile is just the blueprint for a single service.&lt;/p&gt;

&lt;p&gt;However, enterprise tools like Apache Airflow are rarely just one service. Airflow, for instance, requires three separate services to function: a Scheduler, Webserver and Database. (&lt;a href="https://dev.to/cliffe_okoth/a-beginners-guide-to-apache-airflow-3-4mp9"&gt;More on Airflow here&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;To spin up all of these services on our Azure VM, the project utilizes &lt;strong&gt;Docker Compose&lt;/strong&gt;. This requires a &lt;code&gt;docker-compose.yml&lt;/code&gt; file, which acts as a master blueprint. &lt;br&gt;
Here is a simplified look at how it defines our architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:13&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow&lt;/span&gt;

  &lt;span class="na"&gt;airflow-webserver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080:8080"&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;

  &lt;span class="na"&gt;airflow-scheduler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of running long, complex terminal commands to start each piece manually, Docker Compose reads this YAML file and handles the networking automatically. &lt;/p&gt;

&lt;p&gt;To build the container, you only need to run one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Docker then downloads the database, builds your custom Airflow image using your Dockerfile, links them all together and boots up an isolated orchestration server. &lt;br&gt;
&lt;em&gt;The &lt;code&gt;-d&lt;/code&gt; flag simply tells it to run in "detached" mode, meaning it runs quietly in the background so you can continue using your terminal.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;By containerizing the orchestrator, this data pipeline achieves perfect environment consistency. It doesn't matter if you deploy this project on an Azure VM, a Google Cloud instance or your laptop, Docker ensures that Airflow 2.10.0 and every other Python library are locked and ready to orchestrate your data.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>dataengineering</category>
      <category>devops</category>
      <category>docker</category>
    </item>
    <item>
      <title>Where Does Your Data Live? Decoding the Modern Data Ecosystem</title>
      <dc:creator>Cliffe Okoth</dc:creator>
      <pubDate>Sun, 03 May 2026 01:37:22 +0000</pubDate>
      <link>https://dev.to/cliffe_okoth/where-does-your-data-live-decoding-the-modern-data-ecosystem-2hlg</link>
      <guid>https://dev.to/cliffe_okoth/where-does-your-data-live-decoding-the-modern-data-ecosystem-2hlg</guid>
      <description>&lt;p&gt;If you are stepping into the world of data engineering or analytics, you have likely been hit with a wave of storage buzzwords like &lt;em&gt;data lake&lt;/em&gt; and &lt;em&gt;data warehouse&lt;/em&gt;. In this article, we will demystify these terms so you can understand exactly where your data belongs. &lt;/p&gt;

&lt;h2&gt;
  
  
  Database
&lt;/h2&gt;

&lt;p&gt;Imagine you just launched a business. You need a system to record daily operations every time a customer buys a product, updates their password or submits a support ticket. This is the job of a standard &lt;strong&gt;Database&lt;/strong&gt;.&lt;br&gt;
A database is a collection of structured or unstructured data stored in a computer system, managed by a Database Management System (DBMS). &lt;br&gt;
Databases are most useful for small, atomic transactions and typically contain only the most up-to-date information. Common types include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Relational (SQL) Databases&lt;/strong&gt; for structured data as in tables with fixed rows and columns. Examples include Postgresql, MySQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-relational (NoSQL) Databases&lt;/strong&gt; for unstructured data like JSON (JavaScript Object Notation), documents. Examples include MongoDB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Databases have the following core features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;ACID Properties:&lt;/strong&gt; To guarantee absolute data integrity during transactions, databases adhere strictly to the ACID framework:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Atomicity:&lt;/strong&gt; Database transactions are treated as a single, "all-or-nothing" unit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency:&lt;/strong&gt; Data must seamlessly transition from one valid state to another without breaking the user defined rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolation:&lt;/strong&gt; Multiple transactions can happen concurrently without interfering with one another.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Durability:&lt;/strong&gt; Once a transaction is complete, the changes are permanent and irreversible, even if the system crashes.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Query Language:&lt;/strong&gt; Databases allow users to interact directly with the system using specific languages, most commonly SQL (Structured Query Language). This enables developers and analysts to easily retrieve, filter, aggregate or update information.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Indexing:&lt;/strong&gt; Think of this like the index at the back of a textbook. Instead of forcing the system to scan an entire table, indexes act as structural shortcuts that allow the database to locate specific data instantly.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Normalization:&lt;/strong&gt; This is the design practice of breaking down large datasets into smaller, interconnected tables. It eliminates duplicate information, reduces redundancy and keeps the database organized and efficient.  &lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Backup and Recovery:&lt;/strong&gt; To safeguard against hardware failures, software bugs or unexpected downtime, databases come equipped with robust mechanisms to safely back up and restore data.  &lt;/p&gt;&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Data Modelling:&lt;/strong&gt; Designing a database requires a clear structural blueprint. This process moves through three phases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conceptual modelling&lt;/strong&gt; maps out the high-level data relationships.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical modelling&lt;/strong&gt; adds the technical details.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Physical modelling&lt;/strong&gt; translates that design into the actual working database schema. &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use cases for databases&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Databases excel in scenarios that require real-time data handling and high transaction volumes. &lt;br&gt;
Key use cases include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real-Time Transaction Processing:&lt;/strong&gt; Databases are built to execute immediate operations, such as processing payments at a retail point-of-sale (POS) system or handling financial transfers in banking.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Customer Relationship Management (CRM):&lt;/strong&gt; They allow CRM platforms to manage real-time customer orders, interactions and support tickets.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enterprise Resource Planning (ERP):&lt;/strong&gt; Databases power the day-to-day operational software of businesses, managing records for everything from employee payroll to live inventory management.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Databases are perfect for storing records in real-time, but what happens when you want to compare current sales to those from five years ago? &lt;br&gt;
Running a massive historical query could cripple your business' active, database-dependent operations. &lt;br&gt;
To remedy this, a separate storage system dedicated to historical data should suffice.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpzwbqu92p3nlc3ni3jyq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpzwbqu92p3nlc3ni3jyq.png" alt="db_api" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Warehouse
&lt;/h2&gt;

&lt;p&gt;To solve the historical reporting problem, a data warehouse is used. Instead of handling real-time transactions, it stores massive amounts of structured, historical data from multiple sources to help organizations spot long-term trends and make data-driven decisions. &lt;br&gt;
It is usually denormalized to prioritize read operations ahead of write operations. These are the key features of a data warehouse:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Centralized Data:&lt;/strong&gt; Data warehouses consolidate information from multiple systems to give analysts a comprehensive, high-level view of the organization's data.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Time-Variant Data:&lt;/strong&gt; Data warehouses retain historical records, allowing businesses to analyze past performance, compare specific time periods, and identify long-term trends.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Denormalized Architecture:&lt;/strong&gt; Data is deliberately structured with fewer tables to minimize complex relationships, which drastically speeds up read performance and simplifies heavy analytical queries.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Aggregated Data:&lt;/strong&gt; Information is frequently summarized at various levels of detail, enabling analysts to quickly pull high-level overviews or drill down into granular metrics when necessary.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Query Optimization:&lt;/strong&gt; To process massive analytical workloads efficiently, warehouses utilize advanced performance techniques such as indexing, data segmentation and materialized views.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;BI Integration:&lt;/strong&gt; Data warehouses natively support and connect with Business Intelligence (BI) platforms to power interactive dashboards, robust reporting and data visualizations.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use cases for data warehouses
&lt;/h3&gt;

&lt;p&gt;Data warehouses are better suited for use cases that involve the analysis and reporting of large datasets. These use cases include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Business Intelligence (BI):&lt;/strong&gt; Data warehouses consolidate large volumes of historical data, which is ideal for analytics, reporting and forecasting.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Trend analysis and reporting:&lt;/strong&gt; Data warehouses are ideal for generating business reports, dashboards and exploring patterns over time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Predictive analytics and data mining:&lt;/strong&gt; Data warehouses support advanced analytics that help businesses make data-driven decisions, such as predicting customer behavior or market trends.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples of data warehouses include: Amazon Redshift, Google BigQuery, Snowflake.&lt;/p&gt;

&lt;p&gt;Data warehouses are incredibly organized, but this rigid structure is a double-edged sword. While it guarantees clean, structured data, it leaves you with a problem, where do you put millions of messy, unstructured website click logs or raw JSON files?&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Lake
&lt;/h2&gt;

&lt;p&gt;When data is too large or unstructured for a data warehouse, it gets dumped into a data lake. Here, data from disparate sources is stored in its original, raw format. &lt;br&gt;
Due to its storage flexibility, it acts as a playground for data scientists who train machine learning models on the data before it is fully structured. Like data warehouses, data lakes are not intended to satisfy the transaction and concurrency needs of an application. &lt;br&gt;
Key features of a data lake:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Support for diverse formats:&lt;/strong&gt; Handles data in formats like JSON and Parquet, accommodating a wide range of use cases.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real-time analytics readiness:&lt;/strong&gt; Ideal for machine learning and advanced data science workloads.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Horizontal scalability:&lt;/strong&gt; Uses cost-efficient storage solutions such as Amazon S3 or Azure Blob Storage, allowing seamless growth with increasing data volumes.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples of data lakes include: AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage.&lt;/p&gt;

&lt;p&gt;As your hypothetical company grows, your Data Warehouse becomes massive. Now the Marketing team is complaining that it takes them too long to find the specific campaign metrics they need among all the finance, HR and engineering data.&lt;/p&gt;

&lt;p&gt;Enter the &lt;strong&gt;Data Mart&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fdhiod411ll60u1xhig.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fdhiod411ll60u1xhig.png" alt="Warehouse to mart" width="800" height="459"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Mart
&lt;/h2&gt;

&lt;p&gt;A data mart is a specialized, smaller-scale database designed to serve the specific needs of a single business unit such as marketing or finance. Its primary goal is to filter an organization's massive data pool into a highly focused, manageable repository for quick access.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Types of Data Marts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are three main types of data marts, categorized by how they source their information and their relationship to a central data warehouse:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dependent Data Marts:&lt;/strong&gt; These are directly partitioned from an enterprise's central data warehouse. Using this top-down approach, the data mart extracts a specific, predefined subset of the primary data whenever a department needs to run an analysis.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Independent Data Marts:&lt;/strong&gt; These operate as fully standalone repositories without relying on a central data warehouse. Teams extract, process and store data directly from various internal or external sources.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hybrid Data Marts:&lt;/strong&gt; As the name implies, these blend the two approaches by pulling information from both an existing data warehouse and external operational systems. This provides the speed and structured interface of a top-down approach while maintaining the flexible integration of an independent setup.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Historically, companies had to maintain both a Data Lake (for raw, cheap machine learning storage) and a Data Warehouse (for fast, structured BI reporting). Moving data between the two was challenging and expensive. Recently, a new architecture emerged to bridge this gap: the &lt;strong&gt;Data Lakehouse&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Lakehouse
&lt;/h2&gt;

&lt;p&gt;A data lakehouse is a modern hybrid architecture that combines the massive, cost-effective storage of a data lake with the robust data management capabilities of a warehouse. By bridging the gap between raw data storage and high-speed analytics, a lakehouse can simultaneously support unstructured machine learning workloads and structured Business Intelligence workflows.  &lt;/p&gt;

&lt;p&gt;Key Features of a Data Lakehouse:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ACID Compliance:&lt;/strong&gt; Unlike traditional data lakes, lakehouses guarantee reliable transactions to maintain strict data consistency and integrity.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flexible Schemas:&lt;/strong&gt; They support both "schema-on-write" and "schema-on-read". This gives engineers flexibility when ingesting raw data, while still providing a rigid, reliable structure when analysts need to query it.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Native BI Integration:&lt;/strong&gt; Lakehouses connect seamlessly with popular Business Intelligence platforms like Tableau, Power BI, and Looker, making it easy for decision-makers to visualize their data directly from the source.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;br&gt;
There is no single "best" data storage solution, only the right tool for the job. In fact, a robust modern data ecosystem usually relies on these systems working together:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Your &lt;strong&gt;Database&lt;/strong&gt; captures the live sale.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Your &lt;strong&gt;Data Lake&lt;/strong&gt; stores the messy, raw website logs of how the customer found you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Your &lt;strong&gt;Data Warehouse&lt;/strong&gt; analyzes five years of those sales trends.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Your &lt;strong&gt;Data Mart&lt;/strong&gt; gives the marketing team instant access to only the metrics they care about.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>data</category>
      <category>database</category>
      <category>aws</category>
      <category>analytics</category>
    </item>
    <item>
      <title>The Blueprint for Modern Data Orchestration</title>
      <dc:creator>Cliffe Okoth</dc:creator>
      <pubDate>Sat, 02 May 2026 00:40:22 +0000</pubDate>
      <link>https://dev.to/cliffe_okoth/a-beginners-guide-to-apache-airflow-3-4mp9</link>
      <guid>https://dev.to/cliffe_okoth/a-beginners-guide-to-apache-airflow-3-4mp9</guid>
      <description>&lt;p&gt;If the terms &lt;em&gt;orchestration&lt;/em&gt; or &lt;em&gt;Apache Airflow&lt;/em&gt; sound like intimidating data jargon, this article will help you cut through the noise and understand the basics.&lt;br&gt;
So, &lt;em&gt;what exactly is &lt;strong&gt;data orchestration&lt;/strong&gt;?&lt;/em&gt;&lt;br&gt;
In DataOps (Data Operations), it is the underlying system that manages data workflows (such as ETL pipelines) to ensure tasks run at the right time and in the correct sequence.&lt;br&gt;
For example, if data transformation depends on extraction, orchestration makes sure the extraction process runs to completion first.&lt;br&gt;
&lt;em&gt;What is a &lt;strong&gt;DAG&lt;/strong&gt;?&lt;/em&gt; A DAG is a model that contains all the &lt;em&gt;tasks&lt;/em&gt; to be run. DAG stands for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Directed&lt;/strong&gt; meaning tasks have a specific direction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acyclic&lt;/strong&gt; meaning it has no circular dependencies — extraction cannot depend on transformation if transformation depends on extraction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph&lt;/strong&gt; meaning a collection of tasks (nodes) connected by dependencies (edges).
&lt;em&gt;What is a Task?&lt;/em&gt; This is a step in a DAG that describes a single unit of work.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskvpaevabfpliavzzopo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskvpaevabfpliavzzopo.png" alt="DAG Diagram" width="480" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Think of the DAG as an orchestra conductor and the tasks as the instruments. &lt;br&gt;
To bring this orchestration to life, tools like Apache Airflow are used to define, schedule and monitor batch-oriented pipelines. &lt;br&gt;
An Airflow instance contains the following main components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;Scheduler&lt;/strong&gt; submits tasks to the executor and triggers scheduled workflows.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;DAG processor&lt;/strong&gt; reads DAG files and organizes them in the metadata database.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Webserver&lt;/strong&gt; is the Airflow User Interface for inspecting, triggering and debugging the behaviour of DAGs and tasks.&lt;/li&gt;
&lt;li&gt;A dedicated folder of &lt;strong&gt;DAG files&lt;/strong&gt; which contains the DAG, is read by the scheduler to figure out which tasks to run and when to run them.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Metadata Database&lt;/strong&gt; stores the state of tasks, DAGs and variables.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this point you might be asking yourself, &lt;em&gt;Why not just use cron jobs?&lt;/em&gt; Well, think of cron jobs as an alarm clock and Airflow as a project manager. Cron just runs your script at a certain time with no regard for the task's dependencies. &lt;br&gt;
&lt;em&gt;Say you schedule &lt;code&gt;extract.py&lt;/code&gt; for 12:00 AM and &lt;code&gt;transform.py&lt;/code&gt; for 1:30 AM. If extraction takes 40 minutes, Cron will blindly trigger the transformation at 1:30 AM, leading to corrupted data or a crash.&lt;/em&gt; &lt;br&gt;
Airflow, acting as a project manager, understands this dependency; it waits for extraction to finish and will automatically retry the task if it times out or fails.&lt;br&gt;
To make sense of this jargon, below is an example of a simple DAG:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.providers.standard.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.providers.standard.operators.bash&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BashOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt; 

&lt;span class="c1"&gt;# Step 1: Define your Python functions 
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_function&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Your logic here
&lt;/span&gt;    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Set default arguments
&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;your_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;depends_on_past&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# don't wait for previous DAG runs
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_on_failure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                       &lt;span class="c1"&gt;# retry once if it fails
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retry_delay&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Create DAG object
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;template_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;# unique DAG identifier
&lt;/span&gt;    &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# default args defined above
&lt;/span&gt;    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Template for new DAGs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="c1"&gt;# DAG description
&lt;/span&gt;    &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# frequency of execution (you could use cron expressions for granularity)
&lt;/span&gt;    &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                      &lt;span class="c1"&gt;# don't run for previous dates
&lt;/span&gt;    &lt;span class="n"&gt;max_active_runs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;                   &lt;span class="c1"&gt;# run one instance at a time
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: Define tasks
&lt;/span&gt;&lt;span class="n"&gt;task1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# unique task identifier
&lt;/span&gt;    &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_function&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# Python function to be executed
&lt;/span&gt;    &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;task2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bash_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello World&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 5: Set dependencies
&lt;/span&gt;&lt;span class="n"&gt;task1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;task2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the example above, we use Python to declare tasks and their dependencies. These instructions are then interpreted by the orchestration engine and run sequentially. This is what data engineers refer to as &lt;em&gt;Workflow As Code&lt;/em&gt;. &lt;br&gt;
The DAG above is defined using &lt;strong&gt;traditional operators&lt;/strong&gt; as in &lt;code&gt;PythonOperator&lt;/code&gt; and &lt;code&gt;BashOperator&lt;/code&gt;.&lt;br&gt;
However, this is not the only method used; Airflow has a built-in &lt;strong&gt;TaskFlow API&lt;/strong&gt; that defines DAGs using Python decorators, which  makes it easier to pass data between DAGs. &lt;br&gt;
Here is an example of a simple ETL pipeline using TaskFlow API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.decorators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pendulum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Define the DAG using the @dag decorator
&lt;/span&gt;&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;example&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;taskflow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;taskflow_etl_pipeline&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Extract: Task returns a dictionary 
&lt;/span&gt;    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;data_string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 30.5, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 28.2, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1003&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 31.1}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Transform: Receives data directly from the upstream task
&lt;/span&gt;    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;total_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Load: Final task to "load" or print the data
&lt;/span&gt;    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processed_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loading data: Total value is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;processed_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 5. Define dependencies by calling the functions
&lt;/span&gt;    &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiate the DAG
&lt;/span&gt;&lt;span class="nf"&gt;taskflow_etl_pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;How can you tell if your DAG runs?&lt;/em&gt; Use the &lt;code&gt;airflow dags list&lt;/code&gt; command to check if it's been parsed by the scheduler. &lt;br&gt;
If not, use &lt;code&gt;airflow dags list-import-errors&lt;/code&gt; to check for syntax errors. Alternatively, you could check the user interface at &lt;code&gt;localhost:8080&lt;/code&gt;. &lt;br&gt;
To ensure configuration errors are avoided, use the following link for a step-by-step guide on installation and setup:&lt;br&gt;
&lt;a href="https://www.notion.so/Setting-up-Airflow-3-0-353cf979b71d80ffb7c6cd6f9d8fdc64?source=copy_link" rel="noopener noreferrer"&gt;Step by step guide on how to Install and Setup Apache Airflow&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices
&lt;/h2&gt;

&lt;p&gt;As your workflows grow in complexity, adhering to a few core principles will save you from scheduling nightmares and data corruption. Let's look at some of them:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Idempotency:&lt;/strong&gt; A task should return the exact same outcome whether it is run once, twice or a hundred times for the same execution date.&lt;br&gt;
&lt;strong&gt;2. Atomicity:&lt;/strong&gt; Each task should perform one defined operation. This ensures modularity. If the transformation phase fails, you only need to retry that specific task instead of re-fetching all your raw data from the source. &lt;em&gt;See diagram below&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lxwgng66ln9ec2l9on2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lxwgng66ln9ec2l9on2.png" alt="Atomicity diagram" width="800" height="549"&gt;&lt;/a&gt;&lt;br&gt;
&lt;small&gt;Left - monolith | Right - modular&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Encapsulation:&lt;/strong&gt; Only define the DAG structure at the top level. If you put heavy data processing, API calls or database queries in the global scope of your file, the scheduler will execute that code every single time it parses the file. This will crash your Airflow instance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;To sum everything up, Apache Airflow might seem intimidating at first, but at its core, it is simply a tool designed to bring order to chaos. By embracing orchestration, you transform isolated, manually run scripts into reliable, automated data pipelines. To recap the key takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Orchestration&lt;/strong&gt; is essential to data pipelines, it ensures your data tasks run in the right sequence and at the right time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DAGs are the blueprint&lt;/strong&gt;, they provide a map of your tasks and dependencies, ensuring no task runs out of order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Airflow does the heavy lifting&lt;/strong&gt; by handling the logistics of executing and monitoring your tasks so you can focus on the logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflow as Code:&lt;/strong&gt; Whether you use traditional operators or the modern, Pythonic TaskFlow API, you have the flexibility to define complex pipelines.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>airflow</category>
      <category>dataengineering</category>
      <category>dataops</category>
      <category>ai</category>
    </item>
    <item>
      <title>What is the difference between ETL and ELT?</title>
      <dc:creator>Cliffe Okoth</dc:creator>
      <pubDate>Fri, 10 Apr 2026 23:21:50 +0000</pubDate>
      <link>https://dev.to/cliffe_okoth/what-is-the-difference-between-etl-and-etl-3ok4</link>
      <guid>https://dev.to/cliffe_okoth/what-is-the-difference-between-etl-and-etl-3ok4</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;Say you have data in a dozen different places, and you need it all in one spot, fully cleaned and ready for analysis. That is the core goal of &lt;em&gt;data integration&lt;/em&gt;. To get the job done, data engineers rely on two primary data pipeline architectures: &lt;strong&gt;ETL&lt;/strong&gt; (Extract, Transform, Load) and its modern alternative, &lt;strong&gt;ELT&lt;/strong&gt; (Extract, Load, Transform). While both move data from source to storage, the timing of how they process that data changes everything. Let's break down how they work.&lt;/p&gt;

&lt;h2&gt;
  
  
  ETL
&lt;/h2&gt;

&lt;p&gt;ETL(Extract, Transform, Load) is a data integration process that &lt;strong&gt;extracts&lt;/strong&gt; raw data from a single or multiple sources, &lt;strong&gt;transforms&lt;/strong&gt; this data into a usable format, then &lt;strong&gt;loads&lt;/strong&gt; the resultant data into a database where end-users can access it. &lt;br&gt;
What do these three processes entail?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract:&lt;/strong&gt; This is the first step of the process. It includes extracting data from target sources that can range from structured sources like databases (SQL, NoSQL), to semi-structured data (JSON, XML) to unstructured data (emails, flat files). 
It is crucial in this step, to gather data without altering its original format as it is processed in the next stage.&lt;/li&gt;
&lt;li&gt;** Transform:** In this step, data gets cleansed and restructured to meet operational needs. Data is usually not loaded directly into the data destination, it is first loaded into a &lt;em&gt;staging&lt;/em&gt; database (layer between the raw data and the clean data). This ensures a quick roll back in case something goes wrong in the pipeline. Common transformations include:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Filtering:&lt;/strong&gt; Removing irrelevant data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Sorting:&lt;/strong&gt; Organizing data into a required order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Aggregating:&lt;/strong&gt; Summarizing data to provide meaningful insights (e.g. average sales, total sales).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Load:&lt;/strong&gt; This is the final process where transformed data is uploaded to a target database where end-users can access it. Depending on the use case, there are two types of loading methods:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full Load:&lt;/strong&gt; All data is loaded into the target system, often used during the initial population of the warehouse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental Load:&lt;/strong&gt; Only new or updated data is loaded, making this method more efficient for ongoing data updates.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;So, how does ETL work?&lt;/em&gt; Think of a modern ETL pipeline as a factory assembly line. The system doesn't wait to gather all the raw materials before starting production. Instead, it multitasks—extracting new data while simultaneously cleaning the previous batch and loading the finished product. How fast this assembly line moves depends entirely on the business's needs, generally falling into two categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch processing pipelines:&lt;/strong&gt; This is the most popular method where data is extracted, transformed and loaded periodically. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time processing pipelines:&lt;/strong&gt; This method depends on streaming sources for data, with transformations performed using a real-time processing engine like &lt;strong&gt;Spark&lt;/strong&gt;. Unlike batch processing which is scheduled, this method occurs in real time e.g fraud detection.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft85r7j5efl9mvbvkamqm.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft85r7j5efl9mvbvkamqm.jpeg" alt=" " width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Real world use cases
&lt;/h3&gt;

&lt;p&gt;These are some of the ways ETL is used in the real world:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sensor Data Integration:&lt;/strong&gt; Gathering raw, continuous data from multiple IoT sensors, filtering out anomalies, and moving the clean data to a single point where it can be analyzed for equipment maintenance.&lt;br&gt;
&lt;strong&gt;Cloud Migration:&lt;/strong&gt; Moving legacy data from an on-premise (client-managed) warehouse, transforming its structure to match modern schemas, and loading it into the new cloud platform.&lt;br&gt;
&lt;strong&gt;Marketing Data Integration:&lt;/strong&gt; Collecting campaign data from various distinct sources (like Facebook Ads, Google Ads, and email platforms), standardizing currency and date formats and preparing it for analysis before loading it into a final reporting destination.&lt;br&gt;
&lt;strong&gt;Database Replication:&lt;/strong&gt; Continuously extracting data from multiple operational databases, transforming it to unified schema and replicating it into a central data warehouse for reporting.&lt;/p&gt;

&lt;p&gt;These are some of the tools you could use for ETL:&lt;br&gt;
&lt;em&gt;Open-source tools&lt;/em&gt;: &lt;strong&gt;Apache Nifi&lt;/strong&gt;.&lt;br&gt;
&lt;em&gt;Commercial ETL tools&lt;/em&gt;: &lt;strong&gt;Informatica&lt;/strong&gt; and &lt;strong&gt;Microsoft SSIS&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now, for the longest time ETL was applauded for its &lt;strong&gt;data quality&lt;/strong&gt; and &lt;strong&gt;governance&lt;/strong&gt; capabilities ensuring data stored followed the outlined business requirements.&lt;br&gt;
However, as companies grew, this 'clean first, store later' approach led to &lt;strong&gt;scaling inefficiencies&lt;/strong&gt;. The pipeline became a bottleneck that frustrated data engineers with silent failures. &lt;br&gt;
This is where ELT came in.&lt;/p&gt;

&lt;h2&gt;
  
  
  ELT
&lt;/h2&gt;

&lt;p&gt;ELT stands for "Extract, Load, Transform." In this process, the transformation of data occurs after it is loaded into storage. That means there's no need for data staging. &lt;/p&gt;

&lt;p&gt;The ELT process does not differ much from ETL, &lt;strong&gt;transformation&lt;/strong&gt; just comes &lt;strong&gt;after&lt;/strong&gt; data &lt;strong&gt;loading&lt;/strong&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Real world use cases
&lt;/h2&gt;

&lt;p&gt;This is how ELT can be used in the real world:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mobile Lending Applications:&lt;/strong&gt; Ingesting massive volumes of raw, unstructured user and transaction data from a mobile lending app directly into a data lake then using the warehouse's computing power to transform specific segments of that data to train machine learning algorithms for credit scoring.&lt;br&gt;
&lt;strong&gt;Event Analytics:&lt;/strong&gt; Dumping massive volumes of raw website clickstream data or server logs directly into a cloud data warehouse as soon as they are generated. Transformations are only applied later when data analysts need to query specific user behaviors or run a security audit.&lt;br&gt;
&lt;strong&gt;Rapid Storing of Unstructured Data:&lt;/strong&gt; Loading new, completely unstructured data (like raw text, audio files, or social media feeds) directly into storage, providing immediate access to all raw information whenever it is needed for future  analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  ELT Tools
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Open-source tools&lt;/strong&gt;&lt;br&gt;
    * &lt;strong&gt;ELT Platforms:&lt;/strong&gt; Airbyte&lt;br&gt;
    * &lt;strong&gt;Orchestrators:&lt;/strong&gt; Apache Airflow&lt;br&gt;
    * &lt;strong&gt;Transformation Framework:&lt;/strong&gt; data build tool (dbt)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Commercial tools&lt;/strong&gt;&lt;br&gt;
    * &lt;strong&gt;ELT Platforms:&lt;/strong&gt; Matillion, Hevo Data, Weld&lt;br&gt;
    * &lt;strong&gt;Connectors:&lt;/strong&gt; Fivetran&lt;br&gt;
    * &lt;strong&gt;Data Replication:&lt;/strong&gt; Stitch&lt;/p&gt;

&lt;h2&gt;
  
  
  ETL vs. ELT
&lt;/h2&gt;

&lt;p&gt;The choice between ETL and ELT depends on several factors, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data complexity:&lt;/strong&gt; ETL is often used for complex transformations that require specialized tools and expertise.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Skills and resources:&lt;/strong&gt; ETL requires specialized skills and resources for building and maintaining transformation pipelines. ELT may be easier to implement because it leverages the resources of cloud data warehouses.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data volume:&lt;/strong&gt; ELT is generally better suited for large volumes of data because it leverages the processing power of cloud data warehouses for transformations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Target system:&lt;/strong&gt; ELT is best suited for cloud-based data warehouses and data lakes that have the processing power to handle transformations.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;To cap this off, in modern data engineering, transforming raw data into actionable insights requires robust data integration pipelines. The two dominant approaches for moving and preparing this data are ETL and ELT.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ETL (Extract, Transform, Load):&lt;/strong&gt; This traditional approach extracts raw data, cleans and structures it within an intermediate staging area, and finally loads it into a target database or data warehouse.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Enforcing strict data quality, ensuring regulatory compliance/governance and executing highly complex transformations—often used with legacy systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-offs:&lt;/strong&gt; Can suffer from scaling inefficiencies, rigid maintenance requirements and processing bottlenecks.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;ELT (Extract, Load, Transform):&lt;/strong&gt; This modern approach extracts raw data and loads it directly into a data lake or cloud data warehouse without prior staging. Transformations are performed post-load, leveraging the massive computational power of the destination system.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Handling massive data volumes, quickly ingesting unstructured data and minimizing latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-offs:&lt;/strong&gt; Requires robust security measures to protect sensitive raw data and strict cataloging to prevent the data lake from degrading into an unmanageable mess.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;In conclusion, the choice between the two processes depends heavily on one's specific needs. ETL remains the standard for complex transformations where data quality must be guaranteed prior to storage. Conversely, ELT has emerged as the preferred choice for modern, cloud-based environments dealing with massive, diverse datasets where speed and flexibility are the top priorities.&lt;/p&gt;

</description>
      <category>data</category>
      <category>dataengineering</category>
      <category>sql</category>
      <category>cloud</category>
    </item>
  </channel>
</rss>
