<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gathuru_M</title>
    <description>The latest articles on DEV Community by Gathuru_M (@gathurum).</description>
    <link>https://dev.to/gathurum</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F912835%2Fe5214cca-5a55-402c-8806-eabe9d7fc72c.png</url>
      <title>DEV Community: Gathuru_M</title>
      <link>https://dev.to/gathurum</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gathurum"/>
    <language>en</language>
    <item>
      <title>Building a Real-Time Weather Streaming Pipeline with Apache Kafka 4 and Cassandra</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Mon, 29 Jun 2026 00:28:17 +0000</pubDate>
      <link>https://dev.to/gathurum/building-a-real-time-weather-streaming-pipeline-with-apache-kafka-4-and-cassandra-3343</link>
      <guid>https://dev.to/gathurum/building-a-real-time-weather-streaming-pipeline-with-apache-kafka-4-and-cassandra-3343</guid>
      <description>&lt;p&gt;In my &lt;a href="https://dev.to/gathurum/introduction-to-apache-kafka-shifting-from-batch-processing-to-real-time-streaming-2e96"&gt;last article&lt;/a&gt;, we broke down the core concepts of Apache Kafka — moving away from scheduled batch jobs and shifting toward real-time event streaming.&lt;/p&gt;

&lt;p&gt;This article focuses on a practical Kafka project to help you understand the architecture better. The goal is simple: fetch live weather metrics from the &lt;strong&gt;OpenWeather API&lt;/strong&gt;, stream those events instantly through an &lt;strong&gt;Apache Kafka topic&lt;/strong&gt;, and consume them downstream to be stored in &lt;strong&gt;Apache Cassandra&lt;/strong&gt; for time-based analysis.&lt;/p&gt;

&lt;p&gt;To make this happen, we will be writing &lt;strong&gt;two standalone Python scripts&lt;/strong&gt;: a custom &lt;code&gt;weather_producer.py&lt;/code&gt; script to fetch and stream the data, and a separate &lt;code&gt;weather_consumer.py&lt;/code&gt; script to read the stream and save it to our database.&lt;/p&gt;




&lt;h2&gt;
  
  
  Upgrading from Kafka 3.9.2 to Kafka 4.0 (KRaft Native)
&lt;/h2&gt;

&lt;p&gt;Before writing our scripts, we have to address our Kafka environment. In my earlier practice sessions, I was running &lt;strong&gt;Apache Kafka version 3.9.2&lt;/strong&gt;, which still relied heavily on an external service called Apache ZooKeeper to manage cluster metadata.&lt;/p&gt;

&lt;p&gt;However, with the release of &lt;strong&gt;Apache Kafka 4.x&lt;/strong&gt;, ZooKeeper support has been completely removed. Kafka now runs entirely in &lt;strong&gt;KRaft (Kafka Raft) mode&lt;/strong&gt;, managing its own internal metadata log natively. This simplifies things dramatically, meaning we only have to run a single Kafka process instead of managing two entirely different software setups on our machine.&lt;/p&gt;

&lt;p&gt;To get ready for this project, I had to completely clean out my old local environment. If you are following along on WSL or Linux, here is exactly how to uninstall version 3.9.2 and upgraded to Kafka 4.0:&lt;/p&gt;

&lt;h3&gt;
  
  
  i). Removing Kafka 3.9.2
&lt;/h3&gt;

&lt;p&gt;First, stop any running instances of ZooKeeper and Kafka, then wipe out the old binaries and the local data directories to prevent any legacy configuration conflicts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Delete the old installation directory&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; ~/kafka_2.13-3.9.2

&lt;span class="c"&gt;# Clear out local temporary data/log directories used by the old version&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /tmp/zookeeper
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /tmp/kafka-logs

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ii). Installing and Formatting Kafka 4.0
&lt;/h3&gt;

&lt;p&gt;KRaft requires you to explicitly generate a cluster ID and format your storage directory before launching the broker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download and extract Kafka 4.0&lt;/span&gt;
wget https://archive.apache.org/dist/kafka/4.0.0/kafka_2.13-4.0.0.tgz
&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xzf&lt;/span&gt; kafka_2.13-4.0.0.tgz
&lt;span class="nb"&gt;cd &lt;/span&gt;kafka_2.13-4.0.0

&lt;span class="c"&gt;# Generate a unique cluster ID for KRaft&lt;/span&gt;
&lt;span class="nv"&gt;KRAFT_CLUSTER_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;bin/kafka-storage.sh random-uuid&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Format your storage log directory using that ID&lt;/span&gt;
bin/kafka-storage.sh format &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="nv"&gt;$KRAFT_CLUSTER_ID&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; config/kraft/server.properties

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, starting Kafka is as simple as running one single command targeting our KRaft configuration file:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bin/kafka-server-start.sh config/kraft/server.properties&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fujc1lf61aq35lr0kv520.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fujc1lf61aq35lr0kv520.png" alt="Kafka 4.0 starting up in KRaft mode" width="799" height="314"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;&lt;/em&gt;&lt;/p&gt;
&lt;center&gt;&lt;em&gt; Kafka 4.0 starting up in KRaft mode &lt;/em&gt;&lt;/center&gt;



&lt;h3&gt;
  
  
  Step 1: Creating the Weather Topic
&lt;/h3&gt;

&lt;p&gt;Before our scripts can send or read data, our new 4.0 broker needs a destination ready. Using the Kafka CLI tools, create a dedicated topic named &lt;code&gt;weather_updates&lt;/code&gt; with 3 partitions to allow for parallel consumption down the line:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bin/kafka-topics.sh --create --topic weather_updates --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxudwq8eekfwu5hohcb9a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxudwq8eekfwu5hohcb9a.png" alt="Describing the  raw `weather_updates` endraw  topic" width="800" height="274"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;&lt;/em&gt;&lt;/p&gt;
&lt;center&gt;&lt;em&gt; Describing the &lt;code&gt;weather_updates&lt;/code&gt; topic &lt;/em&gt;&lt;/center&gt;



&lt;h3&gt;
  
  
  Step 2: Writing the Python Producer (&lt;code&gt;weather_producer.py&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Our first script is a standalone Python script called &lt;code&gt;weather_producer.py&lt;/code&gt;. Its sole job is to talk to the OpenWeather API, fetch the raw metrics, and hand the payload over to Kafka. We use the &lt;code&gt;kafka-python&lt;/code&gt; library to handle the client connection.&lt;/p&gt;

&lt;p&gt;Notice the continuous &lt;code&gt;while True&lt;/code&gt; loop. Unlike Airflow DAGs that run and stop, streaming application scripts run indefinitely in your terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# weather_producer.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaProducer&lt;/span&gt;

&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_OPENWEATHER_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;CITY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Nairobi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.openweathermap.org/data/2.5/weather?q=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;CITY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;appid=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Kafka Producer targeting our KRaft Broker port
&lt;/span&gt;&lt;span class="n"&gt;producer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaProducer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;value_serializer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--- Starting Weather Producer Loop ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;weather_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="c1"&gt;# Send payload to our Kafka topic
&lt;/span&gt;            &lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;weather_updates&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;weather_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sent event to Kafka: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;weather_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;weather_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;main&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;temp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;K&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error fetching/sending data: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Poll the API every 10 seconds to create a continuous stream
&lt;/span&gt;    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fjgjwxi76hihmn3jmnpfa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fjgjwxi76hihmn3jmnpfa.png" alt="Sending event to Kafka topic" width="800" height="233"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;&lt;/em&gt;&lt;/p&gt;
&lt;center&gt;&lt;em&gt; Sending event to Kafka topic &lt;/em&gt;&lt;/center&gt;



&lt;h3&gt;
  
  
  Step 3: Setting Up Apache Cassandra For the First Time
&lt;/h3&gt;

&lt;p&gt;If you have never used &lt;strong&gt;Apache Cassandra&lt;/strong&gt; before, it is a distributed NoSQL database built specifically for lightning-fast write speeds and handling massive time-series data.&lt;/p&gt;

&lt;p&gt;Because it doesn't come pre-installed on standard Linux distributions, we will need to install it first - binary version. Here is the step-by-step process to set it up on my WSL environment for the first time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="c"&gt;# 1. Navigate to your home directory&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~

&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;openjdk-11-jdk &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is important to note that Cassandra 3.x and 4.x strictly require Java 8 or Java 11.&lt;/p&gt;

&lt;p&gt;Since we updated to Kafka 4.0 which requires Java 17 or Java 21, Cassandra will instantly crash on startup. &lt;br&gt;
Therefore, run the above command to download and install Java 11 side-by-side with your existing Java 17.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 2. Download the official Apache Cassandra binary tarball&lt;/span&gt;
wget https://archive.apache.org/dist/cassandra/4.1.4/apache-cassandra-4.1.4-bin.tar.gz

&lt;span class="c"&gt;# 3. Extract the tarball&lt;/span&gt;
&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xzf&lt;/span&gt; apache-cassandra-4.1.4-bin.tar.gz

&lt;span class="c"&gt;# 4. Rename the folder to just 'cassandra' for cleaner navigation&lt;/span&gt;
&lt;span class="nb"&gt;mv &lt;/span&gt;apache-cassandra-4.1.4 cassandra

&lt;span class="c"&gt;# 5. Clean up the downloaded tar.gz file to save space&lt;/span&gt;
&lt;span class="nb"&gt;rm &lt;/span&gt;apache-cassandra-4.1.4-bin.tar.gz

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, everything for Cassandra lives explicitly inside &lt;code&gt;~/cassandra&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Start Cassandra&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Ubuntu allows you to have multiple versions of Java installed at the same time. You can pick which one is active by running:&lt;br&gt;
&lt;code&gt;sudo update-alternatives --config java&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Look for the row that mentions java-11-openjdk. Type that selection number and hit Enter.&lt;/p&gt;

&lt;p&gt;To make sure your terminal is officially using the older version, run &lt;code&gt;java -version&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It should now say openjdk version "11.0.x"&lt;/p&gt;

&lt;p&gt;Because Cassandra here is a binary version, to start it up, we will execute the startup script directly from the folder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/cassandra
bin/cassandra &lt;span class="nt"&gt;-f&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A large stream of startup logs will begin scrolling down your screen. The specific line &lt;code&gt;INFO  [main] ... Startup completed&lt;/code&gt; near the bottom confirms Cassandra started successfully.&lt;/p&gt;

&lt;p&gt;Once the logs stop moving and Cassandra stays open safely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Leave that terminal window completely alone (let it keep running).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Open a brand-new terminal tab/window in WSL.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Navigate to your folder and run your cluster status tool:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/cassandra
bin/nodetool status

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see your clean grid return with &lt;code&gt;UN 127.0.0.1&lt;/code&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F14rf1mfhjs65mqgiuuah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F14rf1mfhjs65mqgiuuah.png" alt="Nodetool status" width="799" height="190"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;&lt;/em&gt;&lt;/p&gt;
&lt;center&gt;&lt;em&gt;Terminal output of nodetool status confirming Cassandra is awake&lt;/em&gt;&lt;/center&gt;



&lt;h3&gt;
  
  
  Step 4: Creating the Storage Schema
&lt;/h3&gt;

&lt;p&gt;Now that Cassandra is running, we can log into the Cassandra Query Language shell (&lt;code&gt;cqlsh&lt;/code&gt;) right from our command line to create our schema.&lt;/p&gt;

&lt;p&gt;We will create a &lt;strong&gt;Keyspace&lt;/strong&gt; (Cassandra’s version of a database schema) and a table structured specifically for historical analysis, sorting rows chronologically using a clustered timestamp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cqlsh

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside the interactive shell, run the following commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE KEYSPACE weather_analytics 
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE weather_analytics;

CREATE TABLE city_weather (
    city_name text,
    timestamp timestamp,
    temperature float,
    humidity int,
    wind_speed float,
    PRIMARY KEY (city_name, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 5: Writing the Python Consumer (&lt;code&gt;weather_consumer.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;Our second standalone script is &lt;code&gt;weather_consumer.py&lt;/code&gt;. It runs completely independently from the producer, continuously listening to the &lt;code&gt;weather_updates&lt;/code&gt; topic, parsing the incoming JSON data using the &lt;code&gt;cassandra-driver&lt;/code&gt; client, and appending the records directly into Cassandra.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# weather_consumer.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaConsumer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cassandra.cluster&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Cluster&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to local Cassandra instance
&lt;/span&gt;&lt;span class="n"&gt;cassandra_cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cassandra_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;weather_analytics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Subscribe to Kafka Topic
&lt;/span&gt;&lt;span class="n"&gt;consumer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;weather_updates&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;value_deserializer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--- Weather Consumer Listening for Events ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract specific nested fields from OpenWeather JSON payload
&lt;/span&gt;    &lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;temp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;main&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;temp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;humid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;main&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;humidity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;wind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wind&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;speed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Insert statement into Cassandra NoSQL table
&lt;/span&gt;    &lt;span class="n"&gt;insert_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    INSERT INTO city_weather (city_name, timestamp, temperature, humidity, wind_speed)
    VALUES (%s, toTimestamp(now()), %s, %s, %s);
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;insert_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;humid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wind&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Successfully streamed and stored record for: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Verifying the Streaming Pipeline Data Flow
&lt;/h2&gt;

&lt;p&gt;When both independent python scripts are active at the same time, the pipeline functions as a living unit. To confirm that our real-time records are landing correctly inside our NoSQL storage layer, we can log into the Cassandra shell (&lt;code&gt;cqlsh&lt;/code&gt;) and run a quick verification query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;weather_analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city_weather&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqmvphoelrr2bmnc89ox7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqmvphoelrr2bmnc89ox7.png" alt="Producer and Consumer scripts running concurrently" width="800" height="163"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;&lt;/em&gt;&lt;/p&gt;
&lt;center&gt;&lt;em&gt; Producer and Consumer scripts running concurrently &lt;/em&gt;&lt;/center&gt;





&lt;h2&gt;
  
  
  Major Lessons from Entering the Streaming Space
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Upgrading simplifies infrastructure:&lt;/strong&gt; Dropping ZooKeeper and upgrading from 3.9.2 to 4.x made managing the local environment much smoother.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decoupling creates stability:&lt;/strong&gt; If the Cassandra database goes down for maintenance, the Producer script doesn't care. It keeps pulling weather data and sending it to Kafka. Kafka will safely hold onto those messages until Cassandra recovers and the Consumer script turns back on. This completely prevents data loss!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What real-time data sources are you planning to stream in your next project? Let's brainstorm ideas in the comments below!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Introduction to Apache Kafka: Shifting from Batch Processing to Real-Time Streaming</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Fri, 19 Jun 2026 01:13:25 +0000</pubDate>
      <link>https://dev.to/gathurum/introduction-to-apache-kafka-shifting-from-batch-processing-to-real-time-streaming-2e96</link>
      <guid>https://dev.to/gathurum/introduction-to-apache-kafka-shifting-from-batch-processing-to-real-time-streaming-2e96</guid>
      <description>&lt;p&gt;&lt;strong&gt;Batch Processing&lt;/strong&gt; in data, is the approach where a large amount of information is collected over a period of time and processed as a single unit while &lt;strong&gt;Streaming&lt;/strong&gt; is the approach where this information is processed individually in real-time.&lt;/p&gt;

&lt;p&gt;In Batch Processing, whether we are running python scripts manually or scheduling them with Apache Airflow, the core concept is the same: wait for a period of time, collect a chunk of data (like an hour's worth of news or crypto prices), process it, and load it into a database.&lt;/p&gt;

&lt;p&gt;Batch processing is recommended for reports, daily dashboards, and historical analysis. But what happens when a business needs answers in &lt;em&gt;milliseconds&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;Consider these scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ride-Sharing Apps:&lt;/strong&gt; Uber or Bolt tracking a driver’s GPS coordinates second-by-second to update your ETA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Financial Security:&lt;/strong&gt; A bank analyzing a credit card swipe for fraud &lt;em&gt;before&lt;/em&gt; the transaction is approved.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;E-commerce:&lt;/strong&gt; Tracking clickstream data (every single scroll, click, and hover a user makes) to instantly update product recommendations on a homepage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can't wait an hour for an Airflow DAG to trigger for these use cases. You need &lt;strong&gt;Event Streaming&lt;/strong&gt;, and that is exactly what &lt;strong&gt;Apache Kafka&lt;/strong&gt; is designed to handle.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Apache Kafka?
&lt;/h2&gt;

&lt;p&gt;Apache Kafka is a distributed event streaming platform. &lt;br&gt;
Instead of acting like a traditional database where data sits in tables waiting to be queried, Kafka handles data as a continuous, high-speed flow of messages (called &lt;strong&gt;events&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;It acts as a highly decoupled, fault-tolerant middleman between systems that produce data and systems that need to consume that data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Components of Kafka Architecture
&lt;/h2&gt;

&lt;p&gt;To manage real-time streams at scale, Kafka relies on a few key structural components:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Producers and Consumers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Producers:&lt;/strong&gt; Applications that generate and send data into Kafka. For example, a mobile app sending user location updates, or a microservice publishing transaction details.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumers:&lt;/strong&gt; Applications that subscribe to Kafka to read and process those incoming streams. For example, an analytics system calculating live traffic patterns, or a notification service sending an SMS receipt.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key point here is that Producers and Consumers are completely independent. &lt;br&gt;
A producer doesn't know or care who is reading its data, which prevents your entire system architecture from becoming a tangled web of direct API connections.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Topics and Partitions
&lt;/h3&gt;

&lt;p&gt;Data within Kafka is organized into categories called &lt;strong&gt;Topics&lt;/strong&gt; (similar to a table in a traditional database). If you are tracking a logistics fleet, you might have a topic named &lt;code&gt;truck_gps_coordinates&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To handle massive scale, a single Topic is split into multiple pieces called &lt;strong&gt;Partitions&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Partitions are spread across different servers (called &lt;strong&gt;Brokers&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;This allows Kafka to achieve parallel processing — multiple consumers can read from different partitions of the same topic simultaneously, maximizing throughput.&lt;/li&gt;
&lt;li&gt;Inside a partition, every message is appended in a strict chronological order and given a unique sequential ID called an &lt;strong&gt;Offset&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Offsets: How Kafka Tracks Progress
&lt;/h3&gt;

&lt;p&gt;Unlike a traditional message queue, Kafka doesn't delete messages the moment a consumer reads them. Messages stay in Kafka for a configured amount of time (e.g., 7 days).&lt;/p&gt;

&lt;p&gt;Because the data stays put, a consumer uses its &lt;strong&gt;Offset&lt;/strong&gt; pointer like a bookmark to remember exactly which message it read last. If the consumer crashes, once it reboots, it checks its last committed offset and resumes reading exactly where it left off without losing a single event.&lt;/p&gt;




&lt;h2&gt;
  
  
  Local Setup: Zookeeper and Brokers
&lt;/h2&gt;

&lt;p&gt;When you start practicing with Kafka locally, especially inside a Linux environment like WSL — you quickly realize it requires a bit of infrastructure orchestration to spin up.&lt;/p&gt;

&lt;p&gt;Historically, Kafka relies on &lt;strong&gt;Apache Zookeeper&lt;/strong&gt; to act as the coordinator, managing the cluster, tracking which brokers are alive, and electing leaders for partitions.&lt;/p&gt;

&lt;p&gt;When launching Kafka via the CLI, you have to spin up Zookeeper first, and then launch your Kafka Broker service(Kafka Server).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F35y94a7iy65mef9x8ay6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F35y94a7iy65mef9x8ay6.png" alt="Running Zookeeper and Kafka Broker services in terminal" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;/em&gt;&lt;/p&gt;
&lt;center&gt;&lt;em&gt; Running Zookeeper and Kafka Broker services in terminal &lt;/em&gt;&lt;/center&gt;





&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;In Batch (Airflow/Postgres):&lt;/strong&gt; Data is at rest. We run queries &lt;em&gt;over&lt;/em&gt; the static data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In Streaming (Kafka):&lt;/strong&gt; Data is in motion. Our application logic stays active, and the data constantly flows &lt;em&gt;through&lt;/em&gt; our code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Understanding the architecture of topics, partitions, and offsets is step one. But as data engineers, we need to programmatically interact with this streaming cluster.&lt;/p&gt;

&lt;p&gt;In the next article, we are going to write Python scripts using a Kafka client library. We will build a custom &lt;strong&gt;Python Producer&lt;/strong&gt; to generate stream events and a &lt;strong&gt;Python Consumer&lt;/strong&gt; to read and display those events in real-time inside our setup.&lt;/p&gt;

&lt;p&gt;Are you moving into real-time streaming workflows yet, or sticking to batch pipelines? Let's discuss in the comments below!&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>beginners</category>
      <category>data</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Containerizing an ETL Pipeline with Docker Compose and Clean Git Hygiene</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Sun, 14 Jun 2026 15:52:16 +0000</pubDate>
      <link>https://dev.to/gathurum/containerizing-an-etl-pipeline-with-docker-compose-and-clean-git-hygiene-493n</link>
      <guid>https://dev.to/gathurum/containerizing-an-etl-pipeline-with-docker-compose-and-clean-git-hygiene-493n</guid>
      <description>&lt;p&gt;In my &lt;a href="https://dev.to/gathurum/understanding-docker-for-data-engineering-18fg"&gt;last post&lt;/a&gt;, we broke down the core concepts of Docker and why packaging your code into standardized containers is ideal for avoiding "dependency hell."&lt;/p&gt;

&lt;p&gt;In this article, we look into developing an ETL project and running it on Docker using &lt;strong&gt;&lt;em&gt;docker-compose&lt;/em&gt;.&lt;/strong&gt; The project uses the &lt;strong&gt;Coinpaprika API&lt;/strong&gt; to fetch and normalize ticker data. &lt;br&gt;
We will also look at how to push the entire project to GitHub using &lt;strong&gt;atomic commits&lt;/strong&gt; to keep our version control clean. Let’s dive into how it works!&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Docker Compose?
&lt;/h2&gt;

&lt;p&gt;Instead of managing containers individually, Docker Compose allows us to define a multi-container application in a single file → &lt;code&gt;docker-compose.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Our project requires two distinct services/containers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;db&lt;/code&gt; (PostgreSQL):&lt;/strong&gt; The data warehouse destination where our normalized coin data will be loaded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;etl_script&lt;/code&gt; (Python):&lt;/strong&gt; Our custom application container that sends requests to the Coinpaprika API, transforms the JSON response using &lt;code&gt;pandas&lt;/code&gt;, and pushes it to our database.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  Step 1: Writing the Configuration Files
&lt;/h2&gt;

&lt;p&gt;To build this, we create a clean directory structure. First, write a &lt;code&gt;Dockerfile&lt;/code&gt; for the Python ETL script to ensure it has all the necessary packages installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Dockerfile&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.11-slim&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; etl_script.py .&lt;/span&gt;

&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "etl_script.py"]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Keeping Secrets Secret with a &lt;code&gt;.env&lt;/code&gt; File
&lt;/h3&gt;

&lt;p&gt;Introduce a &lt;code&gt;.env&lt;/code&gt; file - a simple text file that holds sensitive environment variables.&lt;br&gt;
In this case, the file will hold our database configurations. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Note that this file is not pushed to GitHub.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here is what the &lt;code&gt;.env&lt;/code&gt; file looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# .env
DB_USER=crypto_admin
DB_PASSWORD=SuperSecretPassword123
DB_NAME=crypto_warehouse

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Composing the Multi-Container Network
&lt;/h3&gt;

&lt;p&gt;Next, write the &lt;code&gt;docker-compose.yml&lt;/code&gt; file to define the database and the script together. &lt;br&gt;
Notice how it references the variables dynamically from our &lt;code&gt;.env&lt;/code&gt; file using the &lt;code&gt;${VARIABLE_NAME}&lt;/code&gt; syntax. Docker Compose automatically detects this file in the same directory and injects the values safely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;db&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:15&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;crypto_postgres&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${DB_USER}&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${DB_PASSWORD}&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${DB_NAME}&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5432:5432"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres_data:/var/lib/postgresql/data&lt;/span&gt;

  &lt;span class="na"&gt;etl_script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;crypto_etl_runner&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;db&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_HOST=db&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_NAME=${DB_NAME}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_USER=${DB_USER}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD=${DB_PASSWORD}&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Networking in Docker:
&lt;/h4&gt;

&lt;p&gt;Under &lt;code&gt;etl_script&lt;/code&gt;, the &lt;code&gt;DB_HOST&lt;/code&gt; environment variable is set to &lt;code&gt;db&lt;/code&gt; instead of &lt;code&gt;localhost&lt;/code&gt;. Because Docker Compose spins up both containers on a shared default network, they can find each other using their service names as hostnames!&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Spinning it Up
&lt;/h2&gt;

&lt;p&gt;With our files defined, launching the entire multi-container architecture requires just one command in the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Docker automatically pulls the Postgres image, builds our custom Python image, injects our hidden &lt;code&gt;.env&lt;/code&gt; configurations, sets up the network isolation, and spins up both containers simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3vp31mivuwl2sqftrsb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3vp31mivuwl2sqftrsb.png" alt="Terminal logs from running docker compose up" width="800" height="517"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;&lt;/em&gt;&lt;/p&gt;
&lt;center&gt;&lt;em&gt;Terminal logs from running docker compose up&lt;/em&gt;&lt;/center&gt;





&lt;h2&gt;
  
  
  Step 5: Practicing Clean Version Control (Atomic Commits)
&lt;/h2&gt;

&lt;p&gt;Instead of working for three hours and writing a giant, generic commit message like &lt;code&gt;"fixed code and added files"&lt;/code&gt;, let's practice how to write &lt;strong&gt;atomic commits&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An atomic commit means each commit does exactly &lt;em&gt;one&lt;/em&gt; logical thing. It makes your GitHub commit history readable, and if something breaks, it’s very easy to roll back to the exact step where things went wrong.&lt;/p&gt;

&lt;p&gt;Here is an example of our commit timeline for this project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;idx: Starting point, Initializing project scaffolding&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;infra: Added Docker Infrastructure: Dockerfile, docker-compose.yml and env vars&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;feature: Add ETL script with CoinPaprika API integration and PostgreSQL connection handler&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Volumes prevent data loss:&lt;/strong&gt; Adding the &lt;code&gt;volumes&lt;/code&gt; tag to the Postgres container ensures that even if we stop and destroy our containers using &lt;code&gt;docker-compose down&lt;/code&gt;, the actual crypto data stays saved safely on the hard drive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environments are modular:&lt;/strong&gt; Now, our configurations are completely decoupled from the code. If someone clones the repository from GitHub, the project won't leak any secrets, and they can easily plug in their own database credentials by creating their own local &lt;code&gt;.env&lt;/code&gt; file.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All the best as you continue learning Docker!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>docker</category>
      <category>git</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Understanding Docker for Data Engineering</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Sun, 14 Jun 2026 14:52:02 +0000</pubDate>
      <link>https://dev.to/gathurum/understanding-docker-for-data-engineering-18fg</link>
      <guid>https://dev.to/gathurum/understanding-docker-for-data-engineering-18fg</guid>
      <description>&lt;p&gt;In your data engineering journey, you may have pipelines running locally inside your development environment, and it works beautifully, on &lt;em&gt;your&lt;/em&gt; machine.&lt;/p&gt;

&lt;p&gt;But what happens if you want to hand over a project to a colleague, deploy it to a shared server managed by your team leader, or push it to a cloud provider?&lt;/p&gt;

&lt;p&gt;Suddenly, a storm of errors appears. Like...:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;You don't have PostgreSQL 15 installed locally?"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Your machine is running an older version of Python that doesn't support that syntax?"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Your operating system is missing the specific database drivers needed for &lt;code&gt;psycopg2&lt;/code&gt;"&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is called &lt;strong&gt;Dependency Hell&lt;/strong&gt;. In data engineering, ensuring your pipelines run exactly the same way everywhere is just as important as writing the code itself. That is why we use &lt;strong&gt;Docker&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Container Analogy:
&lt;/h3&gt;

&lt;p&gt;Before the 1950s, shipping goods across the world was incredibly messy. Workers had to manually load barrels of oil, sacks, and crates of electronics onto ships. Every item was a different shape and size, making loading slow, inefficient, and prone to accidents.&lt;/p&gt;

&lt;p&gt;Then came the &lt;strong&gt;standardized shipping container&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It didn’t matter what was inside the box, whether it was cars, clothes, or frozen food, the shipping container was always the exact same size, had the same hooks, and fit perfectly on every ship, train, and crane in the world.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker does exactly this for software.&lt;/strong&gt; Instead of shipping raw Python scripts and text files, Docker lets you package your application code, dependencies, runtime, and configurations into a single, standardized box called a &lt;strong&gt;Container&lt;/strong&gt;. If a machine can run Docker, it can run your container seamlessly, whether it’s a Windows laptop, a Mac, or a Linux server.&lt;/p&gt;

&lt;p&gt;A common question beginners ask is: &lt;em&gt;"Why not just use a Virtual Machine (VM) to isolate our code?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;While VMs provide isolation, they are quite heavy. A VM copies an entire guest operating system (like a whole installation of Windows or Ubuntu), which consumes gigabytes of RAM and CPU before your code even starts running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Docker containers are lightweight.&lt;/em&gt;&lt;/strong&gt; They don't include a whole operating system; instead, they share the host machine’s operating system kernel and only pack the bare essentials (your app code and libraries). This means a container can spin up in seconds rather than minutes, using a fraction of the system resources.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Three Core Concepts in Docker
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. The Dockerfile
&lt;/h4&gt;

&lt;p&gt;A text document containing step by step instructions on how to build a Docker environment. &lt;br&gt;
You specify the base image, install your python packages, and copy your scripts into it.&lt;/p&gt;
&lt;h4&gt;
  
  
  2. The Docker Image
&lt;/h4&gt;

&lt;p&gt;When you run a build command on your Dockerfile, it compiles into a &lt;strong&gt;Docker Image&lt;/strong&gt;. This is a read-only blueprint of your environment. &lt;br&gt;
It contains all the snapshots of your libraries and setup files. You can share this image on platforms like Docker Hub.&lt;/p&gt;
&lt;h4&gt;
  
  
  3. The Docker Container
&lt;/h4&gt;

&lt;p&gt;When you run an image, it becomes a &lt;strong&gt;Container&lt;/strong&gt;. This is the active instance of Docker containing everything an application needs to run. &lt;br&gt;
You can start it, stop it, write data inside it, and delete it when you're done.&lt;/p&gt;



&lt;p&gt;Bleow is a guide to help you install Docker on Linux or on Windows if you have Windows Subsytem for Linux, (WSL)&lt;br&gt;
&lt;strong&gt;&lt;em&gt;-- Using Docker Engine&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Step 1: Update the system and install prerequisites    &lt;/span&gt;
    &lt;span class="nb"&gt;sudo &lt;/span&gt;apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt upgrade &lt;span class="nt"&gt;-y&lt;/span&gt;    
    &lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; ca-certificates curl    

    &lt;span class="c"&gt;# Step 2: Add Docker's official GPG key and repository    &lt;/span&gt;
    &lt;span class="nb"&gt;sudo install&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; 0755 &lt;span class="nt"&gt;-d&lt;/span&gt; /etc/apt/keyrings  &lt;span class="c"&gt;# create the keyring directory&lt;/span&gt;
    &lt;span class="c"&gt;# Download Docker's GPG key&lt;/span&gt;
    &lt;span class="nb"&gt;sudo &lt;/span&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://download.docker.com/linux/ubuntu/gpg &lt;span class="se"&gt;\ &lt;/span&gt;   
      &lt;span class="nt"&gt;-o&lt;/span&gt; /etc/apt/keyrings/docker.asc  
    &lt;span class="c"&gt;# Set correct permissions on the key file&lt;/span&gt;
    &lt;span class="nb"&gt;sudo chmod &lt;/span&gt;a+r /etc/apt/keyrings/docker.asc    
     &lt;span class="c"&gt;# Add Docker's stable apt repository  &lt;/span&gt;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"deb [arch=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;dpkg &lt;span class="nt"&gt;--print-architecture&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="s2"&gt;   
      signed-by=/etc/apt/keyrings/docker.asc] &lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="s2"&gt;   
      https://download.docker.com/linux/ubuntu &lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="s2"&gt;   
      &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt; /etc/os-release &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$VERSION_CODENAME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt; stable"&lt;/span&gt; | &lt;span class="se"&gt;\ &lt;/span&gt;   
      &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/docker.list &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null    

    &lt;span class="c"&gt;# Step 3: Install Docker Engine (latest version)&lt;/span&gt;
    &lt;span class="nb"&gt;sudo &lt;/span&gt;apt update    
    &lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; docker-ce docker-ce-cli containerd.io &lt;span class="se"&gt;\ &lt;/span&gt;   
      docker-buildx-plugin docker-compose-plugin    

    &lt;span class="c"&gt;# Step 4: Verify installation    &lt;/span&gt;
    docker &lt;span class="nt"&gt;--version&lt;/span&gt;    
    docker compose version    

    &lt;span class="c"&gt;# Step 5: Check if Docker is running &lt;/span&gt;
    &lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start docker   &lt;span class="c"&gt;# start the Docker daemon   &lt;/span&gt;
    &lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status docker  &lt;span class="c"&gt;# verify Docker is running&lt;/span&gt;
    &lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;docker    &lt;span class="c"&gt;# enable Docker to start on boot    &lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you install Docker Desktop or run Docker commands via your terminal, you get an organized view of your environment. &lt;/p&gt;

&lt;p&gt;To confirm Docker is working, pull and run the hello-world image&lt;br&gt;
&lt;code&gt;docker run hello-world&lt;/code&gt;   &lt;/p&gt;

&lt;p&gt;What happens behind the scenes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Docker checks locally for the 'hello-world' image&lt;/li&gt;
&lt;li&gt;Not found locally → pulls it from Docker Hub&lt;/li&gt;
&lt;li&gt;Creates a container from the image&lt;/li&gt;
&lt;li&gt;Runs it → prints the success message&lt;/li&gt;
&lt;li&gt;Container exits
&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  Core Docker Commands
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    &lt;span class="c"&gt;# ── Containers ──────────────────────────────────────────────────────────────    &lt;/span&gt;
    docker ps                    &lt;span class="c"&gt;# list RUNNING containers only    &lt;/span&gt;
    docker ps &lt;span class="nt"&gt;-a&lt;/span&gt;                 &lt;span class="c"&gt;# list ALL containers (running + stopped + exited)    &lt;/span&gt;
    docker run &amp;lt;image&amp;gt;           &lt;span class="c"&gt;# create and run a container from an image    &lt;/span&gt;
    docker run &lt;span class="nt"&gt;-it&lt;/span&gt; ubuntu bash   &lt;span class="c"&gt;# run interactively (-i) with a terminal (-t)    &lt;/span&gt;
    docker stop &amp;lt;container_id&amp;gt;   &lt;span class="c"&gt;# gracefully stop a running container    &lt;/span&gt;
    docker &lt;span class="nb"&gt;rm&lt;/span&gt; &amp;lt;container_id&amp;gt;     &lt;span class="c"&gt;# remove a stopped container    &lt;/span&gt;
    docker &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;docker ps &lt;span class="nt"&gt;-aq&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;   &lt;span class="c"&gt;# remove ALL stopped containers    &lt;/span&gt;

    &lt;span class="c"&gt;# ── Images ──────────────────────────────────────────────────────────────────    &lt;/span&gt;
    docker images                &lt;span class="c"&gt;# list all locally stored images    &lt;/span&gt;
    docker pull python:3.10      &lt;span class="c"&gt;# download an image from Docker Hub without running it    &lt;/span&gt;
    docker rmi &amp;lt;image_id&amp;gt;        &lt;span class="c"&gt;# remove a local image    &lt;/span&gt;
    docker rmi &lt;span class="si"&gt;$(&lt;/span&gt;docker images &lt;span class="nt"&gt;-q&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="c"&gt;# remove ALL local images    &lt;/span&gt;

    &lt;span class="c"&gt;# ── Building ─────────────────────────────────────────────────────────────────    &lt;/span&gt;
    docker build &lt;span class="nt"&gt;-t&lt;/span&gt; myapp &lt;span class="nb"&gt;.&lt;/span&gt;      &lt;span class="c"&gt;# build an image from Dockerfile in current folder    &lt;/span&gt;
                                 &lt;span class="c"&gt;# -t = tag (name) for the image    &lt;/span&gt;
                                 &lt;span class="c"&gt;# .  = path to the Dockerfile (current directory)    &lt;/span&gt;

    &lt;span class="c"&gt;# ── Running Interactively ────────────────────────────────────────────────────    &lt;/span&gt;
    docker run &lt;span class="nt"&gt;-it&lt;/span&gt; ubuntu bash               &lt;span class="c"&gt;# run Ubuntu container with bash shell    &lt;/span&gt;
    docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &amp;lt;container_id&amp;gt; bash      &lt;span class="c"&gt;# open a shell inside a RUNNING container    &lt;/span&gt;

    &lt;span class="c"&gt;# ── System Info ──────────────────────────────────────────────────────────────    &lt;/span&gt;
    docker version               &lt;span class="c"&gt;# show Docker client and server (daemon) version    &lt;/span&gt;
    docker info                  &lt;span class="c"&gt;# detailed Docker system information&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  What's Next?
&lt;/h4&gt;

&lt;p&gt;Understanding individual containers is the first step. However, in data engineering, projects rarely rely on just one thing. A project could need a &lt;strong&gt;Python environment&lt;/strong&gt; to run a script and a &lt;strong&gt;PostgreSQL database&lt;/strong&gt; to store the data.&lt;br&gt;
This will require two containers to run. Running those as separate containers manually and trying to link their networks together can get complicated.&lt;/p&gt;

&lt;p&gt;In my &lt;a href="https://dev.to/gathurum/containerizing-an-etl-pipeline-with-docker-compose-and-clean-git-hygiene-493n"&gt;next article&lt;/a&gt;, we are going to look at &lt;strong&gt;Docker Compose&lt;/strong&gt; — a tool that lets us define and run multi-container applications using a single YAML file.&lt;br&gt;
We will package an entire ETL pipeline so that anyone can spin up a database and execution script with just one command: &lt;code&gt;docker-compose up&lt;/code&gt;.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>dataengineering</category>
      <category>beginners</category>
    </item>
    <item>
      <title>TaskFlow API vs. Traditional Operators: Practical Airflow ETL Pipeline</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Sun, 07 Jun 2026 19:56:12 +0000</pubDate>
      <link>https://dev.to/gathurum/taskflow-api-vs-traditional-operators-practical-airflow-etl-pipeline-962</link>
      <guid>https://dev.to/gathurum/taskflow-api-vs-traditional-operators-practical-airflow-etl-pipeline-962</guid>
      <description>&lt;p&gt;In my &lt;a href="https://dev.to/gathurum/demystifying-apache-airflow-why-data-engineers-dont-just-use-cron-jobs-2pmk"&gt;last article&lt;/a&gt;, we went over the foundational pillars of Apache Airflow—DAGs, Tasks, and why orchestration beats manual scripts.&lt;/p&gt;

&lt;p&gt;For this practical Airflow project, we will build&lt;br&gt;
an ETL pipeline to aggregate market data from Massive API. But instead of just writing it one way, we look into writing &lt;strong&gt;two different versions of the same DAG&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Traditional Approach:&lt;/strong&gt; Using classic standard operators (&lt;code&gt;PythonOperator&lt;/code&gt;) and manual XCom pulling/pushing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Modern Approach:&lt;/strong&gt; Using the &lt;strong&gt;TaskFlow API&lt;/strong&gt; (&lt;code&gt;@dag&lt;/code&gt; and &lt;code&gt;@task&lt;/code&gt; decorators) to make the code clean and Pythonic.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you’re confused about the difference or wondering which one you should use in your projects, let’s break down both.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Goal: Aggregating Data from Massive API
&lt;/h2&gt;

&lt;p&gt;The pipeline does three basic things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract:&lt;/strong&gt; Pulls raw daily open/close asset data from our market API client.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform:&lt;/strong&gt; Normalizes the heavy, nested JSON payload into a clean structure using &lt;code&gt;pandas&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load:&lt;/strong&gt; Inserts the structured records into a cloud PostgreSQL database.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Approach 1: The Traditional Way (Standard Operators)
&lt;/h2&gt;

&lt;p&gt;When Airflow was first built, you had to explicitly define every single task using an Operator class and manually stitch them together using the bitshift operators (&lt;code&gt;&amp;gt;&amp;gt;&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The trickiest part here is data sharing. Because tasks run in isolation, we have to use explicit &lt;strong&gt;XComs&lt;/strong&gt; (cross-communications) to pass our API payload from the extract task to the transform task.&lt;/p&gt;

&lt;p&gt;Here is how the traditional DAG looks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="c1"&gt;# (Assume our custom API extraction and DB loading logic are imported here)
&lt;/span&gt;
&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retry_delay&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Fetching data from our massive asset API
&lt;/span&gt;    &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;65000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-06-07T00:00:00Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; 
    &lt;span class="c1"&gt;# We MUST explicitly push to XCom so the next task can see it
&lt;/span&gt;    &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_market_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# We MUST explicitly pull from XCom
&lt;/span&gt;    &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw_market_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Transforming data for: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ticker&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Transformation logic goes here...
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;raw_data&lt;/span&gt;  &lt;span class="c1"&gt;# Returning automatically pushes to default XCom
&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;market_data_traditional_v1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@hourly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;extract_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_extract&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;transform_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transform_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_transform&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Stitching the pipeline dependency together
&lt;/span&gt;    &lt;span class="n"&gt;extract_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;transform_task&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Verdict on Traditional:
&lt;/h3&gt;

&lt;p&gt;It works perfectly, but it feels heavy. You have to write a lot of boilerplate code just to set up the tasks, and passing data requires explicitly passing the Task Instance (&lt;code&gt;ti&lt;/code&gt;) and remembering exact &lt;code&gt;task_ids&lt;/code&gt; string names. If you typo a string, your whole pipeline crashes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Approach 2: The Modern Way (TaskFlow API)
&lt;/h2&gt;

&lt;p&gt;Introduced in Airflow 2.0, the &lt;strong&gt;TaskFlow API&lt;/strong&gt; uses Python decorators (&lt;code&gt;@dag&lt;/code&gt; and &lt;code&gt;@task&lt;/code&gt;). It fundamentally changes how we write pipelines by making Airflow handle XComs silently behind the scenes.&lt;/p&gt;

&lt;p&gt;Look at how clean this version is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.decorators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;

&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;market_data_taskflow_v1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@hourly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;market_data_etl&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;65000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-06-07T00:00:00Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;raw_data&lt;/span&gt;  &lt;span class="c1"&gt;# No manual xcom_push! Airflow handles it.
&lt;/span&gt;
    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# No manual xcom_pull! We just treat it like a regular Python variable.
&lt;/span&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Transforming data for: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ticker&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;raw_data&lt;/span&gt;

    &lt;span class="c1"&gt;# This single clean line sets up dependencies AND passes data!
&lt;/span&gt;    &lt;span class="n"&gt;market_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;market_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiate the DAG
&lt;/span&gt;&lt;span class="n"&gt;market_data_etl_dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;market_data_etl&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Verdict on TaskFlow:
&lt;/h3&gt;

&lt;p&gt;This feels like writing native Python! You don’t have to instantiate &lt;code&gt;PythonOperator&lt;/code&gt; manually, and dependencies are implicitly built simply by passing outputs into inputs (&lt;code&gt;transform(extract())&lt;/code&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Seeing Both in the Airflow UI
&lt;/h2&gt;

&lt;p&gt;Once I deployed both DAG files to my local environment running on WSL, they popped up immediately in my Airflow Web UI dashboard.&lt;/p&gt;

&lt;p&gt;Even though the code looks completely different, &lt;strong&gt;Airflow creates the exact same visual graph architecture&lt;/strong&gt; for both under the hood.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Traditional DAG&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhu79qtemb4ibphjhk3gv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhu79qtemb4ibphjhk3gv.png" alt="Traditional DAG" width="800" height="541"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When I ran them, checking the logs for the TaskFlow DAG showed how cleanly it handled the data context without any missing parameters.&lt;/p&gt;




&lt;h2&gt;
  
  
  Which One Should You Use?
&lt;/h2&gt;

&lt;p&gt;After building this massive API project using both paradigms, here is my takeaway for fellow beginners:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use TaskFlow API&lt;/strong&gt; whenever you are working strictly with Python functions (&lt;code&gt;PythonOperator&lt;/code&gt;). It reduces boilerplate code significantly, makes your code highly readable, and saves you from the headache of managing raw XCom keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Standard Operators&lt;/strong&gt; when you need to interact with external tools directly via specialized operators (like &lt;code&gt;PostgresOperator&lt;/code&gt;, &lt;code&gt;S3CreateObjectOperator&lt;/code&gt;, or &lt;code&gt;BashOperator&lt;/code&gt;). TaskFlow is amazing for Python-native workflows, but standard classes are still essential for interacting with cloud infra and heavy enterprise databases.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Now that our pipeline can be scheduled and orchestrated automatically, another major question popped up: &lt;strong&gt;Where do we deploy this safely so it runs the exact same way on a remote server as it does on my local machine?&lt;/strong&gt; Right now, everything is relying on my specific local python setup and WSL configuration. If I send this code to a teammate, they might hit a wall of environment errors.&lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://dev.to/gathurum/understanding-docker-for-data-engineering-18fg"&gt;next&lt;/a&gt; part of this series, we are diving into &lt;strong&gt;Docker&lt;/strong&gt; to containerize our database and ETL processes so they can run seamlessly anywhere!&lt;/p&gt;

&lt;p&gt;Which style do you prefer writing in Airflow? Let me know your thoughts in the comments!&lt;/p&gt;

</description>
      <category>automation</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Why use Apache Airflow? Quick Guide for Data Engineers</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Sun, 07 Jun 2026 18:02:19 +0000</pubDate>
      <link>https://dev.to/gathurum/demystifying-apache-airflow-why-data-engineers-dont-just-use-cron-jobs-2pmk</link>
      <guid>https://dev.to/gathurum/demystifying-apache-airflow-why-data-engineers-dont-just-use-cron-jobs-2pmk</guid>
      <description>&lt;p&gt;If you followed my &lt;a href="https://dev.to/gathurum/etl-pipeline-fetching-real-time-news-data-with-python-and-postgres-2c88"&gt;last post&lt;/a&gt;, we successfully built an ETL pipeline that fetched data from the News API, cleaned it with &lt;code&gt;pandas&lt;/code&gt;, and loaded it into a PostgreSQL database. It felt amazing to watch it run successfully in the terminal.&lt;/p&gt;

&lt;p&gt;But what if the News API goes down for 10 minutes? or what if my laptop is asleep when the script is supposed to run? &lt;/p&gt;

&lt;p&gt;In the real world, you can't just sit at your laptop and manually click "Run" on a Python script every day. You need automation, monitoring, and a way to handle failures. That is exactly where &lt;strong&gt;Apache Airflow&lt;/strong&gt; comes in.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem:
&lt;/h2&gt;

&lt;p&gt;Before tools like Airflow, developers relied heavily on &lt;strong&gt;Cron jobs&lt;/strong&gt; (a built-in Linux tool used to schedule scripts at specific times). Cron is great for simple things, but it has huge blind spots for data engineering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Dependency Management:&lt;/strong&gt; If your "Transform" script takes longer than usual, your "Load" script might start running before the data is even ready, causing a massive crash.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lack of Visibility:&lt;/strong&gt; If a script fails at 3 AM, you won't know until you check the logs manually or notice empty tables the next morning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No Easy Retries:&lt;/strong&gt; If a network glitch causes an API call to fail, Cron won't automatically try again 5 minutes later. You have to handle that messy logic yourself in Python.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Airflow solves all of this by acting as the &lt;strong&gt;workflow orchestrator&lt;/strong&gt;. It doesn't actually store or process your data; instead, it acts as the manager telling your scripts exactly &lt;em&gt;when&lt;/em&gt; to run, &lt;em&gt;in what order&lt;/em&gt;, and &lt;em&gt;what to do&lt;/em&gt; if something breaks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Concepts Explained
&lt;/h2&gt;

&lt;p&gt;Let’s break down the four most important concepts using a simple analogy: &lt;strong&gt;Baking a Cake&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The DAG (Directed Acyclic Graph)
&lt;/h3&gt;

&lt;p&gt;Think of a &lt;strong&gt;DAG&lt;/strong&gt; as the entire recipe for your cake.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Directed:&lt;/strong&gt; It has a clear starting point and moves in a specific direction (you can't frost the cake before you bake it).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acyclic:&lt;/strong&gt; It cannot go in circles. Step C cannot loop back and trigger Step A, otherwise your pipeline would run forever.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph:&lt;/strong&gt; It's just a structural map of how your steps link together.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In data engineering, your DAG is the blueprint of your entire ETL pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Operators
&lt;/h3&gt;

&lt;p&gt;If the DAG is the recipe, &lt;strong&gt;Operators&lt;/strong&gt; are the kitchen appliances. They are the pre-built templates that define &lt;em&gt;what&lt;/em&gt; actually gets done. Airflow provides different types of operators:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PythonOperator&lt;/code&gt;: Used to execute a piece of Python code (like our &lt;code&gt;transform_data&lt;/code&gt; function).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PostgresOperator&lt;/code&gt;: Used to run SQL queries directly inside a Postgres database.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;BashOperator&lt;/code&gt;: Used to run command-line terminal scripts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Tasks
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;Task&lt;/strong&gt; is an operator that has been given specific instructions. It’s a single node inside your DAG. For example, using a &lt;code&gt;PythonOperator&lt;/code&gt; to run &lt;code&gt;extract_data()&lt;/code&gt; becomes the &lt;strong&gt;"Extract Task"&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. XComs (Cross-Communications)
&lt;/h3&gt;

&lt;p&gt;In our standalone Python script, passing data was easy: we just returned a value from one function and passed it into the next (&lt;code&gt;cleaned_df = transform_data(raw_data)&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;In Airflow, tasks run completely independently. They can't easily talk to each other. &lt;strong&gt;XComs&lt;/strong&gt; are like little sticky notes that tasks use to pass small amounts of data or metadata down the line. One task "pushes" a note, and the next task "pulls" it.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Quick Peek at the Airflow UI
&lt;/h2&gt;

&lt;p&gt;The absolute best part of Apache Airflow is its user interface. Instead of staring at text scrolling through a dark terminal window, Airflow gives you a beautiful visual dashboard where you can see your pipelines running in real-time.&lt;/p&gt;

&lt;p&gt;When a pipeline runs successfully, the tasks turn a satisfying dark green. If a task fails, it turns red, making it incredibly easy to spot exactly where your pipeline broke.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Airflow Web UI Dags page &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo3fzovxjgyvpcb07ukxy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo3fzovxjgyvpcb07ukxy.png" alt="Airflow Web UI Dags page Example" width="799" height="327"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What We’re Doing Next
&lt;/h2&gt;

&lt;p&gt;Now that we understand the foundational pillars of Airflow—DAGs, Operators, Tasks, and why we use them over simple cron jobs—it’s time to get our hands dirty.&lt;/p&gt;

&lt;p&gt;In my &lt;a href="https://dev.to/gathurum/taskflow-api-vs-traditional-operators-practical-airflow-etl-pipeline-962"&gt;next article&lt;/a&gt;, we are going to break down an ETL project into Airflow Tasks and watch it run automatically inside the Airflow UI.&lt;/p&gt;

&lt;p&gt;Are you using Airflow or which other tools do you prefer for orchestration?&lt;/p&gt;

</description>
    </item>
    <item>
      <title>ETL Pipeline: Fetching Real-Time News Data with Python and Postgres</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Sun, 07 Jun 2026 17:25:39 +0000</pubDate>
      <link>https://dev.to/gathurum/etl-pipeline-fetching-real-time-news-data-with-python-and-postgres-2c88</link>
      <guid>https://dev.to/gathurum/etl-pipeline-fetching-real-time-news-data-with-python-and-postgres-2c88</guid>
      <description>&lt;p&gt;The best way to actually understand data engineering is to build something that breaks, fix it, and watch it successfully run.&lt;/p&gt;

&lt;p&gt;In this article, we build an ETL pipeline that pulls data from the &lt;strong&gt;News API&lt;/strong&gt;, cleans it up using &lt;code&gt;pandas&lt;/code&gt;, and loads it into a local &lt;strong&gt;PostgreSQL&lt;/strong&gt; database.&lt;/p&gt;

&lt;p&gt;If you are a beginner Python developer or just getting into data engineering, this one is for you!&lt;/p&gt;




&lt;h2&gt;
  
  
  The Goal &amp;amp; The Architecture
&lt;/h2&gt;

&lt;p&gt;Before writing a single line of code, let’s look at what we are actually trying to achieve:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extract:&lt;/strong&gt; Connect to the News API using Python, fetch the top headlines about technology, and pull down the raw data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform:&lt;/strong&gt; The raw data comes back as a messy, nested JSON object. We'll use &lt;code&gt;pandas&lt;/code&gt; to flatten it, pick the columns we actually care about, handle missing values, and format the dates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load:&lt;/strong&gt; Connect to a local PostgreSQL database and append our clean data into a structured table.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Step 1: Setting Up the Database
&lt;/h2&gt;

&lt;p&gt;First, we need a place for our data to live. I used a PostgreSQL instance running on the cloud with Aiven.&lt;/p&gt;

&lt;p&gt;We need a clean target table. Here is the SQL script I used to create a simple &lt;code&gt;news_articles&lt;/code&gt; table. Notice how we have to be careful with our data types (like using &lt;code&gt;TIMESTAMP&lt;/code&gt; for dates and &lt;code&gt;TEXT&lt;/code&gt; for long URLs).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;news_articles&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;source&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;author&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;published_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;extracted_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Below is a diagram of the news_articles table created.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgp3bvuf9haczrig9pd43.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgp3bvuf9haczrig9pd43.png" alt="A diagram of the news_articles table created" width="748" height="565"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: The Python ETL Script
&lt;/h2&gt;

&lt;p&gt;To keep things clean and modular, I broke the code down into three distinct functions representing &lt;strong&gt;E&lt;/strong&gt;, &lt;strong&gt;T&lt;/strong&gt;, and &lt;strong&gt;L&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Make sure you have your dependencies installed before running this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;requests pandas psycopg2-binary sqlalchemy 

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the full, documented script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;

&lt;span class="c1"&gt;# Configuration
&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_NEWS_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://newsapi.org/v2/everything?q=technology&amp;amp;apiKey=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;DB_PARAMS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_db_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_username&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5432&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--- Starting Extraction ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;articles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Successfully extracted &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; articles.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API Request Failed with status code: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Transforming Data
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="c1"&gt;# Create an empty list to store clean articles after looping
&lt;/span&gt;    &lt;span class="n"&gt;cleaned_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="c1"&gt;# Parse through each dictionary to extract what we need
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;clean_article&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;publishedAt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;publishedAt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;cleaned_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean_article&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Rename column to match Postgres fields
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;publishedAt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;published_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="c1"&gt;# Handle missing values  
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;No Description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# Format dates
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;published_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;published_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Transformation complete!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="n"&gt;db_URI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;URI&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_URI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news_articles&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;con&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news_api&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;if_exists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data loaded successfully to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news_articles&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to load data to the database:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run the pipeline
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;cleaned_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ETL Pipeline Finished Successfully! 🎉&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pipeline failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3: Running the Pipeline and Verifying the Results
&lt;/h2&gt;

&lt;p&gt;When I first ran this script, I ran into a classic beginner issue: the date format coming from the API included a &lt;code&gt;Z&lt;/code&gt; at the end (e.g., &lt;code&gt;2026-06-07T06:00:00Z&lt;/code&gt;), which caused my local database to complain until I used &lt;code&gt;pd.to_datetime()&lt;/code&gt; to safely parse it.&lt;/p&gt;

&lt;p&gt;But once those quirks were ironed out, running the script in the terminal yielded beautiful logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;![Successful execution](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vhq6zbzsct5asbdocdcf.png)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To verify it actually worked, I hopped back into my database client and ran a simple query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;source_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;author&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;published_at&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;news_articles&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And there it was—real live web data, organized neatly into my own database tables!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zq9qdz2if544qtmq7t2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zq9qdz2if544qtmq7t2.png" alt="First 5 rows of extracted data from Postgres" width="799" height="319"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways from My First Project
&lt;/h2&gt;

&lt;p&gt;Building this taught me a few massive lessons that you don't get just by reading textbooks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Data is messy:&lt;/strong&gt; You can almost never load API responses directly into a database. Nested dictionaries (like the &lt;code&gt;source&lt;/code&gt; field in this project) require explicit flattening.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotency matters:&lt;/strong&gt; If I run this script twice right now, it will duplicate all the articles. As I move forward, I need to look into how to handle duplicates (like checking for existing URLs before inserting).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Running this manually via a Python script is great for practice, but what if I want this data updated every single morning at 6 AM while I'm asleep? I can't just sit here and click "Run" manually.&lt;/p&gt;

&lt;p&gt;That is where &lt;strong&gt;Data Orchestration&lt;/strong&gt; comes in. In my next article, we are going to look at &lt;strong&gt;Apache Airflow&lt;/strong&gt; and how we can take this exact code and turn it into an automated, scheduled workflow!&lt;/p&gt;

&lt;p&gt;Have you built an ETL pipeline before? What was the trickiest data cleaning issue you faced? Let’s chat in the comments!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>api</category>
      <category>etl</category>
      <category>beginners</category>
    </item>
    <item>
      <title>ETL vs. ELT: Which Approach Should You Use and Why?</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Thu, 14 May 2026 20:20:27 +0000</pubDate>
      <link>https://dev.to/gathurum/etl-vs-elt-which-approach-should-you-use-and-why-386d</link>
      <guid>https://dev.to/gathurum/etl-vs-elt-which-approach-should-you-use-and-why-386d</guid>
      <description>&lt;h3&gt;
  
  
  1. Introduction
&lt;/h3&gt;

&lt;p&gt;Understanding a company's data architecture can feel overwhelming, but once you cut out the noise, you will notice one of two operations taking place: &lt;strong&gt;ETL&lt;/strong&gt; or &lt;strong&gt;ELT&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;These two operations are the backbone of how data moves from a source (like an app or a database) to a destination (like a data warehouse). While they sound almost identical, the order of the letters changes everything about how a company manages its data. &lt;br&gt;
In this article, we will break down both approaches for better understanding and I will give my take on which is better.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. What is ETL? (Extract, Transform, Load)
&lt;/h3&gt;

&lt;p&gt;ETL is the traditional way of handling data that follows the process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract:&lt;/strong&gt; Pull data from various sources (Excel, SQL databases, APIs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform:&lt;/strong&gt; Before the data reaches its final home, it is cleaned and formatted in a &lt;strong&gt;Staging Layer&lt;/strong&gt; (a temporary storage area). Business logic is applied here to make the data "useful."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load:&lt;/strong&gt; The cleaned, "ready-to-use" data is finally saved in the destination.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Characteristic:&lt;/strong&gt; The data is transformed &lt;em&gt;before&lt;/em&gt; it is stored.&lt;br&gt;
&lt;strong&gt;Common Tools:&lt;/strong&gt; Microsoft SSIS, Talend, Informatica.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. What is ELT? (Extract, Load, Transform)
&lt;/h3&gt;

&lt;p&gt;ELT is the more modern approach, with the same process as you would guess but in a different order.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extract:&lt;/strong&gt; Pull the raw data from the sources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load:&lt;/strong&gt; Instead of cleaning it first, you move the &lt;strong&gt;raw data&lt;/strong&gt; directly into a high-capacity storage system, like a Data Lake or a Data Warehouse (BigQuery, Snowflake).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform:&lt;/strong&gt; You perform the cleaning and modelling &lt;em&gt;after&lt;/em&gt; the data is already in its destination.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Characteristic:&lt;/strong&gt; Data grows into a historical archive. Since the raw data is always there, you can go back and re-transform it differently next year if your business needs change.&lt;br&gt;
&lt;strong&gt;Common Tools:&lt;/strong&gt; Fivetran or Airbyte (for loading), and &lt;strong&gt;dbt&lt;/strong&gt; (for the transformation part).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Comparison Table&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;ETL&lt;/th&gt;
&lt;th&gt;ELT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Order&lt;/td&gt;
&lt;td&gt;Transform before Loading&lt;/td&gt;
&lt;td&gt;Load before Transforming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;Uses temporary Staging&lt;/td&gt;
&lt;td&gt;Uses permanent Data Lake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flexibility&lt;/td&gt;
&lt;td&gt;Rigid / Fixed&lt;/td&gt;
&lt;td&gt;Highly Flexible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best For&lt;/td&gt;
&lt;td&gt;On-premise / Small data&lt;/td&gt;
&lt;td&gt;Cloud / Big Data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  4. Which is Better?
&lt;/h3&gt;

&lt;p&gt;With cloud storage becoming cheaper and databases becoming more powerful, &lt;strong&gt;ELT would be the preferred option.&lt;/strong&gt; Here’s why:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; ELT can handle massive "Big Data" sets that would crash a traditional ETL staging server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexibility:&lt;/strong&gt; Because you store the &lt;em&gt;raw&lt;/em&gt; data first, you never lose information. In ETL, if you don't "transform" a column, it's gone. In ELT, you can decide to use that column later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed:&lt;/strong&gt; You can load data as often as you want without waiting for complex cleaning scripts to finish.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;While ETL is still used for highly sensitive data or older systems, &lt;strong&gt;ELT is the best approach for modern, cloud-based data engineering.&lt;/strong&gt; It is more scalable, flexible, and allows for much deeper historical analysis.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>beginners</category>
      <category>data</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>How to connect Power BI to a PostgreSQL Database</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Thu, 14 May 2026 17:20:31 +0000</pubDate>
      <link>https://dev.to/gathurum/connecting-power-bi-to-a-postgresql-database-a-practical-guide-4p6i</link>
      <guid>https://dev.to/gathurum/connecting-power-bi-to-a-postgresql-database-a-practical-guide-4p6i</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Power BI&lt;/strong&gt; is a business analytics platform developed by Microsoft. It is mostly used by organizations to transform raw data from &lt;strong&gt;multiple sources&lt;/strong&gt; into interactive visualizations that support &lt;strong&gt;data-driven decision-making.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data sitting in various places in an organization—such as databases, Excel files, and cloud storage—can be collectively imported into Power BI to create a wholesome report. This allows data analysts and developers to extract insights that would otherwise be hidden in rows of text.&lt;/p&gt;

&lt;p&gt;Most companies connect Power BI directly to databases for the following reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time Data Access:&lt;/strong&gt; It gives visibility of the most current data available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation:&lt;/strong&gt; Reports can be developed and scheduled to refresh instantly, removing the need for manual updates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficiency:&lt;/strong&gt; It saves analysts time by eliminating the need to generate new reports every time the underlying data changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt; is a primary example of a SQL database. These are commonly used in companies as they offer a structured and secure environment to store, manage, and retrieve large volumes of data effectively.&lt;/p&gt;

&lt;p&gt;In this article, we will walk through connecting a Postgres Database to Power BI Desktop.&lt;/p&gt;




&lt;h3&gt;
  
  
  Part 1: Connecting to a Local PostgreSQL Database
&lt;/h3&gt;

&lt;p&gt;To connect to a Postgres database sitting on your own machine:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Open Power BI Desktop.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;On the Home ribbon, click &lt;strong&gt;Get Data&lt;/strong&gt;, select &lt;strong&gt;PostgreSQL database&lt;/strong&gt; from the list, and click &lt;strong&gt;Connect&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Connection Details:&lt;/strong&gt; For the Server, type &lt;code&gt;localhost&lt;/code&gt;. This tells Power BI to look at your own machine.&lt;br&gt;
&lt;strong&gt;Database Name:&lt;/strong&gt; Enter the specific name of your database (e.g., &lt;code&gt;sales_db&lt;/code&gt;).&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ltj8lca5jqt6n9727p2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ltj8lca5jqt6n9727p2.png" alt="Enter Connection Details" width="800" height="491"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Credentials:&lt;/strong&gt; When prompted, enter the username and password you created during PostgreSQL installation.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After successful authentication, a popup appears showing the &lt;strong&gt;existing&lt;/strong&gt; tables in your database. Select the tables you need and click &lt;strong&gt;Load&lt;/strong&gt; to bring them into the Power BI environment.&lt;/p&gt;




&lt;h3&gt;
  
  
  Part 2: Moving to the Cloud (Aiven PostgreSQL)
&lt;/h3&gt;

&lt;p&gt;In most real work environments, data lives in the cloud so everyone can access it from anywhere. For this guide, we will be using &lt;strong&gt;Aiven&lt;/strong&gt;. Connecting from a cloud database requires a few extra security steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to connect:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open Power BI Desktop.&lt;/strong&gt; Click &lt;strong&gt;Get Data&lt;/strong&gt;, select &lt;strong&gt;PostgreSQL database&lt;/strong&gt;, and click &lt;strong&gt;Connect&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Obtain Details:&lt;/strong&gt; From your Aiven console, copy your &lt;strong&gt;Host, Port, Database Name, Username,&lt;/strong&gt; and &lt;strong&gt;Password&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Download the Certificate:&lt;/strong&gt; In the list of connection details, download the &lt;code&gt;ca.pem&lt;/code&gt; file by clicking the download icon.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Why SSL?&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
 When connecting to a cloud database, we use &lt;strong&gt;SSL certificates&lt;/strong&gt;. This acts as a secure tunnel for your data as it travels over the internet. SSL encrypts the connection so that malicious actors cannot "intercept" your credentials or your company's private data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9ti3p8m8z3u577iu7ez.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9ti3p8m8z3u577iu7ez.png" alt="Database Details" width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enter database details:&lt;/strong&gt; Fill in the details obtained from Aiven.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19zh0gp0ujf0kqjtn1h3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19zh0gp0ujf0kqjtn1h3.png" alt="Power BI Connection Details" width="800" height="395"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Under server name, enter the host name and the port in the format: `host:port&lt;/em&gt;`&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSL Configuration:&lt;/strong&gt; You may receive an error because Power BI cannot automatically verify the cloud's CA certificate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1da5e99ye0q27z16ffhj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1da5e99ye0q27z16ffhj.png" alt="Error Msssage" width="703" height="349"&gt;&lt;/a&gt;&lt;br&gt;
To resolve this, manually import the certificate to your Windows machine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Press &lt;strong&gt;Windows + R&lt;/strong&gt;, type &lt;code&gt;certmgr.msc&lt;/code&gt;, and press Enter.&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;Trusted Root Certification Authorities&lt;/strong&gt;, then click &lt;strong&gt;Certificates&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Right-click the folder, choose &lt;strong&gt;All Tasks&lt;/strong&gt; &amp;gt; &lt;strong&gt;Import&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Browse to your &lt;code&gt;ca.pem&lt;/code&gt; file (change file type to "All Files" to see it). Select it and click &lt;strong&gt;Next&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Finish the wizard. You will get a popup saying "Import Successful."&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj37oxbkqvhatqfm0tk1w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj37oxbkqvhatqfm0tk1w.png" alt="Windows cerificates page" width="800" height="572"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Restart Power BI&lt;/strong&gt; to apply the changes. Your connection will now be successful, and your tables will be ready for loading.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Part 3: Building the Data Model
&lt;/h3&gt;

&lt;p&gt;Once your tables (&lt;strong&gt;Customers, Products, Sales, and Inventory&lt;/strong&gt;) are loaded, they appear in the "Data" pane. Now you must connect them.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftum56wleb27cbljqakze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftum56wleb27cbljqakze.png" alt="Loaded Data on Power BI" width="800" height="608"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Data Modeling&lt;/strong&gt; is the process of telling Power BI how these tables relate to one another. For example, the &lt;code&gt;Customer_ID&lt;/code&gt; in your &lt;strong&gt;Sales&lt;/strong&gt; table should link to the &lt;code&gt;ID&lt;/code&gt; in your &lt;strong&gt;Customers&lt;/strong&gt; table.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Benefit:&lt;/strong&gt; By creating these relationships, you can filter a chart by "Customer Name" and see exactly what "Products" they bought across all "Sales."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Power BI often creates these automatically, which you can verify in the &lt;strong&gt;Model View&lt;/strong&gt; pane.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzvhm1agrcyovn6835qtd.png" alt="Relationships on Power BI" width="800" height="372"&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Conclusion: Why SQL is a Superpower for Power BI Analysts
&lt;/h3&gt;

&lt;p&gt;You might wonder, "If I have Power BI, do I still need SQL?" The answer is &lt;strong&gt;yes&lt;/strong&gt;. SQL skills are important because they allow you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Filter at the source:&lt;/strong&gt; Instead of bringing 1 million rows into Power BI and slowing it down, you can write a SQL query to only bring in the specific data you need.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Preparation:&lt;/strong&gt; You can perform complex aggregations and clean up messy data before it even reaches your dashboard.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mastering the connection between SQL and Power BI is what turns a basic report into a powerful, automated business tool.&lt;/p&gt;

</description>
      <category>postgressql</category>
      <category>powerapps</category>
      <category>dataengineering</category>
      <category>beginners</category>
    </item>
    <item>
      <title>SQL Joins Explained: Case Example</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Mon, 02 Mar 2026 18:43:59 +0000</pubDate>
      <link>https://dev.to/gathurum/sql-joins-explained-case-example-2l3</link>
      <guid>https://dev.to/gathurum/sql-joins-explained-case-example-2l3</guid>
      <description>&lt;p&gt;Structured Query Language(SQL) is a computer language for storing, manipulating, and retrieving data stored in a relational database.&lt;/p&gt;

&lt;p&gt;SQL Joins are like clauses used to combine records from two or more tables in a database, just as the name(join) means.&lt;/p&gt;

&lt;p&gt;In this article, you will learn, using a case example, the different types of joins, when, and how they are used. Be sure to check for “&lt;strong&gt;bonus joins”&lt;/strong&gt; included at the end of the article.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example Data
&lt;/h2&gt;

&lt;p&gt;We will use data from these 2 tables to show various ways to display employees from John Smith's company.&lt;/p&gt;

&lt;h3&gt;
  
  
  Departments Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Department_Id&lt;/th&gt;
&lt;th&gt;Department_Name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Research&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Employees Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Employee_Id&lt;/th&gt;
&lt;th&gt;Full_Name&lt;/th&gt;
&lt;th&gt;Department_Id&lt;/th&gt;
&lt;th&gt;Job_Role&lt;/th&gt;
&lt;th&gt;Manager_Id&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;John Smith&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;CEO&lt;/td&gt;
&lt;td&gt;Null&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sarah Goodes&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;CFO&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Wayne Ablett&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;CIO&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Michelle Carey&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;HR Manager&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Chris Matthews&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Sales Manager&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Andrew Judd&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Development Manager&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Danielle McLeod&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Support Manager&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Matthew Swan&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;HR Representative&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Stephanie Richardson&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Salesperson&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Tony Grant&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Salesperson&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Jenna Lockett&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Front-End Developer&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Michael Dunstall&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Back-End Developer&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;Jane Voss&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Back-End Developer&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;Anthony Hird&lt;/td&gt;
&lt;td&gt;Null&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;Natalie Rocca&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The code below shows the syntax for writing a JOIN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;table1&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;table2&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;table1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column1&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;table2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Types of Joins&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are four major types of joins in SQL, as listed below:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Inner Join&lt;/li&gt;
&lt;li&gt;Left Join&lt;/li&gt;
&lt;li&gt;Right Join&lt;/li&gt;
&lt;li&gt;Full Join&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Inner Join&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It creates a result table that displays information &lt;strong&gt;common&lt;/strong&gt; between two tables based on a shared piece of information. &lt;/p&gt;

&lt;p&gt;It is the most important and frequently used of the joins.&lt;/p&gt;

&lt;p&gt;In this case, we could use it to display employees from the departments listed in the first table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Full_Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Job_Role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt;
 &lt;span class="c1"&gt;-- You can replace the keyword INNER JOIN with JOIN&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should get the results below&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Result&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Full_Name&lt;/th&gt;
&lt;th&gt;Job_Role&lt;/th&gt;
&lt;th&gt;Department_Name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;John Smith&lt;/td&gt;
&lt;td&gt;CEO&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sarah Goodes&lt;/td&gt;
&lt;td&gt;CFO&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wayne Ablett&lt;/td&gt;
&lt;td&gt;CIO&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Michelle Carey&lt;/td&gt;
&lt;td&gt;HR Manager&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chris Matthews&lt;/td&gt;
&lt;td&gt;Sales Manager&lt;/td&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Andrew Judd&lt;/td&gt;
&lt;td&gt;Development Manager&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Danielle McLeod&lt;/td&gt;
&lt;td&gt;Support Manager&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Matthew Swan&lt;/td&gt;
&lt;td&gt;HR Representative&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stephanie Richardson&lt;/td&gt;
&lt;td&gt;Salesperson&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tony Grant&lt;/td&gt;
&lt;td&gt;Salesperson&lt;/td&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jenna Lockett&lt;/td&gt;
&lt;td&gt;Front-End Developer&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Michael Dunstall&lt;/td&gt;
&lt;td&gt;Back-End Developer&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jane Voss&lt;/td&gt;
&lt;td&gt;Back-End Developer&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Natalie Rocca&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Left Join&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A &lt;em&gt;LEFT JOIN&lt;/em&gt; returns &lt;strong&gt;all&lt;/strong&gt; rows from the left table, plus matched values from the right table or NULL in case of no match.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: The left table refers to the table that appears before the "LEFT JOIN" keywords in your SQL query. Same case applies when using RIGHT JOIN&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We could use left join to retrieve a list of all employees along with their department names. If an employee doesn't belong to a department, display NULL&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Full_Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; 
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Full_Name&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should get the results below&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Result&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Employee_Id&lt;/th&gt;
&lt;th&gt;Full_Name&lt;/th&gt;
&lt;th&gt;Department_Name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;John Smith&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sarah Goodes&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Wayne Ablett&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Michelle Carey&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Chris Matthews&lt;/td&gt;
&lt;td&gt;SALES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Andrew Judd&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Danielle McLeod&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Matthew Swan&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Stephanie Richardson&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Tony Grant&lt;/td&gt;
&lt;td&gt;SALES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Jenna Lockett&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Michael Dunstall&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;Jane Voss&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;Anthony Hird&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;Natalie Rocca&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can use the COALESCE() function to replace NULL values with more relatable user-defined values.&lt;/p&gt;

&lt;p&gt;i.e. If an employee doesn't belong to a department, display "No Department" instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Employee_Id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Full_Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"No Department"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;Department_Name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; 
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Full_Name&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should get the results below&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Result&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Employee_Id&lt;/th&gt;
&lt;th&gt;Full_Name&lt;/th&gt;
&lt;th&gt;Department_Name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;John Smith&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sarah Goodes&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Wayne Ablett&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Michelle Carey&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Chris Matthews&lt;/td&gt;
&lt;td&gt;SALES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Andrew Judd&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Danielle McLeod&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Matthew Swan&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Stephanie Richardson&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Tony Grant&lt;/td&gt;
&lt;td&gt;SALES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Jenna Lockett&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Michael Dunstall&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;Jane Voss&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;Anthony Hird&lt;/td&gt;
&lt;td&gt;No Department&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;Natalie Rocca&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Right Join&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;SQL RIGHT JOIN&lt;/em&gt; returns all rows from the right table, even if there are no matches&lt;br&gt;
in the left table. Not so different from the LEFT JOIN.&lt;/p&gt;

&lt;p&gt;In this case, we could view employees and their various departments&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;Department_Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;COALESCE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Full_Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"No Employee"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;Full_Name&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; 
&lt;span class="k"&gt;RIGHT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should get the results below&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Result&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Department_Name&lt;/th&gt;
&lt;th&gt;Full_Name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;td&gt;Wayne Ablett&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;td&gt;Sarah Goodes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;td&gt;John Smith&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;td&gt;Stephanie Richardson&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;td&gt;Matthew Swan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;td&gt;Michelle Carey&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;td&gt;Tony Grant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;td&gt;Chris Matthews&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;td&gt;Jane Voss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;td&gt;Michael Dunstall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;td&gt;Jenna Lockett&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;td&gt;Andrew Judd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;Natalie Rocca&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;Danielle McLeod&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research&lt;/td&gt;
&lt;td&gt;No Employee&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Full Join&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;SQL FULL JOIN&lt;/em&gt; combines the results of both left and right outer joins.&lt;/p&gt;

&lt;p&gt;The joined table will contain all records from both tables and fill in NULLs for missing&lt;br&gt;
matches on either side.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Employee_Id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Full_Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Job_Role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; 
&lt;span class="k"&gt;FULL&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Department_Id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Employee_Id&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should get the results below&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Result&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;EMPLOYEE_ID&lt;/th&gt;
&lt;th&gt;FULL_NAME&lt;/th&gt;
&lt;th&gt;JOB_ROLE&lt;/th&gt;
&lt;th&gt;DEPARTMENT_ID&lt;/th&gt;
&lt;th&gt;DEPARTMENT_NAME&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;John Smith&lt;/td&gt;
&lt;td&gt;CEO&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sarah Goodes&lt;/td&gt;
&lt;td&gt;CFO&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Wayne Ablett&lt;/td&gt;
&lt;td&gt;CIO&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Executive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Michelle Carey&lt;/td&gt;
&lt;td&gt;HR Manager&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Chris Matthews&lt;/td&gt;
&lt;td&gt;Sales Manager&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Andrew Judd&lt;/td&gt;
&lt;td&gt;Development Manager&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Danielle McLeod&lt;/td&gt;
&lt;td&gt;Support Manager&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Matthew Swan&lt;/td&gt;
&lt;td&gt;HR Representative&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Stephanie Richardson&lt;/td&gt;
&lt;td&gt;Salesperson&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Tony Grant&lt;/td&gt;
&lt;td&gt;Salesperson&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Sales&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Jenna Lockett&lt;/td&gt;
&lt;td&gt;Front-End Developer&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Michael Dunstall&lt;/td&gt;
&lt;td&gt;Back-End Developer&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;Jane Voss&lt;/td&gt;
&lt;td&gt;Back-End Developer&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;Anthony Hird&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;Natalie Rocca&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Research&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bonus Joins
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Self Join&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A &lt;em&gt;SELF JOIN&lt;/em&gt; is typically not a join type but a special way of joining a table to itself. You may want to combine rows in a table based on a related column present in the table.&lt;/p&gt;

&lt;p&gt;It is commonly used when you need to traverse a hierarchical structure where each row references another row in the same table &lt;/p&gt;

&lt;p&gt;OR&lt;/p&gt;

&lt;p&gt;When comparing data within rows in the same table.&lt;/p&gt;

&lt;p&gt;In this case, we will use self-join in the employees’ table to display the employee-manager relationship. Each employee record contains a reference to the manager's ID, allowing us to retrieve more information about the managers by adding another column, “Manager_Name”. In Data Science, this is an example of &lt;strong&gt;&lt;em&gt;feature engineering.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Employee_Id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Full_Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Job_Role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Employee_Id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;Manager_Id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Full_Name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;Manager_Name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Manager_Id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Employee_Id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should get the results below&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Result&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;EMPLOYEE_ID&lt;/th&gt;
&lt;th&gt;FULL_NAME&lt;/th&gt;
&lt;th&gt;JOB_ROLE&lt;/th&gt;
&lt;th&gt;DEPARTMENT_ID&lt;/th&gt;
&lt;th&gt;MANAGER_NAME&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sarah Goodes&lt;/td&gt;
&lt;td&gt;CFO&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;John Smith&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Wayne Ablett&lt;/td&gt;
&lt;td&gt;CIO&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;John Smith&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Michelle Carey&lt;/td&gt;
&lt;td&gt;HR Manager&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;John Smith&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Chris Matthews&lt;/td&gt;
&lt;td&gt;Sales Manager&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sarah Goodes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Andrew Judd&lt;/td&gt;
&lt;td&gt;Development Manager&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Wayne Ablett&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Danielle McLeod&lt;/td&gt;
&lt;td&gt;Support Manager&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Wayne Ablett&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Matthew Swan&lt;/td&gt;
&lt;td&gt;HR Representative&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Michelle Carey&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Stephanie Richardson&lt;/td&gt;
&lt;td&gt;Salesperson&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Chris Matthews&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Tony Grant&lt;/td&gt;
&lt;td&gt;Salesperson&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Chris Matthews&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Jenna Lockett&lt;/td&gt;
&lt;td&gt;Front-End Developer&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Andrew Judd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Michael Dunstall&lt;/td&gt;
&lt;td&gt;Back-End Developer&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Andrew Judd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;Jane Voss&lt;/td&gt;
&lt;td&gt;Back-End Developer&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Andrew Judd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;Anthony Hird&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Danielle McLeod&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;Natalie Rocca&lt;/td&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Danielle McLeod&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's most of what you need to know about SQL Joins. Hopefully, you found this information helpful. Feel free to share it with anyone who's having a hard time with Joins, and keep learning! 😊&lt;/p&gt;

&lt;p&gt;If you have any questions, please leave them in the comments section below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks for reading!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>sql</category>
      <category>datascience</category>
      <category>beginners</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Turning Messy Data into Business Action</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Mon, 09 Feb 2026 02:07:28 +0000</pubDate>
      <link>https://dev.to/gathurum/turning-messy-data-into-business-action-3jj8</link>
      <guid>https://dev.to/gathurum/turning-messy-data-into-business-action-3jj8</guid>
      <description>&lt;h2&gt;
  
  
  How Analysts Translate Messy Data, DAX, and Dashboards into Action Using Power BI
&lt;/h2&gt;

&lt;p&gt;As a data analyst, whether in the corporate world or working on a personal project, you will find that data analysis is often driven by the &lt;strong&gt;need to solve a problem&lt;/strong&gt; or wanting an accurate, insightful view of data to make a better decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Define the Objective
&lt;/h3&gt;

&lt;p&gt;First, we need to come up with the objective of our analysis.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ask Questions:&lt;/strong&gt; Write down the specific questions you want answers to from the analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Theme Your Dashboard:&lt;/strong&gt; If you need a dashboard, design one that maintains a clear theme. Don’t just show every kind of data to the user—focus on what matters.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Gathering the Data
&lt;/h3&gt;

&lt;p&gt;The data we need is often scattered. You might need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scrape it online.&lt;/li&gt;
&lt;li&gt;Use data from survey results.&lt;/li&gt;
&lt;li&gt;Combine several Excel sheets.&lt;/li&gt;
&lt;li&gt;Fetch data from a database.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, how do analysts translate this &lt;strong&gt;Messy Data&lt;/strong&gt; into action? &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Power BI&lt;/strong&gt; allows you to access data from various sources, load it, transform it, and analyze it—all in the same workspace.&lt;/p&gt;

&lt;p&gt;In Power BI, we follow the &lt;strong&gt;ELT approach&lt;/strong&gt; (Extract, Load, Transform).&lt;/p&gt;




&lt;h2&gt;
  
  
  The ELT Process
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Extract and Load
&lt;/h3&gt;

&lt;p&gt;First, we start by &lt;strong&gt;Extracting&lt;/strong&gt; the required data. &lt;br&gt;
When you launch Power BI, you get a prompt to select a data source. Simply select where your data lies and import it into Power BI by &lt;strong&gt;Loading&lt;/strong&gt; it there.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvo322w5lt36jliquox1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvo322w5lt36jliquox1.png" alt="Power BI Data Sources" width="800" height="176"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Transform
&lt;/h3&gt;

&lt;p&gt;Once all your data is loaded, the next step is &lt;strong&gt;Cleaning&lt;/strong&gt;. You want to remove all inconsistencies in your data by ensuring:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Correct Data Types:&lt;/strong&gt; All fields (Dates, Decimals, Text) must be set correctly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handle Missing Values:&lt;/strong&gt; Deal with nulls or blanks by either removing rows or filling them where appropriate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Splitting/Merging:&lt;/strong&gt; Split or merge columns where necessary (e.g., splitting an "Address" column to get separate "Location" and "Street" fields).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standardization:&lt;/strong&gt; Ensure consistency (e.g., making sure all currency data is in either Dollars or Shillings across the entire dataset).&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Data Modelling: The Integration Step
&lt;/h2&gt;

&lt;p&gt;After the data is clean and transformed, the best way to integrate it is through &lt;strong&gt;Data Modelling&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;As discussed in my previous article &lt;a href="https://dev.to/gathurum/how-to-implement-data-modelling-in-power-bi-4p2h"&gt;Link&lt;/a&gt; , data modelling helps you create a clear structure and establish necessary relationships before you begin any plotting or calculations. This involves creating a proper &lt;strong&gt;Star Schema&lt;/strong&gt;, consisting of a Fact table and Dimension tables.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Always make sure to click &lt;strong&gt;"Close &amp;amp; Apply"&lt;/strong&gt; to save all changes made in the Power Query Editor before you start building your reports.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  DAX
&lt;/h2&gt;

&lt;p&gt;Once your data model is set up, you might find that the raw data doesn't provide all the answers immediately. This is where &lt;strong&gt;DAX (Data Analysis Expressions)&lt;/strong&gt; comes in. &lt;br&gt;
DAX is the formula language of Power BI, allowing us to calculate and generate new information in two primary ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Measures:&lt;/em&gt;&lt;/strong&gt; Use DAX to create summary aggregations. These calculate values on the fly, such as Total Sales, Year-over-Year Growth, Profit Margin, or Average Selling Price.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Calculated Columns:&lt;/em&gt;&lt;/strong&gt; Generate new columns within your tables to provide more granular information. For example, you could create a "Profit Status" column that labels each row as "Profitable" or "Loss" based on a calculation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DAX respects your model’s relationships. When you create a measure and plot it in a visual, it automatically reacts to the "Filter Context." This means the numbers will automatically filter and update based on the Dimensions (such as Date, Category, or Location) you use in your report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Visualizing for Answers
&lt;/h2&gt;

&lt;p&gt;A well-modeled dataset allows us to answer any question we might have. The best way to do this is to plot &lt;strong&gt;Visuals&lt;/strong&gt;. Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Spot Patterns Fast:&lt;/strong&gt; Visuals help you and your audience see trends or patterns immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tell a Story:&lt;/strong&gt; After answering your initial questions, creating a dashboard tells a story about the problem you want to solve.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;A great dashboard should:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Describe the problem clearly.&lt;/li&gt;
&lt;li&gt;Illustrate insights found in the data.&lt;/li&gt;
&lt;li&gt;Suggest a potential solution to the problem.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By following this flow, your analysis remains relevant and provides genuine value, rather than just showing the dashboard user what they already know.&lt;/p&gt;




</description>
      <category>analytics</category>
      <category>beginners</category>
      <category>ai</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>How to Implement Data Modelling in Power BI</title>
      <dc:creator>Gathuru_M</dc:creator>
      <pubDate>Mon, 02 Feb 2026 21:25:14 +0000</pubDate>
      <link>https://dev.to/gathurum/how-to-implement-data-modelling-in-power-bi-4p2h</link>
      <guid>https://dev.to/gathurum/how-to-implement-data-modelling-in-power-bi-4p2h</guid>
      <description>&lt;p&gt;In this article we will explore the fundamental concepts of data modelling and specifically how to implement data modelling within Power BI for effective data analysis. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Modelling&lt;/strong&gt; is simply structuring or setting up your data in tables, relationships, and access to data for analysis scenarios.&lt;/p&gt;

&lt;p&gt;Often you'll find data distributed in several sheets or systems since, that is how it's maintained. If lucky your data could be in a neat setup &lt;em&gt;(Data Warehouse)&lt;/em&gt; where most of the structuring has already been done.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Case Scenario&lt;/em&gt;:
&lt;/h3&gt;

&lt;p&gt;Here is some sample data, to help make sense of this concept:&lt;br&gt;
&lt;a href="https://www.youtube.com/redirect?event=video_description&amp;amp;redir_token=QUFFLUhqbllqX1p4bnBVbUhVTVozSTJvMEFxZWV1azRWZ3xBQ3Jtc0ttdG53eFJac2w2am9Lb3lDVWlrMm10UFVkRkgyZFRHX25OaU9RcmlmUmpIRHlpZ05WcjR0WjlnUWN6S1BjcWZzQXRoX0ItbjJ3U3FKWFY5NndzQmQ0VkZNcjJrNGY4SzFOek5ROUFxQTRFdS1iUWROaw&amp;amp;q=https%3A%2F%2Fchandoo.org%2Fwp%2Fwp-content%2Fuploads%2F2024%2F10%2Fsample-chocolate-sales-data-all.xlsx&amp;amp;v=4ePNrdxWtY0" rel="noopener noreferrer"&gt;Sample Chocolate Sales Data&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Understanding the Tables
&lt;/h3&gt;

&lt;p&gt;In our case, we have 5 tables.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fact Table:&lt;/strong&gt; One table that contains pointers to other tables, but has no specific information. In our case, the &lt;em&gt;shipments table&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dimension Tables:&lt;/strong&gt; Each explains one dimension of data. What is happening from the perspective of each dimension/entity. These will be the rest of our tables.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Calendar Table:&lt;/strong&gt; A special Dimension table found in most data that holds the time component of analysis, i.e., [Timeseries, Forecasting or Trend analysis].&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Doing this in Power BI
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Load Data:&lt;/strong&gt; Load the data in Power BI by selecting Excel Workbook as the data source. &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fidjk4u8k5m4pr41v418v.png" alt="Loading Data to Power BI" width="800" height="609"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic Modelling:&lt;/strong&gt; Power BI automatically tries to match the tables and create a model; you can see this in the &lt;strong&gt;Model View&lt;/strong&gt;. &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrrodf06j67g5i12v5y4.png" alt="Data Modelling in Power BI" width="800" height="487"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Manual Relationship Setup:&lt;/strong&gt; 🤔 However, not all the data was modelled automatically. To finish up, drag the &lt;strong&gt;'Shipdate'&lt;/strong&gt; field from the Shipment table onto &lt;strong&gt;'cal_date'&lt;/strong&gt; in the Calendar table to create a new relationship and complete the look.&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs48l55io96b3356hagye.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs48l55io96b3356hagye.png" alt="Relationships in Power BI" width="765" height="649"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Notice that the schema now looks &lt;em&gt;Star-shaped&lt;/em&gt;. This kind of model with one fact table and multiple dimension tables is called a &lt;strong&gt;Star Schema&lt;/strong&gt;. The process of setting up this structure is &lt;strong&gt;Data Modelling&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Manage Relationships:&lt;/strong&gt; You can view and edit all connections by clicking on &lt;strong&gt;Manage Relationships&lt;/strong&gt;. This menu shows the relationship status, specifically whether they are &lt;strong&gt;Active&lt;/strong&gt; or &lt;strong&gt;Inactive&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  Table Characteristics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Relationship Design:&lt;/strong&gt; The Products table shows each product in a row with more information from the Shipments table indicating a Many to One Relationship.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fact Table Design:&lt;/strong&gt; The shipments table has more records as compared to the dimension tables in our data. &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore, Fact tables are engineered to host many records and may also have fewer columns as compared to dimension tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Effective data modelling is the backbone of any powerful analysis. By transforming messy, multi-sheet data into a structured Star Schema, everything will feel fast and intuitive; when it's wrong, you spend more time fixing "broken" numbers than actually analyzing data.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>datascience</category>
      <category>bigdata</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
