<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: peterson oaikhenah</title>
    <description>The latest articles on DEV Community by peterson oaikhenah (@nextwebb).</description>
    <link>https://dev.to/nextwebb</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F148697%2Fbbacd135-b2ac-470a-9e88-39a83ffa500b.jpg</url>
      <title>DEV Community: peterson oaikhenah</title>
      <link>https://dev.to/nextwebb</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nextwebb"/>
    <language>en</language>
    <item>
      <title>Data Engineering Foundations: A Hands-On Guide</title>
      <dc:creator>peterson oaikhenah</dc:creator>
      <pubDate>Mon, 13 Jan 2025 12:40:48 +0000</pubDate>
      <link>https://dev.to/nextwebb/data-engineering-foundations-a-hands-on-guide-49gd</link>
      <guid>https://dev.to/nextwebb/data-engineering-foundations-a-hands-on-guide-49gd</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;A practical guide to data engineering ETL pipeline&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hey there! If you’ve been curious about data engineering, this guide will help you understand the basics and walk you through practical examples. Whether it’s setting up storage, processing data, automating workflows, or monitoring systems, I’ll keep it simple, relatable, and fun! 😊&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is Data Engineering?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Data engineering involves organizing, processing, and automating data workflows to make raw data useful for analysis and decision-making. Here’s what we’ll cover:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: Where and how data lives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing&lt;/strong&gt;: Cleaning and transforming raw data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation&lt;/strong&gt;: Running workflows seamlessly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: Ensuring everything runs smoothly.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let’s dive into each step! 🚀&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Setting Up Your Environment and Prerequisites&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Before we start, let’s set the stage for what you’ll need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Environment&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Unix-based system (MacOS) or Windows Subsystem for Linux (WSL).&lt;/li&gt;
&lt;li&gt;Python installed (preferably version 3.11).&lt;/li&gt;
&lt;li&gt;PostgreSQL installed and running locally.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Basic Command Line Knowledge&lt;/strong&gt;: We’ll use terminal commands for installations and setups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python Basics&lt;/strong&gt;: Familiarity with Python syntax.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Administrative Access&lt;/strong&gt;: To install and configure software.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Architectural Overview&lt;/strong&gt;:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv271jowdvjdogw28x7kt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv271jowdvjdogw28x7kt.png" alt="Flow of the ETL process" width="716" height="1102"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The diagram above provides a high-level view of how the component's workflow interacts;&lt;br&gt;
This modular approach allows each tool to do what it’s best: Airflow for orchestration, Spark for distributed data processing, and PostgreSQL for structured data storage.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Install Necessary Tools&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL:
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; brew update
 brew &lt;span class="nb"&gt;install &lt;/span&gt;postgresql
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;PySpark:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight shell"&gt;&lt;code&gt; brew &lt;span class="nb"&gt;install &lt;/span&gt;apache-spark
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Airflow:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight shell"&gt;&lt;code&gt; python &lt;span class="nt"&gt;-m&lt;/span&gt; venv airflow_env
 &lt;span class="nb"&gt;source &lt;/span&gt;airflow_env/bin/activate  &lt;span class="c"&gt;# macOS/Linux&lt;/span&gt;
&lt;span class="c"&gt;#https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt&lt;/span&gt;
 pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"apache-airflow[postgres]=="&lt;/span&gt; &lt;span class="nt"&gt;--constraint&lt;/span&gt; 
&lt;span class="s2"&gt;"https://raw.githubusercontent.com/apache/airflow/constraints-2.10.4/constraints-3.11.txt"&lt;/span&gt;

 airflow db migrate &lt;span class="c"&gt;#Initialize the Database&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiouzfqtrevgnhye8x3d5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiouzfqtrevgnhye8x3d5.png" alt="image" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that we’re ready, let’s dive into each component! 💻&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;1. Storage: Databases and File Systems&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Data storage forms the foundation of data engineering. Storage can be broadly categorized into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Databases&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Efficiently organized.&lt;/li&gt;
&lt;li&gt;Features include search, replication, and indexing.&lt;/li&gt;
&lt;li&gt;Examples:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL Databases&lt;/strong&gt;: Structured data (e.g., PostgreSQL, MySQL).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NoSQL Databases&lt;/strong&gt;: Schema-less (e.g., MongoDB, Redis).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;File Systems&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Suitable for unstructured data.&lt;/li&gt;
&lt;li&gt;Limited features compared to databases.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Setting Up PostgreSQL&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start the Service&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   brew services start postgresql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foaz275w6s6b2ni83rj70.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foaz275w6s6b2ni83rj70.png" alt="Start the Postgresql Service" width="800" height="71"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create a Database, Connect to Database and Create Table&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;   &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;sales_data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="n"&gt;sales_data&lt;/span&gt;
   &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;item_name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;sale_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;
   &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Insert Sample Data&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;   &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sale_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Laptop'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-10'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
          &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Phone'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-12'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe43ppim3gmwfn3ssn8wq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe43ppim3gmwfn3ssn8wq.png" alt="Create a Database &amp;amp; Insert Sample Data" width="800" height="382"&gt;&lt;/a&gt;x&lt;/p&gt;

&lt;p&gt;Now your data is safely stored in PostgreSQL! 🏠&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;2. Processing: PySpark and Distributed Computing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Processing frameworks enable you to transform raw data into actionable insights. Apache Spark, with its distributed computing model, is one such framework that’s widely used.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch and Stream Processing&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Batch: Processes data in chunks.&lt;/li&gt;
&lt;li&gt;Stream: Processes data in real-time.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Tools&lt;/strong&gt;: Apache Spark, Flink, Kafka, Hive.&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Processing Data with PySpark&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;1). &lt;strong&gt;Install Java and  PySpark&lt;/strong&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   brew &lt;span class="nb"&gt;install &lt;/span&gt;openjdk@11 &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; brew &lt;span class="nb"&gt;install &lt;/span&gt;apache-spark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2). &lt;strong&gt;Load Data from a CSV&lt;/strong&gt; :&lt;/p&gt;

&lt;p&gt;Create a &lt;code&gt;sales.csv&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   id,item_name,amount,sale_date
   1,Laptop,1200,2024-01-10
   2,Phone,800,2024-01-12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python script to load and process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

   &lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DataProcessing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

   &lt;span class="c1"&gt;# Load CSV File
&lt;/span&gt;   &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcv9karvsbmshqg8iqzyt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcv9karvsbmshqg8iqzyt.png" alt="Python script to load and process" width="800" height="194"&gt;&lt;/a&gt;&lt;br&gt;
   &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fatlz7sb6bki2y1f22us8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fatlz7sb6bki2y1f22us8.png" alt="Spark UI" width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3). &lt;strong&gt;Filter High-Value Sales&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="c1"&gt;# Filter sales above $1000
&lt;/span&gt;   &lt;span class="n"&gt;high_value_sales&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="n"&gt;high_value_sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6735bsoixg9jja5ew2ej.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6735bsoixg9jja5ew2ej.png" alt="Filter High-Value Sales" width="800" height="279"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6735bsoixg9jja5ew2ej.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6735bsoixg9jja5ew2ej.png" alt="Spark UI - High-Value Sales" width="800" height="279"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;4). &lt;strong&gt;Setup Postgres DB driver&lt;/strong&gt;:&lt;br&gt;
If you don’t find the driver, download it directly from the PostgreSQL website:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;curl -O https://jdbc.postgresql.org/download/postgresql-42.6.0.jar&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;This will download the driver to your current working directory.&lt;/p&gt;

&lt;p&gt;Before running the script, verify the .jar file path:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ls -l /path/to/postgresql-42.6.0.jar&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Once you locate the driver file, use its full path in your script.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;     #Add the PostgreSQL JDBC driver path
     spark = SparkSession.builder \
    .appName("DataProcessing") \
    .config("spark.jars", "/path/to/postgresql-42.6.0.jar") \
    .getOrCreate()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5).  &lt;strong&gt;Save Processed Data Back to PostgreSQL&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;     &lt;span class="c1"&gt;#Add the PostgreSQL JDBC driver path
&lt;/span&gt;     &lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DataProcessing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.jars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/postgresql-42.6.0.jar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

   &lt;span class="n"&gt;high_value_sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jdbc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;options&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jdbc:postgresql://localhost:5432/sales_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.postgresql.Driver&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;dbtable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high_value_sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
   &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftuh03rw45s0epln2p698.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftuh03rw45s0epln2p698.png" alt="Save Processed Data Back to PostgreSQL" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgtkd5zcnqv71gg1w2nwy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgtkd5zcnqv71gg1w2nwy.png" alt="high_value_sales table" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5k1d30glba5zffq5rew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5k1d30glba5zffq5rew.png" alt="Spark UI - Save Processed Data Back to PostgreSQL" width="800" height="417"&gt;&lt;/a&gt;&lt;br&gt;
Great work! You’ve processed data using Spark. 🚀&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;3. Automation: Airflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Automation helps in managing workflows by setting up dependencies and schedules. Tools like Airflow, Oozie, and Luigi simplify this process.&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;Automating ETL with Airflow&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;1). &lt;strong&gt;Initialize apache Airflow&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   airflow db migrate

   airflow &lt;span class="nb"&gt;users &lt;/span&gt;create &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--username&lt;/span&gt; admin &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--firstname&lt;/span&gt; Admin &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--lastname&lt;/span&gt; User &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--role&lt;/span&gt; Admin &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--email&lt;/span&gt; admin@example.com
   airflow webserver &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxapec4u1kmmi3gk60dq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxapec4u1kmmi3gk60dq.png" alt="Initialize apache Airflow" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxapec4u1kmmi3gk60dq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxapec4u1kmmi3gk60dq.png" alt="Create airflow admin user" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2). &lt;strong&gt;Create a Workflow (DAG)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
   &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.bash&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BashOperator&lt;/span&gt;
   &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

  &lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
     &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;airflow&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
     &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;alerts@example.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
     &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_on_failure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;spark_etl_pipeline&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;#Update with the absolute path to your script
&lt;/span&gt;    &lt;span class="n"&gt;spark_etl_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;spark_etl&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python3 /absolute/path/to/sales.py&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Replace with the actual path to sales.py
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;#Example additional tasks for data verification
&lt;/span&gt;    &lt;span class="n"&gt;verify_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;verify_results&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Verification complete!&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;#Define the workflow
&lt;/span&gt;    &lt;span class="n"&gt;spark_etl_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;verify_task&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this DAG:&lt;br&gt;
     The schedule_interval='@hourly' ensures the ETL pipeline runs every hour.&lt;br&gt;
     The bash_command executes the PySpark ETL script (sales.py).&lt;br&gt;
     Email alerts notify the owner in case of failures.&lt;/p&gt;

&lt;p&gt;3). &lt;strong&gt;Monitor Your Workflow&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Move the DAG file to Airflow’s dags/ directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mv spark_etl_pipeline.py ~/airflow/dags/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart Airflow services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;airflow scheduler
airflow webserver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check the spark_etl_pipeline DAG, and ensure it is enabled. Trigger it manually to validate.&lt;/p&gt;

&lt;p&gt;Open the Airflow UI at &lt;code&gt;http://localhost:8080&lt;/code&gt;. Check task statuses and logs.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0m4mrk2qdgsnrwz3ll9t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0m4mrk2qdgsnrwz3ll9t.png" alt="Apache Airflow dashboard UI" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;4. Monitoring&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Monitoring ensures pipelines run reliably. You can use tools like Airflow’s alerting system or integrate with Grafana and Prometheus.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;From the Airflow UI&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task Status&lt;/strong&gt;: Check the &lt;strong&gt;Graph View&lt;/strong&gt; at &lt;a href="http://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt; to monitor task states:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Green&lt;/strong&gt;: Success&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Red&lt;/strong&gt;: Failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Yellow&lt;/strong&gt;: Running&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grey&lt;/strong&gt;: Queued/Skipped&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Logs&lt;/strong&gt;: Click on a task instance to view logs for debugging. &lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8cwop45xajepxcqk41n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8cwop45xajepxcqk41n.png" alt="Airflow UI" width="800" height="473"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Wrapping Up&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Congrats! 🎉 You’ve learned how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up storage with PostgreSQL.&lt;/li&gt;
&lt;li&gt;Process data using PySpark.&lt;/li&gt;
&lt;li&gt;Automate workflows with Airflow.&lt;/li&gt;
&lt;li&gt;Monitor systems for reliability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data Engineering as a Career Focus&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data engineering is the backbone of modern data workflows. It focuses on building systems that collect, store, and process data efficiently, enabling teams like analysts, data scientists, and business stakeholders to extract value from it.&lt;/p&gt;

&lt;p&gt;While it primarily involves skills like coding, system design, and infrastructure management, data engineering often intersects with adjacent fields: backend development (e.g., APIs, database schema design), DevOps (e.g., CI/CD pipelines, infrastructure as code), and data operations (e.g., managing ETL workflows and ensuring data reliability).&lt;/p&gt;

&lt;p&gt;Whether you're orchestrating workflows with tools like Apache Airflow, processing massive datasets with Apache Spark, or managing databases like PostgreSQL, your role is pivotal in enabling data-driven decisions.&lt;/p&gt;

&lt;p&gt;Start small, explore, and grow your expertise! 💪&lt;/p&gt;

&lt;p&gt;If you have any inquiries or wish to gain additional knowledge, please get in touch with me on &lt;a href="//github.com/nextwebb"&gt;GitHub&lt;/a&gt;, &lt;a href="https://x.com/iam_nextwebb" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;, or &lt;a href="https://www.linkedin.com/in/peterson-oaikhenah-102645144/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;. Kindly show your support by leaving a thumbs up 👍, a comment 💬, and sharing this article with your network 😊.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;References to Explore Further:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://spark.apache.org/docs" rel="noopener noreferrer"&gt;Apache Spark Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://airflow.apache.org" rel="noopener noreferrer"&gt;Airflow Official Site&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.postgresql.org/docs" rel="noopener noreferrer"&gt;PostgreSQL Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://raw.githubusercontent.com/apache/airflow/constraints-2.10.4/constraints-3.11.txt" rel="noopener noreferrer"&gt;Constraints file for each version of Python that Airflow supports&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>apachespark</category>
      <category>apacheairflow</category>
      <category>python</category>
    </item>
    <item>
      <title>Avoiding Pitfalls in Amazon S3: Handling Case Sensitivity in Python Workflows</title>
      <dc:creator>peterson oaikhenah</dc:creator>
      <pubDate>Sun, 01 Dec 2024 18:06:01 +0000</pubDate>
      <link>https://dev.to/aws-builders/avoiding-pitfalls-in-amazon-s3-handling-case-sensitivity-in-python-workflows-22if</link>
      <guid>https://dev.to/aws-builders/avoiding-pitfalls-in-amazon-s3-handling-case-sensitivity-in-python-workflows-22if</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;Managing S3 Case Sensitivity in Python Workflows&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When working with &lt;strong&gt;Amazon S3&lt;/strong&gt;, it’s easy to overlook an important nuance: &lt;strong&gt;case sensitivity&lt;/strong&gt;. While bucket names are &lt;strong&gt;case-insensitive&lt;/strong&gt;, object keys (file paths) are &lt;strong&gt;case-sensitive&lt;/strong&gt;. This distinction can lead to unexpected bugs in your workflows. For instance, &lt;code&gt;my-bucket/data/file.txt&lt;/code&gt; and &lt;code&gt;my-bucket/Data/File.txt&lt;/code&gt; are treated as completely different objects.&lt;/p&gt;

&lt;p&gt;If you’ve ever had a Python script fail to locate files in S3, chances are, case sensitivity might have been the issue.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Why Does Case Sensitivity Matter?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Let’s say your data processing pipeline dynamically generates S3 paths based on inputs from multiple teams. One team might upload to &lt;code&gt;my-bucket/data/&lt;/code&gt;, while another uses &lt;code&gt;my-bucket/Data/&lt;/code&gt;. Without a strategy to handle case mismatches, your pipeline could skip files or fail altogether, causing inefficiencies and delays.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;How to Handle Case Sensitivity in Python&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here’s how you can address this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Normalize Paths&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Standardize paths to lowercase (or a consistent format) during both upload and access.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify Object Keys&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Use AWS SDK methods like &lt;code&gt;list_objects_v2&lt;/code&gt; to confirm the existence of object keys.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Implement Error Handling&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Design scripts to handle exceptions &lt;code&gt;KeyError&lt;/code&gt; and log issues for debugging.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Code Example: Listing Objects Safely&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Below is a Python script to list objects in an S3 bucket while addressing case sensitivity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize_s3_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Normalize and validate S3 paths to handle case sensitivity.

    Args:
        bucket (str): Name of the S3 bucket.
        prefix (str): Prefix (folder path) in the bucket.

    Returns:
        list: Canonical paths matching the prefix.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_objects_v2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Contents&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Path &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; not found. Check case sensitivity.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Contents&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage
&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-bucket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;s3_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data/File.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize_s3_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s3_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Canonical paths found:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This script ensures that your workflow identifies object keys, regardless of mismatched cases in input paths.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Real-World Scenario&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In many data processing workflows, case mismatches in file paths can lead to missing or duplicated records. For instance, a team processing customer records stored in S3 noticed recurring errors due to inconsistent casing in object keys. By implementing strategies like normalizing paths and validating keys, they were able to significantly reduce these issues and improve the reliability of their data pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Takeaways&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Standardize&lt;/strong&gt;: Use consistent casing for all S3 paths.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Validate&lt;/strong&gt;: Leverage AWS SDKs to confirm the existence of the object key.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Handle Errors Gracefully&lt;/strong&gt;: Design scripts to log and report mismatched paths.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By addressing case sensitivity early in your workflow, you can prevent costly errors and build more resilient systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What About You?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Have you faced challenges with case sensitivity in S3? Share your experiences in the comments or connect with me to discuss more strategies for optimizing cloud workflows!&lt;/p&gt;

&lt;p&gt;If you have any inquiries or wish to gain additional knowledge, please get in touch with me on &lt;a href="https://github.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;GitHub&lt;/strong&gt;&lt;/a&gt;, &lt;a href="https://twitter.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;Twitter&lt;/strong&gt;&lt;/a&gt;, or &lt;a href="https://linkedin.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;LinkedIn&lt;/strong&gt;&lt;/a&gt;. Kindly show your support by leaving a thumbs up 👍, a comment 💬, and sharing this article with your network 😊.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;References&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Naming Amazon S3 Objects&lt;/strong&gt;: &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>datapipeline</category>
      <category>errors</category>
      <category>awss3</category>
    </item>
  </channel>
</rss>
