<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mohamed Amin</title>
    <description>The latest articles on DEV Community by Mohamed Amin (@amin12905).</description>
    <link>https://dev.to/amin12905</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2833280%2F7c1305cc-68b3-4edd-9810-270352b6ae19.png</url>
      <title>DEV Community: Mohamed Amin</title>
      <link>https://dev.to/amin12905</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amin12905"/>
    <language>en</language>
    <item>
      <title>Building a YouTube Analytics Dashboard</title>
      <dc:creator>Mohamed Amin</dc:creator>
      <pubDate>Fri, 02 May 2025 19:52:40 +0000</pubDate>
      <link>https://dev.to/amin12905/building-a-youtube-analytics-dashboard-3i77</link>
      <guid>https://dev.to/amin12905/building-a-youtube-analytics-dashboard-3i77</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;In the growing creator economy, YouTube channels pump out tons of videos every week. and Creators need a simple dashboard that answers the following questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How has my channel grown over time?&lt;/li&gt;
&lt;li&gt;What is the best day and time to post a video?&lt;/li&gt;
&lt;li&gt;What are the most engaging videos in my channel?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To tackle this, I built an end-to-end data pipeline that automatically pulls, processes, and visualizes YouTube data.&lt;/p&gt;

&lt;p&gt;In this article goes over the steps I took to archieve this.&lt;/p&gt;

&lt;h1&gt;
  
  
  Part 1: Installing the Tools Natively on an Azure Ubuntu VM
&lt;/h1&gt;

&lt;p&gt;Instead of using Docker,I installed everything natively because it gave me more control, better performance tuning, and more flexibility compared to containerized environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Install Apache Airflow
&lt;/h2&gt;

&lt;p&gt;Apache Airflow schedules and manages all the ETL jobs automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1.1: Create a dedicated airflow user
&lt;/h3&gt;

&lt;p&gt;First, it’s good practice to create a separate user to run Airflow (avoiding permission issues later).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sudo adduser airflow&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Follow the prompts (you can skip extra info fields). Then add the user to sudoers:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sudo usermod -aG sudo airflow&lt;/code&gt;&lt;br&gt;
Login as the new user:&lt;br&gt;
&lt;code&gt;sudo su - airflow&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1.2: Install system dependencies
&lt;/h3&gt;

&lt;p&gt;Update the system first:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sudo apt update &amp;amp;&amp;amp; sudo apt upgrade -y&lt;/code&gt;&lt;br&gt;
Install required libraries:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sudo apt install -y python3-pip python3-venv libpq-dev&lt;/code&gt;&lt;br&gt;
python3-venv → To create isolated Python environments&lt;/p&gt;

&lt;p&gt;libpq-dev → For Postgres client libraries (needed later)&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1.3: Set up Python Virtual Environment for Airflow
&lt;/h3&gt;

&lt;p&gt;It’s highly recommended to run Airflow inside a virtual environment:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;python3 -m venv airflow_venv&lt;/code&gt;&lt;br&gt;
&lt;code&gt;source airflow_venv/bin/activate&lt;/code&gt;&lt;br&gt;
Now inside the virtualenv, install Airflow:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install apache-airflow&lt;/code&gt;&lt;br&gt;
✅ Installing Airflow takes a while (~5 mins) because it pulls many dependencies.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1.4: Initialize Airflow Database
&lt;/h3&gt;

&lt;p&gt;Airflow uses a metadata database to track DAG runs and tasks. We'll configure it later for PostgreSQL, but initially, initialize it:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;export AIRFLOW_HOME=~/airflow&lt;/code&gt;&lt;br&gt;
&lt;code&gt;airflow db init&lt;/code&gt;&lt;br&gt;
This creates:&lt;/p&gt;

&lt;p&gt;~/airflow/airflow.cfg&lt;/p&gt;

&lt;p&gt;SQLite DB initially (we'll later connect to PostgreSQL).&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1.5: Create an Admin User
&lt;/h3&gt;

&lt;p&gt;Airflow needs an admin user to login to the web UI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;airflow users create \
    --username admin \
    --firstname Admin \
    --lastname User \
    --role Admin \
    --email admin@example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 1.6: Run Airflow services
&lt;/h3&gt;

&lt;p&gt;You need two services running:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Webserver (UI)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scheduler (Job runner)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Start them:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;airflow webserver --port 8080&lt;/code&gt;&lt;br&gt;
In another SSH session/tab:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;airflow scheduler&lt;/code&gt;&lt;br&gt;
You can now access Airflow at http://:8080&lt;/p&gt;

&lt;p&gt;At this point, verify:&lt;/p&gt;

&lt;p&gt;Airflow UI is accessible.&lt;/p&gt;

&lt;p&gt;Admin login works.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Install Apache Spark (Data Processing Engine)
&lt;/h2&gt;

&lt;p&gt;Spark handles heavy lifting for data transformation. Make sure you have java and scala installed.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2.1: Download and extract Spark
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;wget https://downloads.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
tar -xvzf spark-3.4.0-bin-hadoop3.tgz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Move it to /opt (standard for system-wide apps):&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sudo mv spark-3.4.0-bin-hadoop3 /opt/spark&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2.2: Set Spark Environment Variables
&lt;/h3&gt;

&lt;p&gt;Edit your bash profile:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;nano ~/.bashrc&lt;/code&gt;&lt;br&gt;
Add at the bottom:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply changes:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;source ~/.bashrc&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2.3: Test Spark installation
&lt;/h3&gt;

&lt;p&gt;Check Spark version:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;spark-submit --version&lt;/code&gt;&lt;br&gt;
✅ If you see version info without errors, Spark is installed properly.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Install PostgreSQL (Data Warehouse)
&lt;/h2&gt;

&lt;p&gt;PostgreSQL will store all cleaned YouTube data.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 3.1: Install PostgreSQL Server
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt update
sudo apt install -y postgresql postgresql-contrib
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 3.2: Create Database and User
&lt;/h3&gt;

&lt;p&gt;Switch to Postgres superuser:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sudo -u postgres psql&lt;/code&gt;&lt;br&gt;
Inside psql shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE DATABASE youtube_analytics;
CREATE USER airflow WITH ENCRYPTED PASSWORD 'yourpassword';
GRANT ALL PRIVILEGES ON DATABASE youtube_analytics TO airflow;
\q
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Airflow and your scripts will connect using the airflow user.&lt;/p&gt;

&lt;p&gt;At this point, verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;psql connects successfully.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;youtube_analytics database exists.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. Install Grafana (Dashboarding)
&lt;/h2&gt;

&lt;p&gt;Grafana visualizes the results beautifully.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4.1: Add Grafana Repo
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4.2: Install Grafana
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;sudo apt-get install grafana&lt;/code&gt;&lt;br&gt;
Enable and start the service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo systemctl start grafana-server
sudo systemctl enable grafana-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4.3: Access Grafana
&lt;/h3&gt;

&lt;p&gt;Grafana will be available at:&lt;br&gt;
http://:3000&lt;/p&gt;

&lt;p&gt;Default credentials:&lt;/p&gt;

&lt;p&gt;Username: admin&lt;/p&gt;

&lt;p&gt;Password: admin&lt;br&gt;
(you'll be prompted to change password at first login)&lt;/p&gt;

&lt;p&gt;At this point, your server is fully ready:&lt;br&gt;
Airflow + Spark + Postgres + Grafana all installed, running natively.&lt;/p&gt;
&lt;h1&gt;
  
  
  Part 2: Code Walkthrough
&lt;/h1&gt;
&lt;h3&gt;
  
  
  1. The Core ETL Pipeline (main.py)
&lt;/h3&gt;

&lt;p&gt;This file is the heart of the whole system. It does three main things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Pulls data from the YouTube API&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Processes and enriches it using PySpark&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Returns clean DataFrames ready to be stored&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Load API Key and Initialize Spark + YouTube Client&lt;br&gt;
python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from dotenv import load_dotenv
import os
from googleapiclient.discovery import build
from pyspark.sql import SparkSession

load_dotenv()
google_api_key = os.getenv('API_KEY')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This section loads your .env file and gets the API key for authenticating with the YouTube Data API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_spark():
    return SparkSession.builder \
        .appName("YoutubeAnalytics") \
        .config("spark.jars", "/path/to/postgresql-42.6.0.jar") \
        .master("local[*]") \
        .getOrCreate()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Creates a local Spark session and includes the JDBC driver to connect to PostgreSQL later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_youtube():
    return build('youtube', 'v3', developerKey=google_api_key)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Builds the YouTube API client using the API key.&lt;/p&gt;

&lt;p&gt;Get Subscriber Count (getSubscribers())&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;request = youtube.channels().list(part='statistics', id='UCtxD0x6AuNNqdXO9Wp5GHew')
response = request.execute()
subscriber_count = int(response['items'][0]['statistics']['subscriberCount'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gets the current subscriber count for the specified channel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;subscriber_data = [(date.today(), subscriber_count)]
df = spark.createDataFrame(subscriber_data, ["date", "subscribers"])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stores the date + subscriber count as a Spark DataFrame. This helps visualize growth over time later.&lt;/p&gt;

&lt;p&gt;Get Top Videos by Engagement (get_videos)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Loop through playlist videos
request = youtube.playlistItems().list(part='snippet,contentDetails', playlistId=playlist_id, maxResults=50)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pulls all video IDs from the "Uploads" playlist.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Pull stats for each video
data_request = youtube.videos().list(part='snippet,statistics', id=','.join(video_ids))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fetches views, likes, and comments for each video.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;engagement_rate = round((like_count + comment_count) / view_count, 2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Calculates engagement rate to identify top-performing videos.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;best_df = df.orderBy(col("engagement_rate").desc()).limit(5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sorts and selects the top 5 most engaging videos.&lt;/p&gt;

&lt;p&gt;Best Time to Post (best_post_time)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = df.withColumn("hour", hour(col('published_date')))
df = df.withColumn("day", date_format(col('published_date'), "E"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Extracts the day of week and hour from the video publish time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = df.withColumn("engagement", col('likes') + col('view_count') + col('comments'))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Creates an engagement score to use for averaging.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;grouped = df.groupBy("day", "hour").avg("engagement")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aggregates average engagement by day/hour pair.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;result = grouped.select("day_hour", "avg_engagement").orderBy(col("avg_engagement").desc()).limit(3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns the best 3 day-hour combinations for posting.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Controller File (controller.py)
&lt;/h3&gt;

&lt;p&gt;This script is responsible for executing the ETL logic and pushing the results to PostgreSQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from main import getSubscribers, get_videos, best_post_time

subscriber_df = getSubscribers()
videos_list, best_videos_df = get_videos()
best_post_time_df = best_post_time(videos_list)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It calls the three main ETL functions and stores their outputs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;subscriber_df.write.jdbc(url=jdbc_url, table="subscriber_data", mode='append', properties=properties)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Writes each DataFrame into its corresponding PostgreSQL table.&lt;/p&gt;

&lt;p&gt;Tables used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;subscriber_data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;best_post_time&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;best_performing_videos&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Clean separation between data logic and storage logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Airflow DAG (youtube_dag.py)
&lt;/h3&gt;

&lt;p&gt;This DAG automates running the pipeline daily.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;default_args = {
    "owner": "Batru",
    "start_date": datetime(2025, 4, 23),
    "retries": 1,
    "retry_delay": timedelta(minutes=1),
    "email_on_failure": True,
    "email": ["batrudin10@gmail.com"]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Defines retry behavior and email alerts on failure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;task_youtube = BashOperator(
    task_id="task_youtube",
    bash_command="""
        source /home/mombasa/projects/youtubue_analytics_dashboard/venv/bin/activate &amp;amp;&amp;amp;
        python3 /home/mombasa/projects/youtubue_analytics_dashboard/controller.py
    """
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The task activates the Python virtual environment and runs controller.py, kicking off the full ETL.&lt;/p&gt;

&lt;h1&gt;
  
  
  Part 3: Visualizing YouTube Data with Grafana
&lt;/h1&gt;

&lt;p&gt;With the data successfully loaded into PostgreSQL from our Spark job (via Airflow), it’s time to bring it to life visually using Grafana.&lt;/p&gt;

&lt;p&gt;Grafana is already installed on our Ubuntu server. Here's how we set it up and created dashboards that answer our key analytics questions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Connect Grafana to PostgreSQL
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Login to Grafana&lt;/li&gt;
&lt;li&gt;Navigate to &lt;a href="http://128.85.32.87:3000" rel="noopener noreferrer"&gt;http://128.85.32.87:3000&lt;/a&gt; in your browser. * Default login is usually:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Username: admin&lt;/p&gt;

&lt;p&gt;Password: admin (you’ll be asked to reset on first login)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add PostgreSQL as a Data Source and Fill in:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Host: localhost:5432 or your DB host&lt;/p&gt;

&lt;p&gt;Database: youtube_analytics&lt;/p&gt;

&lt;p&gt;User and Password: your PostgreSQL credentials&lt;/p&gt;

&lt;p&gt;Click Save &amp;amp; Test.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create the Dashboards
&lt;/h3&gt;

&lt;p&gt;Now that the DB is connected, let’s create panels to visualize the insights.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Channel Growth Over Time
&lt;/h4&gt;

&lt;p&gt;Query from subscriber_data table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT date, subscribers FROM subscriber_data ORDER BY date;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Visualization: Use a Time Series panel&lt;/p&gt;

&lt;p&gt;Y-axis: Subscriber count&lt;/p&gt;

&lt;p&gt;X-axis: Date&lt;/p&gt;

&lt;p&gt;This shows subscriber growth trends over time.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Best Performing Videos by Engagement
&lt;/h4&gt;

&lt;p&gt;Query from best_performing_videos:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT title, engagement_rate FROM best_performing_videos ORDER BY engagement_rate DESC LIMIT 5;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Visualization: Bar Chart&lt;/p&gt;

&lt;p&gt;X-axis: Video titles&lt;/p&gt;

&lt;p&gt;Y-axis: Engagement rate&lt;/p&gt;

&lt;p&gt;This reveals which videos truly resonate with the audience.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Best Time to Post
&lt;/h4&gt;

&lt;p&gt;Query from best_post_time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT day_hour, avg_engagement FROM best_post_time ORDER BY avg_engagement DESC LIMIT 5;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Visualization: Table or Bar Chart&lt;/p&gt;

&lt;p&gt;The final dashboard would look like something like this&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvctzq9ffcx9us8nbcia.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvctzq9ffcx9us8nbcia.png" alt="Image description" width="800" height="256"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Introduction to Apache Spark for Data Engineering</title>
      <dc:creator>Mohamed Amin</dc:creator>
      <pubDate>Mon, 14 Apr 2025 09:56:56 +0000</pubDate>
      <link>https://dev.to/amin12905/introduction-to-apache-spark-for-data-engineering-3fmh</link>
      <guid>https://dev.to/amin12905/introduction-to-apache-spark-for-data-engineering-3fmh</guid>
      <description>&lt;h1&gt;
  
  
  🔥 Introduction
&lt;/h1&gt;

&lt;p&gt;With the volume and velocity of data being generated today, Apache Spark has emerged as a go-to distributed computing framework. Spark is designed for fast processing and scalability, making it ideal for modern data engineering workflows.&lt;/p&gt;

&lt;p&gt;In this article, we will cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;What Apache Spark is&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Definitions of common Spark terms&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Core components of Spark&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Why use Spark as a Data Engineer&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  ⚙️ What is Apache Spark?
&lt;/h1&gt;

&lt;p&gt;Apache Spark is an open-source data processing engine built for large-scale data workloads. It is about 100 times faster than traditional MapReduce frameworks due to its in-memory processing capabilities.&lt;/p&gt;

&lt;h1&gt;
  
  
  📘 Common Spark Terms
&lt;/h1&gt;

&lt;h3&gt;
  
  
  1. RDD (Resilient Distributed Dataset)
&lt;/h3&gt;

&lt;p&gt;A distributed collection of objects that are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Immutable&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Support in-memory processing&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Offer fault tolerance through lineage information&lt;/p&gt;

&lt;h3&gt;
  
  
  2. DataFrame
&lt;/h3&gt;

&lt;p&gt;A distributed collection of data organized into named columns, similar to a Pandas DataFrame, but optimized for big data.&lt;/p&gt;

&lt;h1&gt;
  
  
  🧹 Components of Spark
&lt;/h1&gt;

&lt;p&gt;Spark consists of a core engine and several powerful libraries:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Spark Core
&lt;/h3&gt;

&lt;p&gt;The foundation of the Spark ecosystem, responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Task scheduling&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Memory management&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fault recovery&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Basic I/O operations&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Spark SQL
&lt;/h3&gt;

&lt;p&gt;Enables querying of structured data using SQL-like syntax.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Spark Streaming
&lt;/h3&gt;

&lt;p&gt;Processes real-time data streams from sources like Kafka, Flume, and sockets, using a micro-batch architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Spark MLlib
&lt;/h3&gt;

&lt;p&gt;A scalable machine learning library built on top of Spark for classification, regression, clustering and recommendation.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. GraphX
&lt;/h3&gt;

&lt;p&gt;A library used for graph processing and computation, useful for task such as social network analysis.&lt;/p&gt;

&lt;h1&gt;
  
  
  🚀 Why Spark?
&lt;/h1&gt;

&lt;p&gt;Here’s why Spark is widely adopted in big data engineering:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Speed
&lt;/h3&gt;

&lt;p&gt;Spark outperforms Hadoop MapReduce by being up to 100x faster, thanks to its in-memory computation.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Scalability
&lt;/h3&gt;

&lt;p&gt;Spark is built to scale across hundreds or thousands of nodes, handling petabyte-scale data.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Unified Engine
&lt;/h3&gt;

&lt;p&gt;Spark provides a single engine for batch processing, real-time streaming, machine learning, and graph computation.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Fault Tolerance
&lt;/h3&gt;

&lt;p&gt;Spark automatically recovers from node failures using RDD lineage, which tracks how data is derived.&lt;/p&gt;

&lt;h1&gt;
  
  
  🔄 A Typical Spark Workflow for Data Engineering
&lt;/h1&gt;

&lt;p&gt;Here's how Spark fits into a standard data engineering pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Ingestion&lt;/strong&gt; - Read data from various sources like local files, relational databases, data lakes, or APIs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Transformation&lt;/strong&gt; - Apply transformations such as filtering, joins, aggregations, and custom business logic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Validation and Cleansing&lt;/strong&gt; - Clean the data, handle nulls, validate schema, and ensure quality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Loading&lt;/strong&gt; - Write the processed data to destinations like data warehouses, file systems, or dashboards.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  🧠 Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Apache Spark continues to be a game-changer in the fields of big data and data engineering. Its unified architecture, ability to handle large datasets with ease, and support for both batch and real-time processing make it an essential tool for modern data teams.&lt;/p&gt;

&lt;p&gt;As a data engineer, mastering Spark enables you to build fast, scalable, and reliable data pipelines that can drive analytics, power machine learning models, and support real-time applications.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Step-by-Step guide on Live Streaming weather data from Openweather api to MongoDB using Kafka</title>
      <dc:creator>Mohamed Amin</dc:creator>
      <pubDate>Mon, 07 Apr 2025 10:46:28 +0000</pubDate>
      <link>https://dev.to/amin12905/step-by-step-guide-on-live-streaming-weather-data-from-openweather-api-to-mongodb-using-kafka-km4</link>
      <guid>https://dev.to/amin12905/step-by-step-guide-on-live-streaming-weather-data-from-openweather-api-to-mongodb-using-kafka-km4</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this guide, we’ll stream data from the OpenWeather API and store it in MongoDB, using Kafka for fault tolerance. Before we start, make sure you’ve already set up a cloud server—I'm using an EC2 instance on AWS. Also, sign up on &lt;a href="https://openweathermap.org/" rel="noopener noreferrer"&gt;OpenWeatherMap&lt;/a&gt; and get your API key—we’ll use this to fetch the weather data.&lt;/p&gt;

&lt;p&gt;This guide is broken down into 5 parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Setting up our cloud environment&lt;/li&gt;
&lt;li&gt;Setting up our Kafka environment&lt;/li&gt;
&lt;li&gt;Setting up our MongoDB environment&lt;/li&gt;
&lt;li&gt;Writing our producer and consumer Python files&lt;/li&gt;
&lt;li&gt;Tying everything together to complete the project&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. Setting Up Our Cloud Environment
&lt;/h2&gt;

&lt;p&gt;We'll SSH into our EC2 server and check if the Kafka port (default is 9092) is free.&lt;/p&gt;

&lt;h3&gt;
  
  
  a) SSH into EC2
&lt;/h3&gt;

&lt;p&gt;Open your terminal (I’m using Git Bash) and run the SSH command from the EC2 &lt;strong&gt;Connect&lt;/strong&gt; section:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ssh &lt;span class="nt"&gt;-i&lt;/span&gt; your-key.pem ec2-user@your-ec2-ip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  b) Check if Kafka Port is Free
&lt;/h3&gt;

&lt;p&gt;Once logged into the server, run the following:&lt;br&gt;
&lt;code&gt;$ sudo lsof -i :9092&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If it returns nothing, great—no service is using that port, so Kafka is good to go.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Setting Up Our Kafka Environment
&lt;/h2&gt;

&lt;p&gt;Head over to Kafka Downloads and copy the link for version 4.0.&lt;/p&gt;

&lt;h3&gt;
  
  
  a) Download Kafka
&lt;/h3&gt;

&lt;p&gt;Run this command to download kafka binary files:&lt;br&gt;
&lt;code&gt;$ wget https://downloads.apache.org/kafka/4.0.0/kafka_2.13-4.0.0.tgz&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  b) Extract Kafka
&lt;/h3&gt;

&lt;p&gt;Run this command to extract the files:&lt;br&gt;
&lt;code&gt;$ tar -xzvf kafka_2.13-4.0.0.tgz&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can rename the folder for convenience if you want.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Setting Up Our MongoDB Environment
&lt;/h2&gt;

&lt;p&gt;We’ll use MongoDB Atlas to host our database. Sign up or log in, create a cluster, and grab your connection string (we’ll use this in our Python script).&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Writing Our Producer and Consumer Python Files
&lt;/h2&gt;

&lt;h2&gt;
  
  
  a)producer.py — Pulling from OpenWeather API and Writing to Kafka:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from confluent_kafka import Producer
import requests, json, os, time
from dotenv import load_dotenv
import pandas as pd

# Load environment variables from .env file
load_dotenv()
weather_api_key = os.getenv('WEATHER_API_KEY')  # Your OpenWeather API key
city_name = 'Nairobi'

# Build API URL
weather_url = f'https://api.openweathermap.org/data/2.5/weather?q={city_name}&amp;amp;appid={weather_api_key}'

# Kafka Producer Configuration
config = {
    'bootstrap.servers': 'localhost:9092',
    'client.id': 'python-producer'
}
producer = Producer(config)
topic = 'weather_topic'

# Function to extract data from API
def extract_data():
    response = requests.get(weather_url)
    data = response.json()

    weather_df = pd.DataFrame(data['weather'], index=[0])
    temp_df = pd.DataFrame(data['main'], index=[0])
    location_df = pd.DataFrame({'country': data['sys']['country'], 'city': city_name}, index=[0])

    merged_df = pd.merge(pd.merge(location_df, weather_df, left_index=True, right_index=True), temp_df, left_index=True, right_index=True)
    return merged_df

# Function to transform the data
def transform_data(df):
    df = df.drop(columns=['id', 'icon'])  # Drop unnecessary columns
    cols = ['temp', 'feels_like', 'temp_min', 'temp_max']
    df[cols] = df[cols] - 273  # Convert from Kelvin to Celsius
    return df.to_dict(orient='records')

# Optional callback for delivery status
def delivery_report(err, msg):
    if err:
        print(f'Delivery failed: {err}')
    else:
        print(f'Delivered to {msg.topic()}[{msg.partition()}]')

# Continuous streaming loop
while True:
    data = extract_data()
    transformed = transform_data(data)
    for record in transformed:
        producer.produce(topic, value=json.dumps(record), callback=delivery_report)
        producer.poll(0)
    time.sleep(600)  # Wait for 10 minutes before fetching again
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;:&lt;br&gt;
We load API keys from .env for security.Extract weather data, clean it, and convert temperatures and send each data point to Kafka every 10 minutes since our api sends the data every 10 mins&lt;/p&gt;

&lt;h2&gt;
  
  
  b)consumer.py — Consuming from Kafka and Pushing to MongoDB
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from confluent_kafka import Consumer, KafkaError, KafkaException
from dotenv import load_dotenv
import os, json, time
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

# Load MongoDB connection string from .env
load_dotenv()
uri = os.getenv('DB_STRING')

# Connect to MongoDB
client = MongoClient(uri, server_api=ServerApi('1'))
db = client.weather_data  # You can name this whatever you want
collection = db.reports

# Kafka Consumer configuration
config = {
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'weather-consumer-group',
    'auto.offset.reset': 'earliest'
}
consumer = Consumer(config)
topic = 'weather_topic'
consumer.subscribe([topic])

# Function to insert data into MongoDB
def load_data(records):
    collection.insert_many(records)

# Consume messages loop
while True:
    try:
        msg = consumer.poll(1.0)
        if msg is None:
            print("No message received.")
        elif msg.error():
            if msg.error().code() == KafkaError._PARTITION_EOF:
                print("End of partition reached.")
            else:
                raise KafkaException(msg.error())
        else:
            message_data = json.loads(msg.value().decode('utf-8'))
            load_data([message_data])
            print(f"Data stored: {message_data}")
            time.sleep(600)  # Optional: throttle for real-time feel
    except Exception as e:
        print(f"Error: {str(e)}")
        break
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;br&gt;
Consumes data from Kafka topic.Parses JSON messages and pushes them to MongoDB. load_data() is where insertion happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Tying Everything Together
&lt;/h2&gt;

&lt;h3&gt;
  
  
  a) Start Kafka Server
&lt;/h3&gt;

&lt;p&gt;Navigate to your Kafka directory and run:&lt;br&gt;
&lt;code&gt;$ nohup kafka/bin/kafka-server-start.sh kafka/config/server.properties&lt;/code&gt;&lt;br&gt;
b) Run Python Files&lt;br&gt;
In the same server session or another, run:&lt;br&gt;
&lt;code&gt;$ nohup python3 consumer.py&lt;/code&gt;&lt;br&gt;
&lt;code&gt;$ nohup python3 producer.py&lt;/code&gt;&lt;br&gt;
The nohup command allows the scripts to keep running even after you disconnect from SSH.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Step: Check MongoDB
&lt;/h2&gt;

&lt;p&gt;Head over to MongoDB Atlas &amp;gt; Clusters &amp;gt; Browse Collections, and you should start seeing weather data coming in every 10 minutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffv4nzwrvglsa5y3trznx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffv4nzwrvglsa5y3trznx.png" alt="Image description" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;You’ve now built a working real-time data pipeline that:&lt;/p&gt;

&lt;p&gt;*Streams weather data from OpenWeather API&lt;br&gt;
*Publishes it to Kafka&lt;br&gt;
*Consumes it and stores it in MongoDB&lt;/p&gt;

&lt;p&gt;This setup is highly scalable and gives you fault tolerance via Kafka. You can build on this with more processing or dashboard visualizations later.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Understanding Data Warehouses: An Overview of what a Data Warehouse is</title>
      <dc:creator>Mohamed Amin</dc:creator>
      <pubDate>Mon, 24 Mar 2025 14:13:10 +0000</pubDate>
      <link>https://dev.to/amin12905/understanding-data-warehouses-an-overview-of-what-a-data-warehouse-is-1n1m</link>
      <guid>https://dev.to/amin12905/understanding-data-warehouses-an-overview-of-what-a-data-warehouse-is-1n1m</guid>
      <description>&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;data warehouse&lt;/strong&gt; refers to a centralized system used to store large amounts of data from different sources. Most of the time, data warehouses store &lt;strong&gt;structured&lt;/strong&gt; data for analytical purposes.  &lt;/p&gt;

&lt;p&gt;Data warehouses help businesses make &lt;strong&gt;data-driven decisions&lt;/strong&gt; by ensuring that data is readily available and easily accessible.  &lt;/p&gt;




&lt;h2&gt;
  
  
  2. Components of a Data Warehouse
&lt;/h2&gt;

&lt;p&gt;There are &lt;strong&gt;four&lt;/strong&gt; main components of a data warehouse:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source&lt;/strong&gt; – This is where the data originates, such as transactional databases, APIs, logs, or external data sources.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Staging&lt;/strong&gt; – This is the area where data is processed before being loaded into the warehouse.

&lt;ul&gt;
&lt;li&gt;Data is usually moved through either &lt;strong&gt;ETL (Extract, Transform, Load)&lt;/strong&gt; or &lt;strong&gt;ELT (Extract, Load, Transform)&lt;/strong&gt; pipelines, depending on the use case.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Storage&lt;/strong&gt; – This is where the processed data is stored, typically in a structured format optimized for analytical queries.
&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Presentation&lt;/strong&gt; – This is where the data reaches the end user, such as a data analyst using &lt;strong&gt;BI (Business Intelligence) tools&lt;/strong&gt; to analyze and visualize the data.
&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Data Warehouse vs. Database
&lt;/h2&gt;

&lt;p&gt;The main difference between a &lt;strong&gt;data warehouse&lt;/strong&gt; and a &lt;strong&gt;database&lt;/strong&gt; is &lt;strong&gt;the amount and nature of the data they store&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;database&lt;/strong&gt; is optimized for transactional processing (&lt;strong&gt;OLTP - Online Transaction Processing&lt;/strong&gt;) and handles &lt;strong&gt;real-time&lt;/strong&gt; operations, such as inserting, updating, and deleting records.
&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;data warehouse&lt;/strong&gt; is optimized for analytical processing (&lt;strong&gt;OLAP - Online Analytical Processing&lt;/strong&gt;) and stores &lt;strong&gt;large volumes of historical data&lt;/strong&gt; for reporting and decision-making.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  When should you use a data warehouse?
&lt;/h3&gt;

&lt;p&gt;You should use a data warehouse instead of a database when you need to store &lt;strong&gt;historical data&lt;/strong&gt; and perform &lt;strong&gt;complex queries&lt;/strong&gt; on large datasets that grow exponentially.  &lt;/p&gt;




&lt;h2&gt;
  
  
  4. Data Warehouse Architecture
&lt;/h2&gt;

&lt;p&gt;There are &lt;strong&gt;three&lt;/strong&gt; main architecture models:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Top-down approach (Inmon)&lt;/strong&gt; – In this approach, the data warehouse is designed to meet &lt;strong&gt;business requirements first&lt;/strong&gt;, ensuring a well-structured, integrated system.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bottom-up approach (Kimball)&lt;/strong&gt; – This approach prioritizes &lt;strong&gt;quick reporting&lt;/strong&gt; by building &lt;strong&gt;data marts&lt;/strong&gt; first, which can later be integrated into a larger data warehouse. This is the most commonly used approach.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Vault&lt;/strong&gt; – A more &lt;strong&gt;flexible&lt;/strong&gt; and &lt;strong&gt;scalable&lt;/strong&gt; approach designed for &lt;strong&gt;handling changes&lt;/strong&gt; in data structures over time.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Which model to use depends on your use case.&lt;/strong&gt;  &lt;/p&gt;




&lt;h2&gt;
  
  
  5. Data Modeling
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Data modeling&lt;/strong&gt; refers to the &lt;strong&gt;visual representation&lt;/strong&gt; of how data is organized within a system. There are &lt;strong&gt;three&lt;/strong&gt; main categories:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conceptual Data Modeling&lt;/strong&gt; – A high-level overview that focuses on business concepts without technical details.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical Data Modeling&lt;/strong&gt; – Adds &lt;strong&gt;structure, attributes, and relationships&lt;/strong&gt; to the data.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Entity-Relationship Diagram (ERD)&lt;/strong&gt; is used for &lt;strong&gt;OLTP&lt;/strong&gt; systems.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dimensional Data Model&lt;/strong&gt; is used for &lt;strong&gt;OLAP&lt;/strong&gt; systems.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Physical Data Modeling&lt;/strong&gt; – Specifies how data is actually stored in the database, defining table structures, indexes, and relationships.
&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  6. Star Schema vs. Snowflake Schema
&lt;/h2&gt;

&lt;p&gt;Both of these are types of &lt;strong&gt;Dimensional Data Models&lt;/strong&gt; used in data warehouses.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Star Schema&lt;/strong&gt; – A data model where a &lt;strong&gt;central fact table&lt;/strong&gt; is directly connected to &lt;strong&gt;dimension tables&lt;/strong&gt;, forming a star-like structure.

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Pros:&lt;/em&gt; Simplifies queries and improves performance.
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Cons:&lt;/em&gt; Can lead to data redundancy.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Snowflake Schema&lt;/strong&gt; – A data model where &lt;strong&gt;dimension tables are normalized&lt;/strong&gt;, breaking them into smaller related tables.

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Pros:&lt;/em&gt; Reduces data redundancy.
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Cons:&lt;/em&gt; Increases query complexity.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. OLAP vs. OLTP
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OLAP (Online Analytical Processing)&lt;/strong&gt; – Used for &lt;strong&gt;historical data analysis&lt;/strong&gt;, enabling businesses to derive insights from large datasets.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OLTP (Online Transaction Processing)&lt;/strong&gt; – Used for &lt;strong&gt;real-time transactions&lt;/strong&gt;, such as banking systems, e-commerce platforms, and booking systems.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OLAP&lt;/strong&gt;: Analyzing customer purchasing patterns over the past 5 years.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OLTP&lt;/strong&gt;: Processing an online purchase in a retail store.
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Types of Data Warehouses
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;On-Premise Data Warehouse&lt;/strong&gt; – A company &lt;strong&gt;develops and maintains&lt;/strong&gt; its own data warehouse infrastructure.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Data Warehouse&lt;/strong&gt; – A company &lt;strong&gt;outsources&lt;/strong&gt; its data warehouse to &lt;strong&gt;cloud providers&lt;/strong&gt; like:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Redshift (AWS)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google BigQuery&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure Synapse Analytics (Microsoft)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid Data Warehouse&lt;/strong&gt; – A combination of &lt;strong&gt;on-premise and cloud&lt;/strong&gt; storage, leveraging the advantages of both.
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  9. Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article we briefly went over what a data warehouse and what it entails.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The Ultimate Guide to Apache Kafka: Basics, Architecture, and Core Concepts</title>
      <dc:creator>Mohamed Amin</dc:creator>
      <pubDate>Mon, 10 Mar 2025 14:44:13 +0000</pubDate>
      <link>https://dev.to/amin12905/the-ultimate-guide-to-apache-kafka-basics-architecture-and-core-concept-544o</link>
      <guid>https://dev.to/amin12905/the-ultimate-guide-to-apache-kafka-basics-architecture-and-core-concept-544o</guid>
      <description>&lt;h1&gt;
  
  
  1. Introduction
&lt;/h1&gt;

&lt;p&gt;Apache Kafka is an open-source distributed publish-subscribe messaging system. Let’s break this down further:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Distributed: Kafka is designed to be fault-tolerant and scalable. It achieves this by allowing multiple Kafka servers (brokers) to work together in a cluster, ensuring system reliability and high availability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Publish-Subscribe: Kafka has a producer-consumer  like model, in that:&lt;br&gt;
Producers publish messages to Kafka.&lt;br&gt;
Consumers subscribe to Kafka topics and consume messages.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To better understand this lets take an example of an e-commerce store. When the store is small, the owner can handle deliveries directly to the customers. However, as the store grows, doing deliveries directly becomes inefficient, causing delays.&lt;/p&gt;

&lt;p&gt;Now, imagine using a postal office to handle deliveries. Instead of personally delivering each order, the owner drops off packages at the postal office, and the postal office ensures delivery to customers efficiently.&lt;/p&gt;

&lt;p&gt;In this example the e-commerce store represents the producer (sending messages/orders). The postal office represents Kafka (managing and delivering messages). The customers represent consumers (receiving the messages/orders). This approach removes bottlenecks, making the system more scalable—just like Kafka does for data processing.&lt;/p&gt;

&lt;h1&gt;
  
  
  2. Core Concepts of Apache Kafka
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Cluster
&lt;/h2&gt;

&lt;p&gt;A Kafka cluster refers to multiple brokers (Kafka servers) working together to ensure scalability, fault tolerance, and high availability of data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Broker
&lt;/h2&gt;

&lt;p&gt;A broker is an instance of a Kafka server that stores and manages messages. Multiple brokers form a cluster, ensuring data replication and fault tolerance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Topic
&lt;/h2&gt;

&lt;p&gt;Kafka organizes data into topics, this is similar to tables in a relational database. Producers write data to topics, and consumers read from them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Producers
&lt;/h2&gt;

&lt;p&gt;Producers are applications that publish messages to Kafka topics. They determine which topic a message should go to and can also decide how messages are partitioned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Consumers
&lt;/h2&gt;

&lt;p&gt;Consumers are applications that subscribe to topics and consume messages. Kafka ensures that messages are delivered in an ordered and scalable manner.&lt;/p&gt;

&lt;h2&gt;
  
  
  Partitions
&lt;/h2&gt;

&lt;p&gt;A Kafka topic is divided into multiple partitions to allow parallel processing and increase scalability. Each partition is stored on multiple brokers for fault tolerance. If a broker storing a partition fails, Kafka can still serve data from its replicas on other brokers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kafka Connect
&lt;/h2&gt;

&lt;p&gt;Kafka Connect is a framework that enables integration between Kafka and external systems such as databases, cloud storage, and message queues. It also manages tasks&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article we have been able to go the basics of Kafka, seen its use cases and its core concepts&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Building Scalable Data Pipelines with Python – A Complete Guide.</title>
      <dc:creator>Mohamed Amin</dc:creator>
      <pubDate>Sat, 08 Feb 2025 16:43:26 +0000</pubDate>
      <link>https://dev.to/amin12905/building-scalable-data-pipelines-with-python-a-complete-guide-ch5</link>
      <guid>https://dev.to/amin12905/building-scalable-data-pipelines-with-python-a-complete-guide-ch5</guid>
      <description>&lt;h2&gt;
  
  
  What are Data Pipelines
&lt;/h2&gt;

&lt;p&gt;A data pipeline refers to a series of steps used to automate the migration of data from a source to its destination. Sometimes, transformation is performed alongside migration to ensure the data is structured and clean for analysis.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Components of a Pipeline
&lt;/h2&gt;

&lt;p&gt;The components of a pipeline refer to the elements that come together to form a data pipeline. These include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Sources&lt;/strong&gt; - These can include databases, CSV files, APIs, and other file formats.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Ingestion Methods&lt;/strong&gt; - These refer to how data is loaded into the pipeline. There are two main methods:

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Batch Processing&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stream Processing&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Processing&lt;/strong&gt; - This refers to the techniques and tools used to transform data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Storage&lt;/strong&gt; - This refers to where the data is stored, including data warehouses, data lakes, etc. This is usually the final destination of the data.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Key Functions of a Pipeline
&lt;/h2&gt;

&lt;p&gt;The key functions of a pipeline include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Extract&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Transform&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Load&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Considerations When Designing a Pipeline
&lt;/h2&gt;

&lt;p&gt;When designing a data pipeline, the following factors should be considered:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Maintainability&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automation&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Python ETL Implementation
&lt;/h2&gt;

&lt;p&gt;In this section, we will see how to implement a simple ETL pipeline to read data from a CSV file and an API, then write the data to a PostgreSQL database.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Reading from a CSV File
&lt;/h3&gt;

&lt;p&gt;Before building an ETL pipeline to read from a CSV file, we need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A CSV file (generated using &lt;a href="https://www.mockaroo.com/" rel="noopener noreferrer"&gt;Mockaroo&lt;/a&gt; for dummy data).&lt;/li&gt;
&lt;li&gt;A PostgreSQL database (created using &lt;a href="https://aiven.io/" rel="noopener noreferrer"&gt;Aiven&lt;/a&gt; and connected using DBeaver).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following Python script demonstrates how to read from a CSV file and store the data in a PostgreSQL database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_engine&lt;/span&gt;

&lt;span class="c1"&gt;# Create a connection to PostgreSQL
&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql://username:password@localhost:5432/etl_db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read CSV file
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Data Cleaning and Transformation
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sales_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Load data into PostgreSQL
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code successfully migrates data from a CSV file to the database, demonstrating how an ETL pipeline works.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Reading from an API
&lt;/h3&gt;

&lt;p&gt;We will fetch data from this API: &lt;a href="https://raw.githubusercontent.com/LuxDevHQ/LuxDevHQDataEngineeringGuide/refs/heads/main/samplejson.json" rel="noopener noreferrer"&gt;Sample JSON Data&lt;/a&gt;. This dummy data represents staff members from a fictional company.&lt;/p&gt;

&lt;p&gt;The ETL process is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_engine&lt;/span&gt;

&lt;span class="c1"&gt;# API URL
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://raw.githubusercontent.com/LuxDevHQ/LuxDevHQDataEngineeringGuide/refs/heads/main/samplejson.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Fetch data from API
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Transform data
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;position&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;full_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Load data into PostgreSQL
&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql://username:password@localhost:5432/etl_db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;staff_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We have learned about data pipelines, their components, key functions, design considerations, and how to implement a simple ETL pipeline to read data from a CSV file and an API into a PostgreSQL database.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
