DEV Community: Mohamed Amin

Building a YouTube Analytics Dashboard

Mohamed Amin — Fri, 02 May 2025 19:52:40 +0000

Introduction

In the growing creator economy, YouTube channels pump out tons of videos every week. and Creators need a simple dashboard that answers the following questions:

How has my channel grown over time?
What is the best day and time to post a video?
What are the most engaging videos in my channel?

To tackle this, I built an end-to-end data pipeline that automatically pulls, processes, and visualizes YouTube data.

In this article goes over the steps I took to archieve this.

Part 1: Installing the Tools Natively on an Azure Ubuntu VM

Instead of using Docker,I installed everything natively because it gave me more control, better performance tuning, and more flexibility compared to containerized environments.

1. Install Apache Airflow

Apache Airflow schedules and manages all the ETL jobs automatically.

Step 1.1: Create a dedicated airflow user

First, it’s good practice to create a separate user to run Airflow (avoiding permission issues later).

sudo adduser airflow

Follow the prompts (you can skip extra info fields). Then add the user to sudoers:

sudo usermod -aG sudo airflow
Login as the new user:
sudo su - airflow

Step 1.2: Install system dependencies

Update the system first:

sudo apt update && sudo apt upgrade -y
Install required libraries:

sudo apt install -y python3-pip python3-venv libpq-dev
python3-venv → To create isolated Python environments

libpq-dev → For Postgres client libraries (needed later)

Step 1.3: Set up Python Virtual Environment for Airflow

It’s highly recommended to run Airflow inside a virtual environment:

python3 -m venv airflow_venv
source airflow_venv/bin/activate
Now inside the virtualenv, install Airflow:

pip install apache-airflow
✅ Installing Airflow takes a while (~5 mins) because it pulls many dependencies.

Step 1.4: Initialize Airflow Database

Airflow uses a metadata database to track DAG runs and tasks. We'll configure it later for PostgreSQL, but initially, initialize it:

export AIRFLOW_HOME=~/airflow
airflow db init
This creates:

~/airflow/airflow.cfg

SQLite DB initially (we'll later connect to PostgreSQL).

Step 1.5: Create an Admin User

Airflow needs an admin user to login to the web UI.

airflow users create \
    --username admin \
    --firstname Admin \
    --lastname User \
    --role Admin \
    --email admin@example.com

Step 1.6: Run Airflow services

You need two services running:

Webserver (UI)
Scheduler (Job runner)

Start them:

airflow webserver --port 8080
In another SSH session/tab:

airflow scheduler
You can now access Airflow at http://:8080

At this point, verify:

Airflow UI is accessible.

Admin login works.

2. Install Apache Spark (Data Processing Engine)

Spark handles heavy lifting for data transformation. Make sure you have java and scala installed.

Step 2.1: Download and extract Spark

wget https://downloads.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
tar -xvzf spark-3.4.0-bin-hadoop3.tgz

Move it to /opt (standard for system-wide apps):

sudo mv spark-3.4.0-bin-hadoop3 /opt/spark

Step 2.2: Set Spark Environment Variables

Edit your bash profile:

nano ~/.bashrc
Add at the bottom:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Apply changes:

source ~/.bashrc

Step 2.3: Test Spark installation

Check Spark version:

spark-submit --version
✅ If you see version info without errors, Spark is installed properly.

3. Install PostgreSQL (Data Warehouse)

PostgreSQL will store all cleaned YouTube data.

Step 3.1: Install PostgreSQL Server

sudo apt update
sudo apt install -y postgresql postgresql-contrib

Step 3.2: Create Database and User

Switch to Postgres superuser:

sudo -u postgres psql
Inside psql shell:

CREATE DATABASE youtube_analytics;
CREATE USER airflow WITH ENCRYPTED PASSWORD 'yourpassword';
GRANT ALL PRIVILEGES ON DATABASE youtube_analytics TO airflow;
\q

Now Airflow and your scripts will connect using the airflow user.

At this point, verify:

psql connects successfully.
youtube_analytics database exists.

4. Install Grafana (Dashboarding)

Grafana visualizes the results beautifully.

Step 4.1: Add Grafana Repo

sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update

Step 4.2: Install Grafana

sudo apt-get install grafana
Enable and start the service:

sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Step 4.3: Access Grafana

Grafana will be available at:
http://:3000

Default credentials:

Username: admin

Password: admin
(you'll be prompted to change password at first login)

At this point, your server is fully ready:
Airflow + Spark + Postgres + Grafana all installed, running natively.

Part 2: Code Walkthrough

1. The Core ETL Pipeline (main.py)

This file is the heart of the whole system. It does three main things:

Pulls data from the YouTube API
Processes and enriches it using PySpark
Returns clean DataFrames ready to be stored

Load API Key and Initialize Spark + YouTube Client
python

from dotenv import load_dotenv
import os
from googleapiclient.discovery import build
from pyspark.sql import SparkSession

load_dotenv()
google_api_key = os.getenv('API_KEY')

This section loads your .env file and gets the API key for authenticating with the YouTube Data API.

def get_spark():
    return SparkSession.builder \
        .appName("YoutubeAnalytics") \
        .config("spark.jars", "/path/to/postgresql-42.6.0.jar") \
        .master("local[*]") \
        .getOrCreate()

Creates a local Spark session and includes the JDBC driver to connect to PostgreSQL later.

def get_youtube():
    return build('youtube', 'v3', developerKey=google_api_key)

Builds the YouTube API client using the API key.

Get Subscriber Count (getSubscribers())

request = youtube.channels().list(part='statistics', id='UCtxD0x6AuNNqdXO9Wp5GHew')
response = request.execute()
subscriber_count = int(response['items'][0]['statistics']['subscriberCount'])

This gets the current subscriber count for the specified channel.

subscriber_data = [(date.today(), subscriber_count)]
df = spark.createDataFrame(subscriber_data, ["date", "subscribers"])

Stores the date + subscriber count as a Spark DataFrame. This helps visualize growth over time later.

Get Top Videos by Engagement (get_videos)

# Loop through playlist videos
request = youtube.playlistItems().list(part='snippet,contentDetails', playlistId=playlist_id, maxResults=50)

Pulls all video IDs from the "Uploads" playlist.

# Pull stats for each video
data_request = youtube.videos().list(part='snippet,statistics', id=','.join(video_ids))

Fetches views, likes, and comments for each video.

engagement_rate = round((like_count + comment_count) / view_count, 2)

Calculates engagement rate to identify top-performing videos.

best_df = df.orderBy(col("engagement_rate").desc()).limit(5)

Sorts and selects the top 5 most engaging videos.

Best Time to Post (best_post_time)

df = df.withColumn("hour", hour(col('published_date')))
df = df.withColumn("day", date_format(col('published_date'), "E"))

Extracts the day of week and hour from the video publish time.

df = df.withColumn("engagement", col('likes') + col('view_count') + col('comments'))

Creates an engagement score to use for averaging.

grouped = df.groupBy("day", "hour").avg("engagement")

Aggregates average engagement by day/hour pair.

result = grouped.select("day_hour", "avg_engagement").orderBy(col("avg_engagement").desc()).limit(3)

Returns the best 3 day-hour combinations for posting.

2. The Controller File (controller.py)

This script is responsible for executing the ETL logic and pushing the results to PostgreSQL.

from main import getSubscribers, get_videos, best_post_time

subscriber_df = getSubscribers()
videos_list, best_videos_df = get_videos()
best_post_time_df = best_post_time(videos_list)

It calls the three main ETL functions and stores their outputs.

subscriber_df.write.jdbc(url=jdbc_url, table="subscriber_data", mode='append', properties=properties)

Writes each DataFrame into its corresponding PostgreSQL table.

Tables used:

subscriber_data
best_post_time
best_performing_videos

Clean separation between data logic and storage logic.

3. The Airflow DAG (youtube_dag.py)

This DAG automates running the pipeline daily.

default_args = {
    "owner": "Batru",
    "start_date": datetime(2025, 4, 23),
    "retries": 1,
    "retry_delay": timedelta(minutes=1),
    "email_on_failure": True,
    "email": ["batrudin10@gmail.com"]
}

Defines retry behavior and email alerts on failure.

task_youtube = BashOperator(
    task_id="task_youtube",
    bash_command="""
        source /home/mombasa/projects/youtubue_analytics_dashboard/venv/bin/activate &&
        python3 /home/mombasa/projects/youtubue_analytics_dashboard/controller.py
    """
)

The task activates the Python virtual environment and runs controller.py, kicking off the full ETL.

Part 3: Visualizing YouTube Data with Grafana

With the data successfully loaded into PostgreSQL from our Spark job (via Airflow), it’s time to bring it to life visually using Grafana.

Grafana is already installed on our Ubuntu server. Here's how we set it up and created dashboards that answer our key analytics questions.

Connect Grafana to PostgreSQL

Login to Grafana
Navigate to http://128.85.32.87:3000 in your browser. * Default login is usually:

Username: admin

Password: admin (you’ll be asked to reset on first login)

Add PostgreSQL as a Data Source and Fill in:

Host: localhost:5432 or your DB host

Database: youtube_analytics

User and Password: your PostgreSQL credentials

Click Save & Test.

Create the Dashboards

Now that the DB is connected, let’s create panels to visualize the insights.

1. Channel Growth Over Time

Query from subscriber_data table:

SELECT date, subscribers FROM subscriber_data ORDER BY date;

Visualization: Use a Time Series panel

Y-axis: Subscriber count

X-axis: Date

This shows subscriber growth trends over time.

2. Best Performing Videos by Engagement

Query from best_performing_videos:

SELECT title, engagement_rate FROM best_performing_videos ORDER BY engagement_rate DESC LIMIT 5;

Visualization: Bar Chart

X-axis: Video titles

Y-axis: Engagement rate

This reveals which videos truly resonate with the audience.

3. Best Time to Post

Query from best_post_time:

SELECT day_hour, avg_engagement FROM best_post_time ORDER BY avg_engagement DESC LIMIT 5;

Visualization: Table or Bar Chart

The final dashboard would look like something like this

Introduction to Apache Spark for Data Engineering

Mohamed Amin — Mon, 14 Apr 2025 09:56:56 +0000

🔥 Introduction

With the volume and velocity of data being generated today, Apache Spark has emerged as a go-to distributed computing framework. Spark is designed for fast processing and scalability, making it ideal for modern data engineering workflows.

In this article, we will cover:

What Apache Spark is
Definitions of common Spark terms
Core components of Spark
Why use Spark as a Data Engineer

⚙️ What is Apache Spark?

Apache Spark is an open-source data processing engine built for large-scale data workloads. It is about 100 times faster than traditional MapReduce frameworks due to its in-memory processing capabilities.

📘 Common Spark Terms

1. RDD (Resilient Distributed Dataset)

A distributed collection of objects that are:

Immutable
Support in-memory processing

Offer fault tolerance through lineage information

2. DataFrame

A distributed collection of data organized into named columns, similar to a Pandas DataFrame, but optimized for big data.

🧹 Components of Spark

Spark consists of a core engine and several powerful libraries:

1. Spark Core

The foundation of the Spark ecosystem, responsible for:

Task scheduling
Memory management
Fault recovery
Basic I/O operations

2. Spark SQL

Enables querying of structured data using SQL-like syntax.

3. Spark Streaming

Processes real-time data streams from sources like Kafka, Flume, and sockets, using a micro-batch architecture.

4. Spark MLlib

A scalable machine learning library built on top of Spark for classification, regression, clustering and recommendation.

5. GraphX

A library used for graph processing and computation, useful for task such as social network analysis.

🚀 Why Spark?

Here’s why Spark is widely adopted in big data engineering:

1. Speed

Spark outperforms Hadoop MapReduce by being up to 100x faster, thanks to its in-memory computation.

2. Scalability

Spark is built to scale across hundreds or thousands of nodes, handling petabyte-scale data.

3. Unified Engine

Spark provides a single engine for batch processing, real-time streaming, machine learning, and graph computation.

4. Fault Tolerance

Spark automatically recovers from node failures using RDD lineage, which tracks how data is derived.

🔄 A Typical Spark Workflow for Data Engineering

Here's how Spark fits into a standard data engineering pipeline:

Data Ingestion - Read data from various sources like local files, relational databases, data lakes, or APIs.
Data Transformation - Apply transformations such as filtering, joins, aggregations, and custom business logic.
Data Validation and Cleansing - Clean the data, handle nulls, validate schema, and ensure quality.
Data Loading - Write the processed data to destinations like data warehouses, file systems, or dashboards.

🧠 Final Thoughts

Apache Spark continues to be a game-changer in the fields of big data and data engineering. Its unified architecture, ability to handle large datasets with ease, and support for both batch and real-time processing make it an essential tool for modern data teams.

As a data engineer, mastering Spark enables you to build fast, scalable, and reliable data pipelines that can drive analytics, power machine learning models, and support real-time applications.

Step-by-Step guide on Live Streaming weather data from Openweather api to MongoDB using Kafka

Mohamed Amin — Mon, 07 Apr 2025 10:46:28 +0000

Introduction

In this guide, we’ll stream data from the OpenWeather API and store it in MongoDB, using Kafka for fault tolerance. Before we start, make sure you’ve already set up a cloud server—I'm using an EC2 instance on AWS. Also, sign up on OpenWeatherMap and get your API key—we’ll use this to fetch the weather data.

This guide is broken down into 5 parts:

Setting up our cloud environment
Setting up our Kafka environment
Setting up our MongoDB environment
Writing our producer and consumer Python files
Tying everything together to complete the project

1. Setting Up Our Cloud Environment

We'll SSH into our EC2 server and check if the Kafka port (default is 9092) is free.

a) SSH into EC2

Open your terminal (I’m using Git Bash) and run the SSH command from the EC2 Connect section:

ssh -i your-key.pem ec2-user@your-ec2-ip

b) Check if Kafka Port is Free

Once logged into the server, run the following:
$ sudo lsof -i :9092

If it returns nothing, great—no service is using that port, so Kafka is good to go.

2. Setting Up Our Kafka Environment

Head over to Kafka Downloads and copy the link for version 4.0.

a) Download Kafka

Run this command to download kafka binary files:
$ wget https://downloads.apache.org/kafka/4.0.0/kafka_2.13-4.0.0.tgz

b) Extract Kafka

Run this command to extract the files:
$ tar -xzvf kafka_2.13-4.0.0.tgz

You can rename the folder for convenience if you want.

3. Setting Up Our MongoDB Environment

We’ll use MongoDB Atlas to host our database. Sign up or log in, create a cluster, and grab your connection string (we’ll use this in our Python script).

4. Writing Our Producer and Consumer Python Files

a)producer.py — Pulling from OpenWeather API and Writing to Kafka:

from confluent_kafka import Producer
import requests, json, os, time
from dotenv import load_dotenv
import pandas as pd

# Load environment variables from .env file
load_dotenv()
weather_api_key = os.getenv('WEATHER_API_KEY')  # Your OpenWeather API key
city_name = 'Nairobi'

# Build API URL
weather_url = f'https://api.openweathermap.org/data/2.5/weather?q={city_name}&appid={weather_api_key}'

# Kafka Producer Configuration
config = {
    'bootstrap.servers': 'localhost:9092',
    'client.id': 'python-producer'
}
producer = Producer(config)
topic = 'weather_topic'

# Function to extract data from API
def extract_data():
    response = requests.get(weather_url)
    data = response.json()

    weather_df = pd.DataFrame(data['weather'], index=[0])
    temp_df = pd.DataFrame(data['main'], index=[0])
    location_df = pd.DataFrame({'country': data['sys']['country'], 'city': city_name}, index=[0])

    merged_df = pd.merge(pd.merge(location_df, weather_df, left_index=True, right_index=True), temp_df, left_index=True, right_index=True)
    return merged_df

# Function to transform the data
def transform_data(df):
    df = df.drop(columns=['id', 'icon'])  # Drop unnecessary columns
    cols = ['temp', 'feels_like', 'temp_min', 'temp_max']
    df[cols] = df[cols] - 273  # Convert from Kelvin to Celsius
    return df.to_dict(orient='records')

# Optional callback for delivery status
def delivery_report(err, msg):
    if err:
        print(f'Delivery failed: {err}')
    else:
        print(f'Delivered to {msg.topic()}[{msg.partition()}]')

# Continuous streaming loop
while True:
    data = extract_data()
    transformed = transform_data(data)
    for record in transformed:
        producer.produce(topic, value=json.dumps(record), callback=delivery_report)
        producer.poll(0)
    time.sleep(600)  # Wait for 10 minutes before fetching again

Explanation:
We load API keys from .env for security.Extract weather data, clean it, and convert temperatures and send each data point to Kafka every 10 minutes since our api sends the data every 10 mins

b)consumer.py — Consuming from Kafka and Pushing to MongoDB

from confluent_kafka import Consumer, KafkaError, KafkaException
from dotenv import load_dotenv
import os, json, time
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

# Load MongoDB connection string from .env
load_dotenv()
uri = os.getenv('DB_STRING')

# Connect to MongoDB
client = MongoClient(uri, server_api=ServerApi('1'))
db = client.weather_data  # You can name this whatever you want
collection = db.reports

# Kafka Consumer configuration
config = {
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'weather-consumer-group',
    'auto.offset.reset': 'earliest'
}
consumer = Consumer(config)
topic = 'weather_topic'
consumer.subscribe([topic])

# Function to insert data into MongoDB
def load_data(records):
    collection.insert_many(records)

# Consume messages loop
while True:
    try:
        msg = consumer.poll(1.0)
        if msg is None:
            print("No message received.")
        elif msg.error():
            if msg.error().code() == KafkaError._PARTITION_EOF:
                print("End of partition reached.")
            else:
                raise KafkaException(msg.error())
        else:
            message_data = json.loads(msg.value().decode('utf-8'))
            load_data([message_data])
            print(f"Data stored: {message_data}")
            time.sleep(600)  # Optional: throttle for real-time feel
    except Exception as e:
        print(f"Error: {str(e)}")
        break

Explanation:
Consumes data from Kafka topic.Parses JSON messages and pushes them to MongoDB. load_data() is where insertion happens.

5. Tying Everything Together

a) Start Kafka Server

Navigate to your Kafka directory and run:
$ nohup kafka/bin/kafka-server-start.sh kafka/config/server.properties
b) Run Python Files
In the same server session or another, run:
$ nohup python3 consumer.py
$ nohup python3 producer.py
The nohup command allows the scripts to keep running even after you disconnect from SSH.

Final Step: Check MongoDB

Head over to MongoDB Atlas > Clusters > Browse Collections, and you should start seeing weather data coming in every 10 minutes.

Conclusion

You’ve now built a working real-time data pipeline that:

*Streams weather data from OpenWeather API
*Publishes it to Kafka
*Consumes it and stores it in MongoDB

This setup is highly scalable and gives you fault tolerance via Kafka. You can build on this with more processing or dashboard visualizations later.

Understanding Data Warehouses: An Overview of what a Data Warehouse is

Mohamed Amin — Mon, 24 Mar 2025 14:13:10 +0000

1. Introduction

A data warehouse refers to a centralized system used to store large amounts of data from different sources. Most of the time, data warehouses store structured data for analytical purposes.

Data warehouses help businesses make data-driven decisions by ensuring that data is readily available and easily accessible.

2. Components of a Data Warehouse

There are four main components of a data warehouse:

Source – This is where the data originates, such as transactional databases, APIs, logs, or external data sources.
Staging – This is the area where data is processed before being loaded into the warehouse.
- Data is usually moved through either ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) pipelines, depending on the use case.
Storage – This is where the processed data is stored, typically in a structured format optimized for analytical queries.
Presentation – This is where the data reaches the end user, such as a data analyst using BI (Business Intelligence) tools to analyze and visualize the data.

3. Data Warehouse vs. Database

The main difference between a data warehouse and a database is the amount and nature of the data they store:

A database is optimized for transactional processing (OLTP - Online Transaction Processing) and handles real-time operations, such as inserting, updating, and deleting records.
A data warehouse is optimized for analytical processing (OLAP - Online Analytical Processing) and stores large volumes of historical data for reporting and decision-making.

When should you use a data warehouse?

You should use a data warehouse instead of a database when you need to store historical data and perform complex queries on large datasets that grow exponentially.

4. Data Warehouse Architecture

There are three main architecture models:

Top-down approach (Inmon) – In this approach, the data warehouse is designed to meet business requirements first, ensuring a well-structured, integrated system.
Bottom-up approach (Kimball) – This approach prioritizes quick reporting by building data marts first, which can later be integrated into a larger data warehouse. This is the most commonly used approach.
Data Vault – A more flexible and scalable approach designed for handling changes in data structures over time.

Which model to use depends on your use case.

5. Data Modeling

Data modeling refers to the visual representation of how data is organized within a system. There are three main categories:

Conceptual Data Modeling – A high-level overview that focuses on business concepts without technical details.
Logical Data Modeling – Adds structure, attributes, and relationships to the data.
- Entity-Relationship Diagram (ERD) is used for OLTP systems.
- Dimensional Data Model is used for OLAP systems.
Physical Data Modeling – Specifies how data is actually stored in the database, defining table structures, indexes, and relationships.

6. Star Schema vs. Snowflake Schema

Both of these are types of Dimensional Data Models used in data warehouses.

Star Schema – A data model where a central fact table is directly connected to dimension tables, forming a star-like structure.
- Pros: Simplifies queries and improves performance.
- Cons: Can lead to data redundancy.
Snowflake Schema – A data model where dimension tables are normalized, breaking them into smaller related tables.
- Pros: Reduces data redundancy.
- Cons: Increases query complexity.

7. OLAP vs. OLTP

OLAP (Online Analytical Processing) – Used for historical data analysis, enabling businesses to derive insights from large datasets.
OLTP (Online Transaction Processing) – Used for real-time transactions, such as banking systems, e-commerce platforms, and booking systems.

Example:

OLAP: Analyzing customer purchasing patterns over the past 5 years.
OLTP: Processing an online purchase in a retail store.

8. Types of Data Warehouses

On-Premise Data Warehouse – A company develops and maintains its own data warehouse infrastructure.
Cloud Data Warehouse – A company outsources its data warehouse to cloud providers like:
- Amazon Redshift (AWS)
- Google BigQuery
- Azure Synapse Analytics (Microsoft)
Hybrid Data Warehouse – A combination of on-premise and cloud storage, leveraging the advantages of both.

9. Conclusion

In this article we briefly went over what a data warehouse and what it entails.

The Ultimate Guide to Apache Kafka: Basics, Architecture, and Core Concepts

Mohamed Amin — Mon, 10 Mar 2025 14:44:13 +0000

1. Introduction

Apache Kafka is an open-source distributed publish-subscribe messaging system. Let’s break this down further:

Distributed: Kafka is designed to be fault-tolerant and scalable. It achieves this by allowing multiple Kafka servers (brokers) to work together in a cluster, ensuring system reliability and high availability.
Publish-Subscribe: Kafka has a producer-consumer like model, in that:
Producers publish messages to Kafka.
Consumers subscribe to Kafka topics and consume messages.

To better understand this lets take an example of an e-commerce store. When the store is small, the owner can handle deliveries directly to the customers. However, as the store grows, doing deliveries directly becomes inefficient, causing delays.

Now, imagine using a postal office to handle deliveries. Instead of personally delivering each order, the owner drops off packages at the postal office, and the postal office ensures delivery to customers efficiently.

In this example the e-commerce store represents the producer (sending messages/orders). The postal office represents Kafka (managing and delivering messages). The customers represent consumers (receiving the messages/orders). This approach removes bottlenecks, making the system more scalable—just like Kafka does for data processing.

2. Core Concepts of Apache Kafka

Cluster

A Kafka cluster refers to multiple brokers (Kafka servers) working together to ensure scalability, fault tolerance, and high availability of data.

Broker

A broker is an instance of a Kafka server that stores and manages messages. Multiple brokers form a cluster, ensuring data replication and fault tolerance.

Topic

Kafka organizes data into topics, this is similar to tables in a relational database. Producers write data to topics, and consumers read from them.

Producers

Producers are applications that publish messages to Kafka topics. They determine which topic a message should go to and can also decide how messages are partitioned.

Consumers

Consumers are applications that subscribe to topics and consume messages. Kafka ensures that messages are delivered in an ordered and scalable manner.

Partitions

A Kafka topic is divided into multiple partitions to allow parallel processing and increase scalability. Each partition is stored on multiple brokers for fault tolerance. If a broker storing a partition fails, Kafka can still serve data from its replicas on other brokers.

Kafka Connect

Kafka Connect is a framework that enables integration between Kafka and external systems such as databases, cloud storage, and message queues. It also manages tasks

3. Conclusion

In this article we have been able to go the basics of Kafka, seen its use cases and its core concepts

Building Scalable Data Pipelines with Python – A Complete Guide.

Mohamed Amin — Sat, 08 Feb 2025 16:43:26 +0000

What are Data Pipelines

A data pipeline refers to a series of steps used to automate the migration of data from a source to its destination. Sometimes, transformation is performed alongside migration to ensure the data is structured and clean for analysis.

Components of a Pipeline

The components of a pipeline refer to the elements that come together to form a data pipeline. These include:

Data Sources - These can include databases, CSV files, APIs, and other file formats.
Data Ingestion Methods - These refer to how data is loaded into the pipeline. There are two main methods:
- Batch Processing
- Stream Processing
Data Processing - This refers to the techniques and tools used to transform data.
Data Storage - This refers to where the data is stored, including data warehouses, data lakes, etc. This is usually the final destination of the data.

Key Functions of a Pipeline

The key functions of a pipeline include:

Extract
Transform
Load

Considerations When Designing a Pipeline

When designing a data pipeline, the following factors should be considered:

Scalability
Maintainability
Security
Automation

Python ETL Implementation

In this section, we will see how to implement a simple ETL pipeline to read data from a CSV file and an API, then write the data to a PostgreSQL database.

1. Reading from a CSV File

Before building an ETL pipeline to read from a CSV file, we need:

A CSV file (generated using Mockaroo for dummy data).
A PostgreSQL database (created using Aiven and connected using DBeaver).

The following Python script demonstrates how to read from a CSV file and store the data in a PostgreSQL database:

import pandas as pd
from sqlalchemy import create_engine

# Create a connection to PostgreSQL
engine = create_engine("postgresql://username:password@localhost:5432/etl_db")

# Read CSV file
df = pd.read_csv("sales.csv")

# Data Cleaning and Transformation
df = df.dropna()
df = df.rename(columns={'id': 'sales_id'})

# Load data into PostgreSQL
df.to_sql("sales", engine, if_exists="append", index=False)

The above code successfully migrates data from a CSV file to the database, demonstrating how an ETL pipeline works.

2. Reading from an API

We will fetch data from this API: Sample JSON Data. This dummy data represents staff members from a fictional company.

The ETL process is as follows:

import requests
import pandas as pd
from sqlalchemy import create_engine

# API URL
url = "https://raw.githubusercontent.com/LuxDevHQ/LuxDevHQDataEngineeringGuide/refs/heads/main/samplejson.json"

# Fetch data from API
response = requests.get(url)
data = response.json()

# Transform data
df = pd.DataFrame(data)
df = df[['name', 'position', 'country']]
df.columns = ['full_name', 'role', 'country']

# Load data into PostgreSQL
engine = create_engine("postgresql://username:password@localhost:5432/etl_db")
df.to_sql("staff_data", engine, if_exists="append", index=False)

Conclusion

We have learned about data pipelines, their components, key functions, design considerations, and how to implement a simple ETL pipeline to read data from a CSV file and an API into a PostgreSQL database.