Understanding ETL: A Chaotic Introduction

#etl #python #api

Build your first Python Data pipeline using the News API, and make some sense of the acronyms

Introduction
ETL vs ELT: What is the difference and why does it matter?
Set up your environment
Extract Data
Transform Data
Load Data
What next?

- Introduction

If you are just getting into the data space or tech in general, you have experienced what sounds like jargon hell..ETL, ELT, pipelines, data warehouse...These are just a few. Let it not deter you because it can all be broken down into far simpler concepts than they sound.
In this article, I will attempt to bring to light what they are and how they apply in your data learning ambitions.

- ETL vs ELT: What is the difference and why does it matter?

ETL: Extract, Transform, Load
ELT: Extract, Load, Transform

The difference is the order in which data is handled in ETL. We transform(clean and process data) before loading ( store in a database or preferred destination)

A simple analogy is:
ETL is washing your vegetables, chopping them, and then putting them in the fridge.

ELT is dumping all the groceries straight into the fridge and dealing with them when you're ready to cook.

Why does the order matter?
Because it dictates where the transformation happens, how fast data lands, and how flexible you are later.

So why did ELT become a thing?
Modern cloud data warehouses (Snowflake, BigQuery, Redshift) are insanely powerful — they can handle transformations on massive amounts of raw data without breaking a sweat. That meant companies could just dump data in immediately and figure out what to do with it later. No waiting, no losing raw data, maximum flexibility.

Which one are we doing here?
Our NewsAPI project is classic ETL. We'll pull the data, clean it up with Python, and only then store the polished result in SQLite. No warehouse, no raw dump — just a straightforward pipeline from chaos to clarity.

- Set up your environment

Download and install Python here if you have not done so.
Avoid working on the global environment; create a virtual environment for your work.

mkdir news_etl
cd news_etl
python -m venv news_env
source news_env/bin/activate

Install libraries required

pip install requests pandas sqlalchemy python-dotenv

Pro Tip: A good way to test things out is using a jupyter notebooks or google colab to get a working version before bringing everything into one file. Feel free to try both ways

- Extract Data

Go to News API site to create an account and get yourself an API key.
On a python file, have the code as follows:

import requests
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv('NEWS_API_KEY')


def extract_data():
    url = f'https://newsapi.org/v2/everything?q=apple&from=2026-06-09&to=2026-06-09&sortBy=popularity&apiKey={api_key}'

    res = requests.get(url)
    data = res.json()
    print(res.status_code)

    return data

This allows you to request for data from the link used and is authenticated by your api key. Something to note you need to also create a .env file and store your api key there. It is a good security practice.

You should get a status code 200 which is code for success.

- Transform the data

Here we will perform a simple cleanup of our data.

Get only the articles
Convert to a data frame
drop two columns that we won't need

import pandas as pd - add this at the top of your file

def transform_data(data):
    articles = data.get('articles')
    articles_df = pd.DataFrame(articles)
    articles_df.head()
    try:
        articles_df.drop(columns=['source', 'urlToImage'], inplace=True)
    except KeyError as e:
        print(f'Raised Key Error: {e}')

    return articles_df

- Load Data

The data is now ready to be stored in a database. To connect to your database and load data using code is an easy process.
Prepare your database credentials(This might be differenc=t depending on your database setup). In our case, we are using a local PostgreSQL instance.

In your .env file, add the following data:

DATABASE_NAME=postgres
USER=postgres
PORT=5432
DATABASE_PASSWORD=your_db_password
HOST=localhost

To load this on your Python script, add the following lines:

DATABASE_NAME = os.getenv('DATABASE_NAME')
USER = os.getenv('USER')
PORT = os.getenv('PORT')
DATABASE_PASSWORD =os.getenv('DATABASE_PASSWORD')
HOST = os.getenv('HOST')

Add the necessary imports at the top of your file:
from sqlalchemy import create_engine

def load_data(articles_df):

    engine = create_engine(f'postgresql+psycopg2://{USER}:{DATABASE_PASSWORD}@{HOST}:{PORT}/{DATABASE_NAME}')
    articles_df.to_sql('articles', engine, if_exists='append', index=False)

    message= "pipeline run successful"
    return message

Let's break this line down

create_engine - is a method used to connect to a database using the credentials passed in. In our case, the PostgreSQL database instance.
articles_df - is our articles dataframe created in the previous step.
.to_sql function is used to write records stored in a DataFrame to a SQL database. Arguments passed are the table name, the variable used with the create_engine function, and the method used to add data

That's it! Your pipeline is complete.

Wait, one more thing that adds to the chaos is actually calling your pipeline to run. In Python, functions have to be called for them to work

on your script:

def run_pipeline():
    raw_data = extract_data()
    clean_data_as_df = transform_data(raw_data)
    load_clean_data = load_data(clean_data_as_df)

    print(load_clean_data)

run_pipeline()