DEV Community

Gacheri-Mutua
Gacheri-Mutua

Posted on

Building a simple ETL pipeline with python and PostgreSQL.

Data is the backbone of modern applications, but it rarely arrives in a perfectly structured format ready for analysis. Data engineering is the development, execution, and support of systems and processes that take in raw data and produce high-quality, consistent information that supports analysis and decision making.

An ETL (Extract, Transform, Load) pipeline will extract data (from one or multiple sources), transform it (by cleaning or changing the format) and save it in a database. The data pipeline illustrated in this article follows the three phases:
Prerequisite Note: This project assumes a basic familiarity with Linux environments, specifically using the terminal to manage directories and configure environment variables.

Extract: Fetch live news articles about "Apple" via an HTTP GET request and parse the JSON response.

# importing relevant packages
import requests
import os
from dotenv import load_dotenv
import pandas as pd
from sqlalchemy import create_engine



load_dotenv()

url = ('https://newsapi.org/v2/everything?'
       'q=Apple&'
       'from=2026-05-11&'
       'sortBy=popularity&'
       'apiKey=' + os.getenv('NEWS_API_KEY'))

response = requests.get(url)
print (response.json())

#converting the json file into  dictionary
data = response.json()
type(data)
data

#getting the list
articles =data['articles']
type(articles)
articles
Enter fullscreen mode Exit fullscreen mode

Transform: Filter out unnecessary fields and convert the raw JSON into a structured pandas dataframe.

#converting to dataframe
articles_df = pd.DataFrame(articles)
articles_df.head()

#cleaning
articles_df.drop(columns={'source', 'urlToImage'}, inplace=True)
articles_df = articles_df.drop(columns={'source','urlToImage'})
Enter fullscreen mode Exit fullscreen mode

Load: Connect to a PostgreSQL database using SQLAlchemy and save the cleaned data.

DB_HOST = os.getenv('DB_HOST')
DB_USER = os.getenv('DB_USER')
DB_NAME = os.getenv('DB_NAME')
DB_PASSWORD = os.getenv('DB_PASSWORD')
DB_PORT = os.getenv('DB_PORT')

engine = create_engine(f'postgresql+psycopg2://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}')

articles_df.to_sql('articles', con=engine,if_exists='replace', index=False)
Enter fullscreen mode Exit fullscreen mode

Code mechanics
The pipeline into distinct extract(), transform(), and load() functions orchestrated by a main(). If the data source changes the extract() function needs adjustment without losing the connection to the database.

The transformed data is successfully loaded into a PostgreSQL and the pipeline uses a .env file in your root directory to securely store your API keys and database credentials and requires to have the necessary libraries installed before writing the script. Having done this, we have successfully built a modular ETL pipeline.

Conclusion
This project gave a hands-on view of how a data pipeline is built and covers the full lifecycle of Extract → Transform → Load.

Top comments (0)