Data is the backbone of modern applications, but it rarely arrives in a perfectly structured format ready for analysis. Data engineering is the development, execution, and support of systems and processes that take in raw data and produce high-quality, consistent information that supports analysis and decision making.
An ETL (Extract, Transform, Load) pipeline will extract data (from one or multiple sources), transform it (by cleaning or changing the format) and save it in a database. The data pipeline illustrated in this article follows the three phases:
Prerequisite Note: This project assumes a basic familiarity with Linux environments, specifically using the terminal to manage directories and configure environment variables.
Extract: Fetch live news articles about "Apple" via an HTTP GET request and parse the JSON response.
# importing relevant packages
import requests
import os
from dotenv import load_dotenv
import pandas as pd
from sqlalchemy import create_engine
load_dotenv()
url = ('https://newsapi.org/v2/everything?'
'q=Apple&'
'from=2026-05-11&'
'sortBy=popularity&'
'apiKey=' + os.getenv('NEWS_API_KEY'))
response = requests.get(url)
print (response.json())
#converting the json file into dictionary
data = response.json()
type(data)
data
#getting the list
articles =data['articles']
type(articles)
articles
Transform: Filter out unnecessary fields and convert the raw JSON into a structured pandas dataframe.
#converting to dataframe
articles_df = pd.DataFrame(articles)
articles_df.head()
#cleaning
articles_df.drop(columns={'source', 'urlToImage'}, inplace=True)
articles_df = articles_df.drop(columns={'source','urlToImage'})
Load: Connect to a PostgreSQL database using SQLAlchemy and save the cleaned data.
DB_HOST = os.getenv('DB_HOST')
DB_USER = os.getenv('DB_USER')
DB_NAME = os.getenv('DB_NAME')
DB_PASSWORD = os.getenv('DB_PASSWORD')
DB_PORT = os.getenv('DB_PORT')
engine = create_engine(f'postgresql+psycopg2://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}')
articles_df.to_sql('articles', con=engine,if_exists='replace', index=False)
Code mechanics
The pipeline into distinct extract(), transform(), and load() functions orchestrated by a main(). If the data source changes the extract() function needs adjustment without losing the connection to the database.
The transformed data is successfully loaded into a PostgreSQL and the pipeline uses a .env file in your root directory to securely store your API keys and database credentials and requires to have the necessary libraries installed before writing the script. Having done this, we have successfully built a modular ETL pipeline.
Conclusion
This project gave a hands-on view of how a data pipeline is built and covers the full lifecycle of Extract → Transform → Load.
Top comments (0)