How I Built My First ETL Pipeline with Apache Airflow

#etl #apache #airflow #pipeline

ABSTRACT
In this project, I built a simple ETL (Extract, Transform, Load) pipeline to understand how real-world data systems work. The pipeline fetches stock market data from an external API, cleans and structures it, and stores it in a PostgreSQL database. Apache Airflow is then used to automate the workflow. The goal of this project was not just to make the pipeline work, but to understand each stage clearly before integrating everything together.

INTRODUCTION
Most data from APIs is not ready to be used directly. When I first fetched stock data, I realized that although the data was available, it was not readable or structured in a way that could be analyzed easily. This is where ETL pipelines become important.

An ETL pipeline extracts data from a source, transforms it into a usable format, and loads it into a storage system. Apache Airflow helps automate this process by scheduling tasks and managing workflows. Instead of running scripts manually, Airflow allows the pipeline to run automatically at set intervals.

EXTRACT PHASE

The first step was extracting stock data from the Massive (Polygon.io) API. I used the aggregates endpoint to fetch hourly data for a stock ticker. At first, I tried using a different endpoint and only got daily data, which made me realize how important it is to understand the API structure.

The API returned data in JSON format, but not all of it was useful. The actual data I needed was inside the “results” field. One mistake I made early on was trying to use the entire response instead of focusing on this section. Once I fixed that, I was able to get clean hourly records.

TRANSFORM PHASE

The transform step was where the data started to make sense. The raw data had short column names like “o”, “c”, and “t”, which were not intuitive. I converted the data into a Pandas DataFrame and renamed the columns to more meaningful names like “open”, “close”, and “timestamp”.

Another important step was converting the timestamp from Unix milliseconds into a readable datetime format. Without this, it would have been difficult to understand when each record was captured.

I also added new columns such as price change, calculated as the difference between the closing and opening price, and average price, calculated from the high and low values. This made the dataset more useful for analysis instead of just being raw data.

LOAD PHASE

After transforming the data, I loaded it into a PostgreSQL database using SQLAlchemy. This step allowed me to store the data permanently instead of keeping it only in memory.

At this stage, I used a simple approach where the table is replaced each time the pipeline runs. I understand that in a real production system, this would need to be improved to avoid losing historical data and to handle duplicates properly.

USING APACHE AIRFLOW
Once the ETL process was working on its own, I integrated it with Apache Airflow. One important lesson I learned was that Airflow should not be used before testing the ETL logic. Initially, I tried running everything inside Airflow, which made debugging difficult.

Airflow works by defining tasks inside a DAG (Directed Acyclic Graph). Each step of the pipeline becomes a task, and Airflow manages the execution order. This makes it easy to automate the pipeline and run it at intervals such as every hour.

CONCLUSION
This project helped me understand how ETL pipelines work in practice. By extracting data from an API, transforming it into a clean format, and loading it into a database, I was able to simulate a real-world data engineering workflow.

Apache Airflow adds value by automating and managing the pipeline, making it more reliable and scalable. Moving forward, I plan to improve the pipeline by adding support for multiple stock tickers and implementing better data loading strategies.

DEV Community

How I Built My First ETL Pipeline with Apache Airflow

Top comments (0)