Data pipelines are the backbone of any modern data platform — but building them is only half the battle.
Keeping them efficient, observable, and trustworthy is where real engineering comes in.
In this post, we’ll build a complete, observable data pipeline using:
- 🌀 Apache Airflow — for orchestration
- 🐘 PostgreSQL — as our database
- 🧊 Polar — for continuous profiling and observability
- 🐳 Docker — to tie it all together
By the end, you’ll have a running system that not only moves data but also monitors itself in real time.
🧩 Prerequisites
Make sure you have the following installed before starting:
🗂️ Project Structure
Let’s start with a clean, scalable structure for our Airflow project:
.
├── dags/
│ └── simple_etl_dag.py
├── docker-compose.yml
└── .env
-
dags/
→ Your Airflow DAGs live here. -
docker-compose.yml
→ Defines and connects your services. -
.env
→ Keeps environment variables separate from code.
⚙️ Orchestrating with Docker Compose
Let’s define our infrastructure.
We’ll spin up PostgreSQL, Airflow, and Polar in one command.
Step 1: Environment file
Create a .env
file:
AIRFLOW_UID=50000
This ensures the Airflow user runs with the right permissions.
Step 2: Docker Compose setup
Here’s a minimal working setup (simplified for this guide):
version: '3'
services:
postgres:
image: postgres:13
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
ports:
- "5432:5432"
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
airflow-webserver:
image: apache/airflow:2.8.1
depends_on:
- postgres
environment:
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
- AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
- AIRFLOW__CORE__LOAD_EXAMPLES=false
volumes:
- ./dags:/opt/airflow/dags
ports:
- "8080:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:8080/health"]
interval: 30s
timeout: 30s
retries: 3
polar-agent:
image: polar-agent:latest
command:
- "agent"
- "--config-file=/etc/polar/agent.yaml"
volumes:
- ./polar-agent-config.yaml:/etc/polar/agent.yaml
depends_on:
- airflow-webserver
💡 Polar setup here is conceptual — always refer to the official Polar docs for the latest integration method (usually via a sidecar or host-level agent).
🧠 Creating a Simple Airflow DAG
Time to build our first ETL pipeline.
This DAG will:
- Create a
customers
table in Postgres. - Insert a sample record.
Create dags/simple_etl_dag.py
:
from airflow.decorators import dag, task
from airflow.providers.postgres.hooks.postgres import PostgresHook
from pendulum import datetime
@dag(
dag_id="simple_postgres_etl",
start_date=datetime(2025, 1, 1),
schedule=None,
catchup=False,
tags=["etl", "postgres"],
)
def simple_postgres_etl():
@task
def create_customers_table():
pg_hook = PostgresHook(postgres_conn_id="postgres_default")
pg_hook.run("""
CREATE TABLE IF NOT EXISTS customers (
customer_id SERIAL PRIMARY KEY,
name VARCHAR NOT NULL,
signup_date DATE
);
""")
@task
def insert_new_customer():
pg_hook = PostgresHook(postgres_conn_id="postgres_default")
pg_hook.run("""
INSERT INTO customers (name, signup_date)
VALUES ('John Doe', '2025-09-26');
""")
create_customers_table() >> insert_new_customer()
simple_postgres_etl()
Now, run your stack:
docker-compose up
Head to http://localhost:8080 — you’ll find your DAG there, ready to trigger.
🔍 Observability with Polar
Once the pipeline runs, Polar starts profiling automatically.
Here’s what you can do in the Polar UI:
-
Filter by Service – Focus on
airflow-webserver
orscheduler
. - Analyze CPU & Memory – Spot heavy tasks and resource spikes.
- Identify Bottlenecks – Catch inefficiencies before they cause downtime.
🎯 This is where orchestration meets observability — you’re not just scheduling jobs, you’re understanding their runtime behavior.
🏁 Wrapping Up
You’ve built a small but powerful foundation for observable data engineering:
✅ Airflow orchestrates
✅ Postgres stores
✅ Polar profiles
✅ Docker glues it all together
This setup takes you from reactive debugging to proactive optimization.
When your data pipelines tell you what’s happening under the hood — you’re no longer guessing; you’re engineering.
💬 If you enjoyed this, consider following for more hands-on data engineering guides like this one. Got questions? Drop them below or ping me on LinkedIn.
Top comments (0)