Orchestrating and Observing Data Pipelines with Airflow, PostgreSQL, and Polar

#postgres #dataengineering #docker #devops

Data pipelines are the backbone of any modern data platform — but building them is only half the battle.

Keeping them efficient, observable, and trustworthy is where real engineering comes in.

In this post, we’ll build a complete, observable data pipeline using:

🌀 Apache Airflow — for orchestration
🐘 PostgreSQL — as our database
🧊 Polar — for continuous profiling and observability
🐳 Docker — to tie it all together

By the end, you’ll have a running system that not only moves data but also monitors itself in real time.

🧩 Prerequisites

Make sure you have the following installed before starting:

🗂️ Project Structure

Let’s start with a clean, scalable structure for our Airflow project:

.
├── dags/
│   └── simple_etl_dag.py
├── docker-compose.yml
└── .env

dags/ → Your Airflow DAGs live here.
docker-compose.yml → Defines and connects your services.
.env → Keeps environment variables separate from code.

⚙️ Orchestrating with Docker Compose

Let’s define our infrastructure.

We’ll spin up PostgreSQL, Airflow, and Polar in one command.

Step 1: Environment file

Create a .env file:

AIRFLOW_UID=50000

This ensures the Airflow user runs with the right permissions.

Step 2: Docker Compose setup

Here’s a minimal working setup (simplified for this guide):

version: '3'
services:
  postgres:
    image: postgres:13
    environment:
      - POSTGRES_USER=airflow
      - POSTGRES_PASSWORD=airflow
      - POSTGRES_DB=airflow
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 5s
      retries: 5

  airflow-webserver:
    image: apache/airflow:2.8.1
    depends_on:
      - postgres
    environment:
      - AIRFLOW__CORE__EXECUTOR=LocalExecutor
      - AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
      - AIRFLOW__CORE__LOAD_EXAMPLES=false
    volumes:
      - ./dags:/opt/airflow/dags
    ports:
      - "8080:8080"
    command: webserver
    healthcheck:
      test: ["CMD-SHELL", "curl --fail http://localhost:8080/health"]
      interval: 30s
      timeout: 30s
      retries: 3

  polar-agent:
    image: polar-agent:latest
    command:
      - "agent"
      - "--config-file=/etc/polar/agent.yaml"
    volumes:
      - ./polar-agent-config.yaml:/etc/polar/agent.yaml
    depends_on:
      - airflow-webserver

💡 Polar setup here is conceptual — always refer to the official Polar docs for the latest integration method (usually via a sidecar or host-level agent).

🧠 Creating a Simple Airflow DAG

Time to build our first ETL pipeline.

This DAG will:

Create a customers table in Postgres.
Insert a sample record.

Create dags/simple_etl_dag.py:

from airflow.decorators import dag, task
from airflow.providers.postgres.hooks.postgres import PostgresHook
from pendulum import datetime

@dag(
    dag_id="simple_postgres_etl",
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
    tags=["etl", "postgres"],
)
def simple_postgres_etl():
    @task
    def create_customers_table():
        pg_hook = PostgresHook(postgres_conn_id="postgres_default")
        pg_hook.run("""
            CREATE TABLE IF NOT EXISTS customers (
                customer_id SERIAL PRIMARY KEY,
                name VARCHAR NOT NULL,
                signup_date DATE
            );
        """)

    @task
    def insert_new_customer():
        pg_hook = PostgresHook(postgres_conn_id="postgres_default")
        pg_hook.run("""
            INSERT INTO customers (name, signup_date)
            VALUES ('John Doe', '2025-09-26');
        """)

    create_customers_table() >> insert_new_customer()

simple_postgres_etl()

Now, run your stack:

docker-compose up

Head to http://localhost:8080 — you’ll find your DAG there, ready to trigger.

🔍 Observability with Polar

Once the pipeline runs, Polar starts profiling automatically.

Here’s what you can do in the Polar UI:

Filter by Service – Focus on airflow-webserver or scheduler.
Analyze CPU & Memory – Spot heavy tasks and resource spikes.
Identify Bottlenecks – Catch inefficiencies before they cause downtime.

🎯 This is where orchestration meets observability — you’re not just scheduling jobs, you’re understanding their runtime behavior.

🏁 Wrapping Up

You’ve built a small but powerful foundation for observable data engineering:

✅ Airflow orchestrates

✅ Postgres stores

✅ Polar profiles

✅ Docker glues it all together

This setup takes you from reactive debugging to proactive optimization.

When your data pipelines tell you what’s happening under the hood — you’re no longer guessing; you’re engineering.

💬 If you enjoyed this, consider following for more hands-on data engineering guides like this one. Got questions? Drop them below or ping me on LinkedIn.