Oliver Samuel

Posted on Sep 29 • Edited on Oct 2

Automating NASA’s Astronomy Picture of the Day with Airflow

#postgres #dataengineering #tutorial #architecture

Introduction

Have you ever looked up at the night sky and wondered what’s out there?
Since 1995, NASA has published the Astronomy Picture of the Day (APOD) — a daily snapshot of the universe, each with a short scientific explanation.

I wanted to do more than just look at these images. In this project, I built a modern ETL (Extract, Transform, Load) pipeline to automatically collect, clean, and store APOD data. This makes it easy for analysts, data scientists, and enthusiasts to explore decades of astronomy content with SQL or dashboards.

The pipeline is powered by:

Apache Airflow → workflow orchestrator that schedules and manages tasks
MinIO (S3-compatible object storage) → local data lake for raw and cleaned files
PostgreSQL → relational database warehouse for structured analytics

By the end, we’ll have a production-style pipeline that runs daily and can also fetch historical data on demand.

Architecture

NASA API → Extract → MinIO (Raw) → Transform → MinIO (Staged) → Load → PostgreSQL

Airflow: Orchestrates the daily pipeline and backfills
MinIO: Stores raw API data and cleaned staged data (data lake)
PostgreSQL: Stores the final analytics-ready table

Why This Architecture?

Raw vs Cleaned Data (Bronze → Silver Layers)

Raw Layer (Bronze): Keeps the original API responses unchanged. Useful if transformation logic changes later or for debugging.
Staged Layer (Silver): Stores cleaned, validated fields only (date, title, explanation, url).

Why MinIO as Data Lake?

S3-Compatible → Skills transfer to AWS S3, Google Cloud Storage (GCS), Azure Blob
Local & Cost-Free → Perfect for development
Object Storage Benefits → Immutable, easy to partition by date, scalable

Why PostgreSQL as Warehouse?

SQL Access → Analysts and BI (Business Intelligence) tools can query directly
Structured Schema → Enforces quality and constraints
Performance → Indexed queries run faster than scanning JSON files

Why Apache Airflow?

Orchestration → Manages dependencies between Extract → Transform → Load
Scheduling → Automates daily runs
Retries & Monitoring → Ensures resilience
Backfill → Fetches historical data without rewriting code

Project Structure

The project tree structure

Structure Rationale

Modular Scripts: Each ETL step is isolated for testing and reusability
Pipelines Directory: Organizes multi-DAG projects (scalable for future APIs)
Environment Variables: Secrets (API keys, DB passwords) never committed to Git

ETL Scripts

1. Extract (`extract.py`)

Calls NASA API with an API key
Saves raw JSON in MinIO bucket (nasa-apod-dl/raw/)
Partitioned by date (raw/2024-01-15.json)

Why? Keeps original data intact and allows reprocessing later.

2. Transform (`transform.py`)

Reads raw JSON from MinIO
Keeps only relevant fields: date, title, explanation, url
Saves staged JSON to (nasa-apod-dl/staged/)

Why? Removes unnecessary fields and ensures data quality.

3. Load (`load.py`)

Reads staged JSON from MinIO
Inserts into PostgreSQL table apod_records

Why? Makes data analytics-ready for SQL queries and dashboards.

Airflow DAGs

Daily DAG (`nasa_apod_pipeline.py`)

Runs once per day (@daily schedule)
Workflow: Extract → Transform → Load

Why @daily?

NASA publishes one APOD per day at midnight EST
Airflow's @daily runs at midnight UTC (adjust schedule_interval if needed)
Catches up automatically if Airflow was down (controlled by catchup parameter)

Backfill DAG (`nasa_apod_backfill.py`)

Triggered manually from Airflow UI
Accepts JSON config for date range:

{
  "start_date": "2022-01-01",
  "end_date": "2022-12-31"
}

Loops over the range, loads all data

Why a Separate Backfill DAG?

Manual Control: Large backfills shouldn't run automatically
Rate Limiting: NASA API limits requests; backfill can add delays between calls
Config Flexibility: Different parameters (batch size, concurrency) than daily runs
Resource Management: Prevents overwhelming MinIO/Postgres during bulk loads

Analytics-Ready Data

The final table in Postgres (apod_records) enables:

SQL Queries: SELECT * FROM apod_records WHERE date BETWEEN '2023-01-01' AND '2023-12-31'
BI Tool Integration: Connect Tableau, Power BI, Metabase directly
Data Science: Export to Pandas for NLP on explanation text
API Layer: Build Flask/FastAPI to serve data to frontend

Sample Queries:

-- Most common words in titles
SELECT title, COUNT(*) FROM apod_records GROUP BY title ORDER BY COUNT(*) DESC LIMIT 10;

-- Videos vs Images
SELECT media_type, COUNT(*) FROM apod_records GROUP BY media_type;

-- Longest explanations
SELECT date, title, LENGTH(explanation) as length 
FROM apod_records ORDER BY length DESC LIMIT 5;

Lessons Learnt

What Went Right: The pipeline is modular, allowing independent testing of ETL steps, preserves raw data for reprocessing, supports idempotency to prevent duplicate runs, and maintains separation of concerns between components like Extract and Load.
Areas for Improvement: Enhance data quality checks (e.g., URL validation, null checks), add monitoring with Airflow SLAs for runtime alerts, and introduce unit/integration tests for stability.
Future Work:
- Short-Term: Full backfill of historical data (11k records), add media_type field to differentiate images/videos, and implement better logging.
- Medium-Term: Develop an interactive Streamlit dashboard for filtering and visualizing APOD images, add Airflow alerts via email/Slack, and integrate dbt for transformation testing and documentation.
- Long-Term: Migrate the stack to cloud platforms like AWS (S3 + Redshift) or GCP (GCS + BigQuery), perform NLP-based topic modeling on astronomy explanations, implement change data capture for updates, and fully Dockerize the entire stack.

Challenges and Solutions

Handled cases where the daily APOD might not be published yet via retries and error handling, built a backfill Airflow DAG for flexible historical data loading, and used a bronze (raw) and silver (cleaned) layered storage pattern inspired by Medallion Architecture.

Results & Validation

DBeaver SQL Query Screenshot The SQL I've run in DBeaver this morning(October 2, 2025) to check the previous day APOD and it's output:

SELECT date, title, explanation, url
FROM apod_records
WHERE date::date = '2025-10-01';

Querying PostgreSQL with DBeaver confirms that the ETL pipeline successfully ingested yesterday’s Astronomy Picture of the Day (APOD).

The Actual Picture Screenshot

The pipeline not only ingested metadata but also preserved the image URL, allowing analysts to fetch the actual picture.

Airflow Pipeline Screenshot

Airflow DAG view shows successful daily ETL runs, from our earlier test runs through to yesterday’s APOD.

Since deploying, the pipeline has been automatically fetching and processing NASA’s APOD data daily.
The screenshots above show a query from PostgreSQL in DBeaver, retrieving the record for October 1, 2025 — yesterday’s Astronomy Picture of the Day.
The image URL stored in the database links directly to the NASA APOD archive, confirming the pipeline works end-to-end.
In the Airflow UI, we can also see a history of successful DAG runs that processed the APOD records continuously since setup.

Conclusion

The project is a production-grade ETL pipeline orchestrated by Airflow, scalable through a data lake-to-warehouse pattern, maintainable with modular, version-controlled scripts, and reproducible via a simple .env setup. It is not just a toy; it enables analysts to query decades of NASA's astronomy pictures with SQL. The key takeaway is that effective data architecture emphasizes raw data preservation, modularity, accessibility (SQL warehouse), and reliable automation through Airflow orchestration.

Check the source code on GitHub and feel free to contribute. GitHub Public Repo

DEV Community

Automating NASA’s Astronomy Picture of the Day with Airflow

Introduction

Architecture

Why This Architecture?

Project Structure

Structure Rationale

ETL Scripts

1. Extract (`extract.py`)

2. Transform (`transform.py`)

3. Load (`load.py`)

Airflow DAGs

Daily DAG (`nasa_apod_pipeline.py`)

Backfill DAG (`nasa_apod_backfill.py`)

Analytics-Ready Data

Lessons Learnt

Challenges and Solutions

Results & Validation

Conclusion

Top comments (0)

Introduction

Architecture

Why This Architecture?

Project Structure

Structure Rationale

ETL Scripts

1. Extract (extract.py)

2. Transform (transform.py)

3. Load (load.py)

Airflow DAGs

Daily DAG (nasa_apod_pipeline.py)

Backfill DAG (nasa_apod_backfill.py)

Analytics-Ready Data

Lessons Learnt

Challenges and Solutions

Results & Validation

Conclusion

1. Extract (`extract.py`)

2. Transform (`transform.py`)

3. Load (`load.py`)

Daily DAG (`nasa_apod_pipeline.py`)

Backfill DAG (`nasa_apod_backfill.py`)