DEV Community

Oliver Samuel
Oliver Samuel

Posted on • Edited on

Automating NASA’s Astronomy Picture of the Day with Airflow

Introduction

Have you ever looked up at the night sky and wondered what’s out there?
Since 1995, NASA has published the Astronomy Picture of the Day (APOD) — a daily snapshot of the universe, each with a short scientific explanation.

I wanted to do more than just look at these images. In this project, I built a modern ETL (Extract, Transform, Load) pipeline to automatically collect, clean, and store APOD data. This makes it easy for analysts, data scientists, and enthusiasts to explore decades of astronomy content with SQL or dashboards.

The pipeline is powered by:

  • Apache Airflow → workflow orchestrator that schedules and manages tasks
  • MinIO (S3-compatible object storage) → local data lake for raw and cleaned files
  • PostgreSQL → relational database warehouse for structured analytics

By the end, we’ll have a production-style pipeline that runs daily and can also fetch historical data on demand.


Architecture

NASA API → Extract → MinIO (Raw) → Transform → MinIO (Staged) → Load → PostgreSQL
Enter fullscreen mode Exit fullscreen mode
  • Airflow: Orchestrates the daily pipeline and backfills
  • MinIO: Stores raw API data and cleaned staged data (data lake)
  • PostgreSQL: Stores the final analytics-ready table

Architecture Diagram

Why This Architecture?

Raw vs Cleaned Data (Bronze → Silver Layers)

  • Raw Layer (Bronze): Keeps the original API responses unchanged. Useful if transformation logic changes later or for debugging.
  • Staged Layer (Silver): Stores cleaned, validated fields only (date, title, explanation, url).

Why MinIO as Data Lake?

  • S3-Compatible → Skills transfer to AWS S3, Google Cloud Storage (GCS), Azure Blob
  • Local & Cost-Free → Perfect for development
  • Object Storage Benefits → Immutable, easy to partition by date, scalable

Why PostgreSQL as Warehouse?

  • SQL Access → Analysts and BI (Business Intelligence) tools can query directly
  • Structured Schema → Enforces quality and constraints
  • Performance → Indexed queries run faster than scanning JSON files

Why Apache Airflow?

  • Orchestration → Manages dependencies between Extract → Transform → Load
  • Scheduling → Automates daily runs
  • Retries & Monitoring → Ensures resilience
  • Backfill → Fetches historical data without rewriting code

Project Structure

The project tree structure
The project tree structure

Structure Rationale

  • Modular Scripts: Each ETL step is isolated for testing and reusability
  • Pipelines Directory: Organizes multi-DAG projects (scalable for future APIs)
  • Environment Variables: Secrets (API keys, DB passwords) never committed to Git

ETL Scripts

1. Extract (extract.py)

  • Calls NASA API with an API key
  • Saves raw JSON in MinIO bucket (nasa-apod-dl/raw/)
  • Partitioned by date (raw/2024-01-15.json)

Why? Keeps original data intact and allows reprocessing later.

Raw data in MinIO


2. Transform (transform.py)

  • Reads raw JSON from MinIO
  • Keeps only relevant fields: date, title, explanation, url
  • Saves staged JSON to (nasa-apod-dl/staged/)

Why? Removes unnecessary fields and ensures data quality.

Staged JSON file in MinIO


3. Load (load.py)

  • Reads staged JSON from MinIO
  • Inserts into PostgreSQL table apod_records

Why? Makes data analytics-ready for SQL queries and dashboards.

DBeaver showing rows in  raw `apod_records` endraw


Airflow DAGs

Daily DAG (nasa_apod_pipeline.py)

  • Runs once per day (@daily schedule)
  • Workflow: Extract → Transform → Load

Why @daily?

  • NASA publishes one APOD per day at midnight EST
  • Airflow's @daily runs at midnight UTC (adjust schedule_interval if needed)
  • Catches up automatically if Airflow was down (controlled by catchup parameter)

DAG graph view in Airflow UI

DAG file successful runs


Backfill DAG (nasa_apod_backfill.py)

  • Triggered manually from Airflow UI
  • Accepts JSON config for date range:
{
  "start_date": "2022-01-01",
  "end_date": "2022-12-31"
}
Enter fullscreen mode Exit fullscreen mode
  • Loops over the range, loads all data

Why a Separate Backfill DAG?

  • Manual Control: Large backfills shouldn't run automatically
  • Rate Limiting: NASA API limits requests; backfill can add delays between calls
  • Config Flexibility: Different parameters (batch size, concurrency) than daily runs
  • Resource Management: Prevents overwhelming MinIO/Postgres during bulk loads

Backfill DAG run with logs


Analytics-Ready Data

The final table in Postgres (apod_records) enables:

  • SQL Queries: SELECT * FROM apod_records WHERE date BETWEEN '2023-01-01' AND '2023-12-31'
  • BI Tool Integration: Connect Tableau, Power BI, Metabase directly
  • Data Science: Export to Pandas for NLP on explanation text
  • API Layer: Build Flask/FastAPI to serve data to frontend

Sample Queries:

-- Most common words in titles
SELECT title, COUNT(*) FROM apod_records GROUP BY title ORDER BY COUNT(*) DESC LIMIT 10;

-- Videos vs Images
SELECT media_type, COUNT(*) FROM apod_records GROUP BY media_type;

-- Longest explanations
SELECT date, title, LENGTH(explanation) as length 
FROM apod_records ORDER BY length DESC LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

SQL query results in Postgres


Lessons Learnt

  • What Went Right: The pipeline is modular, allowing independent testing of ETL steps, preserves raw data for reprocessing, supports idempotency to prevent duplicate runs, and maintains separation of concerns between components like Extract and Load.

  • Areas for Improvement: Enhance data quality checks (e.g., URL validation, null checks), add monitoring with Airflow SLAs for runtime alerts, and introduce unit/integration tests for stability.

  • Future Work:

    • Short-Term: Full backfill of historical data (11k records), add media_type field to differentiate images/videos, and implement better logging.
    • Medium-Term: Develop an interactive Streamlit dashboard for filtering and visualizing APOD images, add Airflow alerts via email/Slack, and integrate dbt for transformation testing and documentation.
    • Long-Term: Migrate the stack to cloud platforms like AWS (S3 + Redshift) or GCP (GCS + BigQuery), perform NLP-based topic modeling on astronomy explanations, implement change data capture for updates, and fully Dockerize the entire stack.

Challenges and Solutions

Handled cases where the daily APOD might not be published yet via retries and error handling, built a backfill Airflow DAG for flexible historical data loading, and used a bronze (raw) and silver (cleaned) layered storage pattern inspired by Medallion Architecture.

Results & Validation

  • DBeaver SQL Query Screenshot The SQL I've run in DBeaver this morning(October 2, 2025) to check the previous day APOD and it's output:
SELECT date, title, explanation, url
FROM apod_records
WHERE date::date = '2025-10-01';
Enter fullscreen mode Exit fullscreen mode

Querying PostgreSQL with DBeaver confirms that the ETL pipeline successfully ingested yesterday’s Astronomy Picture of the Day (APOD).
Querying PostgreSQL with DBeaver confirms that the ETL pipeline successfully ingested yesterday’s Astronomy Picture of the Day (APOD).

  • The Actual Picture Screenshot

The pipeline not only ingested metadata but also preserved the image URL, allowing analysts to fetch the actual picture.
The pipeline not only ingested metadata but also preserved the image URL, allowing analysts to fetch the actual picture.

  • Airflow Pipeline Screenshot

Airflow DAG view shows successful daily ETL runs, from our earlier test runs through to yesterday’s APOD.
Airflow DAG view shows successful daily ETL runs, from our earlier test runs through to yesterday’s APOD.

Since deploying, the pipeline has been automatically fetching and processing NASA’s APOD data daily.
The screenshots above show a query from PostgreSQL in DBeaver, retrieving the record for October 1, 2025 — yesterday’s Astronomy Picture of the Day.
The image URL stored in the database links directly to the NASA APOD archive, confirming the pipeline works end-to-end.
In the Airflow UI, we can also see a history of successful DAG runs that processed the APOD records continuously since setup.

Conclusion

The project is a production-grade ETL pipeline orchestrated by Airflow, scalable through a data lake-to-warehouse pattern, maintainable with modular, version-controlled scripts, and reproducible via a simple .env setup. It is not just a toy; it enables analysts to query decades of NASA's astronomy pictures with SQL. The key takeaway is that effective data architecture emphasizes raw data preservation, modularity, accessibility (SQL warehouse), and reliable automation through Airflow orchestration.

Check the source code on GitHub and feel free to contribute. GitHub Public Repo

Top comments (0)