Introduction
Have you ever looked up at the night sky and wondered what’s out there?
Since 1995, NASA has published the Astronomy Picture of the Day (APOD) — a daily snapshot of the universe, each with a short scientific explanation.
I wanted to do more than just look at these images. In this project, I built a modern ETL (Extract, Transform, Load) pipeline to automatically collect, clean, and store APOD data. This makes it easy for analysts, data scientists, and enthusiasts to explore decades of astronomy content with SQL or dashboards.
The pipeline is powered by:
- Apache Airflow → workflow orchestrator that schedules and manages tasks
- MinIO (S3-compatible object storage) → local data lake for raw and cleaned files
- PostgreSQL → relational database warehouse for structured analytics
By the end, we’ll have a production-style pipeline that runs daily and can also fetch historical data on demand.
Architecture
NASA API → Extract → MinIO (Raw) → Transform → MinIO (Staged) → Load → PostgreSQL
- Airflow: Orchestrates the daily pipeline and backfills
- MinIO: Stores raw API data and cleaned staged data (data lake)
- PostgreSQL: Stores the final analytics-ready table
Why This Architecture?
Raw vs Cleaned Data (Bronze → Silver Layers)
- Raw Layer (Bronze): Keeps the original API responses unchanged. Useful if transformation logic changes later or for debugging.
- Staged Layer (Silver): Stores cleaned, validated fields only (date, title, explanation, url).
Why MinIO as Data Lake?
- S3-Compatible → Skills transfer to AWS S3, Google Cloud Storage (GCS), Azure Blob
- Local & Cost-Free → Perfect for development
- Object Storage Benefits → Immutable, easy to partition by date, scalable
Why PostgreSQL as Warehouse?
- SQL Access → Analysts and BI (Business Intelligence) tools can query directly
- Structured Schema → Enforces quality and constraints
- Performance → Indexed queries run faster than scanning JSON files
Why Apache Airflow?
- Orchestration → Manages dependencies between Extract → Transform → Load
- Scheduling → Automates daily runs
- Retries & Monitoring → Ensures resilience
- Backfill → Fetches historical data without rewriting code
Project Structure
Structure Rationale
- Modular Scripts: Each ETL step is isolated for testing and reusability
- Pipelines Directory: Organizes multi-DAG projects (scalable for future APIs)
- Environment Variables: Secrets (API keys, DB passwords) never committed to Git
ETL Scripts
1. Extract (extract.py
)
- Calls NASA API with an API key
- Saves raw JSON in MinIO bucket (nasa-apod-dl/raw/)
- Partitioned by date (raw/2024-01-15.json)
Why? Keeps original data intact and allows reprocessing later.
2. Transform (transform.py
)
- Reads raw JSON from MinIO
- Keeps only relevant fields: date, title, explanation, url
- Saves staged JSON to (nasa-apod-dl/staged/)
Why? Removes unnecessary fields and ensures data quality.
3. Load (load.py
)
- Reads staged JSON from MinIO
- Inserts into PostgreSQL table
apod_records
Why? Makes data analytics-ready for SQL queries and dashboards.
Airflow DAGs
Daily DAG (nasa_apod_pipeline.py
)
- Runs once per day (
@daily
schedule) - Workflow: Extract → Transform → Load
Why @daily
?
- NASA publishes one APOD per day at midnight EST
- Airflow's
@daily
runs at midnight UTC (adjustschedule_interval
if needed) - Catches up automatically if Airflow was down (controlled by
catchup
parameter)
Backfill DAG (nasa_apod_backfill.py
)
- Triggered manually from Airflow UI
- Accepts JSON config for date range:
{
"start_date": "2022-01-01",
"end_date": "2022-12-31"
}
- Loops over the range, loads all data
Why a Separate Backfill DAG?
- Manual Control: Large backfills shouldn't run automatically
- Rate Limiting: NASA API limits requests; backfill can add delays between calls
- Config Flexibility: Different parameters (batch size, concurrency) than daily runs
- Resource Management: Prevents overwhelming MinIO/Postgres during bulk loads
Analytics-Ready Data
The final table in Postgres (apod_records
) enables:
-
SQL Queries:
SELECT * FROM apod_records WHERE date BETWEEN '2023-01-01' AND '2023-12-31'
- BI Tool Integration: Connect Tableau, Power BI, Metabase directly
-
Data Science: Export to Pandas for NLP on
explanation
text - API Layer: Build Flask/FastAPI to serve data to frontend
Sample Queries:
-- Most common words in titles
SELECT title, COUNT(*) FROM apod_records GROUP BY title ORDER BY COUNT(*) DESC LIMIT 10;
-- Videos vs Images
SELECT media_type, COUNT(*) FROM apod_records GROUP BY media_type;
-- Longest explanations
SELECT date, title, LENGTH(explanation) as length
FROM apod_records ORDER BY length DESC LIMIT 5;
Lessons Learnt
What Went Right: The pipeline is modular, allowing independent testing of ETL steps, preserves raw data for reprocessing, supports idempotency to prevent duplicate runs, and maintains separation of concerns between components like Extract and Load.
Areas for Improvement: Enhance data quality checks (e.g., URL validation, null checks), add monitoring with Airflow SLAs for runtime alerts, and introduce unit/integration tests for stability.
-
Future Work:
- Short-Term: Full backfill of historical data (11k records), add media_type field to differentiate images/videos, and implement better logging.
- Medium-Term: Develop an interactive Streamlit dashboard for filtering and visualizing APOD images, add Airflow alerts via email/Slack, and integrate dbt for transformation testing and documentation.
- Long-Term: Migrate the stack to cloud platforms like AWS (S3 + Redshift) or GCP (GCS + BigQuery), perform NLP-based topic modeling on astronomy explanations, implement change data capture for updates, and fully Dockerize the entire stack.
Challenges and Solutions
Handled cases where the daily APOD might not be published yet via retries and error handling, built a backfill Airflow DAG for flexible historical data loading, and used a bronze (raw) and silver (cleaned) layered storage pattern inspired by Medallion Architecture.
Results & Validation
- DBeaver SQL Query Screenshot The SQL I've run in DBeaver this morning(October 2, 2025) to check the previous day APOD and it's output:
SELECT date, title, explanation, url
FROM apod_records
WHERE date::date = '2025-10-01';
Querying PostgreSQL with DBeaver confirms that the ETL pipeline successfully ingested yesterday’s Astronomy Picture of the Day (APOD).
- The Actual Picture Screenshot
The pipeline not only ingested metadata but also preserved the image URL, allowing analysts to fetch the actual picture.
- Airflow Pipeline Screenshot
Airflow DAG view shows successful daily ETL runs, from our earlier test runs through to yesterday’s APOD.
Since deploying, the pipeline has been automatically fetching and processing NASA’s APOD data daily.
The screenshots above show a query from PostgreSQL in DBeaver, retrieving the record for October 1, 2025 — yesterday’s Astronomy Picture of the Day.
The image URL stored in the database links directly to the NASA APOD archive, confirming the pipeline works end-to-end.
In the Airflow UI, we can also see a history of successful DAG runs that processed the APOD records continuously since setup.
Conclusion
The project is a production-grade ETL pipeline orchestrated by Airflow, scalable through a data lake-to-warehouse pattern, maintainable with modular, version-controlled scripts, and reproducible via a simple .env setup. It is not just a toy; it enables analysts to query decades of NASA's astronomy pictures with SQL. The key takeaway is that effective data architecture emphasizes raw data preservation, modularity, accessibility (SQL warehouse), and reliable automation through Airflow orchestration.
Check the source code on GitHub and feel free to contribute. GitHub Public Repo
Top comments (0)