Devil Scrapes

Posted on Jun 2

NASA APOD Scraper: bulk-export the Astronomy Picture of the Day as JSON

#webscraping #python #apify #datascience

Quick answer: NASA's Astronomy Picture of the Day API (api.nasa.gov/planetary/apod) is real and free, but it returns one record at a time, enforces rate limits, varies its response shape across the 30-year archive, and requires throttling, backoff, and normalization to make reliable bulk exports. The NASA APOD Scraper wraps that API in a managed Apify Actor — single-date, date-range, or random-pick modes — and returns every APOD as a clean, typed JSON row for $0.0015 per result (~$1.50 per 1,000).

NASA has been posting one space image (or video) per day since June 16, 1995 — roughly 11,000 entries, each with a title, expert explanation, image URL, optional HD version, and copyright credit. Developers use it for daily-photo apps, multimodal RAG corpora, educational platforms, and generative-art pipelines.

Grabbing a single day is three lines of Python. Grabbing a decade is where it breaks: the API enforces per-key rate limits, response shape silently differs between image and video days, and hdurl is sometimes absent. Do it naively and your corpus has gaps, mistyped nulls, and a script that failed somewhere in 2007.

What is NASA APOD? 🔎

The Astronomy Picture of the Day is a public education program operated by NASA and Michigan Technological University since 1995. Each day, a professional astronomer selects an image or video of the universe — a galaxy collision, a comet flyby, a nebula from the James Webb Telescope — and writes a paragraph-length explanation aimed at a general audience.

It is one of the most visited pages on the NASA domain and, as of 2026, has run without a single missed day for over 30 years. The API (api.nasa.gov/planetary/apod) is part of NASA's public API program, covered by NASA's open-data policy, and has been stable since 2015.

Each entry includes title, explanation, image/video URL, HD URL, media type, date, and copyright credit. What it does not give you out of the box: reliable bulk exports, stable null handling across the 30-year archive, or a permalink back to the APOD page.

Does NASA APOD have a bulk-export API? 🛰️

Partially. The API accepts a start_date/end_date range, but enforces 30 requests/hour and 50/day on DEMO_KEY, returns 429 when you exceed it, and still requires you to batch multi-year pulls, handle malformed early-archive records, and normalize null vs absent fields across three decades of hand-entered data. The managed Actor handles all of that — you supply a date range, you get a flat Pydantic-validated dataset back.

What the data looks like

Every row is one APOD entry, with all 11 fields present and typed:

{
  "date": "2026-05-15",
  "title": "Spiral Galaxy NGC 1232",
  "explanation": "One of the largest galaxies visible from Earth, NGC 1232 ...",
  "url": "https://apod.nasa.gov/apod/image/2605/NGC1232.jpg",
  "hdurl": "https://apod.nasa.gov/apod/image/2605/NGC1232_large.jpg",
  "thumbnail_url": null,
  "media_type": "image",
  "copyright": "ESO; J. Spyromilio",
  "service_version": "v1",
  "apod_url": "https://apod.nasa.gov/apod/ap260515.html",
  "scraped_at": "2026-05-15T19:17:59+00:00"
}

Eleven fields, same shape every row. thumbnail_url populates on video days when thumbsForVideos is enabled. hdurl is null when NASA published no high-resolution version. apod_url is the canonical permalink for citation and deduplication.

The naive approach (and why it falls apart) 🔥

The first implementation most people write:

import httpx, datetime

today = datetime.date.today().isoformat()
r = httpx.get(f"https://api.nasa.gov/planetary/apod?date={today}&api_key=DEMO_KEY")
print(r.json())

That works for one day. Scale it to a year of daily data and you'll hit three real problems:

1. Rate limits and batch sizing. DEMO_KEY allows 30 requests/hour and 50/day. A personal key lifts that to 1,000/hour, but a 10-year pull (~3,650 records) still requires batching with throttling. Without a retry loop you get 429 mid-run and a corrupted partial dataset. We pace requests and retry on 429 and 503 with exponential backoff — base 2 s, doubling up to a 30 s cap, up to 5 attempts per batch — plus network-error retries, partial success surfaced clearly, never silently empty.

2. Response shape variation across the archive. The early archive (1995–2000) was hand-entered into several backend systems. hdurl is frequently absent before 2001. copyright is null on NASA-owned imagery, present on photographer-contributed entries. A naive parser expecting all keys present throws KeyError mid-bulk-pull. We normalize every row through Pydantic — required fields guaranteed, optional fields typed as T | None, schema stable across 30 years.

3. Video days. Roughly 5–10% of APOD entries are videos. On those days url points to YouTube or Vimeo, hdurl is null, and thumbnail_url only appears when you pass thumbs=True to the API. We thread thumbsForVideos as a configurable flag and surface media_type on every row so downstream code branches cleanly.

None of these failure modes appears on the first request. All of them surface mid-way through a historical export. We do the dirty work so your dataset stays clean.

The Actor ⚙️

The Actor is on the Apify Store: apify.com/DevilScrapes/nasa-apod-scraper.

Apify Console (no code)

Pick a mode from the dropdown — Today, Single date, Date range, or Random N — fill in the dates if needed, click Start, and the results stream into the run's dataset. Export as JSON, CSV, or Excel directly from the Console.

Python SDK

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/nasa-apod-scraper").call(
    run_input={
        "mode": "range",
        "startDate": "2026-01-01",
        "endDate": "2026-01-31",
        "thumbsForVideos": True,
        "proxyConfiguration": {"useApifyProxy": False},
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["date"], item["title"], item["media_type"])

Run that and you get 31 APOD entries for January 2026, all fields present, video thumbnails included, for under $0.06 ($0.005 start + 31 × $0.0015).

Key inputs: mode (today / single_date / range / random), startDate + endDate (ISO dates), count (random pick size, 1–100), apiKey (optional NASA key — defaults to shared DEMO_KEY), thumbsForVideos (boolean, default true). Full input reference is in the Actor's Store listing.

Use cases 💡

Multimodal RAG corpus. The explanation field is dense, expert-authored prose; hdurl is a direct image file. Embed both into a vector store and you get a semantic-search astronomy corpus for "show me a galaxy that looks like a wheel." Filter to copyright=null for the unambiguously NASA-public-domain subset.

Daily-photo widget or app. Run mode=today on a daily schedule. Store title + url + explanation, display it on iOS, Android, Telegram, Discord, or a browser extension. One Actor run, one result, $0.0065 total.

Educational pipelines. Backfill the archive for a classroom CMS or AI tutor. The explanation text is written by professional astronomers for a general audience — quality-controlled domain prose that is hard to replicate at scale.

Newsletter automation. Pull yesterday's APOD, inject title + explanation + image URL into your template, push to your newsletter API. The copyright field surfaces attribution requirements automatically.

Generative art and print-on-demand. Filter to media_type=image and copyright=null, download hdurl versions — NASA public-domain imagery is free to use commercially. The apod_url gives you the canonical citation page.

Pricing — exact numbers 💰

Pay-per-event. You pay for results delivered to your dataset, nothing for runs that return nothing.

Event	USD
Actor start (one-off per run)	$0.005
Result written to dataset	$0.0015

Pull	Cost
Today only (1 result)	$0.0065
One month (31 results)	$0.052
One year (365 results)	$0.55
Five years (~1,826 results)	$2.74
Full archive (~11,000 results)	$16.51

Apify's $5 free trial credit — no credit card required — covers the first ~3,300 results. A full 30-year archive pull costs just over $16.

The technically interesting part 🔎

The NASA APOD API returns the image and metadata but no link back to the human-readable APOD page — so we reconstruct it. Every APOD page lives at apod.nasa.gov/apod/apYYMMDD.html, so we parse the ISO date, reformat it to the two-digit YYMMDD slug, and emit a guaranteed apod_url on every row. That derived field is what makes the dataset safe to deduplicate and cite: the permalink is stable, the image URLs are not. Pair it with the thumbsForVideos flag (which threads NASA's thumbs=true query param) and video days return a populated thumbnail_url instead of a bare YouTube embed.

Limitations 🚧

~11,000 entries total. APOD has been daily since 1995-06-16. The corpus is fixed-size and grows by one per day. It is well-suited for RAG eval and demos; it is not a million-image training set.
Images are not proxied. The url and hdurl fields point to apod.nasa.gov and external hosts (e.g., hubblesite.org, jwst.nasa.gov). Downloading the images themselves requires a separate step — this Actor returns URLs, not binary files.
thumbnail_url availability varies. NASA only provides thumbnails for some video entries; some older video APODs (YouTube embeds from 2009) have no thumbnail exposed via the API regardless of the flag.
Rate limits apply. The shared DEMO_KEY allows 30 requests/hour and 50/day. For bulk pulls, supply your own key (free at api.nasa.gov).
Copyright varies. Some APOD entries are contributed by external photographers who hold the copyright. The copyright field surfaces this; filtering to copyright=null gives you the subset that is unambiguously NASA-owned public domain.

FAQ ❓

Is scraping NASA APOD legal?
The APOD API is a public, official NASA endpoint designed for programmatic access. NASA's open-data policy explicitly encourages downstream use of its public APIs and datasets. This Actor calls only the documented public endpoint, paces its requests, and honors NASA's rate limits. Standard disclaimer: verify your jurisdiction and downstream use case.

Does NASA APOD have an official API?
Yes — api.nasa.gov/planetary/apod is NASA's own endpoint. It works for single-date and short-range queries out of the box. This Actor wraps it with batching, throttling, retry, and field normalization to make it practical for bulk and historical exports.

Can I export to Google Sheets or a data warehouse?
Yes. Export CSV, Excel, or JSON directly from the Apify Console. Webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n. Or pull via the Apify API into BigQuery, Snowflake, or any destination that accepts JSON.

Why do some rows have a null hdurl?
NASA does not always publish a high-resolution version. For video days and for some older image entries, no HD file exists on NASA's servers — the field is null by design, not a scraper error. Use url as the fallback.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/nasa-apod-scraper.

Free $5 trial credit, no credit card. Run it in mode=today and you'll have this afternoon's Astronomy Picture of the Day — title, explanation, image URL, and all — in your dataset in seconds. If you're building something with APOD data (a widget, a RAG corpus, an educational tool), drop a comment below. I read every one and ship based on what people actually need.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community