Devil Scrapes

Posted on Jun 2

Open Library Scraper: the Goodreads API alternative you've been missing

#webscraping #python #apify #books

Quick answer: Open Library is the Internet Archive's catalogue of 30M+ works — title, authors, ISBNs, subjects, publish year, edition count, cover images, and e-book access flags, all free and publicly searchable. It has a REST API, but bulk pagination is painful and the response shape shifts across editions. The Open Library Books Scraper turns any title / author / subject / ISBN query into clean, typed JSON rows at $0.0015 per result (~$1.50 per 1,000), with pagination, retries, and rate-limit pacing handled for you.

In December 2020, Goodreads quietly shut down their public API. No migration guide, no replacement. Developers with book-recommendation apps, reading-list dashboards, or enrichment pipelines got a 403 and a politely-worded blog post.

Five years later, people are still building the same apps and still need programmatic access to book metadata — title, authors, ISBNs, subjects, edition counts, cover images — in bulk. The answer has been sitting at the Internet Archive's Open Library the whole time. It just takes some engineering to use it at scale.

What is Open Library? 📖

Open Library is the Internet Archive's universal book catalogue — the nonprofit's attempt to build one web page for every book ever published. As of 2026 it tracks over 30 million distinct works, curated by a global volunteer community.

Per work it exposes: title, subtitle, author names, first publication year, edition count, ISBNs across all editions, subject tags, publisher names, a cover image, a ratings average and count, and an e-book access flag telling you whether the full text is readable or borrowable in the Internet Archive's reader.

Crucially, it is genuinely open. The Open Library API is free, requires no API key, and the underlying data is CC0 / public domain — the only credible, legal, large-scale bibliographic data source remaining after Goodreads closed and Google Books restricted bulk export.

Does Open Library have an API? 🔍

Yes — but bulk use is harder than it looks. Open Library exposes a public search endpoint at https://openlibrary.org/search.json that accepts free-text queries and returns paginated results. For a single lookup it works fine. At scale, several things conspire against you.

The response payload mixes absent keys, inconsistent array norms, and fields that appear only on some editions. Pagination uses an offset/limit scheme that caps at 100 results per page — miss a page and your dataset has silent gaps. Author normalization is inconsistent: the same person may appear as "Isaac Asimov", "Asimov, Isaac", or as an author_key reference. Cover images require a second CDN URL construction step. And the API rate-limits bulk callers without much ceremony.

None of that is insurmountable — it's just work that belongs in an Actor, not in your application code.

What the data looks like 📦

One row per work. Here is a complete real output row, with every field from ResultRow:

{
  "openlibrary_key": "/works/OL471576W",
  "title": "Foundation",
  "subtitle": null,
  "authors": ["Isaac Asimov"],
  "first_publish_year": 1951,
  "edition_count": 142,
  "languages": ["eng", "fre", "spa", "deu", "ita"],
  "subjects": [
    "Science fiction",
    "Galactic Empire (Imaginary place)",
    "Psychohistory (Fictitious science)",
    "Fiction"
  ],
  "isbns": ["9780553803716", "0553293354", "9780553293357"],
  "publishers": ["Bantam Books", "Doubleday", "Gnome Press"],
  "cover_id": 8430428,
  "cover_url_l": "https://covers.openlibrary.org/b/id/8430428-L.jpg",
  "ratings_average": 4.14,
  "ratings_count": 2103,
  "ebook_access": "no_ebook",
  "work_url": "https://openlibrary.org/works/OL471576W",
  "scraped_at": "2026-05-31T10:22:00+00:00"
}

Seventeen fields, Pydantic-validated before they hit the dataset. The ebook_access field uses Open Library's own vocabulary: public (free full text), borrowable (IA borrow), no_ebook, or printdisabled.

The naive approach (and why it falls apart)

The first instinct is usually three lines of requests.get:

import requests
resp = requests.get("https://openlibrary.org/search.json", params={"q": "foundation asimov", "limit": 100})
books = resp.json()["docs"]

That works for thirty books. For three thousand, here is where it unravels.

1. Pagination walks a cliff. Open Library caps each page at 100 results, so 3,000 books means 30 sequential offset requests. Miss one and you get a silent gap — fewer rows than you asked for, no error. We track the expected count from numFound, compare against rows emitted, and raise a loud set_status_message on mismatch.

2. Rate limiting is real. Open Library is a nonprofit on donated infrastructure, and bulk callers without pacing get throttled. We thread requests at a controlled pace, honour Retry-After headers, and back off with exponential retry on 429 and 5xx — up to five attempts per page.

3. Author normalization is a mess. The endpoint returns authors as either plain strings or author_key references depending on the edition. We normalize to a flat list of name strings, so your code never branches on isinstance(author, dict).

4. Cover URLs require construction. Covers aren't a direct field — you get a numeric cover_id and must build https://covers.openlibrary.org/b/id/{cover_id}-L.jpg. We do that at parse time and hand you cover_url_l ready to use.

We rotate TLS fingerprints via curl-cffi so Open Library sees real-browser traffic, not Python's default SSL stack. We rotate Apify residential proxies on blocks — fresh session_id, fresh exit IP. We retry with backoff on 408 / 429 / 5xx, and we return Pydantic-validated typed rows. No data, no charge.

The Actor ⚙️

The Actor is live on the Apify Store: apify.com/DevilScrapes/openlibrary-books-scraper. Run it from the Apify Console, or call it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/openlibrary-books-scraper").call(
    run_input={
        "searchQuery": "foundation asimov",
        "searchField": "all",
        "maxResults": 500,
        "language": "eng",
        "proxyConfiguration": {"useApifyProxy": True}
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["title"], item["first_publish_year"], item["isbns"])

searchField accepts all, title, author, subject, or isbn — narrow a query to "every work filed under subject 'psychohistory'" or "every edition with this ISBN-13". maxResults accepts 1–1000; we paginate internally in 100-result pages. language takes a 3-letter ISO-639-2 code (eng, spa, fre).

What you'd actually use this for 💡

Book-recommendation app metadata layer. Every "Goodreads alternative" indie app needs bibliographic data to bootstrap its catalogue. Open Library is the only free, legal source that ships ISBNs + subjects + cover images + edition counts in one call. An initial seed of 50k books costs $75 — the same dataset from a commercial bibliographic vendor runs thousands per year.

RAG corpus for fiction-AI. Pull all works tagged with a subject (e.g. "climate fiction"), get their ISBNs, and use ebook_access: public to filter to works whose full text is legally readable through the Internet Archive reader.

E-book access audit. Filter ebook_access to public or borrowable to build a curated list of freely-readable works in a subject area — useful for AI-tutor platforms, digital-library teams, and course-material curators.

Bibliography automation. Academic teams maintaining reading lists or literature reviews need canonical ISBNs and author names. Feed a list of titles or ISBNs through the Actor and get a deduplicated, structured bibliography in one run.

Pricing — exact numbers 💰

Pay-per-event. You pay only when results land in your dataset.

Event	USD per event
Actor start (one-off per run)	$0.005
Result written to dataset	$0.0015

Pull size	Cost
100 books	$0.16
1,000 books	$1.51
10,000 books	$15.01
50,000 books	$75.01

Apify's $5 free trial credit covers your first ~3,000 books, no credit card required. For comparison, commercial bibliographic APIs (Isbndb, Google Books per-call) charge roughly $0.01–0.05 per lookup — our bulk rate is one tenth that.

The technically interesting bit

Open Library's search endpoint is documented as a public API, but the author normalization contract is nowhere formally specified. We found empirically that works before roughly 1990 have a higher rate of author_key-only entries (no inline name string), because those records were migrated from earlier MARC dumps that stored authors only by identifier. Our parser handles both cases in a single pass — so your authors array is always a list of plain strings, never a mix of strings and dicts. That distinction matters the moment you JOIN authors across a 10k-row dataset.

Limitations 🚧

No long descriptions. The search endpoint does not return the work's blurb. For those, you'd follow up with a separate /works/{key}.json call — out of scope for v1.
Subject tags are community-contributed, not curated taxonomies. Good for filtering and discovery, but not a substitute for Library of Congress Subject Headings or Dewey Decimal.
Some older ISBNs are absent. Works published before ISBN adoption (pre-1970, roughly) often have no ISBN record. We return what Open Library has.
Ratings are thinner than Goodreads. Treat ratings_average as a coarse signal for popular works, not a reliable metric for niche titles.
maxResults ceiling is 1,000 per run. For the full catalogue you'd run multiple queries by subject or date range; the search has no "give me everything" cursor.
Open Library infrastructure events. The Actor surfaces upstream-unavailable errors loudly via set_status_message rather than returning a silent empty dataset.

FAQ ❓

Is scraping Open Library legal?
Open Library is operated by the Internet Archive as a public, open catalogue, and its bibliographic metadata is released under CC0 (public domain). The Actor reads only the public search endpoint, at a paced rate that respects the upstream, and collects no personal data. As always, verify your specific use case against your jurisdiction's applicable terms.

Is there an official Open Library bulk-export API?
Open Library publishes monthly data dumps of the full catalogue as compressed TSV files — useful for snapshots but inconvenient for filtered, on-demand queries by title/author/subject. The Actor sits in the middle: query-driven, real-time, and structured.

Can I export to CSV or feed a warehouse?
Yes. The Apify Console exports any dataset as CSV, Excel, or JSON. You can also webhook the ACTOR.RUN.SUCCEEDED event into Make, Zapier, or n8n, or pull data via the Apify API into Snowflake, BigQuery, or any REST-source pipeline.

How is this different from the Google Books API?
Different data, different access model. Google Books has restrictive bulk-export terms and requires an API key with rate limits. Open Library is CC0 metadata, no key required, and openly encourages reuse — the right source for cover images and ISBNs at scale without a licensing conversation.

Try it

The Actor is live: apify.com/DevilScrapes/openlibrary-books-scraper.

Apify gives every new account $5 of free credit — no credit card. Run it on "subject:climate fiction" and you'll have a structured dataset of every catalogued climate-fiction work in under a minute. Need a field that isn't here? Drop it in the comments — the devil's in the data, and we read every report.

Further reading:

Open Library Developers docs — official API reference
Open Library monthly data dumps — full-catalogue bulk export (TSV, requires parsing)
Apify Python client docs — full SDK reference for programmatic use

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community