Devil Scrapes

Posted on Jun 4

Wikiquote Scraper: pull properly-attributed quotes to JSON for $1.00/1K

#webscraping #python #apify #data

Quick answer: Wikiquote is Wikipedia's sister project — a community-edited, CC BY-SA-licensed library of attributed quotes with source citations. It has no official bulk export or API beyond the low-level MediaWiki endpoint, which returns wiki markup you still have to parse. A Wikiquote scraper fetches any Wikiquote page — by author, by film, by topic — and returns one typed JSON row per quote, with attribution, source work, year, and section. The Apify Actor below does it for $0.001 per quote (~$1.00 per 1,000 results), with fingerprint rotation, proxy handling, and rate-limit pacing built in.

If you've ever tried to source a quote — "Did Einstein actually say that?" — you know the problem. The internet is full of quote sites that confidently repeat misattributions. Wikiquote exists to fix this: every quote is either sourced to a primary publication or explicitly labelled "unsourced", "disputed", or "misattributed". It's the only public quotes corpus that treats citation quality as a first-class property. The catch is extracting it programmatically means parsing wiki markup, navigating per-language subdomains, and wrestling with inconsistent templates. Here's how I wrapped that into a single Actor call.

What is Wikiquote? 📖

Wikiquote is a Wikimedia Foundation project — the same organization behind Wikipedia, Wiktionary, and Commons. Launched in 2003, it now hosts pages in more than 60 languages, each containing curated quotes attributed to a person, work, or theme.

What makes it genuinely different from "famous quotes" sites:

Community-enforced citation standards. Quotes without a traceable source go into an explicitly labelled "Unsourced" or "Attributed" section, not silently into the main list.
Per-language editions. en.wikiquote.org, de.wikiquote.org, fr.wikiquote.org are independent wikis — same MediaWiki software, distinct article sets.
CC BY-SA 4.0 license. Free to reuse with attribution; suitable for commercial products as long as you credit the source.
Structured metadata. Pages group quotes under section headings, and individual quotes are often annotated with a specific publication and year.

What it does not give you: a download button, a REST API for quotes, or any bulk export. The only programmatic surface is the MediaWiki action=parse endpoint, which returns rendered HTML or raw WikiText — both a parsing job.

Does Wikiquote have an API? 🤔

Not for quotes. The MediaWiki API lets you fetch page content — as rendered HTML or raw WikiText — but it hands you a markup blob, not a structured list of quotes. You then have to walk the HTML, identify quote <li> nodes, separate the quote text from the nested citation <ul>, strip navigation links masquerading as list items, and extract year strings from free-form citation prose. Each language edition renders these structures slightly differently.

The Wikimedia Foundation also provides a REST API for article summaries and metadata, but it returns introductions, not quote content. There is no /quotes endpoint anywhere in the official surface.

So if you want Wikiquote quotes as structured data, you parse the page. That is the job.

What the data looks like

Every quote comes back as one flat, typed row. Here is a real one from Albert Einstein's page:

{
  "quote": "Everything should be made as simple as possible, but not simpler.",
  "attribution": "Albert Einstein",
  "source": "Reader's Digest, October 1977",
  "year": "1977",
  "page_url": "https://en.wikiquote.org/wiki/Albert_Einstein",
  "language": "en",
  "section": "Attributed",
  "scraped_at": "2026-05-31T10:14:22+00:00"
}

Eight fields, Pydantic-validated before they hit the dataset. source and year are nullable — when Wikiquote doesn't supply a citation, you get null, not an invented one. section carries the heading the quote appeared under, so you can filter by "Quotes", "Attributed", "Disputed", or "Misattributed" downstream.

The naive approach (and why it falls apart)

The first attempt looks like this: call https://en.wikiquote.org/w/api.php?action=parse&page=Albert_Einstein&prop=text&format=json, take the text.* HTML, grab every

`, done.

Three things break it immediately:

1. Navigation links live in the same <ul> as quotes. Pages are full of short <li> nodes that say "Retrieved from Wikipedia" or "See also: Bertrand Russell" — they parse as list items and contaminate your quote list. The real signal is top-level <li> nodes under div.mw-parser-output that contain a sentence-length string and optionally a nested <ul> with source annotations. Rejecting the nav links takes a heuristic battery: minimum text length, prefix matching against known navlink patterns, structural checks for nested citation blocks.

2. Citation extraction is freeform prose. The source isn't a structured field — it's a nested <li> that might say "Letter to Max Born, December 4, 1926" or just "1960". Extracting the year requires a regex over the citation text, not a field lookup.

3. Per-language page structure varies. The English page for Oscar Wilde organizes quotes by work title; the German page groups by decade; the French page uses a flat unsectioned list. An English-only parser breaks silently on non-English pages, returning empty rows where there are real quotes.

We handle all three. We rotate the browser fingerprint across chrome131, chrome124, firefox147, and safari180 via curl-cffi so each request presents a real TLS handshake, not a Python default. We thread Apify residential proxies with a fresh session whenever we hit a block. We pace requests with a 200 ms inter-page delay and retry on 408/429/5xx with exponential backoff — up to 5 attempts. We hand back Pydantic-validated typed rows; any row that doesn't conform to the ResultRow schema raises at write time rather than silently landing garbage in your dataset. No data written means no charge.

The Actor ⚙️

The result is packaged as an Apify Actor: Wikiquote Quotes Scraper.

Paste a page title in the Apify Console and click Start, or call it from Python:

`python
from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/wikiquote-quotes-scraper").call(
run_input={
"pages": ["Albert Einstein", "Oscar Wilde", "Stoicism"],
"language": "en",
"maxQuotesPerPage": 100,
}
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["quote"], "—", item["attribution"])
`

The three input parameters you'll use most:

Parameter	Type	Default	Notes
`pages`	list	`["Albert Einstein", "Oscar Wilde"]`	Exact Wikiquote article titles or full URLs
`language`	string	`"en"`	Any Wikiquote subdomain code: `en`, `de`, `fr`, `es`, `ja`, etc.
`maxQuotesPerPage`	integer	`50`	Cap per page, max 500

You can pass full URLs or bare titles — the Actor normalizes both. Pages run sequentially in one invocation, so a list of 20 philosopher pages costs one actor-start fee.

Use cases

Daily-quote app or widget. Schedule a run against a curated list of 10–20 author pages — Epictetus, Marcus Aurelius, Seneca — and store the output in a dataset. A Cloudflare Worker hits that dataset at GET /random to serve one quote per request. The free $5 trial credit covers your first ~5,000.

Citation enrichment for LLM outputs. When a model produces "as Einstein said, ...", verify it before displaying. Pull the Einstein page, fuzzy-match against the quote field, and surface source and year if found. The section field tells you whether the quote is "Attributed" or "Disputed" — a ready-made confidence signal.

Multilingual corpus for RAG experiments. Pull the same topic in 5 languages (language: de, fr, es, pt, it) for a compact, attribution-real parallel corpus. Unlike pretraining dumps, every row carries a human-curated citation.

Movie and book reference assembly. Wikiquote has pages for films, novels, and TV shows. Pull "The Big Lebowski" or "Hamlet" and you get every notable line, organized by act or scene — useful for trivia apps and study tools.

Misattribution auditing. Filter by section == "Misattributed" across a set of famous figures to build a dataset of common "fake quotes" — useful for fact-checking tools and media literacy content.

Pricing — exact numbers 💰

Pay-per-event. You pay for quotes that land in your dataset, nothing for pages that fail or come back empty.

Event	USD	What it is
`actor-start`	$0.005	One-off warm-up charge per run
`result`	$0.001	Per quote written to the dataset

Pull	Cost
100 quotes	$0.11
1,000 quotes	$1.01
5,000 quotes	$5.01
20,000 quotes (multilingual sweep)	$20.01

Apify's $5 free trial credit covers your first ~4,990 quotes with no credit card. For comparison, typical free quote-API tiers are rate-limited to a handful of requests per hour and return a single quote per call with no source, year, or section metadata.

The technically-interesting bit

The parser handles the citation-nesting pattern that distinguishes Wikiquote's HTML from ordinary list markup. Each quote is a top-level <li> node; its source annotation lives in a nested <ul> inside the same <li>. The scraper decomposes the nested <ul> first — extracting citation text — then reads the remaining <li> text as the quote. Without the decompose step, the citation bleeds into the quote string, producing malformed rows like "Imagination is more important than knowledge. — Letter to Karl Seelig, 1952" instead of clean fields. The year regex then runs over the citation text only, so dates embedded in quote bodies don't produce spurious year values.

Limitations 🚧

Unusual page templates. Some pages use non-standard markup or transcluded templates the parser doesn't recognize. The result is a quote without source/year, or a skipped row.
Themed list pages. Pages like List of quotes about freedom work, but the attribution field is the page title, not the individual speaker — list pages don't tag each quote with its author.
No Category: page enumeration. You pass individual article titles. Bulk-pulling an entire category (e.g. Category:Philosophers) is not yet supported.
Unsourced quotes are surfaced, not suppressed. If a page has an "Unsourced" section, those quotes appear with source: null. Filter by section for verified citations only.
Per-language availability varies. English Wikiquote has the most complete coverage; other editions have sparser page sets.

FAQ

Is scraping Wikiquote legal?
Wikiquote content is published under the CC BY-SA 4.0 license, and programmatic access to public pages is supported by the MediaWiki API. This Actor reads only what the public site exposes and collects no personal data. Use the data with attribution as the license requires, and consult your own legal counsel for your specific case.

Does Wikiquote have an official API I can use instead?
The MediaWiki API lets you fetch page HTML or WikiText, but it returns markup blobs, not structured quote rows. This Actor is the parsing layer that converts those blobs into typed JSON — it just does that work for you.

Can I export results to Google Sheets or a CSV?
Yes — export JSON, CSV, or Excel directly from the Apify Console's Storage tab. You can also webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull it via the Apify API with your API token.

Why is year null on some quotes?
Wikiquote contributors don't always supply a year in the citation. The Actor extracts the year by running a regex over the source text — if no four-digit year appears in the citation, year is null. This is the source data's limitation, not ours; we surface what's published rather than guessing.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/wikiquote-quotes-scraper.

Free $5 trial credit, no credit card. Run it on ["Albert Einstein", "Oscar Wilde", "Stoicism"] and you'll have a few hundred properly-attributed quotes in under a minute. Need a field I missed — confidence score, canonical Wikidata ID? Drop it in the comments; the roadmap is driven by what people actually build.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community