Devil Scrapes

Posted on Jun 4

Wikipedia Article Scraper: extract structured text in any language for $2/1K

#webscraping #python #apify #datascience

Quick answer: Wikipedia's REST API is real and well-documented — but it's a per-article lookup, not a bulk export. A Wikipedia article scraper takes a list of titles or article URLs, normalises them, follows redirects, and returns one typed row per article containing the summary, plain-text body, Wikidata description, lead image, categories, and references. The Apify Actor at apify.com/DevilScrapes/wikipedia-article-scraper does this for $0.002 per article (~$2.00 per 1,000), with title normalisation, redirect-following, retries, and rate-limit pacing handled for you.

If you've ever assembled a list of 500 entity names and wanted the Wikipedia summary for each one without clicking through a browser 500 times, you know the problem. The Wikipedia REST API makes a single lookup easy. Making a hundred lookups reliably — across redirects, non-English language editions, articles with non-ASCII titles, and the occasional 429 — is where the afternoon disappears.

This post covers what the Wikipedia REST API actually provides, what breaks when you try to batch-scrape it yourself, and how the Actor handles the boring plumbing so you can focus on what you actually want to build.

What is Wikipedia? 🔎

Wikipedia is a free, collaboratively edited encyclopedia maintained by the Wikimedia Foundation. It operates in more than 300 language editions, with the English edition alone hosting over 6.7 million articles as of 2024. Content is published under the Creative Commons Attribution-ShareAlike license, which means you can republish and build on it freely with attribution.

For data practitioners, Wikipedia is something specific: a structured knowledge graph in article form, where each page has a canonical title, a Wikidata-sourced one-line description, a lead-section summary, a full plain-text body, inbound and outbound links, a category taxonomy, and a list of references. That structure makes it a natural seed corpus for knowledge bases, entity disambiguation, and retrieval-augmented generation (RAG) pipelines.

Does Wikipedia have a bulk-export API? 🔌

No, not in a practical sense. The Wikimedia REST API offers per-article endpoints — give it a title, get back a JSON summary. Wikimedia also publishes full database dumps, but those are multi-gigabyte XML files updated monthly, which is the wrong tool if you want a clean dataset for a specific list of 200 entities today.

The practical path for most teams is: iterate your title list, call the summary endpoint for each, handle errors and redirects, normalise the output. That's the whole job — and it's also the code you never want to write yourself for the fourth time.

What the data looks like

Each article comes back as one flat, typed row. Here is a complete real output record:

{
  "title": "Web scraping",
  "pageid": 2696925,
  "language": "en",
  "url": "https://en.wikipedia.org/wiki/Web_scraping",
  "summary": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites using the HyperText Transfer Protocol or a web browser. ...",
  "description": "Data extraction from websites",
  "extract_html": "<p><b>Web scraping</b>, web harvesting, or web data extraction is data scraping ...",
  "fulltext": "Web scraping, web harvesting, or web data extraction is data scraping ...\n\n== Background ==\n...",
  "thumbnail_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/...",
  "original_image_url": "https://upload.wikimedia.org/wikipedia/commons/c/c9/...",
  "categories": ["Distributed computing", "Internet privacy", "Web scraping"],
  "references": null,
  "last_modified": "2024-11-15T09:12:43Z",
  "scraped_at": "2026-05-31T11:05:00+00:00"
}

Fourteen fields per row, Pydantic-validated before they land in your dataset. fulltext is populated only when includeFullText=true (one extra API call per article). references requires includeReferences=true. Everything else arrives by default.

The naive approach (and why it falls apart) ⚠️

The first scraper most engineers write against Wikipedia looks roughly like this:

import requests

def get_summary(title: str) -> dict:
    url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{title}"
    return requests.get(url).json()

That works for ten articles on a laptop. At scale, on a production schedule, it breaks in a few specific ways:

Title normalisation. Wikipedia titles are case-sensitive, space-sensitive, and redirect-heavy. "web scraping" 404s; "Web_scraping" works; "Web scraping" also works because the REST API handles it — but only if your URL encoding is correct. Full article URLs in mixed formats (old en.m.wikipedia.org mobile links, direct wiki/Title paths) need to be parsed down to just the title segment before you can call the API.

Rate-limit pacing. Wikipedia's API etiquette guidelines ask clients to stay under one concurrent request and to identify themselves with a descriptive User-Agent header. Batch at full parallelism without a semaphore and you will collect 429s. We pace requests behind a configurable concurrency semaphore (default 4, max 16) so the upstream stays happy.

Redirect resolution. Many titles redirect: "USA" redirects to "United States", "ML" might redirect to "Machine learning" or not, depending on the edition. The summary endpoint follows redirects internally, but the title field in the response tells you the canonical destination. If you don't capture that canonical title, your dataset has mismatched keys.

Language routing. The Wikipedia API lives at {language}.wikipedia.org, not a single international endpoint. Japanese articles need ja.wikipedia.org/api/rest_v1/page/summary/.... If you want the same article in five languages, you need five separate hosts.

Transient failures. Network hiccups, upstream 503s during high-load periods, and occasional malformed responses are real. We retry on transient errors (408, 429, 5xx) with backoff instead of dropping the title, and we rotate browser fingerprints via curl-cffi — Chrome 131 / Chrome 124 / Firefox 147 / Safari 180 TLS impersonation — so the request looks like a real browser even when proxied through Apify residential infrastructure.

None of this is particularly hard to write once. It is, however, code you will debug repeatedly if you don't get it right the first time.

The Actor 🛠️

The Actor is available on the Apify Store: apify.com/DevilScrapes/wikipedia-article-scraper.

Paste a list of titles in the Apify Console and click Start, or call it programmatically with the Python SDK:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/wikipedia-article-scraper").call(
    run_input={
        "titles": ["Apify", "Web scraping", "Python (programming language)"],
        "language": "en",
        "includeFullText": True,
        "includeReferences": False,
        "concurrency": 4,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["title"], "—", item["description"])

Key input parameters:

Parameter	Type	Default	Notes
`titles`	array	—	Titles or full article URLs. Spaces are fine; they get encoded automatically.
`language`	string	`"en"`	ISO 639-1 code. Maps to `{lang}.wikipedia.org`.
`includeFullText`	boolean	`true`	Fetches plain-text body via the MediaWiki action API. One extra request per article.
`includeReferences`	boolean	`false`	Fetches references via the REST references endpoint.
`concurrency`	integer	4	Parallel requests (1–16).

Use cases 💡

Knowledge-base seeding. You have a list of 300 product names, company names, or technical terms. You want the canonical Wikipedia summary and description for each to populate a structured knowledge graph or seed a vector store for RAG. Feed the list, get 300 typed rows.

Multilingual entity comparison. Fetch the same article in five language editions to compare framing, lead-section length, and category taxonomy. Pass the same titles list five times with different language values, or run five parallel Actor calls. Each row carries its language field so you can join them.

Change monitoring. Schedule a daily run against a fixed watchlist of articles. Diff the last_modified timestamps between yesterday's and today's datasets — any article that changed gets a notification. Useful for monitoring rapidly evolving topics like regulatory changes, public company pages, or crisis events.

Definition harvesting. You're building a glossary tool or a tooltip system and you need the one-line Wikidata description (description field) for each term. The Actor returns it as a first-class field — no secondary Wikidata API call required for that single sentence.

Preprocessing for AI models. The fulltext field returns clean plain text with footnote markers stripped — ready to tokenise or embed. No HTML cleaning, no Wikitext parsing, no template expansion noise.

Pricing — exact numbers 💰

Pay-per-event. You pay only for articles that land in your dataset.

Event	Price
Actor start (one-off warm-up)	$0.005
Result emitted (per article)	$0.002

Volume	Estimated cost
100 articles	$0.21
500 articles	$1.01
1,000 articles	$2.01
10,000 articles	$20.01

Apify's $5 free trial credit covers your first ~2,400 articles with no credit card required. There is no subscription, no minimum commitment, and no charge for a run that returns zero results.

The technically interesting part

Wikipedia's REST API is documented and stable, but the full-text endpoint is not. Plain-text article body lives on the MediaWiki action API at w/api.php?action=query&prop=extracts&explaintext=1 — a different base URL, a different parameter style, and a different response envelope from the summary endpoint. The response buries the extract inside query.pages.{pageid}.extract, where the page ID key is dynamic (not a fixed field name). The Actor navigates this by iterating the pages dict regardless of key, so it survives articles whose numeric page ID changes between API versions.

The redirect-following behaviour is similarly subtle: the summary endpoint returns the canonical post-redirect title in the title field, not the queried title. We capture that canonical and use it for subsequent full-text and reference fetches — so if you queried "ML" and got redirected to "Machine learning", the full-text fetch goes to Machine learning, not ML.

Limitations 🚧

Current version only. We pull the live article — no version history, no diff between revisions. The MediaWiki history API is a separate surface.
No infobox structured data. The REST API exposes article summaries, not parsed infobox templates. For structured entity facts (population, coordinates, founding date), use the Wikidata API directly.
References are flat text only. The references endpoint returns rendered reference text, not parsed citation fields (author, year, DOI). For structured bibliography extraction, additional parsing is required.
Rate limits on very large lists. Wikipedia's API etiquette guidelines recommend no more than one concurrent request from a single IP for heavy workloads. The Actor respects this via the concurrency setting; extremely large lists (10,000+) should be run with concurrency=2 or less to stay polite.

FAQ ❓

Is scraping Wikipedia legal?
Wikipedia content is published under the Creative Commons Attribution-ShareAlike (CC BY-SA) license, and the Wikimedia Foundation explicitly encourages programmatic access via their APIs. This Actor reads only what the public API exposes and identifies itself with a descriptive User-Agent. Attribution is required when republishing content. Check your specific jurisdiction and use case.

Can I export the results to Google Sheets or a CSV?
Yes. Export directly from the Apify Console run page as JSON, CSV, or Excel. You can also webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull it programmatically via the Apify API.

Does Wikipedia have an official API?
Yes — the Wikimedia REST API and the MediaWiki Action API both exist and are documented. They offer per-article lookups. This Actor wraps them with batch handling, error recovery, and multilingual routing so you don't have to.

What happens if a title doesn't exist?
The Actor logs a warning and skips that title — it does not fail the whole run. You will see a log entry noting which titles returned no summary. No charge for skipped titles.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/wikipedia-article-scraper.

Free $5 trial credit, no credit card. Drop in a list of entity names you've been meaning to enrich and have the dataset in under a minute. A use case I missed, or a field that would make this more useful for your pipeline? Leave a comment — this Actor ships updates based on what people actually need.

Built by Devil Scrapes — pay-per-event Apify Actors with honest pricing and no fine print. 😈

DEV Community