Scraping Wikipedia: Bulk Data Extraction and API Usage

#python #tutorial #webdev #programming

Wikipedia is one of the largest knowledge bases on the internet, making it a goldmine for data extraction projects. In this guide, we'll explore how to scrape Wikipedia efficiently using Python — both through its official API and direct HTML parsing.

Why Scrape Wikipedia?

Whether you're building a knowledge graph, training an NLP model, or collecting structured data for research, Wikipedia offers:

Millions of articles across every topic imaginable
Structured data through infoboxes, tables, and categories
A free API with generous rate limits
Regular updates with community-maintained accuracy

Method 1: Using the Wikipedia API

The MediaWiki API is the cleanest way to extract data. No HTML parsing needed.

import requests

def get_wikipedia_article(title):
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "titles": title,
        "prop": "extracts|pageimages|categories",
        "exintro": True,
        "explaintext": True,
        "format": "json"
    }
    response = requests.get(url, params=params)
    data = response.json()
    pages = data["query"]["pages"]
    page = next(iter(pages.values()))
    return {
        "title": page.get("title"),
        "extract": page.get("extract"),
        "categories": [c["title"] for c in page.get("categories", [])]
    }

article = get_wikipedia_article("Python_(programming_language)")
print(article["title"])
print(article["extract"][:200])

Method 2: Scraping HTML Tables

Some Wikipedia data lives in HTML tables that the API doesn't return cleanly. For these, we parse directly.

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_wikipedia_table(url, table_index=0):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    tables = soup.find_all("table", class_="wikitable")

    if table_index >= len(tables):
        return None

    df = pd.read_html(str(tables[table_index]))[0]
    return df

url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
df = scrape_wikipedia_table(url)
print(df.head(10))

Bulk Extraction at Scale

When scraping hundreds of pages, you need to handle rate limiting and use proxies to avoid blocks.

import time
import requests

SCRAPER_API_KEY = "YOUR_KEY"
topics = ["Machine_learning", "Data_science", "Web_scraping", "Natural_language_processing"]

def bulk_scrape(topics):
    results = []
    for topic in topics:
        api_url = f"http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url=https://en.wikipedia.org/wiki/{topic}"
        response = requests.get(api_url)
        soup = BeautifulSoup(response.text, "html.parser")

        content = soup.find("div", {"id": "mw-content-text"})
        paragraphs = content.find_all("p") if content else []
        text = "\n".join(p.get_text() for p in paragraphs[:5])

        results.append({"topic": topic, "text": text})
        time.sleep(2)  # Be respectful
    return results

data = bulk_scrape(topics)
for item in data:
    print(f"{item['topic']}: {len(item['text'])} chars")

Using a proxy service like ScraperAPI ensures your requests don't get blocked during bulk operations, and handles CAPTCHAs and IP rotation automatically.

Extracting Infobox Data

Infoboxes contain the most structured data on Wikipedia. Here's how to parse them:

import requests
from bs4 import BeautifulSoup

def extract_infobox(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    infobox = soup.find("table", class_="infobox")

    if not infobox:
        return {}

    data = {}
    rows = infobox.find_all("tr")
    for row in rows:
        header = row.find("th")
        value = row.find("td")
        if header and value:
            data[header.get_text(strip=True)] = value.get_text(strip=True)
    return data

info = extract_infobox("https://en.wikipedia.org/wiki/Python_(programming_language)")
for key, val in info.items():
    print(f"{key}: {val}")

Best Practices

Use the API first — it's faster, cleaner, and officially supported
Respect rate limits — add delays between requests (1-2 seconds minimum)
Cache results — Wikipedia doesn't change every minute; store what you fetch
Use proxies for scale — services like ScraperAPI or ThorData handle rotation for you
Check robots.txt — always verify scraping policies

Storing Your Data

import json
import csv

# Save as JSON
with open("wikipedia_data.json", "w") as f:
    json.dump(data, f, indent=2)

# Save as CSV
if isinstance(data, list) and data:
    keys = data[0].keys()
    with open("wikipedia_data.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(data)

Conclusion

Wikipedia scraping is one of the best entry points into web scraping. The official API handles most use cases, but combining it with BeautifulSoup for tables and infoboxes gives you comprehensive coverage. For production workloads, pair your scraper with a proxy service like ScrapeOps to ensure reliability at scale.

Happy scraping!

DEV Community