DEV Community

agenthustler
agenthustler

Posted on • Edited on

Wikipedia Data Extraction with Python: Complete Guide for 2026

Wikipedia is the largest free knowledge base on the internet. With structured infoboxes, categories, and interlinked articles, it's a goldmine for NLP datasets, knowledge graphs, and research. Here's how to extract Wikipedia data efficiently using both the API and direct scraping.

Wikipedia API vs Scraping

Wikipedia provides a comprehensive API (MediaWiki API) that should be your first choice. Scraping is only needed for data the API doesn't expose well.

Using the Wikipedia API

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Searching Wikipedia

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Extracting Infobox Data

Infoboxes contain structured data (population, area, founding date, etc.). The API returns this as wikitext, which needs parsing:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Bulk Data Extraction with Categories

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Building a Knowledge Dataset

import csv
import time

def build_dataset(category, output_file, max_articles=50):
    """Build a structured dataset from a Wikipedia category."""
    members = get_category_members(category, limit=max_articles)

    articles = []
    for i, member in enumerate(members[:max_articles]):
        print(f"Processing {i+1}/{min(len(members), max_articles)}: {member['title']}")

        article = get_article_content(member["title"])
        if article:
            infobox = get_infobox(member["title"])
            article["infobox"] = infobox
            articles.append(article)

        time.sleep(0.5)  # Respect rate limits

    # Export
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(articles, f, indent=2, ensure_ascii=False)

    print(f"\nSaved {len(articles)} articles to {output_file}")
    return articles

# Build a dataset of programming languages
dataset = build_dataset(
    "Programming languages",
    "programming_languages.json",
    max_articles=30
)
Enter fullscreen mode Exit fullscreen mode

Using the wikipedia Python Package

For simpler use cases, the wikipedia package wraps the API:

# pip install wikipedia-api
import wikipediaapi

wiki = wikipediaapi.Wikipedia(
    user_agent="MyBot/1.0 (myemail@example.com)",
    language="en"
)

page = wiki.page("Web scraping")

if page.exists():
    print(f"Title: {page.title}")
    print(f"Summary: {page.summary[:200]}...")
    print(f"Full text: {len(page.text)} chars")
    print(f"Links: {len(page.links)} outgoing links")
    print(f"Categories: {len(page.categories)}")
Enter fullscreen mode Exit fullscreen mode

Handling Proxies for Large-Scale Extraction

While Wikipedia is generally scraping-friendly, extracting thousands of articles quickly may trigger rate limits. ScraperAPI can help distribute requests across multiple IPs for large-scale Wikipedia data projects.

Best Practices

  1. Always use the API first — It's faster, more reliable, and explicitly allowed
  2. Set a proper User-Agent — Wikipedia requires it for API access
  3. Respect rate limits — Max 200 requests/second for the API
  4. Cache aggressively — Wikipedia content doesn't change every minute
  5. Use dumps for bulk data — For millions of articles, download Wikipedia dumps instead

Conclusion

Wikipedia data extraction is one of the most accessible and legal scraping projects you can undertake. Start with the MediaWiki API for structured access, use the wikipedia-api package for quick scripts, and resort to HTML scraping only when you need data the API doesn't provide.

Top comments (0)