agenthustler

Posted on Mar 26 • Edited on Apr 19

Wikipedia Data Extraction with Python: Complete Guide for 2026

#webdev #python #tutorial #webscraping

Wikipedia is the largest free knowledge base on the internet. With structured infoboxes, categories, and interlinked articles, it's a goldmine for NLP datasets, knowledge graphs, and research. Here's how to extract Wikipedia data efficiently using both the API and direct scraping.

Wikipedia API vs Scraping

Wikipedia provides a comprehensive API (MediaWiki API) that should be your first choice. Scraping is only needed for data the API doesn't expose well.

Using the Wikipedia API

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Searching Wikipedia

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Extracting Infobox Data

Infoboxes contain structured data (population, area, founding date, etc.). The API returns this as wikitext, which needs parsing:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Bulk Data Extraction with Categories

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Building a Knowledge Dataset

import csv
import time

def build_dataset(category, output_file, max_articles=50):
    """Build a structured dataset from a Wikipedia category."""
    members = get_category_members(category, limit=max_articles)

    articles = []
    for i, member in enumerate(members[:max_articles]):
        print(f"Processing {i+1}/{min(len(members), max_articles)}: {member['title']}")

        article = get_article_content(member["title"])
        if article:
            infobox = get_infobox(member["title"])
            article["infobox"] = infobox
            articles.append(article)

        time.sleep(0.5)  # Respect rate limits

    # Export
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(articles, f, indent=2, ensure_ascii=False)

    print(f"\nSaved {len(articles)} articles to {output_file}")
    return articles

# Build a dataset of programming languages
dataset = build_dataset(
    "Programming languages",
    "programming_languages.json",
    max_articles=30
)

Using the `wikipedia` Python Package

For simpler use cases, the wikipedia package wraps the API:

# pip install wikipedia-api
import wikipediaapi

wiki = wikipediaapi.Wikipedia(
    user_agent="MyBot/1.0 (myemail@example.com)",
    language="en"
)

page = wiki.page("Web scraping")

if page.exists():
    print(f"Title: {page.title}")
    print(f"Summary: {page.summary[:200]}...")
    print(f"Full text: {len(page.text)} chars")
    print(f"Links: {len(page.links)} outgoing links")
    print(f"Categories: {len(page.categories)}")

Handling Proxies for Large-Scale Extraction

While Wikipedia is generally scraping-friendly, extracting thousands of articles quickly may trigger rate limits. ScraperAPI can help distribute requests across multiple IPs for large-scale Wikipedia data projects.

Best Practices

Always use the API first — It's faster, more reliable, and explicitly allowed
Set a proper User-Agent — Wikipedia requires it for API access
Respect rate limits — Max 200 requests/second for the API
Cache aggressively — Wikipedia content doesn't change every minute
Use dumps for bulk data — For millions of articles, download Wikipedia dumps instead

Conclusion

Wikipedia data extraction is one of the most accessible and legal scraping projects you can undertake. Start with the MediaWiki API for structured access, use the wikipedia-api package for quick scripts, and resort to HTML scraping only when you need data the API doesn't provide.

DEV Community

Wikipedia Data Extraction with Python: Complete Guide for 2026

Wikipedia API vs Scraping

Using the Wikipedia API

Searching Wikipedia

Extracting Infobox Data

Bulk Data Extraction with Categories

Building a Knowledge Dataset

Using the `wikipedia` Python Package

Handling Proxies for Large-Scale Extraction

Best Practices

Conclusion

Top comments (0)

Wikipedia API vs Scraping

Using the Wikipedia API

Searching Wikipedia

Extracting Infobox Data

Bulk Data Extraction with Categories

Building a Knowledge Dataset

Using the wikipedia Python Package

Handling Proxies for Large-Scale Extraction

Best Practices

Conclusion

Using the `wikipedia` Python Package