Wikipedia is the largest free knowledge base on the internet. With structured infoboxes, categories, and interlinked articles, it's a goldmine for NLP datasets, knowledge graphs, and research. Here's how to extract Wikipedia data efficiently using both the API and direct scraping.
Wikipedia API vs Scraping
Wikipedia provides a comprehensive API (MediaWiki API) that should be your first choice. Scraping is only needed for data the API doesn't expose well.
Using the Wikipedia API
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Searching Wikipedia
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Extracting Infobox Data
Infoboxes contain structured data (population, area, founding date, etc.). The API returns this as wikitext, which needs parsing:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Bulk Data Extraction with Categories
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Building a Knowledge Dataset
import csv
import time
def build_dataset(category, output_file, max_articles=50):
"""Build a structured dataset from a Wikipedia category."""
members = get_category_members(category, limit=max_articles)
articles = []
for i, member in enumerate(members[:max_articles]):
print(f"Processing {i+1}/{min(len(members), max_articles)}: {member['title']}")
article = get_article_content(member["title"])
if article:
infobox = get_infobox(member["title"])
article["infobox"] = infobox
articles.append(article)
time.sleep(0.5) # Respect rate limits
# Export
with open(output_file, "w", encoding="utf-8") as f:
json.dump(articles, f, indent=2, ensure_ascii=False)
print(f"\nSaved {len(articles)} articles to {output_file}")
return articles
# Build a dataset of programming languages
dataset = build_dataset(
"Programming languages",
"programming_languages.json",
max_articles=30
)
Using the wikipedia Python Package
For simpler use cases, the wikipedia package wraps the API:
# pip install wikipedia-api
import wikipediaapi
wiki = wikipediaapi.Wikipedia(
user_agent="MyBot/1.0 (myemail@example.com)",
language="en"
)
page = wiki.page("Web scraping")
if page.exists():
print(f"Title: {page.title}")
print(f"Summary: {page.summary[:200]}...")
print(f"Full text: {len(page.text)} chars")
print(f"Links: {len(page.links)} outgoing links")
print(f"Categories: {len(page.categories)}")
Handling Proxies for Large-Scale Extraction
While Wikipedia is generally scraping-friendly, extracting thousands of articles quickly may trigger rate limits. ScraperAPI can help distribute requests across multiple IPs for large-scale Wikipedia data projects.
Best Practices
- Always use the API first — It's faster, more reliable, and explicitly allowed
- Set a proper User-Agent — Wikipedia requires it for API access
- Respect rate limits — Max 200 requests/second for the API
- Cache aggressively — Wikipedia content doesn't change every minute
- Use dumps for bulk data — For millions of articles, download Wikipedia dumps instead
Conclusion
Wikipedia data extraction is one of the most accessible and legal scraping projects you can undertake. Start with the MediaWiki API for structured access, use the wikipedia-api package for quick scripts, and resort to HTML scraping only when you need data the API doesn't provide.
Top comments (0)