Wikipedia is one of the largest knowledge bases on the internet, making it a goldmine for data extraction projects. In this guide, we'll explore how to scrape Wikipedia efficiently using Python — both through its official API and direct HTML parsing.
Why Scrape Wikipedia?
Whether you're building a knowledge graph, training an NLP model, or collecting structured data for research, Wikipedia offers:
- Millions of articles across every topic imaginable
- Structured data through infoboxes, tables, and categories
- A free API with generous rate limits
- Regular updates with community-maintained accuracy
Method 1: Using the Wikipedia API
The MediaWiki API is the cleanest way to extract data. No HTML parsing needed.
import requests
def get_wikipedia_article(title):
url = "https://en.wikipedia.org/w/api.php"
params = {
"action": "query",
"titles": title,
"prop": "extracts|pageimages|categories",
"exintro": True,
"explaintext": True,
"format": "json"
}
response = requests.get(url, params=params)
data = response.json()
pages = data["query"]["pages"]
page = next(iter(pages.values()))
return {
"title": page.get("title"),
"extract": page.get("extract"),
"categories": [c["title"] for c in page.get("categories", [])]
}
article = get_wikipedia_article("Python_(programming_language)")
print(article["title"])
print(article["extract"][:200])
Method 2: Scraping HTML Tables
Some Wikipedia data lives in HTML tables that the API doesn't return cleanly. For these, we parse directly.
import requests
from bs4 import BeautifulSoup
import pandas as pd
def scrape_wikipedia_table(url, table_index=0):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
tables = soup.find_all("table", class_="wikitable")
if table_index >= len(tables):
return None
df = pd.read_html(str(tables[table_index]))[0]
return df
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
df = scrape_wikipedia_table(url)
print(df.head(10))
Bulk Extraction at Scale
When scraping hundreds of pages, you need to handle rate limiting and use proxies to avoid blocks.
import time
import requests
SCRAPER_API_KEY = "YOUR_KEY"
topics = ["Machine_learning", "Data_science", "Web_scraping", "Natural_language_processing"]
def bulk_scrape(topics):
results = []
for topic in topics:
api_url = f"http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url=https://en.wikipedia.org/wiki/{topic}"
response = requests.get(api_url)
soup = BeautifulSoup(response.text, "html.parser")
content = soup.find("div", {"id": "mw-content-text"})
paragraphs = content.find_all("p") if content else []
text = "\n".join(p.get_text() for p in paragraphs[:5])
results.append({"topic": topic, "text": text})
time.sleep(2) # Be respectful
return results
data = bulk_scrape(topics)
for item in data:
print(f"{item['topic']}: {len(item['text'])} chars")
Using a proxy service like ScraperAPI ensures your requests don't get blocked during bulk operations, and handles CAPTCHAs and IP rotation automatically.
Extracting Infobox Data
Infoboxes contain the most structured data on Wikipedia. Here's how to parse them:
import requests
from bs4 import BeautifulSoup
def extract_infobox(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
infobox = soup.find("table", class_="infobox")
if not infobox:
return {}
data = {}
rows = infobox.find_all("tr")
for row in rows:
header = row.find("th")
value = row.find("td")
if header and value:
data[header.get_text(strip=True)] = value.get_text(strip=True)
return data
info = extract_infobox("https://en.wikipedia.org/wiki/Python_(programming_language)")
for key, val in info.items():
print(f"{key}: {val}")
Best Practices
- Use the API first — it's faster, cleaner, and officially supported
- Respect rate limits — add delays between requests (1-2 seconds minimum)
- Cache results — Wikipedia doesn't change every minute; store what you fetch
- Use proxies for scale — services like ScraperAPI or ThorData handle rotation for you
- Check robots.txt — always verify scraping policies
Storing Your Data
import json
import csv
# Save as JSON
with open("wikipedia_data.json", "w") as f:
json.dump(data, f, indent=2)
# Save as CSV
if isinstance(data, list) and data:
keys = data[0].keys()
with open("wikipedia_data.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(data)
Conclusion
Wikipedia scraping is one of the best entry points into web scraping. The official API handles most use cases, but combining it with BeautifulSoup for tables and infoboxes gives you comprehensive coverage. For production workloads, pair your scraper with a proxy service like ScrapeOps to ensure reliability at scale.
Happy scraping!
Top comments (0)