How to Scrape Museum Collections: Metropolitan and Smithsonian APIs
The world's greatest museums have digitized millions of artworks and artifacts, many accessible through free public APIs. Let's build a Python pipeline to collect, analyze, and visualize museum collection data.
Available Museum APIs
- Metropolitan Museum of Art — 470K+ objects, fully open, no key needed
- Smithsonian Institution — 15M+ records across 19 museums
- Rijksmuseum — Dutch masters and more
- Harvard Art Museums — 250K+ objects
Setting Up
pip install requests pandas matplotlib tqdm
The Met Collection API
The Met's API is completely open — no key required:
import requests
import time
from tqdm import tqdm
MET_BASE = "https://collectionapi.metmuseum.org/public/collection/v1"
def get_met_departments():
resp = requests.get(f"{MET_BASE}/departments")
return resp.json()["departments"]
def search_met(query, department_id=None):
params = {"q": query}
if department_id:
params["departmentId"] = department_id
resp = requests.get(f"{MET_BASE}/search", params=params)
return resp.json().get("objectIDs", [])
def get_met_object(object_id):
resp = requests.get(f"{MET_BASE}/objects/{object_id}")
return resp.json() if resp.status_code == 200 else None
ids = search_met("impressionist", department_id=11)
print(f"Found {len(ids)} objects")
Batch Collection with Rate Limiting
def collect_objects(object_ids, max_count=500):
objects = []
for oid in tqdm(object_ids[:max_count]):
obj = get_met_object(oid)
if obj:
objects.append({
"id": obj["objectID"],
"title": obj.get("title", ""),
"artist": obj.get("artistDisplayName", ""),
"date": obj.get("objectDate", ""),
"medium": obj.get("medium", ""),
"department": obj.get("department", ""),
"culture": obj.get("culture", ""),
"image_url": obj.get("primaryImageSmall", ""),
"is_public_domain": obj.get("isPublicDomain", False)
})
time.sleep(0.1)
return objects
objects = collect_objects(ids)
Smithsonian API
The Smithsonian requires a free API key from api.data.gov:
SI_KEY = "YOUR_SMITHSONIAN_KEY"
SI_BASE = "https://api.si.edu/openaccess/api/v1.0"
def search_smithsonian(query, rows=100):
params = {"api_key": SI_KEY, "q": query, "rows": rows}
resp = requests.get(f"{SI_BASE}/search", params=params)
data = resp.json()
results = []
for row in data.get("response", {}).get("rows", []):
content = row.get("content", {})
desc = content.get("descriptiveNonRepeating", {})
results.append({
"title": desc.get("title", {}).get("content", ""),
"unit_code": desc.get("unit_code", ""),
"record_link": desc.get("record_link", "")
})
return results
artifacts = search_smithsonian("space exploration")
print(f"Found {len(artifacts)} Smithsonian records")
Analysis and Visualization
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(objects)
pd_counts = df["is_public_domain"].value_counts()
pd_counts.plot(kind="pie", labels=["Restricted", "Public Domain"],
autopct="%1.1f%%", figsize=(6, 6))
plt.title("Met Collection: Public Domain Status")
plt.savefig("public_domain.png")
top_artists = df[df["artist"] != ""]["artist"].value_counts().head(15)
top_artists.plot(kind="barh", figsize=(10, 6))
plt.title("Most Represented Artists")
plt.savefig("top_artists.png")
For museums without APIs, use ScraperAPI with JS rendering. Scale with ThorData proxies and monitor with ScrapeOps.
Key Takeaways
- Major museums offer free, open APIs with millions of records
- The Met API requires no authentication at all
- Rate limiting and batching are essential for large collections
- Public domain artworks can be freely used in projects
Museum APIs are designed for public access. Respect rate limits and attribution requirements.
Top comments (0)