How to Scrape Museum Collections: Metropolitan and Smithsonian APIs

#python #tutorial #webdev #programming

How to Scrape Museum Collections: Metropolitan and Smithsonian APIs

The world's greatest museums have digitized millions of artworks and artifacts, many accessible through free public APIs. Let's build a Python pipeline to collect, analyze, and visualize museum collection data.

Available Museum APIs

Metropolitan Museum of Art — 470K+ objects, fully open, no key needed
Smithsonian Institution — 15M+ records across 19 museums
Rijksmuseum — Dutch masters and more
Harvard Art Museums — 250K+ objects

Setting Up

pip install requests pandas matplotlib tqdm

The Met Collection API

The Met's API is completely open — no key required:

import requests
import time
from tqdm import tqdm

MET_BASE = "https://collectionapi.metmuseum.org/public/collection/v1"

def get_met_departments():
    resp = requests.get(f"{MET_BASE}/departments")
    return resp.json()["departments"]

def search_met(query, department_id=None):
    params = {"q": query}
    if department_id:
        params["departmentId"] = department_id
    resp = requests.get(f"{MET_BASE}/search", params=params)
    return resp.json().get("objectIDs", [])

def get_met_object(object_id):
    resp = requests.get(f"{MET_BASE}/objects/{object_id}")
    return resp.json() if resp.status_code == 200 else None

ids = search_met("impressionist", department_id=11)
print(f"Found {len(ids)} objects")

Batch Collection with Rate Limiting

def collect_objects(object_ids, max_count=500):
    objects = []
    for oid in tqdm(object_ids[:max_count]):
        obj = get_met_object(oid)
        if obj:
            objects.append({
                "id": obj["objectID"],
                "title": obj.get("title", ""),
                "artist": obj.get("artistDisplayName", ""),
                "date": obj.get("objectDate", ""),
                "medium": obj.get("medium", ""),
                "department": obj.get("department", ""),
                "culture": obj.get("culture", ""),
                "image_url": obj.get("primaryImageSmall", ""),
                "is_public_domain": obj.get("isPublicDomain", False)
            })
        time.sleep(0.1)
    return objects

objects = collect_objects(ids)

Smithsonian API

The Smithsonian requires a free API key from api.data.gov:

SI_KEY = "YOUR_SMITHSONIAN_KEY"
SI_BASE = "https://api.si.edu/openaccess/api/v1.0"

def search_smithsonian(query, rows=100):
    params = {"api_key": SI_KEY, "q": query, "rows": rows}
    resp = requests.get(f"{SI_BASE}/search", params=params)
    data = resp.json()
    results = []
    for row in data.get("response", {}).get("rows", []):
        content = row.get("content", {})
        desc = content.get("descriptiveNonRepeating", {})
        results.append({
            "title": desc.get("title", {}).get("content", ""),
            "unit_code": desc.get("unit_code", ""),
            "record_link": desc.get("record_link", "")
        })
    return results

artifacts = search_smithsonian("space exploration")
print(f"Found {len(artifacts)} Smithsonian records")

Analysis and Visualization

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(objects)

pd_counts = df["is_public_domain"].value_counts()
pd_counts.plot(kind="pie", labels=["Restricted", "Public Domain"],
               autopct="%1.1f%%", figsize=(6, 6))
plt.title("Met Collection: Public Domain Status")
plt.savefig("public_domain.png")

top_artists = df[df["artist"] != ""]["artist"].value_counts().head(15)
top_artists.plot(kind="barh", figsize=(10, 6))
plt.title("Most Represented Artists")
plt.savefig("top_artists.png")

For museums without APIs, use ScraperAPI with JS rendering. Scale with ThorData proxies and monitor with ScrapeOps.

Key Takeaways

Major museums offer free, open APIs with millions of records
The Met API requires no authentication at all
Rate limiting and batching are essential for large collections
Public domain artworks can be freely used in projects

Museum APIs are designed for public access. Respect rate limits and attribution requirements.

DEV Community

How to Scrape Museum Collections: Metropolitan and Smithsonian APIs