DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Museum Collections: Metropolitan and Smithsonian APIs

How to Scrape Museum Collections: Metropolitan and Smithsonian APIs

The world's greatest museums have digitized millions of artworks and artifacts, many accessible through free public APIs. Let's build a Python pipeline to collect, analyze, and visualize museum collection data.

Available Museum APIs

  • Metropolitan Museum of Art — 470K+ objects, fully open, no key needed
  • Smithsonian Institution — 15M+ records across 19 museums
  • Rijksmuseum — Dutch masters and more
  • Harvard Art Museums — 250K+ objects

Setting Up

pip install requests pandas matplotlib tqdm
Enter fullscreen mode Exit fullscreen mode

The Met Collection API

The Met's API is completely open — no key required:

import requests
import time
from tqdm import tqdm

MET_BASE = "https://collectionapi.metmuseum.org/public/collection/v1"

def get_met_departments():
    resp = requests.get(f"{MET_BASE}/departments")
    return resp.json()["departments"]

def search_met(query, department_id=None):
    params = {"q": query}
    if department_id:
        params["departmentId"] = department_id
    resp = requests.get(f"{MET_BASE}/search", params=params)
    return resp.json().get("objectIDs", [])

def get_met_object(object_id):
    resp = requests.get(f"{MET_BASE}/objects/{object_id}")
    return resp.json() if resp.status_code == 200 else None

ids = search_met("impressionist", department_id=11)
print(f"Found {len(ids)} objects")
Enter fullscreen mode Exit fullscreen mode

Batch Collection with Rate Limiting

def collect_objects(object_ids, max_count=500):
    objects = []
    for oid in tqdm(object_ids[:max_count]):
        obj = get_met_object(oid)
        if obj:
            objects.append({
                "id": obj["objectID"],
                "title": obj.get("title", ""),
                "artist": obj.get("artistDisplayName", ""),
                "date": obj.get("objectDate", ""),
                "medium": obj.get("medium", ""),
                "department": obj.get("department", ""),
                "culture": obj.get("culture", ""),
                "image_url": obj.get("primaryImageSmall", ""),
                "is_public_domain": obj.get("isPublicDomain", False)
            })
        time.sleep(0.1)
    return objects

objects = collect_objects(ids)
Enter fullscreen mode Exit fullscreen mode

Smithsonian API

The Smithsonian requires a free API key from api.data.gov:

SI_KEY = "YOUR_SMITHSONIAN_KEY"
SI_BASE = "https://api.si.edu/openaccess/api/v1.0"

def search_smithsonian(query, rows=100):
    params = {"api_key": SI_KEY, "q": query, "rows": rows}
    resp = requests.get(f"{SI_BASE}/search", params=params)
    data = resp.json()
    results = []
    for row in data.get("response", {}).get("rows", []):
        content = row.get("content", {})
        desc = content.get("descriptiveNonRepeating", {})
        results.append({
            "title": desc.get("title", {}).get("content", ""),
            "unit_code": desc.get("unit_code", ""),
            "record_link": desc.get("record_link", "")
        })
    return results

artifacts = search_smithsonian("space exploration")
print(f"Found {len(artifacts)} Smithsonian records")
Enter fullscreen mode Exit fullscreen mode

Analysis and Visualization

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(objects)

pd_counts = df["is_public_domain"].value_counts()
pd_counts.plot(kind="pie", labels=["Restricted", "Public Domain"],
               autopct="%1.1f%%", figsize=(6, 6))
plt.title("Met Collection: Public Domain Status")
plt.savefig("public_domain.png")

top_artists = df[df["artist"] != ""]["artist"].value_counts().head(15)
top_artists.plot(kind="barh", figsize=(10, 6))
plt.title("Most Represented Artists")
plt.savefig("top_artists.png")
Enter fullscreen mode Exit fullscreen mode

For museums without APIs, use ScraperAPI with JS rendering. Scale with ThorData proxies and monitor with ScrapeOps.

Key Takeaways

  • Major museums offer free, open APIs with millions of records
  • The Met API requires no authentication at all
  • Rate limiting and batching are essential for large collections
  • Public domain artworks can be freely used in projects

Museum APIs are designed for public access. Respect rate limits and attribution requirements.

Top comments (0)