The Internet Archive Has a Free API — Search 800B+ Web Pages Programmatically

#python #webdev #api #tutorial

The Wayback Machine Has an API

Most developers know the Wayback Machine for browsing old websites. Few know it has a free API that lets you search 800+ billion archived web pages programmatically.

No API key. No signup. No rate limit (be polite).

Check If a URL Was Archived

import requests

def check_archive(url):
    api = f"https://archive.org/wayback/available?url={url}"
    r = requests.get(api)
    snapshot = r.json().get("archived_snapshots", {}).get("closest", {})
    if snapshot:
        return {
            "available": True,
            "url": snapshot["url"],
            "timestamp": snapshot["timestamp"],
            "status": snapshot["status"]
        }
    return {"available": False}

result = check_archive("google.com")
print(result)
# {"available": True, "url": "https://web.archive.org/web/20260101.../google.com", ...}

Get All Snapshots of a URL

def get_snapshots(url, year="2025"):
    cdx = f"https://web.archive.org/cdx/search/cdx?url={url}&output=json&from={year}&limit=10"
    r = requests.get(cdx)
    rows = r.json()
    if len(rows) > 1:
        headers = rows[0]
        return [dict(zip(headers, row)) for row in rows[1:]]
    return []

snapshots = get_snapshots("example.com", "2024")
for s in snapshots[:3]:
    print(f"{s[timestamp]} — {s[statuscode]}")

Real Use Cases

SEO monitoring — track how competitor pages change over time
Legal evidence — prove what a website said on a specific date
Content recovery — find deleted pages, articles, or resources
Research — analyze how websites evolved (design trends, pricing changes)
Link rot detection — check if referenced URLs still exist or have archives

Save a Page to the Archive

def save_page(url):
    save_url = f"https://web.archive.org/save/{url}"
    r = requests.get(save_url)
    if r.status_code == 200:
        return f"Saved: {r.url}"
    return f"Error: {r.status_code}"

# Save your own site for posterity
print(save_page("https://your-site.com"))

Search the Full Text Archive

The Internet Archive also hosts 40M+ books, papers, and media:

def search_archive(query, media_type="texts"):
    url = "https://archive.org/advancedsearch.php"
    params = {
        "q": query,
        "fl[]": ["identifier", "title", "creator", "date"],
        "rows": 5,
        "output": "json",
        "mediatype": media_type
    }
    r = requests.get(url, params=params)
    return r.json()["response"]["docs"]

books = search_archive("machine learning")
for b in books:
    print(f"{b.get(date,N/A)[:4]} | {b[title][:60]}")

Rate Limits & Best Practices

No official rate limit, but be respectful (~1 req/sec)
CDX API can return millions of rows — always use limit parameter
For bulk operations, use &matchType=prefix to get all pages under a domain

I build data tools and scrapers. GitHub has 300+ repos with tutorials and automation scripts.

Need web scraping or data extraction? I've built 77+ production scrapers. Email spinov001@gmail.com — quote in 2 hours. Or try my ready-made Apify actors — no code needed.