The Wayback Machine Has an API
Most developers know the Wayback Machine for browsing old websites. Few know it has a free API that lets you search 800+ billion archived web pages programmatically.
No API key. No signup. No rate limit (be polite).
Check If a URL Was Archived
import requests
def check_archive(url):
api = f"https://archive.org/wayback/available?url={url}"
r = requests.get(api)
snapshot = r.json().get("archived_snapshots", {}).get("closest", {})
if snapshot:
return {
"available": True,
"url": snapshot["url"],
"timestamp": snapshot["timestamp"],
"status": snapshot["status"]
}
return {"available": False}
result = check_archive("google.com")
print(result)
# {"available": True, "url": "https://web.archive.org/web/20260101.../google.com", ...}
Get All Snapshots of a URL
def get_snapshots(url, year="2025"):
cdx = f"https://web.archive.org/cdx/search/cdx?url={url}&output=json&from={year}&limit=10"
r = requests.get(cdx)
rows = r.json()
if len(rows) > 1:
headers = rows[0]
return [dict(zip(headers, row)) for row in rows[1:]]
return []
snapshots = get_snapshots("example.com", "2024")
for s in snapshots[:3]:
print(f"{s[timestamp]} — {s[statuscode]}")
Real Use Cases
- SEO monitoring — track how competitor pages change over time
- Legal evidence — prove what a website said on a specific date
- Content recovery — find deleted pages, articles, or resources
- Research — analyze how websites evolved (design trends, pricing changes)
- Link rot detection — check if referenced URLs still exist or have archives
Save a Page to the Archive
def save_page(url):
save_url = f"https://web.archive.org/save/{url}"
r = requests.get(save_url)
if r.status_code == 200:
return f"Saved: {r.url}"
return f"Error: {r.status_code}"
# Save your own site for posterity
print(save_page("https://your-site.com"))
Search the Full Text Archive
The Internet Archive also hosts 40M+ books, papers, and media:
def search_archive(query, media_type="texts"):
url = "https://archive.org/advancedsearch.php"
params = {
"q": query,
"fl[]": ["identifier", "title", "creator", "date"],
"rows": 5,
"output": "json",
"mediatype": media_type
}
r = requests.get(url, params=params)
return r.json()["response"]["docs"]
books = search_archive("machine learning")
for b in books:
print(f"{b.get(date,N/A)[:4]} | {b[title][:60]}")
Rate Limits & Best Practices
- No official rate limit, but be respectful (~1 req/sec)
- CDX API can return millions of rows — always use
limitparameter - For bulk operations, use
&matchType=prefixto get all pages under a domain
I build data tools and scrapers. GitHub has 300+ repos with tutorials and automation scripts.
More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs
Top comments (0)