How to Scrape Wayback Machine: Historical Web Data with Python

#python #programming #webdev #tutorial

How to Scrape Wayback Machine: Historical Web Data with Python

The Wayback Machine stores over 800 billion web pages dating back to 1996. This data is invaluable for research, competitive analysis, content recovery, and tracking website evolution.

CDX API: The Power Tool

The Wayback Machine provides a CDX API returning structured data about archived URLs — no scraping needed for the index.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Tracking Changes Over Time

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Bulk Collection

class BulkCollector:
    def __init__(self, scraper):
        self.scraper = scraper

    def domain_history(self, domain):
        params = {'url': domain, 'output': 'json', 'matchType': 'domain',
                  'fl': 'timestamp,original,statuscode', 'collapse': 'urlkey',
                  'filter': 'statuscode:200'}
        data = self.scraper.session.get(self.scraper.CDX_API, params=params).json()
        if len(data) <= 1: return []
        return [dict(zip(data[0], row)) for row in data[1:]]

    def download_snapshot(self, domain, timestamp, output_dir='wayback_data'):
        import os
        os.makedirs(output_dir, exist_ok=True)
        urls = self.domain_history(domain)
        count = 0
        for entry in urls[:100]:
            try:
                content = self.scraper.get_page(entry['original'], timestamp)
                path = entry['original'].replace(f'https://{domain}','').replace(f'http://{domain}','')
                path = path.strip('/') or 'index.html'
                fp = os.path.join(output_dir, path)
                os.makedirs(os.path.dirname(fp), exist_ok=True)
                with open(fp, 'w') as f: f.write(content)
                count += 1
                time.sleep(1)
            except: pass
        return count

Competitor Pricing History

def track_pricing(scraper, url, selector):
    tracker = EvolutionTracker(scraper)
    versions = tracker.track_changes(url, selector, samples=20)
    print(f"Found {len(versions)} snapshots")
    for v in versions:
        print(f"  {v['date']}: {v['content'][:100]}")
    changes = tracker.compare(versions)
    print(f"Detected {len(changes)} changes")
    return versions, changes

wb = WaybackScraper()
track_pricing(wb, 'https://example.com/pricing', '.pricing-table')

Tips

Use CDX API first — fast, no page loads needed
Add 1-2s delays — Internet Archive is a nonprofit
Use collapse to deduplicate
Use id_ modifier for original pages without toolbar
Cache results locally

For large-scale collection, ScraperAPI handles proxy rotation. ThorData provides IP diversity. Monitor with ScrapeOps.

Follow for more Python data collection tutorials.

DEV Community

How to Scrape Wayback Machine: Historical Web Data with Python

How to Scrape Wayback Machine: Historical Web Data with Python

CDX API: The Power Tool

Tracking Changes Over Time

Bulk Collection

Competitor Pricing History

Tips

Top comments (0)