DEV Community

Cover image for Archiving Web Pages with wget and Wayback Machine: A Handy Guide
Matt Miller
Matt Miller

Posted on • Edited on

Archiving Web Pages with wget and Wayback Machine: A Handy Guide

Introduction:
The Wayback Machine (web.archive.org) is a valuable resource for accessing archived versions of web pages. In this guide, we'll explore how to use the wget command to download content from the Wayback Machine, allowing you to preserve and explore historical snapshots of websites. Follow the example command and explanation below to get started.

Example Command:

wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --no-parent https://web.archive.org/web/20231225142555/https://example.com/index.php
Enter fullscreen mode Exit fullscreen mode

Explanation of Options:

  • --recursive: Download recursively, ensuring that all linked resources are captured.
  • --no-clobber: Skip downloading files that already exist, preventing redundancy.
  • --page-requisites: Download necessary files for complete page rendering (images, stylesheets, etc.).
  • --html-extension: Save HTML files with a .html extension for easy identification.
  • --convert-links: Convert links to enable offline viewing by updating relative paths.
  • --restrict-file-names=windows: Modify filenames to be compatible with Windows file naming conventions.
  • --no-parent: Prevent ascending to the parent directory, keeping the downloaded content organized.

Usage Notes:

  • Replace URL: Substitute the example URL in the command with the specific Wayback Machine URL you want to download.
  • Content Limitations: Keep in mind that not all websites may be fully archived, and dynamic content might not be accurately captured.
  • Review Terms: Adhere to the terms of service and usage policies of the Wayback Machine and the archived website.

Conclusion:
Using wget in conjunction with the Wayback Machine provides a practical way to archive and explore historical versions of web pages. This process ensures that you can access and analyze web content as it appeared at specific timestamps, offering insights into the evolution of websites over time.


Enjoying the content? If you'd like to support my work and keep the ideas flowing, consider buying me a coffee! Your support means the world to me!

Buy Me A Coffee

Top comments (2)

Collapse
 
paultraf profile image
Paul Trafford • Edited

Thanks for sharing this nifty Wget one-liner to retrieve web pages from the Internet Archive’s Wayback Machine, which has become an essential site for preserving web memory. So, I tried it on this article from a recent snapshot.

wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --no-parent https://web.archive.org/web/20260105193912/https://dev.to/matemiller/archiving-web-pages-with-wget-and-wayback-machine-a-handy-guide-1hek
Enter fullscreen mode Exit fullscreen mode

It successfully generates a snapshot; from the directory where I run that script, I can find the web page at:

web.archive.org/web/20260105193912...

It displays fine in the browser, even when offline.

The --no-parent option effectively confines the crawl to one snapshot, which makes for a compact download. However, if there are links to other snapshots, they won’t be followed. When the Wayback Machine typically uses multiple snapshots for a site, a method needs to be found to piece them together.

The Internet Archive service is very popular, so bandwidth is limited. For larger amounts of content, it is worth adding some throttling with the --limit-rate option, e.g. --limit-rate=500k.

There are other sites running the Wayback Machine software, some of which are listed at:
en.wikipedia.org/wiki/Wikipedia:Li....

Collapse
 
paultraf profile image
Paul Trafford

I’m going to take a liberty and respond to my own comment. How to piece together web pages and assets from multiple snapshots?

The History of Science Museum in Oxford has a snapshot from 1996:
web.archive.org/web/19961219005900...

When I joined the museum over 10 years later, the then director, Jim Bennett, was keen for certain portions of the website to always be available to the public. However, the site has continued to evolve, with a risk of older materials being deleted. So, I wanted a way to crawl the Wayback Machine to generate an offline static archive. Eventually I extended MakeStaticSite, a set of Bash shell scripts I had already written for live sites, to leverage Wget to crawl sites on the Wayback Machine.
makestaticsite.sh/
github.com/paultraf/makestaticsite

Once installed, you can run it like this for the Memento URL for the Wayback site:

./setup.sh -u https://web.archive.org/web/20260105193912/https://dev.to/matemiller/archiving-web-pages-with-wget-and-wayback-machine-a-handy-guide-1hek
Enter fullscreen mode Exit fullscreen mode

A simple example of pages/assets that span multiple snapshots:

./setup.sh -u https://web.archive.org/web/19981201044802/http://www.gnu.org/philosophy/philosophy.html
Enter fullscreen mode Exit fullscreen mode

With the first run of Wget, it uses only one snapshot/timestamp, but then reviews the output and generates a list of further Wayback URLs for Wget to process.

I really hope the Internet Archive service can be maintained for a long time.