DEV Community

Cover image for Crawling a website with wget
Talles L
Talles L

Posted on • Updated on

Crawling a website with wget

Here's an example that I've used to get all the pages from Paul Graham's website:

$ wget --recursive --level=inf --no-remove-listing --wait=6 --random-wait --adjust-extension --no-clobber --continue -e robots=off --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"  --domains=paulgraham.com https://paulgraham.com
Enter fullscreen mode Exit fullscreen mode
Parameter Description
--recursive Enables recursive downloading (following links)
--level=inf Sets the recursion level to infinite
--no-remove-listing Keep ".listing" files that are created to keep track of directory listings
--wait=6 Wait the given number of seconds between requests
--random-wait Multiplies --wait randomly between 0.5 and 1.5 for each request
--adjust-extension Make sure that ".html" is added to the files
--no-clobber Do not redownload a file if exists locally
--continue Allows resuming downloading a partially downloaded file
-e robots=off Ignores robots.txt instructions.
--user-agent Sends the given "User-Agent" header to the server
--domains Comma-separated list of domains to be followed
--span-hosts Allows navigating to subdomains

Other useful parameters:

Parameter Description
--page-requisites Downloads things as inlined images, sounds, and referenced stylesheets
--span-hosts Allows downloading files from links that point to different hosts
--convert-links Converts links to local links (allowing local viewing)
--no-check-certificate Bypasses SSL certificate verification.
--directory-prefix=/my/directory Sets up the destination directory.
--include-directories=posts Comma-separated list of allowed directories to be followed when crawling
--reject "*?*" Rejects URLs that contain query strings

Top comments (0)