Introduction
Lately, I've been interested in compilers and thankfully, the excellent Standford University did a great job of providing materials (handouts, assignments, and so on) on the subject here.
The problem is, the files are pretty much (which is a good thing) but I didn't want to click click click. Much less repeat these actions for other similar pages. Another issue was laziness. Well, you get the idea. Finally, I didn't want to create a virtual environment, install beautiful soup, and requests library for something relatively trivial. Too much work!
So to save me stress and probably you, since you're here, here's what I did.
Example 1: Downloading Text Files
Here's the first page. It contains a list of C codes accompanying the book foundations of computer science by Al Aho and Jeff Ullman.
Here's the entire code. Don't worry, the explanations are in the comments. Please, read the code along with the comments.
# Import the HTMLParser to extract links
from html.parser import HTMLParser
# import this for making HTTP requests
import urllib.request as request
# The link to the page. Very obvious huh?
url = 'http://i.stanford.edu/~ullman/fcsc-figures.html'
# Make the get request to the web page
# Read the response as a text string (utf-8 encoded)
with request.urlopen(url) as response:
content = response.read().decode('utf-8')
# You want to subclass the HTMLParser and override the
# handle_starttag method. We're interested in HTML links <a>
# Each link contains a tuple ('href', URL) so we're taking the second element.
# We need only URLs ending with .txt
class ExtractLinks(HTMLParser):
links = []
base_url = 'http://i.stanford.edu/~ullman'
def handle_starttag(self, tag, attrs) -> None:
if tag == 'a':
for attr in attrs:
link = attr[1]
if link.endswith('txt'):
self.links.append(f'{self.base_url}/{link}')
# Create an instance of your class and feed it the HTML text
# feed huh? great naming I must say!
parser = ExtractLinks()
parser.feed(content)
# This isn't so necessary.
# I just wanted a sorted copy of the links attribute.
# I don't dislike mutation. The X-men were super cool.
links: list[str] = parser.links[:]
links.sort()
# Lastly, iterate through the links and make a get request to each link
# Read the response as text.
# Split the link and extract the filename. The filename is the last element of your split.
# Rename if you want. I wanted a .c extension instead of a .txt
# Open a file for writing (as text) and we are good to go.
for link in links:
with request.urlopen(link) as response:
file = response.read().decode('utf-8')
filename = link.split('/')[-1]
filename = filename.replace('.txt', '.c')
with open(filename, 'wt', encoding='utf-8') as f:
f.write(file)
Here's what it will look like:
Example 2: Downloading Multiple Binary Files (pdf, zip)
Same scenario as above with slight modifications.
# import the HTMLParser to extract links
from html.parser import HTMLParser
# Import this for making HTTP requests
import urllib.request as request
# For something fancy
import os
# Like before, make the get request to the web page
# Read the response as a text string (utf-8 encoded)
url = 'https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/'
with request.urlopen(url) as response:
content = response.read().decode('utf-8')
# This time I only want pdf and zip files.
# Well, one pdf link started with http but I didn't need it.
class ExtractLinks(HTMLParser):
links: list[str] = []
def handle_starttag(self, tag, attrs) -> None:
if tag == 'a':
for attr in attrs:
_, link = attr # another way to extract tuple elements
if link.startswith('http'):
continue
if link.endswith('.zip') or link.endswith('.pdf'):
self.links.append(link)
# We've been here before! Pompeii by Bastille.
parser = ExtractLinks()
parser.feed(content)
links: list[str] = parser.links[:]
links.sort()
for link in links:
# A link might be handouts/something.pdf or assignments/10/pdf
parts = link.split('/')
# Collect the first and last parts to aid file organization
directory, filename = parts[0], parts[-1]
# The file path is the combined directory/folder and the filename
filepath = f"{directory}/{filename}"
# If a file has been downloaded, why repeat the task?
# So skip it.
if os.path.exists(filepath):
print('Skipping', filepath, "....")
continue
# Create the full URL
link_url = url + link
# urllib didn't like the spaces in some of the links
# so replace it with the encoded form.
link_url = link_url.replace(' ', '%20')
with request.urlopen(url_link) as response:
# read yields a binary string. This is what we want here.
# No decoding is necessary.
file = response.read()
# Okay. So, the directory may not have been created.
# But EAFP. Easier to Ask for Forgiveness than Permission.
# Try creating the file. It fails if the directory does not exist.
# In the except block, create the directory.
# Finally, the write mode should be 'wb' - write binary.
try:
with open(filepath, 'wb') as f:
# Keep track of what's happening for the sanity sake
print(f"writing {filepath}")
f.write(file)
except OSError as e:
print(e) # Directory does not exist error.
print('======== creating it ===========')
# Create the directory
os.mkdir(directory)
# This may potentially fail but not because of unavailable
# directory. But, it's okay since I'm not saving the planet.
with open(filepath, 'wb') as f:
f.write(file)
Here we go again :)
I had fun!
Top comments (0)