DEV Community


Web Scraping HN with Python

・1 min read

You may know HN. A news aggregator with tech articles.

Let's scrape that with Python. The PyQuery module allows you to query HTML pages. You can collect all the links with PyQuery

from pyquery import PyQuery as pq

doc =pq(url = "" )

for link in doc('a.storylink'):

That returns the links for the day "2019-07-14". So you have a list of links printed to the screen. You want that in a file.

Hockey dockey.

You can save the output into a csv file. A csv file is a file with all values stored with a delimiter in between, usually a colon but we'll use a semicolon.

from pyquery import PyQuery as pq

date = "2019-07-14"
doc =pq(url = "" + date )

links = []
for link in doc('a.storylink'):

with open('output.csv','w+') as csvfile:
    for link in links:
        csvfile.write( date + ";" + link + ";" )

Simple right? :) Run it and you'll have all the links in a nicely formatted csv file.

A csv file can read with an office program (any spreadsheet) or you can read them using Python pandas.

Related links:

Discussion (0)