DEV Community

loading...

Web Scraping HN with Python

petercour
・1 min read

You may know HN. A news aggregator with tech articles.

Let's scrape that with Python. The PyQuery module allows you to query HTML pages. You can collect all the links with PyQuery

#!/usr/bin/python3
from pyquery import PyQuery as pq

doc =pq(url = "https://news.ycombinator.com/front?day=2019-07-14" )

for link in doc('a.storylink'):
    print(link.attrib['href'])

That returns the links for the day "2019-07-14". So you have a list of links printed to the screen. You want that in a file.

Hockey dockey.

You can save the output into a csv file. A csv file is a file with all values stored with a delimiter in between, usually a colon but we'll use a semicolon.

#!/usr/bin/python3
from pyquery import PyQuery as pq

date = "2019-07-14"
doc =pq(url = "https://news.ycombinator.com/front?day=" + date )

links = []
for link in doc('a.storylink'):
    links.append(link.attrib['href'])

with open('output.csv','w+') as csvfile:
    for link in links:
        csvfile.write( date + ";" + link + ";" )
        csvfile.write('\n')

Simple right? :) Run it and you'll have all the links in a nicely formatted csv file.

A csv file can read with an office program (any spreadsheet) or you can read them using Python pandas.

Related links:

Discussion (0)