Web Scraping HN with Python

#python #web

You may know HN. A news aggregator with tech articles.

Let's scrape that with Python. The PyQuery module allows you to query HTML pages. You can collect all the links with PyQuery

#!/usr/bin/python3
from pyquery import PyQuery as pq

doc =pq(url = "https://news.ycombinator.com/front?day=2019-07-14" )

for link in doc('a.storylink'):
    print(link.attrib['href'])

That returns the links for the day "2019-07-14". So you have a list of links printed to the screen. You want that in a file.

Hockey dockey.

You can save the output into a csv file. A csv file is a file with all values stored with a delimiter in between, usually a colon but we'll use a semicolon.

#!/usr/bin/python3
from pyquery import PyQuery as pq

date = "2019-07-14"
doc =pq(url = "https://news.ycombinator.com/front?day=" + date )

links = []
for link in doc('a.storylink'):
    links.append(link.attrib['href'])

with open('output.csv','w+') as csvfile:
    for link in links:
        csvfile.write( date + ";" + link + ";" )
        csvfile.write('\n')

Simple right? :) Run it and you'll have all the links in a nicely formatted csv file.

A csv file can read with an office program (any spreadsheet) or you can read them using Python pandas.

DEV Community

Web Scraping HN with Python

Top comments (0)

Read next

Unit Testing in Laravel: A Practical Approach for Developers

Advent of Code 24

Advent of Code 2024 - Day 1: Historian Hysteria

Chatbot with Semantic Kernel - Part 2: Plugins 🧩