DEV Community

Wulfi
Wulfi

Posted on

4 2

Extracting data from a website using BeautifulSoup

There are mainly two ways to extract data from a website:

  • Use APIs(if available) to retrieve data.

  • Access the HTML of the webpage and extract useful information/data from it.

In this article, we will extract Billboard magazine's Top Hot 100 songs of the year 1970 from Billboard Year-End Hot 100 singles of 1970.

Image description

Task:

  • Perform Web scraping and extract all 100 songs with their artists.
  • Create python dictionary which contains key as title of the single and value as lists of artists.

Installation
We need to install requests and bs4.The requests module allows you to send HTTP requests using Python. Beautiful Soup (bs4) is a Python library for pulling data out of HTML and XML files.

pip install requests
pip install bs4
Enter fullscreen mode Exit fullscreen mode

Import the libraries

import requests
from bs4 import BeautifulSoup
Enter fullscreen mode Exit fullscreen mode

Sending request

url = "https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1970"
response = requests.get(url)
print(response.url) # print url
response # response status
Enter fullscreen mode Exit fullscreen mode
songSoup = BeautifulSoup(response.text) # Object of BeautifulSoup

data_dictionary = {}

for song in songSoup.findAll('tr')[1:101]: # loop over index 1 to 101 because the findAll('tr') contains table headers
  # Priting 100 table rows.............
  # print(song)   

  title = song.findAll('a')[0].string

  artist = song.findAll('a')[1].string
  # Printing Titles and Artists.............
  print(title, ',', artist)

  # Printing Dictionary.............
  data_dictionary[title] = [artist]
print(data_dictionary)
Enter fullscreen mode Exit fullscreen mode

Image description

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay