loading...

Python3 Challenge: Unraveling Web Scraping

jonathanccardoso profile image Jonathan Cardoso ・3 min read

How to add dynamic data from different pages on a website efficiently!

It’s a challenge that was requested in a real job interview, the purpose of this article is to show the technologies involved and learned during the challenge. Prior knowledge of the Django framework is required.

Challenge

A page Django is developed that presents the requested information automatically and is always up to date, and that page presents the updated numbers promptly.

  1. The estimated population of cities Natal/RN and Mossoró/RN.
  2. Sum of the populations of cities.
  3. Municipal human development index (MHDI) of the cities of Natal/RN and Mossoró/RN.
  4. The average for the Natal and Mossoró MHDIs.

All of this according to the data on the page

Scrapping

The language Python stands out a lot in data collection on the web. We will use the **Beautiful Soup **module, which creates a tree from HTML and XML documents analyzed on the page.

Starting project

Since you have already created a Django project and have already activated your virtualenv (I advise you to use it).

Let’s use bs4 from Beautiful Soup, we have to install the library in the project, through pip:

(venv) ❯ pip install bs4

HTML

Given that you have already created a templates folder in the app, your structure will correspond more or less this away:

Structure of Django files and foldersStructure of Django files and folders

Now created an HTML page, in my case index.html, to show the content.

View Raw index.html on Github

Web Scrapping

Each website has it’s structure so inspect the area you want to capture and analyze, going thought each li inside of the ul in class “resultados-padrao”:

Source Code

In views, we start by importing the tools to be used and the variables, placing cities and their respective fields:

from django.shortcuts import render
import re

import requests, json, locale
from bs4 import BeautifulSoup

def index(request):
  city = [
    {'state': 'rn', 'name': 'natal'},
    {'state': 'rn', 'name': 'mossoro'}
  ]
  fields = [
    'População', 'IDHM'
  ]
  city_values = []

context = {}

return render(request, 'ibge/index.html', context)

In the cityData function, pass the city variable as a parameter, scrolling by name and state, call the pages by the URL parameter through the cities, after capturing the page labels and values.

If the selected fields are the same as the ones requested, save them in the arrays, after returning city_data.

def cityData(city):
  city_data = []
  for item in city:
    # convert and read in json
    item = json.dumps(item)
    item = json.loads(item)

name = item['name']
    state = item['state']
    url_request = requests.get("https://www.ibge.gov.br/cidades-e-estados/"+ state +"/"+ name +".html")
    div_select = BeautifulSoup(url_request.content, 'html.parser')

    ul = div_select.find(class_='resultados-padrao')
    city_label = ul.findAll("p", {"class": "ind-label"})
    city_value = ul.findAll("p", {"class": "ind-value"})

    data = []
    for label, value in zip(city_label, city_value):
      text_label = label.text
      text_value = value.text
      if text_label.split()[0] == fields[0] or text_label.split()[0] == fields[1]:
        data.append([text_label, text_value])
        city_values.append(text_value.split()[0])

     data.append([name, state])
     city_data.append(data)
  return city_data

In conclusion, the expected result from views is this:

View Raw views.py on Github

Conclusion

After that, we’re done. Use this command to run the web application:

(venv) ❯ python manage.py runserver

Page rendering IBGE data.Page rendering IBGE data.

In this tutorial, we used Python, Django, and Beautiful Soup to collect data. You now have great possibilities for development.

Project repository on Github

I hope you enjoyed the content! If you have suggestions and/or criticisms comment below. Bye! 🖐🏽

Posted on by:

jonathanccardoso profile

Jonathan Cardoso

@jonathanccardoso

Full Stack developer, who likes to learn and share the knowledge! a technology enthusiast. #opensource #reactjs #nodejs #django #devops

Discussion

markdown guide