DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’» is a community of 963,673 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

Create account Log in
Cover image for Web Scraper & Data Extraction with Python | Upwork Series #1
The Pylot
The Pylot

Posted on • Updated on

Web Scraper & Data Extraction with Python | Upwork Series #1

Hi Devs!

Welcome to the first post of Upwork Series. In this series, we are going to work on real-world applications.

Before we gon on, I want to introduce you my brand newYouTube Channel - Reverse Python

What is the main difference between this channel from others? I am trying to show you real-world applications instead of just simple tutorials. Complex projects will increase your algorithmic thinking and problem-solving skills. In these videos I didn't record my voice, the reason is, I already explained the project in my blog post so I am just showing the procedure of the project step by step. But if you want explanation in videos, let me know in comments.

Alright! Let's start!

Web Scraper & Data Extraction

Our task is to create web crawler which will scrape data daily from the report pages of transportation company.

Click here to see description of project

Reverse Python

Step 1: Understanding the task

First, it is important to understand the task clearly. They want from us to scrape data then save it in CSV file where each attribute listed above is its own separate column.

We are going to crawl the following information:

  • Date ("The information below reflects the content of the FMCSA management information systems as of {Date}")
  • Operating Status
  • Legal Name
  • DBA Name
  • Physical Address
  • Mailing Address
  • USDOT Number
  • Power Units
  • Drivers
We can provide an initial set of 100 DOT numbers to make sure the information above can be scraped easily, from there we can provide all DOT numbers we would like to scrape on a daily basis.

Report pages can be accessed by dots or with other name IDs. Each page has its own id (dot). So, these dots locate in Excel file. We have to read this file and extract dots from there then pass it into URL to access the report page.

Step 2: Creating our environment and installing dependencies

Now, we know what client wants from us, so let's create our virtual environment then inspect elements that we are going to crawl.

To create virtualenv run the following command in your terminal:

virtualenv env
Enter fullscreen mode Exit fullscreen mode

then install BeautifulSoup which is a Python package for parsing HTML and XML documents and xlrd which is a library for reading data and formatting information from Excel files:

pip install beautifulsoup4 xlrd
Enter fullscreen mode Exit fullscreen mode

Step 3: Crawling Data

Alright Devs! Let's start with opening the example url from project description so can see the fields.

Click to see Example URL

The page should look like this:

Reverse Python

Our first target is to find - Date ("The information below reflects the content of the FMCSA management information systems as of {Date}")

The information below reflects the content of the FMCSA management information systems as of 01/01/2020.

The truth is we can't crawl this element by specific class name or id. Unfortunately, this report pages messed up.

But we can see that this element is in bold format. There are also many elements locate with the bold format. However, we can crawl all of them as text and use RegEx to extract the data we need.

A regular expression (RegEx) is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

The date locates between The information below reflects the content of the FMCSA management information systems as of and .(dot). So , to find the date regex will look between these strings.

import re # regex
import urllib.request
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup

def crawl_data(url):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html = urlopen(req).read()
    bs = BeautifulSoup(html, 'html.parser')
    # Find all bold texts
    bold_texts = bs.find_all('b')
    for b in bold_texts:
        try:
            # look between these strings
            date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator='  ')).group(1).strip()   
            # If regex finds multiple dots, extract string before first dot                    
            if len(date) > 11:
                date = date.split(".",1)[0]
            print(date)
        except AttributeError:
            pass
Enter fullscreen mode Exit fullscreen mode

Well if you run the program you will see it's printing the date.

Let me quickly show you how regex works, because I feel some of you trying to understand.

Consider the following code:

import re

# We need to extarct "coderasha" from the string
data = "Hello my name is coderasha."
name = re.search('Hello my name is (.*).', data)
print(name)

# Output: <_sre.SRE_Match object; span=(0, 27), match='Hello my name is coderasha.'>
Enter fullscreen mode Exit fullscreen mode

As you see match is found but its printed like object. group(1) capture the text matched by the regex inside them into a numbered group that can be reused with a numbered backreference

import re

# We need to extarct "coderasha" from the string
data = "Hello my name is coderasha."
name = re.search('Hello my name is (.*).', data).group(1)
print(name)

# Output: coderasha
Enter fullscreen mode Exit fullscreen mode

So, I am applying the same logic to find the date inside strings that crawled.

The next step is, to find a table and continue to crawl other fields. Luckily, the table locates between center tags. However, we have to find data again using RegEx because table elements have no any special attribute.

# Get all texts inside table
 information = bs.find('center').get_text(strip=True, separator='  ')
# Find fields using RegEx
    operating = re.search('Operating Status:(.*)Out', information).group(1).strip()
    legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip()
    physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip()
    mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip()
    usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip()
    power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip()
    drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip()
Enter fullscreen mode Exit fullscreen mode

Step 4: Write data in CSV

Once data crawled, it is time to create new csv file and write data into it. I prefer to create another function which will handle this action.

import csv

def write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers):
    with open(usdot_address + '.csv', mode='w', newline='', encoding="utf-8") as csv_file:
            fieldnames = ['Date', 'Operating Status', 'Legal_Name', 'Physical Address', 'Mailing Address', 'Power Units', 'Drivers']
            writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerow({ 'Date':date, 'Operating Status': operating, 'Legal_Name': legal_name, 'Physical Address':physical_address, 'Mailing Address': mailing_address, 'Power Units':power_units, 'Drivers':drivers })               
Enter fullscreen mode Exit fullscreen mode

CSV name must be unique, so I named it with usdot_address or with other name ID of report page from crawled data

Step 5: Read excel file to crawl data for each dot

The final step is to read excel file and pass these dots end of the URL to access the pages. We can use xlrd to read excel file

import xlrd

dots = []

def read_excel_file():
    loc = ("dots.xls") 
    wb = xlrd.open_workbook(loc) 
    sheet = wb.sheet_by_index(0) 
    sheet.cell_value(0, 0)
    # First five dot in excel 
    for i in range(1,5): 
        # Convert floats to string and clean from .0
        dot = str(sheet.cell_value(i, 0)).replace('.0', '')
        dots.append(dot)
Enter fullscreen mode Exit fullscreen mode

xlrd reads numbers as float so the best solution is to covert them to strings and use replace() method to remove .0 end of the string.

and pass these dots into url:

for dot in dots:
    crawl_data('https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot)     
    # Sleep 5 seconds to avoid any errors     
    time.sleep(5)
Enter fullscreen mode Exit fullscreen mode

Here is the Full Code:

import re
import csv
import urllib.request
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import xlrd 
import time

dots = []

def read_excel_file():
    loc = ("dots.xls") 
    wb = xlrd.open_workbook(loc) 
    sheet = wb.sheet_by_index(0) 
    sheet.cell_value(0, 0) 
    for i in range(1,5): 
        dot = str(sheet.cell_value(i, 0)).replace('.0', '')
        dots.append(dot)


def crawl_data(url):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html = urlopen(req).read()
    bs = BeautifulSoup(html, 'html.parser')
    bold_texts = bs.find_all('b')
    for b in bold_texts:
        try:
            date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator='  ')).group(1).strip()
            if len(date) > 11:
                date = date.split(".",1)[0]
            print(date)
        except AttributeError:
            pass

    information = bs.find('center').get_text(strip=True, separator='  ')

    operating = re.search('Operating Status:(.*)Out', information).group(1).strip()
    legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip()
    physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip()
    mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip()
    usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip()
    power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip()
    drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip()

    write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers)

def write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers):
    with open(usdot_address + '.csv', mode='w', newline='', encoding="utf-8") as csv_file:
            fieldnames = ['Date', 'Operating Status', 'Legal_Name', 'Physical Address', 'Mailing Address', 'Power Units', 'Drivers']
            writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerow({ 'Date':date, 'Operating Status': operating, 'Legal_Name': legal_name, 'Physical Address':physical_address, 'Mailing Address': mailing_address, 'Power Units':power_units, 'Drivers':drivers })                


read_excel_file()
print(dots)
for dot in dots:
    crawl_data('https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot)
    time.sleep(5)
Enter fullscreen mode Exit fullscreen mode

Mission Accomplished

I hope you liked the first post of new Upwork series. The job offer still open so if want you can send proposal to client using this code. Please take look Reverse Python for more articles like this and subscribe to YouTube Channel - Reverse Python for amazing content.

See you soon Devs! Stay Connected!

Top comments (12)

Collapse
 
narenandu profile image
Narendra Kumar Vadapalli

Just a couple of minor points, which might make the code look clean and neat.

If you define a method like the following

def return_match_from_info(input_str):
    re.search(input_str, information).group(1).strip()

can become

operating = get_match_from_info('Operating Status:(.*)Out')

Of course you need to declare information as global variable


Also if you want to use pandas pandas.pydata.org/,

Collapse
 
ben profile image
Ben Halpern

This is a really interesting concept for a series!

Collapse
 
thepylot profile image
The Pylot Author

ThanksπŸ™ŒπŸš€

Collapse
 
rpopovwex profile image
rpopovwex • Edited on

For some reason, this code gave me AttributeError when a Dot number was not found. I figured out that this was due to bs.find('center') not finding the correct field (since it doesn't exist on the page for non-existent or outdated DoT number). I solved the problem by changing this:

except AttributeError:
      pass

to

except AttributeError:
    continue

so that instead of doing nothing (pass) I'd switch to the next DoT number. I also had to move the whole block of code starting with "information" one tab to the right so that it's only executed when try statement executes without errors. This way only valid DoT numbers are crawled and saved. Hope this helps!

Here's how the code looks in the final form:

def crawl_data(url):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html = urlopen(req).read()
    bs = BeautifulSoup(html, 'html.parser')
    bold_texts = bs.find_all('b')
    for b in bold_texts:
        try:
            date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator='  ')).group(1).strip()
            if len(date) > 11:
                date = date.split(".",1)[0]
            print(date)
        except AttributeError:
            continue

        information = bs.find('center').get_text(strip=True, separator='  ')

        operating = re.search('Operating Status:(.*)Out', information).group(1).strip()
        legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip()
        physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip()
        mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip()
        usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip()
        power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip()
        drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip()

        write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers)
Collapse
 
rpopovwex profile image
rpopovwex

Also, it'd be convenient to add some sort of progress bar that would state which DoT is crawled at the moment and how many are left, as well as a short statement in the case when DoT number is not found.

Collapse
 
juancarlospaco profile image
Juan Carlos

Good post, with examples and explanations, this can be interesting.

I invite you to try the new lib dev.to/juancarlospaco/faster-than-...
πŸ˜ƒ

Collapse
 
thanimeblog profile image
Theanimeblog

I'm looking to hire a dev that can create a web scraper to .csv format that will run daily at 9am est. It can save to a google sheet once run.

Each day a new pdf is published. ie: li-public.fmcsa.dot.gov//lihtml/rp...

only the digits change which correspond to the daily date change.

need to pull the following per row:

MC Number
Company Name
Name
Address
Phone

email to discuss budget, can pay through paypal G&S quickly. Looking for help immediately.

info@dnxint.com is my email

Collapse
 
dpashutskii profile image
Dmitrii Pashutskii

Hi! Great article, thanks for posting.
Just one question, how does this related to Upwork? Just curious as a former freelancer on a platform.

Collapse
 
ilhamday profile image
Ilham Ferry

Nice post. Thanks for make this one. The explanation is so clear. Happy to read it .. :)

Collapse
 
cetive profile image
Cetive

2 Issues,

1) The Code is creating seprate CSV file for each line
2) My dots.xls file have 100 Dot number but its only search 2 and end the process

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

Update Your DEV Experience Level:

Settings

Go to your customization settings to nudge your home feed to show content more relevant to your developer experience level. πŸ›