Victory Akaniru

Posted on Feb 23, 2020 • Edited on Feb 24, 2020

The one where we build a web scrapper and a slackbot - Part 1

#selenium #beautifulsoup4 #automation #python

The problem

As software engineers, part of what we do revolves around making seemingly easy things a little bit easier. who would imagine doing these three things would be a chore?

Visit Brainyquote
Find and copy a random quote about excellence from the site.
Post the quote to a slack channel.

It seems simple enough to do but if done every day for a year becomes boring and tedious.

Python is a scripting language built for things like this! With python, we could automate this whole process and not have to do the same thing every day.

The Product

Well, that's exactly what we would be doing 🎉. In this two-part series, we would be building a slackbot that periodically sends a random quote about excellence to a specified slack channel. Some of our MVP features would include

Scraping tool: This would be responsible for getting a whole lot of quotes and saving them to a JSON file for future use
A Slack bot: That would be responsible for periodically(maybe every morning?) sending one random quote to a slack channel. This part of the project would require us to write some simple code for posing the message to a Slack channel at intervals.

Prerequisites

A python environment and some basic knowledge of Python. That's it

Part 1: The scrapping tool

First off we need to get some groundwork done by creating a basic project setup, a virtual environment and installing some packages

- cd newly_created_folder
- mkdir scrapping-tool
- cd scrapping-tool
- touch __init__.py main.py scroll.py selenium_driver.py

At this point, we're good to go but I strongly recommend you create a virtual environment for this project. If you have virtualenv installed on your PC all you have to do is run the following commands

- virtualenv --python=python3 venv
- source venv/bin/activate

If you don't or have questions around what a virtualenv is... you may want to Read this

Next, install the following 3rd party packages

BeautifulSoup to help us scrape any website for data
selenium to automate browser interactions while doing so and lxml to interface with BeautifulSoup and parse data to LXML.

run the following command on your terminal

pip3 install BeautifulSoup selenium lxml

Finally, download chrome driver by following basic instructions here. This would enable us to run a headless version of chrome when using selenium for automation. If you're on a mac you can simply run

brew cask install chromedriver

Setup sometimes may endure for a night but code comes in the morning... UNKNOWN

Let's write some code!

In the scrapping-tool folder you created, locate the selenium_driver.py file and paste the following code in

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/usr/local/bin/chromedriver", chrome_options=options)

This piece of code imports webdriver from selenium and adds some configuration options for webdriver like incognito, headless mode, etc, finally we make use of the chromedriver we installed earlier by pointing to the path where it was downloaded to. we save this in a driver variable for future use.

By adding the __init__.py file in our folder we told python to consider every file in that folder a package. This means functions, variables, etc are exposed by default from any location in our app 😎.

Part of the hassles that come with automation in web browsers comes up when human interaction is needed. For example, the website we are trying to scrape has some functionalities you would notice once you open the site.

On the first visit to the website, you would have to click and accept the privacy policy
After that, we see the page with all those quotes we would like to get, but then this page implements an infinite scroll.

We won't be doing much automation if we were to help our browser click that button or help the browser scroll when it gets to the bottom of the page. These problems bring us to our next step scroll.py.

The key to scrapping a website properly lies in your ability to hit inspect and find that class or id with which you can access that element

In the file scroll.py Copy and paste the code below.

import time


def scroll(driver, timeout):
    scroll_pause_time = timeout

    # wait for terms modal to popup and then click
    driver.implicitly_wait(timeout)
    privacy_button = driver.find_elements_by_css_selector(".qc-cmp-buttons > button:nth-child(2)")
    privacy_button[0].click()
    time.sleep(2)

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(scroll_pause_time)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            # If heights are the same it will exit the function
            break
        last_height = new_height

A few things to note

We create a scroll function which takes two parameters, driver(our page source) and timeout(wait time).
We make use of some methods available on the driver object like find_elements_by_css_selector this would help us locate elements. like in our case locate the privacy button and where to start our infinite scrolling.
We also make use of execute_script method which takes our browsers window object as a parameter to enable us to scroll the website, determine page height, etc
Notice the while loop? This loop checks our browser and calculate's new scroll height by comparing our current height with the last scroll height. if both heights are the same we break the loop meaning we are at the end of the page.

Bringing it all together, we build the scrapper itself.

In main.py still within the scrapping-tool folder, add the following code

import re
import json
from bs4 import BeautifulSoup
from selenium_driver import driver # here we import the driver we configured earlier
from scroll import scroll # the scroll method


def get_quotes(url):
    try:
        # implicitly_wait tells the driver to wait before throwing an exception
        driver.implicitly_wait(30)
        # driver.get(url) opens the page
        driver.get(url)
        # This starts the scrolling by passing the driver and a timeout
        scroll(driver, 5)
        # Once scroll returns bs4 parsers the page_source
        soup = BeautifulSoup(driver.page_source, "lxml")
        # Them we close the driver as soup_a is storing the page source
        driver.close()

        # Empty array to store the links
        quotes = []

        regex_quotes = re.compile('^b-qt')
        regex_authors = re.compile('^bq-aut')

        quotes_list = soup.find_all('a', attrs={'class': regex_quotes})
        authors_list = soup.find_all('a', attrs={'class': regex_authors})

        quotes = []
        zipped_quotes = list(zip(quotes_list, authors_list))
        for i, x in enumerate(zipped_quotes):
            quote = x[0]
            author = x[1]
            quotes.append({
                "id":  f"id-{i}",
                "quote": quote.get_text(),
                "author": author.get_text(),
                "author-link": author.get('href')
            })

        with open("quotes.json", 'w') as json_file:
            json.dump(quotes, json_file)
    except Exception as e:
        print(e, '>>>>>>>>>>>>>>>Exception>>>>>>>>>>>>>>')


get_quotes('https://www.brainyquote.com/topics/excellence-quotes')

What do we have here?

We import the BeautifulSoup4 library, some inbuilt python packages like re(regular expression ) and json.
We also import the functions packages we created earlier like scroll and driver.
We create a get_quotes function that takes in a URL as a parameter.
With this, we tell our browser to wait a Lil before throwing an error(sometimes network issues may slow things down).
We called the scroll function to do its thing.
And once that is done we pass driver.page_source to BeautifulSoup4. printing driver.page_source at this point would show a bunch of HTML tags -We call close to stop browser interactions, we have all we need now

The goal is to scrape a quote, its author and a link to get all of that author's quotes. at this point, we have all of that data albeit in a format we cannot work with yet(HTML tags) also notice from the code that we are extracting data for the author separately and the same for quotes. How do we link each quote to its author? we also need to create a python dictionary containing all those pieces of information, give them unique id's and also form the author's links. Python zip function to the rescue, to put it simply this function takes two lists and generates a series of tuples containing elements from each list. We also made use of enumerate function this means we can unpack index and data from the tuples returned from the zip function. With that, we unpack and loop over the returned tuple, create a python dictionary containing the data we want and append that to the quotes array. We also called a BeautifulSoup4 method get_text() on the author and quote to enable us to return actual texts from our HTML tags. we also called get('href') which returns any property of a tag we specify, in our case href, this is how we get the link to the author's quotes. Finally, we save the contents of our quotes list to a json file by creating a quotes.json file and dumping our data into it by calling json.dump.

To run the scrapper

python scrapping-tool/main.py

To see all this in action, you can comment out this piece of code options.add_argument('--headless') in the file selenium_driver.py.

Yo! That’s it for now. Feel free to leave a comment, feedback or opinions in the comments. In part two of this article, we will go through creating a slackbot that would display these scrapped quotes on a slack channel. That would also mean we configure a flask project that would enable us to run a server and implement a scheduler!

To view the full code for this article click here

DEV Community

The one where we build a web scrapper and a slackbot - Part 1

The problem

The Product

Prerequisites

Oldest comments (0)