Geof

Posted on Feb 5, 2022

Web Scraping With Python

#webscraping #python #selenium #coding

WHAT IS WEB SCRAPING?

In a nutshell, web scraping is the act of an automated extraction of publicly displayed data from the web which could not be reached nor extracted by the use of API.

Most times, the basic aim for for web scraping are usually for price or news monitoring, information/data gathering, automated research, automated web/platform engagements and similar events.

Web scraping has become very popular and important in recent times due to it’s relevance in the current business world. However, this tutorial is about web scraping with python, so without further ado we’ll dive into what web scraping with python looks like and the libraries needed to code a simple web scraper.

SCRAPING WITH SELENIUM

Python is widely known to be useful in many things in tech, but web scraping happens to be one of the major domains where python programming thrives.

WHAT IS SELENIUM?

Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.

It provides extensions to emulate user interaction with browsers, a distribution server for scaling browser allocation, and the infrastructure for implementations of the W3C WebDriver specification that lets you write interchangeable code for all major web browsers.

Meanwhile, selenium is not the only module used for web scraping with python, there are other major modules that are also as popular as selenium. However there are cons and pros for each of them, you just need to know the one you need at every occasion. We’ll discuss a brief comparison of these modules further on.

The 3 most popular python modules used for web scraping are as follows:

SCRAPY:

Scrapy is efficient and portable. However it’s major con is that it’s not user friendly, especially for beginners.

BEAUTIFUL SOUP:

Beautiful soup is easy to learn and understand. However it does have some cons too: Beautiful soup requires dependencies and it’s less efficient than Scrapy.

SELENIUM:

Selenium is versatile and also works well with javascript. However selenium is also not as efficient as scrapy.

In this post we’ll use selenium as our module for web scraping with python, perhaps in my next web scraping post we’ll adopt any of the other modules mentioned above.

TALK IS CHEAP, LET THE CODING BEGIN…

We are about to code a web scraper that will go to the popular wikipedia’s website and enter a query in the search box, get results and possibly links too. Make no mistakes there is a module specifically for wikipedia search called wikipedia.

However, the aim here is to show how one can access a public website, fill a form, submit it, explore the site contents and more. But we are just going to keep things simple on this particular post.

MY ASSUMPTIONS:

You have the basic experience of HTML and CSS
You have at least beginner’s basic python coding experience, for instance you are familiar with loops, functions, importing of modules and similar knowledge.
Meanwhile, if you have not used selenium before, please do yourself a favor, checkout the basic documentation of this module here before you continue with this tutorial.

Firstly, we’ll begin by importing the necessary modules. But before importing these modules you’d need to download your web browser’s web driver. I personally prefer google chrome driver.

Be sure that you downloaded the same version as the version of your browser, to check your browser version, click on the 3 dots on chrome and click on “help” then click on “about google chrome” right there you’ll see the version you are using.

Once you are done with the download, extract the file and keep it somewhere close to your code folder and note the path to the driver.

Now let’s import the necessary modules.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import requests
from pprint import pprint

Now it’s time to create our main function for this code:

def get_wiki():

    #get the preferred keyword
    keyword = input('Enter a keyword to search:\n')
    link_dec = input('Do you need links? Kindly enter yes or no:\n').lower()

    #create path instance and create the driver path
    d_path = Service('/home/you/Desktop/my_scraper/web_driver/chromedriver')
    driver = webdriver.Chrome(service = d_path)

Let’s explain the code above:

We created a function called get_wiki, now the variable keyword will get a keyword to search for from the user. link dec variable is for decision about needing links or not by the user.

Then we created the the driver path and the driver instance. Now we’ll continue with more codes inside the main function:

#get the page and enter a keyword to search
    driver.get('https://en.wikipedia.org/wiki/Main_Page')
    search_box = driver.find_element(By.NAME, 'search')

    search_box.send_keys(keyword)
    search_box.send_keys(Keys.ENTER)

    time.sleep(3)

    #get the main content
    main_data = driver.find_element(By.ID, 'content')
    pprint(main_data.text)

I presume you have checked out the selenium documentation as I advised earlier, and with your prior knowledge on HTML and CSS, you already know how to find the needed selectors and elements on the wikipedia page.

You can just open the website in a new window and explore the elements and selectors with chrome developers tools while simulating the search. This will enable you to check what is working and what’s not, in case you run into bugs.

So with the code above we’ll get the page and enter a keyword to search and press the search button. We wait for 3 seconds, get the results, and print them out using pprint.

Now let’s create an inner function inside the main function that will get the available links if required by the user:

def show_links():
        """Get the links available in the contents"""
        links = driver.find_elements(By.TAG_NAME, 'a')
        for link in links:
            print(link.get_attribute('href'))

As you can see the function above is self explanatory. Next, we’ll call the function if the link_dec was “yes” and quit the driver, next we call the main function:

if link_dec == 'yes':
        show_links()
    else:
        pass

    driver.quit()

get_wiki()

Now let’s see all the codes in one place:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import requests
from pprint import pprint


def get_wiki():

    #get the preferred keyword
    keyword = input('Enter a keyword to search:\n')
    link_dec = input('Do you need links? Kindly enter yes or no:\n').lower()

    #create path instance and create the driver path
    d_path = Service('/home/you/Desktop/my_scraper/web_driver/chromedriver')
    driver = webdriver.Chrome(service = d_path)


    #get the page and enter a keyword to search
    driver.get('https://en.wikipedia.org/wiki/Main_Page')
    search_box = driver.find_element(By.NAME, 'search')

    search_box.send_keys(keyword)
    search_box.send_keys(Keys.ENTER)

    time.sleep(3)

    #get the main content
    main_data = driver.find_element(By.ID, 'content')
    pprint(main_data.text)


    def show_links():
        """Get the links available in the contents"""
        links = driver.find_elements(By.TAG_NAME, 'a')
        for link in links:
            print(link.get_attribute('href'))

    if link_dec == 'yes':
        show_links()
    else:
        pass

    driver.quit()

get_wiki()

CONCLUSION

From here you can do other things with your search results, like sending them to an email address, converting them to pdf file and more.

In my next web scraping with python post, we’ll focus more on other cools stuffs like getting prices and updates on news, trading and more. We’ll also learn about beautiful soup, regex, sending emails with python and more.

You can edit this code and use it on different sites or search engine like google. Now that you have the basic knowledge, you can explore selenium even more, create better scrapers than what I did here.

Becoming better in anything requires curiosity, so get curious and explore the available knowledge on the internet about web scraping, you might want to check the popular programming communities for extra knowledge on the topic.

To automatically get notification when my next post on web scraping with python and subsequent ones gets published, hit the follow button.

Get an affordable and seamless python one on one training today from anywhere in the world, location is never a barrier, we have friendly learning tools to make your python programming training a worthwhile experience.

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Top comments (1)

Josh • Oct 31 '24

Hey really good post here, what are your thoughts on playwright compared to the tools you mentioned? I've only used playwright for scraping paired with agentql querying

DEV Community

Web Scraping With Python

WHAT IS WEB SCRAPING?

SCRAPING WITH SELENIUM

WHAT IS SELENIUM?

SCRAPY:

BEAUTIFUL SOUP:

SELENIUM:

TALK IS CHEAP, LET THE CODING BEGIN…

MY ASSUMPTIONS:

CONCLUSION

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

Top comments (1)

Create up to 10 Postgres Databases on Neon's free plan.

Read next

7 Powerful Python Performance Optimization Techniques for Faster Code

A Beginner’s Journey Through the Machine Learning Pipeline (1)

We made an AI SWE that solved 48.60% of issues on the SWE bench, 100% open-source.

AdventJS: 25 Programming Challenges in JavaScript and Python! [Free]

Okay