Paige Niedringhaus

Posted on Oct 23, 2023 • Originally published at paigeniedringhaus.com on Oct 22, 2023

Scrape Data from a Lazy Loading Website with Selenium Python

#python #selenium #bigdata #webdriver

Introduction

A few months ago, my friend wanted me to write a program to collect the data of one of the NFT collections on the NFTrade site, compute the current price of each NFT in US dollars based on the current market price of the BNB cryptocurrency it was listed for sale in, and compile all of the NFTs for sale into a CSV file that he could sort and manipulate.

Unfortunately, the NFTrade website does not have a public API so writing a Node.js script to fetch the data from the API and format it as required was not an option. Instead, I needed to make a site scraper to actually go to the website page and "scrape" the data off of it.

Having not written a web scraper before (and also wanting to make the script easier for my friend to update and run on his own machine), I decided to write the program in Python (it seems to be a very popular programming language choice for a task such as this). Along the way, my little web scraper's requirements evolved and got more complex, and I learned a bunch of useful new techniques about using Python for my project, which I intend to share in a series of posts over the coming months.

My first attempt to scrape the data from NFTrade was unsuccessful beyond locating the first 75 NFTs on the page. I figured out this was because NFTrade (as many other websites do) lazy loads NFTs onto the page 75 at a time: once the user's scrolled down far enough to reach the end of the currently visible items, the site loads the next batch of elements onto the page (essentially a fancier version of pagination). So I needed a way to have my web scraper program collect whatever data was available on the page then scroll down far enough to trigger more data to load and collect that, and rinse and repeat.

After some trial and error, I finally found a working solution with the help of a Python package named Selenium Python, and I'll share with you today how to write your own Python script to scrape data from a lazy loading website with Selenium WebDriver.

NOTE: I am not normally a Python developer so my code examples may not be the most efficient or elegant Python code ever written, but they get the job done.

Selenium with Python package

There are a few different popular Python packages available for web scraping which I tried before reaching for Selenium, but I had an issue with them in that they only worked for static websites that were generated at build time, not for sites that are generated on the client-side via JavaScript, like NFTrade is.

To that end, I had to do a little digging to find a package that could work with scraping sites with dynamically loaded data, and I ran across the Selenium Python package during my investigation.

Selenium Python is a Python-based API that allows users to write scripts or automated tests using Selenium WebDriver in an intuitive, Python-flavored way. And Selenium WebDriver is a software that can drive a browser natively, as a user would, either locally or on a remote machine. Originally created back in 2004, some version of Selenium has been around for years and is considered one of the earliest versions of automated testing that emulates user actions on a web page (commonly known today as end-to-end testing).

The cool thing about WebDriver though, is that its uses span beyond automation testing, as scripts can actually be written to scrape data off of live web pages, and that's just what I ended up doing in my Python script, so let's get started.

Install Selenium Python in the Python project

As with most projects, the first thing to do is add the Selenium Python package to the Python project. The easiest way is to use pip to install the Selenium package.

Assuming you have pip on your machine, at the root of your Python project folder, run the following command from a terminal.

pip install selenium

Then, add the selenium package to your requirements.txt file so anyone downloading the repo in the future can install all the necessary project dependencies.

requirements.txt

selenium

And that's all it takes to be ready to use WebDriver in your Python script. Simple enough.

Import Selenium WebDriver into Python script

After adding the Selenium Python bindings to the project, it's time to import Selenium's WebDriver and some of its helpful configuration options to the actual Python script that does the website scraping. I named my file for_sale_scraper.py since I was specifically looking for NFTs that are for sale (not all of the NFTs listed on NFTrade are - some are just visible but not actually available to purchase), but you can choose any sort of file name that makes sense for you.

Below are the imports I added to my file. I'll break down what each one is doing below.

for_sale_scraper.py

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By

The very first import line brings in the selenium.webdriver module and provides all the WebDriver implementations.

from selenium import webdriver

Next, as I chose to use Chrome as the browser I wanted WebDriver to interact with (Selenium supports Firefox, Chrome, Edge, and Safari browsers), I imported the Options class from the selenium.webdriver.chrome.options module. This allowed me to add specific config details about how I want the Chrome browser to be set up when the Python script runs against it: things like headless mode or disable extensions, etc.

from selenium.webdriver.chrome.options import Options

I'll cover the arguments I passed here in detail in the next section.

WebDriverWait, added in the third line of imports, is part of the special sauce that makes WebDriver a good solution for sites like NFTrade that dynamically fetch data on the client side: it allows for implicit and explicit wait times before trying to locate an element on the page, which gives the browser time for data to come back from the server and populate in the DOM.

from selenium.webdriver.support.wait import WebDriverWait

This type of wait is an "explicit wait", meaning I manually set a period during which the code will wait before continuing to try and execute.

And finally, there is the import for By. By is what allows me to locate elements on the page - it is immeasurably useful and powerful.

The By class accepts element IDs, names, attributes, XPaths, link text, tag names, class names, and CSS selectors just to name a few, and once again, it is a key player when it comes to scraping data off of the web page, as I'll demonstrate soon.

from selenium.webdriver.common.by import By

Right, all the Selenium WebDriver imports are now present in the Python file, time to initialize them and get to work.

Add methods to scrape data and lazy load more data

Before WebDriver can begin scraping the data from NFTrade, an instance of the browser that WebDriver will interact with must be instantiated and the proper options supplied to it.

1. Initialize the Selenium WebDriver instance

In my attempt to try to follow good Python coding practices (again, disclaimer: I don't write Python as my primary coding language), I created a class for the the file named class ForSaleNFTScraper, and created an __init__ () method immediately inside of it where I created the Chrome WebDriver instance that the whole script will be able reference in the remainder of its methods.

class ForSaleNFTScraper:
    def __init__ (self):
        options = Options()
        options.add_argument('--headless')
        options.add_argument('--start-maximized')
        self.driver = webdriver.Chrome(options=options)
        self.wait = WebDriverWait(self.driver, 5)

# more code here

The first thing I did inside of the __init__ () method was to add a couple of Chrome browser configs via the Options import from the last section by declaring a new options variable.

    options = Options():

Since I wanted this script to run without actually opening a browser window on the user's local machine, I added the config argument of --headless and the argument of --start-maximized, so the (unseen) window would take up as much screen size as was available (and hopefully load as many NFTs as quickly as possible by doing so).

    options.add_argument('--headless')
    options.add_argument('--start-maximized')

Then I passed the new options object to the instance of webdriver.Chrome, which was set to the variable of self.driver (self is a variable accessible throughout the rest of the methods within this ForSaleNFTScraper class), and instructed the new WebDriver to wait for 5 seconds after startup (which would presumably give it time to go to the specified NFTrade web URL and load the data onto the page before attempting to scrape it).

    self.driver = webdriver.Chrome(options=options)
    self.wait = WebDriverWait(self.driver, 5)

There's plenty happening in that first method, but it's all pretty straightforward once you go through the code line by line and understand what the arguments mean to the Chrome WebDriver instance, and why it's doing what it's doing. Now that the WebDriver instance was configured and ready to go, I could write the code fetching the NFT card data, and lazy loading more data once the end of the currently visible info was reached.

2. Write the get_cards() and get_current_card_count() methods

This is where the code really starts to get interesting in my opinion, because it's where I learned to collect whatever data was currently visible in a (headless) browser and then load more data to add to the list. Pay close attention, because this is where the lazy loading code resides that gets more and more data onto the page.

def get_current_card_count(self):
    """Get the count of cards loaded into list of cards."""
    return len(self.driver.find_elements(By.XPATH, '//div[contains(@class, "Item_itemContent__1XIcH")]'))    

def get_cards(self, max_card_count = 500):
    """Extract and returns card ID and price."""
    URL="https://nftrade.com/collection/[NFT_COLLECTION_NAME]"
    self.driver.get(URL)
    last_card_count = 0
    # loops through lazy loading cards on site until max_card_count number reached 
    while last_card_count < max_card_count:
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)
        last_card_count = self.get_current_card_count()
    # grab all the cards now loaded into the browser by XPATH
    cards = self.driver.find_elements(By.XPATH, '//div[contains(@class, "Item_itemContent__1XIcH")]')
    return cards

# more code here

Ok, here we go.

For starters, there are two methods that I'm displaying in the code snippet here. The first method, get_current_card_count() is how I keep track of how many NFT cards in a collection are currently visible on the screen.

As I've said, NFTrade lazy loads its NFT collections onto a site to make initial page load quicker, and when a user scrolls down to the end of the currently loaded batch of elements, the NFTrade page then triggers to load more cards into the DOM at that point in time.

The second method is get_cards(), which handles going to the NFTrade collection URL and scraping all the available card data. It relies on get_current_card_count() to help it know to load more NFT cards until the desired number of cards has been loaded in the browser to scrape data from.

get_cards() method

I'll talk about get_cards() first as it's the more complicated of the two methods. The first thing the method does is declare a new variable named URL - this variable is set to the URL of the NFTrade collection page I want WebDriver to navigate to and scrape the data from. I used the Selenium WebDriver driver.get() method to navigate to the page given by the URL.

   URL="https://nftrade.com/collection/[NFT_COLLECTION_NAME]"
   self.driver.get(URL)

After navigating to the proper URL, I created a variable called last_card_count and set it equal to 0: this variable will be used to track how many NFTs are currently visible on the page and compare it to the max_card_count variable passed to the get_cards() method (if a number isn't passed for max_card_count it defaults to 500).

Below is the key code to lazy loading more and more data in the browser

Inside of get_cards(), there's a while loop set up to compare the last_card_count and max_card_count variables. As long as last_card_count is less then max_card_count, the loop will run, and each time it executes WebDriver uses the driver.execute_script() method to scroll down the page, wait for 3 seconds (allowing more cards to load onscreen), and then updating the last_card_count variable equal to the new amount of cards on the page using the get_current_card_count() method.

NOTE: The window.scrollTo() method is critical

driver.execute_script() allows for the synchronous execution of JavaScript in the current window, so when you see the code self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);"), what's happening is that WebDriver is using the JavaScript window.scrollTo() method to scroll the browser all the way to the bottom of the page (that's why document.body.scrollHeight is present - it's a measurement of the height of the whole document.body page element), which triggers the page to load more NFT cards into view.

    last_card_count = 0
    # loops through lazy loading cards on site until max_card_count number reached 
    while last_card_count < max_card_count:
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)
        last_card_count = self.get_current_card_count()

And this is a perfect time to segue into discussing the get_current_card_count() method, which is short and sweet.

get_current_card_count() method

This method exists simply to find the count of the current elements loaded in the browser, and it does so by combining the WebDriver find_elements() method with the By.XPATH element locator method.

Due to how the NFTrade site is built, there are no easily identifiable classes, IDs, or other consistent ways to identify all the cards on the page, so I had to resort to XPath expressions to identify each element and include it in my count to update the last_card_count variable. I cobbled together the XPath below by using my Chrome DevTools to inspect the elements on the page and construct the XPath from there through trial and error.

NOTE: What is XPath?

If you're unfamiliar like I was, XPath is a syntax that can be used to navigate through elements and attributes in a standard XML document (or webpage). The link I provided to W3Schools has some good examples of what typical XPath expressions look like and how to interpret them.

So the code inside of the get_current_card_count() method is just the one line of code:

return len(self.driver.find_elements(By.XPATH, '//div[contains(@class, "Item_itemContent__1XIcH")]'))

In the code snippet, I'm getting the count (using the build-in Python method len()) of all the elements on the page that match the XPath of a <div> containing the class of "Item_itemContent__1XIcH", because each NFT on the page is wrapped by that <div> with that class. It's not the prettiest thing to read and understand, but it gets the job done.

And finally, jumping back to the get_cards() method again, once the last_card_count variable has been updated and surpassed the max_card_count variable (i.e. enough NFT cards are loaded into the browser), the while loop ends, and all the cards on the screen are targeted (using the very same XPath used in the get_current_card_count() method, I might add) and set equal to the cards variable defined at the top of the get_cards() method. That variable then gets returned to the __main__ method running the whole script, which I'll cover next.

    # grab all the cards now loaded into the browser by XPATH
    cards = self.driver.find_elements(By.XPATH, '//div[contains(@class, "Item_itemContent__1XIcH")]')
    return cards

There's quite a bit going on here, but hopefully it makes more sense now what these methods are doing. Time to test out this lazy loading script functionality and see how WebDriver does.

3. Run the Python script

All right, now that all the code and logic to load multiple sets of NFTrade cards into the browser and collect the data has been written, it's time to run the code.

To do that, I declared a __main__ method at the bottom of the file which can be started from the terminal with the following command.

python for_sale_scraper.py

Here is what the __main__ method includes.

if __name__ == ' __main__':
   scraper = ForSaleNFTScraper();

  # get all the cards from nftrade site
   cards = scraper.get_cards(max_card_count=200)

   # pprint the card data to ensure we're getting data
   pprint(cards)

   print("Total cards collected:", len(cards))
   # more code here

The first thing the method does is create a new instantiation variable named scraper by calling the ForSaleNFTScraper() class. It then proceeds to fetch all the card data and set it equal to a variable named cards by calling the method scraper.get_cards(max_card_count=200) and supplying a max_card_count variable of 200.

After this step, as a sanity check, I used the Python pprint() and print() methods to print out all the card data and a count of the total cards fetched by the get_cards() method, and ensure all the info I needed to include in the CSV (NFT price, NFT ID, etc.) was available to me. Here's a screenshot of some of the data printed out in my console helping me know my code is doing what I expect.

$Example of the raw NFT data gathered from the get\_cards() method$

Here is what the raw NFT card data gathered from the get_cards() method looks like printed in the terminal.

$Count of the amount of NFTs collected from the get\_cards() method$

Since I set my max_card_count to 200, but NFTrade loads NFTs in batches of 75 at a time, it makes sense that the total count of NFTs scraped off the page equals 225.

And after verifying the right data's there (and the right amount of data as well), I continued on extracting the data, calculating the current price in USD for each NFT, and assembling a CSV of all the data. But I'll save those steps for future blog posts.

Conclusion

Building a Python-based website scraper to create a CSV of NFTs available for sale on NFTrade was a unique challenge I learned a lot of new things from.

After my first attempt failed due to NFTrade dynamically lazy loading NFTs in batches of 75 onto the page as a user scrolled further down, I had to come up with a more creative solution that would allow me to trigger the site to load more cards on the page first, then grab the data on the cards for sale.

I found the solution I was looking for with the help of a Python package called Selenium Python. Selenium Python is a powerful Python-based API that allows users to write scripts or automated tests leveraging Selenium WebDriver. And it was up to the task at hand: with just a few methods I was able to specify as many NFTs as I wanted loaded on the page before scraping and collecting all their data all at once.

Check back in a few weeks — I’ll be writing more blogs about the problems I had to solve while building this Python website scraper in addition to other topics on JavaScript or something else related to web development.

If you’d like to make sure you never miss an article I write, sign up for my newsletter here: https://paigeniedringhaus.substack.com

Thanks for reading. I hope seeing how to make a Python Selenium WebDriver load data onto a dynamic webpage before scraping it comes in handy for you in the future.

DEV Community

Scrape Data from a Lazy Loading Website with Selenium Python

Introduction

Selenium with Python package

Install Selenium Python in the Python project

Import Selenium WebDriver into Python script

Add methods to scrape data and lazy load more data

Conclusion

Further References & Resources

Top comments (0)

Read next

How to Install PySpark on Your Local Machine

Vedro Hooks

BCEWithLogitsLoss in PyTorch

Python beats Javascript, Next.js Leap & the AI Coding Wars