Python Selenium Infinite Scrolling

#python #selenium #scraping

Scraping web pages with infinite scrolling using python, bs4 and selenium

Scroll function
This function takes two arguments. The driver that is being used and a timeout. The driver is used to scroll and the timeout is used to wait for the page to load.

def scroll(driver, timeout):
    scroll_pause_time = timeout

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(scroll_pause_time)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            # If heights are the same it will exit the function
            break
        last_height = new_height

Here is an example using the function

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup

# Your options may be different
options = Options()
options.set_preference('permissions.default.image', 2)
options.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', False)


def all_links(url):
    # Setup the driver. This one uses firefox with some options and a path to the geckodriver
    driver = webdriver.Firefox(options=options ,executable_path='./geckodriver')
    # implicitly_wait tells the driver to wait before throwing an exception
    driver.implicitly_wait(30)
    # driver.get(url) opens the page
    driver.get(url)
    # This starts the scrolling by passing the driver and a timeout
    scroll(driver, 5)
    # Once scroll returns bs4 parsers the page_source
    soup_a = BeautifulSoup(driver.page_source, 'lxml')
    # Them we close the driver as soup_a is storing the page source
    driver.close()

    # Empty array to store the links
    links = []

    # Looping through all the a elements in the page source
    for link in soup_a.find_all('a'):
        # link.get('href') gets the href/url out of the a element
        links.append(link.get('href'))

    return links

And that's how you scrap a page with infinite scrolling

Top comments (8)

Milos-Blagojevic • Mar 26 '20

Hi, thanks so much for the post, it really helped me a lot.
Do you by any chance know why when scrolling through page that has a lot of content I get different results, in a sense that page doesn't always end with the same content, even though it is clearly seen that it reached the end of the page?
For instance I have been trying to scrape posts from an instagram page that has more than 50000 posts and almost everytime I get different results and never do I get even near 50000. Closest I got to was around 20000, but most of the time it is between 5 and 10 thousand.
Do you think this is Instagram related or it has to do with my code?
Any thought will be appreciated.
Thanks in advance :)