DEV Community

Artur Chukhrai
Artur Chukhrai

Posted on

17 1

Scrape Google Hotels with Python

What will be scraped


Full Code

from selenium import webdriver
from selenium_stealth import stealth
from import Service
from import ChromeDriverManager
from import By
from selenium.webdriver.common.keys import Keys
from selectolax.lexbor import LexborHTMLParser
import time, json

def scroll_page(url):
    service = Service(ChromeDriverManager().install())

    options = webdriver.ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-automation'])
    options.add_experimental_option('useAutomationExtension', False)

    driver = webdriver.Chrome(service=service, options=options)

        user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36',
        languages=['en-US', 'en'],
        vendor='Google Inc.',
        webgl_vendor='Intel Inc.',
        renderer='Intel Iris OpenGL Engine',


    old_height = driver.execute_script("""
        function getHeight() {
            return document.querySelector('.t8aeve').scrollHeight;
        return getHeight();

    body = driver.find_element(By.TAG_NAME, 'body')

    while True:
        driver.execute_script("window.scrollTo(0, document.querySelector('.t8aeve').scrollHeight)")


        new_height = driver.execute_script("""
            function getHeight() {
                return document.querySelector('.t8aeve').scrollHeight;
            return getHeight();

        if new_height == old_height:

        old_height = new_height

    parser = LexborHTMLParser(driver.page_source)

    return parser

def scrape_google_hotels(parser):
    data = []

    for result in parser.root.css('.uaTTDe'):
        result_dict = {}

        if result.css_first('.hVE5 .ogfYpf'):
            result_dict['ad'] = result.css_first('.hVE5 .ogfYpf').text().replace('  ', ' ')

        result_dict['title'] = result.css_first('.QT7m7 h2').text()
        result_dict['link'] = '' + result.css_first('.PVOOXe').attributes.get('href')
        price = result.css_first('.OxGZuc .kixHKb span')
        result_dict['price'] = price.text() if price else None
        result_dict['extracted_price'] = float(price.text().split()[0]) if price else None
        rating = result.css_first('.FW82K .KFi5wf')
        result_dict['rating'] = float(rating.text()) if rating else None
        reviews = result.css_first('.FW82K .jdzyld')
        result_dict['reviews'] = int(reviews.text()[2:-1].replace(',', '')) if reviews else None
        result_dict['extensions'] = [extension.css_first('.sSHqwe').text() for extension in result.css('.RJM8Kc .HlxIlc div, li')]
        result_dict['thumbnails'] = [
            thumbnail.attributes.get('src') if thumbnail.attributes.get('src') else thumbnail.attributes.get('data-src')
            for thumbnail in result.css('.NBZP0e .q5P4L')


    return data

def main():
    URL = ''

    parser = scroll_page(URL)
    google_hotels = scrape_google_hotels(parser)

    print(json.dumps(google_hotels, indent=2, ensure_ascii=False))

if __name__ == "__main__":

Enter fullscreen mode Exit fullscreen mode


Install libraries:

pip install selenium selenium-stealth webdriver webdriver_manager selectolax

Enter fullscreen mode Exit fullscreen mode

Code Explanation

Import libraries:

from selenium import webdriver
from selenium_stealth import stealth
from import Service
from import ChromeDriverManager
from import By
from selenium.webdriver.common.keys import Keys
from selectolax.lexbor import LexborHTMLParser
import time, json

Enter fullscreen mode Exit fullscreen mode
Library Purpose
webdriver to drive a browser natively, as a user would, either locally or on a remote machine using the Selenium server.
selenium-stealth to prevent almost all website detections (CAPTCHA and so on).
Service to manage the starting and stopping of the ChromeDriver.
By to set of supported locator strategies (By.ID, By.TAG_NAME, By.XPATH etc).
Keys to simulate button presses (Keys.PAGE_UP, Keys.PAGE_DOWN, Keys.END etc).
LexborHTMLParser a fast HTML5 parser with CSS selectors using Lexbor engine.
time to work with time in Python.
json to convert extracted data to a JSON object.

The next part of the code is divided into functions. Each function is described in the corresponding heading below.

Top-level code environment

The URL variable contains a link to the Google Hotels:

URL = ''

Enter fullscreen mode Exit fullscreen mode

Next, the URL is passed to the scroll_page(URL) function to start selenium in stealth mode and scroll page. The parser returned by this function is passed to the scrape_google_hotels(parser) function to extract all the data. The explanation of these functions will be in the corresponding headings below.

parser = scroll_page(URL)
google_hotels = scrape_google_hotels(parser)

Enter fullscreen mode Exit fullscreen mode

After the all data is retrieved, it is output in JSON format:

print(json.dumps(google_hotels, indent=2, ensure_ascii=False))

Enter fullscreen mode Exit fullscreen mode

This code uses boilerplate __name__ == "__main__" construct that protects users from accidentally invoking the script when they didn't intend to. This indicates that the code is a runnable script:

def main():
    URL = ''

    parser = scroll_page(URL)
    google_hotels = scrape_google_hotels(parser)

    print(json.dumps(google_hotels, indent=2, ensure_ascii=False))

if __name__ == "__main__":

Enter fullscreen mode Exit fullscreen mode

This check will only be performed if the user has run this file. If the user imports this file into another, then the check will not work.

You can watch the video Python Tutorial: if name == 'main' for more details.

Scroll page

The function takes the URL and returns a parser.

First, let's understand how pagination works on the Google Hotels page. Data does not load immediately. If the user needs more data, they will simply scroll the page and site download a small package of data.

In this case, selenium library is used, which allows you to simulate user actions in the browser. For selenium to work, you need to use ChromeDriver, which can be downloaded manually or using code. In our case, the second method is used. To control the start and stop of ChromeDriver, you need to use Service which will install browser binaries under the hood:

service = Service(ChromeDriverManager().install())

Enter fullscreen mode Exit fullscreen mode

You should also add options to work correctly:

options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)

Enter fullscreen mode Exit fullscreen mode

Now we can start webdriver:

driver = webdriver.Chrome(service=service, options=options)

Enter fullscreen mode Exit fullscreen mode

After starting the webdriver, you need to pass it and other attributes to the stealth() function. This will make selenium more stealthy to be able to bypass CAPTCHA (Cloudflare one also):

    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36',
    languages=['en-US', 'en'],
    vendor='Google Inc.',
    webgl_vendor='Intel Inc.',
    renderer='Intel Iris OpenGL Engine',

Enter fullscreen mode Exit fullscreen mode

Pass the URL to the get() method and wait a few seconds for the site to load:


Enter fullscreen mode Exit fullscreen mode

In order to load all the data, you need to implement page scrolling. By scrolling down the page, you can notice that new data is not being loaded. In order for the new data to load, you need to scroll up the page a little and wait. Yes, I'm also surprised that it works that way 🙂

The GIF below shows what it looks like:


The page scrolling algorithm looks like this:

  1. Find out the initial page height and write the result to the old_height variable.
  2. Scroll to the bottom of the page using the script.
  3. Simulating a "Page Up" button press, scroll up a little on the page to load new data.
  4. Find out the new page height and write the result to the new_height variable.
  5. If the variables new_height and old_height are equal, then we complete the algorithm, otherwise we write the value of the variable new_height to the variable old_height and return to step 2.

Getting the page height and scroll is done by pasting the JavaScript code into the execute_script() method:

# 1 step
old_height = driver.execute_script("""
    function getHeight() {
        return document.querySelector('.t8aeve').scrollHeight;
    return getHeight();

body = driver.find_element(By.TAG_NAME, 'body')

while True:
    # 2 step
    driver.execute_script("window.scrollTo(0, document.querySelector('.t8aeve').scrollHeight)")

    # 3 step

    # 4 step
    new_height = driver.execute_script("""
        function getHeight() {
            return document.querySelector('.t8aeve').scrollHeight;
        return getHeight();

    # 5 step
    if new_height == old_height:

    old_height = new_height

Enter fullscreen mode Exit fullscreen mode

After all the data has been loaded, you need to process HTML using from selectolax because it has Lexbor parser which is incredibly fast, like 186% faster compared to bs4 with lxml backend when parsing data with 3000 iterations 5 times. Please note that selectolax does not currently support XPath:

parser = LexborHTMLParser(driver.page_source)

Enter fullscreen mode Exit fullscreen mode

After all the operations are done, stop the driver and the parser is returned:


return parser

Enter fullscreen mode Exit fullscreen mode

The function looks like this:

def scroll_page(url):
    service = Service(ChromeDriverManager().install())

    options = webdriver.ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-automation'])
    options.add_experimental_option('useAutomationExtension', False)

    driver = webdriver.Chrome(service=service, options=options)

        user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36',
        languages=['en-US', 'en'],
        vendor='Google Inc.',
        webgl_vendor='Intel Inc.',
        renderer='Intel Iris OpenGL Engine',


    old_height = driver.execute_script("""
        function getHeight() {
            return document.querySelector('.t8aeve').scrollHeight;
        return getHeight();

    body = driver.find_element(By.TAG_NAME, 'body')

    while True:
        driver.execute_script("window.scrollTo(0, document.querySelector('.t8aeve').scrollHeight)")


        new_height = driver.execute_script("""
            function getHeight() {
                return document.querySelector('.t8aeve').scrollHeight;
            return getHeight();

        if new_height == old_height:

        old_height = new_height

    parser = LexborHTMLParser(driver.page_source)

    return parser

Enter fullscreen mode Exit fullscreen mode

Scrape Google Hotels Listings

The function takes a Lexbor parser and returns the extracted data.

The data list is declared to which the extracted data will be added:

data = []

Enter fullscreen mode Exit fullscreen mode

In order to find all the hotels, you need to use the css() method and pass the .uaTTDe selector there. For each hotel, the result_dict dictionary is created, which will be supplemented with all the data about this hotel:

for result in parser.root.css('.uaTTDe'):
    result_dict = {}

Enter fullscreen mode Exit fullscreen mode

First of all, you need to check if this is an advertisement. If so, then the key with the corresponding value is added to the result_dict dictionary:

if result.css_first('.hVE5 .ogfYpf'):
    result_dict['ad'] = result.css_first('.hVE5 .ogfYpf').text().replace('  ', ' ')

Enter fullscreen mode Exit fullscreen mode

Data such as title and link are easily retrieved:

result_dict['title'] = result.css_first('.QT7m7 h2').text()
result_dict['link'] = '' + result.css_first('.PVOOXe').attributes.get('href')

Enter fullscreen mode Exit fullscreen mode
Code Explanation
css_first() to find the desired element.
text() to retrieve the text content.
attributes to get all attributes of an element.

Sometimes it happens that in some hotels there is no such data as price and rating. So a ternary expression is used for such cases:

price = result.css_first('.OxGZuc .kixHKb span')
result_dict['price'] = price.text() if price else None
rating = result.css_first('.FW82K .KFi5wf')
result_dict['rating'] = float(rating.text()) if rating else None

Enter fullscreen mode Exit fullscreen mode

When retrieving reviews the user gets an element like this:

<span class="jdzyld XLC8M" aria-hidden="true"> (1,602)</span>

Enter fullscreen mode Exit fullscreen mode

In order to extract a numeric value, you need to do the following operations: slice a list by removing space and brackets; remove the comma.

reviews = result.css_first('.FW82K .jdzyld')
result_dict['reviews'] = int(reviews.text()[2:-1].replace(',', '')) if reviews else None

Enter fullscreen mode Exit fullscreen mode

Hotels may have a different number of extensions. Sometimes they may not exist at all. Therefore, when extracting them, list comprehension is used:

result_dict['extensions'] = [extension.css_first('.sSHqwe').text() for extension in result.css('.RJM8Kc .HlxIlc div, li')]

Enter fullscreen mode Exit fullscreen mode

When extracting thumbnails, I noticed that the required value is stored in different attributes. Therefore, when extracting a thumbnails, we first check whether there is an attribute src. If this attribute exists, then retrieve its value. Otherwise, we extract the value from the data-src attribute:

result_dict['thumbnails'] = [
    thumbnail.attributes.get('src') if thumbnail.attributes.get('src') else thumbnail.attributes.get('data-src')
    for thumbnail in result.css('.NBZP0e .q5P4L')

Enter fullscreen mode Exit fullscreen mode

Add result_dict dictionary with all extracted data to the data list:


Enter fullscreen mode Exit fullscreen mode

At the end of the function, the data list is returned:

return data

Enter fullscreen mode Exit fullscreen mode

The function looks like this:

def scrape_google_hotels(parser):
    data = []

    for result in parser.root.css('.uaTTDe'):
        result_dict = {}

        if result.css_first('.hVE5 .ogfYpf'):
            result_dict['ad'] = result.css_first('.hVE5 .ogfYpf').text().replace('  ', ' ')

        result_dict['title'] = result.css_first('.QT7m7 h2').text()
        result_dict['link'] = '' + result.css_first('.PVOOXe').attributes.get('href')
        price = result.css_first('.OxGZuc .kixHKb span')
        result_dict['price'] = price.text() if price else None
        rating = result.css_first('.FW82K .KFi5wf')
        result_dict['rating'] = float(rating.text()) if rating else None
        reviews = result.css_first('.FW82K .jdzyld')
        result_dict['reviews'] = int(reviews.text()[2:-1].replace(',', '')) if reviews else None
        result_dict['extensions'] = [extension.css_first('.sSHqwe').text() for extension in result.css('.RJM8Kc .HlxIlc div, li')]
        result_dict['thumbnails'] = [
            thumbnail.attributes.get('src') if thumbnail.attributes.get('src') else thumbnail.attributes.get('data-src')
            for thumbnail in result.css('.NBZP0e .q5P4L')


    return data

Enter fullscreen mode Exit fullscreen mode


    "ad": "From SpringHill Suites by Marriott Tulsa",
    "title": "SpringHill Suites by Marriott Tulsa",
    "link": "",
    "price": "107 USD",
    "rating": 4.0,
    "reviews": 428,
    "extensions": [
      "3-star hotel",
      "Free breakfast",
      "Free Wi-Fi",
      "Free parking",
      "Indoor pool",
      "Air conditioning",
      "Fitness center",
      "Business center"
    "thumbnails": [
    "ad": "From Candlewood Suites Tulsa Hills - Jenks, an IHG Hotel",
    "title": "Candlewood Suites Tulsa Hills - Jenks, an IHG Hotel",
    "link": "",
    "price": "107 USD",
    "rating": 4.5,
    "reviews": 32,
    "extensions": [
      "2-star hotel",
      "Free Wi-Fi",
      "Free parking",
      "Outdoor pool",
      "Air conditioning",
      "Fitness center",
    "thumbnails": [
    "title": "Hyatt Regency Tulsa Downtown",
    "link": "",
    "price": "$174",
    "rating": 4.3,
    "reviews": 2755,
    "extensions": [
      "4-star hotel",
      "Breakfast ($)",
      "Free Wi-Fi",
      "Parking ($)",
      "Air conditioning",
      "Fitness center",
    "thumbnails": [
  ... other results

Enter fullscreen mode Exit fullscreen mode

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (2)

chrisgreening profile image
Chris Greening

Heyo Artur - super comprehensive tutorial, thanks for sharing! :D

I've never used selenium-stealth before, def going to check that out next time I'm working on a scraping project 😎

chukhraiartur profile image
Artur Chukhrai

Thanks for the feedback 🙂

I recently discovered selenium-stealth for myself and was pleased with its capabilities. Now I also use it for my projects.

Billboard image

Create up to 10 Postgres Databases on Neon's free plan.

If you're starting a new project, Neon has got your databases covered. No credit cards. No trials. No getting in your way.

Try Neon for Free →

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!
