Artur Chukhrai for SerpApi

Posted on Dec 28, 2022 • Edited on Feb 6, 2023 • Originally published at serpapi.com

Scrape Google Product Specs Results with Python

#discuss #security

What will be scraped
Full Code
Preparation
Code Explanation
Using Google Specs Results API from SerpApi
Links

What will be scraped

📌Note: In the DIY solution, I only scrape all specs. If you also want to extract the product title, extensions, description, and others from this page, you can check out the Scrape Google Product Page with Python blog where I described how to extract this data.

Full Code

If you don't need explanation, have a look at full code example in the online IDE.

import requests, json
from parsel import Selector


def get_specs_results(url, headers):
    html = requests.get(url, headers=headers)
    selector = Selector(html.text)

    data = {}

    for result in selector.css('tr'):
        if result.css('.c18cu'):
            category = result.css('.c18cu::text').get() 
            data[category] = {}
        else:
            key = result.css('.ipBhab::text').get()
            value = result.css('.AnDf0c::text').get()
            data[category][key] = value

    return data


def main():
    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    }

    URL = 'https://www.google.com/shopping/product/14019378181107046593/specs?hl=en&gl=us'

    specs_results = get_specs_results(URL, headers)

    print(json.dumps(specs_results, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    main()

Preparation

Install libraries:

pip install requests parsel

Reduce the chance of being blocked

Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.

There's a how to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.

Code Explanation

Import libraries:

import requests, json
from parsel import Selector

Library	Purpose
`requests`	to make a request to the website.
`json`	to convert extracted data to a JSON object.
`Selector`	XML/HTML parser that have full XPath and CSS selectors support.

At the beginning of the main() function, the headers and URL are defined. This data is then passed to the get_specs_results(URL, headers) function to form a request and extract information.

The specs_results dictionary contains the received data that this function returns. At the end of the function, the data is output in JSON format:

def main():
    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    }

    URL = 'https://www.google.com/shopping/product/14019378181107046593/specs?hl=en&gl=us'

    specs_results = get_specs_results(URL, headers)

    print(json.dumps(specs_results, indent=2, ensure_ascii=False))

This code uses the generally accepted rule of using the __name__ == "__main__" construct:

if __name__ == "__main__":
    main()

This check will only be performed if the user has run this file. If the user imports this file into another, then the check will not work. You can watch the video Python Tutorial: if name == 'main' for more details.

Let's take a look at the get_specs_results(url, headers) function mentioned earlier. This function takes url and headers parameters to create a request. Now we need to parse the HTML from the Parsel package, into which we pass the HTML structure that was received after the request:

def get_specs_results(url, headers):
    html = requests.get(url, headers=headers)
    selector = Selector(html.text)

The data dictionary in which the data will be stored is defined:

data = {}

Each product has different categories and parameters. To retrieve data, you first need to find all the table rows and iterate over them:

for result in selector.css('tr'):
    # data extraction will be here

There is a check inside the loop: if the current table row is a category, then we create a key in the dictionary and write an empty dictionary there, else we add the extracted data to this category:

if result.css('.c18cu'):
    category = result.css('.c18cu::text').get() 
    data[category] = {}
else:
    key = result.css('.ipBhab::text').get()
    value = result.css('.AnDf0c::text').get()
    data[category][key] = value

Code	Explanation
`css()`	to access elements by the passed selector.
`::text` or `::attr(<attribute>)`	to extract textual or attribute data from the node.
`get()`	to actually extract the textual data.

I'm attaching a picture showing what I'm talking about:

At the end of the function, the data dictionary is returned.

return data

Output:

{
  "General": {
    "Device Type": "Mouse",
    "Ergonomic Design": "Yes",
    "Orientation": "Right-handed",
    "Width": "2.6 in",
    "Depth": "4.7 in",
    "Height": "1.5 in",
    "Weight": "2.08 oz",
    "Product Material": "ABS plastic",
    "Color": "Onyx black matte"
  },
  "Input Device": {
    "Connectivity Technology": "Wired",
    "Interface": "USB",
    "Movement Detection Technology": "Optical",
    "Buttons Qty": "6",
    "Movement Resolution": "8500 cpi",
    "Performance": "1 ms response time, 300 inches per second, polling rate: 1000 Hz, 35 G maximum acceleration, realtime sensitivity switching 200 - 8500 cpi",
    "Cable Length": "6 ft",
    "Features": "Scrolling wheel, polytetrafluoroethylene (PTFE) feet, waterproof, dustproof, Zero Hardware Acceleration, palm grip, detachable cable, 3 RGB zones, claw grip, RGB backlight, low profile design, SteelSeries TrueMove Core sensor, Golden Micro IP54 mechanical switches, AquaBarrier Protection, perforated housing, 80 million clicks, fingertip grip, SteelSeries mechanical switches, battle-tested shape, Holey Shell"
  },
  "Expansion / Connectivity": {
    "Interfaces": "1 x USB - 4 pin USB Type A"
  },
  "Miscellaneous": {
    "Included Accessories": "USB cable (Type-C to Type-A) SuperMesh",
    "Compliant Standards": "ICES-003 Class B, WEEE, EAC, EMC 2014/30/EU, FCC Part 15 Class B, ACMA, LVD 2014/35/EU, RED 2014/53/EU, RoHS 2015/863/EU, UKCA, NMB-003 Class B",
    "Package Details": "Color box",
    "Works With Chromebook": "Yes"
  },
  "Software / System Requirements": {
    "Software Included": "SteelSeries Engine",
    "OS Required": "Linux, Microsoft Windows 7 or later, Apple MacOS X 10.13 or later"
  }
}

Using Google Specs Results API from SerpApi

This section is to show the comparison between the DIY solution and our solution.

The main difference is that it's a quicker approach. Google Specs Results API will bypass blocks from search engines and you don't have to create the parser from scratch and maintain it.

First, we need to install google-search-results:

pip install google-search-results

Import the necessary libraries for work:

from serpapi import GoogleSearch
import os, json

Next, we write the necessary parameters for making a request:

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    'api_key': os.getenv('API_KEY'),        # your serpapi api
    'engine': 'google_product',             # SerpApi search engine 
    'product_id': '14019378181107046593',   # product id
    'specs': True,                          # more specs, could be also set as '1` which is the same as True
    'hl': 'en',                             # language
    'gl': 'us'                              # country of the search, US -> USA
}

We then create a search object where the data is retrieved from the SerpApi backend. In the results dictionary we get data from JSON:

search = GoogleSearch(params)   # where data extraction happens on the SerpApi backend
results = search.get_dict()     # JSON -> Python dict

In addition to all specs, API also allows you to retrieve data about the product itself (title, extensions, description, etc). All you need is to access the corresponding key:

specs_results = {
    'product_results': results['product_results'],
    'specs_results': results['specs_results']
}

After reviewing the playground, you will be able to understand which keys you can turn to.

Example code to integrate:

from serpapi import GoogleSearch
import os, json

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    'api_key': os.getenv('API_KEY'),        # your serpapi api
    'engine': 'google_product',             # SerpApi search engine 
    'product_id': '14019378181107046593',   # product id
    'specs': True,                          # more specs, could be also set as '1` which is the same as True
    'hl': 'en',                             # language
    'gl': 'us'                              # country of the search, US -> USA
}

search = GoogleSearch(params)               # where data extraction happens on the backend
results = search.get_dict()                 # JSON -> Python dict

specs_results = {
    'product_results': results['product_results'],
    'specs_results': results['specs_results']
}

print(json.dumps(specs_results, indent=2, ensure_ascii=False))

Output:

{
  "product_results": {
    "product_id": 14019378181107046593,
    "title": "SteelSeries Aerox 3 2022 Edition Wired Gaming Mouse, Onyx",
    "reviews": 68,
    "rating": 4.5,
    "extensions": [
      "Ergonomic",
      "Black",
      "Bluetooth",
      "Wireless",
      "Mouse"
    ],
    "description": "Aerox 3 is optimized to get the fastest swipes in gaming, so you beat the competition to the trigger. When milliseconds count, speed matters. Uniquely perforated outer casing and slimmed down electronics are precision engineered to be stunningly lightweight with new levels of durability. Engineered to safeguard the interior circuitry from virtually all types of environmental damage, Aerox is the first-ever gaming mouse to receive an IP54 rating, providing water resistance and protection from dust, dirt, oil, fur, and more. Enjoy true 1-to-1 tracking in a lightweight sensor design. With a range from 200 to 8,500 CPI (counts per inch), you can play with stunning accuracy, speed, and consistency. Rated for 80M clicks (3X the durability found in other gaming mice), upgraded dust and water-resistant switches provide flawless clicks even under the most demanding conditions, prevent double clicks, and allow to you make rock-solid next-level shots."
  },
  "specs_results": {
    "general": {
      "device_type": "Mouse",
      "ergonomic_design": "Yes",
      "orientation": "Right-handed",
      "width": "2.6 in",
      "depth": "4.7 in",
      "height": "1.5 in",
      "weight": "2.08 oz",
      "product_material": "ABS plastic",
      "color": "Onyx black matte"
    },
    "input_device": {
      "connectivity_technology": "Wired",
      "interface": "USB",
      "movement_detection_technology": "Optical",
      "buttons_qty": "6",
      "movement_resolution": "8500 cpi",
      "performance": "1 ms response time, 300 inches per second, polling rate: 1000 Hz, 35 G maximum acceleration, realtime sensitivity switching 200 - 8500 cpi",
      "cable_length": "6 ft",
      "features": "Scrolling wheel, polytetrafluoroethylene (PTFE) feet, waterproof, dustproof, Zero Hardware Acceleration, palm grip, detachable cable, 3 RGB zones, claw grip, RGB backlight, low profile design, SteelSeries TrueMove Core sensor, Golden Micro IP54 mechanical switches, AquaBarrier Protection, perforated housing, 80 million clicks, fingertip grip, SteelSeries mechanical switches, battle-tested shape, Holey Shell"
    },
    "expansion_connectivity": {
      "interfaces": "1 x USB - 4 pin USB Type A"
    },
    "miscellaneous": {
      "included_accessories": "USB cable (Type-C to Type-A) SuperMesh",
      "compliant_standards": "ICES-003 Class B, WEEE, EAC, EMC 2014/30/EU, FCC Part 15 Class B, ACMA, LVD 2014/35/EU, RED 2014/53/EU, RoHS 2015/863/EU, UKCA, NMB-003 Class B",
      "package_details": "Color box",
      "works_with_chromebook": "Yes"
    },
    "software_system_requirements": {
      "software_included": "SteelSeries Engine",
      "os_required": "Linux, Microsoft Windows 7 or later, Apple MacOS X 10.13 or later"
    }
  }
}