Crawlbase

Posted on Oct 31, 2023 • Updated on Nov 27, 2023 • Originally published at crawlbase.com

Amazon Price Scraping with AI.

#datascience #big #data #amazon

This article was originally posted on Crawlbase Blog.
If you’re overwhelmed with Manual price data extraction and want to learn how to scrape prices from Amazon using AI, then you’re in the right place. As you read through this blog, we'll focus on automated scraping techniques, especially those involving automated XPath retrieval. We'll walk you through setting up your scraping setup, using AI to get precisely the data you need, and mastering the art of automated data retrieval with XPath. Whether you're a small online store or a big e-commerce giant, these techniques will be your superpowers in the digital world.

Importance of Automated Scraping

In order to perform scraping, you need to know the CSS selector or the XPath for the elements. Therefore, if you are scraping thousands of websites, you need to manually figure out the selector for each of them. And if the page changes, you need to change that as well. This is where automated web scraping comes into play, offering a pivotal advantage to those who harness its capabilities effectively.

Why Automated Scraping Matters in E-commerce

Automated scraping is like a superpower in the world of online businesses, especially in e-commerce. It helps businesses collect data quickly and accurately, which is crucial for success. Here's why it's so important:

Speedy Data Gathering: Automated scraping allows businesses to grab important data such as product prices, stock availability, and what competitors are up to. This speed is like having a secret weapon, letting businesses make quick, smart decisions that keep them ahead of the competition.
Always Keeping an Eye on Competitors: In e-commerce, things change fast. It is essential to keep a close watch on what your competitors are doing with their prices and products. Automated scraping is like having a robot assistant that watches your competition 24/7, so you're always aware of the situation.
Data-Powered Product Insights: Want to know what products are trending, what customers like, and what the market wants? Automated scraping can help you dive deep into this information, giving you superpowers to develop products and target your marketing.
Adaptability to Layout Changes: Websites sometimes update their look and structure. Automated scraping can handle these changes using CSS selectors, ensuring you can continue collecting data without disruptions.
Great Shopping Experiences: Shoppers love up-to-date and accurate information when they visit an online store. Automated scraping ensures your product data is always current and trustworthy, making your customers happy.

The Advantages of AI-Driven Price Scraping on Amazon

Now, let's talk about using AI-driven scraping on Amazon, especially when combined with automated XPath retrieval. It's like leveling up your superpowers:

Handling Lots of Data: AI-driven scraping and automated XPath retrieval are great at dealing with vast amounts of data. Whether you have many products to track, millions of customer reviews, or many competitors to keep an eye on, this technology can handle the load.
Precision and Trustworthiness: AI models, like the ones powered by OpenAI GPT, are like expert data detectives. They find exactly what you need with incredible accuracy, so you can always trust the information you get.
Saving Time and Resources: Automated scraping means you don't have to do everything manually. It's like having a helper that works around the clock, saving you time and resources. You can use that extra time for important decisions.
Adapting to Changes: Websites like Amazon can change their layout or structure. AI models can adapt, so you don't lose your superpower even when websites update.

Let's explore the practical tools and methods to give your business the upper hand in online retail.

Exploring Necessary APIs

Before delving into the technical intricacies of automated scraping, you must acquaint yourself with the fundamental APIs underpinning your scraping journey. This section will delve into the essential APIs central to automated web scraping: the Crawlbase Crawling API and the OpenAI GPT API.

Crawlbase Crawling API

The Crawlbase Crawling API is a critical foundation for web data extraction endeavors. It offers the ability to retrieve HTML content from web pages, which is an indispensable tool for automated scraping. Here's a technical overview of the Crawlbase Crawling API:

Web Data Extraction: Crawlbase is designed to facilitate extracting HTML content from web pages. It accommodates the intricate structures of web pages, ensuring you can access the precise data required for your scraping tasks, such as price extraction and content analysis.
IP Rotation: Crawlbase incorporates an essential feature of IP rotation. This feature provides enhanced anonymity, scalability, and reliability by cycling through multiple IP addresses during scraping operations. It helps evade IP-based restrictions and ensures uninterrupted data extraction.
Scalability: The Crawlbase Crawling API is engineered to manage scraping tasks of varying scales. Whether you aim to scrape a single web page or thousands of pages, Crawlbase can efficiently handle requests, making it ideal for large-scale data extraction projects.
Ease of Integration: Leveraging Crawlbase's capabilities is straightforward, thanks to its Python library. This integration enables the effortless execution of requests, retrieval of content, and seamless inclusion within your data analysis pipelines.

OpenAI GPT API

The OpenAI GPT API represents a cornerstone for natural language understanding and generation. It opens up various possibilities for tasks related to interpreting and generating text-based data. Here's a technical perspective on the OpenAI GPT API:

Natural Language Understanding: OpenAI's GPT models are meticulously trained for comprehensive language understanding. They excel in interpreting queries, generating text, and assisting in tasks that demand linguistic comprehension, making them a powerful tool for generating XPath expressions.
Language Generation: The GPT API exhibits exceptional proficiency in generating human-like text. This capability is invaluable for tasks such as chatbot responses, content generation, and crafting data extraction instructions, enhancing automation and flexibility in scraping projects.
Versatility: OpenAI's GPT models are exceedingly versatile and adaptable to diverse text-related tasks, making them an invaluable addition to your automated scraping toolkit. Their adaptability paves the way for a wide array of applications within the domain of web scraping.

In the subsequent sections, we will harness the power of these APIs, merging them seamlessly to create an efficient and streamlined process for the extraction of product prices from Amazon's search pages.

Understanding Amazon's Search Page Structure

To become proficient in automated scraping, it's crucial to understand the structure of the web pages you intend to scrape. In this section, we'll take a closer look at Amazon's search page structure, breaking it down into its essential components and helping you identify the specific data you need.

Breaking Down an Amazon Search Page

Amazon's search pages are meticulously designed to provide users a user-friendly and efficient shopping experience, as well as a visually pleasing interface with custom logos. Understanding the structure of these pages is the first step toward successful automated scraping:

Search Bar: At the top of the page, you'll find the search bar, where users enter their queries. This is where the search journey begins, with users seeking specific products or categories.
Filters and Sort Options: On the left side, you'll see various filter and sorting options. Users can refine their search results by selecting categories, brands, price ranges, and more. Recognizing these elements is important as they influence the search results.
Search Results Grid: The central part of the page is occupied by the search results grid. This grid displays a list of products matching the user's query. Each product listing typically includes an image, title, price, ratings, and additional information.
Pagination: At the bottom of the search results, you'll often find pagination controls, allowing users to navigate through multiple pages of results. Understanding how Amazon handles pagination is crucial to gathering data from all pages for scraping purposes.
Product Details Page Links: Each product listing has a link directing users to the product's details page. When scraping Amazon's search pages, these links can be valuable for collecting deeper information about specific products.
Footer: The footer contains links to various Amazon policies, customer service, and additional resources. It's the final section of the page.

Identifying the Data You Need

Amazon's search pages are rich in data, but not all may be relevant to your specific scraping goals. Identifying the precise data elements you require is essential for efficient and focused scraping:

Product Information: Determine which product details are vital for your objectives. This may include product titles, prices, customer ratings, and descriptions. Identifying these elements helps you extract the right information.
Product URLs: If you intend to delve deeper into specific products, capturing the URLs to individual product pages is crucial. This allows you to access more detailed information for each item.
Pagination Control: Understanding how pagination is structured on Amazon's search pages is vital to collecting data from multiple result pages. You'll need to locate and utilize the appropriate elements to navigate the pages efficiently.

As we progress through this blog, we'll apply this knowledge to our automated scraping techniques. You'll learn how to locate and extract the data you need from Amazon's search pages, enabling you to gather valuable insights and make data-driven decisions in the world of e-commerce.

How to Scrape Prices from Amazon : Getting Prepared

Before embarking on your automated scraping journey, you must ensure you have the right tools and setup. This section will cover the initial preparation steps, including installing Python, creating a virtual environment, and acquiring the necessary tokens for Crawlbase and OpenAI.

Installing Python and Essential Libraries

Python is the cornerstone of web scraping projects, and several libraries will play a pivotal role in your journey. Let's start by ensuring you have Python and the following libraries installed:

Python Installation: If you don't have Python installed, download the latest version from the official Python website and follow the installation instructions for your operating system.

Required Libraries: The following libraries are required to follow this blog successfully.

Crawlbase Python Library: To interact with the Crawlbase Crawling API, you'll need the Crawlbase Python library. This library simplifies the process of making requests to Crawlbase for web scraping. Install it with:

pip install crawlbase

OpenAI Python Library: As you'll be utilizing OpenAI's GPT to get XPath, you need to install the OpenAI Python library. This library allows you to interact with OpenAI's APIs effectively. Install it using:

pip install openai

lxml: The Python lxml library is a robust and efficient tool for parsing and working with XML and HTML documents. It provides a powerful and user-friendly interface for navigating and manipulating structured data.

pip install lxml

Creating a Virtual Environment

Creating a virtual environment is a best practice in Python development. It ensures that your project has its isolated environment with the required packages. Here's how to set up a virtual environment:

Install Virtualenv: If you don't have virtualenv installed, you can do so using pip:

pip install virtualenv

Create a Virtual Environment: Navigate to your project directory and run the following command to create a virtual environment:

virtualenv venv

Activate the Virtual Environment: Depending on your operating system, the activation command may differ:

On Windows:

venv\Scripts\activate

On macOS and Linux:

source venv/bin/activate

Your virtual environment is now set up and activated. You can install project-specific packages without interfering with your system-wide Python installation.

Acquiring Tokens for Crawlbase and OpenAI

To use the Crawlbase Crawling API and OpenAI GPT API, you'll need to obtain the necessary tokens or API keys. Here's how to acquire them:

Crawlbase Token: Visit the Crawlbase website and sign up for an account. Once registered, you'll find your API token or key in the documentation. Crawlbase provides two types of tokens: the Normal Token (TCP) for static websites and the JavaScript Token (JS) for dynamic or JavaScript-driven websites. For Amazon, we need a JS token. Keep this token safe, as it will be essential for accessing the Crawlbase API. For an easy start, Crawlbase gives 1000 free requests for its Crawling API.

OpenAI GPT Token: Visit the OpenAI website and create an account if you haven't already. Access your API token from your OpenAI account settings. This token is required for making requests to the OpenAI GPT API.

In the following sections of this blog, we will guide you through the practical steps of scraping product prices from Amazon's search pages efficiently and effectively. Stay with us as we explore the tools and techniques that will give you a competitive edge in e-commerce.

Automating Amazon Price Scraping

Now that you're well-prepared and equipped with the necessary tools and tokens, it's time to dive into the heart of automated scraping. This section will guide you through the detailed steps of scraping product prices from Amazon's search pages using the Crawlbase Crawling API and OpenAI.

Retrieving Amazon Search Page HTML

The first step in automating price scraping is to obtain the HTML content of Amazon's search pages. This HTML content is where the product information, including prices, is embedded. Just like many modern websites, Amazon's search pages use fancy technology like JavaScript and Ajax to load their content. This can make it tricky to scrape data from these pages. But, with the Crawlbase Crawling API, you have the tools to handle these challenges effectively. Below is the Python script to retrieve HTML of Amazon search page for query macbook .

from crawlbase import CrawlingAPI

# Initialize the Crawling API with your Crawlbase token
api = CrawlingAPI({ 'token': 'YOU_CRAWLBASE_JS_TOKEN' })

# URL of the Amazon search page you want to scrape
amazon_search_url = 'https://www.amazon.com/s?k=macbook'

# options for Crawling API
options = {
 'page_wait': 2000,
 'ajax_wait': 'true'
}

# Make a request to scrape the Amazon search page with options
response = api.get(amazon_search_url, options)

# Check if the request was successful
if response['status_code'] == 200:
    # Extracted HTML content after decoding byte data
    html_content = response['body'].decode('latin1')

    # Save the HTML content to a file
    with open('output.html', 'w', encoding='utf-8') as file:
      file.write(html_content)
else:
    print("Failed to retrieve the page. Status code:", response['status_code'])

When using the JavaScript token with the Crawlbase API, you can specify some special parameters to ensure that you capture the dynamically rendered content accurately. You can read about them here.

page_wait: This optional parameter allows you to specify the number of milliseconds to wait before the browser captures the resulting HTML code. Use this parameter in situations where a page takes time to render or when AJAX requests need to be loaded before capturing the HTML.
ajax_wait: Another optional parameter for the JavaScript token. It lets you specify whether to wait for AJAX requests to finish before receiving the HTML response. This is important when the content relies on AJAX requests.

output.html Preview:

Using OpenAI to Extract Price XPath

In our quest to automate the extraction of product prices from Amazon's search pages, we turn to the remarkable capabilities of OpenAI, specifically the GPT (Generative Pre-trained Transformer) model. Lets update the previous example and add the code to utilize OpenAI to generate precise XPath expressions for extracting product prices from HTML content:

import openai
import asyncio
from crawlbase import CrawlingAPI

# Replace 'your_openai_api_key' with your OpenAI API key
openai.api_key = 'your_openai_api_key'

# Initialize the Crawling API with your Crawlbase token
api = CrawlingAPI({ 'token': 'YOU_CRAWLBASE_JS_TOKEN' })

# URL of the Amazon search page you want to scrape
amazon_search_url = 'https://www.amazon.com/s?k=macbook'

# Options for Crawling API
options = {
    'page_wait': 2000
}

async def get_xpath(html):
    response = await openai.Completion.create(
        engine="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "As an assisting entity, your role is to adeptly identify the comprehensive XPath expression, encompassing the path from the HTML source to the product price element within the prompt. Your response should consist solely of the complete XPath, devoid of supplementary explanations, notes, or any additional text. Multiple repetitions of the same answer are permissible."},
            {"role": "user", "content": html}
        ]
    )
    return response.choices[0].message["content"]

async def main():
    # Make a request to scrape the Amazon search page with options
    response = api.get(amazon_search_url, options)

    # Check if the request was successful
    if response['status_code'] == 200:
        # Extracted HTML content after decoding byte data
        html_content = response['body'].decode('latin1')
        xpath = await get_xpath(html_content)
        print(xpath)
    else:
        print("Failed to retrieve the page. Status code:", response['status_code'])

if __name__ == "__main__":
    asyncio.run(main())

This code is the bridge between your HTML content and the precise XPath expressions needed to locate and extract product prices. It initiates communication with OpenAI's GPT-3.5 Turbo engine, provides instructions, and receives generated XPath expressions tailored for your scraping needs. The generated XPath is then readily available for your web scraping tasks, streamlining the process and enhancing precision.

Scraping the Amazon Product Prices

To take your scraping journey to the next level, we'll enhance the previous example script by adding a function called find_max_price. This function utilizes the Python lxml library to parse the HTML content and select all product prices based on the generated XPath expression. It then converts the selected price strings to numerical values and identifies the highest price using the max() function. Finally, the script prints the highest Macbook price found on the Amazon search page, providing you with a valuable data point.

import openai
import asyncio
import lxml
from crawlbase import CrawlingAPI

# Replace 'your_openai_api_key' with your OpenAI API key
openai.api_key = 'your_openai_api_key'

# Initialize the Crawling API with your Crawlbase token
api = CrawlingAPI({ 'token': 'YOU_CRAWLBASE_JS_TOKEN' })

# URL of the Amazon search page you want to scrape
amazon_search_url = 'https://www.amazon.com/s?k=macbook'

# Options for Crawling API
options = {
    'page_wait': 2000
}

async def get_xpath(html):
    response = await openai.Completion.create(
        engine="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Your role as an assisting entity is to proficiently pinpoint the all-encompassing XPath expression that traces the path from the HTML source to the product title and price elements within the prompt. Your response should solely include the complete XPath for both elements, without any extra explanations, notes, or additional text. Repeating the same answer multiple times is acceptable."},
            {"role": "user", "content": html}
        ]
    )
    return response.choices[0].message["content"]

def find_max_price(html_content, xpath):
    parsed_html = html.fromstring(html_content)
    # Use the generated XPath expression to select and extract product prices
    price_strings = parsed_html.xpath(xpath)

    # Convert price strings to float values
    prices = [float(price) for price in price_strings]

    # Find the highest price
    highest_price = max(prices)

    # Print the highest price
    print("The highest macbook price is:", highest_price)

async def main():
    # Make a request to scrape the Amazon search page with options
    response = api.get(amazon_search_url, options)

    # Check if the request was successful
    if response['status_code'] == 200:
        # Extracted HTML content after decoding byte data
        html_content = response['body'].decode('latin1')
        xpath = await get_xpath(html_content)

        find_max_price(html_content, xpath)

    else:
        print("Failed to retrieve the page. Status code:", response['status_code'])

if __name__ == "__main__":
    asyncio.run(main())

Example Output:

The highest macbook price is: 5,299

With this addition, your scraping script now not only retrieves data but also processes it to provide you with valuable insights, such as the highest Macbook price found on the Amazon search page. You may also want to know how to handle pagination while scraping and saving the results in a proper format. For this, you can refer to this blog. Enjoy your enhanced scraping capabilities!

Final Words

I hope this blog helps you automate your scraping efforts and saves you alot of time. If you're interested in scraping Walmart product data or its search pages, consider exploring the following guides:

📜 How to Scrape Amazon Reviews
📜 How to Scrape Amazon Search Pages
📜 How to Scrape Amazon Product Data

You can find additional guides like scraping amazon ASIN, Amazon reviews in Node, Amazon Images, and Amazon data in Ruby. Additionally, for e-commerce scraping guides beyond Walmart, check out our tutorials on scraping product data from Walmart, eBay, and AliExpress.

Feel free to reach out to us here if you need further assistance or have additional questions.

Frequently Asked Questions

Q: What should I do with the scraped price data from Amazon?

What you do with the scraped price data from Amazon largely depends on your intentions and compliance with relevant legal regulations. If you plan to use the data for personal use or analysis, you may typically do so as long as it aligns with Amazon's terms and conditions and the applicable web scraping laws in your region. However, sharing, selling, or publishing scraped data, especially for commercial purposes, often requires explicit permission from Amazon.

Q: How can automated scraping benefit my e-commerce business?

Automated scraping offers several advantages for e-commerce businesses. It allows you to monitor competitive price scraping and product offerings continuously. It provides in-depth insights into product trends, customer preferences, and market demands, which are invaluable for product development and targeted marketing. Additionally, accurate and up-to-date product information on your e-commerce website ensures a seamless shopping experience for customers.

Q: Can I adapt automated scraping to handle changes in website layouts?

Yes, automated scraping can adapt to changes in website layouts. When websites update their design or structure, automated scraping can use techniques such as CSS selectors and flexible XPath expressions to ensure that data collection remains uninterrupted. This adaptability is valuable, allowing you to maintain accurate and up-to-date data even when websites change their appearance.

Q: What are the legal and ethical considerations when using web scraping for data collection?

The legal and ethical aspects of web scraping are essential to consider. The legality of web scraping varies by jurisdiction, and it's crucial to respect website terms of service. Ethical scraping practices involve not overloading a website with requests, avoiding scraping private or sensitive information, and providing proper attribution when using scraped data. Seeking legal advice and being aware of privacy regulations in your region can help ensure compliance with relevant laws.

DEV Community