DEV Community

Cover image for How to Effectively Scrape Noon Data
Crawlbase
Crawlbase

Posted on • Originally published at crawlbase.com

How to Effectively Scrape Noon Data

This blog was initially posted to Crawlbase Blog

Noon is one of the biggest e-commerce platforms in the Middle East, with millions of customers across the UAE, Saudi Arabia, and Egypt. Noon has a huge product catalog and thousands of daily transactions. Scraping Noon data helps businesses to track prices, competitors and market insights.

But scraping Noon is tough. The website has dynamic content, JavaScript-based elements, and anti-bot measures that can block traditional scraping methods. We will use Crawlbase Crawling API to extract search results and product details while handling these challenges.

This tutorial will show you how to scrape Noon data using Python with step-by-step examples for structured data extraction.

Let’s start!

Setting Up Your Python Environment

Before you start scraping Noon data, you need to set up your environment.This includes installing Python, required libraries, and choosing the right IDE to code.

Installing Python and Required Libraries

If you don’t have Python installed, download the latest version from python.org and follow the installation instructions for your OS.

Next, install the required libraries by running:

pip install crawlbase beautifulsoup4 pandas
Enter fullscreen mode Exit fullscreen mode
  • Crawlbase – Bypasses anti-bot protections and scrapes JavaScript heavy pages.
  • BeautifulSoup – Extracts structured data from HTML.
  • Pandas – Handles and stores data in CSV format.

Choosing an IDE for Scraping

Choosing the right Integrated Development Environment (IDE) makes scraping easier. Here are some good options:

  • VS Code – Lightweight and feature-rich with great Python support.
  • PyCharm – Powerful debugging and automation features.
  • Jupyter Notebook – Ideal for interactive scraping and quick data analysis.

With Python installed, libraries set up and IDE ready, you’re now ready to start scraping Noon data.

Scraping Noon Search Results

Scraping search results from Noon will give you the product names, prices, ratings, and URLs. This data is useful for competitive analysis, price monitoring, and market research. In this section, we will guide you through the process of scraping search results from Noon, handling pagination, and storing the data in a CSV file.

Inspecting the HTML for CSS Selectors

Before we start writing the scraper, we need to inspect the HTML structure of Noon’s search results page. By doing this we can find the CSS selectors to extract the product details.

  1. Go to Noon.com and search for a product (e.g., "smartphones").
  2. Right-click on any product and choose Inspect or Inspect Element in Chrome Developer Tools.

Screenshot displaying HTML structure for Noon search results

  1. Identify the following key HTML elements:
  • Product Title: Found in the <div data-qa="product-name"> tag.
  • Price: Found in the <strong class="amount"> tag.
  • Currency: Found in the <span class="currency"> tag.
  • Ratings: Found in the <div class="dGLdNc"> tag.
  • Product URL: Found in the href attribute of the <a> tag.

Once you identify the relevant elements and their CSS classes or IDs, you can proceed to write the scraper.

Writing the Noon Search Listings Scraper

Now that we've inspected the HTML structure, we can write a Python script to scrape the product data from Noon. We’ll use Crawlbase Crawling API for bypassing anti-bot measures and BeautifulSoup for parsing the HTML.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

# Initialize Crawlbase API
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_noon_search(query, page):
    """Scrape search results from Noon."""
    url = f"https://www.noon.com/uae-en/search/?q={query}&page={page}"
    options = {'ajax_wait': 'true', 'page_wait': '5000'}

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch page {page}.")
        return None

def extract_product_data(html):
    """Extract product details from Noon search results."""
    soup = BeautifulSoup(html, 'html.parser')
    products = []

    for item in soup.select('div.grid > span.productContainer'):
        title = item.select_one('div[data-qa="product-name"]').text.strip() if item.select_one('div[data-qa="product-name"]') else ''
        price = item.select_one('strong.amount').text.strip() if item.select_one('strong.amount') else ''
        currency = item.select_one('span.currency').text.strip() if item.select_one('span.currency') else ''
        rating = item.select_one('div.dGLdNc').text.strip() if item.select_one('div.dGLdNc') else ''
        link = f"https://www.noon.com{item.select_one('a')['href']}" if item.select_one('a') else ''

        if title and price:
            products.append({
                'Title': title,
                'Price': price,
                'Currency': currency,
                'Rating': rating,
                'URL': link
            })

    return products
Enter fullscreen mode Exit fullscreen mode

We first initialize the CrawlingAPI class with a token for authentication. The scrape_noon_search function fetches the HTML of a search results page from Noon based on a query and page number, handling AJAX content loading. The extract_product_data function parses the HTML using BeautifulSoup, extracting details such as product titles, prices, ratings, and URLs. It then returns this data in a structured list of dictionaries.

Handling Pagination

Noon’s search results span across multiple pages. To scrape all the data, we need to handle pagination and loop through each page. Here's how we can do it:

def scrape_all_pages(query, max_pages):
    """Scrape multiple pages of search results."""
    all_products = []

    for page in range(1, max_pages + 1):
        print(f"Scraping page {page}...")
        html = scrape_noon_search(query, page)

        if html:
            products = extract_product_data(html)
            if not products:
                print("No more results found. Stopping.")
                break
            all_products.extend(products)
        else:
            break

    return all_products
Enter fullscreen mode Exit fullscreen mode

This function loops through the specified number of pages, fetching and extracting product data until all pages are processed.

Storing Data in a CSV File

Once we’ve extracted the product details, we need to store the data in a structured format. The most common and easy-to-handle format is CSV. Below is the code to save the scraped data:

import csv

def save_to_csv(data, filename):
    """Save scraped data to a CSV file."""
    keys = data[0].keys() if data else ['Title', 'Price', 'Rating', 'URL']

    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(data)

    print(f"Data saved to {filename}")
Enter fullscreen mode Exit fullscreen mode

This function takes the list of products and saves it as a CSV file, making it easy to analyze or import into other tools.

Complete Code Example

Here is the complete Python script to scrape Noon search results, handle pagination, and store the data in a CSV file:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import csv

# Initialize Crawlbase API
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_noon_search(query, page):
    """Scrape product listings from Noon search results."""
    url = f"https://www.noon.com/uae-en/search/?q={query}&page={page}"
    options = {'ajax_wait': 'true', 'page_wait': '5000'}

    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch page {page}.")
        return None

def extract_product_data(html):
    """Extract product details from Noon search results."""
    soup = BeautifulSoup(html, 'html.parser')
    products = []

    for item in soup.select('div.grid > span.productContainer'):
        title = item.select_one('div[data-qa="product-name"]').text.strip() if item.select_one('div[data-qa="product-name"]') else ''
        price = item.select_one('strong.amount').text.strip() if item.select_one('strong.amount') else ''
        currency = item.select_one('span.currency').text.strip() if item.select_one('span.currency') else ''
        rating = item.select_one('div.dGLdNc').text.strip() if item.select_one('div.dGLdNc') else ''
        link = f"https://www.noon.com{item.select_one('a')['href']}" if item.select_one('a') else ''

        if title and price:
            products.append({
                'Title': title,
                'Price': price,
                'Currency': currency,
                'Rating': rating,
                'URL': link
            })

    return products

def scrape_all_pages(query, max_pages):
    """Scrape multiple pages of search results."""
    all_products = []

    for page in range(1, max_pages + 1):
        print(f"Scraping page {page}...")
        html = scrape_noon_search(query, page)

        if html:
            products = extract_product_data(html)
            if not products:
                print("No more results found. Stopping.")
                break
            all_products.extend(products)
        else:
            break

    return all_products

def save_to_csv(data, filename):
    """Save scraped data to a CSV file."""
    keys = data[0].keys() if data else ['Title', 'Price', 'Rating', 'URL']

    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(data)

    print(f"Data saved to {filename}")

def main():
    """Main function to run the scraper."""
    query = "smartphones"  # Change the search term as needed
    max_pages = 5  # Set the number of pages to scrape
    all_products = scrape_all_pages(query, max_pages)
    save_to_csv(all_products, 'noon_smartphones.csv')

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

noon_smartphones.csv Snapshot:

noon_smartphones.csv output file snapshot

Scraping Noon Product Pages

Scraping product pages on Noon will give you all the product details, including descriptions, specifications, and customer reviews. This data will help businesses optimize their product listings and customer behavior. In this section, we will go through the process of inspecting the HTML structure of a product page, writing the scraper and saving the data into a CSV file.

Inspecting the HTML for CSS Selectors

Before we write the scraper, we need to inspect the HTML structure of the product page to identify the correct CSS selectors for the elements we want to scrape. Here’s how to do it:

  1. Open a product page on Noon (e.g., a smartphone page).
  2. Right-click on a product detail (e.g., product name, price, description) and click on Inspect in Chrome Developer Tools.

Screenshot displaying HTML structure for Noon product pages

  1. Look for key elements, such as:
  • Product Name: Found in the <h1 data-qa^="pdp-name-"> tag.
  • Price: Found in the <div data-qa="div-price-now"> tag.
  • Product Highlights: Found in the <div class="oPZpQ"> tag, specifically within an unordered list (<ul>).
  • Product Specifications: Found in the <div class="dROUvm"> tag, within a table's <tr> tags containing <td> elements.

Once you identify the relevant elements and their CSS classes or IDs, you can proceed to write the scraper.

Writing the Noon Product Page Scraper

Now, let's write a Python script to scrape the product details from Noon product pages using Crawlbase Crawling API and BeautifulSoup.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import re

# Initialize Crawlbase API
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_product_page(product_url):
    """Scrape product details from a Noon product page."""
    options = {'ajax_wait': 'true', 'page_wait': '3000'}

    response = crawling_api.get(product_url, options)

    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch product page: {product_url}.")
        return None

def extract_product_details(html):
    """Extract details like name, price, description, and reviews."""
    soup = BeautifulSoup(html, 'html.parser')

    product = {}
    product['Name'] = soup.select_one('h1[data-qa^="pdp-name-"]').text.strip() if soup.select_one('h1[data-qa^="pdp-name-"]') else ''
    product['Price'] = soup.select_one('div[data-qa="div-price-now"]').text.strip() if soup.select_one('div[data-qa="div-price-now"]') else ''
    product['highlights'] = soup.select_one('div.oPZpQ ul').text.strip() if soup.select_one('div.oPZpQ ul') else ''
    product['specifications'] = {re.sub(r'\s+', ' ', row.find_all('td')[0].text.strip()): re.sub(r'\s+', ' ',row.find_all('td')[1].text.strip()) for row in soup.select('div.dROUvm table tr') if len(row.find_all('td')) == 2}

    return product
Enter fullscreen mode Exit fullscreen mode

Storing Data in a CSV File

Once we’ve extracted the product details, we need to store this information in a structured format like CSV for easy analysis. Here’s a simple function to save the scraped data:

import csv

def save_product_data_to_csv(products, filename):
    """Save product details to a CSV file."""
    keys = products[0].keys() if products else ['Name', 'Price', 'Description', 'Reviews']

    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(products)

    print(f"Data saved to {filename}")
Enter fullscreen mode Exit fullscreen mode

Complete Code Example

Now, let's combine everything into a complete script. The main() function will scrape data for multiple product pages and store the results in a CSV file.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import csv
import re

# Initialize Crawlbase API
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})

def scrape_product_page(product_url):
    """Scrape product details from a Noon product page."""
    options = {'ajax_wait': 'true', 'page_wait': '3000'}

    response = crawling_api.get(product_url, options)

    if response['headers']['pc_status'] == '200':
        return response['body'].decode('utf-8')
    else:
        print(f"Failed to fetch product page: {product_url}.")
        return None

def extract_product_details(html):
    """Extract details like name, price, description, and reviews."""
    soup = BeautifulSoup(html, 'html.parser')

    product = {}
    product['Name'] = soup.select_one('h1[data-qa^="pdp-name-"]').text.strip() if soup.select_one('h1[data-qa^="pdp-name-"]') else ''
    product['Price'] = soup.select_one('div[data-qa="div-price-now"]').text.strip() if soup.select_one('div[data-qa="div-price-now"]') else ''
    product['highlights'] = soup.select_one('div.oPZpQ ul').text.strip() if soup.select_one('div.oPZpQ ul') else ''
    product['specifications'] = {re.sub(r'\s+', ' ', row.find_all('td')[0].text.strip()): re.sub(r'\s+', ' ',row.find_all('td')[1].text.strip()) for row in soup.select('div.dROUvm table tr') if len(row.find_all('td')) == 2}

    return product

def save_product_data_to_csv(products, filename):
    """Save product details to a CSV file."""
    keys = products[0].keys() if products else ['Name', 'Price', 'Description', 'Reviews']

    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(products)

    print(f"Data saved to {filename}")

def main():
    """Main function to scrape product pages."""
    product_urls = [
        'https://www.noon.com/uae-en/galaxy-s25-ai-dual-sim-silver-shadow-12gb-ram-256gb-5g-middle-east-version/N70140511V/p/?o=e12201b055fa94ee',
        'https://www.noon.com/uae-en/a78-5g-dual-sim-glowing-black-8gb-ram-256gb/N70115717V/p/?o=c99e13ae460efc6b'
    ]  # List of product URLs to scrape

    product_data = []

    for url in product_urls:
        print(f"Scraping {url}...")
        html = scrape_product_page(url)
        if html:
            product = extract_product_details(html)
            product_data.append(product)

    save_product_data_to_csv(product_data, 'noon_product_details.csv')

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

noon_product_details.csv Snapshot:

noon_product_details.csv output file snapshot

Final Thoughts

Scraping Noon data is great for businesses to track prices, analyze competitors and improve product listings. Crawlbase Crawling API makes this process easier by handling JavaScript rendering and CAPTCHA protections so you get complete and accurate data with no barriers.

With Python and BeautifulSoup, scraping data from Noon search results and product pages is easy. Follow ethical practices and set up the right environment, and you’ll have the insights to stay ahead in the competitive e-commerce game.

If you want to scrape from other e-commerce platforms check out these other guides.

AWS GenAI LIVE image

How is generative AI increasing efficiency?

Join AWS GenAI LIVE! to find out how gen AI is reshaping productivity, streamlining processes, and driving innovation.

Learn more

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs