Noorsimar Singh

Posted on Jul 14

G2.com Scraping: Handling Anti-Bot Measures with Scrapy

#python #datascience #opensource #webscraping

Building a Production-Ready G2.com Scraper with Python and Scrapy

Learn how to build a robust web scraper for G2.com that handles anti-bot measures, exports clean data, and scales from development to production.

TL;DR

Built a G2.com scraper using Scrapy that extracts category listings and product reviews. Features include anti-bot detection, proxy rotation via ScrapeOps, duplicate handling, and clean CSV/JSON exports. Perfect for market research and competitive analysis.

I recently needed to gather competitive intelligence from G2.com for my project. What started as a simple script quickly evolved into a production-ready scraper that handles G2's anti-bot measures, exports clean data, and scales from development to production. Here's how I built it and what I learned along the way.

Why This Matters

G2.com is a goldmine for B2B market research, but it's also protected by sophisticated anti-bot measures. Most scraping attempts fail due to rate limiting, IP blocking, or JavaScript challenges. A robust scraper needs to handle these obstacles while maintaining data quality and respecting the site's policies.

The Architecture

The scraper uses two main spiders:

Category Spider: Extracts product listings from category pages
Product Reviews Spider: Collects detailed reviews from individual product pages

Both spiders share a common pipeline for data validation, duplicate removal, and export formatting.

Core Implementation

1. Universal Selector Discovery

G2's layout varies across pages, so I implemented a fallback system that tries multiple selectors:

def discover_review_containers(self, response):
    container_selectors = [
        'article.elv-bg-neutral-0',
        'article[data-testid*="review"]',
        'div[itemprop="review"]',
        # ... more fallbacks
    ]

    for selector in container_selectors:
        containers = response.css(selector)
        if containers and self._verify_review_content(containers):
            return containers
    return []

This approach ensures the scraper adapts to different page layouts without manual intervention.

2. Anti-Bot Middleware Stack

The middleware pipeline handles various anti-bot measures:

DOWNLOADER_MIDDLEWARES = {
    'g2_scraper.middlewares.RandomUserAgentMiddleware': 400,
    'g2_scraper.middlewares.RandomDelayMiddleware': 401,
    'g2_scraper.middlewares.AntiBotDetectionMiddleware': 402,
    'scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}

The RandomUserAgentMiddleware rotates through realistic browser signatures, while the RandomDelayMiddleware adds natural timing variations.

3. Smart Duplicate Detection

Instead of simple field matching, the duplicate pipeline creates unique identifiers:

def process_item(self, item, spider):
    adapter = ItemAdapter(item)

    if 'reviewer_name' in adapter and 'review_date' in adapter:
        # Reviews: reviewer + date + review snippet
        review_text = adapter.get('review_text', '')[:50]
        item_id = f"review_{adapter.get('reviewer_name')}_{adapter.get('review_date')}_{review_text}"
    elif 'product_name' in adapter and 'product_url' in adapter:
        # Products: name + URL
        item_id = f"product_{adapter.get('product_name')}_{adapter.get('product_url')}"

    if item_id in self.ids_seen:
        raise DropItem(f"Duplicate item found: {item_id}")
    else:
        self.ids_seen.add(item_id)
        return item

This prevents duplicate reviews while allowing multiple products with the same name.

4. Dynamic Export Pipeline

The export pipeline creates files only for item types that actually have data:

def process_item(self, item, spider):
    # Determine item type
    if 'reviewer_name' in adapter:
        item_type = 'review'
    elif 'product_name' in adapter:
        item_type = 'product'
    elif 'category_name' in adapter:
        item_type = 'category'

    # Create file only when first item of this type is encountered
    if item_type not in self.active_types:
        filename = f'data/g2_{item_type}s_{timestamp}.csv'
        # ... create file and write header
        self.active_types.add(item_type)

    # Write data row
    csv_writer.writerow(clean_item_obj.values())

This prevents empty files and keeps the output directory clean.

Usage Examples

Category Scraping

scrapy crawl g2_category -a category=system-security -a limit=5

This extracts the top 5 products from the system-security category, including ratings, review counts, and vendor information.

Product Reviews Scraping

scrapy crawl g2_product_reviews -a product_url="https://www.g2.com/products/rollworks-account-based-platform/reviews"

This collects detailed reviews with pros/cons, reviewer information, and ratings.

Key Features That Made This Production-Ready

JavaScript Rendering: Uses render_js=true parameter for fully rendered pages
Proxy Rotation: Integrated ScrapeOps proxy for IP rotation and geolocation
Data Validation: Automatic cleaning and validation of all scraped fields
Error Recovery: Exponential backoff and retry mechanisms
Rate Limiting: Respectful delays and auto-throttling

What I Learned

The biggest challenge was handling G2's dynamic content loading. Initially, I tried static selectors, but they failed when the site updated its layout. The universal selector discovery system solved this by trying multiple patterns and verifying content before proceeding.

Another key insight was the importance of proper duplicate detection. Simple field matching caused issues when the same reviewer posted multiple reviews for the same product. The current approach uses a combination of reviewer name, date, and review snippet to create truly unique identifiers.

Getting Started

Clone the repository: g2-scrapy-scraper
Install dependencies: pip install -r requirements.txt
Grab a free ScrapeOps API key for proxy rotation
Run the spiders: Use the commands above

Next Steps and Resources

For deeper insights into G2's scraping challenges, check out the G2 Website Analyzer which covers anti-bot measures, legal considerations, and technical challenges.

If you need step-by-step guidance, the How-to Scrape G2 Guide provides detailed walkthroughs for various scraping scenarios.

Why ScrapeOps Made a Difference

I initially tried building this with free proxies, but the success rate was abysmal. After grabbing a free ScrapeOps API key, the success rate jumped to 95%+. The proxy rotation and geolocation features eliminated most blocking issues, while the monitoring dashboard helped me optimize the scraping strategy.

Conclusion

This scraper demonstrates how to build production-ready web scraping solutions that handle real-world challenges. The modular architecture makes it easy to extend for other sites, while the robust error handling ensures reliable operation.

The complete code is available on GitHub - feel free to star it if you find it useful, and let me know if you have questions or suggestions for improvements.

DEV Community