Building a Production-Ready G2.com Scraper with Python and Scrapy
Learn how to build a robust web scraper for G2.com that handles anti-bot measures, exports clean data, and scales from development to production.
TL;DR
Built a G2.com scraper using Scrapy that extracts category listings and product reviews. Features include anti-bot detection, proxy rotation via ScrapeOps, duplicate handling, and clean CSV/JSON exports. Perfect for market research and competitive analysis.
I recently needed to gather competitive intelligence from G2.com for my project. What started as a simple script quickly evolved into a production-ready scraper that handles G2's anti-bot measures, exports clean data, and scales from development to production. Here's how I built it and what I learned along the way.
Why This Matters
G2.com is a goldmine for B2B market research, but it's also protected by sophisticated anti-bot measures. Most scraping attempts fail due to rate limiting, IP blocking, or JavaScript challenges. A robust scraper needs to handle these obstacles while maintaining data quality and respecting the site's policies.
The Architecture
The scraper uses two main spiders:
- Category Spider: Extracts product listings from category pages
- Product Reviews Spider: Collects detailed reviews from individual product pages
Both spiders share a common pipeline for data validation, duplicate removal, and export formatting.
Core Implementation
1. Universal Selector Discovery
G2's layout varies across pages, so I implemented a fallback system that tries multiple selectors:
def discover_review_containers(self, response):
container_selectors = [
'article.elv-bg-neutral-0',
'article[data-testid*="review"]',
'div[itemprop="review"]',
# ... more fallbacks
]
for selector in container_selectors:
containers = response.css(selector)
if containers and self._verify_review_content(containers):
return containers
return []
This approach ensures the scraper adapts to different page layouts without manual intervention.
2. Anti-Bot Middleware Stack
The middleware pipeline handles various anti-bot measures:
DOWNLOADER_MIDDLEWARES = {
'g2_scraper.middlewares.RandomUserAgentMiddleware': 400,
'g2_scraper.middlewares.RandomDelayMiddleware': 401,
'g2_scraper.middlewares.AntiBotDetectionMiddleware': 402,
'scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}
The RandomUserAgentMiddleware rotates through realistic browser signatures, while the RandomDelayMiddleware adds natural timing variations.
3. Smart Duplicate Detection
Instead of simple field matching, the duplicate pipeline creates unique identifiers:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
if 'reviewer_name' in adapter and 'review_date' in adapter:
# Reviews: reviewer + date + review snippet
review_text = adapter.get('review_text', '')[:50]
item_id = f"review_{adapter.get('reviewer_name')}_{adapter.get('review_date')}_{review_text}"
elif 'product_name' in adapter and 'product_url' in adapter:
# Products: name + URL
item_id = f"product_{adapter.get('product_name')}_{adapter.get('product_url')}"
if item_id in self.ids_seen:
raise DropItem(f"Duplicate item found: {item_id}")
else:
self.ids_seen.add(item_id)
return item
This prevents duplicate reviews while allowing multiple products with the same name.
4. Dynamic Export Pipeline
The export pipeline creates files only for item types that actually have data:
def process_item(self, item, spider):
# Determine item type
if 'reviewer_name' in adapter:
item_type = 'review'
elif 'product_name' in adapter:
item_type = 'product'
elif 'category_name' in adapter:
item_type = 'category'
# Create file only when first item of this type is encountered
if item_type not in self.active_types:
filename = f'data/g2_{item_type}s_{timestamp}.csv'
# ... create file and write header
self.active_types.add(item_type)
# Write data row
csv_writer.writerow(clean_item_obj.values())
This prevents empty files and keeps the output directory clean.
Usage Examples
Category Scraping
scrapy crawl g2_category -a category=system-security -a limit=5
This extracts the top 5 products from the system-security category, including ratings, review counts, and vendor information.
Product Reviews Scraping
scrapy crawl g2_product_reviews -a product_url="https://www.g2.com/products/rollworks-account-based-platform/reviews"
This collects detailed reviews with pros/cons, reviewer information, and ratings.
Key Features That Made This Production-Ready
-
JavaScript Rendering: Uses
render_js=true
parameter for fully rendered pages - Proxy Rotation: Integrated ScrapeOps proxy for IP rotation and geolocation
- Data Validation: Automatic cleaning and validation of all scraped fields
- Error Recovery: Exponential backoff and retry mechanisms
- Rate Limiting: Respectful delays and auto-throttling
What I Learned
The biggest challenge was handling G2's dynamic content loading. Initially, I tried static selectors, but they failed when the site updated its layout. The universal selector discovery system solved this by trying multiple patterns and verifying content before proceeding.
Another key insight was the importance of proper duplicate detection. Simple field matching caused issues when the same reviewer posted multiple reviews for the same product. The current approach uses a combination of reviewer name, date, and review snippet to create truly unique identifiers.
Getting Started
- Clone the repository: g2-scrapy-scraper
-
Install dependencies:
pip install -r requirements.txt
- Grab a free ScrapeOps API key for proxy rotation
- Run the spiders: Use the commands above
Next Steps and Resources
For deeper insights into G2's scraping challenges, check out the G2 Website Analyzer which covers anti-bot measures, legal considerations, and technical challenges.
If you need step-by-step guidance, the How-to Scrape G2 Guide provides detailed walkthroughs for various scraping scenarios.
Why ScrapeOps Made a Difference
I initially tried building this with free proxies, but the success rate was abysmal. After grabbing a free ScrapeOps API key, the success rate jumped to 95%+. The proxy rotation and geolocation features eliminated most blocking issues, while the monitoring dashboard helped me optimize the scraping strategy.
Conclusion
This scraper demonstrates how to build production-ready web scraping solutions that handle real-world challenges. The modular architecture makes it easy to extend for other sites, while the robust error handling ensures reliable operation.
The complete code is available on GitHub - feel free to star it if you find it useful, and let me know if you have questions or suggestions for improvements.
Top comments (0)