agenthustler

Posted on Mar 26

G2 Reviews Data Collection: Complete Developer Guide

#webdev #python #webscraping #tutorial

Why Collect G2 Reviews Data?

G2 is the world's largest B2B software review marketplace with over 2 million reviews across 100K+ products. For developers building competitive intelligence tools, market research platforms, or sales enablement software, G2 data is gold.

Common use cases:

Competitive analysis — Track how competitors are rated over time
Lead generation — Identify companies using specific tools
Product research — Understand feature gaps from reviewer feedback
Market sizing — Estimate category adoption trends

API vs Scraping: What's Available?

G2 offers a limited partner API, but it's restricted to G2 customers with enterprise plans. For most developers, scraping is the practical option.

G2's Structure

G2 product pages follow a predictable URL pattern:

https://www.g2.com/products/{product-slug}/reviews
https://www.g2.com/products/{product-slug}/reviews?page={n}
https://www.g2.com/categories/{category-slug}

Building a G2 Reviews Scraper

G2 uses server-side rendering with moderate anti-bot protection. Here's a working approach using Python with ScraperAPI to handle JavaScript rendering and proxy rotation:

import requests
from bs4 import BeautifulSoup
import json
import time
import random

SCRAPER_API_KEY = 'your_api_key'  # Get one at scraperapi.com

def scrape_g2_reviews(product_slug, max_pages=5):
    all_reviews = []
    base_url = f'https://www.g2.com/products/{product_slug}/reviews'

    for page in range(1, max_pages + 1):
        url = f'{base_url}?page={page}'

        # Use ScraperAPI for rendering and proxy rotation
        api_url = f'http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url={url}&render=true'

        response = requests.get(api_url, timeout=60)

        if response.status_code != 200:
            print(f'Failed on page {page}: {response.status_code}')
            break

        soup = BeautifulSoup(response.text, 'html.parser')
        review_cards = soup.select('[itemprop="review"]')

        if not review_cards:
            print(f'No reviews found on page {page}, stopping')
            break

        for card in review_cards:
            review = extract_review(card)
            if review:
                all_reviews.append(review)

        print(f'Page {page}: extracted {len(review_cards)} reviews')
        time.sleep(random.uniform(3, 7))

    return all_reviews

def extract_review(card):
    try:
        # Rating
        rating_el = card.select_one('[class*="star-rating"]')
        rating = None
        if rating_el:
            stars = rating_el.get('class', [])
            for cls in stars:
                if 'stars-' in cls:
                    rating = float(cls.split('stars-')[1]) / 2

        # Title
        title_el = card.select_one('[itemprop="name"]')
        title = title_el.get_text(strip=True) if title_el else None

        # Review body
        likes = card.select_one('[data-test-id="review-likes"]')
        dislikes = card.select_one('[data-test-id="review-dislikes"]')

        # Reviewer info
        reviewer = card.select_one('[itemprop="author"]')
        reviewer_name = reviewer.get_text(strip=True) if reviewer else None

        # Date
        date_el = card.select_one('time')
        review_date = date_el.get('datetime') if date_el else None

        return {
            'rating': rating,
            'title': title,
            'likes': likes.get_text(strip=True) if likes else None,
            'dislikes': dislikes.get_text(strip=True) if dislikes else None,
            'reviewer': reviewer_name,
            'date': review_date,
        }
    except Exception as e:
        print(f'Error extracting review: {e}')
        return None

Extracting Category Data

G2 categories are useful for market research:

def scrape_g2_category(category_slug, max_pages=3):
    products = []

    for page in range(1, max_pages + 1):
        url = f'https://www.g2.com/categories/{category_slug}?page={page}'
        api_url = f'http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url={url}&render=true'

        response = requests.get(api_url, timeout=60)
        soup = BeautifulSoup(response.text, 'html.parser')

        product_cards = soup.select('[data-test-id="product-card"]')

        for card in product_cards:
            name = card.select_one('[itemprop="name"]')
            rating = card.select_one('[class*="star-rating"]')
            review_count = card.select_one('[data-test-id="review-count"]')
            description = card.select_one('[itemprop="description"]')

            products.append({
                'name': name.get_text(strip=True) if name else None,
                'rating': rating.get_text(strip=True) if rating else None,
                'review_count': review_count.get_text(strip=True) if review_count else None,
                'description': description.get_text(strip=True) if description else None,
            })

        time.sleep(random.uniform(2, 5))

    return products

Handling Common Challenges

1. Anti-Bot Detection

G2 uses Cloudflare and behavioral analysis. Using ScraperAPI handles most of this automatically with browser rendering and IP rotation.

2. Pagination

G2 pagination is straightforward with ?page=N parameters. Pages typically contain 10 reviews each.

3. Data Cleaning

import pandas as pd

def clean_reviews(reviews):
    df = pd.DataFrame(reviews)
    df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
    df['date'] = pd.to_datetime(df['date'], errors='coerce')
    df = df.dropna(subset=['title'])
    return df

4. Structured Output

def export_reviews(reviews, product_name):
    df = clean_reviews(reviews)

    # CSV for spreadsheets
    df.to_csv(f'{product_name}_reviews.csv', index=False)

    # JSON for APIs
    df.to_json(f'{product_name}_reviews.json', orient='records', indent=2)

    # Summary stats
    print(f'Total reviews: {len(df)}')
    print(f'Average rating: {df["rating"].mean():.2f}')
    print(f'Date range: {df["date"].min()} to {df["date"].max()}')

The Quick Way: Use a Managed Scraper

If you need G2 data without building and maintaining a scraper, try the G2 Reviews Scraper on Apify. It handles all the anti-bot challenges, provides structured JSON output, and supports scheduling for ongoing data collection.

Use Cases in Practice

Competitive Dashboard

competitors = ['slack', 'microsoft-teams', 'discord']
all_data = {}

for product in competitors:
    reviews = scrape_g2_reviews(product, max_pages=10)
    all_data[product] = clean_reviews(reviews)
    print(f'{product}: {len(reviews)} reviews, avg {all_data[product]["rating"].mean():.1f}')

Sentiment Tracking

Combine G2 review text with NLP libraries like TextBlob or spaCy to track sentiment trends over time.

Conclusion

G2 review data is essential for B2B competitive intelligence. While G2's API is limited to enterprise partners, Python scraping with proper proxy support (ScraperAPI recommended) gets the job done reliably. For production-grade data collection, the G2 Reviews Scraper on Apify provides a managed alternative.

Remember to scrape responsibly — add delays, respect rate limits, and only collect data you have a legitimate use for.

DEV Community