DEV Community

agenthustler
agenthustler

Posted on

G2 Reviews Scraping: Extract Software Reviews, Ratings and Competitor Data

G2.com is the world's largest software review marketplace, hosting millions of verified user reviews across thousands of software categories. For product managers, marketers, sales teams, and competitive analysts, the data locked inside G2 is incredibly valuable — but manually browsing through hundreds of product pages is impractical.

In this guide, we'll explore how to scrape G2 reviews programmatically, covering the platform's structure, extraction techniques using Python and Node.js, and how to scale your data collection using Apify.

Understanding G2.com's Structure

Before writing any scraping code, you need to understand how G2 organizes its data. G2 has several distinct page types, each containing different data points.

Product Pages

Every software product on G2 has a dedicated page (e.g., g2.com/products/slack/reviews). A product page contains:

  • Overall star rating — aggregate score from all reviews (out of 5)
  • Total review count — how many users have reviewed the product
  • Rating breakdown — distribution across 5-star to 1-star ratings
  • Satisfaction scores — Ease of Use, Quality of Support, Ease of Setup, etc.
  • Product description — vendor-provided overview
  • Pricing information — when available
  • Feature list — categorized feature descriptions
  • Comparison links — "vs" pages for competitor comparisons

Individual Reviews

Each review on G2 is structured with rich metadata:

  • Star rating — the reviewer's overall score
  • Review title and body text
  • Pros and Cons — clearly separated sections
  • User demographics — company size, industry, role, region
  • Verified status — whether the review was authenticated
  • Review date — when it was submitted
  • Helpful votes — community engagement signals
  • Product usage duration — how long the reviewer has used the software

Category Pages

G2 organizes software into categories (e.g., g2.com/categories/crm). Each category page lists:

  • Category leader grid — G2's quadrant ranking
  • All products in the category — with summary ratings
  • Subcategories — more granular groupings
  • Market statistics — average ratings, review counts

Comparison Pages

G2 generates comparison pages (e.g., g2.com/compare/slack-vs-microsoft-teams) with:

  • Side-by-side ratings
  • Feature comparison tables
  • Reviewer sentiment comparison
  • Pricing comparison — when available

Setting Up Your Environment

Python Dependencies

pip install requests beautifulsoup4 apify-client pandas
Enter fullscreen mode Exit fullscreen mode

Node.js Dependencies

npm install axios cheerio apify-client
Enter fullscreen mode Exit fullscreen mode

Method 1: Scraping G2 Product Pages

G2 pages are server-rendered, making them accessible with simple HTTP requests. However, they do implement rate limiting and bot detection, so you'll need proper headers and potentially proxies.

Python Product Scraper

import requests
from bs4 import BeautifulSoup
import json
import time

class G2ProductScraper:
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
        })

    def scrape_product(self, product_slug):
        """Scrape a G2 product page for ratings and metadata."""
        url = f"https://www.g2.com/products/{product_slug}/reviews"
        response = self.session.get(url)

        if response.status_code != 200:
            print(f"Failed to fetch {url}: {response.status_code}")
            return None

        soup = BeautifulSoup(response.text, 'html.parser')

        product_data = {
            'slug': product_slug,
            'url': url,
            'name': self._extract_name(soup),
            'overall_rating': self._extract_rating(soup),
            'review_count': self._extract_review_count(soup),
            'rating_breakdown': self._extract_rating_breakdown(soup),
            'satisfaction_scores': self._extract_satisfaction(soup),
        }

        return product_data

    def _extract_name(self, soup):
        title = soup.find('h1')
        return title.text.strip() if title else 'Unknown'

    def _extract_rating(self, soup):
        rating_el = soup.find('span', class_='fw-semibold')
        if rating_el:
            try:
                return float(rating_el.text.strip())
            except ValueError:
                pass
        return None

    def _extract_review_count(self, soup):
        count_el = soup.find('span', attrs={'itemprop': 'reviewCount'})
        if count_el:
            try:
                return int(count_el.text.strip().replace(',', ''))
            except ValueError:
                pass
        return 0

    def _extract_rating_breakdown(self, soup):
        breakdown = {}
        stars_section = soup.find_all('div', class_='rating-bar')
        for i, bar in enumerate(stars_section[:5], 1):
            count_el = bar.find('span', class_='count')
            if count_el:
                breakdown[f'{6-i}_star'] = int(count_el.text.strip().replace(',', ''))
        return breakdown

    def _extract_satisfaction(self, soup):
        scores = {}
        satisfaction_items = soup.find_all('div', class_='satisfaction-score')
        for item in satisfaction_items:
            label = item.find('span', class_='label')
            value = item.find('span', class_='value')
            if label and value:
                scores[label.text.strip()] = value.text.strip()
        return scores

# Usage
scraper = G2ProductScraper()
product = scraper.scrape_product('slack')
if product:
    print(json.dumps(product, indent=2))
Enter fullscreen mode Exit fullscreen mode

Node.js Product Scraper

const axios = require('axios');
const cheerio = require('cheerio');

class G2ProductScraper {
    constructor() {
        this.headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
        };
    }

    async scrapeProduct(productSlug) {
        const url = `https://www.g2.com/products/${productSlug}/reviews`;

        try {
            const response = await axios.get(url, { headers: this.headers });
            const $ = cheerio.load(response.data);

            return {
                slug: productSlug,
                url: url,
                name: $('h1').first().text().trim(),
                overallRating: this.extractRating($),
                reviewCount: this.extractReviewCount($),
                ratingBreakdown: this.extractBreakdown($),
            };
        } catch (error) {
            console.error(`Failed to scrape ${productSlug}: ${error.message}`);
            return null;
        }
    }

    extractRating($) {
        const ratingText = $('span.fw-semibold').first().text().trim();
        return parseFloat(ratingText) || null;
    }

    extractReviewCount($) {
        const countText = $('[itemprop="reviewCount"]').text().trim();
        return parseInt(countText.replace(/,/g, ''), 10) || 0;
    }

    extractBreakdown($) {
        const breakdown = {};
        $('.rating-bar').each((i, el) => {
            const count = $(el).find('.count').text().trim();
            if (count && i < 5) {
                breakdown[`${5 - i}_star`] = parseInt(count.replace(/,/g, ''), 10);
            }
        });
        return breakdown;
    }
}

// Usage
(async () => {
    const scraper = new G2ProductScraper();
    const product = await scraper.scrapeProduct('slack');
    console.log(JSON.stringify(product, null, 2));
})();
Enter fullscreen mode Exit fullscreen mode

Method 2: Extracting Individual Reviews

The real value lies in extracting individual review text, sentiment, and demographics. Here's how to paginate through all reviews for a product:

class G2ReviewExtractor:
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
        })

    def extract_reviews(self, product_slug, max_pages=10):
        """Extract all reviews for a product, paginating through results."""
        all_reviews = []

        for page in range(1, max_pages + 1):
            url = f"https://www.g2.com/products/{product_slug}/reviews?page={page}"
            response = self.session.get(url)

            if response.status_code != 200:
                break

            soup = BeautifulSoup(response.text, 'html.parser')
            reviews = self._parse_reviews(soup)

            if not reviews:
                break

            all_reviews.extend(reviews)
            print(f"Page {page}: {len(reviews)} reviews (total: {len(all_reviews)})")

            time.sleep(2)  # Respectful rate limiting

        return all_reviews

    def _parse_reviews(self, soup):
        reviews = []
        review_cards = soup.find_all('div', attrs={'itemprop': 'review'})

        for card in review_cards:
            review = {}

            # Star rating
            rating_el = card.find('span', class_='star-rating')
            if rating_el:
                stars = rating_el.find_all('svg', class_='filled')
                review['rating'] = len(stars) if stars else None

            # Review title
            title_el = card.find('h3', class_='review-title')
            review['title'] = title_el.text.strip() if title_el else ''

            # Pros
            pros_section = card.find('div', attrs={'data-testid': 'pros'})
            review['pros'] = pros_section.text.strip() if pros_section else ''

            # Cons
            cons_section = card.find('div', attrs={'data-testid': 'cons'})
            review['cons'] = cons_section.text.strip() if cons_section else ''

            # Reviewer info
            reviewer_el = card.find('span', class_='reviewer-name')
            review['reviewer'] = reviewer_el.text.strip() if reviewer_el else 'Anonymous'

            # Company size
            company_el = card.find('span', class_='company-size')
            review['company_size'] = company_el.text.strip() if company_el else ''

            # Industry
            industry_el = card.find('span', class_='industry')
            review['industry'] = industry_el.text.strip() if industry_el else ''

            # Date
            date_el = card.find('time')
            review['date'] = date_el.get('datetime', '') if date_el else ''

            # Verified
            verified_el = card.find('span', class_='verified')
            review['verified'] = verified_el is not None

            reviews.append(review)

        return reviews

# Usage
extractor = G2ReviewExtractor()
reviews = extractor.extract_reviews('slack', max_pages=5)
print(f"\nExtracted {len(reviews)} total reviews")

# Analyze sentiment distribution
from collections import Counter
ratings = Counter(r.get('rating') for r in reviews if r.get('rating'))
for stars in sorted(ratings.keys(), reverse=True):
    bar = '#' * ratings[stars]
    print(f"  {stars} stars: {bar} ({ratings[stars]})")
Enter fullscreen mode Exit fullscreen mode

Method 3: Scraping Category Rankings

Category pages reveal market positioning and competitive landscapes:

def scrape_category(category_slug):
    """Scrape a G2 category page for product rankings."""
    url = f"https://www.g2.com/categories/{category_slug}"

    session = requests.Session()
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    })

    response = session.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    products = []
    product_cards = soup.find_all('div', class_='product-card')

    for card in product_cards:
        product = {}

        name_el = card.find('a', class_='product-name')
        product['name'] = name_el.text.strip() if name_el else ''
        product['url'] = name_el.get('href', '') if name_el else ''

        rating_el = card.find('span', class_='rating')
        product['rating'] = float(rating_el.text.strip()) if rating_el else None

        count_el = card.find('span', class_='review-count')
        if count_el:
            count_text = count_el.text.strip().replace('(', '').replace(')', '').replace(',', '')
            product['review_count'] = int(count_text) if count_text.isdigit() else 0

        products.append(product)

    return products

# Example: Scrape CRM category
crm_products = scrape_category('crm')
print(f"Found {len(crm_products)} CRM products on G2")

for i, product in enumerate(crm_products[:10], 1):
    print(f"{i}. {product['name']} - {product['rating']} ({product['review_count']} reviews)")
Enter fullscreen mode Exit fullscreen mode

Method 4: Comparison Data Extraction

G2's comparison pages are a goldmine for competitive analysis:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeComparison(product1, product2) {
    const url = `https://www.g2.com/compare/${product1}-vs-${product2}`;

    const response = await axios.get(url, {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
        },
    });

    const $ = cheerio.load(response.data);

    const comparison = {
        products: [product1, product2],
        url: url,
        ratings: {},
        features: [],
        reviewerPreference: {},
    };

    // Extract side-by-side ratings
    $('.comparison-rating').each((i, el) => {
        const label = $(el).find('.label').text().trim();
        const scores = [];
        $(el).find('.score').each((j, scoreEl) => {
            scores.push(parseFloat($(scoreEl).text().trim()));
        });
        comparison.ratings[label] = scores;
    });

    // Extract feature comparison
    $('.feature-row').each((i, el) => {
        const feature = $(el).find('.feature-name').text().trim();
        const product1Has = $(el).find('.product-1 .check').length > 0;
        const product2Has = $(el).find('.product-2 .check').length > 0;
        comparison.features.push({ feature, [product1]: product1Has, [product2]: product2Has });
    });

    return comparison;
}

// Usage
(async () => {
    const comp = await scrapeComparison('slack', 'microsoft-teams');
    console.log(JSON.stringify(comp, null, 2));
})();
Enter fullscreen mode Exit fullscreen mode

Scaling G2 Scraping with Apify

For large-scale G2 data extraction, Apify provides the infrastructure to handle anti-bot measures, proxy rotation, and parallel execution:

from apify_client import ApifyClient

def scrape_g2_at_scale(product_slugs, max_reviews_per_product=100):
    """
    Use Apify to scrape multiple G2 products at scale.
    """
    client = ApifyClient("YOUR_APIFY_TOKEN")

    all_results = {}

    for slug in product_slugs:
        run_input = {
            "productUrl": f"https://www.g2.com/products/{slug}/reviews",
            "maxReviews": max_reviews_per_product,
            "includeReviewerDetails": True,
            "includeRatingBreakdown": True,
            "proxyConfiguration": {
                "useApifyProxy": True,
                "apifyProxyGroups": ["RESIDENTIAL"],
            },
        }

        run = client.actor("apify/g2-reviews-scraper").call(run_input=run_input)

        items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
        all_results[slug] = items
        print(f"Scraped {len(items)} reviews for {slug}")

    return all_results

# Scrape multiple competitors
products = ['slack', 'microsoft-teams', 'discord', 'zoom']
results = scrape_g2_at_scale(products, max_reviews_per_product=50)
Enter fullscreen mode Exit fullscreen mode

Analyzing Extracted Review Data

Once you have the raw data, here's how to extract actionable insights:

import pandas as pd
from collections import Counter

def analyze_g2_reviews(reviews, product_name):
    """Comprehensive analysis of G2 review data."""
    df = pd.DataFrame(reviews)

    print(f"\n{'='*60}")
    print(f"ANALYSIS: {product_name}")
    print(f"{'='*60}")

    # Rating distribution
    if 'rating' in df.columns:
        print(f"\nRating Distribution:")
        print(f"  Average: {df['rating'].mean():.2f}")
        print(f"  Median: {df['rating'].median():.1f}")
        for rating in range(5, 0, -1):
            count = len(df[df['rating'] == rating])
            pct = count / len(df) * 100
            bar = '#' * int(pct / 2)
            print(f"  {rating} star: {bar} {count} ({pct:.1f}%)")

    # Company size distribution
    if 'company_size' in df.columns:
        print(f"\nReviewer Company Size:")
        sizes = df['company_size'].value_counts()
        for size, count in sizes.items():
            if size:
                print(f"  {size}: {count}")

    # Industry breakdown
    if 'industry' in df.columns:
        print(f"\nTop Industries:")
        industries = df['industry'].value_counts().head(10)
        for industry, count in industries.items():
            if industry:
                print(f"  {industry}: {count}")

    # Common themes in pros/cons
    if 'pros' in df.columns:
        print(f"\nMost mentioned in PROS:")
        pros_words = extract_key_phrases(df['pros'].dropna().tolist())
        for phrase, count in pros_words[:10]:
            print(f"  '{phrase}': mentioned {count} times")

    if 'cons' in df.columns:
        print(f"\nMost mentioned in CONS:")
        cons_words = extract_key_phrases(df['cons'].dropna().tolist())
        for phrase, count in cons_words[:10]:
            print(f"  '{phrase}': mentioned {count} times")

def extract_key_phrases(texts):
    """Simple keyword frequency analysis."""
    word_freq = Counter()
    stop_words = {'the', 'a', 'an', 'in', 'on', 'at', 'to', 'for', 'of', 'and', 
                  'is', 'are', 'it', 'that', 'this', 'with', 'was', 'be', 'has',
                  'have', 'not', 'but', 'can', 'very', 'from', 'they', 'you', 'we'}

    for text in texts:
        words = text.lower().split()
        for word in words:
            word = word.strip('.,!?()[]{}":;')
            if word not in stop_words and len(word) > 3:
                word_freq[word] += 1

    return word_freq.most_common(20)

# Example usage
analyze_g2_reviews(reviews, "Slack")
Enter fullscreen mode Exit fullscreen mode

Building a Competitive Intelligence Dashboard

Combine all the scraped data into a competitive analysis framework:

def competitive_analysis(products_data):
    """Generate a competitive comparison report from G2 data."""

    print("\n" + "=" * 80)
    print("COMPETITIVE INTELLIGENCE REPORT")
    print("=" * 80)

    # Comparison table
    print(f"\n{'Product':<20} {'Rating':<10} {'Reviews':<12} {'Satisfaction':<15}")
    print("-" * 60)

    for product in products_data:
        name = product.get('name', 'Unknown')[:19]
        rating = product.get('overall_rating', 'N/A')
        reviews = product.get('review_count', 0)
        satisfaction = product.get('satisfaction_scores', {})
        ease = satisfaction.get('Ease of Use', 'N/A')

        print(f"{name:<20} {str(rating):<10} {str(reviews):<12} {ease:<15}")

    # Strengths and weaknesses
    for product in products_data:
        print(f"\n--- {product.get('name', 'Unknown')} ---")

        breakdown = product.get('rating_breakdown', {})
        if breakdown:
            total = sum(breakdown.values())
            five_star_pct = breakdown.get('5_star', 0) / total * 100 if total > 0 else 0
            one_star_pct = breakdown.get('1_star', 0) / total * 100 if total > 0 else 0

            print(f"  5-star rate: {five_star_pct:.1f}%")
            print(f"  1-star rate: {one_star_pct:.1f}%")

            if five_star_pct > 60:
                print(f"  STRENGTH: High satisfaction ({five_star_pct:.0f}% give 5 stars)")
            if one_star_pct > 10:
                print(f"  WARNING: Notable dissatisfaction ({one_star_pct:.0f}% give 1 star)")
Enter fullscreen mode Exit fullscreen mode

User Demographic Analysis

Understanding who reviews a product reveals its actual user base:

def demographic_analysis(reviews):
    """Analyze the demographic makeup of a product's reviewers."""

    # Company size segments
    size_segments = {
        'Enterprise (1000+)': 0,
        'Mid-Market (51-1000)': 0,
        'Small Business (1-50)': 0,
    }

    for review in reviews:
        size = review.get('company_size', '')
        if '1000' in size or 'enterprise' in size.lower():
            size_segments['Enterprise (1000+)'] += 1
        elif '50' in size or 'mid' in size.lower():
            size_segments['Mid-Market (51-1000)'] += 1
        else:
            size_segments['Small Business (1-50)'] += 1

    total = sum(size_segments.values())
    print("\nCompany Size Distribution:")
    for segment, count in size_segments.items():
        pct = count / total * 100 if total > 0 else 0
        bar = '#' * int(pct / 2)
        print(f"  {segment}: {bar} {pct:.1f}%")

    # Satisfaction by segment
    print("\nRating by Company Size:")
    for segment_name in size_segments:
        segment_reviews = [r for r in reviews if segment_name.split('(')[0].strip().lower() in r.get('company_size', '').lower()]
        if segment_reviews:
            avg_rating = sum(r.get('rating', 0) for r in segment_reviews) / len(segment_reviews)
            print(f"  {segment_name}: {avg_rating:.2f} avg rating")
Enter fullscreen mode Exit fullscreen mode

Best Practices for G2 Scraping

  1. Implement rate limiting — G2 actively detects automated access. Space your requests 2-5 seconds apart for direct scraping, or use Apify's managed proxies.

  2. Use residential proxies — G2's anti-bot measures are sophisticated. Datacenter IPs get blocked quickly.

  3. Respect the Terms of Service — Review G2's ToS regarding automated data collection. Use the data responsibly and within legal boundaries.

  4. Cache aggressively — G2 reviews don't change frequently. Cache product pages for 24 hours and individual reviews for a week.

  5. Handle pagination carefully — G2 loads reviews in pages of 10-25. Don't skip pages or you'll miss data.

  6. Validate data quality — Check for empty fields, malformed ratings, and duplicate reviews. G2 occasionally changes its HTML structure.

  7. Monitor for structural changes — Set up alerts when your scraper returns empty results or unusual data patterns.

  8. Store data efficiently — Use structured formats (JSON, CSV, or a database) with consistent schemas for easier analysis later.

Conclusion

G2 review data is one of the most valuable sources of competitive intelligence in the B2B software market. By scraping product pages, individual reviews, category rankings, and comparison tables, you can build a comprehensive understanding of your market landscape.

Start with single product scraping to validate your approach, then scale to category-wide extraction using Apify's infrastructure. The combination of structured review data with demographic analysis gives you insights that no amount of manual browsing could match.

Whether you're tracking competitor sentiment, identifying market gaps, or understanding your own product's perception, automated G2 scraping transforms scattered review data into actionable competitive intelligence.

Remember to scrape responsibly, respect rate limits, and use the data ethically. The goal is insight, not disruption — and the best insights come from clean, well-structured data collected at a sustainable pace.

Top comments (0)