Erika S. Adkins

Posted on Jan 29

How to Build a Custom SERP Scraper for Share-of-Voice Analysis using Playwright

#scraper #webscraping #playwright

Keeping a pulse on brand visibility is a constant battle. While enterprise SEO tools like Ahrefs or SEMrush provide massive amounts of data, they often come with high price tags, limited keyword refresh rates, and "black box" metrics. If you need to track specific high-value keywords every hour or monitor localized results that major tools miss, a custom solution is the answer.

This guide demonstrates how to build a custom Share of Voice (SOV) tracker using Node.js and Playwright. We’ll automate the process of scraping Google Search Engine Results Pages (SERPs), extracting organic and paid rankings, and calculating exactly how much "digital shelf space" your brand occupies compared to your competitors.

Why Scrape SERPs Instead of Buying Data?

For many growth teams, commercial SEO tools are the industry standard. However, building an internal scraper offers several strategic advantages:

Granularity and Localized Data: You can simulate requests from specific zip codes or device types that generic tools might aggregate, giving you a clearer picture of local SEO performance.
Real-Time Frequency: Major tools often update rankings weekly or daily. With your own scraper, you can run checks every hour during a product launch or a high-stakes marketing campaign.
Data Ownership: By piping raw HTML data into your own data warehouse (like BigQuery or Postgres), you can perform long-term historical analysis and build custom dashboards in Looker or Tableau.
Ad vs. Organic Interplay: You can see exactly when a competitor is bidding on your brand terms and adjust your own ad spend in real-time to defend your position.

Project Setup

Ensure you have Node.js (v16 or higher) installed. We will use Playwright because it handles modern web standards and dynamic rendering more reliably than older libraries.

First, create a new directory and initialize your project:

mkdir sov-tracker
cd sov-tracker
npm init -y
npm install playwright

Playwright also requires browser binaries to run. Install them with:

npx playwright install chromium

Your project structure will be simple: an index.js file for the scraper logic and a keywords.json file to store the terms you want to track.

Building the Scraper: The Core Logic

The first step in Share of Voice analysis is reaching the search results. Google is difficult to scrape because of its aggressive bot detection and varying consent modals.

We'll start by initializing a browser instance and navigating to Google. We need to set a realistic User-Agent and handle the "Before you continue" cookie consent form that appears in many regions.

const { chromium } = require('playwright');

async function getSerpResults(keyword) {
    const browser = await chromium.launch({ headless: true });
    // Use a realistic user agent to avoid immediate flagging
    const context = await browser.newContext({
        userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
    });
    const page = await context.newPage();

    try {
        await page.goto(`https://www.google.com/search?q=${encodeURIComponent(keyword)}&hl=en`);

        // Handle Cookie Consent Modal if it appears
        const consentButton = await page.$('button:has-text("Accept all"), button:has-text("I agree")');
        if (consentButton) {
            await consentButton.click();
        }

        // Wait for the results to load
        await page.waitForSelector('#search');

        return await extractData(page);
    } catch (error) {
        console.error(`Error scraping ${keyword}:`, error);
    } finally {
        await browser.close();
    }
}

Extracting Ranking Data

Now that we are on the page, we need to parse the HTML. Google’s CSS classes change frequently, but the structural hierarchy of a search result usually remains consistent. Organic results are typically wrapped in div.g containers.

We want to capture the Position (Rank), Title, and Domain. The domain is the most important piece for calculating Share of Voice.

async function extractData(page) {
    return await page.$$eval('div.g', (results) => {
        return results.map((el, index) => {
            const title = el.querySelector('h3')?.innerText;
            const url = el.querySelector('a')?.href;

            let domain = '';
            if (url) {
                try {
                    domain = new URL(url).hostname.replace('www.', '');
                } catch (e) {
                    domain = 'unknown';
                }
            }

            return {
                rank: index + 1,
                title: title || 'No Title',
                domain: domain
            };
        }).filter(item => item.domain !== 'unknown');
    });
}

In this snippet, we use page.$$eval, which runs the selection logic directly in the browser context. This is highly efficient for extracting lists of data. We also filter out any results where the URL couldn't be parsed.

Calculating Share of Voice

Raw JSON data is useful, but stakeholders need metrics. Share of Voice (SOV) is the percentage of the "search real estate" owned by a specific brand.

A simple SOV formula is:
(Brand Appearances / Total Results) * 100

However, a result at Rank #1 is significantly more valuable than Rank #10. We can apply a Weighted SOV calculation to account for this.

function calculateSOV(scrapedData, targetBrand) {
    const totalWeight = scrapedData.reduce((acc, curr) => acc + (1 / curr.rank), 0);
    const brandWeight = scrapedData
        .filter(item => item.domain.includes(targetBrand))
        .reduce((acc, curr) => acc + (1 / curr.rank), 0);

    const sovPercentage = (brandWeight / totalWeight) * 100;

    return {
        brand: targetBrand,
        sov: sovPercentage.toFixed(2) + '%',
        totalResults: scrapedData.length
    };
}

// Example Usage
const results = [
    { rank: 1, domain: 'competitor.com' },
    { rank: 2, domain: 'mybrand.com' },
    { rank: 3, domain: 'another-site.com' }
];

console.log(calculateSOV(results, 'mybrand.com'));
// Output: { brand: 'mybrand.com', sov: '30.77%', totalResults: 3 }

By using 1 / rank, we give the first result a weight of 1, the second 0.5, and so on. This accurately reflects the diminishing click-through rate (CTR) as users scroll down the page.

Using Data to Optimize Ad Spend

Once you have automated this tracking, you can use the data to make high-impact decisions regarding your marketing budget.

Scenario	SERP Observation	Action
Organic Dominance	You rank #1 organically and also hold the #1 Ad spot.	Reduce Ad Spend: You are likely paying for clicks you would have earned for free.
Competitor Conquesting	A competitor is bidding on your brand name and appears above you.	Increase Ad Spend: Launch a defensive campaign to protect your brand traffic.
The Gap	You are not on Page 1 organically for a high-intent keyword.	Aggressive Bidding: Use PPC to maintain visibility while your SEO team works on organic rankings.

Challenges & Recommended Approaches

Scraping Google is an ongoing game of cat and mouse. To keep your custom tracker running reliably, keep these tips in mind:

Bot Detection: Google uses sophisticated signals to identify bots. If you run hundreds of requests from a single IP, you will face CAPTCHAs. Try using the playwright-extra plugin with the stealth package to hide common bot signatures.
Selector Fragility: Google does not provide stable IDs for results. Instead of relying on specific class names like .yuRUbf, use robust locators based on the page structure, such as the h3 inside the first div that contains an a tag.
Respectful Scraping: Avoid hammering Google's servers. Add a random delay between 3 and 7 seconds between keyword searches to mimic human behavior.
Scaling with Proxies: For enterprise-level tracking, you will need a proxy provider. This allows you to rotate IPs and simulate requests from different geographic locations.

To Wrap Up

Building your own SERP tracker provides a level of control and transparency that off-the-shelf tools cannot match. By combining Playwright's browser automation with simple weighting logic, you can turn raw search results into actionable growth metrics.

Your next step is to automate this script to run on a schedule, using a Cron job or GitHub Actions, and store the results in a database. For more advanced scraping techniques, check out our guide on using proxies with Playwright to avoid blocks at scale.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.