How I Built a SERP Topic Gap Analyzer on Apify (And What the Prefill Bug Taught Me)

#architecture #showdev #tooling #webscraping

Most content gap tools work backwards. They start with keyword data and try to reverse-engineer what your competitors rank for. I wanted to start from the other direction: given a set of real SERP results, which topics are your competitors covering that you're not?

That question led to the SERP Topic Gap Monitor — a deterministic, composable Apify actor that takes pre-fetched SERP data as input, runs a topic extraction and gap-scoring pipeline, and returns a ranked list of coverage gaps your site is missing.

Here's how it's built and what I learned along the way.

The Core Design Decision: Accept Data, Don't Fetch It

The first architectural choice was also the most important: the actor doesn't scrape Google. It accepts pre-fetched SERP results as structured input.

This sounds like a limitation. It's not.

Fetching Google reliably is a whole problem on its own — proxies, rate limits, bot detection, constantly shifting HTML structure. Bundling that complexity into a content analysis tool creates a fragile system where the analysis breaks every time Google changes its layout.

By accepting SERP data as input, the actor is decoupled from any specific data source. You can feed it results from Google Search API, SerpAPI, Apify's own SERP scrapers, or a fixture file you built by hand. The analysis stays stable regardless of where the data comes from.

The input schema is straightforward:

{
  "targetDomain": "yourblog.com",
  "serpResults": [
    {
      "keyword": "best nootropics",
      "results": [
        {
          "url": "https://healthline.com/...",
          "title": "...",
          "snippet": "...",
          "pageContent": "..."
        }
      ]
    }
  ]
}

pageContent is optional — when it's present, the topic extraction is richer. When it's not, the actor falls back to title and snippet.

The Pipeline: Four Pure Functions

The analysis runs as a linear pipeline of pure functions:

tokenize → topics → gaps → score

tokenize — takes raw text (title, snippet, page content), strips HTML, lowercases, removes punctuation, filters a bundled English stopword list (~170 words, NLTK/Snowball-based), and returns a token array.

topics — aggregates tokens across a result set into a frequency map. Higher frequency = more likely to be a genuine topic signal vs. noise.

gaps — compares your domain's topic set against the competitor topic set. Any topic present in competitor results but absent from your pages is a gap.

score — assigns a gapScore to each gap:

gapScore = uniqueCompetitorPages / totalUniqueCompetitorPages

A score of 1.0 means every competitor page in the result set covers this topic. Your site covers none of them. That's where you start writing.

The formula is simple on purpose. No black box, no magic weighting. If you understand the ratio, you understand the output.

Test-First, All the Way Down

Because all the logic is pure functions, the entire pipeline is fully testable without network access. I wrote 54 Vitest specs across 5 files before writing a line of implementation.

The test suite covers:

Tokenization edge cases (empty strings, HTML entities, mixed case, punctuation boundaries)
Stopword filtering (ensuring common words don't inflate topic scores)
Gap detection (correct identification of covered vs. uncovered topics)
Scoring math (boundary conditions at 0 and 1.0)
End-to-end pipeline integration with fixture SERP data

Running npm test gives you full confidence the analysis is correct before you push anything to the platform. This matters more than you'd think — debugging a scoring error in a live Apify run is significantly slower than catching it in a local test.

The Bug That Took Me Three QA Failures to Find

After publishing, I got a notice from Apify: the actor had been flagged "Under Maintenance" after three consecutive automated QA failures.

The error wasn't in the analysis code. It was in input_schema.json.

Apify's automated QA system runs each actor using the prefill values from the input schema as test input. If required fields don't have prefill values, the QA run receives undefined for those fields — which causes the actor to fail immediately.

My targetDomain and serpResults fields had no prefill. So every QA run hit this:

Actor.fail("targetDomain is required")

The fix took five minutes. Adding a prefill to both fields:

"targetDomain": {
  "type": "string",
  "prefill": "myblog.com"
},
"serpResults": {
  "type": "array",
  "prefill": [ /* 2-keyword fixture array */ ]
}

Build 0.1.4 deployed, QA passed, maintenance flag removed.

The lesson: Always set prefill for every required field in your Apify input_schema.json. The QA system can't test what it can't input. I've carried this forward to every actor since.

A Real Run

I ran it against one of my own sites — peakhealthprovisions.com, a wellness/nootropics site — using two keywords and five competitors (Healthline, Examine, WebMD, MedicalNewsToday, NootropicsExpert).

Results: 20 gaps, 0 topics covered.

nootropic   → gapScore: 1.0  (5/5 competitors)
nootropics  → gapScore: 1.0  (5/5 competitors)
cognitive   → gapScore: 0.8  (8 unique pages)
memory      → gapScore: 0.7  (7 unique pages)
focus       → gapScore: 0.6  (6 unique pages)

Every competitor is covering these topics. The site isn't covering any of them. That's not a content calendar suggestion — that's the content calendar.

What's Next

The actor's design creates an explicit upstream dependency: you need SERP data to feed it. One logical next step is a dedicated SERP actor with a clean JSON output schema designed to pipe directly into the Gap Monitor — one actor fetches, one actor analyzes, composable by design.

There's also a "Works well with" section in the README that cross-references the Changelog Triage Agent — another actor in the portfolio for teams who want to monitor API changelogs for breaking changes. Different use case, same philosophy: focused, testable, composable.

Try It

The actor is free to use on the Apify Store:
👉 apify.com/joeslade/serp-topic-gap-monitor

The README has a full walkthrough of the input format, output schema, and how to source SERP data to feed it. If you're building it into a content pipeline or have questions about the architecture, drop a comment — happy to dig into it.

Tags: apify seo typescript webdev devtools