kavela

Posted on May 15 • Originally published at historysaid.com

How we recovered from a 30,000 to 5 Google deindex on a programmatic SEO site

#webdev #wordpress #seo #programming

In April, HistorySaid.com had ~30,000 URLs in Google's index. By May 13, the count was five. Five total URLs.

This is the story of what happened — what we think we know about why, and the surgery we performed to recover. If you run a programmatic SEO site on WordPress, especially one with templated cross-product pages (country × indicator, asset × asset, region × year), the same trap is sitting under your traffic.

The site, briefly

HistorySaid.com is a programmatic reference site built on WordPress, but the content lives in custom tables, not wp_posts. There's a virtual router that intercepts URLs and renders pages out of a SQLite database of countries, indicators, figures, signals, comparisons, decades, superlatives — the long-tail combinatorial surface area you'd expect from a "every country × every indicator × every decade" play.

At peak, the sitemap held about 30,120 URLs spread across these page types:

Page type	Count	URL pattern
Country × Indicator	~22,000	`/{country}/{indicator}`
Compare (A vs B)	~7,300	`/compare/{a}-vs-{b}`
Country hub	216	`/{country}/`
Ranking (per indicator)	102	`/ranking/{indicator}`
Figure	151	`/figure/{slug}`
Pillar (insight hub)	67	`/insights/{slug}`
Signal (analytical post)	102	`/signals/{slug}`
Inquiry (long-form Q&A)	~16	`/why/{slug}`, `/how/{slug}`

The first three categories were the volume play. They were also the trap.

Diagnosis

We didn't see a manual action. There was no penalty notice, no message in Google Search Console. The index count just slid. April: 5,000+. Early May: ~1,200. May 13: 5.

The signature of this is well-known among programmatic SEO operators: Google's scaled content abuse classifier. It's a system, not a manual reviewer. It samples a property, decides "this site is producing thin templated content at scale," and broadly deindexes it. We've seen the same playbook hit two other sites in our network this year before we recognized the shape.

The reason is geometric. If a classifier samples 200 URLs from a 30,000-URL site and 195 of them are /{country}/{indicator} pages with <2,000 characters of unique content surrounding identical data tables, the verdict is set. The other 5,000 high-quality pages don't get evaluated individually — they go down with the sample.

We confirmed the diagnosis by category:

22,000 country × indicator pages: identical template, ~1,200 chars unique text per page (mostly programmatic sentences like "Turkey's inflation rate was X% in 2023").
7,300 compare pages: similar pattern, paired data, even less unique copy.
216 country hubs: had real data density, but Google never got to evaluate them individually.

Surgery 1: shed the templated mass

The recovery move was counterintuitive: don't try to fatten 30,000 thin pages. Trim them to 700. Give the classifier a small, dense, unique-data site to re-evaluate.

We did this in two passes.

May 13 pass dropped the worst offenders. In the SEO plugin's robots filter:

public function rankmath_robots($robots) {
    $thin_types = [
        'indicator_decade', 'indicator_superlative',
        'indicator', 'compare',
    ];
    if (in_array($this->route['type'] ?? '', $thin_types, true)) {
        $robots['index']  = 'noindex';
        $robots['follow'] = 'follow';
    }
    return $robots;
}

The matching HTML meta and X-Robots-Tag header went out at template render time, not on a delay. The sitemap router returned 410 Gone for the matching shards so Google would stop trying:

private function serve_sitemap($type, $page) {
    if (in_array($type, ['decades','superlatives','indicators','compare'], true)) {
        status_header(410);
        header('X-Robots-Tag: noindex');
        echo "<?xml version=\"1.0\"?><urlset xmlns=\"...\"></urlset>";
        exit;
    }
    // …
}

After this pass, 738 URLs remained in the sitemap (~2.5% of the original).

May 15 pass went further. The first sweep wasn't enough; the classifier seemed to be re-evaluating and lingering on rankings, assets, groups, and patterns — all of which still had templated bodies under unique titles. We added them to the $thin_types list and the sitemap guard. The indexable spine became 583 URLs: the 216 country hubs, 151 figures, 67 pillars, 102 signals, 16 long-form inquiries, 26 snippets, 5 evergreen pages.

The non-spine pages aren't gone — they still resolve, they still take internal links, they just carry noindex,follow. Google can crawl through them to discover the spine. They just don't compete for index slots.

Surgery 2: thicken the spine with real data + analysis

Trimming alone isn't enough. The 583 survivor pages also had to clearly not look templated. Country hubs were the worst offenders here: the same hero, the same data table, the same auto-generated paragraph.

We built a small enrichment pipeline that does two things per page:

Pulls a small unique data block from a different source than the existing tables.
Generates a 130–180 word analytical paragraph that cites the numbers from that block.

The fresh data source was DBnomics, which aggregates IMF, OECD, World Bank, ECB, BIS series under a free, no-key API. For each of the 216 country hubs we pulled three series from IMF's World Economic Outlook:

GET https://api.db.nomics.world/v22/series/IMF/WEO:latest/{ISO3}.NGDP_RPCH
GET https://api.db.nomics.world/v22/series/IMF/WEO:latest/{ISO3}.PCPIPCH
GET https://api.db.nomics.world/v22/series/IMF/WEO:latest/{ISO3}.LUR

(Real GDP growth, CPI inflation, unemployment rate.)

The 15 most recent observations from each series got stored as JSON in the page's enrichment row, then rendered as three small two-column tables under a Data & analysis heading. Then a DeepSeek prompt was given the same JSON plus the page's existing global rankings and asked to write a single 130–180 word paragraph that cited at least one specific year and value.

The validator did the work the LLM wouldn't:

def validate(text, must_digits):
    n = len(text.split())
    if n < 110 or n > 230:    return False, f"len{n}"
    if BANNED.search(text):   return False, "banned"
    if not any(d in text for d in must_digits): return False, "no-digits"
    return True, "ok"

must_digits was the formatted version of the headline values from the data block — if the model didn't reference at least one of them, the paragraph was rejected and retried. BANNED killed em-dashes, "in conclusion", "overall", first-person plurals, and other LLM tells.

The enrichment table was a single row per spine page:

CREATE TABLE wp_hs_enrichment (
  id BIGINT PRIMARY KEY,
  subject_type VARCHAR(30),     -- country_hub, figure, ranking, pillar
  subject_slug VARCHAR(150),    -- the page slug
  paragraph TEXT,               -- the 130-180w analysis
  paragraph_long LONGTEXT,      -- the 1500-2500w long-form (top 100 only)
  data_json LONGTEXT,           -- the cached data block
  word_count SMALLINT,
  model VARCHAR(50),
  prompt_version VARCHAR(60),
  KEY (subject_type, subject_slug)
);

A single template partial (enrichment-block.php) reads this row, renders the data tables specific to the page type, then renders the paragraph (or the long-form HTML, if it exists for that page). The four spine templates (country.php, figure-single.php, ranking.php, pillar.php) each include the partial with just two lines of context:

$hs_enrich_type = 'country_hub';
$hs_enrich_slug = $country['slug'];
include __DIR__ . '/enrichment-block.php';

For 520 of the 583 spine pages this short-paragraph pass was enough. For the top 100 by data depth and importance — top 40 countries by global top-20 rank coverage, top 40 figures by length of their key-contributions JSON, top 20 pillars chosen to span categories — we ran a second deeper pass.

The deep pass

The same pipeline shape, but the prompt asked for 1500–2500 words in semantic HTML, with required <h2> sections (Macro snapshot, Historical arc, Where the country leads and lags, Linked thinkers and historical figures, Comparable peers, Forward look). Each section had a 250–400 word target. Internal link slots were enumerated in the prompt so the model would weave 4–6 anchor tags to peer countries and linked figures using exact slugs.

Validation here was structural: word count between 1200 and 3000, at least 4 <h2> tags, no code fences, no banned phrases. Output was wrapped in wp_kses() allowing only h2/h3/p/ul/li/strong/em/a[href].

Average output: 1,750 words per top-100 page. Switzerland came in at 1,773 words; Galbraith's page at 1,611. The deep pages render the long-form HTML in place of the short paragraph automatically.

Cross-linking the survivors

A 583-page editorial spine only works if the pages reinforce each other. We added a "Connected on HistorySaid" pill row below the analysis paragraph on every enriched page, generated server-side from three SQLite joins:

On a country hub: linked historical figures (via figure_countries) + inquiries citing this country (via a REST endpoint that queries wp_hs_inquiries).
On a figure page: linked countries (reverse of the above).
On a ranking: top-4 country profiles for that indicator.

The pill row is rendered by the same enrichment-block.php partial, server-side, no JS — Google sees it on first crawl. Internal anchor count on a typical country hub went from ~12 to ~22 outbound links to other spine pages.

Submission

IndexNow handled Bing and Yandex immediately. A single POST per endpoint with 738 URLs in the body and both returned 200/202. Google's Indexing API is a different animal — it requires a service account added as an Owner of the GSC property, has a 200-URL/day quota per property, and only acknowledges "URL_UPDATED" pings, not Search-Console-level submission.

We queued the 583 spine survivors with the top 100 (deep-enriched) at the front and the rest behind. A daily cron at 00:00 UTC submits 150 URLs per day:

0 0 * * * /usr/bin/python3 /root/pipelines/historysaid-indexing/submit_daily.py

Day 1 hits the top 100 plus 50 of the rest. Day 4 finishes the queue. Top-priority pages get the explicit Google "URL_UPDATED" signal first, and on the day they're freshest.

What we expect and what we don't

The honest answer about recovery from a scaled-content classifier verdict is that the signal turnaround takes 2 to 4 weeks. Indexable page count doesn't snap back. What we should see, in order:

Within 7 days: GSC Crawl stats shows the indexable-page crawl rate rising.
Within 14 days: in Indexing > Pages, the "Discovered – currently not indexed" bucket starts shifting to "Crawled – currently not indexed."
Between 21 and 28 days: re-evaluation completes, indexed count starts climbing back.

If 28 days from May 15 the index is still flat at single digits, the remaining lever is trimming further — rankings, snippets, even some of the lower-data country hubs. Smaller is fine. A 200-URL curated site that gets cited by AI overviews and Wikipedia is worth more than 30,000 indexed pages no one reads.

The pages we're betting on most are linked below. They're the ones with the deepest data, the longest analysis, and the most outbound internal links to the rest of the spine.

Switzerland — top 20 in eight global indicators
Canada — natural resources + currency stability profile
Japan — long-arc demographics and debt
Singapore — sovereign wealth and trade leverage
Sweden — small-economy global rank density
Adam Smith — invisible hand, division of labor
John Maynard Keynes — depression-era policy invention
Richest countries in the world — pillar comparison hub

What we'd do differently if we started over

Two things.

First: build the spine before the long tail. We launched with the 30,000 templated pages live and the 583 spine pages buried inside them. The classifier's sample skewed thin from day one. Had the spine launched first and matured before the combinatorial pages were added, the verdict would likely have gone the other way.

Second: instrument the data block earlier. The DBnomics integration would have taken the same week if we'd done it before the deindex as it took after. Pages with real, unique, cited time-series at launch are demonstrably not templated, even if everything around them is. The classifier sees this; we just gave it nothing to see for too long.

Five URLs in the index is a number you don't forget. We'd rather not see it again.

DEV Community