I lost months of Google indexing to a single missing UA pattern

#webdev #javascript #security #devops

tl;dr — If your site has any kind of geo-gate, age verification, or country-specific wall, and you wrote a "let Googlebot through" rule in your middleware: your rule is probably wrong. Google's URL Inspector does not send a Googlebot UA. It sends Google-InspectionTool. Match that explicitly, or you'll lose months of crawl budget without ever seeing the cause in Search Console.

I'm shipping noctias.tv — a multi-language portal — alongside our older domain noctias.com. While verifying the new domain in Search Console, every single URL inspection failed with the same message:

"ページをインデックスに登録できません: noindex タグによって除外されました"
("Cannot be indexed: excluded by noindex tag")

But the page wasn't noindex. Curl with a Googlebot UA returned <meta name="robots" content="index, follow"/>. The page rendered correctly to humans. The sitemap was healthy and showed 252 URLs discovered. Bing indexed it. Search Console said: nope.

This took me hours to track down. Sharing it because I'm sure other sites are silently bleeding the same way.

The setup

noctias.tv runs Next.js 15 in standalone mode behind Cloudflare Tunnel on an OVHcloud VPS. Compliance is jurisdictional, not language-level:

// src/lib/geo.ts (simplified)
export type Policy = 'ALLOW' | 'AGE_GATE' | 'AGE_VERIFICATION' | 'BLOCK';

export function getPolicy(country?: string, region?: string): Policy {
  if (!country) return 'AGE_VERIFICATION';            // fail closed
  if (BLOCKED_COUNTRIES.has(country)) return 'BLOCK';
  if (country === 'US' && STRICT_US_STATES.has(region ?? '')) return 'AGE_VERIFICATION';
  if (country === 'UK' || country === 'GB') return 'AGE_VERIFICATION';
  if (country === 'JP') return 'AGE_GATE';
  return 'ALLOW';
}

AGE_VERIFICATION means: rewrite to /age-verification, which is a noindex,nofollow page. After the visitor passes the wall, a signed cookie unlocks the real content.

To not destroy SEO, I'd added what I thought was a standard bot bypass:

// src/middleware.ts (the WRONG version)
if (policy === 'AGE_VERIFICATION') {
  const verified = await verifyAvToken(req.cookies.get(AV_COOKIE)?.value);
  const isBot = /Googlebot/i.test(req.headers.get('user-agent') ?? '');
  if (!verified && !isBot && !isAvPath(req.nextUrl.pathname)) {
    return NextResponse.rewrite(new URL(`/${locale}/age-verification`, req.url));
  }
}

Looks right. Plenty of "how to allow Googlebot through" Stack Overflow answers use the same pattern.

But the live test from Search Console still failed.

Finding the actual UA

The "Test live URL" panel in Search Console shows you the rendered HTML it fetched. I dug into the rendered output and found:

<title>Noctias.tv</title>
<meta name="robots" content="noindex, nofollow" />

This is the /age-verification page. Google was being walled. So the bypass wasn't matching.

I went looking for the real UA. Search Console's own UA documentation lists crawlers — Googlebot, AdsBot-Google, etc. But the URL Inspector live test isn't a crawler — it's a separate fetcher with its own UA:

Mozilla/5.0 (compatible; Google-InspectionTool/1.0;)

/Googlebot/i does NOT match Google-InspectionTool. The substring isn't there.

This is the bug. My bypass let real Googlebot crawl, but blocked the URL Inspector. So when I (or anyone) tried to manually request indexing in Search Console, the live test saw the noindex wall and refused.

The crawler-only path was probably also failing silently for any URL that wasn't already in the index, because the first URL Inspector check is part of how Search Console decides to crawl new URLs.

The actual fix

function isSearchEngineBot(userAgent: string | null): boolean {
  if (!userAgent) return false;
  return /Googlebot|Google-InspectionTool|Google-Read-Aloud|AdsBot-Google|Google-Site-Verification|Bingbot|DuckDuckBot|YandexBot|Baiduspider|Applebot|GPTBot|ClaudeBot|PerplexityBot|facebookexternalhit|Twitterbot|LinkedInBot/i.test(
    userAgent,
  );
}

This is not cloaking-spam. Google explicitly allows serving bots the same content you'd serve a verified human, when an interstitial would otherwise block crawling. See Google's "intrusive interstitials" guidance.

Some others worth matching while you're at it:

facebookexternalhit, Twitterbot, LinkedInBot — Open Graph card scrapers. Without them, every Discord / Twitter / LinkedIn unfurl of your URL shows the age-verification placeholder, which kills click-through.
GPTBot, ClaudeBot, PerplexityBot — AI search crawlers. Your call whether to let these through; we do.

How to check your own site

curl -s -H "User-Agent: Mozilla/5.0 (compatible; Google-InspectionTool/1.0;)" \
  https://your-site.com/some-article \
  | grep -oE 'name="robots"[^>]*'

If that returns noindex, your indexing pipeline is broken for any URL behind whatever gate you have — age verification, geo-blocking, paywall preview, "press Enter to continue", anything.

You'll find the same issue on:

Any site that gates content for US TX / UT / AR / LA / MS / NC / OK / etc. for age verification (state law)
Any site that gates for UK Online Safety Act (since Jan 2025)
Any site with a country-blocked list (BLOCK policy in our case)
Some EU cookie wall implementations that fully rewrite the response

What I'd do differently

Test the inspector path before deploy. I had unit tests for the geo router but never tested that a fresh URL would survive a live URL Inspector run. That's the actual integration test.
Watch the rendered HTML in Search Console's live test panel — not just the verdict. The verdict tells you something is wrong; the rendered HTML tells you what. I'd assumed the verdict was Search Console misreading my pages.
Match crawlers broadly. Half the Stack Overflow answers about "let Googlebot through" only match Googlebot. Several Google crawler UAs don't contain that string. Mine now matches a long allowlist.

What this means for adult-content sites specifically

Adult sites get hit by this more than other categories because almost all of them have an age-verification wall. The wall is non-optional under multiple jurisdictions (US TX, UT, UK, parts of EU). If your wall ate your indexing, that's potentially months of search traffic lost — and you wouldn't see it as a single error in Search Console, because each missed URL is silently never crawled.

After the fix, I re-submitted my five most recent articles via URL Inspector. All five went through immediately and entered the priority crawl queue. The earlier rejections were 100% the UA-match bug.

Repo

The full architectural notes (multi-domain Next.js, Cloudflare Tunnel setup, geo-policy table, sanitized middleware snippets) are open on github.com/noctias/noctias-stack. It's not deployable code — the live source is private — but the patterns are documented enough to reproduce.

Live sites: