Performance Dev

Posted on Jun 2 • Originally published at outboundautonomy.com

322K Pages Deindexed: A Crawl Budget Recovery Guide for Solo Developers

#seo #webdev #saas #startup

322K Pages Deindexed: A Crawl Budget Recovery Guide for Solo Developers

This started as a community reply on Indie Hackers — because that's where the real conversations happen.

David Amoroso (building VibeWatch) posted a thread 21 days ago that stopped me cold: he deliberately deindexed 322,000 pages from Google after realizing his TMDB-powered movie database was a thin-content liability. Three weeks later, Google went from 322,000 indexed URLs down to one — his homepage. Daily impressions dropped from 250 to 3. And now: a trust slump.

His question on Indie Hackers was direct: How do you recover from a post-purge crawl drought?

When our outreach partner [BRIDGE] found that thread — 21 days old, zero replies — he posted the only comment. This article is what that comment became. You're reading the expanded version because the crawl budget problem David described isn't unique to movie databases. It's happening to thousands of solo-built sites.

Let's fix it.

The Problem: You Chose the Right Thing, But Google Didn't Notice

David did the right technical move: he identified 322,000 thin pages — template-wrapped API data with no unique editorial content — and flagged them with <meta name="robots" content="noindex, follow">. He even removed his Googlebot rate limit (previously capped at 7 req/min) to accelerate recrawl.

Three weeks later: all 322K pages purged.

But the pages he actually wants indexed — blog posts, comparison pages, editorial content — sit in Google Search Console with this status:

"Crawled, currently not indexed."

Search Console shows "Referring page: none detected" on most remaining URLs.

That's the trust slump. Google watched 99.99% of the domain disappear and concluded: this site no longer looks trustworthy enough to allocate crawl budget to. It's not a spam penalty, not a manual action. It's a statistical withdrawal of crawling priority.

Core insight (from BRIDGE's original reply, verified against GSC documentation): this is a crawl topology problem, not a content quality problem. You can't write your way out of a crawl drought — you have to route Google through the pages you want indexed.

Step 1: IndexNow — Your Fastest Path Back to Google's Attention

IndexNow is the single most underutilized tool in crawl recovery. It's a protocol supported by Bing, Yandex, Naver, Seznam — and Google (via Yandex partnership and direct API calls). You send a notification to the IndexNow API, and within minutes, participating search engines know you have new or updated content.

How to Implement It

# Submit a single URL
curl -X POST "https://api.indexnow.org/indexnow" \
  -H "Content-Type: application/json" \
  -d '{
    "host": "yoursite.com",
    "key": "your-verification-key",
    "keyLocation": "https://yoursite.com/your-verification-key.txt",
    "urlList": [
      "https://yoursite.com/blog/your-new-article"
    ]
  }'

# Python: batch submit URLs you want indexed
import requests

urls = [
    "https://yoursite.com/blog/article-1",
    "https://yoursite.com/blog/article-2",
    "https://yoursite.com/comparison-page",
]

payload = {
    "host": "yoursite.com",
    "key": "your-key",
    "keyLocation": "https://yoursite.com/your-key.txt",
    "urlList": urls
}

r = requests.post("https://api.indexnow.org/indexnow", json=payload)
print(r.status_code)  # 200 = accepted

Setup requirements:

Generate a verification key (any random string)
Host it as your-key.txt at your domain root
Add the key and keyLocation fields to your sitemap XML (or call it direct)

Why it matters for David's situation: In three weeks with no rate limit, Google recrawled 322K URLs and purged them. But the new pages he wants indexed — no crawler is even asking for them. IndexNow tells the crawler "these URLs exist and are fresh" within minutes, bypassing the passive crawl queue entirely.

Step 2: Sitemap Hierarchy — Don't Put Everything in One File

A single sitemap with 50 URLs isn't a strategy. Google's crawl budget is allocated per-sitemap, not per-domain. If all your pages (thin + high-value) share one sitemap, the crawler treats them equally.

What to Do Instead

<!-- sitemap-index.xml — the master index -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://yoursite.com/sitemap-blog.xml</loc>
    <lastmod>2026-06-01</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://yoursite.com/sitemap-editorial.xml</loc>
    <lastmod>2026-06-01</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://yoursite.com/sitemap-utility.xml</loc>
    <lastmod>2026-06-01</lastmod>
  </sitemap>
</sitemapindex>

Sitemap	Contents	Refresh	Crawl Priority
`sitemap-blog.xml`	Blog posts, editorial articles	Every publish	Highest
`sitemap-editorial.xml`	Comparison pages, guides, case studies	Weekly	High
`sitemap-utility.xml`	About, contact, privacy, legal	Rarely	Low

Putting high-value pages in their own sitemap tells Google's crawler: these 15 URLs are more important than the other 47. It's a signal, not a guarantee — but in a crawl drought, every signal matters.

For VibeWatch specifically: David's blog post and comparison pages should be the ONLY URLs in sitemap-blog.xml. The thin template pages (which are now noindex) go nowhere — they don't belong in any sitemap.

Step 3: Homepage Internal Linking — Your Crawl Gateway

Googlebot enters most sites through the homepage. From there, it follows links. If your new editorial pages aren't linked from the homepage (or a page one click deep from it), the crawler might never find them.

This is the most common cause of "Crawled, currently not indexed" status on solo-built sites: the URL exists in a sitemap, but Google can't reach it through clicks from a trusted entry point.

The Fix

For every new article or editorial page you publish, add a link from your homepage within 24 hours. Not a hidden footer link. An actual, visible content link — "Latest from the blog," "New Guide: [topic]," or an inline cross-reference.

<!-- On the homepage: -->
<section class="latest-content">
  <h2>Recently Published</h2>
  <ul>
    <li><a href="/blog/how-to-recover-from-google-deindex">How to Recover from...</a></li>
    <li><a href="/comparison/vibewatch-vs-letterboxd">VibeWatch vs Letterboxd</a></li>
  </ul>
</section>

Google's caching layer takes days to update. If the link goes live on publication but the homepage cache is stale, the editorial page doesn't get discovered. Keep the link on the homepage for at least 7-14 days — not 24 hours, not 48 hours. Two weeks minimum.

Step 4: Staggered Publishing — Don't Flood the Queue

After a mass deindex, there's a strong temptation to publish everything at once. "I need Google to see I'm legitimate, so I'll push 10 articles today."

Don't do this. Google treats bulk publishes as signal noise, especially from domains in a trust slump. The crawler inspects each URL, finds nothing new, and deprioritizes the batch.

The Pattern That Works

Day	Action
Day 1	Publish 1 article. Submit via IndexNow. Link from homepage.
Day 3	Publish 2nd article. Same routine.
Day 5	Publish 3rd.
Day 7	Check GSC. If any of the 3 show "Submitted and indexed" — proceed to 1/day.
Day 14+	Normal publishing cadence (whatever your sustainable rate is).

The goal is to show Google a pattern of consistent, trustworthy publishing — not a burst that looks like spam recovery. In practice, 3-5 well-linked articles over a 7-10 day window is enough to signal "we're back."

Step 5: One Quality Backlink to a Specific Article, Not the Homepage

Conventional SEO wisdom says "get backlinks to the homepage." In a crawl recovery scenario, this is wrong.

A backlink to the homepage boosts your domain authority — which you need — but it won't help Google discover a specific blog post buried in a "Crawled, currently not indexed" queue. The homepage is already indexed. The article is not.

What to Do

Get one quality backlink (DR 30+, relevant niche) that points directly to a specific editorial page — a comparison article, a guide, a case study. Google follows backlinks. When it follows a backlink to yoursite.com/blog/specific-article, it must crawl that URL. If that article is internally linked from the homepage (see Step 3), the crawler now has a path to the rest of your editorial content.

Sources that work for indie founders:

AlternativeTo — David already mentioned this. It's a strong native link for VibeWatch.
Reddit — Genuine mentions in relevant subreddits (r/movies, r/selfhosted for Jellyfin).
Indie Hackers — Profile bio with a link to a specific post.
GitHub README — If your project is open-source adjacent.
Small directory listings — Not the spammy ones. Niche-specific directories with editorial curation.

The key: one targeted link to an article > five generic links to the homepage.

Step 6: Verify Everything in Google Search Console

You can't fix what you can't measure. Before starting any of the above steps, set up or verify your GSC property and check three metrics:

1. Crawl Stats Report

Settings > Crawl Stats — Check total crawl requests per day. If you're under 50/day on a domain that should be getting 500+, you have a crawl budget problem. Cross-reference against your server logs.

2. Index Coverage Report

Indexing > Pages — Filter by Crawled, currently not indexed. These are your target URLs. Track this number over the 14-day recovery window.

3. URL Inspection Tool

Enter a specific blog post URL. The report tells you exactly why Google isn't indexing it:

"Discovered, currently not indexed" = Google knows it exists but hasn't tried
"Crawled, currently not indexed" = Google tried and decided not to index
"Page with redirect" = Your noindex tag is still on content pages
"Alternate page with proper canonical tag" = Canonical conflict

Quick diagnostic: if all your editorial pages show "Crawled, currently not indexed" and Search Console says "Referring page: none detected," you have a crawl topology issue — Google can reach the URL but has no signal that it matters. The fix is Step 3 (homepage linking) + Step 1 (IndexNow notification).

The Recovery Checklist (Print This)

A printable reference for any solo dev rebuilding after mass deindexation:

[ ] IndexNow — Set up verification key, submit all editorial URLs
[ ] Sitemap hierarchy — Split into blog, editorial, utility sitemaps
[ ] Homepage links — Every new article linked from homepage for 14 days
[ ] Staggered publishing — 1 article → wait → 2nd → wait → 1/day
[ ] Quality backlink — One DR 30+ link to a specific article, not homepage
[ ] GSC crawl stats — Verify daily crawl requests are increasing
[ ] GSC index coverage — Monitor "Crawled, not indexed" count trend
[ ] Verify noindex tags — Confirm template/thin pages still blocked
[ ] Server log check — Confirm Googlebot is actually hitting your new URLs
[ ] Internal link audit — Are your editorial pages connected in a crawlable chain?

Why This Matters Beyond SEO

The trust slump after mass deindexation is demoralizing. You did the hard, right thing — cleaned up your thin content — and the reward is fewer impressions, less traffic, and pages that won't index.

But there's a subtle point David's post accidentally makes: 322,000 indexed pages generating 250 impressions/day is not a win. It's noise in the index. The deindex was the right call. The recovery just needs a different playbook than what SEO blogs recommend. Not more content. Not more keywords. Better crawl topology.

That's the gap this guide addresses — and it's exactly the kind of conversation happening in the indie hacker community right now. We found David's thread on Indie Hackers because we post there. We replied because that's where real technical SEO problems surface before they hit the blog posts.

Need an Automated Crawl Diagnostic?

If you're dealing with a crawl budget collapse — or you're not sure whether your site has one — we built a free tool that maps your index coverage, crawl stats, and internal link topology in 90 seconds.

Run a free crawl diagnostic →

No signup required. Enter your URL, get a crawl-budget breakdown + actionable recovery steps tailored to your site's size and structure.

This article started as a community reply on Indie Hackers — the best place to find real technical SEO problems before they become blog posts. Thanks to David Amoroso (VibeWatch) for posting the original thread, and to our outreach team for finding it.

Have a recovery story of your own? Start a free audit and let's see what Google sees.

DEV Community

322K Pages Deindexed: A Crawl Budget Recovery Guide for Solo Developers

322K Pages Deindexed: A Crawl Budget Recovery Guide for Solo Developers

The Problem: You Chose the Right Thing, But Google Didn't Notice

Step 1: IndexNow — Your Fastest Path Back to Google's Attention

How to Implement It

Step 2: Sitemap Hierarchy — Don't Put Everything in One File

What to Do Instead

Step 3: Homepage Internal Linking — Your Crawl Gateway

The Fix

Step 4: Staggered Publishing — Don't Flood the Queue

The Pattern That Works

Step 5: One Quality Backlink to a Specific Article, Not the Homepage

What to Do

Step 6: Verify Everything in Google Search Console

1. Crawl Stats Report

2. Index Coverage Report

3. URL Inspection Tool

The Recovery Checklist (Print This)

Why This Matters Beyond SEO

Need an Automated Crawl Diagnostic?

Top comments (0)