800 Pages, 120 Indexed: How to Fix Your Crawl Budget (Step-by-Step)

#showdev #business #discuss #tutorial

📥 TL;DR — Want the full technical playbook? This article covers the core concepts. The complete guide includes GSC audit templates, robots.txt patterns for 12 CMS setups, and the implementation calendar that took us from 15% to 94% indexation.
→ Fix Your Crawl Budget: The Complete Playbook — €12, instant PDF · 30-day refund

We had 800 pages. Google was indexing 120.

That's 85% of our content invisible to search — despite being technically sound, well-written, and internally linked.

The problem wasn't content quality. It was crawl budget waste.

Here's what we found, what we fixed, and what actually moved the needle.

What Is Crawl Budget (And Why It Matters at Scale)

Google doesn't crawl your entire site on every pass. It has a budget — determined by your domain authority and server response time.

For small sites (<1,000 pages): rarely a problem.
For large sites (10,000+ pages): it becomes the bottleneck between publishing and ranking.

The part most guides miss: crawl budget is wasted on pages that will never rank.

Every paginated URL, filter combination, or session-parameterized URL Googlebot visits is a crawl NOT spent on your actual content.

Our Crawl Budget Audit

Pull your server crawl log. Cross-reference with GSC "Not Indexed" report.

What was eating our budget:

Issue	URLs Wasted
Pagination (/page/2, /page/3...)	340
Faceted navigation (?sort=price&color=red)	180
Session parameters in URLs	90
Thin tag/category pages	70

That's 680 URLs consuming crawl budget daily — 0 ranking potential.

Fix those, and Google redirects crawl capacity to your real content.

The 5 Fixes That Worked

1. Noindex Pagination (Except Page 1)

<!-- On /blog/page/2 and all subsequent pages -->
<meta name="robots" content="noindex, follow">

Keep follow so link equity still passes. Just stop indexation of content-free pages.

2. Canonical Tags on Filter/Facet Pages

<!-- On /shoes?color=red&size=42 -->
<link rel="canonical" href="https://example.com/shoes/">

Consolidates crawl budget to the base page. Prevents duplicate content penalties.

3. Block Session Parameters in robots.txt

# robots.txt
Disallow: /*?session_id=
Disallow: /*?ref=email
Disallow: /*?utm_source=

Better fix: sessions in cookies, not URLs. But robots.txt works in the short term.

4. Internal Links to Underlinked Pages

Pages with zero internal links are invisible to Googlebot even if they're technically indexed.

Use Screaming Frog to find pages with <2 internal links. Add links from your high-traffic content.

Rule of thumb: No content page should have fewer than 3 relevant internal links pointing to it.

5. Server Response Time Under 200ms

Googlebot respects rate limits. Slow servers = fewer crawls per day.

Target <200ms TTFB. CDN for static assets. Aggressive page-level caching.

Going from 800ms to 180ms TTFB alone increased our daily crawl rate by ~40%.

The Results (6 Weeks Later)

Metric	Before	After
Indexed pages	120	748
GSC daily crawl rate	340/day	1,200/day
Organic impressions	12k/mo	47k/mo

Not all indexed pages ranked. But you can't rank what Google hasn't seen.

Ongoing Monitoring

Weekly GSC audit habit:

"Not indexed" count → should decrease weekly
"Crawled but not indexed" → content quality issue
"Discovered but not crawled" → internal linking issue

Set up a Google Sheet pulling GSC data weekly. Flag anything that changes by >10%.

📥 The Complete Crawl Budget Playbook

This article is the theory. The full playbook includes:

GSC audit template (CSV with formulas to identify high-priority gaps)
robots.txt patterns for 12 common CMS setups (WordPress, Webflow, Next.js, Shopify...)
Internal linking matrix for content cluster optimization
Server configuration checklist for Nginx, Apache, and Vercel
30-day implementation calendar with weekly checkpoints

→ 800 Pages, 120 Indexed: Fix Your Crawl Budget — €12, instant PDF download

30-day money-back guarantee. No questions asked.