I Published 2,849 Articles — Google Only Indexed 1,815. Here's What I'm Doing About It
Last month I hit a milestone I'm not sure I should celebrate: 2,849 published articles on my technical knowledge base platform.
Then I checked Google Search Console and found something uncomfortable: only 1,815 pages indexed. That's 36% of my content sitting in the void — published, technically accessible, but invisible.
Here's what I learned about the gap between "publishing" and "being found," and the specific steps I'm taking to close it.
The Discovery Was Painful
I'd been optimizing for output, not discovery. The thinking was simple: more articles = more surface area for long-tail keywords = more organic traffic.
The reality: Google doesn't care about your content volume if it can't or won't index it.
The 1,034 unindexed pages weren't duplicates, weren't blocked by robots.txt, weren't low-quality. They were just... orphaned in Google's queue. Some had been sitting there for weeks.
The Root Causes
1. Sitemap Bloat
My sitemap.xml was a single file with 2,849 URLs. Google's sitemap limit is 50,000 URLs per file, but practically, massive sitemgets deprioritized. Googlebot reads the first chunk thoroughly and skims the rest.
Fix: Split into paginated sitemaps by category and date. Each sitemap now has 500 URLs max, with a sitemap index pointing to them all.
2. Crawl Budget Exhaustion
With 2,849 pages and a modest domain authority, my daily crawl budget was being eaten by:
- Pagination pages (page/2, page/3, etc.)
- Tag archives with thin content
- API endpoints accidentally left unblocked
Fix: noindex on paginated and archive pages, explicit Disallow on API routes in robots.txt, and — crucially — I added a crawl-delay equivalent via server-side rate limiting.
3. Internal Link Orphans
The 1,034 unindexed pages were disproportionately:
- Recently published articles (under 2 weeks old)
- Articles in niche categories with few cross-links
- Articles that weren't linked from the homepage or category landing pages
Google finds pages through links. No links = no discovery = no indexing.
Fix: I built an automated internal linking system. Every new article now gets:
- 3-5 contextual links from existing related articles
- Placement in a "Recently Published" sidebar widget
- A link from its category landing page within 24 hours
4. AI Crawler Interference (Unexpected)
This one surprised me. My robots.txt was allowing GPTBot, PerplexityBot, and ClaudeBot — which is fine, I want AI search engines to index my content.
But these crawlers were consuming crawl budget alongside Googlebot. In a typical hour, I'd see:
- Googlebot: ~40 requests
- GPTBot: ~25 requests
- PerplexityBot: ~15 requests
- ClaudeBot: ~10 requests
That's 50 extra requests per hour that aren't Google indexing my pages.
Fix: I added Crawl-delay: 10 specifically for AI crawlers and rate-limited them at the server level. Googlebot gets priority; AI crawlers get a polite queue.
The Numbers Game
Here's where things stand:
| Metric | Before | After (in progress) |
|---|---|---|
| Published articles | 2,849 | 2,849 |
| Google-indexed pages | 1,815 (64%) | ~1,900 (67%) and climbing |
| Average index time (new article) | 7-14 days | 2-4 days |
| Sitemap submissions | 1 monolithic | 6 category-based |
| Internal links per new article | 0-1 | 3-5 (automated) |
What I'm Still Wrestling With
Long-tail SEO optimization at scale. I've rewritten ~2,033 article titles to include long-tail keywords, but 485 articles were skipped because the original titles were already well-optimized. The remaining 816 new articles still need this treatment.
The challenge: doing this programmatically without producing robotic, keyword-stuffed titles. I'm using a hybrid approach — AI generates candidate titles, then a scoring function filters for readability, keyword density, and uniqueness.
The Takeaway
Publishing content is only half the job. The other half is making sure the right crawlers find it, index it, and rank it — without wasting your crawl budget on pages that don't matter.
If you're running a content-heavy site and seeing a gap between published and indexed pages, check these four things first:
- Sitemap structure — split it up
- Crawl budget — block what shouldn't be crawled
- Internal links — every new page needs at least 3 inbound links
- AI crawler management — they're helpful but expensive
This is part of an ongoing series about building and scaling a technical knowledge base platform. You can explore the full knowledge base at Codcompass.
Top comments (0)