DEV Community

kol kol
kol kol

Posted on

I Published 2,849 Articles — Google Only Indexed 1,815. Here's What I'm Doing About It

I Published 2,849 Articles — Google Only Indexed 1,815. Here's What I'm Doing About It

Last month I hit a milestone I'm not sure I should celebrate: 2,849 published articles on my technical knowledge base platform.

Then I checked Google Search Console and found something uncomfortable: only 1,815 pages indexed. That's 36% of my content sitting in the void — published, technically accessible, but invisible.

Here's what I learned about the gap between "publishing" and "being found," and the specific steps I'm taking to close it.

The Discovery Was Painful

I'd been optimizing for output, not discovery. The thinking was simple: more articles = more surface area for long-tail keywords = more organic traffic.

The reality: Google doesn't care about your content volume if it can't or won't index it.

The 1,034 unindexed pages weren't duplicates, weren't blocked by robots.txt, weren't low-quality. They were just... orphaned in Google's queue. Some had been sitting there for weeks.

The Root Causes

1. Sitemap Bloat

My sitemap.xml was a single file with 2,849 URLs. Google's sitemap limit is 50,000 URLs per file, but practically, massive sitemgets deprioritized. Googlebot reads the first chunk thoroughly and skims the rest.

Fix: Split into paginated sitemaps by category and date. Each sitemap now has 500 URLs max, with a sitemap index pointing to them all.

2. Crawl Budget Exhaustion

With 2,849 pages and a modest domain authority, my daily crawl budget was being eaten by:

  • Pagination pages (page/2, page/3, etc.)
  • Tag archives with thin content
  • API endpoints accidentally left unblocked

Fix: noindex on paginated and archive pages, explicit Disallow on API routes in robots.txt, and — crucially — I added a crawl-delay equivalent via server-side rate limiting.

3. Internal Link Orphans

The 1,034 unindexed pages were disproportionately:

  • Recently published articles (under 2 weeks old)
  • Articles in niche categories with few cross-links
  • Articles that weren't linked from the homepage or category landing pages

Google finds pages through links. No links = no discovery = no indexing.

Fix: I built an automated internal linking system. Every new article now gets:

  • 3-5 contextual links from existing related articles
  • Placement in a "Recently Published" sidebar widget
  • A link from its category landing page within 24 hours

4. AI Crawler Interference (Unexpected)

This one surprised me. My robots.txt was allowing GPTBot, PerplexityBot, and ClaudeBot — which is fine, I want AI search engines to index my content.

But these crawlers were consuming crawl budget alongside Googlebot. In a typical hour, I'd see:

  • Googlebot: ~40 requests
  • GPTBot: ~25 requests
  • PerplexityBot: ~15 requests
  • ClaudeBot: ~10 requests

That's 50 extra requests per hour that aren't Google indexing my pages.

Fix: I added Crawl-delay: 10 specifically for AI crawlers and rate-limited them at the server level. Googlebot gets priority; AI crawlers get a polite queue.

The Numbers Game

Here's where things stand:

Metric Before After (in progress)
Published articles 2,849 2,849
Google-indexed pages 1,815 (64%) ~1,900 (67%) and climbing
Average index time (new article) 7-14 days 2-4 days
Sitemap submissions 1 monolithic 6 category-based
Internal links per new article 0-1 3-5 (automated)

What I'm Still Wrestling With

Long-tail SEO optimization at scale. I've rewritten ~2,033 article titles to include long-tail keywords, but 485 articles were skipped because the original titles were already well-optimized. The remaining 816 new articles still need this treatment.

The challenge: doing this programmatically without producing robotic, keyword-stuffed titles. I'm using a hybrid approach — AI generates candidate titles, then a scoring function filters for readability, keyword density, and uniqueness.

The Takeaway

Publishing content is only half the job. The other half is making sure the right crawlers find it, index it, and rank it — without wasting your crawl budget on pages that don't matter.

If you're running a content-heavy site and seeing a gap between published and indexed pages, check these four things first:

  1. Sitemap structure — split it up
  2. Crawl budget — block what shouldn't be crawled
  3. Internal links — every new page needs at least 3 inbound links
  4. AI crawler management — they're helpful but expensive

This is part of an ongoing series about building and scaling a technical knowledge base platform. You can explore the full knowledge base at Codcompass.

Top comments (0)