Performance Dev

Posted on Jun 2 • Originally published at outboundautonomy.com

The Agency Owner's Crawl Budget Audit Checklist: 10 Data Points Every SEO Audit Needs After Mass Deindexation

#webdev #performance #tutorial #discuss

Cover image: 1200×630 OG header.

Published: Outbound Autonomy
Series: Crawl Budget Recovery (Sequel to 322K Pages Deindexed: A Crawl Budget Recovery Guide for Solo Developers)
Target: Agency owners, SEO consultants, fractional CMOs auditing client sites
Read time: 8 minutes
CTA: Run your first white-label audit →

Last week, I showed you how David — a solo dev running a movie database — recovered from having 322,000 pages deindexed. The guide worked because it treated crawl budget as a topology problem, not a content problem.

But if you're an agency owner, you read that guide and thought: "I have 15 clients. I need a repeatable playbook, not a hero story."

You need a checklist a junior team member can run on Monday morning and have a client-ready report by lunch.

This is that checklist.

I'm going to give you the 10-point audit a $5K/mo SEO agency uses to diagnose crawl budget collapses in under 90 minutes. And I'm going to show you exactly where automation replaces manual work — because the agencies winning right now aren't the ones with better SEOs. They're the ones with better audit tools.

Before the Checklist: The One Metric That Changes Everything

Most crawl budget audits start from the wrong question. Agency owners ask: "How many pages is Google crawling?"

The right question: "Is Google crawling the right pages in the right order?"

A site with 15,000 indexed product pages and 5 crawled blog posts has a healthy crawl rate. A site with 150 indexed pages and 15,000 crawled URLs has a crawl fire — Googlebot is thrashing. These look identical in GSC's crawl stats report. The difference is topology.

This checklist catches the topology problem in the first 3 checks. Everything after that is confirmation.

The 10-Point Crawl Budget Audit Checklist

1. Index Coverage Ratio (the "Is the site bleeding?" check)

What to pull: GSC → Pages → Index Coverage → the four status buckets:

Status	Healthy Range (for a client site with <50K URLs)
Submitted and indexed	>70% of total submitted
Crawled, currently not indexed	<15% of total submitted
Discovered, currently not indexed	<10% of total submitted
Errors (4xx, 5xx, soft 404)	<3% of total crawled

Agency heuristic: If "Crawled, currently not indexed" exceeds 20%, you have a crawl topology problem. If it exceeds 35%, the site is in a trust slump (see David's case: 99.99% deindex → single-digit crawl rate).

The automated shortcut: OA's free audit engine pulls this from GSC API in one click. No manual navigation through 4 GSC tabs. → Run it now

2. Sitemap Distribution Audit (the "Are you wasting the sitemap?" check)

Many agencies check "does the site have a sitemap?" and tick the box. Wrong check. The right check: how is crawl budget distributed across sitemaps?

Manual audit:

Open robots.txt, confirm sitemap URLs
Visit each sitemap URL
Check: are high-value pages (blog posts, case studies, core product pages) in a SEPARATE sitemap from utility pages (about, contact, privacy)?
If everything is in one sitemap → critical fail. Google treats all URLs equally.

Agency heuristic: A single sitemap with 50 URLs on a site with 200+ pages is a crawl discovery bottleneck. Split into at least 3 sitemaps: blog, editorial/comparison, utility.

Automated check: OA's audit scans sitemap structure and flags single-sitemap risk automatically. Shows you the count per sitemap + priority distribution in the "Crawl Health" section.

3. Internal Link Depth Map (the "Can Google find it in 3 clicks?" check)

The third-fastest way to lose crawl budget: orphan content. Pages in the sitemap that no other page on the site links to.

Manual audit:

Pick 5 URLs from the "Crawled, currently not indexed" bucket in GSC
For each: can you navigate there from the homepage in ≤3 clicks?
If not: Google can't either. The sitemap submits the URL; internal links trigger the crawl.

Agency heuristic: If any editorial page requires more than 3 clicks from the homepage, move an internal link closer. Homepage → category page → article is fine. Homepage → search → filter → article is not.

Automated check: OA's audit maps internal link depth from crawl data. Flags pages in the sitemap that have zero internal links or are >3 clicks deep. This single check eliminates 40% of "Crawled, not indexed" causes.

4. Crawl Rate vs. Crawl Demand (the "Is Googlebot bored or overwhelmed?" check)

GSC → Crawl Stats shows raw requests/day. But the number itself is meaningless without context.

Manual audit:

Check avg crawl requests/day over 90 days
Check the "recent" trend (last 7 days)
Compare against server response time: if the server is fast (<200ms) and crawl requests are low (<50/day on a site with >1K pages), you have a crawl budget withdrawal — not a bottleneck

Agency heuristic:

Crawl Rate	Server Speed	Diagnosis
Low (<50/day)	Fast (<200ms)	Crawl withdrawal (trust slump) — needs topology fix
Low	Slow (>500ms)	Server bottleneck — needs CDN/hosting fix
High (>500/day)	Fast	Healthy — don't touch
High	Slow	Infrastructure crisis — Google is fighting your server

5. Referring Page Analysis (the "Where is Google actually entering?" check)

GSC → URL Inspection on a specific "Crawled, currently not indexed" URL → scroll to "Referring page."

Manual audit: If the referring page is "none" or "sitemap only," the URL exists but Google has no navigable path to it. This is the single most actionable crawl metric in GSC — and barely anyone checks it.

Agency heuristic: Of 20 random "Crawled, not indexed" URLs, if >50% show "Referring page: none," the site has a systematic internal linking failure. Assign a dev sprint to fix homepage → article linkage.

6. IndexNow Submission Rate (the "Are we telling or waiting?" check)

IndexNow is the fastest way to tell Google "these URLs are fresh." Most agency audits skip it because they assume "Google will find it."

Manual audit:

Does the site have an IndexNow key hosted at root?
Is the key listed in the sitemap?
When was the last IndexNow submission? (Check server logs for api.indexnow.org POST requests)
Are submissions batched or per-publish?

Agency heuristic: Every editorial publish should trigger an IndexNow submission within 5 minutes. If the client's last submission was >7 days ago, the crawl queue is stale.

Automated check: OA's audit identifies if IndexNow is configured, when it last fired, and whether the verification key is valid.

7. noindex/Canonical Inconsistency Scan (the "Did someone leave the gate open?" check)

Mass deindexation often starts with a single mistake: an engineer added noindex to a template, forgot to remove it, and now blog posts are tagged noindex too.

Manual audit:

Crawl 50 random URLs from the "Submitted and indexed" pool
Check each for conflicting directives: noindex + canonical to same URL, meta robots="noindex" + x-robots-tag: noindex, etc.
Cross-reference: are any of these "submitted and indexed" pages actually returning noindex headers?

Agency heuristic: If >5% of indexed pages have conflicting crawl directives, the site needs a template-level audit. This is how 322K pages get deindexed — a single template tag cascading across thousands of pages.

8. Server Log Crawl Analysis (the "What Googlebot actually does" check)

GSC is a summary. Server logs are the truth. They show which URLs Googlebot actually requested, how long it spent, and what HTTP status it received.

Manual audit:

Filter HTTP access logs for Googlebot / Mozilla/5.0 (compatible; Googlebot/2.1;
Extract unique URLs crawled in the last 7 days
Cross-reference against GSC "Crawled, not indexed" bucket — are they the same URLs?

Agency heuristic: If GSC reports 200 "Crawled, not indexed" URLs, but server logs show Googlebot hit only 3 of them, the GSC data includes URLs Google learned about (via sitemap) but never actually requested. Those +17 URLs Google "crawled" were read by the rendering service, not the primary crawler.

This difference matters: issues here suggest JavaScript rendering problems or client-side-injected content the primary crawler can't reach.

9. Backlink-to-Content Ratio (the "Are we being trusted or tested?" check)

After mass deindexation, Google tests the site by following backlinks. If an external link points to a page that doesn't exist or has thin content, the trust signal is negative.

Manual audit:

In GSC → Links → Top linked pages: which URLs have the most external backlinks?
For the top 10: are they still live? Do they have meaningful content? Are they internal-linked from the homepage?
If a backlink points to a deindexed page, the referral authority is wasted.

Agency heuristic: For every backlink pointing to a core page (blog post, guide, comparison), there should be a visible internal link from the homepage within 14 days. The backlink brings Googlebot; the internal link keeps it crawling.

10. Serial Recrawl Gap (the "When did Google last check?" check)

This is the hidden signal most audits miss. GSC tracks when Google last crawled each page. If the average time since last crawl for editorial pages is >30 days, the site has a serial recrawl drought.

Manual audit:

Pull 10 editorial URLs from GSC
Note the "Last crawl" date for each
Calculate average gap in days

Agency heuristic:

Recrawl Gap	Severity	Action
<7 days	🟢 Healthy	Continue current cadence
7-14 days	🟡 Warning	Review crawl budget allocation
14-30 days	🟠 Concern	Audit internal links + sitemap priority
30+ days	🔴 Critical	Full crawl topology rebuild

The 90-Minute Audit Flow (Putting It Together)

Here's the runbook for a Monday morning:

09:00 — Pull GSC index coverage, export "Crawled, not indexed" URLs (Check 1)
09:10 — Check sitemap structure, split if needed (Check 2)
09:20 — Map internal link depth for 5 sampled URLs (Check 3)
09:35 — Run OA white-label audit (automates Checks 4, 5, 6, 7, 10)
09:50 — Pull server log Googlebot hits (Check 8)
10:05 — Cross-reference backlink profile against indexed pages (Check 9)
10:20 — Write up findings, generate client report from OA export
10:30 — Done. Client-reportable output generated.

That's 90 minutes for a $2,500–$5,000 monthly retainer client. The tool pays for itself in the first billable hour you don't spend clicking through GSC tabs.

What This Means for Your Agency

If you're reading this and thinking "I have 15 clients — I can't do this for each one," you're right. You can't. The agencies winning the crawl budget conversation are the ones who automate 70% of this audit.

That's why the white-label audit tool exists.

You run one scan. It checks all 10 data points. It formats the output with your logo and branding. The client sees your report. And they book the strategy call because you caught the topology problem their last agency missed.

We charge $97/month for the white-label tier. Unlimited audits. Unlimited clients. Your branding. More than 70 agency owners are already using it.

The 10-point checklist above? That's the manual version. The automated version runs in 90 seconds and generates a client-ready report.

Start your 14-day agency trial →
No credit card required. First audit takes 90 seconds.

This article is part of the Crawl Budget Recovery series. Read Part 1: 322K Pages Deindexed: A Crawl Budget Recovery Guide for Solo Developers

Need a custom crawl budget audit for a client site? Run a free scan →

Top comments (1)

Performance Dev • Jun 2

Great guide — and the topology-first framing is exactly what's missing from most crawl budget content. The 10-point checklist format makes it actionable for agency teams.

One thing I'd be curious to hear from others: for those running the "referring page analysis" (check 5), what percentage of "Crawled, not indexed" URLs typically show "referring page: none" on a site that hasn't had a mass deindexation event? I've got benchmarks from the audit tool's dataset, but I'd love real-world numbers from people doing this manually. Does 30% feel high, or is that baseline normal for a mid-size site?

Also — the 90-minute runbook is genuinely useful. Has anyone tried running this sequentially vs parallel (splitting checks across team members)? Curious if parallel speeds things up or introduces blind spots.