DEV Community

Joseph Anady
Joseph Anady

Posted on • Originally published at thatdevpro.com

Technical SEO: crawling, indexing, JS rendering

Originally published at thatdevpro.com. Part of ThatDevPro's open SEO + AI framework library. ThatDevPro is an SDVOSB-certified veteran-owned web + AI engineering studio. Open-source AI citation toolkit: github.com/Janady13/aio-surfaces.


Crawlability, Indexing, Canonicalization, Redirects, URL Structure, and the Bot-Facing Foundation

A comprehensive installation and audit reference for technical SEO — the bedrock layer that determines whether search engines and AI crawlers can discover, render, and index a site at all. Every other framework in this library assumes the technical foundation works. This document specifies what "works" means and how to verify it. Dual-purpose: installation manual and audit document.

Cross-stack implementation note: the code samples in this framework are written in plain HTML for clarity. For React, Vue, Svelte, Next.js, Nuxt, SvelteKit, Astro, Hugo, 11ty, Remix, WordPress, Shopify, and Webflow equivalents of every pattern below, see framework-cross-stack-implementation.md. For pure client-rendered SPAs (no SSR/SSG) see framework-react.md. For Tailwind-specific concerns (purge, dynamic classes, dark-mode CLS, focus accessibility) see framework-tailwind.md.


1. Document Purpose

This is the canonical reference for technical SEO. Content quality, authority signals, schema markup, and AI optimization are all wasted if a crawler cannot reach a page, cannot render it, or cannot decide which version to index. Technical SEO is the prerequisite. It is not glamorous. It is not optional.

In 2026, technical SEO has changed in three ways since 2020. First, JavaScript rendering is no longer the bleeding-edge concern — Google renders JS reliably and other major crawlers (Bingbot, GPTBot, ClaudeBot, PerplexityBot, Applebot) have caught up to varying degrees. Second, the bot landscape exploded: a 2026 site receives traffic from a dozen named AI crawlers in addition to the four major search engines, and robots.txt policy is now an editorial decision, not a technical default. Third, indexing has become more selective — Google indexes a smaller percentage of crawled URLs than it did a decade ago, so wasting crawl budget on duplicate, parameterized, or thin URLs has direct cost.

1.1 Required Tools

  • Google Search Consolesearch.google.com/search-console — coverage, sitemaps, URL inspection
  • Bing Webmaster Toolswww.bing.com/webmasters — Bing-specific coverage and IndexNow submission
  • Screaming Frog SEO Spider — desktop crawler, free up to 500 URLs, paid for unlimited
  • Sitebulb — desktop crawler, alternative to Screaming Frog with stronger reporting
  • Ahrefs Site Audit / Semrush Site Audit — cloud-based crawlers with historical tracking
  • Google Rich Results Testsearch.google.com/test/rich-results — render + schema validation
  • Google Lighthouse — Chrome DevTools performance/SEO audit
  • GTmetrix / WebPageTest — performance and waterfall analysis
  • curl / httpie — manual header inspection
  • Cloudflare / nginx access logs — server-level crawl observation
  • IndexNowwww.indexnow.org — push-based indexing for Bing, Yandex, Naver

1.2 Document Scope

Covers: crawl access, robots.txt, XML sitemaps, canonicalization, redirects, URL structure, status codes, JS rendering, mobile-first indexing, HTTPS posture, hreflang, and crawler observability. Touches but does not exhaust: page experience (own framework: framework-pageexperience.md), schema (framework-schema.md), internal linking (framework-internallinking.md), security (framework-security.md).


2. Client Variables Intake

domain_apex: ""
www_or_non_www_canonical: ""           # decide which is canonical
http_or_https: "https"                 # always https in 2026
trailing_slash_policy: ""              # with-slash | without-slash
url_case_policy: "lowercase"
cms_or_framework: ""                   # WordPress | Next.js | Astro | Hugo | Shopify | Webflow | static
hosting_environment: ""
cdn: ""                                # Cloudflare | Fastly | none
search_console_verified: false
bing_webmaster_verified: false
indexnow_key_deployed: false
known_indexing_issues: []
recent_migrations: []
international_targets: []              # if any hreflang need
Enter fullscreen mode Exit fullscreen mode

3. Crawl Access Layer

3.1 robots.txt

The robots.txt file at the domain root tells crawlers which paths they may request. It is advisory, not a security mechanism — anything genuinely sensitive belongs behind authentication, not behind a Disallow rule.

Minimum viable robots.txt for a production site:

User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /staging/
Disallow: /*?*sessionid=
Disallow: /*?*utm_*

Sitemap: https://example.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode

Required validations:

  • The file is served at /robots.txt with Content-Type: text/plain and HTTP 200.
  • It does NOT block CSS, JS, or image directories. Googlebot needs those to render.
  • It does NOT block fonts (/fonts/, /assets/fonts/) or web manifest assets.
  • The Sitemap directive uses an absolute URL.
  • Wildcards (*) are used sparingly and tested in GSC's robots tester before deploy.

AI crawler policy (2026 baseline):

In 2026, the question is no longer "do we block AI crawlers" but "which AI crawlers do we want citing us, and which do we want to block." A typical client-facing posture:

# AI search crawlers — usually allow (citation traffic)
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

# Aggressive scrapers — block by default unless client requests otherwise
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: MJ12Bot
Disallow: /
Enter fullscreen mode Exit fullscreen mode

Cross-reference: framework-aicitations.md for the full AI-crawler matrix.

3.2 X-Robots-Tag

For non-HTML resources (PDFs, images, JSON files) that should not be indexed, X-Robots-Tag is the only signal. Set via server config:

# nginx
location ~* \.(pdf|json)$ {
  add_header X-Robots-Tag "noindex, nofollow";
}
Enter fullscreen mode Exit fullscreen mode

For HTML pages, <meta name="robots"> is preferred because it is more visible to humans editing pages.

3.3 Crawl Budget

Crawl budget is the number of URLs Googlebot crawls in a given period. For sites under ~50,000 URLs this rarely matters. Above that, crawl-budget waste shows as:

  • Many URLs in GSC's "Crawled — currently not indexed"
  • Many URLs in "Discovered — currently not indexed"
  • Slow propagation of new content to the index
  • Frequent crawls of low-value parameterized URLs

Crawl budget conservation:

  • Block parameterized/sessioned URLs in robots.txt
  • 410 truly dead URLs (faster than 404 to be removed from crawl)
  • Reduce internal links to low-value pages
  • Use noindex on pages that should not be indexed (404 templates, search result pages, tag archives with thin content)
  • Use a clean XML sitemap so the crawler has a prioritized list

4. Indexing Layer

4.1 The Two-Step Process

Indexing is two steps, not one:

  1. Discovery and crawl — the bot finds the URL (sitemap, link, IndexNow ping) and requests it.
  2. Indexing — the bot parses the response, decides whether to add it to the index, and what to associate with it.

A page can be crawled but not indexed. A page can be in the index but not ranked. These are distinct states with distinct fixes.

4.2 Index Status in Google Search Console

Use Coverage (now Pages) report. The buckets:

Status Meaning Action
Indexed In Google's index Monitor for unexpected changes
Indexed, not submitted in sitemap Found via links, not in sitemap Add to sitemap if it should be indexed
Crawled — currently not indexed Crawled but Google chose not to index Improve content quality, add internal links, check duplication
Discovered — currently not indexed Found but not yet crawled Often crawl-budget signal; check site authority + reduce low-value URLs
Excluded by 'noindex' tag Intentionally excluded Verify intent
Page with redirect Redirected; not indexed itself Verify the redirect target is indexed
Duplicate, Google chose different canonical Google ignored your canonical Check internal linking + canonical signals consistency
Soft 404 Returns 200 but content looks like a 404 page Either return real 404 or fix the page
Server error (5xx) Crawl failed Fix the server error

4.3 IndexNow

IndexNow is a push-based indexing protocol supported by Bing, Yandex, Seznam, and Naver (not Google). When a URL changes, you POST it to IndexNow and supported engines crawl within minutes instead of days.

Implementation:

  1. Generate an API key (random 32-character string)
  2. Place the key as /{key}.txt at the domain root with the key as its content
  3. POST URL changes to https://api.indexnow.org/indexnow:
POST /indexnow HTTP/1.1
Host: api.indexnow.org
Content-Type: application/json

{
  "host": "example.com",
  "key": "abc123...",
  "keyLocation": "https://example.com/abc123.txt",
  "urlList": [
    "https://example.com/new-page/",
    "https://example.com/updated-page/"
  ]
}
Enter fullscreen mode Exit fullscreen mode

For WordPress, plugin support exists. For Next.js / static sites, integrate as a build-time hook.

4.4 Mobile-First Indexing

Since 2023, Google indexes the mobile version of every site. The desktop version is largely ignored for ranking. Verify:

  • All content visible on mobile (not hidden behind "click to expand" with mobile-only display:none)
  • Structured data identical on mobile and desktop
  • Meta tags identical on mobile and desktop
  • Internal linking identical (no mobile-only menu omitting key links)
  • Images load on mobile (no desktop-only assets)

5. Canonicalization

5.1 Why Canonicalization Matters

Modern sites generate many URLs that resolve to the same content:

  • https://example.com/page and https://example.com/page/
  • https://example.com/page and https://www.example.com/page
  • https://example.com/page?utm_source=email
  • https://example.com/PAGE (some servers serve the same content for any case)
  • https://example.com/page?session=abc123

Without explicit canonical signals, Google picks one and may not pick the one you want. Ranking signals split across variants. Indexing decisions become inconsistent.

5.2 Canonical Signal Stack

Canonical signals reinforce each other. Use all of them:

  1. <link rel="canonical" href="..."> — the explicit declaration. Self-referential on the canonical URL itself.
  2. 301 redirects — for true duplicates, redirect non-canonical to canonical (preferred over rel=canonical when content is genuinely identical).
  3. Internal linking — every internal link points to the canonical URL, never to a redirected variant.
  4. XML sitemap — only canonical URLs appear in the sitemap.
  5. hreflang annotations — when present, must reference canonical URLs only.
  6. HTTP Link header — equivalent to rel=canonical, used for non-HTML resources.

If these signals disagree, Google picks one and ignores the rest. Consistency is the rule.

5.3 Common Canonicalization Mistakes

  • Canonical to a redirect target. Always canonical to the URL that returns 200, never to one that 301s.
  • Cross-domain canonical without authority. You can canonical mirror.example.com to example.com, but canonicals across unrelated domains are usually ignored.
  • Self-canonical with parameters present. A URL like /page?utm_source=x should canonical to /page (no parameters), not self-canonical.
  • Conflicting canonicals between hreflang clusters. Each hreflang cluster member must canonical to itself, not to an English default.
  • HTTP version canonicalizing to HTTPS but not redirecting. Use a 301 plus self-canonical on HTTPS, not just rel=canonical.

5.4 Trailing Slash and Case

Pick one and enforce it sitewide via 301 redirect. Do not rely on canonicals alone — redirects collapse the duplication, canonicals only signal it.

# nginx — enforce trailing slash + lowercase
rewrite ^/(.*[A-Z]+.*)$ /$1 permanent;  # would need lua/regex helper
location / {
  try_files $uri $uri/ =404;
}
Enter fullscreen mode Exit fullscreen mode

For Next.js, set trailingSlash: true (or false) in next.config.js and stick with it. Mixing breaks canonicalization.


6. Redirects

6.1 Status Codes

Code Use case
301 Moved Permanently The URL has permanently moved. Passes ranking signals. Default for migrations.
302 Found Temporary redirect. Use only for actual temporary moves (A/B tests, seasonal pages). Misused 302s leak link equity.
307 Temporary Redirect Like 302 but preserves request method. Rare in SEO context.
308 Permanent Redirect Like 301 but preserves request method. Functionally equivalent for SEO.
410 Gone Page is permanently removed and not coming back. Faster removal from index than 404.
451 Unavailable for Legal Reasons Use when content removed for legal reasons (DMCA, jurisdictional).

6.2 Redirect Chains

A chain is A → B → C. Eliminate them. Every redirect should be a single hop to the final URL. Chains:

  • Waste crawl budget
  • Add latency for users
  • Risk dropping signals at each hop
  • Break when one link in the chain dies

Maintain a redirect map spreadsheet for any migration. After deploying redirects, crawl the site and verify zero chains.

6.3 Redirect Implementation Layers

In order of preference:

  1. Server config (nginx, Apache, Cloudflare Rules) — fastest, most reliable, executed before page load.
  2. CMS-level redirect plugin — fine for low-volume changes, performance penalty at scale.
  3. JavaScript redirects — last resort. Slow, fragile, sometimes ignored by crawlers.
  4. Meta refresh redirects — never use. Treated as low-quality signal.

6.4 The Migration Redirect Pattern

When migrating URL structure:

  1. Generate a 1:1 map of old URL → new URL for every indexed page.
  2. Implement 301s in server config before deploy.
  3. Update internal links to point to new URLs (do not rely on the redirect).
  4. Update XML sitemap to list only new URLs.
  5. Submit new sitemap in GSC.
  6. Monitor GSC's Coverage report for 30-90 days.
  7. Keep redirects in place permanently — old URLs have inbound links from sites you don't control.

Cross-reference: framework-migration.md for full migration methodology.


7. URL Structure

7.1 The Eight URL Rules

  1. Lowercase. Always.
  2. Hyphens between words. Not underscores. Not camelCase.
  3. Under 60 characters when possible. Long URLs index fine but truncate in SERPs.
  4. No stop words unless meaningful. /the-best-web-hosting/ reads better as /best-web-hosting/.
  5. Descriptive, not numeric. /blog/post-1234/ is opaque; /blog/local-seo-checklist/ is meaningful.
  6. No file extensions where avoidable. /about/ over /about.html or /about.php.
  7. One canonical separator policy. Don't mix /category/post/ and /category-post/.
  8. Stable. Once published, do not change a URL without redirecting.

7.2 URL Hierarchy and Crawl Depth

A URL's path depth (/a/b/c/d/page/) is independent of crawl depth (clicks from homepage). Crawl depth matters for SEO; path depth is only loosely related.

Target: every important page reachable in 3 clicks or fewer from the homepage. Verify with Sitebulb's Crawl Depth report.

Cross-reference: framework-internallinking.md for hub-and-spoke architecture.

7.3 Parameter Handling

URL parameters create duplication. Strategies:

  • Block in robots.txt — for tracking parameters that should never be indexed (?utm_*, ?sessionid=, ?fbclid=).
  • Canonical to parameterless version — for sort/filter parameters where the parameterless version is canonical.
  • Self-canonical with noindex — for parameter combinations that are unique pages but should not be indexed.
  • GSC URL Parameter tool — deprecated as of 2022. Use canonical signals instead.

8. XML Sitemaps

8.1 What Belongs in a Sitemap

Only canonical, indexable, 200-status URLs that you want indexed. Everything else stays out:

  • Excluded: noindex pages, redirected URLs, 4xx URLs, duplicate URLs, parameterized variants
  • Excluded: pagination URLs (page/2/, page/3/) unless you have a strategic reason
  • Excluded: search result pages, login pages, account pages

8.2 Sitemap Structure

For sites under 50,000 URLs, a single sitemap is fine. Above that, use a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-images.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
</sitemapindex>
Enter fullscreen mode Exit fullscreen mode

Per-sitemap entry:

<url>
  <loc>https://example.com/page/</loc>
  <lastmod>2026-05-05</lastmod>
</url>
Enter fullscreen mode Exit fullscreen mode

<changefreq> and <priority> are ignored by Google. <lastmod> is honored when accurate; if you fake it (always = today), Google starts ignoring it.

8.3 Specialized Sitemaps

  • Image sitemap — for image-heavy sites; accelerates Image Pack inclusion.
  • Video sitemap — required for native video content; alternative to VideoObject schema.
  • News sitemap — for sites approved for Google News; only includes articles from the last 48 hours.
  • hreflang sitemap — programmatic alternative to inline hreflang tags.

8.4 Submission

  • Submit sitemap URL in GSC and Bing Webmaster Tools.
  • Reference sitemap URL in robots.txt (Sitemap: https://...).
  • For large sites, ping the sitemap when content changes (or use IndexNow).

9. Status Codes — The Operator's Reference

Beyond redirects, status codes communicate site health to crawlers. The full inventory worth knowing:

9.1 2xx Success

  • 200 OK — page delivered. Default for healthy URLs.
  • 201 Created — resource created (POST endpoints).
  • 204 No Content — success, no body returned.
  • 206 Partial Content — range request (video, large download resume).

9.2 3xx Redirection

  • 301 / 308 — permanent. Default for migrations.
  • 302 / 307 — temporary. Use sparingly.
  • 304 Not Modified — conditional GET succeeded; client uses its cached copy. Healthy at scale.

9.3 4xx Client Errors

  • 400 Bad Request — malformed request. Usually bot or attack traffic.
  • 401 Unauthorized — auth required.
  • 403 Forbidden — access denied. If on a public page, check nginx/htaccess rules.
  • 404 Not Found — URL does not exist. Track in GSC and either redirect or 410.
  • 405 Method Not Allowed — right URL, wrong verb. Common form-handler misconfiguration.
  • 410 Gone — permanently removed. Faster removal from index than 404.
  • 429 Too Many Requests — rate limited. Usually bot abuse.
  • 451 Unavailable for Legal Reasons — content removed for legal cause.

9.4 5xx Server Errors

  • 500 Internal Server Error — application crash.
  • 502 Bad Gateway — nginx couldn't reach upstream.
  • 503 Service Unavailable — maintenance or overload. Use Retry-After header.
  • 504 Gateway Timeout — upstream too slow.

5xx codes are urgent. Sustained 5xx during a Googlebot crawl drops pages from the index.


10. JavaScript Rendering

10.1 The Two-Wave Indexing Model (Mostly Obsolete)

For years, Google's two-wave model meant JS sites were indexed late: the first wave indexed HTML, the second wave indexed rendered content days later. As of 2025, Google renders the vast majority of pages within hours of the first crawl. The two-wave problem is no longer a structural blocker for most sites.

It is still real for AI crawlers. GPTBot, ClaudeBot, and PerplexityBot do not all render JS reliably. For AI search visibility, server-side rendered content matters more than for traditional Google SEO.

10.2 Rendering Strategy by Content Type

Content type Recommended rendering
Marketing pages, landing pages SSG or SSR (no client-side rendering for primary content)
Blog posts, articles SSG
Ecommerce product pages SSR with hydration
Logged-in user pages CSR is fine (these aren't indexed anyway)
Real-time data displays SSR shell + client hydration

10.3 Validation

Use Google's Mobile-Friendly Test or URL Inspection tool's "Test Live URL" to see what Googlebot actually renders. If primary content is missing, the page is not effectively indexed even if it returns 200.

For AI crawler visibility, curl -A "GPTBot" https://example.com/page and inspect the HTML response body. If the content is in <noscript> only or arrives via fetch/XHR, AI crawlers miss it.

Cross-reference: framework-headless.md and framework-nextjs.md for framework-specific rendering patterns.


11. HTTPS

In 2026, HTTPS is non-negotiable. HTTP-only sites suffer ranking penalties, browser warnings, and lost trust signals.

11.1 Required Configuration

  • TLS 1.2 minimum, TLS 1.3 preferred.
  • Valid certificate from a recognized CA (Let's Encrypt, Cloudflare, paid).
  • HSTS header: Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
  • HSTS preload list submission for high-value domains: hstspreload.org
  • HTTP-to-HTTPS 301 redirect, single hop.
  • Canonical and internal links use HTTPS exclusively.
  • Mixed-content audit: no HTTP resources loaded on HTTPS pages.

11.2 Certificate Maintenance

  • Auto-renewal via Let's Encrypt or hosting provider.
  • Monitor expiration via UptimeRobot or similar.
  • Verify certificate transparency log entries via crt.sh.

Cross-reference: framework-security.md for broader security posture.


12. International / hreflang

For sites targeting multiple languages or regions, hreflang annotations tell Google which version to show which user.

12.1 hreflang Implementation

Three valid placement methods, listed in order of preference:

  1. HTTP Link header — non-HTML resources, full programmatic control.
  2. XML sitemap hreflang — preferred for large sites; centralizes all annotations in one place.
  3. <link rel="alternate"> tags in <head> — most common; works but harder to maintain at scale.
<link rel="alternate" hreflang="en-US" href="https://example.com/en-us/page/">
<link rel="alternate" hreflang="en-GB" href="https://example.com/en-gb/page/">
<link rel="alternate" hreflang="x-default" href="https://example.com/page/">
Enter fullscreen mode Exit fullscreen mode

12.2 hreflang Rules

  • Every page in a cluster must reference every other page in the cluster (return-link requirement).
  • Every page must self-reference.
  • Use x-default for the fallback when no language matches.
  • Use ISO 639-1 language + ISO 3166-1 region (en-US, not en_US).

Cross-reference: framework-international.md for full hreflang depth.


13. Crawler Observability

Knowing what crawlers actually do on a site, not what you think they do.

13.1 Server Log Analysis

The single best technical SEO data source. Server logs (nginx access logs, Apache access logs, Cloudflare logs) record every bot request with status, response time, and user agent.

Tools:

  • Screaming Frog Log File Analyser — desktop, parses nginx/Apache logs
  • GoAccess — terminal log viewer with web reports
  • Custom log pipeline — for clients with infrastructure depth, ship logs to BigQuery / DuckDB and query directly

What to look for:

  • Crawl frequency per URL (which pages does Googlebot revisit most often?)
  • Crawl-budget waste (low-value URLs with high crawl rate)
  • 4xx/5xx clusters (pages crawlers are repeatedly hitting that error)
  • Slow URLs (high response time correlates with reduced crawl rate)
  • Bot fingerprint verification (real Googlebot? or a UA-spoofing scraper?)

13.2 Bot Verification

Anyone can claim to be Googlebot. Verify:

  • Reverse DNS lookup of the bot's IP must resolve to googlebot.com or google.com
  • Forward DNS of that hostname must resolve back to the same IP
  • Google publishes its IP ranges at developers.google.com/search/apis/ipranges/googlebot.json

For Bing: bingbot.com reverse DNS. For Apple: applebot.apple.com. AI crawlers vary; ClaudeBot publishes IP ranges, GPTBot publishes IP ranges, PerplexityBot publishes IP ranges.

13.3 GSC URL Inspection

For specific URLs, GSC's URL Inspection tool shows:

  • Last crawl date
  • Last response code
  • Indexed status
  • Mobile usability
  • Rendered HTML (live test)
  • Discovered referring URLs

Use this to debug specific indexing problems.


14. Audit Mode

# Criterion Pass/Fail
TS1 robots.txt returns 200 plain text, allows critical resources
TS2 XML sitemap returns 200, validates, lists only canonical indexable URLs
TS3 Sitemap submitted to Google Search Console and Bing Webmaster
TS4 All canonical URLs return 200
TS5 Self-referential rel=canonical on every indexable page
TS6 HTTP-to-HTTPS 301 redirect, single hop
TS7 www / non-www unified via 301, single hop
TS8 Trailing slash policy enforced sitewide
TS9 URLs lowercase, no mixed-case duplicates
TS10 Zero redirect chains (every redirect single-hop)
TS11 No 4xx URLs in sitemap or internal links
TS12 No 5xx URLs detected in last 30 days
TS13 HSTS header present with min 1-year max-age
TS14 TLS 1.2+ enforced, valid certificate, no mixed content
TS15 Mobile rendering verified (Mobile-Friendly Test)
TS16 JS-rendered content visible to Googlebot via URL Inspection
TS17 IndexNow key deployed (for Bing/Yandex/Naver indexing)
TS18 hreflang correctly implemented if multi-region
TS19 Bot verification logic in place (no UA-spoofing scrapers treated as bots)
TS20 Server logs analyzed at least quarterly for crawl-budget waste
TS21 Zero soft 404s in GSC
TS22 Crawled — currently not indexed bucket under 10% of total
TS23 Discovered — currently not indexed bucket under 5% of total
TS24 URL parameter strategy documented (block / canonical / index per parameter)
TS25 Pagination strategy documented (rel=next/prev replaced or supplemented)
TS26 AI crawler policy in robots.txt explicit and documented
TS27 Duplicate-content audit completed in last 90 days
TS28 All 301 redirects retained from migrations (don't expire redirects)
TS29 Crawl depth report shows zero pages over depth 3 (small sites) or 5 (large)
TS30 URLs under 60 characters where possible
TS31 Server response time under 600ms for HTML responses
TS32 GSC Coverage report shows zero "server error" URLs
TS33 GSC URL Inspection on 5 random pages confirms canonical, indexed status, no rendering issues
TS34 Lighthouse SEO score 100 on representative sample of pages
TS35 No JavaScript-only navigation (every link reachable via crawl without rendering)

Score: 35. World-class: 33+/35.


15. Common Mistakes

  1. Blocking CSS/JS in robots.txt — Google needs them to render. Frequently breaks indexing.
  2. Canonical pointing to a redirect target — invalidates the canonical signal.
  3. Multiple canonical signals disagreeing — internal links say A, sitemap says B, rel=canonical says C; Google picks one and ignores the others.
  4. Trailing slash inconsistency — half the site with /, half without; treated as duplicate URLs.
  5. Redirect chains — A→B→C→D wastes crawl, leaks signals.
  6. 302s where 301 belongs — temporary redirect on a permanent move; ranking signals leak.
  7. Soft 404s — page returns 200 but says "not found"; Google detects and demotes.
  8. Indexable thin pages — tag archives, paginated category pages, search result pages with no value indexed without filtering.
  9. JavaScript navigation only — links rendered via JS; crawl depth report shows orphaned pages.
  10. Stale <lastmod> in sitemap — every URL claims today's date; Google starts ignoring lastmod entirely.
  11. HTTPS deployed but HTTP not redirected — both versions serve, both indexed, duplicate-content penalty.
  12. Mixed content — HTTPS page loads HTTP resources; browser blocks, layout breaks, ranking suffers.
  13. AI crawler blocking by accident — wildcard Disallow: / block applied to legitimate AI crawlers losing citation traffic.
  14. No IndexNow — Bing, Yandex, Naver indexing days late when push-based could do it in minutes.
  15. Forgotten redirects after migration — old redirects removed prematurely; old inbound links 404.

16. Maintenance

Weekly:

  • Review GSC Coverage report for new errors
  • Check Bing Webmaster for crawl issues
  • Spot-check new URLs are indexable and in sitemap

Monthly:

  • Full GSC report review across all categories
  • Sitemap regeneration verification
  • Redirect map audit
  • Server log spot-check for 4xx/5xx clusters

Quarterly:

  • Comprehensive crawl audit (Screaming Frog or Sitebulb)
  • Server log deep analysis
  • Crawl-budget evaluation
  • Canonical audit
  • HTTPS / HSTS verification
  • AI crawler policy review (new bots emerge frequently)

Annually:

  • Full URL inventory and structural review
  • Migration redirect retention verification
  • Hreflang validation if international
  • IndexNow key rotation if compromised

17. Companion Documents

  • framework-pageexperience.md — Core Web Vitals, mobile, intrusive interstitials
  • framework-schema.md — Structured data implementation
  • framework-internallinking.md — Hub-and-spoke architecture
  • framework-migration.md — Site moves and URL restructures
  • framework-security.md — Broader security posture
  • framework-international.md — Full hreflang depth
  • framework-aicitations.md — AI crawler policy and visibility
  • framework-headless.md — Headless CMS rendering patterns

Document version: 1.0
Last updated: 2026-05-05
Owner: Joseph W. Anady — ThatDeveloperGuy — SDVOSB


About this framework library

This article is the Dev.to republish of a framework reference document from ThatDevPro's SEO + AI engineering library. Canonical source: https://www.thatdevpro.com/insights/framework-technicalseo/

ThatDevPro is an SDVOSB-certified veteran-owned web + AI engineering studio operating from Cassville, Missouri. The studio runs the full 14-tier Engine Optimization stack and ships open-source tooling for AI citation engineering.

Companion 14-tier Engine Optimization stack (each tier is its own article):

  1. Tier 1 — Foundation
  2. Tier 2 — Search Visibility
  3. Tier 3 — AI Domination
  4. Tier 4 — Entity and Authority
  5. Tier 5 — Local Domination
  6. Tier 6 — Content and Multimedia
  7. Tier 7 — Social and Community
  8. Tier 8 — Data, Analytics, Conversion
  9. Tier 9 — Monitoring and Intelligence
  10. Tier 10 — Workflow and Operations
  11. Tier 11 — Marketplace and Retail
  12. Tier 12 — International
  13. Tier 14 — Advanced and Immersive

Need this framework implemented on your site? See the Engine Optimization service or hire through ThatDevPro contact.

Top comments (0)