Originally published at thatdevpro.com. Part of ThatDevPro's open SEO + AI framework library. ThatDevPro is an SDVOSB-certified veteran-owned web + AI engineering studio. Open-source AI citation toolkit: github.com/Janady13/aio-surfaces.
Crawlability, Indexing, Canonicalization, Redirects, URL Structure, and the Bot-Facing Foundation
A comprehensive installation and audit reference for technical SEO — the bedrock layer that determines whether search engines and AI crawlers can discover, render, and index a site at all. Every other framework in this library assumes the technical foundation works. This document specifies what "works" means and how to verify it. Dual-purpose: installation manual and audit document.
Cross-stack implementation note: the code samples in this framework are written in plain HTML for clarity. For React, Vue, Svelte, Next.js, Nuxt, SvelteKit, Astro, Hugo, 11ty, Remix, WordPress, Shopify, and Webflow equivalents of every pattern below, see
framework-cross-stack-implementation.md. For pure client-rendered SPAs (no SSR/SSG) seeframework-react.md. For Tailwind-specific concerns (purge, dynamic classes, dark-mode CLS, focus accessibility) seeframework-tailwind.md.
1. Document Purpose
This is the canonical reference for technical SEO. Content quality, authority signals, schema markup, and AI optimization are all wasted if a crawler cannot reach a page, cannot render it, or cannot decide which version to index. Technical SEO is the prerequisite. It is not glamorous. It is not optional.
In 2026, technical SEO has changed in three ways since 2020. First, JavaScript rendering is no longer the bleeding-edge concern — Google renders JS reliably and other major crawlers (Bingbot, GPTBot, ClaudeBot, PerplexityBot, Applebot) have caught up to varying degrees. Second, the bot landscape exploded: a 2026 site receives traffic from a dozen named AI crawlers in addition to the four major search engines, and robots.txt policy is now an editorial decision, not a technical default. Third, indexing has become more selective — Google indexes a smaller percentage of crawled URLs than it did a decade ago, so wasting crawl budget on duplicate, parameterized, or thin URLs has direct cost.
1.1 Required Tools
-
Google Search Console —
search.google.com/search-console— coverage, sitemaps, URL inspection -
Bing Webmaster Tools —
www.bing.com/webmasters— Bing-specific coverage and IndexNow submission - Screaming Frog SEO Spider — desktop crawler, free up to 500 URLs, paid for unlimited
- Sitebulb — desktop crawler, alternative to Screaming Frog with stronger reporting
- Ahrefs Site Audit / Semrush Site Audit — cloud-based crawlers with historical tracking
-
Google Rich Results Test —
search.google.com/test/rich-results— render + schema validation - Google Lighthouse — Chrome DevTools performance/SEO audit
- GTmetrix / WebPageTest — performance and waterfall analysis
-
curl/httpie— manual header inspection - Cloudflare / nginx access logs — server-level crawl observation
-
IndexNow —
www.indexnow.org— push-based indexing for Bing, Yandex, Naver
1.2 Document Scope
Covers: crawl access, robots.txt, XML sitemaps, canonicalization, redirects, URL structure, status codes, JS rendering, mobile-first indexing, HTTPS posture, hreflang, and crawler observability. Touches but does not exhaust: page experience (own framework: framework-pageexperience.md), schema (framework-schema.md), internal linking (framework-internallinking.md), security (framework-security.md).
2. Client Variables Intake
domain_apex: ""
www_or_non_www_canonical: "" # decide which is canonical
http_or_https: "https" # always https in 2026
trailing_slash_policy: "" # with-slash | without-slash
url_case_policy: "lowercase"
cms_or_framework: "" # WordPress | Next.js | Astro | Hugo | Shopify | Webflow | static
hosting_environment: ""
cdn: "" # Cloudflare | Fastly | none
search_console_verified: false
bing_webmaster_verified: false
indexnow_key_deployed: false
known_indexing_issues: []
recent_migrations: []
international_targets: [] # if any hreflang need
3. Crawl Access Layer
3.1 robots.txt
The robots.txt file at the domain root tells crawlers which paths they may request. It is advisory, not a security mechanism — anything genuinely sensitive belongs behind authentication, not behind a Disallow rule.
Minimum viable robots.txt for a production site:
User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /staging/
Disallow: /*?*sessionid=
Disallow: /*?*utm_*
Sitemap: https://example.com/sitemap.xml
Required validations:
- The file is served at
/robots.txtwithContent-Type: text/plainand HTTP 200. - It does NOT block CSS, JS, or image directories. Googlebot needs those to render.
- It does NOT block fonts (
/fonts/,/assets/fonts/) or web manifest assets. - The Sitemap directive uses an absolute URL.
- Wildcards (
*) are used sparingly and tested in GSC's robots tester before deploy.
AI crawler policy (2026 baseline):
In 2026, the question is no longer "do we block AI crawlers" but "which AI crawlers do we want citing us, and which do we want to block." A typical client-facing posture:
# AI search crawlers — usually allow (citation traffic)
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
# Aggressive scrapers — block by default unless client requests otherwise
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: MJ12Bot
Disallow: /
Cross-reference: framework-aicitations.md for the full AI-crawler matrix.
3.2 X-Robots-Tag
For non-HTML resources (PDFs, images, JSON files) that should not be indexed, X-Robots-Tag is the only signal. Set via server config:
# nginx
location ~* \.(pdf|json)$ {
add_header X-Robots-Tag "noindex, nofollow";
}
For HTML pages, <meta name="robots"> is preferred because it is more visible to humans editing pages.
3.3 Crawl Budget
Crawl budget is the number of URLs Googlebot crawls in a given period. For sites under ~50,000 URLs this rarely matters. Above that, crawl-budget waste shows as:
- Many URLs in GSC's "Crawled — currently not indexed"
- Many URLs in "Discovered — currently not indexed"
- Slow propagation of new content to the index
- Frequent crawls of low-value parameterized URLs
Crawl budget conservation:
- Block parameterized/sessioned URLs in robots.txt
- 410 truly dead URLs (faster than 404 to be removed from crawl)
- Reduce internal links to low-value pages
- Use
noindexon pages that should not be indexed (404 templates, search result pages, tag archives with thin content) - Use a clean XML sitemap so the crawler has a prioritized list
4. Indexing Layer
4.1 The Two-Step Process
Indexing is two steps, not one:
- Discovery and crawl — the bot finds the URL (sitemap, link, IndexNow ping) and requests it.
- Indexing — the bot parses the response, decides whether to add it to the index, and what to associate with it.
A page can be crawled but not indexed. A page can be in the index but not ranked. These are distinct states with distinct fixes.
4.2 Index Status in Google Search Console
Use Coverage (now Pages) report. The buckets:
| Status | Meaning | Action |
|---|---|---|
| Indexed | In Google's index | Monitor for unexpected changes |
| Indexed, not submitted in sitemap | Found via links, not in sitemap | Add to sitemap if it should be indexed |
| Crawled — currently not indexed | Crawled but Google chose not to index | Improve content quality, add internal links, check duplication |
| Discovered — currently not indexed | Found but not yet crawled | Often crawl-budget signal; check site authority + reduce low-value URLs |
| Excluded by 'noindex' tag | Intentionally excluded | Verify intent |
| Page with redirect | Redirected; not indexed itself | Verify the redirect target is indexed |
| Duplicate, Google chose different canonical | Google ignored your canonical | Check internal linking + canonical signals consistency |
| Soft 404 | Returns 200 but content looks like a 404 page | Either return real 404 or fix the page |
| Server error (5xx) | Crawl failed | Fix the server error |
4.3 IndexNow
IndexNow is a push-based indexing protocol supported by Bing, Yandex, Seznam, and Naver (not Google). When a URL changes, you POST it to IndexNow and supported engines crawl within minutes instead of days.
Implementation:
- Generate an API key (random 32-character string)
- Place the key as
/{key}.txtat the domain root with the key as its content - POST URL changes to
https://api.indexnow.org/indexnow:
POST /indexnow HTTP/1.1
Host: api.indexnow.org
Content-Type: application/json
{
"host": "example.com",
"key": "abc123...",
"keyLocation": "https://example.com/abc123.txt",
"urlList": [
"https://example.com/new-page/",
"https://example.com/updated-page/"
]
}
For WordPress, plugin support exists. For Next.js / static sites, integrate as a build-time hook.
4.4 Mobile-First Indexing
Since 2023, Google indexes the mobile version of every site. The desktop version is largely ignored for ranking. Verify:
- All content visible on mobile (not hidden behind "click to expand" with mobile-only display:none)
- Structured data identical on mobile and desktop
- Meta tags identical on mobile and desktop
- Internal linking identical (no mobile-only menu omitting key links)
- Images load on mobile (no desktop-only assets)
5. Canonicalization
5.1 Why Canonicalization Matters
Modern sites generate many URLs that resolve to the same content:
-
https://example.com/pageandhttps://example.com/page/ -
https://example.com/pageandhttps://www.example.com/page https://example.com/page?utm_source=email-
https://example.com/PAGE(some servers serve the same content for any case) https://example.com/page?session=abc123
Without explicit canonical signals, Google picks one and may not pick the one you want. Ranking signals split across variants. Indexing decisions become inconsistent.
5.2 Canonical Signal Stack
Canonical signals reinforce each other. Use all of them:
-
<link rel="canonical" href="...">— the explicit declaration. Self-referential on the canonical URL itself. - 301 redirects — for true duplicates, redirect non-canonical to canonical (preferred over rel=canonical when content is genuinely identical).
- Internal linking — every internal link points to the canonical URL, never to a redirected variant.
- XML sitemap — only canonical URLs appear in the sitemap.
- hreflang annotations — when present, must reference canonical URLs only.
-
HTTP
Linkheader — equivalent to rel=canonical, used for non-HTML resources.
If these signals disagree, Google picks one and ignores the rest. Consistency is the rule.
5.3 Common Canonicalization Mistakes
- Canonical to a redirect target. Always canonical to the URL that returns 200, never to one that 301s.
-
Cross-domain canonical without authority. You can canonical
mirror.example.comtoexample.com, but canonicals across unrelated domains are usually ignored. -
Self-canonical with parameters present. A URL like
/page?utm_source=xshould canonical to/page(no parameters), not self-canonical. - Conflicting canonicals between hreflang clusters. Each hreflang cluster member must canonical to itself, not to an English default.
- HTTP version canonicalizing to HTTPS but not redirecting. Use a 301 plus self-canonical on HTTPS, not just rel=canonical.
5.4 Trailing Slash and Case
Pick one and enforce it sitewide via 301 redirect. Do not rely on canonicals alone — redirects collapse the duplication, canonicals only signal it.
# nginx — enforce trailing slash + lowercase
rewrite ^/(.*[A-Z]+.*)$ /$1 permanent; # would need lua/regex helper
location / {
try_files $uri $uri/ =404;
}
For Next.js, set trailingSlash: true (or false) in next.config.js and stick with it. Mixing breaks canonicalization.
6. Redirects
6.1 Status Codes
| Code | Use case |
|---|---|
| 301 Moved Permanently | The URL has permanently moved. Passes ranking signals. Default for migrations. |
| 302 Found | Temporary redirect. Use only for actual temporary moves (A/B tests, seasonal pages). Misused 302s leak link equity. |
| 307 Temporary Redirect | Like 302 but preserves request method. Rare in SEO context. |
| 308 Permanent Redirect | Like 301 but preserves request method. Functionally equivalent for SEO. |
| 410 Gone | Page is permanently removed and not coming back. Faster removal from index than 404. |
| 451 Unavailable for Legal Reasons | Use when content removed for legal reasons (DMCA, jurisdictional). |
6.2 Redirect Chains
A chain is A → B → C. Eliminate them. Every redirect should be a single hop to the final URL. Chains:
- Waste crawl budget
- Add latency for users
- Risk dropping signals at each hop
- Break when one link in the chain dies
Maintain a redirect map spreadsheet for any migration. After deploying redirects, crawl the site and verify zero chains.
6.3 Redirect Implementation Layers
In order of preference:
- Server config (nginx, Apache, Cloudflare Rules) — fastest, most reliable, executed before page load.
- CMS-level redirect plugin — fine for low-volume changes, performance penalty at scale.
- JavaScript redirects — last resort. Slow, fragile, sometimes ignored by crawlers.
- Meta refresh redirects — never use. Treated as low-quality signal.
6.4 The Migration Redirect Pattern
When migrating URL structure:
- Generate a 1:1 map of old URL → new URL for every indexed page.
- Implement 301s in server config before deploy.
- Update internal links to point to new URLs (do not rely on the redirect).
- Update XML sitemap to list only new URLs.
- Submit new sitemap in GSC.
- Monitor GSC's Coverage report for 30-90 days.
- Keep redirects in place permanently — old URLs have inbound links from sites you don't control.
Cross-reference: framework-migration.md for full migration methodology.
7. URL Structure
7.1 The Eight URL Rules
- Lowercase. Always.
- Hyphens between words. Not underscores. Not camelCase.
- Under 60 characters when possible. Long URLs index fine but truncate in SERPs.
-
No stop words unless meaningful.
/the-best-web-hosting/reads better as/best-web-hosting/. -
Descriptive, not numeric.
/blog/post-1234/is opaque;/blog/local-seo-checklist/is meaningful. -
No file extensions where avoidable.
/about/over/about.htmlor/about.php. -
One canonical separator policy. Don't mix
/category/post/and/category-post/. - Stable. Once published, do not change a URL without redirecting.
7.2 URL Hierarchy and Crawl Depth
A URL's path depth (/a/b/c/d/page/) is independent of crawl depth (clicks from homepage). Crawl depth matters for SEO; path depth is only loosely related.
Target: every important page reachable in 3 clicks or fewer from the homepage. Verify with Sitebulb's Crawl Depth report.
Cross-reference: framework-internallinking.md for hub-and-spoke architecture.
7.3 Parameter Handling
URL parameters create duplication. Strategies:
-
Block in robots.txt — for tracking parameters that should never be indexed (
?utm_*,?sessionid=,?fbclid=). - Canonical to parameterless version — for sort/filter parameters where the parameterless version is canonical.
- Self-canonical with noindex — for parameter combinations that are unique pages but should not be indexed.
- GSC URL Parameter tool — deprecated as of 2022. Use canonical signals instead.
8. XML Sitemaps
8.1 What Belongs in a Sitemap
Only canonical, indexable, 200-status URLs that you want indexed. Everything else stays out:
- Excluded:
noindexpages, redirected URLs, 4xx URLs, duplicate URLs, parameterized variants - Excluded: pagination URLs (page/2/, page/3/) unless you have a strategic reason
- Excluded: search result pages, login pages, account pages
8.2 Sitemap Structure
For sites under 50,000 URLs, a single sitemap is fine. Above that, use a sitemap index:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2026-05-05</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-posts.xml</loc>
<lastmod>2026-05-05</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-images.xml</loc>
<lastmod>2026-05-05</lastmod>
</sitemap>
</sitemapindex>
Per-sitemap entry:
<url>
<loc>https://example.com/page/</loc>
<lastmod>2026-05-05</lastmod>
</url>
<changefreq> and <priority> are ignored by Google. <lastmod> is honored when accurate; if you fake it (always = today), Google starts ignoring it.
8.3 Specialized Sitemaps
- Image sitemap — for image-heavy sites; accelerates Image Pack inclusion.
- Video sitemap — required for native video content; alternative to VideoObject schema.
- News sitemap — for sites approved for Google News; only includes articles from the last 48 hours.
- hreflang sitemap — programmatic alternative to inline hreflang tags.
8.4 Submission
- Submit sitemap URL in GSC and Bing Webmaster Tools.
- Reference sitemap URL in robots.txt (
Sitemap: https://...). - For large sites, ping the sitemap when content changes (or use IndexNow).
9. Status Codes — The Operator's Reference
Beyond redirects, status codes communicate site health to crawlers. The full inventory worth knowing:
9.1 2xx Success
- 200 OK — page delivered. Default for healthy URLs.
- 201 Created — resource created (POST endpoints).
- 204 No Content — success, no body returned.
- 206 Partial Content — range request (video, large download resume).
9.2 3xx Redirection
- 301 / 308 — permanent. Default for migrations.
- 302 / 307 — temporary. Use sparingly.
- 304 Not Modified — conditional GET succeeded; client uses its cached copy. Healthy at scale.
9.3 4xx Client Errors
- 400 Bad Request — malformed request. Usually bot or attack traffic.
- 401 Unauthorized — auth required.
- 403 Forbidden — access denied. If on a public page, check nginx/htaccess rules.
- 404 Not Found — URL does not exist. Track in GSC and either redirect or 410.
- 405 Method Not Allowed — right URL, wrong verb. Common form-handler misconfiguration.
- 410 Gone — permanently removed. Faster removal from index than 404.
- 429 Too Many Requests — rate limited. Usually bot abuse.
- 451 Unavailable for Legal Reasons — content removed for legal cause.
9.4 5xx Server Errors
- 500 Internal Server Error — application crash.
- 502 Bad Gateway — nginx couldn't reach upstream.
-
503 Service Unavailable — maintenance or overload. Use
Retry-Afterheader. - 504 Gateway Timeout — upstream too slow.
5xx codes are urgent. Sustained 5xx during a Googlebot crawl drops pages from the index.
10. JavaScript Rendering
10.1 The Two-Wave Indexing Model (Mostly Obsolete)
For years, Google's two-wave model meant JS sites were indexed late: the first wave indexed HTML, the second wave indexed rendered content days later. As of 2025, Google renders the vast majority of pages within hours of the first crawl. The two-wave problem is no longer a structural blocker for most sites.
It is still real for AI crawlers. GPTBot, ClaudeBot, and PerplexityBot do not all render JS reliably. For AI search visibility, server-side rendered content matters more than for traditional Google SEO.
10.2 Rendering Strategy by Content Type
| Content type | Recommended rendering |
|---|---|
| Marketing pages, landing pages | SSG or SSR (no client-side rendering for primary content) |
| Blog posts, articles | SSG |
| Ecommerce product pages | SSR with hydration |
| Logged-in user pages | CSR is fine (these aren't indexed anyway) |
| Real-time data displays | SSR shell + client hydration |
10.3 Validation
Use Google's Mobile-Friendly Test or URL Inspection tool's "Test Live URL" to see what Googlebot actually renders. If primary content is missing, the page is not effectively indexed even if it returns 200.
For AI crawler visibility, curl -A "GPTBot" https://example.com/page and inspect the HTML response body. If the content is in <noscript> only or arrives via fetch/XHR, AI crawlers miss it.
Cross-reference: framework-headless.md and framework-nextjs.md for framework-specific rendering patterns.
11. HTTPS
In 2026, HTTPS is non-negotiable. HTTP-only sites suffer ranking penalties, browser warnings, and lost trust signals.
11.1 Required Configuration
- TLS 1.2 minimum, TLS 1.3 preferred.
- Valid certificate from a recognized CA (Let's Encrypt, Cloudflare, paid).
- HSTS header:
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload - HSTS preload list submission for high-value domains:
hstspreload.org - HTTP-to-HTTPS 301 redirect, single hop.
- Canonical and internal links use HTTPS exclusively.
- Mixed-content audit: no HTTP resources loaded on HTTPS pages.
11.2 Certificate Maintenance
- Auto-renewal via Let's Encrypt or hosting provider.
- Monitor expiration via UptimeRobot or similar.
- Verify certificate transparency log entries via
crt.sh.
Cross-reference: framework-security.md for broader security posture.
12. International / hreflang
For sites targeting multiple languages or regions, hreflang annotations tell Google which version to show which user.
12.1 hreflang Implementation
Three valid placement methods, listed in order of preference:
-
HTTP
Linkheader — non-HTML resources, full programmatic control. - XML sitemap hreflang — preferred for large sites; centralizes all annotations in one place.
-
<link rel="alternate">tags in<head>— most common; works but harder to maintain at scale.
<link rel="alternate" hreflang="en-US" href="https://example.com/en-us/page/">
<link rel="alternate" hreflang="en-GB" href="https://example.com/en-gb/page/">
<link rel="alternate" hreflang="x-default" href="https://example.com/page/">
12.2 hreflang Rules
- Every page in a cluster must reference every other page in the cluster (return-link requirement).
- Every page must self-reference.
- Use
x-defaultfor the fallback when no language matches. - Use ISO 639-1 language + ISO 3166-1 region (
en-US, noten_US).
Cross-reference: framework-international.md for full hreflang depth.
13. Crawler Observability
Knowing what crawlers actually do on a site, not what you think they do.
13.1 Server Log Analysis
The single best technical SEO data source. Server logs (nginx access logs, Apache access logs, Cloudflare logs) record every bot request with status, response time, and user agent.
Tools:
- Screaming Frog Log File Analyser — desktop, parses nginx/Apache logs
- GoAccess — terminal log viewer with web reports
- Custom log pipeline — for clients with infrastructure depth, ship logs to BigQuery / DuckDB and query directly
What to look for:
- Crawl frequency per URL (which pages does Googlebot revisit most often?)
- Crawl-budget waste (low-value URLs with high crawl rate)
- 4xx/5xx clusters (pages crawlers are repeatedly hitting that error)
- Slow URLs (high response time correlates with reduced crawl rate)
- Bot fingerprint verification (real Googlebot? or a UA-spoofing scraper?)
13.2 Bot Verification
Anyone can claim to be Googlebot. Verify:
- Reverse DNS lookup of the bot's IP must resolve to
googlebot.comorgoogle.com - Forward DNS of that hostname must resolve back to the same IP
- Google publishes its IP ranges at
developers.google.com/search/apis/ipranges/googlebot.json
For Bing: bingbot.com reverse DNS. For Apple: applebot.apple.com. AI crawlers vary; ClaudeBot publishes IP ranges, GPTBot publishes IP ranges, PerplexityBot publishes IP ranges.
13.3 GSC URL Inspection
For specific URLs, GSC's URL Inspection tool shows:
- Last crawl date
- Last response code
- Indexed status
- Mobile usability
- Rendered HTML (live test)
- Discovered referring URLs
Use this to debug specific indexing problems.
14. Audit Mode
| # | Criterion | Pass/Fail |
|---|---|---|
| TS1 | robots.txt returns 200 plain text, allows critical resources | |
| TS2 | XML sitemap returns 200, validates, lists only canonical indexable URLs | |
| TS3 | Sitemap submitted to Google Search Console and Bing Webmaster | |
| TS4 | All canonical URLs return 200 | |
| TS5 | Self-referential rel=canonical on every indexable page | |
| TS6 | HTTP-to-HTTPS 301 redirect, single hop | |
| TS7 | www / non-www unified via 301, single hop | |
| TS8 | Trailing slash policy enforced sitewide | |
| TS9 | URLs lowercase, no mixed-case duplicates | |
| TS10 | Zero redirect chains (every redirect single-hop) | |
| TS11 | No 4xx URLs in sitemap or internal links | |
| TS12 | No 5xx URLs detected in last 30 days | |
| TS13 | HSTS header present with min 1-year max-age | |
| TS14 | TLS 1.2+ enforced, valid certificate, no mixed content | |
| TS15 | Mobile rendering verified (Mobile-Friendly Test) | |
| TS16 | JS-rendered content visible to Googlebot via URL Inspection | |
| TS17 | IndexNow key deployed (for Bing/Yandex/Naver indexing) | |
| TS18 | hreflang correctly implemented if multi-region | |
| TS19 | Bot verification logic in place (no UA-spoofing scrapers treated as bots) | |
| TS20 | Server logs analyzed at least quarterly for crawl-budget waste | |
| TS21 | Zero soft 404s in GSC | |
| TS22 | Crawled — currently not indexed bucket under 10% of total | |
| TS23 | Discovered — currently not indexed bucket under 5% of total | |
| TS24 | URL parameter strategy documented (block / canonical / index per parameter) | |
| TS25 | Pagination strategy documented (rel=next/prev replaced or supplemented) | |
| TS26 | AI crawler policy in robots.txt explicit and documented | |
| TS27 | Duplicate-content audit completed in last 90 days | |
| TS28 | All 301 redirects retained from migrations (don't expire redirects) | |
| TS29 | Crawl depth report shows zero pages over depth 3 (small sites) or 5 (large) | |
| TS30 | URLs under 60 characters where possible | |
| TS31 | Server response time under 600ms for HTML responses | |
| TS32 | GSC Coverage report shows zero "server error" URLs | |
| TS33 | GSC URL Inspection on 5 random pages confirms canonical, indexed status, no rendering issues | |
| TS34 | Lighthouse SEO score 100 on representative sample of pages | |
| TS35 | No JavaScript-only navigation (every link reachable via crawl without rendering) |
Score: 35. World-class: 33+/35.
15. Common Mistakes
- Blocking CSS/JS in robots.txt — Google needs them to render. Frequently breaks indexing.
- Canonical pointing to a redirect target — invalidates the canonical signal.
- Multiple canonical signals disagreeing — internal links say A, sitemap says B, rel=canonical says C; Google picks one and ignores the others.
-
Trailing slash inconsistency — half the site with
/, half without; treated as duplicate URLs. - Redirect chains — A→B→C→D wastes crawl, leaks signals.
- 302s where 301 belongs — temporary redirect on a permanent move; ranking signals leak.
- Soft 404s — page returns 200 but says "not found"; Google detects and demotes.
- Indexable thin pages — tag archives, paginated category pages, search result pages with no value indexed without filtering.
- JavaScript navigation only — links rendered via JS; crawl depth report shows orphaned pages.
-
Stale
<lastmod>in sitemap — every URL claims today's date; Google starts ignoring lastmod entirely. - HTTPS deployed but HTTP not redirected — both versions serve, both indexed, duplicate-content penalty.
- Mixed content — HTTPS page loads HTTP resources; browser blocks, layout breaks, ranking suffers.
-
AI crawler blocking by accident — wildcard
Disallow: /block applied to legitimate AI crawlers losing citation traffic. - No IndexNow — Bing, Yandex, Naver indexing days late when push-based could do it in minutes.
- Forgotten redirects after migration — old redirects removed prematurely; old inbound links 404.
16. Maintenance
Weekly:
- Review GSC Coverage report for new errors
- Check Bing Webmaster for crawl issues
- Spot-check new URLs are indexable and in sitemap
Monthly:
- Full GSC report review across all categories
- Sitemap regeneration verification
- Redirect map audit
- Server log spot-check for 4xx/5xx clusters
Quarterly:
- Comprehensive crawl audit (Screaming Frog or Sitebulb)
- Server log deep analysis
- Crawl-budget evaluation
- Canonical audit
- HTTPS / HSTS verification
- AI crawler policy review (new bots emerge frequently)
Annually:
- Full URL inventory and structural review
- Migration redirect retention verification
- Hreflang validation if international
- IndexNow key rotation if compromised
17. Companion Documents
-
framework-pageexperience.md— Core Web Vitals, mobile, intrusive interstitials -
framework-schema.md— Structured data implementation -
framework-internallinking.md— Hub-and-spoke architecture -
framework-migration.md— Site moves and URL restructures -
framework-security.md— Broader security posture -
framework-international.md— Full hreflang depth -
framework-aicitations.md— AI crawler policy and visibility -
framework-headless.md— Headless CMS rendering patterns
Document version: 1.0
Last updated: 2026-05-05
Owner: Joseph W. Anady — ThatDeveloperGuy — SDVOSB
About this framework library
This article is the Dev.to republish of a framework reference document from ThatDevPro's SEO + AI engineering library. Canonical source: https://www.thatdevpro.com/insights/framework-technicalseo/
ThatDevPro is an SDVOSB-certified veteran-owned web + AI engineering studio operating from Cassville, Missouri. The studio runs the full 14-tier Engine Optimization stack and ships open-source tooling for AI citation engineering.
Companion 14-tier Engine Optimization stack (each tier is its own article):
- Tier 1 — Foundation
- Tier 2 — Search Visibility
- Tier 3 — AI Domination
- Tier 4 — Entity and Authority
- Tier 5 — Local Domination
- Tier 6 — Content and Multimedia
- Tier 7 — Social and Community
- Tier 8 — Data, Analytics, Conversion
- Tier 9 — Monitoring and Intelligence
- Tier 10 — Workflow and Operations
- Tier 11 — Marketplace and Retail
- Tier 12 — International
- Tier 14 — Advanced and Immersive
Need this framework implemented on your site? See the Engine Optimization service or hire through ThatDevPro contact.
Top comments (0)