GrimLabs

Posted on Mar 14

I Crawled 1,000 Sites and Here's What's Wrong With Their SEO

#seo #webdev #ai #tutorial

I pointed a crawler at 1,000 websites (a mix of SaaS marketing sites, developer docs, e-commerce stores, and content publishers) and ran a comprehensive technical SEO audit on each. The results were worse than I expected.

This isn't a sales pitch. The raw data and methodology are described below. If you run a website, there's a good chance it has at least 3 of these issues.

Methodology

Sample: 1,000 sites sourced from [DESCRIBE SOURCE, e.g., "the top 1,000 sites on a Hacker News 'Show HN' aggregator," or "1,000 SaaS sites from a public directory"]. The list skews toward developer-focused and B2B sites.

Crawl depth: Up to 500 pages per site, or the full site if smaller.

Checks performed: [X] total technical checks across these categories:

Meta tags (title, description, OG tags, canonical URLs)
Link health (broken internal links, redirect chains)
Performance (Core Web Vitals via Lighthouse)
Structured data (JSON-LD validity, schema.org compliance)
Internationalization (hreflang configuration, locale detection)
Security (HTTPS, mixed content, security headers)
Accessibility (alt text, heading hierarchy, ARIA)

Finding 1: [X]% of Sites Have Broken Canonical URLs

[INSERT EXACT PERCENTAGE] of the 1,000 sites had at least one canonical URL issue.

The most common problems:

Issue	Prevalence
Canonical pointing to non-existent page	[X]%
Canonical pointing to redirect	[X]%
Missing canonical tag entirely	[X]%
Multiple conflicting canonical tags	[X]%
HTTP canonical on HTTPS page	[X]%

Why it matters: A broken canonical tag tells search engines to index the wrong URL, or no URL at all. This is one of the highest-impact, lowest-effort fixes in SEO.

The fix: Ensure every page's canonical tag points to itself (for unique pages) or to the preferred version (for duplicates). Verify that the canonical URL returns a 200 status code.

<!-- Correct -->
<link rel="canonical" href="https://example.com/blog/my-post" />

<!-- Wrong: canonical points to a redirect -->
<link rel="canonical" href="http://example.com/blog/my-post" />
<!-- This 301s to https://, so the canonical is technically a redirect -->

Finding 2: [X]% Have Redirect Chains Longer Than 2 Hops

[INSERT PERCENTAGE] of sites had at least one redirect chain of 3 or more hops.

The worst offender had a [X]-hop chain from an internal link to the final destination.

Distribution of chain lengths:

Chain Length	% of Sites Affected
3 hops	[X]%
4 hops	[X]%
5+ hops	[X]%

Each redirect adds latency and dilutes PageRank. Google eventually stops following chains beyond a certain depth, meaning the destination page may not get crawled at all.

The fix: Audit your redirects and update each one to point directly to the final destination. In particular, update internal links to point to the current URL. Never link to a URL you know redirects.

Finding 3: [X]% Fail Core Web Vitals

[INSERT PERCENTAGE] of sites failed at least one Core Web Vital threshold on mobile.

Metric	Failing Threshold	% Failing
LCP (Largest Contentful Paint)	> 2.5s	[X]%
INP (Interaction to Next Paint)	> 200ms	[X]%
CLS (Cumulative Layout Shift)	> 0.1	[X]%

The most common LCP killer was unoptimized hero images. [X]% of sites served hero images larger than 500KB without modern formats (WebP/AVIF) or responsive sizing.

Finding 4: [X]% Have Broken Structured Data

[INSERT PERCENTAGE] of sites with structured data (JSON-LD) had validation errors.

Most common issues:

Missing required fields - e.g., Article schema without datePublished
Invalid URL references - image field pointing to a 404
Type mismatches - String where array is expected
Deprecated schema types - Using schema.org types that Google no longer supports for rich results

// Common mistake: missing required fields
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "My Blog Post"
  // Missing: author, datePublished, image
}

Finding 5: Only [X]% Have Proper Hreflang Configuration

Of the [X] sites in the sample that serve content in multiple languages:

[INSERT PERCENTAGE] had hreflang issues.

The most devastating: [X]% had hreflang tags pointing to pages that returned 404s, effectively telling Google "this page exists in French" when it doesn't.

Finding 6: [X]% Have Missing or Duplicate Meta Descriptions

This is the oldest SEO issue in the book, and it's still rampant.

Issue	Prevalence
Missing meta description	[X]%
Duplicate meta description (shared across pages)	[X]%
Meta description over 160 chars (truncated in SERPs)	[X]%
Meta description under 50 chars (too short to be useful)	[X]%

The Bigger Picture

The median site in this study had [X] technical SEO issues. The distribution was heavily right-skewed. A small number of sites were in excellent shape, while most had a long tail of problems.

Issues per site distribution:

0-5 issues: [X]% of sites
6-15 issues: [X]% of sites
16-50 issues: [X]% of sites
50+ issues: [X]% of sites

The good news: most of these are fixable in a weekend. The bad news: most teams don't know the issues exist because they're not running regular audits.

How to Audit Your Own Site

You can run these same checks yourself:

Free route: Use Google Search Console + Lighthouse + Schema.org validator manually. Time cost: 2-4 hours per site.
Script it: Write a crawler with Puppeteer or Playwright that hits each page and checks meta tags, links, and structured data. Time cost: a weekend to build, then automated.
Tool route: SiteCrawlIQ runs all of these checks automatically and uses AI to prioritize the issues by estimated traffic impact. Full disclosure: I built it, and this study was run on it.

Whichever route you pick, the point is the same. Run the audit. The issues in this study are silent traffic killers that don't show up in analytics dashboards.

DEV Community