Crystal A Gutierrez

Posted on Apr 22

Most SEO Audit Tools Are Broken for Modern Websites. Here’s How We Fixed It.

#webdev #seo #devtools #buildinpublic

Most SEO audit tools are fundamentally broken for modern web apps.

They analyze HTML that users never actually see.

If you're auditing a React or Next.js app by parsing HTML, you're not auditing the page. You're auditing a shell.

If your audit tool doesn't execute JavaScript, it's not auditing
_____________________________________________________________

The Core Problem With HTML Parsers

Modern sites render content in the browser.

Headings, metadata, structured data, and even core content often don't exist until JavaScript runs. An HTML parser never sees any of it.

This doesn't just affect SEO. It affects debugging, testing, and any tooling that depends on DOM accuracy.

We ran into this problem while building an internal audit tool, and it forced us to rethink the entire approach.

Instead of parsing HTML, we decided to render every page in a real browser using Puppeteer and headless Chromium.

_____________________________________________________________

Why Puppeteer

We evaluated a few options:

Playwright: excellent, but more than we needed for a single-browser target
Selenium: too much overhead, built for cross-browser testing rather than controlled auditing
Cheerio + axios: fast, but HTML-only, exactly what we were trying to avoid

Once we defined the requirement as “a real browser with a real DOM,” most options quickly dropped out.

What we needed was simple:

A predictable, scriptable Chrome environment that behaves like a real user (and close to Googlebot).

Puppeteer gave us:

Direct control over Chromium
A real DOM after rendering
A straightforward API for navigation and interaction

_____________________________________________________________

The Rendering Pipeline

Here’s a simplified version of our audit flow:

const puppeteer = require('puppeteer');

async function auditPage(url) {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });

  const page = await browser.newPage();

  await page.setUserAgent(
    'Mozilla/5.0 (compatible; DeepAuditBot/1.0; https://axiondeepdigital.com)'
  );

  const resources = [];
  page.on('request', (req) => resources.push(req));

  await page.goto(url, {
    waitUntil: 'networkidle2',
    timeout: 30000,
  });

  await autoScroll(page);

  const dom = await page.evaluate(() => document.documentElement.outerHTML);

  await browser.close();

  return { dom, resources };
}

The key detail here is:
waitUntil: 'networkidle2'

This tells Puppeteer to wait until there are no more than 2 in-flight network requests for at least 500ms.

In practice, that gives JavaScript time to execute and dynamic content time to load.

But we learned quickly:
networkidle2 is not a guarantee that a page is “done.”

Some apps keep background requests alive indefinitely. Others hydrate content after initial load.

We had to layer additional safeguards:

Hard timeouts
Scroll-based triggers
Fallback logic when pages never fully settle

_____________________________________________________________

Handling Lazy-Loaded Content

Many sites rely on lazy loading.

To simulate real user behavior, we trigger it manually:

async function autoScroll(page) {
  await page.evaluate(async () => {
    await new Promise((resolve) => {
      let totalHeight = 0;
      const distance = 200;

      const timer = setInterval(() => {
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= document.body.scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}

This triggers:

Intersection observers
Lazy-load listeners
Dynamic content loading

_____________________________________________________________

The Checks Architecture

Once rendering is solved, the problem shifts to analysis.

We structured our checks as independent modules:

Meta tags
Headings
Images
Performance
Structured data
Links

Each check returns a standardized result:

{
  check: 'h1-presence',
  status: 'pass' | 'fail' | 'warning',
  message: 'H1 tag found: "Your Page Title"',
  impact: 'high' | 'medium' | 'low',
}

This made it easy to extend and maintain consistent results.

_____________________________________________________________

The Challenges We Didn’t Anticipate

Timeout handling
Some pages are genuinely slow. We built graceful degradation so that a slow page returns partial results instead of failing entirely.

Bot detection
Some sites serve different content to headless browsers. We mitigated this by using realistic user agents and reducing headless fingerprints.

Single-page app routing
SPAs can behave unpredictably. We chose to audit only the exact URL provided rather than attempting to crawl.

Memory management
Chromium is heavy. We explicitly close pages, manage browser lifecycle, and run audits through a queue.

_____________________________________________________________

What We’d Do Differently

If we were starting over:

We’d implement a browser pool from day one
We’d cache rendered DOM snapshots for repeat audits

Rendering is the most expensive part of the pipeline.

_____________________________________________________________

Final Thoughts

If you're building anything that depends on DOM accuracy:
Don’t trust raw HTML. Render the page.

Everything else is guesswork.

Rendering solved the core problem. Everything after that was trade-offs.

_____________________________________________________________

If you’re experimenting with headless browser pipelines, I’d be interested in how you're handling rendering and timing.

If you’re curious how this behaves on real sites, you can try it here: Axion Deep Digital Free SEO Scan

_____________________________________________________________

Crystal A. Gutierrez
Chairperson & Infrastructure Lead
Axion Deep Digital

Top comments (1)

Performance Dev • Jun 2

Great write-up. We hit the same wall with Outbound Autonomy — parsing HTML on SPA sites is like reading an empty billboard. The Puppeteer networkidle2 + scroll trigger combo is the right approach.

One thing that helped us significantly: we added a second pass that checks document.readyState === 'complete' AND verifies that key content selectors actually exist in the rendered DOM before declaring the page loaded. Cut our false-negative rate on lazy-loaded sites by about 60%. Would recommend trying it alongside the scroll triggers.