Most SEO audit tools are fundamentally broken for modern web apps.
They analyze HTML that users never actually see.
If you're auditing a React or Next.js app by parsing HTML, you're not auditing the page. You're auditing a shell.
If your audit tool doesn't execute JavaScript, it's not auditing
_____________________________________________________________
The Core Problem With HTML Parsers
Modern sites render content in the browser.
Headings, metadata, structured data, and even core content often don't exist until JavaScript runs. An HTML parser never sees any of it.
This doesn't just affect SEO. It affects debugging, testing, and any tooling that depends on DOM accuracy.
We ran into this problem while building an internal audit tool, and it forced us to rethink the entire approach.
Instead of parsing HTML, we decided to render every page in a real browser using Puppeteer and headless Chromium.
_____________________________________________________________
Why Puppeteer
We evaluated a few options:
Playwright: excellent, but more than we needed for a single-browser target
Selenium: too much overhead, built for cross-browser testing rather than controlled auditing
Cheerio + axios: fast, but HTML-only, exactly what we were trying to avoid
Once we defined the requirement as “a real browser with a real DOM,” most options quickly dropped out.
What we needed was simple:
A predictable, scriptable Chrome environment that behaves like a real user (and close to Googlebot).
Puppeteer gave us:
Direct control over Chromium
A real DOM after rendering
A straightforward API for navigation and interaction
_____________________________________________________________
The Rendering Pipeline
Here’s a simplified version of our audit flow:
const puppeteer = require('puppeteer');
async function auditPage(url) {
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
const page = await browser.newPage();
await page.setUserAgent(
'Mozilla/5.0 (compatible; DeepAuditBot/1.0; https://axiondeepdigital.com)'
);
const resources = [];
page.on('request', (req) => resources.push(req));
await page.goto(url, {
waitUntil: 'networkidle2',
timeout: 30000,
});
await autoScroll(page);
const dom = await page.evaluate(() => document.documentElement.outerHTML);
await browser.close();
return { dom, resources };
}
The key detail here is:
waitUntil: 'networkidle2'
This tells Puppeteer to wait until there are no more than 2 in-flight network requests for at least 500ms.
In practice, that gives JavaScript time to execute and dynamic content time to load.
But we learned quickly:
networkidle2 is not a guarantee that a page is “done.”
Some apps keep background requests alive indefinitely. Others hydrate content after initial load.
We had to layer additional safeguards:
Hard timeouts
Scroll-based triggers
Fallback logic when pages never fully settle
_____________________________________________________________
Handling Lazy-Loaded Content
Many sites rely on lazy loading.
To simulate real user behavior, we trigger it manually:
async function autoScroll(page) {
await page.evaluate(async () => {
await new Promise((resolve) => {
let totalHeight = 0;
const distance = 200;
const timer = setInterval(() => {
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= document.body.scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
This triggers:
Intersection observers
Lazy-load listeners
Dynamic content loading
_____________________________________________________________
The Checks Architecture
Once rendering is solved, the problem shifts to analysis.
We structured our checks as independent modules:
Meta tags
Headings
Images
Performance
Structured data
Links
Each check returns a standardized result:
{
check: 'h1-presence',
status: 'pass' | 'fail' | 'warning',
message: 'H1 tag found: "Your Page Title"',
impact: 'high' | 'medium' | 'low',
}
This made it easy to extend and maintain consistent results.
_____________________________________________________________
The Challenges We Didn’t Anticipate
Timeout handling
Some pages are genuinely slow. We built graceful degradation so that a slow page returns partial results instead of failing entirely.
Bot detection
Some sites serve different content to headless browsers. We mitigated this by using realistic user agents and reducing headless fingerprints.
Single-page app routing
SPAs can behave unpredictably. We chose to audit only the exact URL provided rather than attempting to crawl.
Memory management
Chromium is heavy. We explicitly close pages, manage browser lifecycle, and run audits through a queue.
_____________________________________________________________
What We’d Do Differently
If we were starting over:
We’d implement a browser pool from day one
We’d cache rendered DOM snapshots for repeat audits
Rendering is the most expensive part of the pipeline.
_____________________________________________________________
Final Thoughts
If you're building anything that depends on DOM accuracy:
Don’t trust raw HTML. Render the page.
Everything else is guesswork.
Rendering solved the core problem. Everything after that was trade-offs.
_____________________________________________________________
If you’re experimenting with headless browser pipelines, I’d be interested in how you're handling rendering and timing.
If you’re curious how this behaves on real sites, you can try it here: Axion Deep Digital Free SEO Scan
_____________________________________________________________
Crystal A. Gutierrez
Chairperson & Infrastructure Lead
Axion Deep Digital
Top comments (0)