I built a 12-module website audit engine that cross-references visibility with security

#webdev #security #react #devex

I have been developing and testing this for months. The engine is Node.js and TypeScript on Railway. The frontend is React on Vercel. The scan runs 12 modules in parallel via Promise.all and completes in 5 to 15 seconds.
I am going to walk through the architecture, the scoring methodology, and the one design decision that changed the way I think about website audits.
The problem
I audit websites for clients. Every audit required at least four tools: one for SEO basics, one for structured data validation, one for security headers, one for SSL checks. And the newest layer, how AI models discover, chunk, and cite your content, had no tooling at all.
None of these tools cross-referenced their findings. A site could pass every individual check and still have a critical gap that only surfaces when you map the data together.
The architecture
Twelve modules, each returning a standardized JSON block:

DNS Resolution (Google Public DNS API)
TLS and Certificate Validation
Security Header Scan (6 headers)
HTML Structure Parse (H1, meta, canonical, title)
JSON-LD Schema Extraction and Validation
Q&A Content Density Analysis
GEO Chunking and Citation Measurement
robots.txt AI Crawl Policy Classification
Exposed Endpoint Detection (12 paths, false positive filtering)
Internal Link Depth Sampling
Vulnerability Indicator Scan
Content Provenance Check

All twelve run via Promise.all. The response assembles into a unified schema with two branches: visibility_canopy (SEO, AEO, GEO) and security_roots (TLS, headers, endpoints, AI crawl risk).
The false positive problem
Module 9 (exposed endpoints) was generating false positives on SPA sites. A React app on Vercel returns 200 for every path because the catch-all serves index.html for client-side routing. So /.env, /.git/config, and /wp-config.php.bak all came back as "exposed."
The fix uses three-layer detection. First, the engine fetches a guaranteed-nonsense path (e.g., /canopyguard-probe-{timestamp}) to detect catch-all behavior. Then every subsequent path check compares the response body length against both the homepage and the nonsense page. If the body is within 10% of either, it is the same catch-all page and gets filtered out. There is also a content-type check: if /.env returns text/html, it is clearly the SPA serving its shell, not an actual exposed environment file.
Cross-Reference Intelligence
This is the design decision that changed the tool. Instead of just scoring each layer independently, the engine maps visibility data against security data to surface compound gaps.
Example: robots.txt policy is PERMISSIVE (allows all crawlers) and llms.txt status is MISSING (no citation guidance). An SEO tool says the robots.txt is valid. A security scanner says there is no vulnerability. But the cross-reference reveals the actual problem: AI models have full access to scrape your content with zero instructions on how to attribute it.
This layer is qualitative, not scored numerically. It only fires when two conditions from different layers combine to create a gap.
Copy-pasteable fix snippets
Every failing check in the report has a FIX button that drops the exact code to resolve it. Security headers show tabbed snippets for Nginx, Apache, Vercel, and Cloudflare. Schema markup shows complete JSON-LD templates. The llms.txt snippet generates a complete starter file.
I built this because the most common response I got to audit reports was "great, but how do I fix it?" Now the answer is right next to the finding.
The scoring methodology
Published openly on the methodology page. Every weight, every signal, every module. I published it because if you are going to define a standard for AEO and GEO scoring, it needs to be verifiable and challengeable.
What I would do differently
If I were starting over, I would add a headless browser module (Playwright) for JavaScript-rendered sites. The current HTML parser uses server-side fetch, which misses content rendered client-side. That is the biggest gap in the current scan accuracy.
I would also add competitor comparison: scan two domains side by side and diff the results.
Try it
Free, no signup: thecanopyguard.com
The code is not open source yet, but I am considering it. Would love feedback on the scoring methodology, especially the GEO layer.
Adam McClarin, CISSP
Meraki is Love Digital | Soulful Tech

DEV Community

I built a 12-module website audit engine that cross-references visibility with security

Top comments (0)