Julian Ruenes

Posted on Apr 13

I Built a Free AI Readiness Scanner — Here's the Architecture Behind It

#webdev #ai #seo #javascript

AI search is eating traditional search. If you work in web development, you've already felt it. ChatGPT, Perplexity, Google AI Overviews — they're pulling answers directly from websites and serving them to users who never click through.

For businesses, this is a massive shift. For developers, it's a new set of problems to solve.

I run a web design agency called Studio Web in Arlington, VA. We kept getting the same question from clients: "Is my website showing up in AI search results?" There was no simple tool to answer that. So I built one.

The AI Readiness Scanner is a free, client-side tool that audits any website for AI search visibility. No signup, no backend, no cost. Here's how it works under the hood.

Why Single-File HTML

This was a deliberate constraint from day one. No React. No Next.js. No build step. No server.

The reasons were practical:

Hosting simplicity. The tool lives as a single .html file on our static site. No deployment pipeline, no CI/CD, no server costs.
Zero dependencies to maintain. No node_modules, no version conflicts, no supply chain risk.
Speed. The file loads in under a second. There's nothing to hydrate, nothing to fetch before the tool is interactive.
Portability. Anyone can fork it, download it, or embed it.

The tradeoff is obvious — you lose component reusability, state management, and the ergonomics of a modern framework. For a single-purpose tool, that tradeoff was worth it.

Data Sources

The scanner pulls data from multiple sources to build its score.

Google PageSpeed Insights API

This is the backbone. The PageSpeed Insights API (which wraps Lighthouse) is free, allows 25,000 queries per day with an API key, and returns a treasure trove of data beyond just speed metrics.

From a single API call, I extract:

Performance, accessibility, best practices, and SEO scores
Whether the site has structured data (and what types)
Mobile-friendliness signals
HTTPS status
Meta tag analysis
Crawlability indicators

The API response is massive — often 200KB+ of JSON. I only use about 15% of it, but that 15% covers the foundation of the audit.

Client-Side Fetches

For AI-specific checks, I need data the PageSpeed API doesn't provide:

robots.txt — Does the site block AI crawlers like GPTBot, ClaudeBot, or Google-Extended?
llms.txt — Does the site have the emerging llms.txt standard file?
Sitemap — Is there an XML sitemap, and does it actually contain URLs?

These require fetching files directly from the target domain — which brings us to the hardest part of the entire build.

The CORS Problem (and the Proxy Fallback Chain)

If you've built anything client-side that fetches from external domains, you already know where this is going.

Browsers block cross-origin requests unless the target server sends the right CORS headers. Most business websites don't. So fetching example.com/robots.txt from my client-side tool gets blocked immediately.

My solution: a fallback chain of four CORS proxies.

const CORS_PROXIES = [
    'https://api.allorigins.win/raw?url=',
    'https://corsproxy.io/?',
    'https://api.codetabs.com/v1/proxy?quest=',
    'https://cors-anywhere.herokuapp.com/'
];

async function fetchWithProxy(url) {
    // Try direct fetch first (works if target has CORS headers)
    try {
        const response = await fetch(url, {
            signal: AbortSignal.timeout(5000)
        });
        if (response.ok) return await response.text();
    } catch (e) {
        // Expected — most sites block cross-origin
    }

    // Try each proxy in order
    for (const proxy of CORS_PROXIES) {
        try {
            const response = await fetch(
                proxy + encodeURIComponent(url),
                { signal: AbortSignal.timeout(8000) }
            );
            if (response.ok) return await response.text();
        } catch (e) {
            continue; // Try next proxy
        }
    }

    return null; // All proxies failed
}

The chain tries a direct fetch first (surprisingly, some sites do have permissive CORS). If that fails, it cycles through four public proxies. Each attempt has a timeout — 5 seconds for direct, 8 seconds for proxied.

In practice, allorigins.win handles about 70% of successful proxied requests. The others are fallbacks for when it's down or rate-limited.

The obvious limitation: public CORS proxies are unreliable. They go down, they rate-limit, they occasionally return garbage. For v2, I'd run my own lightweight proxy — but for a free tool with zero revenue, the public proxies work well enough.

Parsing robots.txt for AI Crawlers

This is where it gets interesting. Traditional robots.txt parsing just looks for Googlebot or *. Now we need to check for a growing list of AI-specific user agents.

const AI_CRAWLERS = [
    'GPTBot',
    'ChatGPT-User',
    'Google-Extended',
    'ClaudeBot',
    'Anthropic',
    'PerplexityBot',
    'Cohere-ai',
    'Bytespider',
    'CCBot'
];

function parseRobotsTxt(content) {
    const lines = content.split('\n');
    const results = {};
    let currentAgent = null;

    for (const line of lines) {
        const trimmed = line.trim();

        // Skip comments and empty lines
        if (trimmed.startsWith('#') || trimmed === '') continue;

        const agentMatch = trimmed.match(/^User-agent:\s*(.+)/i);
        if (agentMatch) {
            currentAgent = agentMatch[1].trim();
            continue;
        }

        const disallowMatch = trimmed.match(/^Disallow:\s*(.*)/i);
        if (disallowMatch && currentAgent) {
            const path = disallowMatch[1].trim();

            // Check if this rule blocks the entire site
            if (path === '/' || path === '') {
                const isAiCrawler = AI_CRAWLERS.some(
                    crawler => currentAgent.toLowerCase() === crawler.toLowerCase()
                );
                const isWildcard = currentAgent === '*';

                if (isAiCrawler) {
                    results[currentAgent] = 'blocked';
                } else if (isWildcard && path === '/') {
                    // Wildcard block — affects all AI crawlers
                    // unless they have specific allow rules
                    results['*'] = 'blocked';
                }
            }
        }
    }

    return results;
}

The parser handles the key scenarios: explicit AI bot blocks, wildcard blocks that catch everything, and the absence of any mention (which defaults to "allowed"). It doesn't handle every edge case in the robots.txt spec — directive priority, pattern matching with wildcards — but for an audit tool, "is this bot blocked or not" covers 95% of what matters.

The llms.txt Checker

The llms.txt standard is still emerging, but it's gaining traction. The check itself is simple — does the file exist at the domain root? But I also evaluate the quality of the content.

async function checkLlmsTxt(domain) {
    const url = `https://${domain}/llms.txt`;
    const content = await fetchWithProxy(url);

    if (!content) {
        return { exists: false, score: 0 };
    }

    let qualityScore = 40; // Base score for having the file at all

    // Check content length (too short = not useful)
    if (content.length > 200) qualityScore += 15;
    if (content.length > 500) qualityScore += 15;

    // Check for key business information
    const hasLocation = /address|location|based in|located/i.test(content);
    const hasServices = /services|offer|provide|specialize/i.test(content);
    const hasContact = /contact|email|phone|call/i.test(content);

    if (hasLocation) qualityScore += 10;
    if (hasServices) qualityScore += 10;
    if (hasContact) qualityScore += 10;

    return {
        exists: true,
        score: Math.min(qualityScore, 100),
        length: content.length
    };
}

A file that just says "Welcome to our website" gets a low quality score. A file with detailed business information, services, and location data gets a high one.

The Scoring Algorithm

The final score is a weighted composite across six categories:

Category	Weight	What It Measures
Structured Data	25%	Schema markup presence and completeness
AI Crawler Access	20%	robots.txt rules for AI-specific bots
Content Structure	20%	Headings hierarchy, FAQ sections, meta tags
Technical Foundation	15%	Speed, HTTPS, mobile-friendliness
AI-Specific Files	10%	llms.txt, sitemap.xml
Citability	10%	FAQ schema, clear answers, data-rich content

Each category scores 0-100 individually. The weighted average produces the final score.

The weights reflect what actually correlates with AI search visibility based on what we've observed working with clients at Studio Web. Structured data gets the highest weight because it's the single biggest differentiator — sites with proper schema markup score 3.2x higher on average (we validated this across 127 sites using both this tool and our Local SEO Score Checker).

Demo Mode Fallback

Public CORS proxies fail. APIs have rate limits. Sometimes the target site is just unreachable. Rather than showing users an error screen, the scanner falls back to a demo mode.

async function runScan(domain) {
    try {
        const results = await performLiveAnalysis(domain);
        displayResults(results, { isDemo: false });
    } catch (error) {
        console.warn('Live scan failed, using demo mode:', error);
        const demoResults = generateDemoResults(domain);
        displayResults(demoResults, { isDemo: true });
    }
}

Demo mode generates plausible (but clearly labeled) results based on industry averages. A banner at the top tells the user they're seeing estimated data, not live results, with a "try again" button.

This was a UX decision more than a technical one. A broken tool with an error message gets closed. A tool that shows useful (if estimated) information keeps people engaged and lets them understand what the scanner checks.

What I'd Do Differently in v2

Run my own CORS proxy. A simple Cloudflare Worker that proxies requests would eliminate the biggest reliability issue. Cost: basically zero on the free tier.

Add historical tracking. Right now each scan is a snapshot. I'd like users to scan weekly and see a trend line. This means some kind of storage — probably a lightweight backend with a database, which breaks the single-file constraint but adds real value.

Expand the AI crawler list dynamically. New AI bots appear every month. Hard-coding the list means constant updates. I'd pull from a maintained registry instead.

Add competitive benchmarking. "Your score is 34" is useful. "Your score is 34, and the average in your industry is 52" is actionable.

Try It Yourself

The tool is live and free:

AI Readiness Scanner — Enter any domain and get a full AI readiness audit in under 30 seconds.

If you're more interested in local search specifically, we also built a Local SEO Score Checker that evaluates traditional local SEO signals.

Both are free, no signup, no data collection.

If you build something similar — or fork this approach — I'd love to hear about it. I'm Julian, and I run Studio Web, a web design agency in Arlington, VA. You can find me in the comments or through the site.

DEV Community