Cihangir Bozdogan

Posted on May 3

AI Search Crawlers Are Curl From 1998, Not Chrome. Your SPA Is Invisible and Here Is the Mechanism.

#ai #seo #webdev #javascript

There is a meeting that happens at almost every web team I have audited where someone says "we render fine, our React app renders fine for crawlers." Most of the time this is wrong. The kind of crawler the team has in mind is Googlebot, which has shipped a JavaScript-rendering second-pass since around 2019. The kind of crawler that actually decides whether a site appears in ChatGPT, Claude, Perplexity, or any of the AI search products is not Googlebot. It is GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot and these things behave like curl from 1998. They issue an HTTP GET, parse the HTML they get back, and that is the entire interaction. JavaScript is not executed. Hydration does not happen. The single-page-app shell with <div id="root"></div> is what gets indexed.

I worked through this the slow way. Set up six identical-looking sites a Next.js SSR build, a Next.js SSG build, a Vite SPA with no SSR, a Remix SSR build, a static HTML page, and a hybrid where above-the-fold is server-rendered and below-the-fold lazy-loads. Pointed each AI search platform at each. Watched what they could and could not see. The result is not subtle: the SPA-with-no-SSR is functionally invisible to half the AI ecosystem, and the lazy-loaded content is invisible to all of it.

This post is the field report and the mechanism. The two-mode spectrum AI crawlers occupy. What each of the six relevant crawlers actually does on the wire. The hydration cliff that decides what the model sees. The five failure modes I now flag in technical audits. And the patterns that work the SSR-or-SSG sweet spot that costs almost nothing to ship and changes whether the AI ecosystem can see you at all.

The Two-Mode Spectrum

Web crawlers exist on a spectrum from "raw HTTP fetcher" to "full headless browser with JavaScript execution." The endpoints of the spectrum are not philosophical. They cost different amounts of money to operate and they produce different views of the page.

Raw HTTP fetcher. A program that issues an HTTP GET, receives the response, parses HTML, follows links via the parsed href attributes. No JavaScript is executed. No CSS is laid out. No images are decoded. The cost per page is a few milliseconds plus the network round-trip. The throughput is whatever the operator's bandwidth and target rate-limiting allows. This is curl plus an HTML parser. Most AI crawlers live near this end of the spectrum.

Headless browser. A program that does everything a real browser does, headlessly. JavaScript executes, network requests fan out, the DOM gets mutated, eventually a render tree settles. The cost per page is hundreds of milliseconds to multiple seconds, depending on how heavy the page is. The throughput is one to two orders of magnitude lower than the raw fetcher. Memory cost is much higher. Googlebot's render queue, Bingbot's modern incarnation, and a handful of search-tool products live at this end.

The gap between the two modes is where the AI ecosystem currently sits: not because nobody knows how to run a headless browser, but because the cost-benefit at training-corpus or live-fetch scale is decisively against it. OpenAI is not going to run a headless browser to fetch a site on every ChatGPT search. The latency is too high. The cost is too high. The volume is too high.

The implication is direct. If the part of your page that says what your business does is rendered by JavaScript that runs after the document loads, the AI crawler that fetches your page does not see it. There is no second-pass render queue at OpenAI, Anthropic, or Perplexity. There is one pass. Whatever is in the HTML at first-paint is everything the model gets.

The Six Crawlers, Tested

Six crawlers do almost all the AI-relevant work. Their User-Agents, their robots.txt declarations, and their JS-execution behaviour are public information for the most part. Where I have observed behaviour beyond what the docs say, I have flagged it.

GPTBot (OpenAI training crawler)

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.0; +https://openai.com/gptbot. OpenAI publishes its IP ranges. GPTBot's job is to harvest content for training data, not to fetch live for a specific user query. It respects robots.txt and respects User-agent: GPTBot Disallow: / if you set it.

The interaction shape is HTML-only. No JavaScript execution. The content GPTBot acquires is whatever the server returns to a plain GET first-paint HTML, server-rendered or static, plus any inline JSON-LD. Anything rendered after document-ready is invisible. Anything fetched from a downstream API by client-side JavaScript is invisible. The crawler acts like an HTTP fetcher, not a browser.

The implication for training inclusion is mechanical: if GPTBot cannot see your content in HTML, your content does not enter the training corpus through this path. There are other paths (Common Crawl, licensed datasets) but for the part you control directly, this is the gate.

OAI-SearchBot (OpenAI live-fetch crawler)

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot. Same operator as GPTBot, different role. OAI-SearchBot is the fetcher used at ChatGPT-search inference time the user asks a question, the model decides to search, candidate URLs come back from the Bing-backed retrieval, and OAI-SearchBot fetches a handful of them in parallel.

This crawler operates under a much tighter latency budget than GPTBot. The user is waiting on an answer. The fetcher cannot afford to render. JavaScript is not executed. Robots.txt is honoured.

There is a subtlety here that catches operators. ChatGPT search candidates come from Bing's index. Bingbot does execute JavaScript. So a JavaScript-only site can be in Bing's index and therefore in the candidate set ChatGPT search ranks against but when OAI-SearchBot tries to live-fetch that page to get content for the answer, it gets the empty shell. The candidate ranks. The content does not appear in the cited answer. The site is in the SERP but invisible at synthesis time.

ChatGPT-User (the "browse" UA)

User-Agent contains ChatGPT-User. This is the fetcher used when a user explicitly asks ChatGPT to browse a specific URL ("can you summarise this page for me?"). It is allowed to do slightly more than OAI-SearchBot in some configurations limited rendering, but I have not seen it execute arbitrary JS reliably. Treat it the same as OAI-SearchBot for planning purposes: HTML-only is the safe assumption.

ClaudeBot (Anthropic crawler)

User-Agent: Mozilla/5.0 (compatible; ClaudeBot/1.0; +claudebot@anthropic.com). Used for harvesting training data. HTML-only. Respects robots.txt. Behaviour matches GPTBot more than it matches anyone else modest crawl rate, conservative on server load, predictable.

There is a separate UA for Claude's web search tool when it is configured (Anthropic's docs are still maturing on this), but the same pattern applies: at inference time, when a model is fetching live, JavaScript is not in the budget.

PerplexityBot (Perplexity crawler)

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot. Perplexity operates a hybrid retrieval they pull from Bing and from their own crawler. PerplexityBot is the latter. HTML-only. Their robots.txt compliance has been a source of friction in the press; the documented behaviour is that they respect it, and the controversies have been around whether they always have.

Behavioural note from observed traffic: PerplexityBot is more aggressive than GPTBot or ClaudeBot on revisit frequency. Pages that update often see PerplexityBot more frequently. Pages that are stable see all three at similar cadences.

Bingbot

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm. Microsoft's primary search crawler. Respects robots.txt. Executes JavaScript via a Chromium-based renderer. This puts Bingbot at the headless-browser end of the spectrum, alongside Googlebot.

Bingbot matters in the AI conversation more than the Microsoft branding suggests, because ChatGPT search and Copilot both depend on Bing's index. If your site is JavaScript-only and Bingbot can render it, you can appear in ChatGPT search candidates but as noted under OAI-SearchBot, the live-fetch step at inference time still cannot render. So Bingbot indexes you, ChatGPT ranks you, OAI-SearchBot tries to fetch you for the cited content, and gets nothing useful. The candidate ranking and the citation content are decoupled.

Googlebot

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; Googlebot/2.1; +http://www.google.com/bot.html. Google's two-pass crawler: first pass is an HTML fetch, second pass is a render-queue pass with a Chromium-based renderer. Respects robots.txt. Executes JavaScript, but on a delay the render queue can lag the initial fetch by minutes to days.

Googlebot is important for AI in the same indirect way Bingbot is. Gemini and Google AI Overview depend on Google's index. The render queue means JavaScript-rendered content does eventually get indexed but the Gemini and AI Overview live fetcher at inference time has the same constraint as OAI-SearchBot: no JavaScript at synthesis. The same decoupling fires.

The Summary Matrix

Crawler	Operator	Role	JS execution	Robots.txt	Latency profile
GPTBot	OpenAI	Training data	No	Yes	Patient
OAI-SearchBot	OpenAI	Live fetch	No	Yes	Tight
ChatGPT-User	OpenAI	User-triggered browse	No (effectively)	Yes	Tight
ClaudeBot	Anthropic	Training / fetch	No	Yes	Patient
PerplexityBot	Perplexity	Hybrid index	No	Yes (with caveats)	Mid
Bingbot	Microsoft	Search index	Yes	Yes	Mid
Googlebot	Google	Search index	Yes (second pass)	Yes	Patient first pass, delayed render

The pattern is sharp: every crawler that fetches live for an AI synthesis call is HTML-only. Every crawler that builds a search index that AI products rank against may render JavaScript, but the JS-rendered content is only useful for ranking, not for the cited content the model actually emits.

The Hydration Cliff

If the rest of this post had a single image, it would be the hydration cliff. The picture is: a React app renders in three stages. Stage one is the HTML shell delivered by the server. Stage two is the JavaScript bundle being loaded and parsed. Stage three is the React tree mounting, fetching data, and rendering. To a user, the three stages compress to "the page loads." To an HTML-only crawler, only stage one exists.

To make this concrete, here is what curl sees against a stock Vite + React SPA with the production build:

$ curl -s https://example.com/products/widget-pro | head -30

<!doctype html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Widget Pro</title>
  </head>
  <body>
    <div id="root"></div>
    <script type="module" src="/assets/index-a1b2c3d4.js"></script>
  </body>
</html>

That is what GPTBot, OAI-SearchBot, ClaudeBot, and PerplexityBot all see. The product name, the description, the price, the availability, the JSON-LD, the reviews none of it is in the response. None of it enters the AI ecosystem through this fetch.

The same page, server-rendered (Next.js with getServerSideProps or App Router with appropriate caching):

$ curl -s https://example.com/products/widget-pro | head -50

<!doctype html>
<html lang="en">
  <head>
    <title>Widget Pro Industrial-grade widget rated for 50,000 cycles</title>
    <meta name="description" content="Widget Pro is rated for 50,000 cycles, ships from Berlin..." />
    <script type="application/ld+json">
      {
        "@context": "https://schema.org",
        "@type": "Product",
        "name": "Widget Pro",
        "description": "Industrial-grade widget...",
        "offers": { "@type": "Offer", "price": "299.00", "priceCurrency": "EUR" }
      }
    </script>
  </head>
  <body>
    <main>
      <h1>Widget Pro</h1>
      <p>Industrial-grade widget rated for 50,000 cycles. Ships from Berlin warehouse.</p>
      <p class="price">€299.00 In stock</p>
      ...
    </main>
  </body>
</html>

Same React app, same components, same product. The difference between "invisible" and "fully indexable" is whether the rendering happens on the server before the response is sent or on the client after the response is sent. To the AI ecosystem, that is the difference between not existing and existing.

The trap is that the JavaScript-only version does the right thing for users. It loads in a few hundred milliseconds, it is interactive, it is fast on subsequent navigations because of client-side routing. The user experience is fine. The crawler experience is empty.

The Five Failure Modes

These are the patterns I now flag immediately in any audit, ordered by how often they show up.

Failure Mode 1: Pure CSR with no SSR fallback. The site is a Vite, Create React App, or Angular CLI build with a near-empty index.html shell. Every other page on the site is rendered client-side from the same shell. Title and description are set client-side via JavaScript. AI crawlers see the same empty shell on every URL. The fix is to migrate to a framework with SSR or SSG capability Next.js, Remix, SvelteKit, Astro, Nuxt or to add a pre-render step (react-snap or similar) that emits static HTML for known routes.

Failure Mode 2: Hydration boundary on critical content. The site has SSR, but the critical content (product description, article body, business hours) is inside a <Suspense> boundary or a <ClientOnly> wrapper that defers rendering until hydration. AI crawlers see a loading spinner or an empty container where the content should be. The fix is to move the critical content out of the deferred boundary. Defer the comments, the related-products carousel, the live-availability widget not the product name or the article text.

Failure Mode 3: Lazy-loaded above-the-fold content. The site uses loading="lazy" for content that should be visible immediately, including text content rendered conditionally based on viewport intersection. AI crawlers do not run an Intersection Observer. They do not scroll. Anything gated on scroll position never appears. The fix is loading="lazy" for images that genuinely live below the fold; everything else stays eager.

Failure Mode 4: CDN edge cache serving stale JSON-LD. The site has perfectly valid SSR with rich JSON-LD, but the CDN edge cache has the version from the last deploy that had a bug the JSON-LD references the wrong product, the price is wrong, the availability is stale. AI crawlers ingest the stale data and the model emits answers with the stale data weeks after deploy. The fix is purposeful cache invalidation on JSON-LD-affecting deploys, ideally with surrogate-key invalidation tied to the entities the page renders.

Failure Mode 5: robots.txt that selectively blocks AI crawlers. Someone in the past read a Hacker News thread about AI training and added User-agent: GPTBot plus User-agent: ClaudeBot plus User-agent: PerplexityBot plus Disallow: / to robots.txt. This was perhaps a defensible position when the question was "do I want my content used for training." It is not a defensible position when the question is "do I want my content cited in AI answers." The training crawlers and the live-fetch crawlers are often the same agent or sibling agents from the same operator. Blocking them blocks the citation path. The fix is to decide what you actually want: if it is "no training but yes citation," you need to allow the live-fetch user-agents and disallow the training UA; if it is "everything blocked," own that and stop expecting AI visibility.

What Actually Works

Three patterns survive contact with all six crawlers and all five failure modes.

Server-side rendering. The simplest and most durable pattern. Every page returns first-paint HTML with all critical content present. Hydration happens on top of populated content, not in place of it. Frameworks that ship this out of the box: Next.js (App Router or Pages with getServerSideProps), Remix, Nuxt, SvelteKit, Astro. The performance cost is a server round-trip per page, mitigated by edge rendering and caching. The visibility benefit is binary: with SSR, AI crawlers see your content; without it, they do not.

Static-site generation. A subset of SSR where the rendering happens at build time and the output is plain HTML files. Even simpler than runtime SSR. Works for content that does not change per request most marketing pages, most blog content, most product detail pages with infrequently updated availability. Frameworks: Next.js (generateStaticParams, getStaticProps), Astro, Hugo, Gatsby, 11ty. AI-crawler-perfect.

Hybrid with a clear boundary. SSR or SSG for the content that needs to be visible, client-side for the interactive widgets that do not. Article body server-rendered; comments client-rendered. Product page server-rendered; reviews client-rendered. The key is the content that determines what your business is or what an article is about must be in first-paint HTML. The interactive layer can come on top.

The pattern that does not work and this is where I see teams burn the most time is the half-SSR build where some routes are server-rendered and some are not. AI crawlers do not infer your routing convention. They fetch URLs they discover in sitemaps, internal links, and external mentions. If a fraction of those URLs return rich HTML and the rest return shells, the visibility becomes lottery-distributed. Either commit to "every public URL is server-rendered or static" or commit to "we are invisible to AI crawlers and fine with that." There is no middle ground that produces predictable AI visibility.

The Synthesis

The mental model engineers carry of "modern web crawler" is twenty years out of date in two directions at once. It overestimates the capability of the AI crawlers (assuming they render like Chrome) and underestimates the capability of Bingbot and Googlebot (which actually do). The result is decisions that optimise for the wrong set of crawlers.

The single sentence I tell anyone shipping a public site that wants to be cited by AI: every AI crawler that fetches live at inference time is HTML-only, and every piece of content that depends on JavaScript to appear is invisible to the citation pipeline regardless of what Bingbot or Googlebot can do.

If you only have time to internalise three things from this post, in order:

Live-fetch AI crawlers do not run JavaScript. GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, ChatGPT-User. The page they see is the first-paint HTML, nothing else.
Bingbot and Googlebot are the exception, and they only help with ranking, not with citation content. They render JS, your site appears in the candidate set, but the live fetcher that grabs content for the answer cannot render. The decoupling is invisible until you measure for it.
SSR or SSG is not optional for AI visibility. It is the gate. Pure-CSR sites are functionally invisible to half the AI ecosystem and partially invisible to the other half.

Everything else the failure modes, the framework picks, the cache invalidation discipline, the robots.txt nuances is implementation detail layered over those three. The mechanism is unglamorous: HTML in, HTML out, and a vanishingly small fraction of the AI ecosystem is willing to spend a Chromium-instance worth of compute to recover what your client-side JavaScript would have produced.

The agentic web is being built on top of HTTP/1.1 and HTML parsers. It looks more like the web of 1998 than the web of 2018. If you treat it that way and ship server-rendered or static HTML, your site is visible. If you treat it like 2018 and rely on the browser, your site is invisible, and the marketing claim that "AI search is the new SEO" lands somewhere uncomfortable for your traffic.

The User-Agent strings in the per-crawler section are taken from each operator's published documentation as of writing OpenAI's bot documentation, Anthropic's ClaudeBot documentation, Perplexity's bot documentation, Microsoft's Bingbot reference, and Google's Googlebot reference. JavaScript-execution behaviour is from official documentation where available and observed traffic where not. Individual UAs are versioned and may have shifted by the time you read this; verify against the operator's current docs before shipping a robots.txt change. The "no JavaScript at inference-time fetchers" finding is consistent across every public statement and every observed log line I have seen, but I cannot rule out that a specific operator runs a small headless-browser fleet for a small fraction of fetches; if such a fleet exists, it does not change the production-planning conclusion.

DEV Community