DEV Community: Jan-Willem Bobbink

LLM trackers are quietly breaking their users' own analytics

Jan-Willem Bobbink — Wed, 29 Apr 2026 09:40:32 +0000

And nobody in SEO is talking about it yet

There is a measurement problem sitting at the centre of the AI visibility industry, and the tools sold to solve AI search measurement are the ones causing it. The pollution is silent, it is structural, and it is showing up in exactly the dataset that GEO practitioners now rely on most.

This post lays out the mechanism, why it matters more than the rank tracker problem that came before it, and the simple product fix that would clear it up.

The mechanism

When you run daily prompt tracking across ChatGPT, Perplexity, Google AI Mode, Claude and the other LLM surfaces, the model sometimes decides it needs fresh information to answer the prompt. It fires off a retrieval-augmented generation request, often through a search index, and your pages get fetched. RAG systems work by injecting retrieved content into the LLM's context window before generation, anchoring the answer to external sources rather than relying purely on training data [1]. Not every tracked prompt triggers retrieval. Models cache, they reuse prior context, and they only ground when they judge it necessary. But when you are tracking hundreds or thousands of prompts per day across multiple engines, the cumulative volume of triggered fetches is significant.

Those fetches land in your server logs as bot hits. They inflate your crawl data. They dirty your content performance analysis. And they are caused by the tool you bought to measure your AI visibility in the first place.

The scale of legitimate AI crawler activity already makes this hard to disentangle. Cloudflare's 2025 Year in Review reports that AI "user action" crawling, the category that includes pages fetched in response to user prompts, grew more than 15 times across 2025 [2]. Botify's analysis of more than 7 billion log files found that OpenAI's combined crawl of the web tripled between August 2025 and March 2026, with OAI-SearchBot and GPTBot both at all-time highs [3]. Single Grain reported GPTBot traffic growing 305% between May 2024 and May 2025 [4]. Tracker-induced fetches are sitting on top of an already noisy baseline.

Why this matters more than it did two years ago

GA4 automatically excludes traffic from known bots and spiders, and according to Google's own documentation you cannot disable this filter or see how much was excluded [5]. If your AI visibility analysis lives in GA4, the pollution does not show up there in any obvious way, which is part of why the issue has stayed invisible.

But GA4 is not where the real SEO (or any of the new abbreviations!) work is happening anymore. Log file analysis is. As Search Engine Land notes, while crawl tools like SEMrush or Screaming Frog simulate bot behaviour, log files capture what crawlers actually do in real time, including for bots that GSC and GA4 will never report on [6]. That is the only honest record of what AI systems are doing on your site.

Tools like botsanalyser.com have made server log parsing accessible to any SEO without needing to set up a data pipeline from scratch. Practitioners are increasingly using logs to answer questions that GA4 cannot: which AI crawlers visit, how often, which pages they fetch, how deep they go, and how that behaviour correlates with citations and visibility in AI answers. Search Engine Land's recent coverage of log file analysis for AI crawlers explicitly frames logs as the closest substitute for the missing feedback loop in AI search, where impressions, clicks, and indexing data simply do not exist the way they do in traditional SEO [7].

This is the dataset where the signal lives for AI search optimisation. And this is exactly the dataset that LLM tracker traffic is contaminating.

A familiar pattern, with a worse outcome

This is not the first time the SEO industry has bought a measurement tool that quietly polluted its own data. Rank trackers did the same thing to Google Search Console for years. Every time a rank tracker checked position 37 for a keyword, Google counted an impression. The more keywords you tracked, the noisier your GSC impression data became.

The proof showed up clearly when Google stopped supporting the &num=100 parameter on 12 September 2025. Within days, GSC impressions dropped sharply across the industry, with some sites reporting declines of 20 to 50 percent. The "alligator effect" graphs that many SEOs had attributed to AI Overviews snapped shut almost overnight. Search Engine Land's analysis concluded that automated crawlers had been inflating impression counts, and that the post-change baseline reflected real user activity rather than scraper noise [8]. Smith Digital framed those vanished impressions as "ghost impressions" generated by machine activity that never represented a real human seeing a result [9].

Google itself later confirmed a separate logging error that had been over-reporting GSC impressions from 13 May 2025 onwards. The fix was rolled out in April 2026, almost a full year after the bug began [10]. Between rank tracker pollution and Google's own logging bug, GSC impression data was structurally unreliable for most of 2025.

The log file version of this same problem is worse for three reasons.

First, the noise is harder to identify. Rank tracker traffic in GSC was at least bundled into a single metric you could mentally discount. LLM tracker traffic in your logs arrives with rotating user agents, sometimes through Bing's infrastructure, sometimes through Google, sometimes direct from OpenAI or Anthropic. Seer Interactive has documented how stealth AI crawling, where bots reappear under generic browser headers and unrelated IPs, makes traditional bot detection unreliable [11]. There is no clean way to label this traffic after the fact.

Second, the noise is harder to filter. You cannot simply exclude a known IP range or user agent string. The same fetches that come from real LLM grounding for real user prompts arrive through the same infrastructure as the fetches caused by your tracker. They are mechanically identical from the server's perspective. Passion Digital flagged this exact problem when it noted that misidentifying bot traffic is one of the most common errors in LLM bot tracking, particularly because not all bots clearly identify themselves and user agent strings can be spoofed [12].

Third, the dataset is being used to drive decisions, not just reporting. Log data is feeding content prioritisation, internal linking strategy, technical SEO fixes for AI crawlers, and conversations with leadership about which AI surfaces are sending qualified bot traffic. Every one of those decisions is being made on top of polluted data.

What "polluted" actually looks like

Imagine a mid-sized site running daily prompt tracking on 500 prompts across five LLM engines. Even if only a fraction of those prompt executions trigger retrieval, you are looking at potentially hundreds of additional fetches per day attributable to the tracker, on top of organic AI crawler activity.

Those fetches will tend to cluster around the pages your tracker considers most relevant to the prompts you set up, which are the same pages you are trying to evaluate. So the pollution is not evenly distributed. It is concentrated on exactly the URLs you most want clean data for.

The result is that pages with strong tracked-prompt coverage look healthier in your log analysis than they actually are, and pages outside your tracked prompt set look quieter than they actually are. The measurement is structurally biased toward the prompts you chose to monitor.

This bias matters more given how skewed the underlying crawl-to-referral economics already are. Cloudflare's crawl-to-refer ratio metric, which compares how much a platform crawls versus how much referral traffic it sends back, showed Anthropic peaking at roughly 500,000 to 1 and OpenAI peaking at around 3,700 to 1 during 2025 [13]. Practitioners are already trying to read meaningful signal out of fetch volumes that dwarf any human traffic those platforms send back. Adding tracker noise on top of that makes the signal even harder to extract.

The fix is a product decision, but it requires more than one party

There is a path out of this for vendors who care about giving practitioners clean data, and it is worth being precise about who needs to do what. The architectural reality is that the page fetches that pollute server logs are not made by the tracker itself in the most common case. They are made by the LLM provider's own crawler in response to a prompt the tracker submitted to the API. The tracker can attach any header it likes to its API call, but that header does not propagate down into the RAG fetch the LLM subsequently fires off. So a header-only fix solves the wrong half of the problem.

A clean solution stacks three mechanisms.

The first is scheduling. Trackers should run prompts in a declared time window, ideally outside peak hours for their target audience, and publish that schedule. Practitioners can then filter logs by removing crawler hits during the declared window. This works without any LLM cooperation at all and is the easiest mitigation to deploy. It is not perfect because real users prompt at all hours, but it produces a meaningful baseline correction at very low cost.

The second is the tracker activity feed. LLM tracking platforms know exactly when each prompt was executed, against which engine, and in many cases they can infer or directly observe whether a retrieval call was triggered. A timestamped export of that activity, ideally as an API endpoint, with at minimum the timestamp, engine, prompt identifier, and where possible the URLs that were fetched as part of grounding, lets practitioners reconcile log entries against tracker activity with more precision than the time window alone allows.

The third is LLM cooperation, and this is the one that closes the loop. When a tracker calls the API, the LLM provider should mark the resulting RAG crawler fetches in a way that downstream log analysis can identify. This could be a custom User-Agent suffix on the OAI-SearchBot or ClaudeBot or PerplexityBot request, an extra HTTP header passed through from the originating API call, or a published list of IP ranges used specifically for API-originated retrieval. Without this, no amount of tracker discipline cleans up the actual problematic fetches, because the tracker is not the party making them.

The combination is what gets you clean data. The time window is the easy first cut, the activity feed is the cross-reference, and LLM cooperation is what makes the filtering precise. None of the three is technically difficult. All three are product decisions about whether to give practitioners visibility into a measurement system that currently obscures itself.

The harder question is whether LLM providers will cooperate. They have less commercial incentive than tracker vendors do, and they are the bottleneck on the cleanest part of the fix.

Why no vendor has shipped this yet

Probably because the easy parts make the product look smaller and the hard part requires someone else's cooperation. A tracker that shows you "your site was fetched 40,000 times by AI crawlers last month" reads differently than a tracker that shows you "your site was fetched 40,000 times, of which 12,000 were caused by us, leaving 28,000 organic AI crawler hits." The honest version is more useful. It is also less impressive on a dashboard.

There is also a competitive logic. The first tracker vendor to publish a schedule and an activity feed effectively concedes that their measurement creates noise. No vendor wants to be the first to admit that, even though every vendor in the category has the same problem. And LLM providers, who hold the cleanest part of the fix, have even less incentive: they get the value of training data and answer grounding from those crawls, and the cost of the noise lands on publishers and SEO practitioners, not on them.

What practitioners can do in the meantime

Until vendors ship a clean activity feed, there are partial mitigations worth considering. Time-of-day patterns can sometimes isolate tracker traffic if your tracker runs on a fixed schedule. Cross-referencing fetch volumes against your tracked prompt list can flag suspiciously consistent crawl patterns on covered URLs. Comparing log data from before and after enabling a tracker, on the same site, gives you a rough estimate of the baseline shift.

None of these are substitutes for vendor-provided data. They are workarounds for a problem that should not be the customer's to solve.

The bigger point

Every AI visibility report being published right now, including the ones used to set strategy at large brands, is sitting on top of log data that has been contaminated by the measurement tools themselves. The industry is making decisions on a dataset it has not properly cleaned, because the only people who can clean it have a commercial reason not to.

That is the real story. Not that LLM trackers are bad, they are useful, but that the standard practice of evaluating AI search performance from log data is currently broken in a way that vendors could fix tomorrow and have chosen not to. The easy parts sit with tracker vendors. The hardest and most important part sits with the LLM providers themselves.

The first tracker vendor to publish a schedule and an activity feed will, briefly, look like the one with the noisier product. They will also be the only one giving practitioners data they can actually trust at the tracker layer. The first LLM provider to mark API-originated RAG fetches will give the entire industry the missing piece. None of this is hard. All of it is overdue.

Not going to start the debate about involving LLMs themselves :)

Sources

[1] Firecrawl, "What is RAG grounding?" (accessed 29-04-2026), Firecrawl Glossary.
https://www.firecrawl.dev/glossary/web-search-apis/rag-grounding

[2] David Belson, "The 2025 Cloudflare Radar Year in Review: The rise of AI, post-quantum, and record-breaking DDoS attacks" (29-01-2026), Cloudflare Blog.
https://blog.cloudflare.com/radar-2025-year-in-review/

[3] Chris Long, "OpenAI Has Tripled Their Crawl of the Web: An Analysis of 7B+ Log Files" (23-04-2026), Botify Blog.
https://www.botify.com/blog/openai-tripled-web-crawl

[4] Single Grain, "Log File Analysis for Understanding AI Crawling Behavior" (28-12-2025), Single Grain Blog.
https://www.singlegrain.com/blog-posts/analytics/log-file-analysis-for-understanding-ai-crawling-behavior/

[5] Google, "[GA4] Filter incoming data: Known bot-traffic exclusion" (accessed 29-04-2026), Google Analytics Help.
https://support.google.com/analytics/answer/9888366

[6] Search Engine Land, "Log file analysis for SEO: Find crawl issues & fix them fast" (27-11-2025), Search Engine Land.
https://searchengineland.com/guide/log-file-analysis

[7] Lauren Busby, "Why log file analysis matters for AI crawlers and search visibility" (16-04-2026), Search Engine Land.
https://searchengineland.com/log-file-analysis-ai-crawlers-search-visibility-474428

[8] Search Engine Land, "Why Google Search Console impressions fell (and why that's good)" (23-10-2025), Search Engine Land.
https://searchengineland.com/why-google-search-console-impressions-dropped-interpret-data-463677

[9] Smith Digital, "Why Google Search Console Impressions Dropped in Sept 2025" (16-12-2025), Smith Digital Blog.
https://smithdigital.io/blog/google-search-console-impression-drop-sept-2025

[10] Danny Goodwin, "Google is fixing a Search Console bug that inflated impression counts" (03-04-2026), Search Engine Land.
https://searchengineland.com/google-search-console-bug-inflated-impression-counts-473530

[11] Seer Interactive, "Perplexity, Stealth AI Crawling, and the Impacts on GEO and Log File Analysis" (30-10-2025), Seer Interactive Insights.
https://www.seerinteractive.com/insights/perplexity-stealth-ai-crawling-and-the-impacts-on-geo-and-log-file-analysis

[12] Passion Digital, "Tracking LLMs Bots on Your Site using Log File Analysis" (15-07-2025), Passion Digital Blog.
https://passion.digital/blog/tracking-llms-bots-on-your-site-using-log-file-analysis/

[13] Cloudflare, "The crawl before the fall... of referrals: understanding AI's impact on content providers" (01-07-2025), Cloudflare Blog.
https://blog.cloudflare.com/ai-search-crawl-refer-ratio-on-radar/

Building Glippy MCP: giving Claude the power to audit a site's AI readiness

Jan-Willem Bobbink — Tue, 21 Apr 2026 08:30:55 +0000

I spent the last few weekends turning Glippy, a desktop app and browser extension that scores a site's readiness for AI crawlers, into a Model Context Protocol (MCP) server. The result is glippy-mcp: a Node.js binary that plugs into Claude Desktop, Claude Code, Cursor, Windsurf, and anything else that speaks MCP, then exposes nine tools for analysing, comparing, and exporting GEO reports on any domain.

This post walks through why I built it, what it does, and the handful of design decisions that actually mattered.

Why an MCP server at all?

Glippy already had a perfectly good desktop app. The engine geo-checker.js fetches robots.txt, llms.txt, the homepage HTML, sitemap.xml, and a few security headers, then runs 10 weighted scoring categories (Structured Data, Semantic HTML, Machine Readability, Citability & Answer-Readiness, and so on). You paste a domain, you get a report.

The problem is that I kept finding myself copy-pasting that report back into Claude to ask follow-up questions like "which of these issues should I fix first for a Shopify site?" or "compare this to the three competitors in the report I ran yesterday." The conversation loop was slow and lossy — Claude was operating on stale text instead of live crawls.

MCP fixes that. Instead of me being a ferry between two tools, Claude can call analyze_domain, compare_domains, or analyze_sitemap directly during the conversation. The model decides when a fresh crawl is needed, and my job shrinks to asking good questions.

What the server actually exposes

Nine tools, all stdio-transport JSON-RPC 2.0 under the hood:

Tool	What it does
`analyze_domain`	Full 10-category GEO analysis of one domain
`check_robots_txt`	Which AI crawlers (GPTBot, ClaudeBot, …) are blocked
`check_llms_txt`	Is there an `llms.txt`? Show the contents
`get_geo_summary`	Quick score + top 3 strengths and weaknesses
`compare_domains`	Run 2–10 domains in parallel, rank them
`analyze_sitemap`	Fetch a sitemap, score every page
`analyze_urls`	Same, but for an arbitrary URL list
`export_report`	Styled Markdown or HTML report for one domain
`export_bulk_report`	Same, for comparisons / sitemaps / URL sets

Everything interesting lives in src/geo-checker.js (the scoring engine reused from the desktop app) and src/index.js (the MCP wrapper).

The skeleton: less code than you'd think

The MCP SDK does most of the heavy lifting. A minimal version of the server is about twenty lines:

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
import { checkGEO } from "./geo-checker.js";

const server = new McpServer({ name: "glippy-mcp", version: "0.1.0" });

server.tool(
  "analyze_domain",
  "Full GEO analysis of a domain",
  {
    domain: z.string().describe("e.g. example.com — no https:// prefix"),
    max_pages: z.number().int().min(1).max(10).optional(),
  },
  async ({ domain, max_pages = 10 }) => {
    const result = await checkGEO(domain, { maxPages: max_pages });
    return { content: [{ type: "text", text: formatReport(result) }] };
  },
);

await server.connect(new StdioServerTransport());

Zod schemas double as both runtime validation and the JSON schema Claude sees when deciding how to call the tool, so clear .describe() text matters more than the parameter name. "Do not include https://" in the description saves a lot of round-trips where the model would otherwise guess wrong.

The decisions that actually mattered

1. Reuse the engine, don't rewrite it

geo-checker.js is ~4,800 lines of cheerio-based HTML inspection that has already been battle-tested against thousands of real-world sites. The MCP wrapper imports its public functions (checkGEO, analyseHTML, analyseRobotsTxt, parseSitemapUrls, throttledFetchUrl, aggregatePageScores) and does zero scoring itself. Every bug fix in the desktop app flows through to the MCP server for free.

If you're MCP-ifying an existing tool, resist the urge to "do it properly this time." Wrap what you have.

2. Keep everything local except the license check

A Glippy MCP license key (GLMCP-XXXX-XXXX-XXXX) hits a Cloudflare Worker (mcp-worker/) on first use and caches the result for 24 hours. Actual crawling and scoring run on the user's machine: no domains, results, or HTML ever leave their box.

That choice kept the server very cheap to run (my Worker handles only verify/deactivate/Stripe-webhook traffic) and kept privacy-sensitive users happy. The validation logic falls back gracefully: if the license server is unreachable but a cached valid license exists, the tool keeps working.

3. Two tricks to avoid re-crawling

Crawling a sitemap of 500 pages is expensive and rude. I added two layers of deduplication.

In-memory cache, 5-minute TTL. Keyed on domain + maxPages. The clever bit: if you ask for max_pages=3 and there's already a cached run at max_pages=5, the cache hits. Subsequent tools in the same conversation (get_geo_summary, export_report) reuse the crawl automatically.

Explicit JSON output mode. For workflows where the model needs to generate multiple report formats, every analysis tool accepts output_format="json". The raw result object can then be handed to export_report or export_bulk_report via an analysis_result parameter, bypassing the cache entirely. This shows up in practice as:

analyze_domain  domain="example.com" max_pages=5 output_format="json"
→ export_report format="html" analysis_result=<from above>
→ export_report format="markdown_full" analysis_result=<from above>

One crawl, three artifacts.

4. Per-domain rate limiting, not global

The batch tools (analyze_sitemap, analyze_urls, compare_domains) fan out concurrent fetches. Naively doing this against a single origin will get you rate-limited — or worse, get you blocked. The throttledFetchUrl helper in geo-checker.js keeps a per-host queue with a configurable delay (default 5 rps, tunable via the GLIPPY_RATE_LIMIT env var or a rate_limit parameter) while a global semaphore caps total in-flight requests at 10.

Result: comparing example.com and competitor.com runs effectively in parallel because they're different origins, but hammering a single sitemap stays polite.

5. Stderr for logs, stdout is sacred

MCP over stdio means stdout is a JSON-RPC channel. A stray console.log anywhere in the engine will corrupt the frame and the client will disconnect with a cryptic parse error. Route every log through console.error, and audit third-party dependencies for chatty output. I caught one cheerio helper printing a deprecation warning to stdout in an older version; pinning fixed it.

Client config: the part users actually interact with

Getting an MCP server installed is still the single biggest adoption barrier. I wrote one config block and then copy-pasted it across every guide:

{
  "mcpServers": {
    "glippy-geo": {
      "command": "npx",
      "args": ["-y", "glippy-mcp"],
      "env": {
        "GLIPPY_LICENSE_KEY": "GLMCP-XXXX-XXXX-XXXX"
      }
    }
  }
}

Same JSON works for Claude Desktop, Claude Code (.mcp.json), Cursor (.cursor/mcp.json), Windsurf (.windsurf/mcp.json), and Continue.dev. Using npx -y means users don't manage a global install; they always get the latest published version.

For ChatGPT / OpenAI, which doesn't speak MCP natively yet, a small bridge does the job, but that's a post for another day.

What I'd do differently

Ship JSON-mode from day one. I added it in v0.1 after realising chained exports were the most common workflow. Cache-hit logic is fine, but explicit result passing is faster and more predictable for agents.
Fewer tools, sharper descriptions. Nine is on the edge of "too many to reason about." In hindsight, analyze_domain with rich options subsumes half of the others. Next major version might consolidate.
Streaming responses. A full sitemap crawl can take a minute. Right now it's a single tool call; a streaming update ("scored 42/500 pages…") would be a nicer UX once MCP clients support progress notifications more widely.

Try it

npx -y glippy-mcp   # needs a license key — grab one at glippy.dev

Drop the config block into your MCP client of choice and ask Claude something like:

Give me a GEO readiness summary for stripe.com and explain the top three issues in plain English.

The whole project ended up being a small reminder that MCP is mostly just "expose your existing tool well." The SDK is thin, the protocol is boring in a good way, and the hard problems: caching, rate limiting and clean log separation are the same ones you already know from building any CLI.

Why Lovable.dev sites struggle with search engine and LLM indexing

Jan-Willem Bobbink — Sun, 01 Feb 2026 12:42:16 +0000

Lovable.dev's pure client-side rendering architecture creates significant SEO challenges because search engines receive only an empty HTML shell when crawling these React applications. Google takes approximately 9x longer to index JavaScript-heavy pages compared to static HTML, and other search engines—including AI crawlers—often cannot render the content at all. The platform itself acknowledges these limitations, noting that indexing can take "days instead of hours" and that social media previews are broken by default.

This problem isn't unique to Lovable.dev—it affects most single-page applications (SPAs) built with React, Vue, or Angular that rely on client-side JavaScript to render content. The solutions range from implementing server-side rendering to using prerendering services, with SEO experts like Jan-Willem Bobbink consistently recommending SSR as the safest approach for SEO-critical sites.

Lovable.dev's technical architecture creates an empty-shell problem

Lovable.dev generates React applications using a modern but SEO-problematic stack: React with TypeScript, Vite for builds, Tailwind CSS with shadcn/ui components, and React Router for client-side navigation. The platform exclusively produces client-side rendered (CSR) single-page applications with no built-in server-side rendering options.

When a search engine crawler visits a Lovable.dev site, it receives HTML that looks essentially like this:

<html>
<head><title>Loading...</title></head>
<body>
  <div id="root"></div>
  <script src="/bundle.js"></script>
</body>
</html>

All meaningful content—text, images, navigation, metadata—exists only after JavaScript executes in the browser. Lovable's own documentation acknowledges this limitation: "Platforms like Facebook, X/Twitter, and LinkedIn do not wait for content to render, so they only see the initial HTML page structure."

The platform offers workarounds but no native fix. Users can export their code to GitHub and deploy elsewhere, use prerendering services like Prerender.io or LovableHTML ($9+/month), or migrate entirely to Next.js—though this breaks Lovable's visual editor functionality.

Google's two-wave indexing creates multi-day delays

Google processes JavaScript websites through a three-phase pipeline: crawling, rendering, and indexing. When Googlebot first visits a CSR page, it captures the raw HTML immediately but places the page in a rendering queue for JavaScript execution. Google's Tom Greenaway confirmed at Google I/O that "the final render can actually arrive several days later."

This creates what researchers call "flaky indexing." The same page might appear differently on different crawl attempts. Some pages get fully indexed while others remain partially indexed or show errors like "Crawled – currently not indexed" in Search Console. A study by Onely demonstrated that Google takes up to 9 times longer to properly render JavaScript pages than static HTML.

The crawl and render budget problem

Search engines allocate finite computational resources for JavaScript execution. Sites that exceed their rendering budget experience up to 40% lower indexation rates and 23% decreased organic traffic. Heavy JavaScript bundles—particularly common in React applications—can cause Google to abandon rendering entirely before completion.

Each JavaScript file competes for crawl budget allocation. Framework files (React, Redux), third-party scripts (analytics, ads), and component libraries all require HTTP requests and processing time. Failed rendering attempts waste budget without producing successful indexing.

Beyond Google: other crawlers struggle more

While Googlebot uses an evergreen Chromium browser for rendering, other crawlers have more limited capabilities:

Crawler	JavaScript Support	Implication
Googlebot	Full (with delays)	Content eventually indexed
Bingbot	Limited	Microsoft recommends dynamic rendering
DuckDuckBot	Minimal	Requires static content
AI Crawlers (GPTBot, ClaudeBot)	None	Completely miss CSR content
Social Media Bots	None	Broken link previews

Bing's official documentation explicitly states: "In order to increase the predictability of crawling and indexing by Bing, we recommend dynamic rendering as a great alternative for websites relying heavily on JavaScript." Tests by Screaming Frog found that Angular.io—a JavaScript-heavy site—shows "problematic indexing issues" in Bing with missing canonical tags, meta descriptions, and H1 elements.

Six specific indexing challenges affect Lovable.dev sites

1. Metadata and title tags aren't visible to crawlers

Meta tags generated client-side may not be processed during the first crawl. The <title> tag must exist before JavaScript execution for proper indexing. Social media crawlers don't execute JavaScript at all, which is why Lovable sites often display generic or incorrect information when shared on Facebook, Twitter, or LinkedIn.

React Helmet can manage meta tags dynamically, but must be combined with SSR for full effectiveness with search engines.

2. Structured data often goes unseen

Search Engine Journal reports that "structured data added only through client-side JavaScript is invisible to most AI crawlers." While Googlebot can eventually process JavaScript-generated JSON-LD, the rendering delays and potential failures create inconsistency. Rich results may not appear if schema markup isn't in the initial HTML.

3. Internal links may not be crawlable

Links created via onclick events or addEventListener are not crawlable. Google ignores URL fragments (#), meaning SPAs using hash-based routing appear as a single URL. A case study documented by Momentic found that a React website lost 51% of traffic partly because "link types that were not crawlable" were implemented as click events rather than proper <a href> elements.

4. Core Web Vitals suffer under client-side rendering

Largest Contentful Paint (LCP) typically performs poorly with CSR because content loads only after JavaScript execution. With pure client-side rendering, the LCP element doesn't exist in initial HTML—JavaScript must build the DOM first, creating significant render delays. The target is 2.5 seconds or less; CSR sites often exceed this.

Cumulative Layout Shift (CLS) increases as JavaScript-rendered content causes elements to shift during load. Brands optimizing their rendering approach report 67% reduction in layout shifts.

5. Mobile-first indexing amplifies the problem

Google primarily uses mobile Googlebot for indexing. Mobile devices have slower processors and limited bandwidth, making JavaScript execution significantly slower. Industry guidelines recommend keeping JavaScript bundles under 100-170KB minified and gzipped for initial load—a threshold many React applications exceed.

6. AI search visibility is nearly zero

Modern AI assistants like ChatGPT, Claude, and Perplexity rely on crawlers that don't execute JavaScript. Vercel research found that most AI crawlers "only fetch static HTML and bypass JavaScript." Lovable's documentation acknowledges: "Many AI systems don't reliably see dynamically rendered content, so they may miss your pages or only see partial content."

Jan-Willem Bobbink's framework for JavaScript SEO

Jan-Willem Bobbink, founder of notprovided.eu and an SEO consultant with 30 years of web development experience, has become a leading voice on JavaScript SEO. At BrightonSEO 2019, he presented findings from building 10 websites using the 10 most popular JavaScript frameworks—conducting hands-on testing rather than relying on client data alone.

His observation that JavaScript framework adoption among clients jumped from 28% in 2016 to 65% in 2019 underscores why this expertise matters. His ten core recommendations provide a practical framework for addressing JavaScript SEO challenges.

Bobbink's primary recommendation: server-side rendering

"Server Side Rendering (SSR) is just the safest way to go," Bobbink states. "For SEO you just don't want to take a risk Google sees anything else than a fully SEO optimized page in the initial crawl." He specifically recommends Next.js as an SEO-friendly framework for React development.

His preferred approach is a hybrid model: "Content and important elements for SEO are delivered as Server Side Rendered and then you sprinkle all the UX/CX improvements for the visitors as a Client Side Rendered 'layer.'"

Critical technical warnings from Bobbink

Data persistence creates ranking risks. "Googlebot is crawling with a headless browser, not passing anything to the next successive URL request." Sites using cookies, local storage, or session data to populate SEO elements—like personalized product links—have lost rankings because crawlers don't carry this data between requests.

Unit test your SSR implementation. Bobbink shared a case where broken SSR caused two weeks of visibility loss. He recommends Jest for Angular and React testing, and vue-test-utils for Vue applications.

Monitor prerendering services for failures. Services like Prerender.io can fail silently. He advocates monitoring tools like ContentKing, Little Warden, PageModified, and SEORadar to detect when rendered pages differ from expectations.

Why Bobbink advises against dynamic rendering

Despite Google historically promoting dynamic rendering, Bobbink advises against it due to outdated content issues. Cached rendered pages can serve stale prices, ratings, or stock information in rich snippets—creating poor user experiences and potential policy violations.

Solutions for improving Lovable.dev site indexability

Option 1: Migrate to Next.js for proper SSR

The most comprehensive solution involves exporting Lovable code to GitHub and converting to Next.js. Tools like "ViteToNext.AI" and "next-lovable" facilitate this migration. Next.js provides:

Server-side rendering via getServerSideProps for dynamic content
Static site generation via getStaticProps for content that doesn't change frequently
Incremental Static Regeneration (ISR) for automatic page updates without full rebuilds
Built-in metadata API for proper SEO tags in initial HTML
Native sitemap and robots.txt generation

The trade-off: Lovable's visual editor no longer functions after migration.

Option 2: Implement prerendering services

Prerendering services intercept crawler requests and serve pre-rendered HTML while users receive the normal JavaScript application.

Prerender.io (industry leader): Starts at $9/month for 3,000 renders, with average delivery time of 0.03 seconds. Supports Google, Bing, and AI crawlers. Requires Cloudflare Workers or similar proxy configuration.

LovableHTML: Built specifically for Lovable.dev sites at $9+/month.

Rendertron: Google's open-source solution. Free but requires self-hosting and DevOps expertise.

Option 3: Add SSR via Vike

Vike (formerly vite-plugin-ssr) can add server-side rendering to existing Vite projects. This preserves the React Router structure but requires VPS deployment rather than Lovable's built-in hosting.

Option 4: Islands architecture with Astro

For content-heavy sites, Astro provides an alternative approach: render pages as static HTML with isolated "islands" of interactivity that hydrate independently. This ships zero JavaScript by default, adding client-side code only where interactivity is required.

Google's official recommendations for JavaScript sites

Google Search Central documentation, updated in December 2025, provides clear guidance for JavaScript-heavy websites.

Dynamic rendering is now deprecated as a long-term strategy. Google explicitly states: "Dynamic rendering was a workaround and not a long-term solution for problems with JavaScript-generated content in search engines. Instead, we recommend that you use server-side rendering, static rendering, or hydration as a solution."

Don't block JavaScript resources. Ensure robots.txt allows all JavaScript files, CSS files, and API endpoints needed for rendering. Blocking these prevents Google from understanding pages.

Use proper HTML links. Links must be implemented as <a href> elements, not <span onclick> or JavaScript event handlers. Google may not follow programmatically triggered navigation.

Place metadata in initial HTML. Canonical URLs and robots directives should exist in server-rendered HTML. Google advises: "You shouldn't use JavaScript to change the canonical URL to something else than the URL you specified as the canonical URL in the original HTML."

HTTP status codes matter. Pages returning non-200 status codes may skip JavaScript execution entirely. Use proper 404s for missing pages rather than soft 404s.

Practical implementation priorities for Lovable.dev users

For sites where SEO is a primary growth channel, the recommended approach depends on project scale and resources:

Scenario	Recommended Solution
New SEO-critical project	Build with Next.js instead of Lovable
Existing Lovable site, limited budget	Implement Prerender.io or LovableHTML
Large site with development resources	Migrate to Next.js with SSR/SSG hybrid
Content marketing focus	Consider Astro for static generation

Lovable acknowledges that SSR "may help" for very large sites, projects where organic search is the primary growth channel, highly competitive verticals, and sites prioritizing AI/LLM visibility. For applications where SEO matters less than rapid development—internal tools, authenticated dashboards, or apps primarily shared via direct links—Lovable's CSR architecture presents fewer concerns.

Conclusion

The core tension with Lovable.dev is architectural: the platform optimizes for rapid full-stack application development using client-side rendering, while search engines and AI crawlers work best with server-rendered content. This isn't a bug but a fundamental trade-off inherent to the platform's design.

The practical path forward depends on priorities. Teams needing strong SEO should either avoid Lovable.dev for those projects, implement prerendering services immediately, or plan for eventual migration to SSR frameworks like Next.js. Jan-Willem Bobbink's hybrid approach—server-rendered SEO elements with client-side UX enhancements—represents the industry consensus on balancing searchability with interactivity.

As AI-powered search grows in importance, the inability of AI crawlers to execute JavaScript makes this problem increasingly urgent. Sites invisible to ChatGPT, Claude, and Perplexity miss a growing discovery channel. Google's December 2025 deprecation of dynamic rendering as a long-term strategy signals that the search giant expects sites to solve JavaScript SEO at the source through proper SSR implementation rather than workarounds.

References

Lovable.dev Documentation

Google Official Sources

Research & Studies

Bing & Other Search Engines

AI Search & Crawlers

Tools & Solutions

SEO Monitoring Tools

Jan-Willem Bobbink

Core Web Vitals

17 common SEO mistakes LLMs and vibecoders make

Jan-Willem Bobbink — Tue, 13 Jan 2026 17:19:54 +0000

The rise of AI-assisted development has democratized coding like never before. Anyone can spin up a SaaS, build a landing page, or create a web app by prompting their way to a working product. But here's the uncomfortable truth: most of these projects are SEO disasters waiting to happen.

LLMs don't inherently understand SEO. They generate code that works, not code that ranks. And vibecoders (developers who ship by feel rather than fundamentals) often lack the technical SEO knowledge to catch these issues before they tank their organic traffic.

After analyzing countless AI-generated codebases and vibe-coded projects, here are the 17 most common SEO mistakes I see repeatedly.

1. Client-side rendering without SSR or SSG

This is the big one. LLMs default to whatever framework is most popular in their training data, which often means React SPAs with client-side rendering.

// What the LLM generates
function BlogPost({ slug }) {
  const [post, setPost] = useState(null);

  useEffect(() => {
    fetch(`/api/posts/${slug}`).then(res => setPost(res.json()));
  }, [slug]);

  return <article>{post?.content}</article>;
}

Googlebot will see an empty <article> tag. Your content doesn't exist until JavaScript executes, and while Google can render JavaScript, it's slow, unreliable, and puts you at a significant disadvantage.

The fix: Use Next.js with getStaticProps or getServerSideProps, Nuxt with SSR, or Astro for content-heavy sites. If you must use a SPA, implement pre-rendering or dynamic rendering for crawlers.

2. Hash-based or query parameter routing

LLMs often generate routing patterns that are technically functional but SEO-hostile:

// Terrible for SEO
yoursite.com/#/blog/my-post
yoursite.com/?page=blog&id=123

// What you actually need
yoursite.com/blog/my-post

Hash fragments (#) are completely ignored by search engines. Query parameters create duplicate content issues and look spammy to users.

The fix: Always use clean, semantic URL paths. Configure your framework's router for history-based navigation, not hash-based.

3. Auto-generated slugs without human review

When LLMs generate content management systems, they typically create slugs from titles automatically:

const slug = title.toLowerCase().replace(/\s+/g, '-');
// "10 Best Ways to Optimize Your Website!!!" becomes
// "10-best-ways-to-optimize-your-website!!!"

This produces slugs with special characters, excessive length, and no keyword optimization. Worse, if you change a title, many systems regenerate the slug, breaking existing links without redirects.

The fix: Generate slugs as suggestions, but require human approval. Strip special characters, limit length to 60 characters, and implement automatic 301 redirects when slugs change.

4. Missing or duplicate meta tags

Ask an LLM to build you a blog and you'll often get pages with:

No meta description at all
The same title tag on every page
Titles that exceed 60 characters and get truncated
Meta descriptions that are either missing or auto-truncated content

<!-- What you get -->
<title>My Blog</title>

<!-- What you need -->
<title>How to fix Core Web Vitals issues in 2024 | Your Brand</title>
<meta name="description" content="Learn the 7 most effective techniques to improve LCP, CLS, and INP scores. Includes code examples and before/after case studies.">

The fix: Build meta tag management into your content model from day one. Every page needs a unique, optimized title (50-60 chars) and description (150-160 chars).

5. No canonical URLs

Duplicate content is the silent killer of SEO. LLMs rarely implement canonical tags, which means:

yoursite.com/blog/post
yoursite.com/blog/post/
yoursite.com/blog/post?utm_source=twitter
www.yoursite.com/blog/post

All compete against each other, diluting your ranking signals.

<link rel="canonical" href="https://yoursite.com/blog/post">

The fix: Implement canonical tags on every page. Pick one URL format (with or without trailing slash) and stick to it. Configure your server to redirect all variations to the canonical version.

6. Completely ignoring internal linking

LLMs generate isolated pages. They don't understand your content architecture or how pages should relate to each other. You end up with blog posts that link nowhere, category pages that don't link to their children, and pillar content that doesn't establish topical authority.

The fix: Design your internal linking architecture deliberately. Every piece of content should link to 3-5 related pieces. Important pages should receive more internal links. Use descriptive anchor text, not "click here."

7. Invalid or incorrect schema markup

When LLMs attempt structured data, they often produce schema that is:

Syntactically invalid JSON-LD
Using deprecated schema types
Missing required properties
Semantically incorrect (marking a blog post as a Product)

// LLM-generated mess
{
  "@type": "Article",
  "author": "John Doe"  // Wrong: should be a Person object
  "datePublished": "January 5, 2024"  // Wrong: needs ISO 8601 format
}

The fix: Validate all schema markup with Google's Rich Results Test. Use the correct types: BlogPosting for blog posts, Article for news, Product for products. Include all required and recommended properties.

8. Hallucinated facts and statistics

This is an LLM-specific problem that creates both credibility and potential legal issues. LLMs confidently generate statistics, quotes, and "studies" that don't exist:

"According to a 2023 Stanford study, 73% of websites with proper schema markup see a 45% increase in click-through rates."

That study doesn't exist. That statistic was invented. And when your content is full of hallucinated facts, it destroys E-E-A-T signals and can get you penalized for misinformation.

The fix: Fact-check every statistic, quote, and claim in AI-generated content. Link to primary sources. Remove anything you can't verify.

9. No robots.txt or sitemap.xml

LLMs build features, not infrastructure. They won't remind you that search engines need a roadmap to your site.

// robots.txt you need
User-agent: *
Disallow: /admin/
Disallow: /api/
Sitemap: https://yoursite.com/sitemap.xml

Without a sitemap, Google has to discover pages through crawling alone, which may never find your deeper pages. Without robots.txt, you might be letting bots crawl your API endpoints and admin panels.

The fix: Generate a dynamic sitemap.xml that updates when content changes. Include lastmod dates. Create a robots.txt that guides crawlers to what matters.

10. Images without alt text or optimization

AI-generated code typically handles images like this:

<img src={post.image} />

No alt text. No width/height (causing CLS). No lazy loading. Massive unoptimized files. No next-gen formats.

The fix: Every image needs descriptive alt text for accessibility and image search. Specify dimensions to prevent layout shift. Use WebP/AVIF formats. Implement lazy loading for below-the-fold images.

11. Broken heading hierarchy

LLMs choose heading levels based on visual size, not document structure:

<h3>Main Page Title</h3>  <!-- Should be h1 -->
<h1>A Section Header</h1> <!-- Should be h2 -->
<h4>Subsection</h4>       <!-- Should be h3 -->

Or worse, multiple H1 tags on a single page because the developer wanted multiple "big text" elements.

The fix: Every page gets exactly one H1. Headings follow logical order: H1 → H2 → H3. Never skip levels. Use CSS for styling, not heading tags.

12. Ignoring Core Web Vitals

Vibecoders ship features. Core Web Vitals are an afterthought, if they're thought of at all. Common issues:

LCP (Largest Contentful Paint): Hero images that take 8 seconds to load because nobody optimized them
CLS (Cumulative Layout Shift): Ads, images, and fonts that shift content as they load
INP (Interaction to Next Paint): JavaScript bundles so large that clicks take 500ms to register

The fix: Test with PageSpeed Insights before shipping. Lazy load below-the-fold content. Optimize your critical rendering path. Reserve space for dynamic content.

13. JavaScript-dependent critical content

Beyond the CSR problem, LLMs often put critical content behind JavaScript interactions:

// Your important content is hidden until user clicks
<Accordion title="Product Features">
  <p>All your keyword-rich content here</p>
</Accordion>

Content inside collapsed accordions, tabs, or "read more" sections may be deprioritized or ignored by search engines.

The fix: Important content should be visible in the initial HTML. If you must use interactive elements, ensure the content is in the DOM on page load, just visually hidden.

14. No mobile optimization

LLMs are trained on desktop-centric code. Mobile is an afterthought:

Fixed widths instead of responsive layouts
Tiny tap targets
Horizontal scrolling on mobile
Text too small to read without zooming

Google uses mobile-first indexing. If your mobile experience is broken, your rankings suffer.

The fix: Design mobile-first. Test on real devices. Use responsive images. Ensure tap targets are at least 48x48px.

15. Missing or wrong hreflang tags

When LLMs build multilingual sites, they either ignore hreflang entirely or implement it incorrectly:

<!-- Common mistakes -->
<link rel="alternate" hreflang="english" href="..." />  <!-- Wrong: should be "en" -->
<link rel="alternate" hreflang="en" href="..." />       <!-- Missing: x-default -->
<!-- Also missing: the return links on the other language versions -->

The fix: Use proper language codes (en, en-US, de-DE). Always include x-default. Ensure every page in the hreflang set references all other pages, including itself.

16. Pagination done wrong

LLMs generate infinite scroll because it's trendy:

// Infinite scroll that search engines can't follow
<InfiniteScroll loadMore={fetchNextPage}>
  {posts.map(post => <PostCard {...post} />)}
</InfiniteScroll>

Or they create paginated content without proper linking:

<!-- What's missing -->
<link rel="next" href="/blog?page=2">
<link rel="prev" href="/blog">

The fix: Provide crawlable pagination with static links. Consider a "load more" button that appends to existing content rather than replacing it. Ensure all pages are accessible via links, not just JavaScript.

17. Zero consideration for page speed

The default LLM stack is bloated:

Import the entire Lodash library for one function
Include three animation libraries
Bundle fonts you're not using
No code splitting
No tree shaking
Synchronous third-party scripts blocking render

// What LLMs generate
import _ from 'lodash';
const sorted = _.sortBy(items, 'date');

// What you need
import sortBy from 'lodash/sortBy';
// Or just: items.sort((a, b) => a.date - b.date);

The fix: Audit your bundle size regularly. Use dynamic imports for heavy components. Lazy load third-party scripts. Question every dependency.

The root cause

These mistakes share a common origin: LLMs optimize for "does it work?" not "will it rank?"

SEO isn't a feature you bolt on later. It's architectural. By the time you realize your React SPA isn't indexing properly, you're looking at a significant rewrite, not a quick fix.

The vibecoders shipping MVPs without SEO fundamentals are building on sand. They'll get traffic from Product Hunt and Hacker News, wonder why organic never materializes, and blame "SEO takes time" rather than examining their technical foundation.

The solution

If you're using AI to build web projects:

Specify SEO requirements upfront. Tell the LLM you need SSR, semantic URLs, and proper meta tags before it generates code.
Use SEO-first frameworks. Next.js, Nuxt, Astro, and SvelteKit have good defaults. Vanilla React SPAs don't.
Audit before launch. Run Lighthouse, check your rendered HTML, validate your schema, test your mobile experience.
Monitor continuously. Set up Google Search Console. Track your Core Web Vitals. Watch for indexing issues.

The bar for "working websites" is low. The bar for "websites that ranks" or "websites that show up in LLMs" is much higher. Know the difference.

WordPress plugin: Track ChatGPT Hits

Jan-Willem Bobbink — Mon, 11 Dec 2023 17:44:13 +0000

Due to high demand I decided to make a user friendly version of tracking known OpenAI bot hits. This WordPress plugin tracks URL requests by the ChatGPT / OpenAI bots and direct user actions by tracking request made by specific user agents.

You can download the plugin at my blog – You can simply upload the folder track-chatgpt to your plugin directory via sFTP or use the WordPress plugin interface to upload the ZIP.

There are currently two known useragents and a small set of IP addresses which can be used to check if it are valid requests by OpenAI ChatGPTs. The plugin will show if the request where from a verified source (valid requests) or not.

The plugin shows a graph of the hits during the past 28 days. It has a download functionality to download the full dataset at once.

REST API is also enabled, so you can connect via /wp-json/chatgpt-tracker/v1/download-data/ and use automated exports to an external database to include the data into your monitoring dashboards. Its a simple dump of the full dataset. I may update this feature in the future if there is enough interest for it.

Any feature requests? PM me on LinkedIn or Twitter or leave a comment at the bottom.

For updates (if useragents or IPs change for example), follow me on LinkedIn or Twitter. I’m trying to get the plugin into the official WordPress repository as soon as possible which enables auto-updates too.

Frequently Asked Questions

How can I test if the tracker works?
The easiest way to test the plugin is to go to the ChatGPT 4 interface and request it to summarize one of your latest URLs on your website. Make sure it actually requests the URL. It could be Bing has already crawled and stored the contents so that will be used instead of visiting the live URL. Make sure it actually shows it is browsing:

You can also set your own browser with a useragent containing GPTBot or ChatGPT. You will notice those hits will be documented as invalid since the IP address will not match OpenAI’s ones.

Does it have any privacy related impact?
No, it doesn’t impact any privacy related matters since the plugin only tracks and documents user-agents and IP adresses from validated sources.

What is the difference between crawling and browsing behaviour?
More information about the behaviour of the different bots can be found on ChatGPT-User browsing and GPTBot crawling

Can I use this data in external reporting?
Yes, you definitely can. You can use the REST API. The plugin has a specific endpoint enabled.

Known issues / Feature requests

Feature: Referral traffic: do people click on the mentioned URLs from chat.openai.com to your website
Issue: When your site is using a CDN like Cloudflare, it reports their IP addresses

Changelog

= 1.0 =

Added a download functionality
Added a simple graph plotting the last 28 days of hits.

= 0.5 =

Basic functionality

What I learned about SEO from using the 10 most used JS frameworks

Jan-Willem Bobbink — Thu, 06 Feb 2020 15:32:20 +0000

JavaScript will define and impact the future of most SEO consultants. A big chunk of websites has, is or will move over to a JS framework driven platform. Stack Overflow published an extensive study about the data gathered from an enquiry amongst more than 100.000 professional programmers’ most used Programming, Scripting and Markup languages: read more at Most Popular Technologies The outcome is quite clear, it’s all about JavaScript today:

But JavaScript and search engines are a tricky combination. It turns out there is a fine line between successful and disastrous implementations. Below I will share 10 tips to prevent SEO disasters to happen with your own or your clients sites.

1. Always go for Server Side Rendering (SSR)

As Google shared earlier this year during Google I/O the pipeline for crawling, indexing and rendering is somewhat different from the original pipeline. Check out https://web.dev/javascript-and-google-search-io-2019 for more context but the diagram below is clear enough to start with: there is a separate track, also known as the second wave, where the rendering of JavaScript takes place. To make sure Google has URLs to be processed and returned to the crawl queue, the initial HTML response needs to include all relevant HTML elements for SEO. This means at least the basic page elements that show up in SERPs and links. It’s always about links right? 🙂

Google showed numerous setups in their article about rendering on the web but forget to include the SEO perspective. That made me publish an alternative table: read more at https://www.notprovided.eu/rendering-on-the-web-the-seo-version/

Server Side Rendering (SSR) is just the safest way to go. There are cons, but for SEO you just don’t want to take a risk Google sees anything else then a fully SEO optimized page in the initial crawl. Don’t forget that the most advanced search engine, Google, can’t handle it well. How about all the other search engines like Baidu, Naver, Bing etcetera?

Since Google openly admits there are some challenges ahead, they have been sharing setups of dynamic rendering. Pick the best suitable scenario for a specific group of users (low CPU power mobile phone users for example) or bots. An example setup could be the following where you make use of the client side rendering setup for most users (not for old browsers, non JS users, slow mobile cell phones etcetera) and sent search engine bots or social media crawlers the fully static rendered HTML version.

Whatever Google tells us, read Render Budget, or: How I Stopped Worrying and and Learned to Render Server-Side by a former Google Engineer.

2. Tools for checking what search engines do and don’t see

Since most platforms capture user agents for dynamic rendering setups, changing it directly into Chrome is the first thing I always do. Is this 100% proof? No, some setups also match on IPs. But I would target the SSR as broad as possible, also think about social media crawlers wanting to capture OpenGraph tags for example. Targeting a combination of IPs and User Agents will not cover enough. Better cover too much requests and spend some more money on sufficient servers pushing out rendered HTML then missing out on specific platforms possibilities.

Next thing you need to check is if users, bots and other requests get the same elements of content and directives back. I’ve seen example where Googlebot got different titles, H1 headings and content blocks back compared to what the users got to see. A nice Chrome plugin is View Rendered Source that compares the fetched and rendered differences directly.

If you have access to a domain in Google Search Console, of course use the inspection tool. It now also uses an evergreen Googlebot version (like all other Google Search tools) so it represents what Google will actually see during crawling. Check the HTML and screenshots to be sure every important element is covered and is filled with the correct information.

Non-owned URLs that you want to check? Use the Rich Result Tester https://search.google.com/test/rich-results which also shows the rendered HTML version and you can check for Mobile and Desktop versions separately to double check if there are no differences.

3. The minimal requirement for initial HTML response

It is a simple list of search engine optimization basics, but important for SEO results:

Title and meta tags
Directives like indexing and crawling directives, canonical references and hreflangs annotations.
All textual content, including a semantically structure set of Hx-headings
Structured data markup

Lazy loading: surely a best practice in modern performance optimization but it turns out that for things like mobile SERP thumbnails and Google Discover Feed, Googlebot likes to have a noscript version. Make sure that Google can find a clean link without the need of any JavaScript.

4. Data persistence risks

Googlebot is crawling with a headless browser, not passing anything to the next, sucessive URL request. So don’t make use of cookies, local storage or session data to fill in any important SEO elements. I’ve seen examples where products were personalized within category pages and that product links were only loaded based on a specific cookie. Don’t do that or accept a ranking loss.

5. Unit test SSR

Whatever developers tell you, things can break. Things can go offline due to network failures. It could be due to new release or just some unknown bug that gets introduced while working on completely different things. Below an example of a site were the SSR was broken (just after last year’s #BrightonSEO) causing two weeks of trouble internally.

Make sure you setup unit testing for server side rendering. Testing setups for the most used JavaScript frameworks:

Angular & React testing: https://jestjs.io/
Vue testing https://github.com/vuejs/vue-test-utils

6. Third party rendering – Setup monitoring

Also third party rendering like prerender.io is not flawless, those can break too. If Amazon crashes their infrastructure, most third parties you’ll use will be offline. Use third party (haha!) tools like ContentKing, Little Warden or PageModified. Do consider where they host their services 🙂

Another tactic you can apply to be sure Google doesn’t index empty pages is to start serving a 503 header, load the page and send a signal back to the server once content is loaded and update header status. This is quite tricky and you need to really tune this to not ruin your rankings completely. It is more of a band-aid for unfinished setups.

7. Performance: reduce JS

Even if every element relevant for SEO is available in the initial HTML response, I have had clients losing traffic due to performance getting worse for both users and search engine bots. First of all, think of real users’ experiences. Google Chrome UX reports are a great way of monitoring the actual performance. And Google can freely use that data to feed it to their monstrous algorithms haha!

Most effective tip is using tree-shaking to simply reduce the amount of JavaScript bytes that needs to be loaded. Making your scripts more clean can also speed up processing which helps a lot with older, slower CPUs. Specifically for older mobile phones this can help speeding up user experiences.

8. Can Google load all JS scripts?

Make sure you monitor and analyze log files to see if any static JS files are generating any errors. Botify is perfect for this with their separate section monitoring static file responses. The brown 404 trends clearly show an issue with files not being accessible at the moment Google required them.

9. Prevent analytics views triggered during pre rendering

Make sure you don’t send pageviews into your analytics when pre rendering. Easy way is just blocking all request to the tracking pixel domain. As simple as it can get. Noticed an uplift in traffic? Check your SSR first before reporting massive traffic gains.

10. Some broader SSR risks

Cloaking in the eyes of search engines: they still don’t like it and make sure you don’t accidently cloak. In the case of server side rendering this could mean showing users different content compared to search engines.

Caching rendered pages can be cost effective but think about the effects on the datapoints sent to Google: you don’t want outdated structured data like product markup to be outdated.

Check the differences between Mobile and Desktop Googlebots, a tool like SEO Radar can help you quickly identify differences between the two user agents.