Cihangir Bozdogan

Posted on May 4

Nine Search Backends, Nine Different Webs. Why AI Citations Diverge for the Same Query.

#ai #seo #webdev #llm

Run the same brand-query through ChatGPT, Gemini, Perplexity, Claude, and Grok. Read the citations. The cited URLs will not be the same, the brands featured will not be the same, and in roughly a third of cases one tool will cite your brand confidently while another does not mention it at all. The temptation is to reach for an algorithmic explanation different rerankers, different summarisation styles, different prompt scaffolds. The actual explanation is upstream of all of that. Different tools sit on top of different search backends, and the backends do not see the same web.

I worked this out by running the same fifty brand-queries across nine AI tools for six months and logging every citation URL, every search-tool invocation, every backend signature I could pull out of the response trace. The divergence was not noise. It was structural. A page indexed by Bing but missing from Google's index simply does not show up in Gemini, no matter how well it is written. A site that Brave's crawler reaches but Tavily's reranker buries cannot win on Tavily-backed agents. The "AI search" abstraction collapses into a backend-coverage problem the moment you try to optimise systematically.

This post is the field report. The publicly known map of search backends behind the major AI tools. What "different index" actually means at the crawl-and-rank layer. The fusion-layer wildcards Tavily, Exa, and similar APIs that sit between agents and the open web. The provider-internal indexes nobody talks about. Why citation patterns drift over months. The practical monitoring strategy for an operator who actually wants to see the gap rather than guess at it.

The Backend Map: What Powers What

The first useful exercise is drawing the map honestly, including the parts that are partly private. Some relationships are documented, some inferred from observable signatures, and some have shifted over time. I will mark each accordingly.

Bing powers Microsoft's own grounding stack Copilot in all its forms and the Grounding with Bing Search tool exposed through Azure AI Foundry. Microsoft documents the Grounding with Bing Search service as the canonical way for an Azure-hosted agent to ground responses on real-time public web data. The legacy Bing Search API was deprecated in mid-2025 in favour of the grounding-specific service. DuckDuckGo has long sourced traditional links and images "largely from Bing" while layering its own crawler (DuckDuckBot) and specialised sources on top. The DuckDuckGo help page on results sources says exactly that.

ChatGPT search is the trickiest cell on the map and the one where I want to be most careful. The Microsoft–OpenAI partnership originally placed Bing behind ChatGPT's web-search behaviour, and many secondary sources still describe the relationship that way. Then OpenAI launched ChatGPT search in October 2024 and explicitly positioned it as a competitor to Bing. OpenAI's own announcement describes the feature as "powered by real-time web search and partnerships with news and data providers" a deliberately broad framing. The Microsoft–OpenAI partnership was renegotiated through 2025 to give OpenAI more flexibility to use multiple cloud and search providers. The honest answer for "what backend powers ChatGPT search today" is a stack that includes Bing, includes direct news-publisher integrations, and increasingly includes OpenAI's own crawl. Treating it as pure Bing is wrong now. Treating it as pure first-party is also wrong.

Gemini and Google AI Overview sit on Google Search. This one is documented unambiguously. Google's grounding documentation says "Grounding with Google Search connects the Gemini model to real-time web content" and exposes the result trace through groundingChunks, groundingSupports, webSearchQueries, and searchEntryPoint. Google's web index is the same index that powers conventional Google Search, with the same crawl and the same ranking signals. AI Overview is grounded on a subset of the top-ranked search results for the query, with the LLM synthesis on top.

Claude's web search tool uses Brave Search as the third-party search provider. This is documented in Google's own Vertex AI documentation for Anthropic partner models, which lists Brave Search as the "third-party search service that Anthropic Web Search feature can call." Brave's API page lists Mistral AI, Cohere, Together.ai, and Snowflake among its users and frames itself as "the leading search tool for applications that use Claude MCP." Claude's grounded answers bottom out in Brave's index, which is independent of Bing and Google.

Brave Search runs an independent index. Brave's API page is direct about this: "The only search API with its own Web index at scale. Truly independent, lightning-fast, and built to power AI apps." They reinforce the point: "the Brave Search API is not a scraper that simply uses bots to query Google or Bing and repackage their results. Instead, it's our own independent index of the Web packaged with our own ranking models." The published index size is "over 30 billion pages." This matters because a Claude grounding session has a different starting set of candidate URLs than a Gemini grounding session, before any ranking or reranking even happens.

Perplexity is the hybrid everyone notices and few describe accurately. Perplexity uses an internal crawler, an internal index (the answer engine running on top of it is branded "Sonar"), plus third-party search APIs. Public reporting and Perplexity's own help-centre material have at various times mentioned both Google-backed and Bing-backed paths. The exact mix has shifted over the product's life. Operators tracking Perplexity citations should not assume any single backend is the source of truth for what Perplexity sees.

Grok searches X plus the open web through xAI's Web Search and X Search tools. The xAI documentation describes a web search tool that "enables Grok to search the web in real-time and browse web pages" and an X-platform search tool with keyword, semantic, user, and thread retrieval. The web component's underlying provider has not been publicly disclosed in the same way Anthropic's choice of Brave is disclosed. What is clear is that Grok's index is biased toward X-platform content in a way no other tool's is a brand with strong X presence shows up disproportionately on Grok and disproportionately not on tools that lack an X integration.

Kagi runs a metasearch architecture: two in-house indexes Teclis (web) and TinyGem (news) combined with "anonymised API calls to all major search result providers worldwide" plus specialised vertical sources. Kagi is small-scale relative to Bing or Google but maintains a distinctive in-house crawl focused on non-commercial, "small web" content. Kagi is not a backend for any major LLM but its index character is genuinely different from the dominant ones. You.com runs its own real-time index plus vertical indexes for news, healthcare, legal, and similar independent of Bing and Google, smaller in scale.

Tavily and Exa are different in kind. They are not "search engines" in the Bing/Google sense. They are search APIs designed to be the retrieval layer for AI agents. Tavily describes itself as offering "real-time search, extraction, research, and web crawling through a single, secure API," with a "production-grade retrieval stack." Exa describes its product as an "industry-leading web index built for agents." Both decline to publicly name their upstream sources in the way Brave does, and both are widely understood by builders to combine custom crawling with dense retrieval and their own reranking on top. They are themselves a backend choice for any agent that wires them in.

The honest summary of the map: nine major surfaces, four or five distinct primary backends, and at least three more API-layer products that act as their own backends when an agent is built on top of them.

What "Different Index" Actually Means

It is easy to gloss "different backends, different indexes" as near-equivalent. It is not. Indexes differ on more axes than most operators count, and each axis is a place where two backends diverge for the same query in ways that show up directly in citation behaviour.

Crawl frequency and freshness. Common Crawl, which seeds many training corpora, adds 3–5 billion new pages per month according to its published methodology. Commercial backends crawl much more aggressively than that for high-value sites and less aggressively for the long tail. A new product page on a high-authority domain will hit Bing's index in hours and Google's in similar time. The same page on a small-authority domain might sit uncrawled by either for weeks. Brave's crawl prioritises differently again, and Tavily's and Exa's targeted crawls are shaped by which queries their customers run. A "freshness gap" between two backends is rarely a bug it is a budget allocation made deliberately.

Geo and language coverage. Google's index has significantly broader non-English coverage than Bing's. Brave is English-and-Latin-script biased. Tavily and Exa are English-dominant by default. A query in German or Japanese retrieves different breadth across these backends before any ranking layer touches the results. A citation gap in Gemini might disappear when you query in the brand's primary market language, while the same gap in Claude (Brave-backed) might persist because Brave's coverage thins out in non-English.

Deep-link coverage and JavaScript-only content. Backends differ sharply on how deep they crawl and how they handle JavaScript-rendered content. Bing and Google have invested heavily in headless rendering. Brave's public statements about its crawler are more conservative. Tavily and Exa's behaviour around JS-rendered content depends on their per-customer crawl budget. A brand that ships a JavaScript-only site will see different coverage curves across backends a fact that compounds with the well-known issue of inference-time fetchers being even less rendering-capable than crawl-time fetchers.

Robots.txt and bot identifiers. All major backends respect robots.txt, but the exact directives they honour differ at the edges. Backends respect specific bot identifiers (GoogleBot, BingBot, BraveBot, OAI-SearchBot, ClaudeBot, etc.) and a robots policy that allows one and blocks another produces a hard coverage gap. Operators who tighten their robots.txt against AI training bots without thinking carefully about the search-time fetchers occasionally cut themselves off from grounding entirely.

Content-type coverage. PDFs, video transcripts, podcast transcripts, and code repositories are covered very unevenly. Google handles PDFs more thoroughly than most. Code-heavy queries land differently on Brave because of how Brave indexes code-host content. Video-transcript surfacing depends on whether the backend has direct ingestion or transcript-extraction at crawl time. A brand whose primary content is a podcast or a video series will have a wildly different visibility profile across backends than one whose content is HTML articles.

Snippet length and chunk granularity. Once a backend indexes a page, the chunk it stores is what determines whether a passage can become a citation. Brave publishes that its API returns "up to five snippets" per result. Google's grounded responses surface a different chunk shape. Tavily and Exa, being embedding-based, serve dense vectors over whatever chunk size their pipeline uses. If your page's information is densely packed in a single section that exceeds the backend's chunk granularity, that information may never enter a citation context window even when the page itself is in the index.

The compound effect is that two backends covering "the open web" can diverge on something close to half of mid-frequency queries. The divergence is structural, not random.

The Fusion-Layer Wildcards

Tavily and Exa deserve their own section because they are increasingly the retrieval layer that AI agents actually depend on, and they break the simple "what crawl do they use" frame.

A traditional search engine like Bing or Google does crawl, build an inverted index, rank with hundreds of signals, and return ranked URLs. Tavily and Exa are different. They crawl too, but they layer on dense retrieval, custom ranking, and explicit reranking optimised for LLM consumption. A page that ranks well on Bing can rank poorly on Tavily for the same query, because Tavily's ranker disagrees with Bing's. Not "is wrong" disagrees. The two systems optimise different objectives.

This matters operationally because more and more AI agents particularly in the developer-tools space and in custom MCP-server stacks wire in Tavily or Exa rather than going through a public-web search backend. An operator who only monitors citation behaviour on ChatGPT, Gemini, and Claude is missing an entire layer of agent stacks that route their grounding through these API products. For most consumer-facing brands the agent-stack layer is small today. For B2B brands selling to developers, it is not small.

The asymmetry in observability makes this harder to track. Tavily and Exa do not publish their crawl coverage or their reranking objectives in the way Google and Bing do. The way to see whether your brand is reachable through their backends is to query the API directly. There is no shortcut. A practical rule I have settled on: if a brand sells primarily to developers, Tavily and Exa coverage matters as much as Bing or Google coverage, even though no consumer ever uses Tavily directly. The retrieval layer for the agents the buyers are building is what determines whether the brand shows up in agent-driven workflows.

The First-Party Backends Nobody Talks About

There is a layer below the named backends that is increasingly load-bearing and partly private. I want to be careful here because the public documentation is thin. What I can say with confidence is what the observable signatures suggest, framed as observation rather than asserted fact.

OpenAI's ChatGPT search is grounded on sources that do not look like pure Bing results. The October 2024 launch announcement mentioned "partnerships with news and data providers" alongside search backends. Looking at citation patterns from ChatGPT search across the last six months, I see consistent appearance of certain news domains in patterns that suggest direct ingestion rather than open-web ranking. That is consistent with OpenAI building its own crawl on top of licensed data feeds. I cannot prove it from public documentation alone but the observable pattern is real, and it is the pattern an operator should expect from a company that has explicitly positioned itself as a Bing competitor.

Perplexity's own index has grown over time. The product launched as a layer on top of third-party search and has progressively built its own crawl, embedding pipeline, and ranker. The hybrid mix Perplexity ships today is genuinely different from what it shipped two years ago. Tracking Perplexity citations over time, not over a fixed snapshot, is the only reliable approach. Google's grounding pipeline is, by contrast, the most stable and most documented anchored to Google Search, which itself is the most stable index in the market.

The general rule: if you are tracking citation behaviour and your data is more than three months old, treat it as suspect. Backend mixes drift, internal crawls expand, third-party partnerships open and close.

Why Citation Patterns Drift

Once the backend map is in place, drift becomes legible. There are four causes for a brand to be cited on one tool and missing on another, each with a different fix.

Index gap. The page is not in the relevant backend's index at all. This is the most common cause and the easiest to verify. Pull the URL into the backend's site search (site:yourbrand.example on Google, on Bing, on Brave) and see whether it returns. The fix is at the crawl layer: sitemaps, internal linking, robots.txt audit, JS-rendering audit.

Ranking gap. The page is indexed but does not rank in the top-N for the relevant queries. Different backends have different top-N cutoffs for what enters the LLM's context window a typical grounding session pulls between three and ten URLs into the synthesis. A page ranked at twenty is invisible. The fix is the standard SEO playbook for the specific backend, with the caveat that different backends weight the signals differently.

Language and geo gap. The page is indexed and ranks in one geo or language but not another. Most common when a brand publishes primarily in English but operates in multiple markets. The fix is genuinely localised content, with hreflang and locale-specific URLs, not translated boilerplate.

Freshness gap. The page changed significantly since the last crawl, and the backend's snapshot is stale. AI grounding sessions read the snapshot, not the live page. The fix is a sane sitemap with lastmod and a hosting setup that does not time out crawler requests.

A specific and frustrating drift case: a brand's pages get crawled and indexed by every backend except one. The exception backend often has a specific technical incompatibility a robots.txt line that named the wrong bot, a CDN rule that returns 403 for that bot's user-agent, an SSR setup that fails for one rendering pipeline. The way to find these is bot-by-bot log analysis. I have lost count of the number of "we are invisible on Claude / on Perplexity / on Gemini" cases that turned out to be a single line in a CDN config.

Practical Monitoring: Which Two Backends First

Nine surfaces is too many to monitor day-to-day for most operators. The question becomes which two or three to start with.

The framework I have settled on uses traffic profile. For a B2C brand whose customers come primarily through Google search today, the first two surfaces to monitor are Google AI Overview and Gemini, because the Google index is the source of truth for both. The second tier is ChatGPT search and Perplexity, which capture an increasing share of grounded buyer-side queries. The third tier is Claude (Brave-backed) and Grok (X-and-web).

For a B2B brand selling to developers, the priority shifts. Claude and ChatGPT search rank highest because developers disproportionately use them. Perplexity matters because of its researcher persona. Tavily- and Exa-backed agent stacks matter because the buyer is building those agents. Google AI Overview drops in priority.

For a brand whose content is primarily video or audio, the priority shifts again. Google has the strongest video content integration and that bias propagates to Gemini and AI Overview. Other backends are uneven. Monitor whichever backend is documented to handle your content type best.

The mistake operators make most often is testing in only one tool. A single ChatGPT query that cites the brand is taken as "we are visible to AI." A single Gemini query that does not cite the brand is taken as "Gemini is broken for us." Both interpretations are wrong because they do not separate the backend layer from the synthesis layer. The right test is the same query across at least four tools, with the citation set logged for each, on a rotation that catches drift.

What to log when running this systematically:

The exact prompt, with timestamp.
The tool used and the model version.
Whether grounding fired (the response includes a tool-call indicator).
The full citation set returned, with URLs and the cited domains.
Whether the brand appears in the citation set, and if so, in which position.
The synthesis-layer mention of the brand (separate from citation), since some tools cite without naming and some name without citing.
The locale/language the query was run in.

A spreadsheet of those columns over fifty queries across nine tools, repeated monthly, gives you the picture. Anything less and you are guessing.

Backend Variation Matrix

The matrix below is the practical artefact I use in audits. It is deliberately not a ranking there is no "best" backend, just different coverage profiles. Where a cell is uncertain or has shifted recently, I have marked it as observation rather than fact.

Backend / Layer	Primary AI tools it powers	Coverage character	Independence	Public docs on internals
Google Search	Gemini grounding, AI Overview	Largest open-web index, strong multilingual, strong PDF, strong video integration, slow drift	Independent	Strong (`groundingChunks`, `webSearchQueries`, `groundingSupports`)
Bing (via Grounding with Bing Search)	Microsoft Copilot, DuckDuckGo (traditional links), historically ChatGPT search	Large open-web index, strong English, weaker non-English vs Google, fast for high-authority freshness	Independent	Documented Azure AI Foundry tool surface
Brave Search API	Claude web search tool, Mistral / Cohere / Together / Snowflake apps per Brave's own claim	~30bn pages, English-leaning, independent crawl, AI-friendly snippet sizing	Independent (explicitly not a Bing/Google reseller)	Strong (API page declares scale and method)
Perplexity (Sonar + hybrid)	Perplexity.ai answers and Search API	Hundreds of billions of pages claimed; mix of own crawl plus external APIs that has shifted over time	Hybrid; mix is not stable	Partial; mix not fully disclosed
OpenAI's emerging stack	ChatGPT search	Hybrid of partner data feeds plus open web; positioned competitively against Bing	Hybrid; trending toward independence	Limited; framed as "real-time web search and partnerships"
xAI Search (web + X)	Grok	X-platform-biased, plus open web; X integration unique among major tools	Hybrid; web provider not publicly disclosed	Partial; web sources not enumerated
You.com index	You.com search and API	Own real-time index plus vertical indexes; smaller scale than Bing/Google	Independent	Documented per vertical
Kagi (Teclis + TinyGem + metasearch)	Kagi search (consumer; not a major LLM backend)	"Small web" focus; metasearch architecture; small relative to dominant indexes	Hybrid; explicitly metasearch	Strong (named indexes and API call structure)
Tavily	AI agents using Tavily as their retrieval layer	Custom crawl plus dense retrieval plus reranker, optimised for LLM consumption	Independent at the retrieval layer	Limited; mechanism not publicly enumerated
Exa	AI agents using Exa as their retrieval layer	Embedding-based retrieval with custom ranking; positioned as "industry-leading web index built for agents"	Independent at the retrieval layer	Limited; mechanism not publicly enumerated
Common Crawl (training-time baseline)	Indirectly underlies many model training corpora	300bn+ pages, 3–5bn new pages monthly; quarterly published snapshots	Independent	Strong; methodology published

The matrix is the artefact you carry into a monitoring strategy. It tells you which backends an observed gap implicates, which backends to instrument when you need to verify the gap, and which surface to test against when you ship a fix. The "AI search" abstraction collapses into this matrix the moment you try to do real work on it.

The Synthesis

The single sentence: AI search is a thin synthesis layer over a small set of search backends, and a brand's visibility across AI tools is determined first by which backends index it and rank it, not by anything the model itself does at synthesis time.

If you only have time to internalise three things from this post, in order:

There is no "AI search" channel there are four or five distinct backend channels with different coverage curves. Google's index, Bing's index, Brave's index, Perplexity's hybrid index, and the agent-layer retrieval products like Tavily and Exa each see a different web. Treating "AI" as one channel is debugging at the wrong layer.
Pick two backends to monitor first based on traffic profile, then expand. B2C through Google traffic: Gemini and AI Overview first. B2B to developers: Claude and ChatGPT search first. Test the same fifty brand-queries across all monitored surfaces monthly, and log the citation set for each. The divergence shows up immediately.
When a citation gap appears on one backend and not another, the cause is almost always upstream of the model. Index gap, ranking gap, language/geo gap, or freshness gap each has a different fix and a different team. Engineering work on the crawl layer pays out across every AI tool that shares that backend. Synthesis-layer "AI optimisation" without backend coverage is sand-castle work.

The operator who internalises this stops asking "are we visible on AI" and starts asking "which backends index us, which rank us, and which is the cheapest to fix first." That framing change is the win. The engineering follows.

The backend market is going to keep shifting. OpenAI is building its own crawl. Anthropic's stack is expanding beyond Brave. Perplexity's mix changes annually. Tavily and Exa are growing into the retrieval layer for the agentic web. The brands cited consistently across AI tools in three years will be the ones who treated this as a coverage problem on a moving target.

The backend taxonomy in this post is my synthesis of public mechanism documentation Google's grounding-with-search documentation, Brave's Search API documentation, Microsoft's Grounding with Bing Search service docs on Azure AI Foundry, Google's Vertex AI documentation for Anthropic partner models (which is where Brave is named as the third-party search provider behind Claude's web search tool), DuckDuckGo's results-sources help page, Kagi's search-sources documentation, xAI's web search tool documentation, OpenAI's announcement of ChatGPT search, and Common Crawl's published methodology combined with observable behaviour across my own monthly cross-tool citation audits over the past six months. Where I have written "in my testing" or "the pattern I observe," that is exactly what I mean. The provider-internal index claims for OpenAI and Perplexity are framed as observation because public documentation does not enumerate the backend mix. Provider behaviour is moving; backend mixes drift on a quarterly cadence; verify against current docs before shipping a strategy.