DEV Community

Cihangir Bozdogan
Cihangir Bozdogan

Posted on

I Reverse-Engineered ChatGPT's Retrieval Stack. The Bottleneck Isn't What You Think.

ChatGPT cites its sources. You see the neat little [1], [2] markers, and the implicit message is: the model went out, looked at the web, brought back evidence, and is showing you receipts.

That story is half right. The other half is what every team building a RAG system gets wrong.

There is no single retrieval system inside ChatGPT. There are at least two — a parametric one frozen in the weights and a live one that fires only sometimes plus a tool layer deciding which to invoke, plus a generation step that has to reconcile them when they disagree. Almost none of it is published in detail. Some is confirmed by OpenAI and Microsoft. Some is inferred from leaked system-prompt fragments and citation studies. A lot is just observable behavior if you poke it with enough queries.

I spent a week tracing the pipeline. What follows is an engineer's reading of how it actually works the two channels, the eight-step pipeline, the tool layer, and the one finding that should change how you build your own retrieval system.

Spoiler for the impatient: the bottleneck is not the LLM, and it is not the embedding model. It is the rerank step. I'll get there.

Two Channels, One Voice


Every ChatGPT response is the output of a model with access to two completely different sources of information. The model does not always tell you which one produced the sentence you're reading.

The training corpus is frozen at the knowledge cutoff. It's parametric — what the model "knows" lives in weights, not as a list of URLs it can point at. That corpus is enormous and heterogeneous: a slice of Common Crawl, licensed publisher content, public code, and — since 2024 — Reddit, via the formal OpenAI/Reddit data partnership. Anything that comes from this channel has no source URL attached. The model can recite a fact; it cannot tell you where in training it saw the fact.

The live retrieval channel is different. When the browser tool fires, the model issues real search queries, fetches real pages, and the URLs travel with the content into the context window. This is the channel that produces the bracketed citations.

Here's the part that should bother you more than it does: the model does not consistently disclose which channel produced any given answer. Ask "what's the latest version of X?" and you might get a freshly retrieved answer with citations — or you might get a confident, plausible answer pulled from training-time memory of an older release, no citations, no signal that retrieval was skipped. Same formatting. Same tone. Only one is right.

We come back to this. It's the most engineering-relevant idea in the whole stack, and the one ChatGPT itself handles worst.

The Pipeline, End to End


Reverse-engineered from observed behavior, OpenAI/Microsoft attestations, and citation studies, the live-retrieval pipeline runs roughly eight steps. Some implementations probably collapse steps. Some parallelize. The logical sequence is consistent:

  1. Query rewriting and decomposition. Your prompt is rarely a good search query. The model rewrites it — sometimes into parallel queries that decompose a multi-part question. "Compare X and Y on Z" becomes two or three independent retrieval calls, fused later. This happens inside the model itself and is cheap.

  2. Search API call. Confirmed: the primary backend is the Bing Web Search API, a consequence of the OpenAI/Microsoft commercial relationship. Anything missing from Bing's index simply cannot be cited via this channel.

  3. Result fetching. From the ranked URL list, the system fetches a small number of pages. Small is the operative word — a handful, not dozens. The fetch is parallelized, so wall-clock cost is set by the slowest tail.

  4. Page parsing. Each fetched page is converted from HTML to clean text. This is where the render gap bites. JS-heavy SPAs, late-binding hydration, content rendered after DOMContentLoaded none of it reliably visible to a server-side fetcher. Paywalled and robots-blocked pages disappear here too. OpenAI's crawler OAI-SearchBot is the publicly confirmed user agent; sites that block it block themselves out of citation.

  5. Chunking. Long pages split into smaller passages. Standard RAG concerns apply chunk size, overlap, semantic boundaries. Bad chunking destroys grounding even when the right page got fetched. The relevant passage gets cut down the middle, and neither half scores well alone.

  6. Re-ranking and selection. From the chunks, a smaller set is selected for the final context. This is the stage that decides what becomes citation-worthy, and it is almost certainly handled by a model — either the main LLM in a separate scoring pass or a smaller dedicated ranker. The exact architecture is undisclosed.

  7. Context assembly. Selected chunks are injected into the prompt alongside their source URLs. The [1], [2] markers are downstream of this — chunks are paired with URLs so the generation step can attribute correctly.

  8. Generation with citation tagging. The model produces the final answer in a single forward pass, emitting citation markers tied to the assembled chunks. Mapping a generated span back to the chunk that justified it is non-trivial the model has to do an implicit alignment between what it's saying and what it was given. When that alignment is wrong, you get the well-known failure mode: a citation that doesn't actually support its claim.

A compact way to see the whole sequence: query → rewrite → search → fetch → parse → chunk → rerank → assemble → generate. Every step has a budget and a failure mode. Every step throws away information the next step could have used. That it works at all is a quiet engineering accomplishment.

The Tool Layer (and Why You Should Read Leaked Prompts Sideways)

Above the pipeline sits the tool surface the model actually calls. The model itself doesn't make HTTP requests. It emits structured tool calls; a runtime executes them and returns results.

Two surfaces dominate: a browser tool with sub-actions like opening a URL, fetching a page, following a link; and a web.run family that issues searches and returns ranked candidates. The model decides when to call each, with what arguments, how many times. From outside, it looks like a small set of structured function calls open, search, fetch, read with the LLM as the decision-maker.

Leaked system-prompt material shows consistent themes. Open multiple results in parallel. Cite all sources used. Prefer recent sources for time-sensitive queries. Handle disagreements between sources explicitly. I'm paraphrasing deliberately the leak provenance is messy and any specific snapshot's wording may not reflect current production.

The best public archive is jujumilk3/leaked-system-prompts, which collects historical snapshots from many vendors. Treat it as primary-source material — useful for the shape of instructions in production prompts, not as a reliable transcript of any current system. OpenAI does not publish the full browser-tool prompt. Any individual snippet may be inaccurate, partial, or out of date.

The hygiene rule when reasoning from these leaks: infer patterns, not wording. Categories, ordering, hierarchy are stable across snapshots. Exact phrasing isn't.

Why Bing — and the Google-Shaped Mystery

The choice of Bing as primary backend is a confirmed mechanism, and the reason is not technical excellence. It's commercial. OpenAI and Microsoft have a deep, well-publicized relationship, and Bing's Web Search API is the natural surface to plug into.

The trade-off is index coverage. Bing is competitive on mainstream content. On long-tail, niche, or freshly published content, it still trails Google in many domains. A page that's hours old may not be in the index ChatGPT can query. Inventing a specific lag-hour figure would be irresponsible; the directional claim — Bing-only retrieval has a freshness ceiling — is what matters.

This is where the most interesting public test comes in. SEO consultant Aleyda Solís published a brand-new page, submitted it to both engines, queried ChatGPT before Bing had indexed it and ChatGPT returned a snippet matching Google's cached version. The page was findable through ChatGPT before Bing knew it existed. Search Engine Journal's coverage is the canonical write-up.

I want to be honest about what this proves and what it doesn't: there is no public confirmation of a direct Google-fallback inside the OpenAI pipeline. Some Google-shaped results may have alternate explanations — third-party aggregators that themselves query Google, plug-ins or browsing modes that bypass the default Bing path, transient behaviors that have since changed. Observed behavior suggests fallback retrieval exists. The precise mechanism is not on the public record.

The largest quantitative study is Seer Interactive's analysis of 500+ SearchGPT citations: roughly 87% of cited URLs in Bing's top-20, around 56% also in Google's results at a median rank of 17, and approximately 92% of agent retrieval through the Bing API directly. Observational, not mechanistic — but consistent with a Bing-primary system that has some non-Bing surface area for the long tail.

The Latency Cliff


Watching the network panel during a retrieval-on response, total time from prompt submission to first streamed token typically lands in the 4–10-second range. Where do those seconds go?

Without inventing precise milliseconds: query rewriting takes hundreds of ms (small generation step inside the main model). The search API call adds a few hundred (round-trip plus Bing's own ranking). Page fetches happen in parallel but wall-clock is gated by the slowest tail — one slow origin server drags the whole budget. Parsing and chunking are CPU-bound and fast. Re-rank is another model call. Generation begins streaming once context is assembled.

The structural implication is the part that matters: fetch budget is small. ChatGPT cannot fetch fifty pages. It fetches a handful. The Seer numbers are consistent with this — most cited URLs come from a tight slice of Bing's top results, not from deep crawling.

If you're optimizing for citation, increasing the count of pages an AI agent could theoretically see is at best linear. The model is rate-limited by latency, not by index breadth. Your leverage point is not "make more pages indexable." It's "be in the small set of pages that survive the rerank step."

That's the first hint at the contrarian thesis. Hold on to it.

Citation Behavior: Dedup, Diversity, Disagreement


The set of citations a ChatGPT answer surfaces is not a top-N list from the search ranker. It's the output of a selection process that visibly cares about more than relevance.

DejanMarketing's GPT-search analysis found, across a wide sample, that ChatGPT typically selects 3–10 sources per response. Not 1, not 50. That range is consistent across query types and visible in the rendered answer. The bound is almost certainly latency-driven on the upper end and grounding-quality-driven on the lower end.

Within that set, same-domain dedup is visible. A single domain rarely appears five times in one answer's citations even when the search ranker would happily return five pages from the same site. Observed behavior suggests an explicit diversity pressure — possibly prompt-level, possibly ranker-level pushing the system toward distinct sources rather than concentrating on one well-ranked publisher.

Conflict handling is the more interesting case. When sources disagree, the answer language hedges — "some sources report... while others suggest..." and the model usually surfaces both citations. This is consistent with a system that prefers honest conflict-surfacing over arbitrary tie-breaking. The hedge isn't a marketing feature. It's what cross-encoder rerankers naturally produce when several chunks score similarly with contradictory content.

The pattern that falls out: a small number of high-confidence citations beats a large number of shaky ones. Cross-encoder rerankers concentrate on agreement among independently-retrieved chunks — a stronger signal than the absolute relevance score of any single chunk.

The Confidence-Calibration Problem

This is the engineering-relevant center of the whole system, and the part most retrieval discussions miss.

The two-channel distinction from the opening is not a clean separation at inference time. Both channels feed into a single generation pass, and the model has to decide — implicitly, no externally visible toggle which to trust for any given assertion. When channels agree, this is invisible. When they disagree, it is the source of nearly every quietly-wrong answer the system produces.

The freshness disclosure problem is the simplest version. Ask "what's the latest version of X?" right after a release. Browser tool fires, search index has the new release: correct answer, release page cited. Browser tool doesn't fire — model judged retrieval unnecessary, hit a rate limit, or user is on a path that doesn't invoke browsing — and the model answers from training-time memory of the older release. Identical formatting. Only one is right. The user has no signal to tell them apart.

The deeper version is more subtle, and worth being explicit about. Training corpora include the model's own historical outputs. Sufficiently popular AI-generated text on the web at scrape time ends up in the next training set. So a model can be confidently wrong because a previous model was confidently wrong and the wrong answer survived into training. Re-ranking has to override parametric belief in those cases. Sometimes it does. Sometimes it doesn't — particularly when the wrong belief is well-attested across many low-quality sources and the correct passage shows up in only one reranked chunk.

For an engineer building a retrieval system from scratch, the implication is concrete: make the override explicit. ChatGPT does this implicitly, and not always well. In your own RAG pipeline, decide deliberately when retrieved evidence overrides parametric belief, and surface that decision rather than letting the model arbitrate silently. A simple rule — if retrieved evidence contradicts parametric memory, retrieved wins, and the system says so — enforced at the prompt or rerank layer is more honest than the alternative. Even when the contradicting evidence is itself wrong, the failure mode becomes inspectable rather than invisible. That is a much better place to be.

Four Things to Take to Your Own Pipeline

Grounded in the mechanism, not the marketing.

1. The bottleneck is not the LLM. It is the rerank step. This is the contrarian thesis the post opened with, and it's the conclusion that survives the rest. If your RAG system produces bad citations, the bottleneck is almost always downstream of the embedding model. A bi-encoder retriever and a cosine-similarity index will surface plausible-but-wrong chunks faster than you can debug them. Cross-encoder reranking is the single highest-leverage stage. Spend your engineering budget on rerank quality and on chunking that respects semantic boundaries — not on swapping in a slightly larger embedding model and hoping.

2. There's a latency cliff on fetch count. ChatGPT fetches a handful of pages, not dozens, and the same constraint applies to anything you build with comparable user-facing latency targets. Past roughly five-to-ten fetched pages, latency dominates the marginal grounding gain. Each extra page mostly slows the system without meaningfully improving the answer. Decide your fetch ceiling deliberately. Design for parallelism so a single slow tail doesn't blow the budget. Accept that you can't scale your way around the rerank quality problem by fetching more.

3. Citation tagging is harder than it looks. Mapping a generated span back to the chunk that justified it is a separate concern from retrieval, with its own failure modes. You can have perfect retrieval and still emit citations that don't support their attached claims. In practice this is either a separately-trained alignment component, an extra reasoning pass over the generated answer, or a constrained-decoding setup that forces citation tags to track the active context chunk. Pick one. Don't assume the LLM will do it for free — the visible failure mode of "wrong citation on a true claim" is exactly what happens when you assume it will.

4. Source diversity is a feature, not a nice-to-have. If your pipeline doesn't explicitly enforce same-domain dedup or topic-cluster diversity at the rerank stage, hard-code it. Allowing one domain to dominate the cited set is the fastest way to make a RAG system look like a thinly-wrapped paraphrase of one publisher. Diversity pressure is cheap to implement a small penalty in the rerank score, a per-domain cap on selected chunks — and it's the difference between a citation list that reads like research and one that reads like a single-source rewrite.

Closing

My read after a week of poking at this: ChatGPT's retrieval stack is not magic. It's a query rewrite, a search call, a small fetch budget, a re-rank, a context assembly, and a prompt with citation instructions, all wrapped in a tool layer the model decides when to invoke.

The interesting part isn't the architecture. It's the choices the system makes. What gets fetched. What gets selected. What gets attributed. When retrieval fires and when it doesn't. How the two channels of knowledge get reconciled when they disagree.

Every retrieval system built from now on makes the same set of choices. Most make them worse. The work isn't in copying the architecture. It's in making each of those choices deliberately — and being honest with the user about which channel produced the answer.

That last part, especially. ChatGPT doesn't do it. Yours can.

Top comments (0)