William C.

Posted on Feb 27

How I Reverse-Engineered ChatGPT's Hidden Search Behavior with a Chrome Extension

#webdev #ai #seo #javascript

When you ask ChatGPT a question with Browse mode enabled, something invisible happens. Before giving you an answer, it silently generates 3 to 12 hidden search queries, consults over a dozen web sources, and makes a decision about which ones deserve to be cited. You never see any of this.

I built a Chrome extension that intercepts this invisible process in real time. Here's how, and what I discovered.

The Hidden Architecture of AI Search

ChatGPT doesn't use a simple fetch() to search the web. It uses a complex streaming architecture based on Server-Sent Events (SSE) combined with JSON Patch operations (RFC 6902).

Here's what actually happens when you ask ChatGPT "What are the best SEO tools in 2026?":

Your browser sends a POST request to the /conversation endpoint
The response comes as an EventStream, chunks of data streamed over time
Each chunk contains a JSON Patch operation that modifies a running response object
Hidden inside these patches are the search queries, source URLs, and citation decisions

The key insight: the search queries are embedded in the stream before the text response appears. ChatGPT decides what to search for first, reads the results, then starts writing.

Intercepting the Stream

The approach uses Chrome Manifest V3 with MAIN world script injection. In the MAIN world, you can override window.fetch before ChatGPT's JavaScript loads:

const originalFetch = window.fetch;
window.fetch = async function(...args) {
  const response = await originalFetch.apply(this, args);
  const url = args[0]?.url || args[0];

  if (url.includes('/conversation')) {
    // Clone the response to read the stream without consuming it
    const clonedResponse = response.clone();
    parseSSEStream(clonedResponse.body);
  }

  return response;
};

The tricky part is parsing the SSE stream. ChatGPT uses JSON Patch, which means each message modifies a path in a JSON document. The search queries appear as add operations on paths like /message/content/parts/0:

async function parseSSEStream(readableStream) {
  const reader = readableStream.getReader();
  const decoder = new TextDecoder();
  let buffer = '';

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split('\n');
    buffer = lines.pop(); // Keep incomplete line

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6));
        extractSearchQueries(data);
      }
    }
  }
}

What I Found Inside the Stream

After analyzing 500+ ChatGPT sessions, here are the patterns I identified:

1. Query Multiplication

For a single user question, ChatGPT generates an average of 8.2 distinct search queries. These aren't simple rephrases, they're strategic reformulations targeting different facets of the answer.

For example, asking "Is my website optimized for AI search?" generates queries like:

AI search optimization techniques 2026
GEO generative engine optimization checklist
Schema.org markup AI citations
how AI crawlers index websites
AI visibility score metrics

2. The Reformulation Gap

47% of what ChatGPT actually searches is semantically different from what you originally asked. I call this the Reformulation Gap. The AI rewrites your intent into queries it believes will yield better results.

This has huge implications for SEO. If you're optimizing your content for what users type, you're potentially missing 47% of the queries that AI platforms actually use to find and evaluate your content.

3. The Consult-vs-Cite Ratio

ChatGPT reads 3.2x more sources than it actually cites. In a typical session with 4 cited sources in the response, it silently consulted 12-15 pages. The sites that get read but not cited are in a "consulted but not cited" category.

Why does this matter? Because getting consulted (even without a citation) is a signal. It means the AI found your content, read it, and used it to form its understanding, it just chose to cite someone else.

4. Platform Differences

Each AI platform searches differently:

Metric	ChatGPT	Claude	Gemini
Queries/prompt	8.2	5.4	6.8
Sources consulted	14	8	11
Sources cited	4	3	5
Cite/consult ratio	28%	37%	45%

Claude is the most "selective" searcher, fewer queries, but a higher citation rate. Gemini has the best cite/consult ratio, meaning it's more likely to credit the sources it reads.

Technical Challenges

Challenge 1: JSON Patch Parsing

ChatGPT's stream doesn't send complete messages. It sends incremental patches. A single search query might arrive across 5-10 separate patches, character by character. You need to maintain state and reconstruct the full query from fragments.

Challenge 2: Timing and Deduplication

The same URL can appear multiple times in a stream, once when the AI requests it, once when it receives the content, and again when it decides to cite it. Fingerprint-based deduplication with timing windows solved this.

Challenge 3: Platform Differences

Claude uses a completely different format (input_json_delta chunks in standard SSE), while Gemini uses Service Workers that bypass window.fetch entirely. Each platform requires its own parser.

What This Means for Your Content

If you're creating web content and want AI platforms to find, consult, AND cite your pages, here's what the data suggests:

Optimize for reformulated queries, not just user queries. Think about what an AI would search for when answering questions about your topic.
Schema.org markup matters. Sites with structured data receive 30-40% more citations in our data.
The cite/consult gap is an opportunity. If you can measure which AI platforms read your content without citing it, you can optimize to close that gap.

Try It Yourself

I packaged this into a Chrome extension called AI Query Revealer. It shows you the hidden queries, sources, and citation decisions in real time as you use ChatGPT, Claude, or Gemini.

The code intercepts fetch requests client-side only, no data is sent to any external server. Your conversations stay private.

What's the most surprising thing you've noticed about how AI platforms search? I'd love to hear from anyone else who's been reverse-engineering these systems.

DEV Community