We Query ChatGPT, Claude, and Perplexity About Your Brand. Here's the Architecture

#ai #buildinpublic #webdev #machinelearning

A few months ago I had a question that didn't let me sleep. When people ask ChatGPT "what's the best tool for X", which brands actually get mentioned? And more importantly: does my brand get mentioned?

Turns out this is a surprisingly hard thing to measure. So we built a product for it. It's called Be Recommended, and under the hood it's a system that parallel-queries ChatGPT, Claude, Perplexity, and Gemini, then scores how visible a brand is inside their answers.

This post is about the architecture. And about the data we found, which frankly shocked us.

The problem in one line

Traditional SEO tells you how you rank on Google. But when someone opens ChatGPT and asks "what's the best CRM for a solo consultant", Google isn't in the room. The LLM answers directly, from its training data plus whatever live retrieval it does. If your brand isn't in that answer, you lose the customer before you even know they existed.

The Gartner forecast everyone quotes says roughly a quarter of search volume moves to AI assistants by 2026. Our own data, which I'll get to, suggests the shift is faster than that in some verticals.

Why this is harder than it looks

The naive version is easy: open ChatGPT, type a prompt, see if your brand is mentioned. Done.

The real version is not.

Problem one is prompt variance. "Best CRM for solo consultant" and "what CRM should a one person agency use" return different answers. Any scoring that hits a single prompt is basically noise. You need a prompt set per vertical, seeded with realistic intent, and you need to rotate through it.

Problem two is model variance. ChatGPT and Claude don't even agree on what decade it is, let alone which SaaS product is the market leader. Averaging them hides the signal. We report per-engine scores separately and then a blended composite.

Problem three is response variance. The same prompt to the same model on two different days will surface different brands. You need to sample, not snapshot.

Problem four, and this is the one that took us the longest, is parsing. LLMs don't return JSON when you ask "what CRM should I use". They return prose with brand names sprinkled inside sentences, sometimes with typos, sometimes with "CRM Hub" instead of "HubSpot CRM", sometimes with a brand mentioned in a dismissive footnote. You can't just grep.

The architecture, top to bottom

Here is how a single "analyze my brand" request flows through the system.

Step one: intent expansion. The user gives us a brand name and a URL. We pull the landing page, classify the vertical, and generate a set of 20 to 40 representative buyer intents. "Best X for Y", "alternatives to Z", "is brand A worth it", "what's the cheapest way to do W". The prompts are not random. They are templated against real search intents we've scraped from Quora, Reddit, and Google's People Also Ask.

Step two: parallel fanout. Each prompt is dispatched to every engine in parallel. Four engines, up to forty prompts, that's one hundred and sixty concurrent inference calls per scan. We don't wait serially. If you do, a single scan takes twenty minutes and the user bounces.

Step three: rate limiting. This is where it got real. OpenAI's TPM limits on the standard tier make a naive fanout impossible. We implemented a token-bucket limiter per engine, per model, with exponential backoff and jittered retries. When one engine is saturated we spill to another to keep the user's scan moving. Responses get deduplicated against a short-lived cache so two prompts with near-identical semantics don't double-bill.

Step four: response normalization. This is the parsing layer. Each response gets run through a second, cheaper LLM call whose only job is to extract mentioned entities and classify the sentiment of each mention. Positive mention, neutral mention, negative mention, dismissive footnote. We also track position: was your brand first, third, or buried at the end of a list of ten? Position matters because users don't read past the top three.

Step five: scoring. The scoring algorithm is where the opinion lives. We weight by engine reach (ChatGPT dominates), by mention sentiment, by position, and by the share of prompts that mentioned the brand at all. The output is a 0 to 100 score, the AI Visibility Score. The math is tuned so that a brand mentioned positively in the top three of most prompts on most engines lands around 80. That's the "top performer" band.

Step six: reporting. We don't dump raw data on users. The final report shows the composite score, a per-engine breakdown, the exact prompts where you were mentioned and where you weren't, the top competitors that showed up in your place, and a prioritized list of fixes. The fixes come from a rules engine that pattern-matches on the gaps. Missing G2 listing, thin landing page, no schema markup, no authoritative third-party mention, and so on.

What the data showed us

We ran the system against a few hundred brands across SaaS, e-commerce, agencies, and consumer apps. A few findings that genuinely surprised me.

The average AI Visibility Score is 31. Out of 100. That means for most companies, most of the time, the major LLMs either don't mention them or mention them in a throwaway sentence. Think about that for a second. A whole industry is optimizing for Google while the actual decision-making conversation has already moved somewhere they're not even present.

Brand recognition and AI visibility are barely correlated. We scanned some household SaaS names and found scores in the 40s. We scanned smaller tools with great documentation and active third-party review coverage and found scores in the 70s. Training data eats marketing budgets for breakfast. What the internet wrote about you five years ago matters more than what your ad agency shipped last quarter.

The engines disagree a lot. It's common to see a 30-point spread between ChatGPT and Perplexity for the same brand. Perplexity leans heavily on fresh retrieval, so it rewards recent coverage. ChatGPT leans on training snapshots, so it rewards longevity and repeat mentions in tech press. Claude sits in the middle. Gemini is the wildcard.

Position bias is brutal. Being mentioned fifth in a list of ten is almost the same as not being mentioned at all. Users read the first two or three answers and stop. If the LLM puts you in sixth place, you're invisible.

What we learned building it

Two things I'd tell anyone trying to build something similar.

First: the parsing layer is where your product lives or dies. Getting prompts right is easy. Calling the APIs is easy. Extracting clean, structured mentions from messy LLM prose is the whole game. We rewrote ours three times before it was usable.

Second: this space is moving fast enough that any snapshot is a guess at the current weather. We rescan on a schedule because the same brand can swing 20 points in a month when a big review lands or a model version ships. A one-shot audit is entertainment. A tracked score over time is actually useful.

Try it

If you're curious what the three big LLMs say about your own brand, Be Recommended runs a scan for $4. You get a full report with the prompts, the engines, the competitors, and the fixes. We built it because we were running this for ourselves every week and decided it was silly not to let other people use it.

Would love to hear what score you get. Drop it in the comments.