DEV Community

Surmado
Surmado

Posted on • Originally published at surmado.hashnode.dev

Why I Built an AI Visibility Tool That Doubts Its Own Outputs

Why I Built an AI Visibility Tool That Doubts Its Own Outputs

Building an AI visibility tool (AEO) sounds straightforward. Run prompts across LLMs. Parse outputs. Count mentions. Call it a metric. Ship the UI.

Except that stack has a fundamental flaw: LLMs are stochastic… but most scoring stacks treat them like stable sensors.

The job isn't to pretend the model is stable. Or to delude yourself into thinking a lot of noise = statistical significance. The job is to quantify the instability. To go smaller rather than bigger. More niche rather than more "Big Data." So we built our system to measure stability and meaning, not just mentions.

I'm Luke, founder of Surmado. The team and I build AI marketing intelligence for small businesses. Here's what I learned building semantic matching for brands that aren't famous yet.

The Naive Approach (And Why It Breaks Fast)

The typical AI visibility stack looks like this:

  1. Build prompt library
  2. Call LLM APIs
  3. Parse for brand mentions
  4. Score mentions
  5. Expose metrics
  6. Ship scoring without uncertainty bounds
  7. Assume repeatability you didn't test

This works fine if you're tracking Nike vs Adidas. Run the same query 100 times, and you'll get roughly consistent mentions because these brands are deeply embedded in training data.

But try tracking "El Tianguis Rolled Taquitos" vs "Chipotle."

Noise. Or nothing. It all depends on WHO is asking and HOW you're parsing.

The problem: Most tools stop at exact string matching. They look for literal brand names in outputs. They ignore or handwave "personas." They skip the part that breaks on small brands.

The Small Business Problem

Of course, LLMs are next-token prediction machines trained on internet-scale data. That creates a coverage cliff, a sharp drop in reliability when you move from well-documented entities to local businesses. And a persona-variance bomb, a huge swing based on who the LLM is talking to. It shifts its probability window with its audience when you're on the consumer apps. ChatGPT "remembers."

So when you ask an LLM "what are the best taco places in San Diego," it might say:

  • "Tacos El Gordo"
  • "That rolled taco spot" (intent match, no exact string)
  • "A local favorite on Adams Ave" (location proxy)
  • Nothing at all (below the coverage cliff)

If you're a tourist, it might recommend the safe bets. A local might get the deep cuts.

Why small business use cases are harder:

  • Smaller web footprint
  • Inconsistent citations across sources
  • Local nicknames and variants
  • Multi-language or regional naming

Exact string matching catches maybe 30% of that. The rest? Lost signal. If you're Nike, that's good enough. If you're a small brand, it's easy to get tricked or disappointed.

If an output says "that rolled taco spot on Adams," we treat it as a candidate match only when category + location cues align with the brand profile across repeated runs. Exact-string tools miss that entirely. It's ML, not LLMs doing that semantic match.

Show Your Work (Or It's Just Vibes With a UI)

Here's where most tools lose me: they give you a score with no receipts.

"Your AI visibility score is 67/100."

Cool. Based on what? How did you get there? Can I audit it? What's the sample?

We output everything:

  • PDF reports with full methodology
  • JSON exports with raw data, prompts used, and scoring logic
  • PPTX decks for stakeholder presentations
  • Conversational interface (our chatbot Scout) that can explain any metric in plain English

If you want to replicate the metric, the raw outputs and scoring inputs are exportable. You see how we built your persona. You see our prompts. You get the signal not the noise.

The same artifacts and signals are available via API (and webhooks for automation), so visibility checks can be part of your workflows, not a tab you babysit. And it's easy to run variants like tourist vs local. (Or country vs country if you're an international brand. Very helpful to find competitors in local markets you didn't know existed.)

Transparency builds trust. Especially when you're asking small businesses to make budget decisions based on your numbers. A mention count is a number. A reliable decision-support signal needs variance, context, and receipts.

Building for Variance (Instead of Hiding From It)

We needed semantic matching that could handle:

  • Brand variants and misspellings
  • Category confusion ("is this a taco place or a burrito place?")
  • Competitive positioning ("similar to X but cheaper")
  • Local context and geo-specific results

We run repeated, persona-shaped scenarios across models and seeds, then only promote signals that stay stable across runs.

We track things like mention-rate variance, competitor displacement, and persona agreement, then normalize across runs. With all of our products: The LLMs handle narrative extraction. Python and ML handle the math.

We're not treating LLM outputs as gospel. We're treating them as they are… noisy signals that need interpretation, aggregation, and statistical grounding.

We only promote a signal to a user-facing metric when it crosses a minimum stability threshold across repeated runs. We are candid when a result is directional rather than sound. And we have a chat bot (Surmado Scout) that reads the reports via RAG, prompt injection, tool calling for historical comparison and graph drawing (and there's some other fun research that runs silently in the background). The narrow but rich context window of this chat bot has been very helpful for our business already. Yummy dogfood.

What I'd Do Differently

If I started over? I'd ship the MVP faster and iterate on semantic matching in public.

We spent ~6 months building the "right" architecture before launching. I rebuilt the project from the ground several times. That perfectionism delayed feedback loops. The matching layer improved 3x faster once real businesses started using it and showing us edge cases we'd never considered.

Only after getting alpha testers in, did the semantic sort / match really click into place. But I suppose every builder has to learn this same lesson.

People still ask where the dashboard is. Well, it's in the chat bot. It is the chat bot. It's really nifty and easier to parse, but old habits die hard. I think it's more intuitive and actually useful though. I hope future users agree!

Why This Approach Matters

SEO isn't dead, but AIO/AEO/GEO (can we please agree on one acronym?) is becoming more and more of a focus. I think it will only get more intense after ChatGPT launches ads.

And yeah, the market is flooded with tools that collect LLM outputs and call it analysis. Most of their backends are like a weekend project. Call and API. Note the result. Make it pretty. That's fine if you're enterprise and have a team to parse it. Or if you just want a story to tell your boss.

But small businesses don't have that bandwidth. Or that luxury.

They need:

  • Actionable insights, not raw data
  • Auditability, not black-box scores
  • Context-aware tools that understand their business, not just their brand name

That's what we built.


If you're building in this space, I'm happy to compare notes on evaluation, variance, and small-brand edge cases. The API is public and documented, and you can check out the OpenAPI spec or full docs if you want to see how we expose these signals.

surmado.com | GitHub | hi@surmado.com


Tags: ai llm seo machine-learning startup saas python small-business

Top comments (0)