Nikola Sava

Posted on Mar 23

We Audited 20+ Sites for AI Visibility. Here Are the Most Common Mistakes

#seo #ai #technical #webaudits

Most sites we reviewed had decent technical foundations. Good Core Web Vitals, clean sitemaps, well-structured URLs. They ranked fine in Google. But when you queried their niche in ChatGPT, Perplexity, or Gemini - nothing. Completely absent.

That's the AI visibility gap. After 20+ audits, the pattern is consistent. The same mistakes appear across industries, site sizes, and tech stacks. Here's what we actually found, ordered by how often we see it.

Mistake #1: Blocking AI Crawlers in robots.txt

This is the fastest way to disappear from AI search. You can't be cited if the crawler can't read your content.

Check your robots.txt file right now:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

If any of those blocks exist - you're opted out. In many sites we audited, these were added automatically by security plugins or firewall rules. Nobody noticed.

An important distinction: training crawlers (GPTBot, ClaudeBot) collect data for model training, while search crawlers (ChatGPT-User, Claude-User) fetch content in real time to answer user queries. You can block training while keeping search visibility - that's a legitimate middle ground:

# Block training, allow real-time search
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Allow: /

Mistake #2: Missing or Outdated Schema Markup
Broken or absent structured data is the second most common issue we find. AI systems use schema to understand what your business is, what it offers, and how it connects to known entities in the world. Without it, they're guessing - and usually wrong.

Minimum schema stack for AI visibility:

Organization - with name, url, description, sameAs links to LinkedIn, Crunchbase, Wikidata

Article or BlogPosting - on every content page

FAQPage - on any Q&A-style content

Product or Service - on commercial pages

Person + author - on bylined articles and thought leadership content

Half the sites we audited had no schema at all. The other half had bare-bones Organization markup from 2020, with no sameAs links and a generic one-line description.

Here's a minimal Organization block that actually works:

json { "@context": "https://schema.org", "@type": "Organization", "name": "Your Brand Name", "url": "https://yourdomain.com", "description": "One clear sentence about what you do and for whom.", "sameAs": [ "https://www.linkedin.com/company/your-brand", "https://www.crunchbase.com/organization/your-brand", "https://www.wikidata.org/wiki/QXXXXXXX" ] }
Mistake #3: Weak Entity Signals
AI engines don't match keywords - they recognize entities. If your content says "we help companies grow" instead of naming your brand and what it specifically does, you're invisible at the entity level.

Content written for humans often over-uses pronouns and generic phrases. Content that works for AI needs to repeat the brand name, category, and specific terms throughout the page - not in a spammy way, but clearly and consistently.

Weak:

"Our team has years of experience helping clients achieve their goals."

Stronger:

"The Web Audits team has run AI visibility and technical SEO audits since 2022, working with B2B SaaS companies across Europe and the US."

The difference isn't keyword stuffing. It's entity clarity.

Mistake #4: Content That's Not Extractable
There's a real difference between content that ranks and content that gets cited. AI systems pull specific, clean, well-structured answers. If your answer is buried in 1,200 words of narrative prose, it won't get extracted - even if it's the most accurate answer on the page.

Patterns that block extractability:

No clear

/

structure

Long paragraphs without a direct answer near the top

No data points, statistics, or concrete claims

Marketing copy that sounds important but carries no actual information

Fix: write in answer-first format. The direct answer goes in the first two sentences. Context and explanation follow. Think of it like writing for a featured snippet, but stricter. Every section should be self-contained enough to be quoted on its own.

Mistake #5: No Presence Beyond Your Own Domain
AI systems - especially those using retrieval-augmented generation (RAG) - pull from sources they already consider credible. If your brand only exists on your own website, you're working with a single citation source. That's not enough.

Sites that appear consistently in AI-generated answers have:

A Wikipedia or Wikidata entry (even a stub counts)

Mentions in 2-3 trade publications, niche directories, or industry blogs

Quotes or data points cited by other content creators

Active profiles on LinkedIn, GitHub, and Crunchbase - with names that match exactly what's in your schema sameAs values

This is the part that takes the longest. There's no shortcut. But it's why some technically solid sites still don't appear in AI answers - they simply don't exist outside their own domain.

Mistake #6: No llms.txt File
The llms.txt standard is still emerging, but it's worth adding now. Unlike robots.txt - which controls crawl access - llms.txt guides AI systems at query time, when a model is assembling an answer and needs to understand your site's structure and content priorities.

The file lives at /llms.txt in your root directory. Keep it under 10KB. Use markdown headers and direct summaries - no marketing language.

text

# llms.txt

> Web Audits is a technical SEO and AI visibility audit agency
> based in Europe. We run audits for B2B SaaS companies and
> digital agencies.

## Key pages

- [Services](https://webaudits.dev/services): Full audit packages
- [Blog](https://webaudits.dev/blog): Technical SEO and GEO articles
- [About](https://webaudits.dev/about): Team and methodology
Not every LLM reads it today, but adoption is growing. The cost of adding it is about 30 minutes. When a model does read it, it gets a curated map of your most important content instead of guessing.

Mistake #7: One-Time Fix Mentality
The sites that improved fastest after audits didn't just fix the issues - they set up a monitoring routine. AI search results shift. Models update. New crawlers appear. A site that was visible last quarter can disappear after a CMS update resets your robots.txt.

Monthly checks worth adding to your workflow:

Query your brand name and core topics in ChatGPT, Perplexity, and Gemini

Open yourdomain.com/robots.txt - verify AI crawlers are still allowed

Check that schema is intact after any CMS or plugin updates

Review external mentions and new citation sources quarterly

Confirm llms.txt is accessible and up to date

This doesn't take long once it's routine. The issue is that most teams never set it up in the first place.

Mistake #8: Ignoring AI Overviews in Google
Google's AI Overviews pull from the same content signals that other AI systems use. If you're not showing up in AI Overviews, the fixes are the same as everything above - better schema, cleaner structure, answer-first writing, and entity clarity.

One thing specific to Google: page speed and Core Web Vitals still matter here. Slow pages get de-prioritized in AI Overview sourcing even when the content is good. Don't let technical debt block otherwise solid content.

Practical Takeaway
Start here: open your robots.txt and check for blocked AI crawlers. Then look at your most important landing page - does it have Organization schema with sameAs links? Does the first paragraph answer the core question directly, by name?

Those three things cover the majority of what we fix in a standard AI visibility audit at webaudits.dev. They're not the whole picture, but they're where most of the gap lives.

Fix the basics first. Then build outward.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.