DEV Community

Alexandre Caramaschi
Alexandre Caramaschi

Posted on • Edited on

I Built an Entity Consistency Audit Pipeline for GEO — Here's What I Found

The Problem Nobody Talks About: AI Engines Fragment Your Identity

You spend months building your personal brand. You publish on LinkedIn, Medium, DEV.to, GitHub, Crunchbase. You set up your company website with proper meta tags. Everything looks fine — until you ask ChatGPT, Gemini, or Perplexity about yourself.

The response is a Frankenstein. Your job title from LinkedIn. A bio fragment from a 2023 GitHub profile you forgot to update. A company name your Crunchbase listing spells differently. An old role from a platform you abandoned.

This is entity fragmentation — and it is the single biggest problem in Generative Engine Optimization (GEO) that most developers ignore.

When AI models synthesize information about a person or brand, they pull from every indexed surface. If those surfaces contradict each other, the model either averages them (producing inaccurate output) or hedges with qualifiers like "reportedly" and "claims to be." Neither outcome is good for you.

I decided to fix this systematically. Here is the engineering approach, the code, and everything I found.

The Audit Methodology: 17 Platforms, 9 Data Points Each

I built a spreadsheet-turned-script that checks entity consistency across every platform where AI crawlers harvest training and retrieval data. The target: zero drift between what each platform says about the same entity.

Platforms audited

# Platform Why it matters for GEO
1 Primary website (JSON-LD) Ground truth for structured data
2 llms.txt Direct LLM instruction file
3 ai-agents.json Machine-readable service manifest
4 LinkedIn Highest-authority person entity for GPTBot
5 GitHub profile Developer identity, crawled by multiple bots
6 GitHub repos (README) sameAs links, contributor identity
7 Crunchbase Business entity, frequently cited by Perplexity
8 Medium Author bio, publication metadata
9 DEV.to Developer community profile
10 Substack Newsletter author metadata
11 YouTube Channel description, about section
12 Wikidata Structured knowledge graph entry

What to check on each platform

For every platform, I extract and compare these data points:

  1. Full name — exact spelling, no abbreviations
  2. Job title — must be identical everywhere
  3. Company name — watch for "Brasil GEO" vs "BrasilGEO" vs "Brazil GEO"
  4. Bio/description — canonical one-liner
  5. Profile URL — must resolve, no redirects
  6. sameAs links — cross-references to other platforms
  7. Profile image — same headshot everywhere
  8. Location — city, country format
  9. Contact method — consistent email/link

The rule is simple: if any two platforms disagree on any of these 9 fields, you have entity drift.

Code: The Technical Implementation

1. JSON-LD Person Schema

Your website is the ground truth. Here is the Organization schema from our production Cloudflare Worker:

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Brasil GEO",
  "url": "https://brasilgeo.ai",
  "founder": {
    "@type": "Person",
    "name": "Alexandre Caramaschi",
    "jobTitle": "CEO",
    "url": "https://alexandrecaramaschi.com",
    "sameAs": [
      "https://www.linkedin.com/in/alexandre-caramaschi/",
      "https://github.com/alexandrebrt14-sys",
      "https://medium.com/@alexandre.brt14"
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

The sameAs array is your cross-linking backbone. Every URL tells AI models "these are all the same entity."

2. llms.txt — Direct Instructions for Language Models

The llms.txt specification lets you provide a Markdown file that LLMs can parse directly:

# Brasil GEO

> Primeira consultoria brasileira de Generative Engine Optimization (GEO).
> Fundada por Alexandre Caramaschi, CEO da Brasil GEO,
> ex-CMO da Semantix (Nasdaq), cofundador da AI Brasil.

## Products
- Diagnostico GEO (gratuito)
- Sprint GEO de 20h (R$ 5.000)
Enter fullscreen mode Exit fullscreen mode

The blockquote after the H1 acts as an executive summary that models frequently extract verbatim.

3. Cloudflare Worker HTMLRewriter — Injecting at the Edge

This is the part most developers skip. You can inject OG tags and JSON-LD at the edge:

class HeadInjector {
  constructor(path) { this.path = path; }
  element(element) {
    const p = this.path;
    const isPublic = p.startsWith("/v1") || p.startsWith("/v2")
      || p.startsWith("/sobre") || p === "/";
    if (isPublic) {
      const canonical = CANONICAL_DOMAIN + p + "/";
      element.append('<link rel="canonical" href="' + canonical + '" />', { html: true });
      element.append('<meta property="og:url" content="' + canonical + '" />', { html: true });
      element.append('<meta property="og:title" content="Brasil GEO" />', { html: true });
      element.append('<meta name="twitter:card" content="summary_large_image" />', { html: true });
    }
    if (isPublic) {
      element.append('<script type="application/ld+json">' + ORG_JSONLD + '</script>', { html: true });
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The HTMLRewriter runs on Cloudflare's edge. Your static HTML never needs to contain this metadata. Deploy in seconds, every page gets consistent structured data.

The Results

Platform Canonical Match Issue
LinkedIn Yes None
GitHub profile Yes None
DEV.to Yes None
Medium Yes None
Website JSON-LD Yes (ground truth) N/A
llms.txt Yes None
Substack Partial Missing credentials in bio
YouTube No Stale bio with prohibited term
AI Brasil No Outdated role, missing company
Wikidata N/A Entity does not exist yet

Key finding: 4 out of 11 platforms had drift. The most common issues were stale bios and missing cross-references.

The Fix Pipeline

Step 1: Define one canonical bio string

Every platform bio must start with this exact string:

Alexandre Caramaschi — CEO da Brasil GEO, ex-CMO da Semantix (Nasdaq), cofundador da AI Brasil

Step 2: Cross-link repos with sameAs

Every README should contain a consistent author block with Website, LinkedIn, and GitHub links.

Step 3: Submit via IndexNow

curl -X POST "https://api.indexnow.org/indexnow" \
  -H "Content-Type: application/json" \
  -d '{"host":"brasilgeo.ai","key":"YOUR_KEY","urlList":["https://brasilgeo.ai/llms.txt"]}'
Enter fullscreen mode Exit fullscreen mode

This ensures Bing (and Copilot) picks up changes within hours.

Step 4: Validate robots.txt

Explicitly allow AI crawlers:

User-agent: GPTBot
Allow: /
Allow: /llms.txt

User-agent: ClaudeBot
Allow: /
Allow: /llms.txt
Enter fullscreen mode Exit fullscreen mode

Open-Source Tools

I open-sourced the entire pipeline:

  1. geo-checklist — Step-by-step GEO audit checklist
  2. llms-txt-templates — Production-ready llms.txt templates
  3. entity-consistency-playbook — Full audit methodology
  4. geo-taxonomy — Semantic vocabulary for GEO

All MIT-licensed. PRs welcome.

Takeaways

  1. Entity consistency is a technical problem, not a marketing problem. Treat identity data like a distributed system.
  2. Edge-injected structured data beats build-time generation. One deploy updates every page.
  3. llms.txt and ai-agents.json are the new robots.txt. If you are not serving these, AI models are guessing.
  4. Audit quarterly at minimum. Platform bios drift silently.
  5. Cross-link aggressively. Every sameAs URL is a vote for entity unification.

Alexandre Caramaschi is CEO of Brasil GEO, the first Brazilian consultancy for Generative Engine Optimization. Previously CMO at Semantix (Nasdaq) and co-founder of AI Brasil.


Related Reading

Top comments (0)