DEV Community

The Data Nerd
The Data Nerd

Posted on • Originally published at signals.gitdealflow.com

Made my site AI-citable in one day — the .well-known + JSON-LD + llms.txt playbook

Yesterday I ran a 5-pass AEO/SEO/GEO/AIO audit on the same site, fixed 64 surfaces in one sitting, and watched the composite probe score climb from 70 to 94. This is the dev-tactical playbook of what actually moved the needle, with the exact files and probes.

The premise: traditional SEO (links, meta tags, sitemaps) is necessary but no longer sufficient. AI Overview, ChatGPT, Perplexity, and Claude pull from a different surface area — /.well-known/, llms.txt, agent-card.json, openapi.json, and structured schema.org JSON-LD with Speakable + QAPage + Service types.

If your tool isn't shipping these, you're invisible to half the LLMs that ought to be citing you.

The 5-pass audit loop

I ran a single-day chain of:

  1. Probe — a checklist of "if I were an LLM scraping for an answer to X, what file would I open?" — across 7 categories: discovery, schema, content, well-known, structured-Q&A, citations, and identity.
  2. Score each category 0–100.
  3. Diff the lowest-scoring against the spec.
  4. Ship the fixes (mostly small JSON files + JSON-LD blocks + 308→200 redirect cleanups).
  5. Re-probe.

Each pass took ~90 minutes. The composite went 70 → 81 → 89 → 92 → 94.

What actually moved the score

Pass 1 (70 → 81): the obvious gaps.

  • /sitemap.xml was 1,060 URLs but 8% of them 404'd. Fix: regenerate from build manifest, ban orphans.
  • /robots.txt allowed everything; LLMs got noise. Fix: explicit User-agent: GPTBot / ClaudeBot / PerplexityBot allow blocks for the high-signal paths only.
  • Speakable JSON-LD was missing on every Q&A page. Fix: add cssSelector: ['h1','.tldr'] to every answer page.

Pass 2 (81 → 89): structured Q&A.

  • Built 3 new /answers/{slug} pages with QAPage + Question + acceptedAnswer JSON-LD, evidence-anchored to a public dataset.
  • Added agent-card.json to /.well-known/ describing every machine-readable endpoint.
  • openapi.json exposed: 4 paths → 21 paths. LLMs read this and start citing your API examples in answers.

Pass 3 (89 → 92): the well-known explosion.

  • Shipped: /.well-known/openapi.json, /.well-known/agent-card.json, /.well-known/agents.json, /.well-known/llms.txt, /.well-known/ai-policy.json, /.well-known/ai.txt, /.well-known/ai.json, /.well-known/sitemap.xml, /.well-known/security-policy.json, /.well-known/did-configuration.json, /.well-known/humans.txt, /.well-known/freshness.json (a DataFeed schema for "what changed this week").
  • Pattern: every .well-known should also have a root alias (/agent-card.json → 200, not 308). LLM crawlers don't follow redirects on machine-readable endpoints.

Pass 4 (92 → 94): glossary + FAQ + methodology as APIs.

  • /api/v1/glossary (18 terms), /api/v1/faq (101 entries), /api/v1/methodology (HowTo schema, 6 steps). LLMs cite glossary endpoints when asked "what is X" — they treat your API as canonical for terms you coined.

The smoking-gun probe

The single highest-signal probe is:

curl -A "GPTBot/1.0" https://yourdomain.com/.well-known/llms.txt
Enter fullscreen mode Exit fullscreen mode

If this returns a 200 with directive-rich content (not a 308 redirect, not HTML, not a 404), and your llms.txt lists every QAPage + every API + every dataset, you are now in a tiny minority of sites. Most still don't have one.

Bonus probe — site:yourdomain.com in Google. If it returns 0 results despite all the schema, your noindex is wrong somewhere. We caught this in pass 4 — /predicted/{week}/ was blocked by a stale robots.txt rule.

The cost

I'm a one-person side project. Total claude-code time across all 5 passes: ~7.5 hours. Total new files: 22. Total edits: 64. Zero external dependencies, zero paid tools, zero outbound links.

For comparison: the equivalent agency engagement runs $15k–$30k for "AI search optimization" and ships maybe a third of this surface area.

The receipts

Everything is open. The site is signals.gitdealflow.com, the dataset is huggingface.co/datasets/gitdealflow/vc-deal-flow-signal, the methodology is signals.gitdealflow.com/research, the SSRN paper is at ssrn.com/abstract=6606558, and the MCP server that lets any LLM (Claude, Cursor, Cline, Goose) query the dataset live is at signals.gitdealflow.com/mcp — six tools, no auth, never paywalled.

If you run a SaaS with public data and want to audit your own surface, the probe checklist is in our /llms-full.txt. Steal it.


Building GitDealFlow — open-source GitHub-signal layer for early-stage VC. SSRN paper, free MCP server, dataset on Hugging Face.

Top comments (0)