DEV Community

whitetirocket
whitetirocket

Posted on

My side project gets most of its traffic from ChatGPT, not Google. Here is the schema work behind it.

My side project gets most of its traffic from ChatGPT, not Google. Here is the schema work behind it.

A few weeks ago I opened Google Analytics for a passport-photo tool I build and maintain, and the traffic-source breakdown stopped me:

  • chatgpt.com — 65% of sessions
  • google / organic — about 6%
  • direct — the rest

Then I checked Bing Webmaster Tools, which has a new "AI Performance" tab, and saw the site had picked up 45 citations from Microsoft Copilot over three months.

The site is barely three months old. It has almost no backlinks. On Google it ranks on page 3-5 for anything competitive. And yet AI answer engines were sending it the bulk of its visitors.

This post is the honest engineering breakdown of what I did to make the site machine-citeable — what I think worked, what I cannot attribute, and the code. It is a Next.js App Router project, but the techniques are framework-agnostic.

Why this happens at all

AI answer engines (ChatGPT search, Perplexity, Copilot, Gemini) do not rank ten blue links. They synthesise an answer and cite a few sources. For a new site this is a genuinely different game from classic SEO:

  • Classic SEO: you need domain authority and backlinks to rank. That takes 6-12 months minimum for a new domain.
  • AI citation: the engine needs a page that clearly and verifiably answers the user's question, is crawlable, and is structured so the answer is easy to extract. Domain age matters far less.

A three-month-old site cannot outrank an established competitor on Google. But it absolutely can be the cleanest, most extractable answer to a specific question — and that is what gets cited.

What I shipped

1. llms.txt

There is an emerging convention, llmstxt.org, for a plain-text file at /llms.txt that gives AI systems a curated, readable summary of your site — what it is, key pages, key facts, citation guidance. It is the robots.txt idea applied to LLMs.

Mine includes the brand entity, the canonical fact list, the most important pages, the audience, and a short "how to cite this" section. It costs nothing to maintain and it is the single most direct way to hand an AI engine a clean model of your site.

2. Schema.org structured data — the parts that matter

Not all schema is equal for AI extraction. The types that earned their place:

  • FAQPage with Question / acceptedAnswer pairs. Every Q&A is a discrete, extractable answer. I put one on the homepage and a big one on a dedicated facts page.
  • Question as mainEntity on individual pages — the page declares the one question it primarily answers, with the answer attached. This is the cleanest possible signal: "this page answers exactly this."
  • Dataset (Schema.org) on the page documenting my public data API. Google Dataset Search and AI engines treat Dataset JSON-LD as a citation-worthy source.
  • SpeakableSpecification — marks which parts of the page are suitable for text-to-speech, used by voice assistants.
// Per-page primary question — the cleanest "this page answers X" signal
const pageSchema = {
  '@context': 'https://schema.org',
  '@type': 'WebPage',
  url: pageUrl,
  mainEntity: {
    '@type': 'Question',
    name: `What is the photo specification for ${country} ${doc}?`,
    acceptedAnswer: {
      '@type': 'Answer',
      text: factualAnswerAssembledFromStructuredData,
      url: pageUrl,
    },
  },
}
Enter fullscreen mode Exit fullscreen mode

3. A canonical facts page

I built a /facts page: ~45 question/answer pairs, each grounded in an official government source, each rendered as real HTML with FAQPage JSON-LD and inline microdata (itemScope / itemProp). The first sentence of every answer is the direct answer — no preamble. AI engines extract the first authoritative sentence; give it to them cleanly.

4. Open data with a real endpoint

The dataset behind the tool is published two ways: an MIT-licensed GitHub repo, and a live JSON-LD API endpoint that returns a Schema.org Dataset. When an AI engine wants to verify a claim about coverage, there is a machine-readable source to point at. Being verifiable is itself a citation signal.

5. Server-side rendering — non-negotiable

If your content only exists after client-side JavaScript runs, many crawlers will not see it. Everything that matters — the schema blocks, the FAQ text, the tables — is in the server-rendered HTML. Quick check:

curl -s https://yoursite.com/ | grep -c "the answer text you expect"
Enter fullscreen mode Exit fullscreen mode

If that returns 0, AI crawlers see an empty page.

6. Tables, not prose, for structured facts

This one is underrated, especially for Perplexity. If you have comparison data — sizes, prices, specs — render it as a real HTML <table>. AI engines lift tables into answers almost verbatim. A paragraph describing the same data is much harder for them to use.

7. robots.txt — explicitly allow the AI crawlers

GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, Google-Extended, CCBot, Bingbot and others. If you want AI citations, do not block the bots that produce them. I also added a tiny middleware that tags each AI-bot request, so I can see in logs which engines actually crawl and how often.

// middleware.ts — detect and log AI crawler hits
const AI_BOTS = [/GPTBot/i, /PerplexityBot/i, /ClaudeBot/i, /OAI-SearchBot/i, /Google-Extended/i]
const ua = request.headers.get('user-agent') ?? ''
if (AI_BOTS.some((re) => re.test(ua))) {
  console.log(JSON.stringify({ event: 'ai_bot_crawl', ua, path: request.nextUrl.pathname }))
}
Enter fullscreen mode Exit fullscreen mode

8. IndexNow — push, do not wait

ChatGPT search and Copilot lean on the Bing index. IndexNow lets you push new and changed URLs to Bing the moment you deploy, instead of waiting for a crawl. One POST per deploy.

What I cannot honestly claim

I cannot prove causation for any single technique. I shipped these together over a few weeks; traffic rose over the same period. It is correlation. What I can say:

  • The AI-citation traffic is real and measurable (GA4 source = chatgpt.com, Bing AI Performance citation count).
  • The pages AI actually cites are the most structured ones: the homepage, a tool-feature page, and timely explainer posts about recent regulation changes.
  • Domain authority did not drive this — the domain has almost none.

Takeaways

If you have a small or new site and want AI engines to send you traffic:

  1. Add llms.txt. Lowest effort, highest signal density.
  2. Use FAQPage and per-page mainEntity Question schema. Make every page declare what it answers.
  3. Server-render everything that matters. Verify with curl.
  4. Render structured facts as HTML tables.
  5. Lead every answer with the direct answer in sentence one.
  6. Allow the AI crawlers in robots.txt; push updates with IndexNow.
  7. Publish verifiable primary data if you have any.

None of this requires domain authority or a backlink budget. It requires being the clearest, most machine-readable answer to a real question. For a new site, that is a far more winnable game than classic SEO.

I build IDPhotoSnap, a free browser-only passport photo tool. The country-spec dataset is open at github.com/whitetirocket/passport-photo-specs. Happy to answer questions about the schema setup in the comments.

Top comments (0)