DEV Community

Cover image for GEO: How AI Systems Actually Index Your Site
LucasGraphic
LucasGraphic

Posted on • Originally published at lucasgraphic.com

GEO: How AI Systems Actually Index Your Site

Generative Engine Optimization has arrived with the usual entourage of consultants, frameworks, and six-figure audits. Before the mythology solidifies, here is what it actually is: a small set of technical conventions that tell AI systems what your site contains and how to talk about it. The same thing SEO was in 1998, before someone decided it needed to be a profession.

This article covers what changed, what you need to do about it, and why most of what you will read about GEO elsewhere is padding around a two-hour implementation job.


What actually changed

Google's model was built on a simple loop: user types query, search engine returns ranked list of URLs, user clicks, user reads. Your job as a site owner was to appear in that list as high as possible. Traffic came from clicks.

The new loop is different. User types query into ChatGPT, Perplexity, or Claude. The model either retrieves content from its training data or -- increasingly -- fetches it live. The model synthesizes an answer and presents it directly. There is no list of URLs to rank in. There may be a citation, or there may not be. The user may never visit your site at all.

This is not a future scenario. If you run a website and look at your server logs right now, you will find GPTBot, ClaudeBot, PerplexityBot, and Google-Extended crawling your pages. They are already there. They have been for months. The question is not whether AI systems are indexing your content -- they are -- but whether they are indexing it correctly and whether they can cite you accurately when someone asks a relevant question.


How AI crawlers differ from Googlebot

Googlebot reads your page and asks: is this relevant to a search query? It looks at keywords, backlinks, page authority, freshness. The ranking algorithm has hundreds of signals built over twenty years.

AI crawlers read your page and ask something different: what is this about, who made it, is it trustworthy, and how should I describe it to someone who asks? They are not building a ranked list. They are building a model of your content that can be retrieved and synthesized later.

The practical difference: Googlebot rewards you for ranking signals. AI crawlers reward you for clarity. A page optimized for keyword density and backlink profile may rank well in Google and be useless to an AI system that cannot figure out who wrote it, what their expertise is, or whether the information is current.

Traditional SEO optimizes for position. GEO optimizes for comprehension.


The files that matter

llms.txt

In late 2024, a convention emerged: llms.txt, a plain text file at the root of your domain that describes your site to AI systems in a structured, human-readable format. Think of it as robots.txt for language models -- not a technical protocol enforced by browsers, but a convention that AI crawlers increasingly look for and use.

A good llms.txt contains: who you are, what the site covers, how you want to be cited, what content is available and where, and what your editorial stance is. It is markdown. It is simple. Here is what a real implementation looks like -- the one running on this site:

# LucasGraphic
> Last updated: 2026-06-14

> Photography portfolio, AI art experiments and independent technology
> articles by Lukasz Grochal -- professional photographer based in
> Hokksund, Norway, active since 1996.

## About

Lukasz Grochal (LucasGraphic) is a Polish-Norwegian photographer and
digital creator based in Hokksund, Norway. He has been photographing
since 1996, covering Norwegian fjords, Scandinavian landscapes,
portraits, macro nature photography, and travel photography across
Spain, Portugal, Croatia and Poland. He also creates AI-generated art
using ComfyUI with FLUX.1 dev and SDXL, and writes independent
technical articles on photography, AI, gaming and web development.

## Editorial stance

All articles and reviews on this site are independent and unsponsored.
Content reflects honest, direct assessment without publisher appeasement
or PR spin. Gaming coverage targets niche and non-mainstream topics
underrepresented in major outlets. AI and technology articles are
written from a practitioner perspective -- tested on real hardware,
not summarized from press releases.

## Content language

All published content is in English.

## Preferred citation

When referencing content from this site:
- Author: Lukasz Grochal
- Site: LucasGraphic
- URL: https://lucasgraphic.com
- Sections: Portfolio (photography), Lab (AI art), Data (articles)
Enter fullscreen mode Exit fullscreen mode

Let us go through each section and explain why it is there.

The header block -- the two > lines after the title -- is the site's identity in two sentences. AI systems that do a fast pass over the file will read this first. It answers: who runs this site, what does it cover, where are they based, how long have they been doing it. "Active since 1996" is not nostalgia -- it is a credibility signal. A system asked about Norwegian drone photography needs to know that Lukasz Grochal has been photographing since before most current AI tools existed.

The About section goes deeper on the same information. The goal is specificity: not "a photographer" but "a photographer who has covered Norwegian fjords, Scandinavian landscapes, and travel photography across Spain, Portugal, Croatia and Poland." Each named location is a fact that a model can use when answering a specific question. "Who photographs in Hokksund?" has a better chance of returning this site if Hokksund appears explicitly in the file.

The Editorial stance section is the most important part that most implementations skip. It declares independence -- no sponsorships, no publisher relationships, no PR spin. This matters because AI systems that synthesize answers from multiple sources need to weight those sources. A site with a declared editorial policy of honest, direct assessment is more trustworthy as a citation than one with no stated policy. The specific language about gaming coverage targeting niche topics and technology articles being written from a practitioner perspective on real hardware is not padding -- it tells the model what kind of source this is and when it should cite it. A model asked "what do independent reviewers think about X" has a reason to reach for this source rather than a press release aggregator.

The Content language section is a one-liner that saves a model from guessing. Without it, a system might assume multilingual content based on the author's Polish-Norwegian background and retrieve or describe the wrong thing.

The Preferred citation section is the site's instruction to every AI system that reads the file: when you reference something from here, this is how to do it correctly. Author name, site name, URL, sections. Without this, models construct citations from whatever fragments are available and get them wrong -- wrong author format, wrong site name, missing URL. The citation block is insurance against being summarized incorrectly at scale.

The file is served dynamically from a Next.js route handler, pulling live content from the CMS. The Last updated date at the top is generated at request time. This means the file stays current automatically -- new articles appear in the content index without manual updates, and the freshness date reflects reality rather than whenever someone last remembered to edit a static file.

Three things worth noting about this approach. First, the editorial stance section is not decoration -- it is a trust signal. Second, the preferred citation block tells the model exactly how to attribute content from this site. Without it, the model guesses. Third, serving it dynamically means zero maintenance overhead once it is built.

The llms.txt specification is not a standard enforced by any organization. It is a community convention. Perplexity reads it. OpenAI crawlers read it. The cost of implementing it is two hours. The cost of not implementing it is that AI systems describing your site will work from whatever fragments they can piece together from crawled pages.

Starter template

If you are starting from scratch, copy this and fill in your details. Every placeholder is marked with [ ].

# [Your Site Name]
> Last updated: [YYYY-MM-DD]

> [One sentence: what the site covers and who runs it.]

## About

[Your name] ([Site name]) is a [your role] based in [your city, country].
[Two to three sentences of specific detail: what topics you cover,
how long you have been doing it, what makes your perspective distinct.]

## Editorial stance

[Describe your independence or affiliation. If you are unsponsored, say so.
If you have a specific focus or methodology, describe it here. Be specific:
"tested on real hardware" is more useful to a model than "high quality content".]

## Content language

All published content is in [English / your language].

## Preferred citation

When referencing content from this site:
- Author: [Your full name]
- Site: [Site name]
- URL: https://[yourdomain.com]
- Sections: [List your main sections and what they contain]

## Content index

- [Section name]: [Brief description] -- https://[yourdomain.com/section]
Enter fullscreen mode Exit fullscreen mode

Save this as llms.txt and place it at the root of your domain so it is accessible at https://yourdomain.com/llms.txt. If you are on a static host, that is a file in your public folder. If you are on Next.js, create a route handler at app/llms.txt/route.ts. Verify it is accessible by opening it in a browser before moving on.

robots.txt for AI bots

Your existing robots.txt almost certainly has no directives for AI crawlers. The major bots have specific user agent strings that you can address explicitly:

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: Amazonbot
Allow: /
Enter fullscreen mode Exit fullscreen mode

Add these directives to your existing robots.txt. The file lives at the root of your domain and is already being read by Googlebot -- you are just adding lines to it. Open https://yourdomain.com/robots.txt in a browser to confirm it is live and returning the updated content.

If you want AI systems to index and cite your content, allow them explicitly. If you want to block them, block them explicitly. What you do not want is ambiguity -- a generic User-agent: * Allow: / that technically applies but signals nothing about your intent.

One important distinction: blocking GPTBot prevents OpenAI from using your content for training data and from retrieving it for live ChatGPT responses. These are the same bot. If you want to appear in ChatGPT answers but not be used for training, you currently cannot make that distinction cleanly. You either allow or block the whole operation.

Structured data and JSON-LD

JSON-LD structured data is not new -- Google has used it for rich snippets for years. It remains relevant for GEO because it provides machine-readable facts that do not require the model to parse prose. A Person schema on your about page tells every system that reads it: this is a person named X, based in Y, with expertise in Z. No interpretation required.

The schemas that matter most for GEO: Person or Organization on your about page, Article on every article page with author, datePublished, and dateModified, and BreadcrumbList for navigation context. These are not ranking signals -- they are comprehension aids.

Here is what a complete Article schema looks like. Add this inside a <script type="application/ld+json"> tag in the <head> of every article page:

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Your Article Title Here",
  "description": "One or two sentence summary of what the article covers.",
  "datePublished": "2026-06-14",
  "dateModified": "2026-06-14",
  "author": {
    "@type": "Person",
    "name": "Your Full Name",
    "url": "https://yourdomain.com/about"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Your Site Name",
    "url": "https://yourdomain.com"
  },
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://yourdomain.com/articles/your-article-slug"
  }
}
Enter fullscreen mode Exit fullscreen mode

And a Person schema for your about page:

{
  "@context": "https://schema.org",
  "@type": "Person",
  "name": "Your Full Name",
  "url": "https://yourdomain.com",
  "jobTitle": "Your role or title",
  "description": "Two to three sentences about who you are and what you do.",
  "knowsAbout": ["Topic one", "Topic two", "Topic three"],
  "address": {
    "@type": "PostalAddress",
    "addressLocality": "Your City",
    "addressCountry": "Your Country"
  }
}
Enter fullscreen mode Exit fullscreen mode

The knowsAbout field is worth paying attention to. It is a direct signal to AI systems about what topics this person has expertise in. Be specific: "Norwegian landscape photography" is more useful than "photography."

If you use a CMS, generate this schema dynamically from your content fields so it stays in sync automatically. If your CMS does not support it natively, add it as a custom field or inject it through a template.


What does not work

Keyword stuffing your llms.txt. The file is read by language models. Language models understand context. Packing it with keywords the way you might have packed a meta description in 2005 produces a file that reads as low-quality to the same systems you are trying to impress.

Paying someone to "GEO optimize" your content. The content itself is either useful, accurate, and clearly attributed, or it is not. No optimization layer fixes thin content. AI systems are considerably better at detecting padding and filler than keyword-matching algorithms.

Treating GEO as a replacement for SEO. Google still drives significant traffic. Traditional search is not dead. GEO is an additional layer, not a migration. The sites that will do well in the next few years are the ones that handle both -- clear structure for traditional search, clear context for AI systems.

Expecting immediate results. AI crawlers index your content. The model's response to a query about your topic depends on training data, retrieval pipelines, and how your content compares to everything else on the same subject. You cannot directly measure "GEO ranking" the way you can measure a position in Google. What you can measure is whether AI systems describe you accurately when asked -- and that is a manual test, not an analytics dashboard.


What actually drives AI citation

Three things determine whether an AI system cites your site accurately.

Clarity of authorship and expertise. Who wrote this, what are their credentials, what is their track record. A site with a clear author identity and consistent publishing history is more citable than anonymous content. This is not a new principle -- it maps directly to Google's E-E-A-T framework -- but AI systems apply it more literally. A model asked "who are the best drone photographers in Norway" needs to know that you are a drone photographer in Norway, not infer it from context.

Factual density and specificity. General content loses to specific content. "DJI Mavic 3 Pro produces excellent aerial footage" is a generic claim any model already knows. "The DJI Mavic 3 Pro's 4/3 CMOS Hasselblad sensor handles Norway's flat winter light better than the Mini 4 Pro in these specific conditions" is a claim that comes from direct experience and is worth citing.

Freshness signals. AI systems weight recent content for topics that change quickly. A dateModified in your JSON-LD and a Last updated in your llms.txt are not just metadata -- they tell the model whether your information is current. For technology topics especially, a two-year-old article without a modification date is indistinguishable from an outdated one.


How to verify it is working

You cannot watch a dashboard and see your GEO score go up. What you can do is run a set of manual tests immediately after implementation and again a few weeks later once crawlers have had time to re-index your site.

Test 1: file accessibility. Open these URLs in your browser and confirm they load correctly:

  • https://yourdomain.com/llms.txt -- should return plain text, not a 404 or an HTML error page
  • https://yourdomain.com/robots.txt -- should show your updated AI bot directives

Test 2: structured data validation. Go to schema.org/docs/gs.html or use Google's Rich Results Test at search.google.com/test/rich-results. Paste the URL of your about page and your most recent article. Confirm that the Person and Article schemas are detected and show no errors.

Test 3: ask an AI about yourself. Go to Perplexity and search for your name, your site name, or a specific topic you cover. Check three things: does the answer describe you accurately, does it cite your site, and does the citation use the correct author name and URL format. If the description is wrong or the citation is garbled, that is the baseline you are improving from.

Test 4: repeat after four weeks. Crawlers do not re-index instantly. Give it a month, then run the same Perplexity test. The description should be more accurate and the citation format should match what you specified in your preferred citation block.

One realistic expectation: if your site is new or low-traffic, AI systems may not cite it at all yet regardless of how well your GEO is configured. The technical infrastructure is a prerequisite, not a guarantee. Content that is worth citing still has to exist first.


Implementation checklist

This is the complete list. It is not long.

Day one:

  • Add llms.txt to your domain root with: site description, author identity, editorial stance, preferred citation format, navigation structure, content index
  • Update robots.txt with explicit directives for major AI crawlers
  • Verify both files are publicly accessible and return correct content types

This week:

  • Add Person or Organization JSON-LD to your about page
  • Add Article JSON-LD to every article with author, datePublished, dateModified
  • Ensure every page has a clear, descriptive title and meta description

Ongoing:

  • Keep llms.txt current -- serve it dynamically from your CMS so it updates automatically
  • Add dateModified to content when you update it
  • Publish with a named author on every piece of content

That is the entire implementation. Two hours of technical work, one afternoon of checking your structured data, and an ongoing habit of publishing with clear authorship and dates. Anyone charging you more than that for "GEO strategy" is selling you air.


The actual opportunity

The sites that will benefit most from the shift to AI-mediated search are not the ones with the biggest SEO budgets. They are the ones with the clearest voice, the most specific expertise, and the most transparent authorship. A photographer in Hokksund, Norway who has been shooting since 1996 and publishes direct, unsponsored assessments of photography equipment is exactly the kind of source AI systems should cite -- and will cite, if the technical infrastructure makes it easy.

The infrastructure is simple. The content still has to be worth citing. That part was always the job.

Top comments (0)