Marc Newstead

Posted on May 18

Your Website's Content Now Has Two Jobs: Feeding Crawlers and Training LLMs

#ai #seo #webdev #content

Your Website's Content Now Has Two Jobs: Feeding Crawlers and Training LLMs

If you've built a documentation site, marketing platform, or content-heavy application recently, you've probably optimised for Google's crawlers—structured data, semantic HTML, decent page speed, sensible meta tags. Job done, right?

Not anymore.

Google's AI Overviews (and similar LLM-powered search features from Bing, Perplexity, and others) are now extracting, synthesising, and presenting your content without sending users to your site. Your carefully crafted landing pages might be cited in an AI-generated summary, but the click? Gone.

This isn't theoretical. Organic click-through rates are dropping. The real battle for UK search visibility isn't just about ranking #1 anymore—it's about being the source LLMs choose to cite.

What Changed: From PageRank to Prompt Context

Traditional SEO was a game of signals: backlinks, domain authority, keyword density, Core Web Vitals. You knew the rules. You played by them.

Generative Engine Optimisation (GEO) is different. LLMs don't "crawl" in the same way. They're trained on massive corpora, then retrieve and synthesise information at inference time. The question isn't "Does this page rank?" but "Does this content get selected as context for the LLM's response?"

Think of it like this:

# Traditional SEO
def get_search_results(query):
    pages = index.find(query)  # crawled, indexed, ranked
    return sorted(pages, key=lambda p: p.pagerank)

# GEO / LLM-powered search
def get_ai_overview(query):
    context = retrieval_model.fetch_relevant_docs(query)
    response = llm.generate(query, context=context)
    return response  # user never clicks through

Your content needs to be selected during retrieval, not just indexed. That's a different optimisation problem.

What LLMs Actually Reward

Based on research into how generative engines select sources, here's what seems to matter:

1. Authoritative, Structured Answers

LLMs favour content that directly answers questions in a clear, hierarchical format. Think:

Concise definitions at the top
Bullet lists for steps or options
Tables for comparisons
Headings that map to user intent ("How to...", "What is...", "Best practices for...")

2. Semantic Richness

Keyword stuffing is dead. LLMs look for semantic coverage—related concepts, synonyms, contextual depth. If you're documenting an API, don't just list endpoints. Explain use cases, common errors, and edge cases.

3. Freshness and Specificity

Generic evergreen content is losing ground to timely, specific answers. If you're writing a guide, reference current versions, real-world examples, and actual data.

4. Citation-Friendly Formatting

LLMs are more likely to cite content that's easy to extract and attribute. This means:

Clear authorship and publication dates
Structured data (JSON-LD, Open Graph)
Blockquotes, code snippets, and other semantically marked-up elements

Practical Steps: Optimise for Both

The good news? You don't have to choose. Most GEO best practices also improve traditional SEO. Here's a starter playbook:

Audit Your Content for "Extract-ability"

Can someone (or an LLM) quickly pull a useful answer from your page? If your intro is 300 words of marketing fluff before the actual information, you're in trouble.

Use Structured Data Everywhere

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Building Resilient APIs",
  "author": {"@type": "Person", "name": "Jane Dev"},
  "datePublished": "2024-01-15"
}
</script>

This helps both crawlers and LLMs understand your content's context.

Write for Humans, Optimise for Machines

Your content should read naturally, but also be scannable. Use:

Bold text for key terms
Short paragraphs (2-3 sentences max)
Code blocks for examples (not screenshots)
Tables and lists for structured info

Track Non-Click Visibility

If your content is cited in an AI Overview but doesn't generate clicks, that's still brand exposure. Tools are emerging to track "impression share" in AI-generated results, but even anecdotal monitoring (searching your own key topics and noting citations) is useful.

The Workflow Shift

If you're building content pipelines or CMS tooling, consider:

AI-assisted content audits: Use LLMs to identify gaps in semantic coverage or question-answer alignment
Automated structured data generation: Parse your markdown/HTML and inject schema.org markup programmatically
Prompt testing: Literally query LLMs (ChatGPT, Claude, Perplexity) with your target keywords and see if your content surfaces

Agencies focused on AI automation and software development are building these workflows into content ops—treating GEO as a CI/CD problem, not just an editorial one.

The Takeaway

If you're maintaining documentation, a developer blog, or any content-heavy platform, your success metrics are shifting. Rankings matter less than retrieval. Clicks matter less than citation.

Optimise for both crawlers and LLMs. Structure your content like an API response: clear, hierarchical, and easy to parse. And remember—your content isn't just being read anymore. It's being used as training data.

Make sure it's worth learning from.

DEV Community

Your Website's Content Now Has Two Jobs: Feeding Crawlers and Training LLMs

Your Website's Content Now Has Two Jobs: Feeding Crawlers and Training LLMs

What Changed: From PageRank to Prompt Context

What LLMs Actually Reward

1. Authoritative, Structured Answers

2. Semantic Richness

3. Freshness and Specificity

4. Citation-Friendly Formatting

Practical Steps: Optimise for Both

Audit Your Content for "Extract-ability"

Use Structured Data Everywhere

Write for Humans, Optimise for Machines

Track Non-Click Visibility

The Workflow Shift

The Takeaway

Top comments (0)