DEV Community

Marc Newstead
Marc Newstead

Posted on

Your Website's Content Now Has Two Jobs: Feeding Crawlers and Training LLMs

Your Website's Content Now Has Two Jobs: Feeding Crawlers and Training LLMs

If you've built a documentation site, marketing platform, or content-heavy application recently, you've probably optimised for Google's crawlers—structured data, semantic HTML, decent page speed, sensible meta tags. Job done, right?

Not anymore.

Google's AI Overviews (and similar LLM-powered search features from Bing, Perplexity, and others) are now extracting, synthesising, and presenting your content without sending users to your site. Your carefully crafted landing pages might be cited in an AI-generated summary, but the click? Gone.

This isn't theoretical. Organic click-through rates are dropping. The real battle for UK search visibility isn't just about ranking #1 anymore—it's about being the source LLMs choose to cite.

What Changed: From PageRank to Prompt Context

Traditional SEO was a game of signals: backlinks, domain authority, keyword density, Core Web Vitals. You knew the rules. You played by them.

Generative Engine Optimisation (GEO) is different. LLMs don't "crawl" in the same way. They're trained on massive corpora, then retrieve and synthesise information at inference time. The question isn't "Does this page rank?" but "Does this content get selected as context for the LLM's response?"

Think of it like this:

# Traditional SEO
def get_search_results(query):
    pages = index.find(query)  # crawled, indexed, ranked
    return sorted(pages, key=lambda p: p.pagerank)

# GEO / LLM-powered search
def get_ai_overview(query):
    context = retrieval_model.fetch_relevant_docs(query)
    response = llm.generate(query, context=context)
    return response  # user never clicks through
Enter fullscreen mode Exit fullscreen mode

Your content needs to be selected during retrieval, not just indexed. That's a different optimisation problem.

What LLMs Actually Reward

Based on research into how generative engines select sources, here's what seems to matter:

1. Authoritative, Structured Answers

LLMs favour content that directly answers questions in a clear, hierarchical format. Think:

  • Concise definitions at the top
  • Bullet lists for steps or options
  • Tables for comparisons
  • Headings that map to user intent ("How to...", "What is...", "Best practices for...")

2. Semantic Richness

Keyword stuffing is dead. LLMs look for semantic coverage—related concepts, synonyms, contextual depth. If you're documenting an API, don't just list endpoints. Explain use cases, common errors, and edge cases.

3. Freshness and Specificity

Generic evergreen content is losing ground to timely, specific answers. If you're writing a guide, reference current versions, real-world examples, and actual data.

4. Citation-Friendly Formatting

LLMs are more likely to cite content that's easy to extract and attribute. This means:

  • Clear authorship and publication dates
  • Structured data (JSON-LD, Open Graph)
  • Blockquotes, code snippets, and other semantically marked-up elements

Practical Steps: Optimise for Both

The good news? You don't have to choose. Most GEO best practices also improve traditional SEO. Here's a starter playbook:

Audit Your Content for "Extract-ability"

Can someone (or an LLM) quickly pull a useful answer from your page? If your intro is 300 words of marketing fluff before the actual information, you're in trouble.

Use Structured Data Everywhere

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Building Resilient APIs",
  "author": {"@type": "Person", "name": "Jane Dev"},
  "datePublished": "2024-01-15"
}
</script>
Enter fullscreen mode Exit fullscreen mode

This helps both crawlers and LLMs understand your content's context.

Write for Humans, Optimise for Machines

Your content should read naturally, but also be scannable. Use:

  • Bold text for key terms
  • Short paragraphs (2-3 sentences max)
  • Code blocks for examples (not screenshots)
  • Tables and lists for structured info

Track Non-Click Visibility

If your content is cited in an AI Overview but doesn't generate clicks, that's still brand exposure. Tools are emerging to track "impression share" in AI-generated results, but even anecdotal monitoring (searching your own key topics and noting citations) is useful.

The Workflow Shift

If you're building content pipelines or CMS tooling, consider:

  • AI-assisted content audits: Use LLMs to identify gaps in semantic coverage or question-answer alignment
  • Automated structured data generation: Parse your markdown/HTML and inject schema.org markup programmatically
  • Prompt testing: Literally query LLMs (ChatGPT, Claude, Perplexity) with your target keywords and see if your content surfaces

Agencies focused on AI automation and software development are building these workflows into content ops—treating GEO as a CI/CD problem, not just an editorial one.

The Takeaway

If you're maintaining documentation, a developer blog, or any content-heavy platform, your success metrics are shifting. Rankings matter less than retrieval. Clicks matter less than citation.

Optimise for both crawlers and LLMs. Structure your content like an API response: clear, hierarchical, and easy to parse. And remember—your content isn't just being read anymore. It's being used as training data.

Make sure it's worth learning from.

Top comments (0)