DEV Community

Watson Foglift
Watson Foglift

Posted on

5 AI Crawlers Launched in 2024–2025 That Most robots.txt Guides Still Miss

Most "AI crawler robots.txt guides" you can find today were written for the 2023 lineup: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot. They are all still correct. They are also all incomplete.

Between June 2024 and late 2025, five more user-agents quietly entered circulation that most of those guides do not mention. If you maintain a site's robots.txt — especially a content site or docs domain — three of the five will surprise you, and two of them will not do what you think Disallow does.

This post is the reference table I wish I had had six months ago.

The five that belong in your robots.txt

Crawler Company Purpose User-Agent Launched
Applebot-Extended Apple Apple Intelligence training opt-out signal Applebot-Extended Jun 2024
Meta-ExternalAgent Meta Llama / Meta AI training meta-externalagent Jul 2024
Meta-ExternalFetcher Meta Meta AI user-requested fetches meta-externalfetcher 2024
DuckAssistBot DuckDuckGo DuckAssist cited answers (on-demand, non-training) DuckAssistBot/1.2 2025
CCBot Common Crawl Open dataset that feeds The Pile, RedPajama, C4 CCBot ongoing

CCBot is not new — it has crawled the web since 2008 — but it is back on this list because Common Crawl now runs it on dedicated IP ranges with reverse-DNS verification, and because the downstream story (The Pile, RedPajama, C4) is the part most guides skip.

Three of them behave in ways that break the "just add Disallow" habit

1. Applebot-Extended is an opt-out signal, not a crawler

Applebot-Extended does not fetch pages. It has no independent crawl footprint. The actual crawling is done by regular Applebot, the same bot that has indexed content for Siri and Spotlight for years.

What Applebot-Extended does is tell Apple whether it is allowed to use the content Applebot already fetched to train Apple Intelligence foundation models. It is a training-use opt-out, not a fetch-blocker.

This means:

  • Blocking Applebot-Extended leaves you fully indexable for Siri, Spotlight, and Apple search.
  • Blocking only Applebot blocks you from Apple search too.
  • If you want Siri discovery but not Apple Intelligence training, you want the extended block specifically.

Roughly 6–7% of high-traffic sites block it today. The list skews heavily toward news: The New York Times, The Financial Times, The Atlantic, Vox Media, Condé Nast are all on the record as blocking it.

Source: Apple Support articles 119829 and 120320.

2. Meta-ExternalFetcher can ignore robots.txt on user-supplied URLs

This is the sharp edge of the post.

Meta's docs are explicit: facebookexternalhit and meta-externalfetcher may ignore robots.txt when a user explicitly hands Meta AI a URL as context. Same carve-out ChatGPT-User and Perplexity-User apply. The intent is "the user asked the assistant to look at this specific page, so the assistant fetches it."

The implication:

  • If your threat model is "no Meta AI surface ever fetches this page," robots.txt alone is not enough.
  • You need a firewall rule or user-agent block at the edge.
  • Disallow: / under User-agent: meta-externalfetcher stops batch crawls, but a user pasting the URL into Meta AI can still trigger a fetch.

Source: Meta for Developers, crawler documentation (updated 2024–2026).

3. CCBot blocks propagate slowly — and partially

This is the subtle one.

Common Crawl publishes snapshots on a roughly quarterly cadence. Blocking CCBot today removes you from future snapshots. It does nothing about the snapshots you are already in.

The reason that matters: Common Crawl is the upstream source for derivative training datasets like The Pile, RedPajama, and C4. Those datasets are distributed, mirrored, and baked into models that already shipped. A block is a forward-looking decision; the historical footprint lives on for years.

If you are trying to scrub a specific page out of training data, blocking CCBot is a start, not a solution.

Source: commoncrawl.org/ccbot.

What actually goes in robots.txt

For most content sites, the sane 2026 default is "allow all AI crawlers, no training opt-out." Foglift's own telemetry across customer sites says blocking correlates with fewer AI citations, not more.

But if you specifically want to be quotable in AI answers while opting out of training, the minimal block list looks like this:

# Training opt-outs
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow search/retrieval bots (needed to appear in AI answers)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: DuckAssistBot
Allow: /
Enter fullscreen mode Exit fullscreen mode

Three notes on that block:

  • anthropic-ai and Claude-Web are deprecated. If your existing robots.txt references them, nothing breaks, but they are no-ops. Anthropic's current lineup is ClaudeBot (training), Claude-SearchBot (search), Claude-User (user-triggered browsing).
  • Meta-ExternalFetcher is deliberately not on the block list above. See the carve-out above — a user-supplied URL can override it anyway, so blocking mostly adds noise without adding protection. If you want true denial, do it at the firewall.
  • DuckAssistBot is on-demand and non-training. Leaving it allowed costs nothing and makes your content eligible for DuckAssist citations.

How to verify blocks actually worked

Two sanity checks that take under five minutes:

  1. curl -A "Meta-ExternalAgent" https://yoursite.com/robots.txt and read the response Meta-ExternalAgent would parse. Same trick for each user-agent you care about.
  2. Look at server logs for the exact user-agent strings above over the last 30 days. If you never saw DuckAssistBot in logs, your "Disallow DuckAssistBot" line is theoretical — you can't verify it's doing anything without sample traffic.

The second check is the one teams skip. A robots.txt rule you cannot observe being respected is not a security control; it is an honor-system pledge.

Why the guides are stale

The reason most robots.txt guides stop at GPTBot is that the 2023 cohort was easy: four crawlers, four training companies, one clean mental model ("block the bot named after the LLM"). The 2024–2025 cohort broke that mental model on three different axes at once:

  • Apple: the opt-out signal is a pseudo-user-agent that does not crawl.
  • Meta: there are two user-agents and robots.txt only applies to one of them reliably.
  • DuckDuckGo / Common Crawl: on-demand-only and "already baked in" respectively — neither fits "training bot" or "search bot" cleanly.

If your robots.txt was last touched in 2023, these are the five rows to add.


For a longer write-up with Anthropic's three-crawler split, the complete table of 17 AI crawlers, and a recommended config for publishers vs. SaaS docs vs. e-commerce, we keep the live reference at foglift.io/blog/robots-txt-ai-crawlers. It's updated whenever a new user-agent lands.

Top comments (0)