Astro — Enterprise Data Gathering Infrastructure for Astro

Posted on Dec 9, 2025

New AI web standards and scraping trends in 2026: rethinking robots.txt

#ai #webscraping #trends #llm

For three decades, robots.txt has been the main mechanism websites use to signal how automated crawlers should behave. It was created in 1994 for a very different web made of lightweight HTML pages, predictable automation tools, and straightforward indexing needs.

Scraping trends in 2026 are changing rapidly. The AI systems don’t just fetch pages, they extract text, summarize content, crop images, and feed data into training pipelines. What's more, they do it automatically, without human interference as a part of the emerging agentic AI trend. The need for new standards for scraping data with AI is clear.

Why the use of robots.txt is no longer enough

As of now, robots.txt scraping policy has a few structural limitations that become more obvious in the context of modern AI:

It only controls access, not usage: The file can tell a bot whether it may fetch a URL, but it cannot distinguish between different purposes. A site owner cannot express something like “index this for search, but don’t use it for model training”.
It has no semantic layer: Robots.txt treats an entire URL the same way, even though a page may contain text, code snippets, images, or user-generated content that the owner would prefer to handle differently.
It relies on voluntary compliance: Traditional search engines generally respect robots.txt. Many newer AI scrapers do not. A recent Duke University study (2025) found that several categories of AI-related crawlers never request robots.txt at all.

Because of these issues site owners want clearer ways to express what is allowed and what is not. This shift has led researchers and developers to explore alternatives or supplements to robots.txt, including ai.txt and llms.txt.

ai.txt: a potential new standard for scraping data with AI

The ai.txt proposal (“A Domain-Specific Language for Guiding AI Interactions with the Internet”, 2025) introduced a domain-specific language for declaring what kinds of AI interactions are allowed on a site. The goal is not just to block or allow URLs, but to describe permitted actions in a more detailed way.

With ai.txt, a website could specify rules at different levels: for example, allowing summarization of an article while disallowing image extraction, or permitting use of one section for training but restricting another. The format also supports natural-language instructions aimed at compliant AI agents, which adds flexibility that robots.txt could never offer. In short, it's used to:

Specify what types of content can or cannot be used (e.g., text allowed, images forbidden).
Define what actions AI systems may perform, such as summarizing but not training.
Set rules for specific sections or elements of a page.
Provide usage terms or licensing notes directly in the file.

There are two proposed paths for enforcement:

Programmatic parsing, where AI agents read a structured representation of ai.txt (such as XML) and enforce rules automatically.
Prompt-based enforcement, where the ai.txt file is read as plain text and incorporated into the agent’s instructions.

llms.txt: a content guide rather than a permission file

While ai.txt focuses on restrictions, llms.txt focuses on clarity.

Originally proposed by Jeremy Howard, llms.txt is a simple Markdown file placed at a site’s root. Instead of forcing AI systems to scrape entire pages, llms.txt gives them a concise overview of the site’s key content.

A typical file might include a short summary of the project, followed by links to documentation, examples, reference pages, or other important sections. Because it uses Markdown, it is easy to read and easy for models to parse. In short, it can:

Give AI a clean summary of what the site is and what it contains.
Point to the most important pages (docs, guides, reference material).
Mark which URLs are the “canonical” sources to rely on.
Help models avoid scraping noisy or irrelevant pages by offering a curated list.

It is not a replacement for a robots.txt scraping policy, and it does not regulate access. It plays a different role: offering a clean entry point so that AI systems can rely on a curated outline rather than full scraping. This can reduce hallucinations and improve the accuracy of responses when an LLM references a site’s content.

Current status and future opportunities: why do we use robots.txt instead of ai.txt and llms.txt?

Various alternative and complementary tools for robots.txt are voluntary conventions, mostly adopted in tech / SEO communities. For example, a Yoast SEO plugin added a feature that auto-generates a llms.txt at a site’s root.

However, actual AI providers have not universally honored these files so far. SearchEngineLand reported (2025) that “there’s no clear evidence that AI companies follow llms.txt” rules, and notes that Google has explicitly said it does not support llms.txt. Rather than using ai.txt or llms.txt, large platforms have provided developer docs or crawler rules to implement similar controls and adapt robots.txt scraping policies.

Governments are only beginning to look at this issue. In the UK, publishers such as Guardian News & Media have asked policymakers to support an ai.txt-style standard. Their submission to a UK Parliament committee described ai.txt as a broad, industry-friendly solution and noted that current proposals from major tech companies don’t fully meet publishers’ needs.

For now, though, no country requires ai.txt or llms.txt. Despite the evolution of AI scraping, 2026 trends in robots.txt are yet uncertain. In the U.S., regulators focus on voluntary disclosure and transparency guidelines. As of late 2025, both formats are not yet ready to become a new AI standard for scraping data. It's unlikely that things will change as early as 2026, but it's possible that a unified system will eventually emerge.

Citations

[1] “Scrapers selectively respect robots.txt directives: Evidence from a large-scale empirical study”, Taein Kim, Duke University, 2024
[2] “Exclusive – Multiple AI companies bypassing web standard to scrape publisher sites, licensing firm says”, Reuters, 2024
[3] “LLMs.txt Directory”, llmstxt.cloud, 2024
[4] “What Is llms.txt? How the New AI Standard Works”, Bluehost, 2025
[5] “Google AI and LLMs.txt Not Yet Implemented”, Search Engine Roundtable, 2024
[6] “Written evidence on AI and web standards (on the importance of robots.txt and llms.txt)”, UK Parliament Committees, 2024

Top comments (1)

wfgsss • Feb 14

The gap between robots.txt and what modern scrapers actually do is massive. We run scrapers against Chinese wholesale platforms (Yiwugo, 1688, DHgate) and the compliance landscape there is even more fragmented — most Chinese e-commerce sites have no robots.txt at all, or have one that blocks everything including legitimate search engines.

The ai.txt proposal is interesting but I wonder about adoption incentives. Site owners who already ignore robots.txt violations have little reason to implement a more complex standard. The real enforcement mechanism might end up being legal (EU AI Act) rather than technical.

One practical pattern we have seen work: scraping platforms like Apify now let you declare your data usage purpose at the Actor level (research, price monitoring, lead generation, etc.). This creates an audit trail that aligns with what ai.txt is trying to achieve — purpose-limited access — but enforced at the infrastructure layer rather than relying on voluntary compliance.

The llms.txt approach feels more immediately useful. For our wholesale data tools, having a machine-readable content map would save significant crawl budget vs discovering page structure through trial and error. Would love to see e-commerce platforms adopt this.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.