david duymelinck

Posted on Feb 18

The AI scraper problem and a possible fix

#ai #discuss

When I saw Laravel cloud offering markdown for agents it feels like a simple solution for a difficult problem.

The problem

Most websites serve HTML because it has been the base language since the invention of the browser.

The problem today is that there are not only search engine bots, but also AI bots that scrape the websites.
While search engine bots use HTML metadata to make the website pages appear on the search engine, AI bots have a lot less use for HTML because it is a markup language mainly used for humans. ARIA is information clutter for an AI bot. Navigation links are of no use for an LLM.
And there are also design assets like images and videos that are only for human use.

Why the markdown for agents solution is only a partial fix

The solution is genius in its simplicity. It checks the accept header, and when it sees markdown as an option, it will serve the markdown version of the page.

While I haven't used the solution I don't think it is going to be able to identify images and videos that contain information useful for the AI. So that information is not going to be transcribed in the markdown file.
The alt attribute could be an indication, but it is being used to get better search results so the information so that is not reliable.

The other problem I see is when the page is deeply linked on the website or appears as part of multiple contexts. This could be a source of confusion and missing information for an LLM.

A better solution

I think it shouldn't be a hoster solution but an application solution.
Look at a webpage from now on like it is an API endpoint. There we have, for as long as I remember, the use of the accept header to negotiate the output.

Add AI augmentation fields when editing webpages to be able to get the right information in front of an AI.
And add a hide from AI option for parts that are only for humans.

GEO

I don't know how many people know generation engine optimization (GEO), and what the current practices are. So I want to go over them briefly to compare it with the markdown version of a page.

The semantic structure of HTML falls away because the semantic structure of a markdown file is in the headers.
HTML has elements as section, aside and nav that can contain data for humans but has no use for an LLM. Markdown avoids adding that information.

Structured data like JSON-LD is not needed because the markdown file only contains relevant data.

Organize pages in a context AI understands can be avoided by adding context to the markdown of the page.

Longer texts and FAQS can be avoided on HTML pages but added to markdown pages.

llms.txt is a proposal to provide markdown content for the site. I think using a link element in HTML to guide bots to the markdown page is a more standardized way to let AI bots discover their content.

Conclusion

Markdown pages are not only a solution to fix LLM information but they can also be used as a rough human-bot divider.
A lot of websites see a surge in traffic because of AI bots, and because it is becoming more an more difficult to identify bot traffic from human traffic people are taking measures for bots that affect humans.

Of course this is a solution for content that doesn't change that often. For content that changes more frequently other solutions should be used.

Top comments (4)

Lars Moelleken • Feb 18

You can add videos and images into markdown and multimodal LLMs can analayse / transcript them, but my problem with all this scraper thing is that they get the information without any references or credits for the (content) creator. Ultimately, the big tech companies get things "for free", while a few years ago, "normal" people would have gone to prison for copying videos 😑. I know that the economy in the USA has no choice, they invested sooo much money into that technology that everything nothing else matters and it seems the tech companies can do what they want. In the EU the AI-Act and DSGVO, etc. tries to protect the people but while we all use US Software that protection ideas has nearly no value.

One current example for good/bad technology to explain my point of view : In Russia, you can pay for your underground journey using facial recognition, but the same system was then also used to identify people who was chosen for the frontline in the Ukraine war.

david duymelinck • Feb 18

You can add videos and images into markdown and multimodal LLMs can analayse / transcript them

That is true. My point about the images and videos was more on a HTML page the distinction between design asset and information is almost non existent. When the markdown contains images and videos it should be because they contain information an LLM can use.

Ultimately, the big tech companies get things "for free"

It is a catch 22. People want their information in front of their public when they ask an AI chatbot a question. But they still want their public to go to their website to do transactions.
Google tried to do the same with AMP, but that backfired. AI is AMP on steroids.

In the EU the AI-Act and DSGVO, etc. tries to protect the people but while we all use US Software that protection ideas has nearly no value.

The benefit of the Chinese models is that they can be run on your own servers (when you have the money). In the US the companies don't release their newest models because they want the subscriptions.
This is a fight US companies can't win. They can take Chinese companies to court but those decisions only stand in the US.

Lars Moelleken • Feb 18

Do you think that all this content scraping without credits will fireback at some point? 🤔

david duymelinck • Feb 18

With an LLM the credits are dubious because it can make up credits, and you can data poison the LLM.

That is why it the relationship with AI for brand recognition is a schizophrenic one.

We are all living the experiment, and the only thing we can hope for is surviving it with no or little damage.