DEV Community

david duymelinck
david duymelinck

Posted on

Markdown pages, are they a good solution?

I already written a post about a HTML to markdown converter solution. In that post I suggested that the functionality on the application level would be a better solution. And in the last week Laravel markdown response and Symfony markdown response bundle popped up. I guess other languages with web frameworks will get similar solutions.

I consider those solutions to be partial fixes, because they lack the tools to trim or augment the page content for an LLM to get the data it can act on.
If you want to provide LLM content I think the best solution is a backend one instead of a frontend one. I consider fully created HTML as a part of the frontend.

The elephant in the room is that webpages are also a human construct.
An AI scraper doesn't need to follow human navigation. Page links are useless if it can grep the content.

How did we get here?

With search engine bots their goal was to find all the pages of a website to put them in a search index and rank them.

The purpose of AI bots is to scrape content from websites to use as additional knowledge for an LLM.

While scraping content was a part of the search engine bots, the content was not the main objective.

Search engine bots are also a minor part of the traffic. And are a part of the marketing cost, because they expose the website to a bigger audience.

AI bots are becoming a substantial part of the traffic, and they haven't proven their marketing worth or any other benefit.

It seems logical to me that the first reaction of people was to block AI traffic. When people discovered food wrappings contained less food the people were not happy. When food companies started to use lower quality products because their sales have reached the ceiling, the people where again unhappy.
The sad fact is people keep buying the product. And I think we are at the same level with AI. Websites are allowing AI scrapers because it could be beneficial.

What is the solution?

If you want to provide data for an LLM, I think an LLM website and a human website is a better way to go.
The LLM website can be nothing more than a collection of linked markdown files.

The second part of the solution is to provide a search that returns data an LLM or an agent can use. The main goal of the search is to provide specific information or information not found on the LLM website.
I don't think REST(ful) or Graphql endpoints are good enough because their output is not LLM specific.

The benefit of the LLM website is that they are static pages, so you could host them on edge servers when you see traffic ramping up in a certain region.
The benefit of a search is that you could create a paywall for AI scrapers to access the searchable content more frequently, or extra information.
The benefit of this solution is that HTML page traffic will be more human again, once the people that use the AI scrapers are aware of these options.

Top comments (15)

Collapse
 
klement_gunndu profile image
klement Gunndu

The backend-over-frontend argument makes sense — serving markdown responses also lets you version the LLM-facing content separately from the UI. Interesting that the traffic cost argument is what's finally forcing this conversation.

Collapse
 
bblackwind profile image
bblackwind

nice work

Collapse
 
ingosteinke profile image
Ingo Steinke, web developer

I don't understand the issue here. HTML offers semantic structure options and search engines have long encouraged content creators to use it, and accessibility profits as well. Markdown has very little semantic options in comparison. Web scraping has been optimized for HTML pages for decades now, and Common Crawl and Google's derivate are being used for AI training.

What's the benefit of offering an alternative markdown representation of site content? Let website owners control the simplification process and its priorities to prevent crawlers doing it the wrong way and misunderstanding our content?

Collapse
 
xwero profile image
david duymelinck

The problem, as I see it is two fold. On the one hand there are certain content structures that provide more context for an LLM, but people rather read less structured content. And the semantics of HTML target software that has stricter requirements than an LLM, so it requires more tokens than needed.
Markdown is good enough for most cases where an LLM needs extra contenxt.

And on the other hand HTML requires a processor to extract the content.
Having Markdown pages removes the need of a processor.

The bigger issue is the traffic cost, and the reports you can get from a website. It is now easier than ever to run up the visitor numbers for a website. So how do you know as a site owner the website is a benefit for your company?
Having markdown pages splits the traffic in a way that is beneficial for everyone, so it can provide an organic way to split human and AI traffic.

In AI cycles people are talking about token budgets, so that is one of the most important things that is going to drive how content is going to be distributed.

Collapse
 
ingosteinke profile image
Ingo Steinke, web developer

Markdown does not require a processor, but LLM engines seem to use one anyway. I recall a recent vulnerabilty report where an AI (Copilot?) executed code embedded in .mdx files.

Markdown is much simpler than HTML but its popular implementations, even without scripting and front matter, like those on GitHub or here on DEV, include complex formatting options like using an image with alternative text as a link text thus nesting ![...[...](...)](...) and the liquid-like curly braces: if you remove all of that complexity, you're back at plain text plus headlines.

I still don't get the initial issue at all. Is there a use case of either

  1. a) AI traffic getting so frequent that website owners pay the bill for extra traffic or
  2. b) cloud servers block AI crawlers as alleged DOS attackers?
  3. AI agents stopping to crawl a site because their HTML is too costly to parse?
  4. ... (you tell me)
Thread Thread
 
embernoglow profile image
EmberNoGlow • Edited

I agree with you. I didn't understand the point of this article. I don't even understand what author talking about - how difficult is it for AI to read markdown or how difficult is it to write it? Markdown is simply an "accessory" that can be useful to humans. AI essentially doesn’t care what text it reads, whether it contains additional characters or not - this does not change the meaning of the text.

Thread Thread
 
xwero profile image
david duymelinck

I think the initial issue is that there is no standard for AI scrapers. And that is why they are using search engine bot technology, while that is build for an whole other purpose.
If you can eliminate the need for an agent to scrape your whole website for a paragraph of information wouldn't that benefit the agent and your website?

I think Markdown pages can be used as a pragmatic solution, because there are a lot of people running agents that don't understand the costs they create for other people.
So there will be scrapers that will try to use HTML pages. If you can redirect them to markdown pages it puts less strain on the server because it is a much simpler format.

I think we are all looking for solutions that can accommodate both humans and agents without the need to blend the two.
Human websites have too many edge cases to be efficient for agents.

Thread Thread
 
xwero profile image
david duymelinck • Edited

I think we are getting our wires crossed somewhere. I never mentioned AI has problems with markdown. It is HTML for humans that has too much noise for what an LLM needs.

If you want to write HTML pages for agents, that could be a solution too. But more text equals more tokens.

Thread Thread
 
ingosteinke profile image
Ingo Steinke, web developer • Edited

But more text equals more tokens.

I hope I'm wrong, but I see AI burning more tokens in generating and refining erroneous answers than they ever could in a simple scraping process. But maybe I'm just too much of an AI-skeptic to care for their tech "solution" that I never asked for. Still, it feels like a micro-optimization at the wrong place while AI has much more fundamental unsolved issues.

Thread Thread
 
xwero profile image
david duymelinck • Edited

I don't think it is their solution. I think it more finding a way to deal with the mass of vibe coded software that is running 24/7.
The requirement of knowing to code is removed because of AI, so people who have no experience just point and shoot their agents.

The term that was used not long ago was script kiddies. Now instead of needing to manually search for code, they let an LLM generate it.

The problem is that if you block all suspected AI traffic, you could hurt your business if you have a shop or want to get your product known.
It is the same with any social media/forum platform. There are always going to be people that try to take advantage, but at the same time there are people using the same technology to make something good.

We are living in a weird world now. And we are looking for answers.

Collapse
 
ravavyr profile image
Ravavyr

AIs read the markdown better, is all. LLMs.txt is generally recommended to be written as markdown.

Collapse
 
ingosteinke profile image
Ingo Steinke, web developer

AI is assumably more stupid than Googlebot was 10 years ago?

Collapse
 
matthewhou profile image
Matthew Hou

Nice in theory, but maintaining markdown docs on a fast-moving team is harder than it sounds. We tried this and the file was outdated within two sprints. Nobody wanted to own it. I think the real answer is executable standards — lint rules and CI checks that enforce patterns automatically, not docs that go stale.

Collapse
 
xwero profile image
david duymelinck

It are not really docs. It is just Alternate content for LLM's. So the responsibility can be moved from the dev team to the content team. Of course the editor should be someone who understands what content is best for an LLM.

Collapse
 
matthewhou profile image
Matthew Hou

You're right — I've been thinking about this as "LLM-facing websites." Just like we started building responsive pages for mobile, sites should serve different responses for human browsers vs. LLM/API access. I've already caught myself defaulting to asking Claude instead of Googling — and I don't think I'm the only one. The shift is already happening.