The wacky 2026 world of bot traffic

#scraper #ai

Three strategies to deal with bot traffic.

Give up. Have at it. Take what you want. As fast as you want. On my dime. Steal it. re-use it. Life is too short.
Hide behind cloudflare. Let them try to deal with it. Have re-captchas annoy your users. Pay cloudflare and be up-sold.
Understand more about it and do what you can to make it harder. Rage into the dying of the light. Etc.

I completely understand option 1 and 2. Life is too short with other priorities to be in an un-winnable battle with bots that don't even know what robots.txt was supposed to be let alone read it let alone obey it.

However in the spirit of stupidly writing an entire webgl/webgpu website without touching Cesium or Threejs, I'd rather understand more about it.

So the other week I see in the logs that several entire net blocks from Alibaba data web hosting centers are essentially scraping the entire website all the leaf nodes in all the languages. Since there are 35,000 satellites and each page is in multiple languages, thats a lot of pages. They rotate user agents and use a vast number of IPs but they are easy to see, and block - just blank ban the entire netblocks involved. I've already got my web server set up so that from the CLI I can just type 'ban x.y.z/n' (or unban), and monitoring scripts to do this continually.

This quieted things for a few days. But then they came back. This time they came back from 35 THOUSAND ip addresses and each IP did one fetch, they used a relatively small set of rotating but legit user agents.

Investigating this pattern which seemed, on the face of it, to be impossible to block I can see all these residential IPs are using HTTP 1.1 protocol. This provides an option for detection. If you're using a modern browser AND HTTP 1.1, you're blocked. Because all modern browsers use HTTP 2.x. For bots, however, the picture is mixed. But legit bots have legit user-agents. (Note: I would not recommend you follow this life hack. Gemini can fill you in for all the reasons why it is a bad idea).

But the question is. Why? Why spend money crawling an entire tree of URLs via residential proxy IPs, when the information is ALL ephemeral. it would be like crawling an entire worldwide weather website. Absolutely no use at all.

The answer is possibly: AI training. Somewhere, someone, has decided they want an entire mirror of 10s of thousands of domains and do not care if the data is a maze of pointless twisty passages all alike, or the weather in Manitoba, or the stock price of google at this very instant in time. They just want it, and will disobey all the flimsy "laws" humans set up when they invented files like robots.txt.

So that's the landscape today. 75% or more of your traffic (in simple server logs) is fake. Even if it LOOKS real, it's probably fake. You're paying outbound bandwidth costs to feed AI training, SEO intelligence, competitive advantage keyword data-sets who knows what.

Of course google analytics would parse out all this because I doubt scrapers care to run the google fragments that record a "real user". (Baiduspider and others increasingly do execute all js now). But if you're paying for a CDN or paying for bandwidth by the terabyte, you're paying to feed bots anyway and they are way more voracious than they used to be.

DEV Community

The wacky 2026 world of bot traffic

Top comments (0)