DEV Community

Three Ways to Fight AI Crawlers

Lorenzo Zarantonello on April 03, 2025

Until recently, websites were pushing for web crawlers to index their content properly. Now, a new type of crawler, AI crawlers, is changing the g...

Read full post

Justin Hirsch • Apr 8 '25

My company's Saas product is hosted on unique sub-domains, one assigned to each customer. We often see "waves" of crawler traffic, even from legitimate bots such as Bing. Throttling legitimate bots using robots.txt is ineffective because of the different subdomains. Recently, we have had several "DDOS" type situations, with crawlers firing up on 1000s of our sub-domains within minutes. Our product relies heavily on dynamically generated content, so it spins up all our servers, eats through all available memory, and kills the databases by running 1000s of qps.

In the last month, Bingbot has been one of the worst culprits. Even though we have set up rates and good search times within the Bing console, this wave of bots seems not to respect those parameters. It occurred to us that these could be other scrapers faking Bingbot, but the handful of IPs I have tried do checkout with Bing.
To protect our servers and customers from this nuisance, I have configured our HAProxy-based load balancers to return 429 responses if a particular IP address accesses more than a certain number of sub-domains within a limited period of time. We use this same system to block bots scanning for compromised scripts (e.g., WordPress PHP, .env files, etc.).

Lorenzo Zarantonello This is Learning • Apr 9 '25

Cool to hear your prod experience!

Have you had any thoughts about how this affects your website's visibility on the Bing chatbot?
As an example, searching with Perplexity seems to trigger a search across different websites. (maybe in real-time?)
If I were one of those websites, I would want to be searched by Perplexity's bot because I am sure there is an ongoing search about my product or something adjacent, right?

Justin Hirsch • Apr 9 '25

We are concerned about how this might affect our SEO. From what we have seen, we still get the BingBots scanning most of our domains daily, but in a low-key fashion that doesn't trigger the circuit breakers. It's our understanding that by returning a 429, we are not indicating that we don't exist, more that we were just overloaded and to try again later. As for Perplexity/Bing Chatbots, we rely on SEO from our customers' sites that are linked to our marketing site. Still, we keep the marketing on a separate infrastructure without the restrictions.

Dmitriy A. • Apr 4 '25 • Edited

Don't fight AI bots and crawlers make them your friend — prerendering.com

It will reduce load to a website
Won't DDoS your servers as prerendering service hosts in front of it (if installed via CF Workers)
You choose when to block or allow page crawl
Optimize results for the top SEO score

Lorenzo Zarantonello This is Learning • Apr 8 '25

Mmm, I have a few questions. Based on the article and on what people experienced, how do you think prerendering will reduce load?
How can you choose when to block or allow page crawl?

Dmitriy A. • Apr 8 '25

Reduce load — Rendered pages returned from cache without sending requests to origin
Choose when to block or allow page crawl — Middleware under your full control, the simplest is to filter by bot's User-Agent and allow crawling only existing URLs

Lorenzo Zarantonello This is Learning • Apr 8 '25

Good points!
I think 1 is very viable when the content is static (some post/article) but not possible if we need to show dynamic content, right? E.g. techpays.com has some filters to figure out a salary range given your role/location etc.

Dmitriy A. • Apr 8 '25

It's good for static and dynamic websites, cache TTL can get set in a range between 1 hour to 31 days

Isaac Klutse • Apr 4 '25

Thanks so much

Lorenzo Zarantonello This is Learning • Apr 8 '25

Do you have a website that has experienced something similar?

Raz Devra • Apr 7 '25

great info.. thanks

Lorenzo Zarantonello This is Learning • Apr 8 '25

No problem! Have you experienced something similar?

kvetoslavnovak • Apr 10 '25

Same experience here — at first we were excited to see traffic going up, then came the shock of realizing it was just aggressive AI crawlers burning through our plan quotas.

Nevo David • Apr 8 '25

Neat, I liked how it showed real ways to stop sneaky botshow can we protect online content without hurting the good parts of the internet?