Athreya aka Maneshwar

Posted on May 3, 2025 • Edited on Mar 19

Killing Bots at the Gate: Detecting Malicious Crawlers with Nginx

#webdev #programming #beginners #devops

Hello, I'm Maneshwar. I'm building git-lrc, an AI code reviewer that runs on every commit. It is free, unlimited, and source-available on Github. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Bots are a fact of life on the internet.

Some are helpful—like search engine crawlers.

Others scrape your data, spam your forms, or brute-force your login pages.

If you’re self-hosting with Nginx, you don’t need a pricey SaaS WAF to stop them.

Here's how to detect and destroy malicious bots using good ol’ Nginx, a few scripts, and some zip-bomb flavor.

1. Start with Logs — Always

Nginx logs tell the full story. Make sure you're capturing User-Agent, IP, and paths.

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent"';
access_log  /var/log/nginx/access.log  main;

Now dig through logs for patterns:

# Top IPs by request volume
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head

# Suspicious User-Agents
grep -iE 'curl|wget|python|scrapy|bot|crawler|headless' /var/log/nginx/access.log | less

Want real-time views? Try GoAccess for a terminal dashboard.

2. Identify Suspicious Behavior

Things that scream “bot”:

Blank or obviously fake User-Agent headers
High request volume from a single IP
Frequent hits to /wp-login.php, /xmlrpc.php, /admin, or random paths
Unusual Referer headers or none at all
Crawlers hitting endpoints that no normal user would

Bonus: check your logs against public bot signature lists like MitchellKrogza’s bad bot list.

3. Block the Obvious Stuff with Nginx

Create a quick and dirty User-Agent filter:

map $http_user_agent $bad_bot {
    default 0;
    ~*(curl|wget|python|scrapy|bot|Go-http-client) 1;
}

server {
    if ($bad_bot) {
        return 403;
    }
}

And rate limit abusive IPs:

limit_req_zone $binary_remote_addr zone=abusers:10m rate=5r/s;

server {
    location / {
        limit_req zone=abusers burst=10 nodelay;
        ...
    }
}

Also check out Nginx rate limiting docs.

4. Use Fail2Ban to Auto-Ban IPs

Install Fail2Ban and wire it to your Nginx logs:

Jail config (/etc/fail2ban/jail.local):

[nginx-badbots]
enabled  = true
filter   = nginx-badbots
logpath  = /var/log/nginx/access.log
maxretry = 5
findtime = 600
bantime  = 3600

Filter (/etc/fail2ban/filter.d/nginx-badbots.conf):

[Definition]
failregex = ^<HOST> -.*"(GET|POST).*HTTP.*"(curl|wget|python|scrapy|bot|Go-http-client)
ignoreregex =

Once this is running, bots get banned automatically after a few hits.

5. Use Better Tools for Smarter Bots

If you're seeing more sophisticated attacks, try:

CrowdSec: Open-source tool that shares a dynamic IP reputation list and applies bans
ModSecurity: Full WAF, works with Nginx
OpenResty: Extend Nginx with Lua scripting (e.g., custom captcha, behavior analysis)

If you’re open to a proxy layer:

Cloudflare free tier: Blocks a lot of trash automatically
Fastly Bot Protection: Advanced but paid

Bonus Serve Zip Bombs to Dumb Bots (⚠️ Handle with care)

This blog post by Idiallo shows how he turned bot detection into punishment.

The method? Serve them a compressed zip bomb.

To generate one:

dd if=/dev/zero bs=1G count=10 | gzip -c > 10GB.gz

This creates a ~10MB file that decompresses to 10GB of zeros.

If a bot tries to read it without knowing, it chokes.

Then detect and serve it:

if (ipIsBlackListed() || isMalicious()) {
    header("Content-Encoding: gzip");
    header("Content-Length: " . filesize(ZIP_BOMB_FILE_10G));
    readfile(ZIP_BOMB_FILE_10G);
    exit;
}

He explains that when traffic spikes, he swaps in a 1MB variant.

It’s a great deterrent for low-effort bots.

Heuristics like repeated scanning and double-visits from spam IPs helped him fine-tune this method.

📎 Also check out this Hacker News discussion for community input on his approach.

Final Thoughts

You don’t need an enterprise WAF to defend your site.

With a bit of log inspection, some config hacks, and creative trolling like zip bombs, you can knock out the majority of disruptive bots.

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

⭐ Star it on GitHub:

HexmosTech / git-lrc

AI Micro Code Reviews That Run on Commit

git-lrc

AI Micro Code Reviews That Run on Commit

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
🔁 Build a habit,…

View on GitHub

Top comments (5)

Nick • Dec 2 '25 • Edited

header("Content-Encoding: deflate, gzip");

This is incorrect.

Header must be just "gzip", without "deflate";

This is how correct protection "location" block lookalike:

location ~* (?:.php|(^|/)(?:.git|.env|wp-admin|wp-includes|wordpress|backup|passwd|shadow|bitrix|laravel|owa)(/|$)) {
auth_basic off;
access_log off;
log_not_found off;
# Disguised as a text file with passwords
default_type text/plain;
add_header Content-Type "text/plain; charset=utf-8";
add_header Content-Encoding: "gzip";
proxy_buffering off;
sendfile on;
tcp_nopush on;
gzip off;
gzip_vary on;
gzip_static on;
# Disable cache
add_header Cache-Control "no-store, no-cache, must-revalidate";
expires -1;
# Limiting download speed to bypass server bandwidth degradation
limit_rate_after 0;
limit_rate 1k;
root /etc/nginx/bombs;
try_files /passwd.gz =200;
}