Ajit Kumar

Posted on Dec 17, 2025

Blocking Web Scrapers and Crawlers Using Nginx (Without Cloudflare or AWS WAF)

#nginx #linux #hardening #devops

Web scraping and aggressive crawling can quietly inflate your infrastructure costs—especially when your product is still in development and has few real users. Static assets (images, JavaScript bundles, CSS) are often the biggest cost drivers, not your API or database.

This article explains how to block and throttle scrapers using Nginx alone, without Cloudflare, AWS WAF, or paid services.

The setup applies to a common architecture:

Frontend: Next.js (or any SPA)
Backend: Django (or similar) via Gunicorn
Reverse proxy: Nginx
Hosting: Self-managed VM / EC2

We will cover:

Where Nginx configuration files live
How Nginx processes requests
How to block known bots by User-Agent
How to rate-limit static assets and APIs
How to verify everything safely

1. Understanding Nginx Configuration Layout

On Ubuntu/Debian systems, Nginx uses a layered configuration structure.

Main config

/etc/nginx/nginx.conf

This file:

Defines global settings
Loads other configuration files
Should rarely contain site-specific logic

Site configurations

/etc/nginx/sites-available/
/etc/nginx/sites-enabled/

sites-available/: all defined virtual hosts
sites-enabled/: symlinks to active hosts
Nginx only loads what is in sites-enabled

Shared configuration snippets

/etc/nginx/conf.d/

Used for:

Rate limits
Maps
Security rules
Reusable logic

This article will add security rules here, so they apply consistently.

2. Typical Architecture (Example)

We will use placeholders throughout:

Component	Example
Frontend	`https://example.com`
API	`https://api.example.com`
Frontend port	`3000`
Backend	Gunicorn via Unix socket

3. Step 1 – Disable Version Leakage (Optional but Recommended)

Edit:

/etc/nginx/nginx.conf

Inside the http {} block:

http {
    server_tokens off;
    ...
}

This prevents Nginx from advertising its version in error pages.

Test:

sudo nginx -t
sudo systemctl reload nginx

4. Step 2 – Define Rate Limits (Shared Security File)

Create a shared file:

/etc/nginx/conf.d/security.conf

Add:

# General request limit per IP
limit_req_zone $binary_remote_addr zone=general:10m rate=30r/s;

# Static asset scraping control
limit_req_zone $binary_remote_addr zone=static:10m rate=10r/s;

# API abuse protection
limit_req_zone $binary_remote_addr zone=api:10m rate=5r/s;

These limits:

Are per IP
Do not affect normal users
Significantly slow scrapers

5. Step 3 – Block Known Bad Bots by User-Agent

Still in:

/etc/nginx/conf.d/security.conf

Add:

map $http_user_agent $bad_bot {
    default 0;
    ~*(curl|wget|python|scrapy|httpclient|chatgpt-user) 1;
}

This defines a variable $bad_bot that evaluates before routing happens.

6. Step 4 – Apply Blocking in Frontend (HTTPS)

Edit your frontend HTTPS server:

/etc/nginx/sites-enabled/default

Example:

server {
    listen 443 ssl;
    server_name example.com www.example.com;

    # Block known bots early
    if ($bad_bot) {
        return 403;
    }

    ssl_certificate /path/to/fullchain.pem;
    ssl_certificate_key /path/to/privkey.pem;

    # Protect static assets
    location ~* \.(js|css|png|jpg|jpeg|gif|svg|webp|ico)$ {
        limit_req zone=static burst=20 nodelay;
        expires 7d;

        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
    }

    # Default frontend routing
    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
    }
}

What this achieves

Blocks scripted crawlers immediately
Throttles image & JS scraping
Keeps real users unaffected

7. Step 5 – Apply Protection to the API Server

Edit the API HTTPS server block:

/etc/nginx/sites-enabled/default

Example:

server {
    listen 443 ssl;
    server_name api.example.com;

    # Block known bots
    if ($bad_bot) {
        return 403;
    }

    location / {
        limit_req zone=api burst=10 nodelay;

        proxy_pass http://unix:/run/gunicorn.sock;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

This ensures:

Direct API scraping is throttled
Token brute-force attempts are slowed
Authentication still works normally

8. Step 6 – Validate Configuration

Always test before reloading:

sudo nginx -t

Expected:

syntax is ok
test is successful

Reload safely:

sudo systemctl reload nginx

9. Step 7 – Test Blocking (Demo)

Test User-Agent blocking

curl -A "ChatGPT-User" https://example.com

Expected:

403 Forbidden

Test normal browser behavior

Open site in Chrome / Firefox
Pages load
Images load
No errors

10. Why This Works (And Why It’s Cheap)

Nginx blocks traffic before backend execution
Static assets are rate-limited (largest bandwidth cost)
Known bots are rejected immediately
No external services required
No recurring cost

This approach alone can reduce outbound traffic 30–60% on crawler-heavy sites.

11. What This Does NOT Replace

This setup does not replace:

Advanced bot detection
JavaScript challenges
Global CDN caching

But for:

Early-stage products
Cost-sensitive deployments
Self-managed infrastructure

…it is an excellent first line of defense.

12. Next Steps

Once this is in place, the natural next hardening step is:

Fail2Ban to permanently ban repeat offenders based on logs

That will be covered in a follow-up post.

Conclusion

You don’t need Cloudflare or AWS WAF to significantly reduce scraping.
With a small, well-placed set of Nginx rules, you can:

Block obvious bots
Throttle aggressive crawlers
Reduce bandwidth costs
Protect your backend

All with tools you already have.

DEV Community