DEV Community

Ajit Kumar
Ajit Kumar

Posted on

Blocking Web Scrapers and Crawlers Using Nginx (Without Cloudflare or AWS WAF)

Web scraping and aggressive crawling can quietly inflate your infrastructure costs—especially when your product is still in development and has few real users. Static assets (images, JavaScript bundles, CSS) are often the biggest cost drivers, not your API or database.

This article explains how to block and throttle scrapers using Nginx alone, without Cloudflare, AWS WAF, or paid services.

The setup applies to a common architecture:

  • Frontend: Next.js (or any SPA)
  • Backend: Django (or similar) via Gunicorn
  • Reverse proxy: Nginx
  • Hosting: Self-managed VM / EC2

We will cover:

  1. Where Nginx configuration files live
  2. How Nginx processes requests
  3. How to block known bots by User-Agent
  4. How to rate-limit static assets and APIs
  5. How to verify everything safely

1. Understanding Nginx Configuration Layout

On Ubuntu/Debian systems, Nginx uses a layered configuration structure.

Main config

/etc/nginx/nginx.conf
Enter fullscreen mode Exit fullscreen mode

This file:

  • Defines global settings
  • Loads other configuration files
  • Should rarely contain site-specific logic

Site configurations

/etc/nginx/sites-available/
/etc/nginx/sites-enabled/
Enter fullscreen mode Exit fullscreen mode
  • sites-available/: all defined virtual hosts
  • sites-enabled/: symlinks to active hosts
  • Nginx only loads what is in sites-enabled

Shared configuration snippets

/etc/nginx/conf.d/
Enter fullscreen mode Exit fullscreen mode

Used for:

  • Rate limits
  • Maps
  • Security rules
  • Reusable logic

This article will add security rules here, so they apply consistently.


2. Typical Architecture (Example)

We will use placeholders throughout:

Component Example
Frontend https://example.com
API https://api.example.com
Frontend port 3000
Backend Gunicorn via Unix socket

3. Step 1 – Disable Version Leakage (Optional but Recommended)

Edit:

/etc/nginx/nginx.conf
Enter fullscreen mode Exit fullscreen mode

Inside the http {} block:

http {
    server_tokens off;
    ...
}
Enter fullscreen mode Exit fullscreen mode

This prevents Nginx from advertising its version in error pages.

Test:

sudo nginx -t
sudo systemctl reload nginx
Enter fullscreen mode Exit fullscreen mode

4. Step 2 – Define Rate Limits (Shared Security File)

Create a shared file:

/etc/nginx/conf.d/security.conf
Enter fullscreen mode Exit fullscreen mode

Add:

# General request limit per IP
limit_req_zone $binary_remote_addr zone=general:10m rate=30r/s;

# Static asset scraping control
limit_req_zone $binary_remote_addr zone=static:10m rate=10r/s;

# API abuse protection
limit_req_zone $binary_remote_addr zone=api:10m rate=5r/s;
Enter fullscreen mode Exit fullscreen mode

These limits:

  • Are per IP
  • Do not affect normal users
  • Significantly slow scrapers

5. Step 3 – Block Known Bad Bots by User-Agent

Still in:

/etc/nginx/conf.d/security.conf
Enter fullscreen mode Exit fullscreen mode

Add:

map $http_user_agent $bad_bot {
    default 0;
    ~*(curl|wget|python|scrapy|httpclient|chatgpt-user) 1;
}
Enter fullscreen mode Exit fullscreen mode

This defines a variable $bad_bot that evaluates before routing happens.


6. Step 4 – Apply Blocking in Frontend (HTTPS)

Edit your frontend HTTPS server:

/etc/nginx/sites-enabled/default
Enter fullscreen mode Exit fullscreen mode

Example:

server {
    listen 443 ssl;
    server_name example.com www.example.com;

    # Block known bots early
    if ($bad_bot) {
        return 403;
    }

    ssl_certificate /path/to/fullchain.pem;
    ssl_certificate_key /path/to/privkey.pem;

    # Protect static assets
    location ~* \.(js|css|png|jpg|jpeg|gif|svg|webp|ico)$ {
        limit_req zone=static burst=20 nodelay;
        expires 7d;

        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
    }

    # Default frontend routing
    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
    }
}
Enter fullscreen mode Exit fullscreen mode

What this achieves

  • Blocks scripted crawlers immediately
  • Throttles image & JS scraping
  • Keeps real users unaffected

7. Step 5 – Apply Protection to the API Server

Edit the API HTTPS server block:

/etc/nginx/sites-enabled/default
Enter fullscreen mode Exit fullscreen mode

Example:

server {
    listen 443 ssl;
    server_name api.example.com;

    # Block known bots
    if ($bad_bot) {
        return 403;
    }

    location / {
        limit_req zone=api burst=10 nodelay;

        proxy_pass http://unix:/run/gunicorn.sock;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}
Enter fullscreen mode Exit fullscreen mode

This ensures:

  • Direct API scraping is throttled
  • Token brute-force attempts are slowed
  • Authentication still works normally

8. Step 6 – Validate Configuration

Always test before reloading:

sudo nginx -t
Enter fullscreen mode Exit fullscreen mode

Expected:

syntax is ok
test is successful
Enter fullscreen mode Exit fullscreen mode

Reload safely:

sudo systemctl reload nginx
Enter fullscreen mode Exit fullscreen mode

9. Step 7 – Test Blocking (Demo)

Test User-Agent blocking

curl -A "ChatGPT-User" https://example.com
Enter fullscreen mode Exit fullscreen mode

Expected:

403 Forbidden
Enter fullscreen mode Exit fullscreen mode

Test normal browser behavior

  • Open site in Chrome / Firefox
  • Pages load
  • Images load
  • No errors

10. Why This Works (And Why It’s Cheap)

  • Nginx blocks traffic before backend execution
  • Static assets are rate-limited (largest bandwidth cost)
  • Known bots are rejected immediately
  • No external services required
  • No recurring cost

This approach alone can reduce outbound traffic 30–60% on crawler-heavy sites.


11. What This Does NOT Replace

This setup does not replace:

  • Advanced bot detection
  • JavaScript challenges
  • Global CDN caching

But for:

  • Early-stage products
  • Cost-sensitive deployments
  • Self-managed infrastructure

…it is an excellent first line of defense.


12. Next Steps

Once this is in place, the natural next hardening step is:

  • Fail2Ban to permanently ban repeat offenders based on logs

That will be covered in a follow-up post.


Conclusion

You don’t need Cloudflare or AWS WAF to significantly reduce scraping.
With a small, well-placed set of Nginx rules, you can:

  • Block obvious bots
  • Throttle aggressive crawlers
  • Reduce bandwidth costs
  • Protect your backend

All with tools you already have.

Top comments (0)