Web scraping and aggressive crawling can quietly inflate your infrastructure costs—especially when your product is still in development and has few real users. Static assets (images, JavaScript bundles, CSS) are often the biggest cost drivers, not your API or database.
This article explains how to block and throttle scrapers using Nginx alone, without Cloudflare, AWS WAF, or paid services.
The setup applies to a common architecture:
- Frontend: Next.js (or any SPA)
- Backend: Django (or similar) via Gunicorn
- Reverse proxy: Nginx
- Hosting: Self-managed VM / EC2
We will cover:
- Where Nginx configuration files live
- How Nginx processes requests
- How to block known bots by User-Agent
- How to rate-limit static assets and APIs
- How to verify everything safely
1. Understanding Nginx Configuration Layout
On Ubuntu/Debian systems, Nginx uses a layered configuration structure.
Main config
/etc/nginx/nginx.conf
This file:
- Defines global settings
- Loads other configuration files
- Should rarely contain site-specific logic
Site configurations
/etc/nginx/sites-available/
/etc/nginx/sites-enabled/
-
sites-available/: all defined virtual hosts -
sites-enabled/: symlinks to active hosts - Nginx only loads what is in
sites-enabled
Shared configuration snippets
/etc/nginx/conf.d/
Used for:
- Rate limits
- Maps
- Security rules
- Reusable logic
This article will add security rules here, so they apply consistently.
2. Typical Architecture (Example)
We will use placeholders throughout:
| Component | Example |
|---|---|
| Frontend | https://example.com |
| API | https://api.example.com |
| Frontend port | 3000 |
| Backend | Gunicorn via Unix socket |
3. Step 1 – Disable Version Leakage (Optional but Recommended)
Edit:
/etc/nginx/nginx.conf
Inside the http {} block:
http {
server_tokens off;
...
}
This prevents Nginx from advertising its version in error pages.
Test:
sudo nginx -t
sudo systemctl reload nginx
4. Step 2 – Define Rate Limits (Shared Security File)
Create a shared file:
/etc/nginx/conf.d/security.conf
Add:
# General request limit per IP
limit_req_zone $binary_remote_addr zone=general:10m rate=30r/s;
# Static asset scraping control
limit_req_zone $binary_remote_addr zone=static:10m rate=10r/s;
# API abuse protection
limit_req_zone $binary_remote_addr zone=api:10m rate=5r/s;
These limits:
- Are per IP
- Do not affect normal users
- Significantly slow scrapers
5. Step 3 – Block Known Bad Bots by User-Agent
Still in:
/etc/nginx/conf.d/security.conf
Add:
map $http_user_agent $bad_bot {
default 0;
~*(curl|wget|python|scrapy|httpclient|chatgpt-user) 1;
}
This defines a variable $bad_bot that evaluates before routing happens.
6. Step 4 – Apply Blocking in Frontend (HTTPS)
Edit your frontend HTTPS server:
/etc/nginx/sites-enabled/default
Example:
server {
listen 443 ssl;
server_name example.com www.example.com;
# Block known bots early
if ($bad_bot) {
return 403;
}
ssl_certificate /path/to/fullchain.pem;
ssl_certificate_key /path/to/privkey.pem;
# Protect static assets
location ~* \.(js|css|png|jpg|jpeg|gif|svg|webp|ico)$ {
limit_req zone=static burst=20 nodelay;
expires 7d;
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
}
# Default frontend routing
location / {
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
}
}
What this achieves
- Blocks scripted crawlers immediately
- Throttles image & JS scraping
- Keeps real users unaffected
7. Step 5 – Apply Protection to the API Server
Edit the API HTTPS server block:
/etc/nginx/sites-enabled/default
Example:
server {
listen 443 ssl;
server_name api.example.com;
# Block known bots
if ($bad_bot) {
return 403;
}
location / {
limit_req zone=api burst=10 nodelay;
proxy_pass http://unix:/run/gunicorn.sock;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
This ensures:
- Direct API scraping is throttled
- Token brute-force attempts are slowed
- Authentication still works normally
8. Step 6 – Validate Configuration
Always test before reloading:
sudo nginx -t
Expected:
syntax is ok
test is successful
Reload safely:
sudo systemctl reload nginx
9. Step 7 – Test Blocking (Demo)
Test User-Agent blocking
curl -A "ChatGPT-User" https://example.com
Expected:
403 Forbidden
Test normal browser behavior
- Open site in Chrome / Firefox
- Pages load
- Images load
- No errors
10. Why This Works (And Why It’s Cheap)
- Nginx blocks traffic before backend execution
- Static assets are rate-limited (largest bandwidth cost)
- Known bots are rejected immediately
- No external services required
- No recurring cost
This approach alone can reduce outbound traffic 30–60% on crawler-heavy sites.
11. What This Does NOT Replace
This setup does not replace:
- Advanced bot detection
- JavaScript challenges
- Global CDN caching
But for:
- Early-stage products
- Cost-sensitive deployments
- Self-managed infrastructure
…it is an excellent first line of defense.
12. Next Steps
Once this is in place, the natural next hardening step is:
- Fail2Ban to permanently ban repeat offenders based on logs
That will be covered in a follow-up post.
Conclusion
You don’t need Cloudflare or AWS WAF to significantly reduce scraping.
With a small, well-placed set of Nginx rules, you can:
- Block obvious bots
- Throttle aggressive crawlers
- Reduce bandwidth costs
- Protect your backend
All with tools you already have.
Top comments (0)