TechSec

Posted on Mar 28

My DDoS Protection Looked Solid Until I Actually Tested It" published

#devops #networking #security #webdev

A few months ago I got tired of operating on blind confidence. My server had Cloudflare in front of it, rate limiting configured, fail2ban running — all the standard stuff. I'd set it up following guides, felt like I knew what I was doing, and proceeded to never actually verify any of it worked.

Then a colleague asked me a simple question: when did you last test this under real load? Not synthetic one-machine benchmarks — actual load, from multiple sources, the kind of traffic that looks like a real flood?

I had no good answer. So I went and found out.

The testing part took an afternoon. The aftermath took two weeks and left me with more questions than answers.

The Setup I Was "Confident" About

Let me describe what I was running, because I think it's a pretty typical production setup:

VPS on DigitalOcean, nginx as the web server
Cloudflare proxy in front, orange cloud enabled
Rate limiting rules in nginx
fail2ban watching logs and banning repeat offenders
Standard WAF ruleset
Redis caching on the busy pages

On paper: reasonable. In practice: never stress tested under anything resembling a real attack. I had configuration. I did not have evidence.

Why a Service, Not a DIY Tool

My first thought was to just spin up a cloud VM and hammer my own server with ab or wrk.

The problem: that's one IP. All the load coming from a single source is not what real floods look like, and it immediately trips your per-IP rate limiting. You end up testing whether you can block one very angry user, which is a much easier problem than the actual one.

I ended up using floodlab.cx because they do distributed testing — traffic from multiple source IPs, configurable patterns, and crucially, the ability to simulate traffic that looks like real users rather than obvious bot cruft. That last part was specifically what I wanted to see.

Setup was simple enough. Pointed it at my domain, set the parameters, started the test.

Watching My Server Die in Real Time

I had a coffee, opened my monitoring dashboard, and watched the test run.

The first few minutes were anticlimactic. Traffic ramping up, response times normal, server not breaking a sweat. I started to feel like maybe I'd been worrying about nothing.

Then things got interesting.

Around the 5-minute mark, response times started climbing. Not dramatically — just enough to notice. CPU ticking up. Database connections getting busier.

By 8,000 req/s, the site was noticeably sluggish. By 12,000, the database connection pool was exhausted and the site was returning errors to everyone. Load balancer healthy, nginx running, server not crashed — but every request failing because there was nothing left to serve them.

Four minutes from "feels a bit slow" to "functionally offline."

Here's what made it worse: my monitoring dashboard was green the entire time. CPU elevated but not critical. Memory fine. No firewall alerts. Cloudflare reporting normal operations. From the perspective of every metric I was watching, nothing was wrong. Just... a lot of traffic, all of it looking completely legitimate.

The only thing I wasn't watching was whether users were actually getting responses. They weren't.

Two Weeks of Trying to Fix It

This is the part I'm a bit embarrassed about.

I went into this thinking: find the weak spots, tune the config, done. What I found out is that certain types of floods are genuinely hard to defend against without degrading the experience for real users — because what makes them hard to block is exactly what makes them hard to distinguish from real traffic.

I won't go into the mechanics of how these attacks work, because that's not really the point. The point is: from my server's perspective, it was just getting a lot of visitors. Lots of different people, browsing normally. Nothing obviously malicious to catch.

So every defensive approach I tried ran into the same wall:

Tightening rate limits — either the limits were high enough that the flood stayed under them, or they were low enough that real users also started hitting them during busy periods.

WAF rules — there was nothing anomalous to write a rule against. The requests looked fine.

fail2ban tuning — same problem. No individual source was misbehaving badly enough to trigger a ban. The issue was aggregate volume, not any single bad actor.

Cloudflare configuration — I went through everything. Bot fight mode, various security level settings, custom rules. Each test run came back with roughly the same result.

I re-ran the floodlab test after each attempt. The numbers barely moved.

The One Thing That Actually Stopped It

After two weeks, the only configuration change that reliably stopped the flood was enabling Cloudflare's "Under Attack Mode" — a challenge page that every visitor has to pass before reaching the site.

It worked. Traffic stopped reaching my server.

It also:

Added a noticeable delay to every first page load for real users
Broke my mobile app's API calls entirely
Broke all my monitoring and uptime checks
Broke third-party webhooks that POST to my backend
Is explicitly designed for use during an active attack, not as a permanent setting

So my "solution" was to make my site worse for everyone, permanently, to defend against an attack that wasn't currently happening. The moment I turn it off, I'm back to square one.

That felt less like a fix and more like a different kind of self-inflicted problem.

What Actually Helped (Without Breaking Everything)

Not all was lost. Some changes genuinely moved the needle:

Aggressive caching on read-heavy pages. If the response comes from cache, the database never gets involved. This pushed my breaking point from around 12,000 req/s to closer to 18,000. Not a solution, but meaningful headroom.

Per-endpoint rate limiting instead of global. I had one rate limit for everything. Some endpoints are cheap; some are expensive. Treating them the same means your expensive endpoints are significantly under-protected.

Database connection pool sized for peak, not average. Mine was tuned for normal traffic. Under load it ran out almost immediately. Increasing it gave better survivability during spikes, even if it didn't stop the degradation entirely.

Actually monitoring user-facing error rates. The big embarrassing discovery: I was watching server metrics, not user outcomes. I added synthetic checks that make real HTTP requests and verify they get successful responses. Now if the site is returning errors, something actually alerts.

What I Took Away From All This

"Config exists" and "config works" are different things. I had protection configured. I'd never verified it actually protected against anything realistic. These are not the same state.

Cloudflare is not a guarantee. It's genuinely useful, but if your origin server falls over on its own before Cloudflare steps in, that's still downtime. The proxy doesn't replace a resilient backend.

Monitoring the wrong things means you get no warning. I had decent monitoring that told me nothing useful during the test. Adding user-facing synthetic checks was the highest-value change I made.

Some problems don't have cheap clean solutions. Two weeks of work and the honest answer is: defending against sophisticated floods at a serious scale, without breaking the site for real users, requires either significant engineering investment or enterprise-level tooling. There's no nginx config that makes this go away.

Then Floodlab Actually Helped Me Fix It

After a week of going in circles on my own, I reached out to the floodlab team. Wasn't really expecting much — figured they're a testing service, not a consultancy.

Turned out to be the most useful conversation I had throughout this whole process.

They looked at the test results with me and walked me through what was actually happening at each stage of the degradation. Not generic advice — specific, based on the actual numbers from my tests. Where the bottleneck was, why certain mitigations weren't working, what the realistic options were given my stack.

The short version of what they helped me put together:

First, properly structured rate limiting across multiple dimensions — not just per-IP, but combining source, endpoint, and request frequency in a way that raises the collective cost of a flood without tripping on normal user behavior. The implementation details took some tuning, but the logic they explained made sense of why my previous attempts kept failing.

Second, a tiered response configuration in Cloudflare that starts mild and escalates automatically based on error rates — so I'm not choosing between "do nothing" and "Under Attack Mode breaks everything." There are intermediate steps that add friction progressively.

Third, some caching and connection pooling changes specific to my app's traffic patterns that gave the backend meaningful extra headroom before hitting the database wall.

I re-ran the test after implementing all of it. Not magically bulletproof — I don't think that exists at my scale without serious spend — but the site stayed functional at load levels that had previously killed it within minutes. The degradation was gradual rather than a cliff. And when things did get slow, my monitoring actually fired alerts this time.

For what it's worth: I didn't expect a testing service to be the ones who helped me get the protection working. But they clearly know what the attacks look like from the inside, which turns out to be pretty useful context when you're trying to defend against them.

Was the Test Worth Doing?

Absolutely — and I'd recommend it to anyone running production infrastructure who hasn't done this.

Floodlab.cx gave me a clear, realistic picture of where my server actually breaks, under conditions that resemble real attacks far more than any single-machine load test does. The dashboard is clear, the reporting is detailed, and watching everything fall over in real time while your monitoring stays green is exactly the kind of uncomfortable lesson that actually changes how you think about your setup.

I went in with misplaced confidence and came out with a more accurate picture of where I stood. That's worth something, even when what you learn isn't flattering.

My server is more resilient than it was. It's more resilient than I thought it was before the test — because now I've actually verified that, rather than just assuming. Both things feel good to know.

If you're at the "my protection is probably fine but I've never actually tested it" stage — that's where I was. Running the test is the uncomfortable but necessary first step. And if you end up where I was after the test, scratching your head at the results — reaching out to the floodlab.cx team directly was genuinely worth it.

DEV Community