Carrie

Posted on Oct 16

The 6 Best Web Application Firewalls Compared (2024)

#opensource #waf #cybersecurity

Recently, I had the opportunity to recommend some useful security products to several customers, with WAF being one of them.

Attack protection is the core capability of a WAF, and this article will introduce how to scientifically test the effectiveness of WAF protection.

To ensure the fairness and impartiality of the test results, all target machines, testing tools, and test samples mentioned in this article are open-source projects.

Testing Metrics

The test results are mainly based on four indicators:

Detection Rate: Reflects the comprehensiveness of the WAF's detection capabilities. Failing to detect is referred to as a "false negative."
False Positive Rate: Reflects the interference with normal traffic. Unreliable results are referred to as "false positives."
Accuracy Rate: The accuracy rate is a comprehensive indicator of detection and false positive rates, avoiding the trade-off between missing and false alarms.
Detection Latency: Reflects the performance of the WAF, with longer latency indicating poorer performance.

Detection latency can be directly calculated using tools, but how to calculate the other indicators corresponds to the concept of predictive classification in statistics:

TP: The number of attack samples intercepted.
TN: The number of normal samples correctly allowed through.
FN: The number of attack samples allowed through, i.e., the number of "false negatives."
FP: The number of normal requests intercepted, i.e., the number of "false positives."

These formulas can be used to calculate the above three indicators:

Detection Rate = TP / (TP + FN)
False Positive Rate = FP / (TP + FP)
Accuracy Rate = (TP + TN) / (TP + TN + FP + FN)

To reduce the comprehensive impact of randomness and minimize errors, "detection latency" will be divided into two indicators: "90% average latency" and "99% average latency."

Test Samples

Data Source: All test data comes from my own browser.

Packet capturing method: Use Burp as a proxy, direct the browser globally to Burp, export the XML file, and then use a Python script to process it into individual requests.

Based on past experience, the ratio of normal traffic to attack traffic for services exposed on the Internet is usually around 100:1, and we will match the samples in this way.

White Samples: Browsing various forums, a total of 60,707 HTTP requests were accumulated, with a total size of 2.7 GB (this process took me 5 hours).

Black Samples: To make the test results more comprehensive, I collected black samples using four different methods, totaling 600 HTTP requests (this process took me 5 hours).

Simple common attack traffic: Deploy DVWA target machine, attack all common vulnerability samples one by one.
Common attack traffic: Use all attack payloads provided by the portswigger official website.
Targeted vulnerability traffic: Deploy VulHub target machine, attack all classic vulnerabilities with default PoCs one by one.
Attack confrontation traffic: Increase the confrontation level of DVWA, and attack DVWA again under medium and high protection.

Testing Method

After clarifying the testing indicators and samples, we now need three things: WAF, target machine to receive traffic, and testing tools.

All WAFs use the initial configuration without any adjustments.

The target machine uses Nginx, which directly returns 200 for any request received, with the following configuration:

nginx
location / {
    return 200 'hello WAF!';
    default_type text/plain;
}

The testing tool requirements are as follows:

Parse the export results of Burp.
Repackage according to the HTTP protocol.
Considering that the data will be open-sourced later, delete the Cookie Header.
Modify the Host Header field to allow the target machine to receive traffic normally.
Determine whether it was intercepted by the WAF based on whether the request returns 200.
Send packets evenly after mixing black and white samples.
Automatically calculate the above "testing indicators."

Start Testing

SafeLine WAF


TP: 426    TN: 33056    FP: 38    FN: 149
Total sample size: 33669    Success: 33669    Errors: 0
Detection Rate: 74.09%
False Positive Rate: 8.19%
Accuracy Rate: 99.44%

90% Average Latency: 0.73 milliseconds
99% Average Latency: 0.89 milliseconds

Coraza


TP: 404    TN: 27912    FP: 5182    FN: 171
Total sample size: 33669    Success: 33669    Errors: 0
Detection Rate: 70.26%
False Positive Rate: 92.77%
Accuracy Rate: 84.10%

90% Average Latency: 3.09 milliseconds
99% Average Latency: 5.10 milliseconds

ModSecurity


TP: 400    TN: 25713    FP: 7381    FN: 175
Total sample size: 33669    Success: 33669    Errors: 0
Detection Rate: 69.57%
False Positive Rate: 94.86%
Accuracy Rate: 77.56%

90% Average Latency: 1.36 milliseconds
99% Average Latency: 1.71 milliseconds

Baota WAF


TP: 224    TN: 32998    FP: 96    FN: 351
Total sample size: 33669    Success: 33669    Errors: 0
Detection Rate: 38.96%
False Positive Rate: 30.00%
Accuracy Rate: 98.67%

90% Average Latency: 0.53 milliseconds
99% Average Latency: 0.66 milliseconds

nginx-lua-waf


TP: 213    TN: 32619    FP: 475    FN: 362
Total sample size: 33669    Success: 33669    Errors: 0
Detection Rate: 37.04%
False Positive Rate: 69.04%
Accuracy Rate: 97.51%

90% Average Latency: 0.41 milliseconds
99% Average Latency: 0.49 milliseconds

SuperWAF


TP: 138    TN: 33048    FP: 46    FN: 437
Total sample size: 33669    Success: 33669    Errors: 0
Detection Rate: 24.00%
False Positive Rate: 25.00%
Accuracy Rate: 98.57%

90% Average Latency: 0.34 milliseconds
99% Average Latency: 0.41 milliseconds

Comparison Table

	False Negatives	False Positives	Accuracy Rate	Average Latency
SafeLine WAF	149 items	38 items	99.44%	0.73 ms
Coraza	171 items	5182 items	84.10%	3.09 ms
ModSecurity	175 items	7381 items	77.56%	1.36 ms
Baota WAF	351 items	96 items	98.67%	0.53 ms
ngx-lua-waf	362 items	475 items	97.51%	0.41 ms
SuperWAF	437 items	46 items	98.57%	0.34 ms

Conclusion

The SafeLine WAF has the best overall performance, with the fewest false positives and false negatives.

Coraza and ModSecurity have a high detection rate, but they are not adapted to reality scenarios, resulting in too many false positives.

Different test samples and testing methods may lead to significant variations in the results, so it is necessary to select appropriate test samples and methods based on actual conditions for testing.

The results of this test are for reference only and should not be used as the sole criterion for evaluating products, tools, algorithms, or models.

DEV Community

The 6 Best Web Application Firewalls Compared (2024)

Testing Metrics

Test Samples

Testing Method

Start Testing

Conclusion

Top comments (0)

Read next

Congrats to the 2024 Hacktoberfest Writing Challenge Winners!

Setting up your own secure VPN with Amnezia VPN

Linux Kernel and Boot process for Beginner

Ollama and Web-LLM: Building Your Own Local AI Search Assistant