Recently, I had the opportunity to recommend some useful security products to several customers, with WAF being one of them.
Attack protection is the core capability of a WAF, and this article will introduce how to scientifically test the effectiveness of WAF protection.
To ensure the fairness and impartiality of the test results, all target machines, testing tools, and test samples mentioned in this article are open-source projects.
Testing Metrics
The test results are mainly based on four indicators:
- Detection Rate: Reflects the comprehensiveness of the WAF's detection capabilities. Failing to detect is referred to as a "false negative."
- False Positive Rate: Reflects the interference with normal traffic. Unreliable results are referred to as "false positives."
- Accuracy Rate: The accuracy rate is a comprehensive indicator of detection and false positive rates, avoiding the trade-off between missing and false alarms.
- Detection Latency: Reflects the performance of the WAF, with longer latency indicating poorer performance.
Detection latency can be directly calculated using tools, but how to calculate the other indicators corresponds to the concept of predictive classification in statistics:
- TP: The number of attack samples intercepted.
- TN: The number of normal samples correctly allowed through.
- FN: The number of attack samples allowed through, i.e., the number of "false negatives."
- FP: The number of normal requests intercepted, i.e., the number of "false positives."
These formulas can be used to calculate the above three indicators:
- Detection Rate = TP / (TP + FN)
- False Positive Rate = FP / (TP + FP)
- Accuracy Rate = (TP + TN) / (TP + TN + FP + FN)
To reduce the comprehensive impact of randomness and minimize errors, "detection latency" will be divided into two indicators: "90% average latency" and "99% average latency."
Test Samples
Data Source: All test data comes from my own browser.
Packet capturing method: Use Burp as a proxy, direct the browser globally to Burp, export the XML file, and then use a Python script to process it into individual requests.
Based on past experience, the ratio of normal traffic to attack traffic for services exposed on the Internet is usually around 100:1, and we will match the samples in this way.
White Samples: Browsing various forums, a total of 60,707 HTTP requests were accumulated, with a total size of 2.7 GB (this process took me 5 hours).
Black Samples: To make the test results more comprehensive, I collected black samples using four different methods, totaling 600 HTTP requests (this process took me 5 hours).
- Simple common attack traffic: Deploy DVWA target machine, attack all common vulnerability samples one by one.
- Common attack traffic: Use all attack payloads provided by the portswigger official website.
- Targeted vulnerability traffic: Deploy VulHub target machine, attack all classic vulnerabilities with default PoCs one by one.
- Attack confrontation traffic: Increase the confrontation level of DVWA, and attack DVWA again under medium and high protection.
Testing Method
After clarifying the testing indicators and samples, we now need three things: WAF, target machine to receive traffic, and testing tools.
All WAFs use the initial configuration without any adjustments.
The target machine uses Nginx, which directly returns 200 for any request received, with the following configuration:
nginx
location / {
return 200 'hello WAF!';
default_type text/plain;
}
The testing tool requirements are as follows:
- Parse the export results of Burp.
- Repackage according to the HTTP protocol.
- Considering that the data will be open-sourced later, delete the Cookie Header.
- Modify the Host Header field to allow the target machine to receive traffic normally.
- Determine whether it was intercepted by the WAF based on whether the request returns 200.
- Send packets evenly after mixing black and white samples.
- Automatically calculate the above "testing indicators."
Start Testing
TP: 426 TN: 33056 FP: 38 FN: 149
Total sample size: 33669 Success: 33669 Errors: 0
Detection Rate: 74.09%
False Positive Rate: 8.19%
Accuracy Rate: 99.44%
90% Average Latency: 0.73 milliseconds
99% Average Latency: 0.89 milliseconds
TP: 404 TN: 27912 FP: 5182 FN: 171
Total sample size: 33669 Success: 33669 Errors: 0
Detection Rate: 70.26%
False Positive Rate: 92.77%
Accuracy Rate: 84.10%
90% Average Latency: 3.09 milliseconds
99% Average Latency: 5.10 milliseconds
TP: 400 TN: 25713 FP: 7381 FN: 175
Total sample size: 33669 Success: 33669 Errors: 0
Detection Rate: 69.57%
False Positive Rate: 94.86%
Accuracy Rate: 77.56%
90% Average Latency: 1.36 milliseconds
99% Average Latency: 1.71 milliseconds
Baota WAF
TP: 224 TN: 32998 FP: 96 FN: 351
Total sample size: 33669 Success: 33669 Errors: 0
Detection Rate: 38.96%
False Positive Rate: 30.00%
Accuracy Rate: 98.67%
90% Average Latency: 0.53 milliseconds
99% Average Latency: 0.66 milliseconds
nginx-lua-waf
TP: 213 TN: 32619 FP: 475 FN: 362
Total sample size: 33669 Success: 33669 Errors: 0
Detection Rate: 37.04%
False Positive Rate: 69.04%
Accuracy Rate: 97.51%
90% Average Latency: 0.41 milliseconds
99% Average Latency: 0.49 milliseconds
SuperWAF
TP: 138 TN: 33048 FP: 46 FN: 437
Total sample size: 33669 Success: 33669 Errors: 0
Detection Rate: 24.00%
False Positive Rate: 25.00%
Accuracy Rate: 98.57%
90% Average Latency: 0.34 milliseconds
99% Average Latency: 0.41 milliseconds
Comparison Table
False Negatives | False Positives | Accuracy Rate | Average Latency | |
---|---|---|---|---|
SafeLine WAF | 149 items | 38 items | 99.44% | 0.73 ms |
Coraza | 171 items | 5182 items | 84.10% | 3.09 ms |
ModSecurity | 175 items | 7381 items | 77.56% | 1.36 ms |
Baota WAF | 351 items | 96 items | 98.67% | 0.53 ms |
ngx-lua-waf | 362 items | 475 items | 97.51% | 0.41 ms |
SuperWAF | 437 items | 46 items | 98.57% | 0.34 ms |
Conclusion
The SafeLine WAF has the best overall performance, with the fewest false positives and false negatives.
Coraza and ModSecurity have a high detection rate, but they are not adapted to reality scenarios, resulting in too many false positives.
Different test samples and testing methods may lead to significant variations in the results, so it is necessary to select appropriate test samples and methods based on actual conditions for testing.
The results of this test are for reference only and should not be used as the sole criterion for evaluating products, tools, algorithms, or models.
Top comments (0)