Comparison on Six Self-Hosted WAF

#cybersecurity #opensource #testing

Recently, I had the opportunity to recommend some useful security products to several clients, with WAF (Web Application Firewall) being one of them. Attack protection is the core capability of a WAF, and in this article, I will introduce how to scientifically test the effectiveness of WAF protection.

To ensure the fairness and impartiality of the test results, all target machines, testing tools, and test samples mentioned in this article are open-source projects.

Testing Metrics

The test results are primarily based on four indicators:

Detection Rate: Reflects the comprehensiveness of the WAF's detection capabilities. Failing to detect is referred to as a "false negative."
False Positive Rate: Reflects the interference with normal traffic. Unreliable results are referred to as "false positives."
Accuracy Rate: The accuracy rate is a comprehensive indicator of detection and false positive rates, avoiding the trade-off between missing and false alarms.
Detection Latency: Reflects the performance of the WAF, with longer latency indicating poorer performance.

Detection latency can be directly calculated using tools, but other indicators correspond to the concept of predictive classification in statistics:

TP: The number of attack samples intercepted.
TN: The number of normal samples correctly allowed through.
FN: The number of attack samples allowed through, i.e., the number of "false negatives."
FP: The number of normal requests intercepted, i.e., the number of "false positives."

These formulas can be used to calculate the above three indicators:

Detection Rate = TP / (TP + FN)
False Positive Rate = FP / (TP + FP)
Accuracy Rate = (TP + TN) / (TP + TN + FP + FN)

To reduce the comprehensive impact of randomness and minimize errors, "detection latency" will be divided into two indicators: "90% average latency" and "99% average latency."

Test Samples

Data Source: All test data comes from my own browser.
Packet Capturing Method: Use Burp as a proxy, direct the browser globally to Burp, export the XML file, and then use a Python script to process it into individual requests.
Traffic Ratio: Based on past experience, the ratio of normal traffic to attack traffic for services exposed on the Internet is usually around 100:1, so we will match the samples in this way.

White Samples: Browsing various forums, a total of 60,707 HTTP requests were accumulated, with a total size of 2.7 GB (this process took me 5 hours).

Black Samples: To make the test results more comprehensive, I collected black samples using four different methods, totaling 600 HTTP requests (this process took me 5 hours).

Attack Traffic Samples:

Simple Common Attack Traffic: Deploy DVWA target machine, attack all common vulnerability samples one by one.
Common Attack Traffic: Use all attack payloads provided by the PortSwigger official website.
Targeted Vulnerability Traffic: Deploy VulHub target machine, attack all classic vulnerabilities with default PoCs one by one.
Attack Confrontation Traffic: Increase the confrontation level of DVWA, and attack DVWA again under medium and high protection.

Testing Method

After clarifying the testing indicators and samples, we now need three things: WAF, target machine to receive traffic, and testing tools.

WAF Configuration: All WAFs use the initial configuration without any adjustments.
Target Machine: Uses Nginx, which directly returns 200 for any request received, with the following configuration:

nginx
location / {
    return 200 'hello WAF!';
    default_type text/plain;
}

Testing Tool Requirements:

Parse the export results of Burp.
Repackage according to the HTTP protocol.
Delete the Cookie Header for open-source purposes.
Modify the Host Header field to allow the target machine to receive traffic normally.
Determine whether it was intercepted by the WAF based on whether the request returns 200.
Send packets evenly after mixing black and white samples.
Automatically calculate the above "testing indicators."

Test Results

SafeLine WAF

Coraza

ModSecurity

Baota WAF

nginx-lua-waf

SuperWAF

Comparison Table

WAF	False Negatives	False Positives	Accuracy Rate	Average Latency
SafeLine WAF	149	38	99.44%	0.73 ms
Coraza	171	5,182	84.10%	3.09 ms
ModSecurity	175	7,381	77.56%	1.36 ms
Baota WAF	351	96	98.67%	0.53 ms
ngx-lua-waf	362	475	97.51%	0.41 ms
SuperWAF	437	46	98.57%	0.34 ms

Conclusion

SafeLine WAF provides the best overall performance, with the fewest false positives and false negatives.
Coraza and ModSecurity have a high detection rate, but they result in too many false positives, which makes them less suitable for real-world usage.
Baota WAF, nginx-lua-waf, and SuperWAF show good performance, but still fall short in terms of detection rate or false positive rates.

Different test samples and testing methods may lead to significant variations in the results, so it is necessary to select appropriate test samples and methods based on actual conditions for testing.

The results of this test are for reference only and should not be used as the sole criterion for evaluating products, tools, algorithms, or models.