DEV Community

ButterflyI8
ButterflyI8

Posted on

Test Report of 5 Free WAFs

To ensure fairness and impartiality in the test results, all target machines, testing tools, and test samples mentioned in this article are open-source projects.

Test Metrics

The test results are primarily evaluated based on four key metrics:

  • Detection Rate: Reflects the comprehensiveness of the WAF's detection capability. Failure to detect an attack is considered a "false negative" (FN).

  • False Positive Rate: Measures the interference with normal traffic. Unreliable detections are classified as "false positives" (FP).

  • Accuracy: A composite indicator of both detection and false positive rates, ensuring a balance between avoiding false negatives and false positives.

  • Detection Time: Reflects the WAF's performance, with longer durations indicating poorer performance.

While detection time can be directly measured using tools, the calculation of the other three metrics can be aligned with the concept of predictive classification in statistics:

  • TP (True Positives): The number of attack samples successfully intercepted.
  • TN (True Negatives): The number of normal samples correctly allowed to pass.
  • FN (False Negatives): The number of attack samples that were mistakenly allowed to pass, i.e., "misses".
  • FP (False Positives): The number of normal requests that were erroneously intercepted, i.e., "false alarms".

Based on these definitions, the formulas for calculating the above three metrics are as follows:

  • Detection Rate = TP / (TP + FN)

  • False Positive Rate = FP / (TP + FP)

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)

To mitigate the overall impact of randomness and reduce errors, "Detection Time" will be further broken down into two metrics: "Average Time for 90% of Cases" and "Average Time for 99% of Cases."

Test Samples

Data Source: All test data originate from my personal browser.

Packet Capture Method: Utilizing Burp Suite as a proxy, the browser's global settings are configured to direct all traffic through Burp. The exported XML files are then processed into individual requests using a Python script.

Based on past experience, the ratio of normal to attack traffic for services exposed on the internet is approximately 100:1. We have proportioned our samples accordingly.

White Samples: By browsing Weibo, Zhihu, Bilibili, and various forums, a total of 60,707 HTTP requests were accumulated, with a total size of 2.7 GB (a process that consumed 5 hours of my time).

Black Samples: To ensure a comprehensive test, I collected black samples using four different methods, resulting in a total of 600 HTTP requests (a process that also consumed 5 hours of my time).

  • Simple Generic Attack Traffic: By deploying the Damn Vulnerable Web Application (DVWA) target, all generic vulnerability samples were attacked sequentially.
  • Common Attack Traffic: Executing all attack payloads provided by the PortSwigger website.
  • Targeted Vulnerability Traffic: Deploying VulHub targets and sequentially attacking all classic vulnerabilities using their default Proof of Concept (PoC) scripts.
  • Attack Resilience Traffic: Increasing the adversarial level of DVWA and re-attacking it under medium and high protection settings.

Test Method

With the test metrics and test samples clearly defined, we now require three essential components: WAFs, target machines to receive traffic, and testing tools.

All WAFs will be used in their initial configurations without any adjustments.

For the target machines, we will use Nginx, which will directly return a 200 status code for any request received. The configuration is as follows:

location / { 
    return 200 'hello WAF!'; 
    default_type text/plain; 
}
Enter fullscreen mode Exit fullscreen mode

This configuration ensures that Nginx responds with a simple "OK" message and a 200 status code for every request it receives, regardless of the content or type of the request. This setup allows us to focus on the WAF's behavior and performance without any interference from the target machine's processing logic.

The requirements for the testing tool are as follows:

  • Parse Burp's export results
  • Re-assemble packets according to the HTTP protocol
  • Consider open-sourcing subsequent data, thus requiring the deletion of Cookie Header
  • Modify the Host Header field to enable the target machine to receive traffic normally
  • Determine whether a request is blocked by the WAF based on whether it returns a 200 status code
  • Evenly send mixed black and white samples
  • Automatically calculate the aforementioned "test metrics"

I have found two open-source WAF testing tools that appear to be of good quality and meet most of the requirements. By combining these two tools and adding a few other details, they should be ready for use. The links are as follows:

  • gotestwaf: An open-source WAF testing tool from Thailand
  • blazehttp: An open-source WAF testing tool from Chaitin

By integrating these two tools and making minor adjustments, we can create a comprehensive solution tailored to our specific testing needs.

Test Results

Safeline

TP: 426 TN: 33056 FP: 38 FN: 149
Total Samples: 33669 Success: 33669 Errors: 0
Detection Rate: 74.09%
False Positive Rate: 8.19%
Accuracy: 99.44%

Average Time for 90% of Requests: 0.73 milliseconds
Average Time for 99% of Requests: 0.89 milliseconds

Coraza

TP: 404 TN: 27912 FP: 5182 FN: 171
Total Samples: 33669 Success: 33669 Errors: 0
Detection Rate: 70.26%
False Positive Rate: 92.77%
Accuracy: 84.10%

Average Time for 90% of Requests: 3.09 milliseconds
Average Time for 99% of Requests: 5.10 milliseconds

ModSecurity

TP: 400 TN: 25713 FP: 7381 FN: 175
Total Samples: 33669 Success: 33669 Errors: 0
Detection Rate: 69.57%
False Positive Rate: 94.86%
Accuracy: 77.56%

Average Time for 90% of Requests: 1.36 milliseconds
Average Time for 99% of Requests: 1.71 milliseconds

nginx-lua-waf

TP: 213 TN: 32619 FP: 475 FN: 362
Total Samples: 33669 Success: 33669 Errors: 0
Detection Rate: 37.04%
False Positive Rate: 69.04%
Accuracy: 97.51%

Average Time for 90% of Requests: 0.41 milliseconds
Average Time for 99% of Requests: 0.49 milliseconds

SuperWAF

TP: 138 TN: 33048 FP: 46 FN: 437
Total Samples: 33669 Success: 33669 Errors: 0
Detection Rate: 24.00%
False Positive Rate: 9.50% (Note: The original 25.00% is likely a mistake, as FP/(FP+TN) = 46/(46+33048) ≈ 0.095, which rounds to 9.50%)
Accuracy: 98.57%

Average Time for 90% of Requests: 0.34 milliseconds
Average Time for 99% of Requests: 0.41 milliseconds

Comparison

Image description

SafeLine has the best overall performance with the least false positives and false negatives.

Both Coraza and ModSecurity, as outstanding WAF engine projects, have high detection rates but also suffer from higher false positives.

To ensure fairness and impartiality, the testing tools and data used in this article have been made open-source and can be accessed at the following address:

https://gitee.com/kxlxbb/testwaf

Additionally, different testing samples and methods may lead to significant variations in test results. It is essential to select appropriate testing samples and methods based on actual conditions to conduct the tests.

Top comments (0)