DEV Community

ButterflyI8
ButterflyI8

Posted on

"Free WAF" SafeLine Performance Testing

Background

Chaitin's WAF SafeLine has been available for some time now. In terms of security protection capabilities, Chaitin is arguably one of the best WAF vendors with projects hosted on GitHub. Their core semantic analysis technology boasts exceptional detection accuracy while maintaining a low false positive rate.

However, how about its actual detection performance and how much resources are needed to ensure that the WAF does not impede the operations of our specific website traffic. To address this, a performance test is necessary. This article aims to provide a firsthand account of SafeLine's actual performance through stress testing, offering data for reference. Additionally, we will explore potential performance tuning strategies to maximize SafeLine's detection capabilities within a given resource environment.
website
GitHub

Test Environment

  • CPU: Intel i7-12700
  • Memory: 64GB DDR5 4800MHz
  • Kernel: 5.17.15-051715-generic
  • Docker: 20.10.21
  • SafeLine Version: 1.3.0

Test Deployment

WAF Configuration:Add a new site to the WAF, configuring it to forward traffic to a business server also running on the same machine.
Image description

Business Server Configuration: Set up an Nginx server to return a simple 200 response for all requests.
Image description

Testing Tools

  • wrk: A straightforward HTTP benchmarking utility for putting web servers through their paces.
  • wrk2: An enhanced version of wrk that enables testing at a predefined queries per second (QPS) rate.

Testing Strategy

The primary objective of this testing is to benchmark the performance of various services related to traffic inspection within the WAF. Specifically, we aim to determine the maximum QPS that each service can sustain when allocated and fully utilizing a single CPU core.

We will use two types of requests for the test:

  1. Simple GET requests without a request body.
  2. GET requests with a request body, specifically a 1KB JSON payload.

Given that the core metric for WAF performance is the number of HTTP requests it can inspect per second, QPS is a more relevant parameter than network layer throughput for our evaluation.

Testing Process

  1. Evaluate Service Functionality

Initially, send a random 1000 QPS (Queries Per Second) to observe which services' workloads vary in response to the change in QPS, using the simplest GET requests for this purpose.

Image description

Load Status of Each Container:

Image description

Based on the information presented, it is clear that the load of three services is correlated with traffic:

  • safeline-tengine: As the name suggests and given that Tengine is an enhanced version of Nginx maintained by Ali Group, this container functions as a reverse proxy, receiving incoming requests and forwarding them to the appropriate backend servers. It handles the initial traffic and acts as the gateway for the application.

  • safeline-detector: As inferred from the name and the context provided, this service is responsible for detection. It is likely that after Tengine (the reverse proxy) receives the requests, it forwards them to the detector for further inspection or analysis. This aligns with the official documentation about quickly integrating a free WAF (Web Application Firewall) with Nginx, where detection of potentially malicious traffic is a crucial step.

  • safeline-mario: From the naming convention and a glimpse into the configurations of both detector and mario, it can be deduced that mario is involved in analyzing and possibly persisting the detection logs generated by the detector. This service likely processes the information from the detector to provide insights, generate reports, or simply ensure that important data is stored for future reference.

  • Detector Configuration File
    Image description

  • Mario Configuration File
    Image description

2. Testing the Baseline Performance of Three Services with Simple Requests

First, we need to limit the CPU usage of all services to a single core by adding resource limits in the compose.yml file for each service.
Image description

Next, execute docker compose up -d to apply the changes (this command should be run from the installation directory of SafeLine, which is /data/safeline by default).
Image description

After that, we use wrk to determine the maximum QPS (Queries Per Second) that can be achieved:
Image description

The QPS reached 4175.
Image description

From the above figure, it can be observed that the CPU usage of the detector service reached 100%, becoming the first bottleneck.

Let's examine the actual CPU usage of the detector service:
Image description

We can see that the process name of the detector is snserver, which is a multithreaded program. The number of threads is approximately equal to the number of CPU cores in the entire machine (seemingly a few more). When limited to a single core, each thread has some CPU usage, but none of them are high.

This situation effectively results in each thread running for only a short period each time it is awakened, with multiple threads switching back and forth, leading to higher context-switching overhead. We notice a configuration line in the detector's configuration file that likely controls the number of threads. By uncommenting and setting this to 1, we aim to reduce context-switching and see if the QPS can be further improved. This file is located at resources/detector/snserver.yml in the installation directory. Make the change and restart the detector.
Image description

The reduction in the number of threads indicates that the configuration has taken effect.
Image description

Similarly, nginx also exhibits this issue, so we reduce the number of worker processes.
Image description

Running wrk again:
Image description

The QPS increased to over 17000+, which is a relatively realistic and impressive performance.

However, there are some interesting observations regarding the load changes:

A. After starting the benchmark, the loads of both detector and mario reach 100% simultaneously. During this process, it can be noticed that the memory usage of mario continues to rise.
Image description

B. After stopping wrk, the loads of tengine and detector immediately drop back to zero, but mario remains at 100%. Notably, the memory usage of mario exceeds 2GB at this point.
Image description

C. After some time, the CPU usage of mario also drops, and the memory usage returns to the pre-benchmark level (around 300+ MB).
Image description

Based on these observations, we can infer:

  • For the simplest GET requests, the detector can support over 17000+ QPS on a single core.

  • However, with 17000+ QPS, a single core for mario is insufficient. The continuous rise in memory usage suggests that the log processing cannot keep up, leading to a rapid accumulation of queues. This is further evidenced by the fact that mario continues to run at 100% CPU even after stopping wrk.

  • For the simplest GET requests, the performance of mario is the bottleneck among the three services.

Since the GET requests are too simple, we will not precisely measure the QPS each service can support when running at full capacity on a single core in this scenario. Instead, we will proceed with testing using complex requests, as the data obtained from such tests will be more meaningful.

3. Testing the Baseline Performance of Three Services with Complex Requests

We utilize the lua script of wrk to generate complex requests:
Image description

Using tcpdump to capture packets reveals that the outgoing requests carry a 1024-byte body.

Image description

The measured QPS is slightly over 10,000:

Image description

Load conditions of each service:

Image description

Unsurprisingly, the detector remains the bottleneck due to the larger and more complex requests, as the detection engine naturally consumes more CPU for inspection.

Meanwhile, both nginx and mario's CPU usage decrease somewhat. The decrease in nginx (65% --> 52%) is attributed to the lower QPS reducing its load, while the increased request size partially offsets this reduction, resulting in a net decrease. mario's more significant drop (100% --> 47%) indicates that its overhead is more sensitive to QPS than request size.

From this case, we can conclude that the maximum QPS for a single core of the detector is 10,000.

Next, let's examine tengine's performance. To maximize its capacity, we need to allocate more resources to detector and mario, setting them both to 4 cores.

Image description

nginx can support up to 28,000 QPS on a single core.

However, the load on the detector rises to 326%, seemingly consuming more CPU than its single-core limit of 10,000 QPS. This suggests that there might be additional overhead from multithreading synchronization.

Image description

Moving on to test mario's single-core limit, as overloading it can lead to data accumulation, we cannot rely solely on CPU limitations. A brute-force approach involves using binary search to find a QPS and then testing it with wrk2's fixed QPS mode, aiming for mario's CPU usage to approach 100% without continuous memory growth.

The test results here are intriguing. When limiting mario's CPU to 1 core, even at 10,000 QPS, its memory struggles to maintain stability. Therefore, we first relax this limit and allocate 2 cores. The most reliable figure obtained is that at 11,000 QPS, mario can maintain approximately 100% CPU usage without continuous memory growth.

Image description

Yet, this raises a confusing point: while the detector was previously measured to support up to 10,000 QPS on a single core, its CPU usage at 11,000 QPS reaches 247%, which is unexpected.

Test Summary

The performance of the three key services is summarized in the following table (with 1K Body in the request):

Image description

Based on this table, we can estimate a comprehensive single-core QPS number, i.e., the approximate QPS that Lei Chi can sustain if deployed on a device with only one CPU core. The calculation can be done as follows:

Image description

Top comments (0)