Background
Chaitin's WAF SafeLine has been available for some time now. In terms of security protection capabilities, Chaitin is arguably one of the best WAF vendors with projects hosted on GitHub. Their core semantic analysis technology boasts exceptional detection accuracy while maintaining a low false positive rate.
However, how about its actual detection performance and how much resources are needed to ensure that the WAF does not impede the operations of our specific website traffic. To address this, a performance test is necessary. This article aims to provide a firsthand account of SafeLine's actual performance through stress testing, offering data for reference. Additionally, we will explore potential performance tuning strategies to maximize SafeLine's detection capabilities within a given resource environment.
website
GitHub
Test Environment
- CPU: Intel i7-12700
- Memory: 64GB DDR5 4800MHz
- Kernel: 5.17.15-051715-generic
- Docker: 20.10.21
- SafeLine Version: 1.3.0
Test Deployment
WAF Configuration:Add a new site to the WAF, configuring it to forward traffic to a business server also running on the same machine.
Business Server Configuration: Set up an Nginx server to return a simple 200 response for all requests.
Testing Tools
- wrk: A straightforward HTTP benchmarking utility for putting web servers through their paces.
- wrk2: An enhanced version of wrk that enables testing at a predefined queries per second (QPS) rate.
Testing Strategy
The primary objective of this testing is to benchmark the performance of various services related to traffic inspection within the WAF. Specifically, we aim to determine the maximum QPS that each service can sustain when allocated and fully utilizing a single CPU core.
We will use two types of requests for the test:
- Simple GET requests without a request body.
- GET requests with a request body, specifically a 1KB JSON payload.
Given that the core metric for WAF performance is the number of HTTP requests it can inspect per second, QPS is a more relevant parameter than network layer throughput for our evaluation.
Testing Process
- Evaluate Service Functionality
Initially, send a random 1000 QPS (Queries Per Second) to observe which services' workloads vary in response to the change in QPS, using the simplest GET requests for this purpose.
Load Status of Each Container:
Based on the information presented, it is clear that the load of three services is correlated with traffic:
safeline-tengine: As the name suggests and given that Tengine is an enhanced version of Nginx maintained by Ali Group, this container functions as a reverse proxy, receiving incoming requests and forwarding them to the appropriate backend servers. It handles the initial traffic and acts as the gateway for the application.
safeline-detector: As inferred from the name and the context provided, this service is responsible for detection. It is likely that after Tengine (the reverse proxy) receives the requests, it forwards them to the detector for further inspection or analysis. This aligns with the official documentation about quickly integrating a free WAF (Web Application Firewall) with Nginx, where detection of potentially malicious traffic is a crucial step.
safeline-mario: From the naming convention and a glimpse into the configurations of both detector and mario, it can be deduced that mario is involved in analyzing and possibly persisting the detection logs generated by the detector. This service likely processes the information from the detector to provide insights, generate reports, or simply ensure that important data is stored for future reference.
2. Testing the Baseline Performance of Three Services with Simple Requests
First, we need to limit the CPU usage of all services to a single core by adding resource limits in the compose.yml
file for each service.
Next, execute docker compose up -d
to apply the changes (this command should be run from the installation directory of SafeLine, which is /data/safeline
by default).
After that, we use wrk
to determine the maximum QPS (Queries Per Second) that can be achieved:
From the above figure, it can be observed that the CPU usage of the detector
service reached 100%, becoming the first bottleneck.
Let's examine the actual CPU usage of the detector
service:
We can see that the process name of the detector
is snserver
, which is a multithreaded program. The number of threads is approximately equal to the number of CPU cores in the entire machine (seemingly a few more). When limited to a single core, each thread has some CPU usage, but none of them are high.
This situation effectively results in each thread running for only a short period each time it is awakened, with multiple threads switching back and forth, leading to higher context-switching overhead. We notice a configuration line in the detector
's configuration file that likely controls the number of threads. By uncommenting and setting this to 1, we aim to reduce context-switching and see if the QPS can be further improved. This file is located at resources/detector/snserver.yml
in the installation directory. Make the change and restart the detector
.
The reduction in the number of threads indicates that the configuration has taken effect.
Similarly, nginx
also exhibits this issue, so we reduce the number of worker processes.
The QPS increased to over 17000+, which is a relatively realistic and impressive performance.
However, there are some interesting observations regarding the load changes:
A. After starting the benchmark, the loads of both detector
and mario
reach 100% simultaneously. During this process, it can be noticed that the memory usage of mario
continues to rise.
B. After stopping wrk
, the loads of tengine
and detector
immediately drop back to zero, but mario
remains at 100%. Notably, the memory usage of mario
exceeds 2GB at this point.
C. After some time, the CPU usage of mario
also drops, and the memory usage returns to the pre-benchmark level (around 300+ MB).
Based on these observations, we can infer:
For the simplest GET requests, the
detector
can support over 17000+ QPS on a single core.However, with 17000+ QPS, a single core for
mario
is insufficient. The continuous rise in memory usage suggests that the log processing cannot keep up, leading to a rapid accumulation of queues. This is further evidenced by the fact thatmario
continues to run at 100% CPU even after stoppingwrk
.For the simplest GET requests, the performance of
mario
is the bottleneck among the three services.
Since the GET requests are too simple, we will not precisely measure the QPS each service can support when running at full capacity on a single core in this scenario. Instead, we will proceed with testing using complex requests, as the data obtained from such tests will be more meaningful.
3. Testing the Baseline Performance of Three Services with Complex Requests
We utilize the lua script of wrk
to generate complex requests:
Using tcpdump
to capture packets reveals that the outgoing requests carry a 1024-byte body.
The measured QPS is slightly over 10,000:
Load conditions of each service:
Unsurprisingly, the detector
remains the bottleneck due to the larger and more complex requests, as the detection engine naturally consumes more CPU for inspection.
Meanwhile, both nginx
and mario
's CPU usage decrease somewhat. The decrease in nginx
(65% --> 52%) is attributed to the lower QPS reducing its load, while the increased request size partially offsets this reduction, resulting in a net decrease. mario
's more significant drop (100% --> 47%) indicates that its overhead is more sensitive to QPS than request size.
From this case, we can conclude that the maximum QPS for a single core of the detector
is 10,000.
Next, let's examine tengine
's performance. To maximize its capacity, we need to allocate more resources to detector
and mario
, setting them both to 4 cores.
nginx
can support up to 28,000 QPS on a single core.
However, the load on the detector
rises to 326%, seemingly consuming more CPU than its single-core limit of 10,000 QPS. This suggests that there might be additional overhead from multithreading synchronization.
Moving on to test mario
's single-core limit, as overloading it can lead to data accumulation, we cannot rely solely on CPU limitations. A brute-force approach involves using binary search to find a QPS and then testing it with wrk2
's fixed QPS mode, aiming for mario
's CPU usage to approach 100% without continuous memory growth.
The test results here are intriguing. When limiting mario
's CPU to 1 core, even at 10,000 QPS, its memory struggles to maintain stability. Therefore, we first relax this limit and allocate 2 cores. The most reliable figure obtained is that at 11,000 QPS, mario
can maintain approximately 100% CPU usage without continuous memory growth.
Yet, this raises a confusing point: while the detector
was previously measured to support up to 10,000 QPS on a single core, its CPU usage at 11,000 QPS reaches 247%, which is unexpected.
Test Summary
The performance of the three key services is summarized in the following table (with 1K Body in the request):
Based on this table, we can estimate a comprehensive single-core QPS number, i.e., the approximate QPS that Lei Chi can sustain if deployed on a device with only one CPU core. The calculation can be done as follows:
Top comments (0)