ButterflyI8

Posted on Aug 19

"Free WAF" SafeLine Performance Testing

#webdev #cybersecurity #testing #docker

Background

Chaitin's WAF SafeLine has been available for some time now. In terms of security protection capabilities, Chaitin is arguably one of the best WAF vendors with projects hosted on GitHub. Their core semantic analysis technology boasts exceptional detection accuracy while maintaining a low false positive rate.

However, how about its actual detection performance and how much resources are needed to ensure that the WAF does not impede the operations of our specific website traffic. To address this, a performance test is necessary. This article aims to provide a firsthand account of SafeLine's actual performance through stress testing, offering data for reference. Additionally, we will explore potential performance tuning strategies to maximize SafeLine's detection capabilities within a given resource environment.
website
GitHub

Test Environment

CPU: Intel i7-12700
Memory: 64GB DDR5 4800MHz
Kernel: 5.17.15-051715-generic
Docker: 20.10.21
SafeLine Version: 1.3.0

Test Deployment

WAF Configuration:Add a new site to the WAF, configuring it to forward traffic to a business server also running on the same machine.

Business Server Configuration: Set up an Nginx server to return a simple 200 response for all requests.

Testing Tools

wrk: A straightforward HTTP benchmarking utility for putting web servers through their paces.
wrk2: An enhanced version of wrk that enables testing at a predefined queries per second (QPS) rate.

Testing Strategy

The primary objective of this testing is to benchmark the performance of various services related to traffic inspection within the WAF. Specifically, we aim to determine the maximum QPS that each service can sustain when allocated and fully utilizing a single CPU core.

We will use two types of requests for the test:

Simple GET requests without a request body.
GET requests with a request body, specifically a 1KB JSON payload.

Given that the core metric for WAF performance is the number of HTTP requests it can inspect per second, QPS is a more relevant parameter than network layer throughput for our evaluation.

Testing Process

Evaluate Service Functionality

Initially, send a random 1000 QPS (Queries Per Second) to observe which services' workloads vary in response to the change in QPS, using the simplest GET requests for this purpose.

Load Status of Each Container:

Based on the information presented, it is clear that the load of three services is correlated with traffic:

safeline-tengine: As the name suggests and given that Tengine is an enhanced version of Nginx maintained by Ali Group, this container functions as a reverse proxy, receiving incoming requests and forwarding them to the appropriate backend servers. It handles the initial traffic and acts as the gateway for the application.
safeline-detector: As inferred from the name and the context provided, this service is responsible for detection. It is likely that after Tengine (the reverse proxy) receives the requests, it forwards them to the detector for further inspection or analysis. This aligns with the official documentation about quickly integrating a free WAF (Web Application Firewall) with Nginx, where detection of potentially malicious traffic is a crucial step.
safeline-mario: From the naming convention and a glimpse into the configurations of both detector and mario, it can be deduced that mario is involved in analyzing and possibly persisting the detection logs generated by the detector. This service likely processes the information from the detector to provide insights, generate reports, or simply ensure that important data is stored for future reference.
Detector Configuration File
Mario Configuration File

2. Testing the Baseline Performance of Three Services with Simple Requests

First, we need to limit the CPU usage of all services to a single core by adding resource limits in the compose.yml file for each service.

Next, execute docker compose up -d to apply the changes (this command should be run from the installation directory of SafeLine, which is /data/safeline by default).

After that, we use wrk to determine the maximum QPS (Queries Per Second) that can be achieved:

The QPS reached 4175.

From the above figure, it can be observed that the CPU usage of the detector service reached 100%, becoming the first bottleneck.

Let's examine the actual CPU usage of the detector service:

We can see that the process name of the detector is snserver, which is a multithreaded program. The number of threads is approximately equal to the number of CPU cores in the entire machine (seemingly a few more). When limited to a single core, each thread has some CPU usage, but none of them are high.

This situation effectively results in each thread running for only a short period each time it is awakened, with multiple threads switching back and forth, leading to higher context-switching overhead. We notice a configuration line in the detector's configuration file that likely controls the number of threads. By uncommenting and setting this to 1, we aim to reduce context-switching and see if the QPS can be further improved. This file is located at resources/detector/snserver.yml in the installation directory. Make the change and restart the detector.

The reduction in the number of threads indicates that the configuration has taken effect.

Similarly, nginx also exhibits this issue, so we reduce the number of worker processes.

Running wrk again:

The QPS increased to over 17000+, which is a relatively realistic and impressive performance.

However, there are some interesting observations regarding the load changes:

A. After starting the benchmark, the loads of both detector and mario reach 100% simultaneously. During this process, it can be noticed that the memory usage of mario continues to rise.

B. After stopping wrk, the loads of tengine and detector immediately drop back to zero, but mario remains at 100%. Notably, the memory usage of mario exceeds 2GB at this point.

C. After some time, the CPU usage of mario also drops, and the memory usage returns to the pre-benchmark level (around 300+ MB).

Based on these observations, we can infer:

For the simplest GET requests, the detector can support over 17000+ QPS on a single core.
However, with 17000+ QPS, a single core for mario is insufficient. The continuous rise in memory usage suggests that the log processing cannot keep up, leading to a rapid accumulation of queues. This is further evidenced by the fact that mario continues to run at 100% CPU even after stopping wrk.
For the simplest GET requests, the performance of mario is the bottleneck among the three services.

Since the GET requests are too simple, we will not precisely measure the QPS each service can support when running at full capacity on a single core in this scenario. Instead, we will proceed with testing using complex requests, as the data obtained from such tests will be more meaningful.

3. Testing the Baseline Performance of Three Services with Complex Requests

We utilize the lua script of wrk to generate complex requests:

Using tcpdump to capture packets reveals that the outgoing requests carry a 1024-byte body.

The measured QPS is slightly over 10,000:

Load conditions of each service:

Unsurprisingly, the detector remains the bottleneck due to the larger and more complex requests, as the detection engine naturally consumes more CPU for inspection.

Meanwhile, both nginx and mario's CPU usage decrease somewhat. The decrease in nginx (65% --> 52%) is attributed to the lower QPS reducing its load, while the increased request size partially offsets this reduction, resulting in a net decrease. mario's more significant drop (100% --> 47%) indicates that its overhead is more sensitive to QPS than request size.

From this case, we can conclude that the maximum QPS for a single core of the detector is 10,000.

Next, let's examine tengine's performance. To maximize its capacity, we need to allocate more resources to detector and mario, setting them both to 4 cores.

nginx can support up to 28,000 QPS on a single core.

However, the load on the detector rises to 326%, seemingly consuming more CPU than its single-core limit of 10,000 QPS. This suggests that there might be additional overhead from multithreading synchronization.

Moving on to test mario's single-core limit, as overloading it can lead to data accumulation, we cannot rely solely on CPU limitations. A brute-force approach involves using binary search to find a QPS and then testing it with wrk2's fixed QPS mode, aiming for mario's CPU usage to approach 100% without continuous memory growth.

The test results here are intriguing. When limiting mario's CPU to 1 core, even at 10,000 QPS, its memory struggles to maintain stability. Therefore, we first relax this limit and allocate 2 cores. The most reliable figure obtained is that at 11,000 QPS, mario can maintain approximately 100% CPU usage without continuous memory growth.

Yet, this raises a confusing point: while the detector was previously measured to support up to 10,000 QPS on a single core, its CPU usage at 11,000 QPS reaches 247%, which is unexpected.

Test Summary

The performance of the three key services is summarized in the following table (with 1K Body in the request):

Based on this table, we can estimate a comprehensive single-core QPS number, i.e., the approximate QPS that Lei Chi can sustain if deployed on a device with only one CPU core. The calculation can be done as follows:

DEV Community

"Free WAF" SafeLine Performance Testing

Background

Test Environment

Test Deployment

Testing Tools

Testing Strategy

Testing Process

Test Summary

Top comments (0)

Read next

DevSecOps Project: "Secure Full-Stack Node.js Web Application Deployment with Jenkins, Docker, Kubernetes, and HashiCorp Vault"

How can i fix Serverless Function fail when deploying a Nodejs project on Vercel?

How to use the HTML output tag for displaying the result of a calculation.

Running Cron Job in Django Using Celery and Docker