SafeLine WAF: Performance Testing and Optimization

#cybersecurity #opensource #testing #devops

SafeLine's Community Edition WAF has been on the market for a while now, and its semantic analysis technology is impressive in both detection accuracy and low false positive rates. The Community Edition is said to use the same detection engine as the Enterprise version, although it might not be the latest release.

Performance Questions and Testing Approach

The key question remains: does the Community Edition compromise on detection performance? Additionally, how much resource allocation is required to ensure that the WAF doesn’t become a bottleneck under specific traffic conditions on my website? To address these concerns, we decided to conduct a stress test to gauge SafeLine's actual performance and share some data for reference. We also explored potential optimization methods to extract even better performance from the Community Edition within the available resources.

Test Setup

WAF Configuration: We set up a single site that forwards traffic to a business server running on the same machine.
Business Server: A basic Nginx server returning a simple 200 OK page.

Testing Tools

wrk: A basic HTTP performance testing tool.
wrk2: A modified version of wrk that allows testing with a fixed QPS (queries per second).

Testing Strategy

Our primary focus in this test was the performance of various services related to traffic detection, defined as the maximum QPS that can be supported by a single service occupying a single CPU core.

We used two types of requests: a simple GET request without a request body, and a GET request with a 1K JSON body. The core metric for WAF performance is the number of HTTP requests that can be inspected per second, making QPS a more relevant parameter than network layer throughput.

Test Process

Assessing Service Functionality We started by sending a random load of 1000 QPS with a simple GET request to observe how the load on different services fluctuates with QPS.

Load Distribution:

The results revealed three services with load directly tied to traffic:

safeline-tengine: A reverse proxy based on Alibaba's forked and modified version of Nginx.
safeline-detector: Likely the detection service, receiving requests from Nginx for inspection.
safeline-mario: Presumably responsible for analyzing and persisting detection logs.

Baseline Performance Testing with Simple Requests We set the CPU usage limit for all services to 1 core and executed docker compose up -d to apply the changes (in SafeLine's default installation directory: /data/safeline).

Then, we used wrk to measure the maximum QPS.

The results showed a maximum QPS of 4175, with the detector service hitting 100% CPU usage, indicating it as the first bottleneck.

After analyzing the detector’s CPU usage, we noticed that the process, named snserver, was multi-threaded, with the number of threads roughly equal to the CPU core count. However, context switching overhead was high due to each thread getting only a small slice of CPU time.

To reduce context switching, we modified the detector’s configuration file (resources/detector/snserver.yml) to reduce the number of threads to 1.

After restarting the detector, we observed a significant performance increase, with QPS rising to over 17,000.

Performance Testing with Complex Requests Next, we generated more complex requests using wrk’s Lua scripting:

After sending requests with a 1K body, we recorded a QPS of just over 10,000.

The detector remained the bottleneck, but Nginx and Mario both showed reduced CPU usage. This suggests that the detector's detection engine requires more CPU resources for larger, more complex requests.

We also tested Mario's single-core performance under load, noting that Mario’s memory usage would continue to increase under high load, posing an OOM (Out of Memory) risk. After fine-tuning, we found that Mario could handle around 11,000 QPS on a single core without significant memory buildup.

Testing Summary

The performance of the three critical services under load is summarized in the table below (requests include a 1K body):

Service	Effect	Single Core Max QPS
safeline-tengine	Reverse proxy	28,000
safeline-detector	Detection Services	10,000
safeline-mario	Log Analysis and Persistence	11,000

Based on these results, we can estimate the overall single-core QPS capacity, giving an idea of the load SafeLine can handle when deployed on a machine with limited CPU resources.

Potential Optimization Points:

Thread Contention in Detector: The multi-threaded nature of the detector service might introduce synchronization overhead, leading to higher CPU usage.
Memory Usage in Mario: Mario's memory usage increases under high load, risking an OOM scenario. Addressing this issue could enable further performance tuning.

We hope Chaitin Technology addresses these optimization points in future updates. With these improvements, it would be interesting to see how much further we can push SafeLine's performance under full load, especially on machines with limited memory.