There are two things that look fishy to me in those results:
There are almost no differences between server implementations. Not saying it is impossible, but differences in server implementations small like that are very unlikely.
The absolute number of handled requests per second is very low - I was able to easily get 250k http requests per second handled by Actix-web server, loaded with ab, on a laptop, so more than 2 orders of magnitude faster. Also got 100k+ req/s in some server implementations in Go (but far different than Rust).
This suggests there was a common bottleneck outside of your server implementations, and you've measured the performance of that bottleneck, not the servers. Which also means the results are probably inconclusive and you can't interpret them as "Rust has won".
I looked quickly at your code and it seems you're opening a new connection for each request. This typically adds a large amount of latency and system load to each request and might become a problem, particularly at low concurrency levels like 100.
A few suggestions for better benchmarking:
For a throughput comparison you need to verify if the servers are really working at their full speed, so you should capture CPU load. It is also good to capture other system metrics like system CPU time, cache misses, context switches and syscalls, which are often a good indicator of how efficiently the server app interacts with the system.
Cache connections and leverage the HTTP keep-alive. That makes a tremendous difference in throughout.
Play with different concurrency levels. If concurrency is too low and latency is too high, you won't get the max throughput. The server would simply wait for requests, handle them quickly and go idle waiting for more. Also switching between idle and active is costly (context switch).
In latency tests, latency median is not as interesting as a full histogram. I'd expect large differences in P99 between GCed and non-gced servers. So even if medians are very close, it doesn't mean the servers would work equally well in production. Obviously you should do latency tests at lower throughout than max, so those should be separate experiments.
Anyway I'd love to see updated results, because you seem to have put a lot of work into multiple implementations and it would be a pity if you stopped now ;)
When I started the series, I did want to capture more metrics, but that kept pushed and Its been months so I decided to do something simple atleast. The source is in GitHub so if you are interested feel free to use it and publish a follow up. I might not have time anytime in the near future due to other commitments. The metrics you are suggesting will take a lot of effort and time to do properly. The bottleneck is the sleep introduced, so theoritically 25 seconds is the best possible for this code. If I remove the sleep this is the result for same 10k req with 100 concurrent
Concurrency Level: 100
Time taken for tests: 0.309 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 2830000 bytes
HTML transferred: 1760000 bytes
Requests per second: 32344.98 [#/sec] (mean)
Time per request: 3.092 [ms] (mean)
Time per request: 0.031 [ms] (mean, across all concurrent requests)
Transfer rate: 8939.09 [Kbytes/sec] received
Honestly I didn't expect people to take this so seriously or even for the post to do well. I was just wrapping up a series that was taking a lot of effort and not much interest in terms of views. But man this blowed up. Now I think I have to rework this to something better 😂
Depends on how efficient the load generation tool is vs how much work on the server side is required to handle the request. You can also pin those two processes to different CPU core sets. This way one computer is enough to get meaningful results. Obviously if your don't know what you're doing, it is better to use two separate machines.
Ya in this case the server is quite simple and doesn't need too much resource that might explain why I got similar results from both. I would be interested in learning more about pinning process to cores. Do you have any resource you can recommend?
There are two things that look fishy to me in those results:
This suggests there was a common bottleneck outside of your server implementations, and you've measured the performance of that bottleneck, not the servers. Which also means the results are probably inconclusive and you can't interpret them as "Rust has won".
I looked quickly at your code and it seems you're opening a new connection for each request. This typically adds a large amount of latency and system load to each request and might become a problem, particularly at low concurrency levels like 100.
A few suggestions for better benchmarking:
For a throughput comparison you need to verify if the servers are really working at their full speed, so you should capture CPU load. It is also good to capture other system metrics like system CPU time, cache misses, context switches and syscalls, which are often a good indicator of how efficiently the server app interacts with the system.
Cache connections and leverage the HTTP keep-alive. That makes a tremendous difference in throughout.
Play with different concurrency levels. If concurrency is too low and latency is too high, you won't get the max throughput. The server would simply wait for requests, handle them quickly and go idle waiting for more. Also switching between idle and active is costly (context switch).
In latency tests, latency median is not as interesting as a full histogram. I'd expect large differences in P99 between GCed and non-gced servers. So even if medians are very close, it doesn't mean the servers would work equally well in production. Obviously you should do latency tests at lower throughout than max, so those should be separate experiments.
Anyway I'd love to see updated results, because you seem to have put a lot of work into multiple implementations and it would be a pity if you stopped now ;)
When I started the series, I did want to capture more metrics, but that kept pushed and Its been months so I decided to do something simple atleast. The source is in GitHub so if you are interested feel free to use it and publish a follow up. I might not have time anytime in the near future due to other commitments. The metrics you are suggesting will take a lot of effort and time to do properly. The bottleneck is the sleep introduced, so theoritically 25 seconds is the best possible for this code. If I remove the sleep this is the result for same 10k req with 100 concurrent
Concurrency level 100 is way to small. Try something in range 500-5000. Beware that ab is not a good tool for testing high concurrency.
Ya, I wasn't expecting people take this simple experiment of mine so seriously. I'll try to update the tests to something better
I have updated the benchmarks with more data. WDYT now?
Never run client/server benchmarks on the same computer.
The process to generate loads will inevitably impact the process to serve the requests.
Best infra for benchmarking is two independent computer hardware. Not even VMs, as they also compete for CPU resources.
Honestly I didn't expect people to take this so seriously or even for the post to do well. I was just wrapping up a series that was taking a lot of effort and not much interest in terms of views. But man this blowed up. Now I think I have to rework this to something better 😂
That's what happens when you publish benchmarks! 😂
lesson learned :P
Depends on how efficient the load generation tool is vs how much work on the server side is required to handle the request. You can also pin those two processes to different CPU core sets. This way one computer is enough to get meaningful results. Obviously if your don't know what you're doing, it is better to use two separate machines.
Ya in this case the server is quite simple and doesn't need too much resource that might explain why I got similar results from both. I would be interested in learning more about pinning process to cores. Do you have any resource you can recommend?
man taskset