The Day Our Server Growth Hit a Wall and Why I Blame the Runtime

#webdev #programming #rust #performance

The Problem We Were Actually Solving

I still remember the day our server growth hit a wall, and it was not due to the reasons we initially thought. We were running a high-traffic search service, and our operators were consistently hitting performance issues at the same stage of growth. After months of trying to optimize our configuration, we finally realized that the problem was not with our configuration decisions, but with the underlying runtime. Our service was built using a language that, while easy to learn and develop with, was not designed with performance and memory safety in mind. I had been noticing that our allocaton counts were increasing exponentially with the growth of our service, and our latency numbers were starting to suffer as a result. For instance, our average latency had increased from 10ms to 50ms over the course of a few months, and our allocation counts had risen from 100 allocations per second to over 1000.

What We Tried First (And Why It Failed)

We initially tried to address the issue by tweaking our configuration settings, adjusting everything from cache sizes to connection pooling. We spent countless hours poring over the Veltrix documentation, trying to find the perfect combination of settings that would unlock the performance we needed. However, no matter how much we tweaked, we just could not seem to get the performance we needed. Our profiler output was showing that the majority of our time was being spent in garbage collection, and our allocation counts were still increasing. It was not until we started to look at the underlying runtime that we began to understand the root cause of our problems. We were using a language that was designed for ease of use, but not for performance, and it was costing us dearly. For example, our garbage collection pauses were lasting up to 100ms, and were happening every 10 seconds.

The Architecture Decision

It was at this point that we made the decision to switch to a new runtime, one that was designed with performance and memory safety in mind. We chose to use Rust, a language that I had been interested in for some time, but had been hesitant to adopt due to its steep learning curve. However, after doing some research and experimentation, I became convinced that Rust was the right choice for our service. Its focus on memory safety and performance made it an ideal candidate for our high-traffic search service. We spent several months re-writing our service in Rust, and the results were nothing short of astounding. Our allocation counts plummeted, and our latency numbers decreased dramatically. For instance, our average latency decreased from 50ms to 5ms, and our allocation counts dropped from 1000 allocations per second to less than 100.

What The Numbers Said After

After switching to Rust, our numbers told a very different story. Our allocation counts were down by a factor of 10, and our latency numbers had decreased by a factor of 5. Our profiler output was showing that the majority of our time was now being spent in actual processing, rather than garbage collection. We were also seeing a significant decrease in the number of errors we were experiencing, due to Rust's focus on memory safety. For example, we were seeing a 90% decrease in null pointer exceptions, and a 95% decrease in out of memory errors. The results were so dramatic that we were able to scale back our hardware, reducing our costs and increasing our overall efficiency. We also saw an increase in throughput, with our service able to handle 50% more requests per second than before.

What I Would Do Differently

Looking back, I would do several things differently. First, I would have made the switch to Rust sooner, rather than trying to tweak our configuration settings for so long. While it was a difficult decision to make, it was ultimately the right one, and it has paid off in a big way. I would also have invested more time in learning Rust, rather than trying to learn it on the fly. While the learning curve was steep, it was worth it in the end, and I am now a big advocate for the language. Finally, I would have been more proactive in monitoring our performance and allocation counts, rather than waiting for our service to hit a wall. By doing so, we could have avoided a lot of pain and suffering, and could have made the switch to Rust much sooner. For instance, I would have used tools like Prometheus and Grafana to monitor our performance metrics, and would have set up alerts to notify us when our allocation counts or latency numbers exceeded certain thresholds.