An intro to load testing & some distributed systems learnings

#todayilearned #distributedsystems #testing

TL;DR notes from articles I read today.

An introduction to load testing

Load testing is done by running the software on one machine (or cluster of machines) to generate a large number of requests to the webserver on a second machine (or cluster).
Common parameters to test should include server resources (CPU, memory, etc) for handling anticipated loads; quickness of response for the user; efficiency of application; the need for scaling up hardware or scaling out to multiple servers; particularly resource-intensive pages or API calls; and maximum requests per second.
In general, a higher number of requests implies higher latency. But it is a good practice in real life to test multiple times at different request rates. Though a website can load in 2-5 seconds, web server latency should typically be around 50-200 milliseconds. Remember that even ‘imperceptible’ improvements add up in the aggregate for a better UX.
As a first step, monitor resources - mostly CPU load and free memory.
Next, find the maximum response rate of your web server by setting desired concurrency (100 is a safe default but check settings like MaxClients, MaxThreads, etc for your server) and test duration in any load testing tool. If your software only handles one URL at a time, run the test with a few different URLs with varying resource requirements. This should push the CPU idle time to 0% and raise response times beyond real-world expectations.
Dial back the load and test again for how your server performs when not pushed to its absolute limit: specify exact requests per second, and cut your maximum requests from the earlier step in half. Step up or down requests by another halfway each time till you reach your maximum for acceptable latency (which should be in the 99th or even 99.999th percentile).
Some options among load-testing software you can explore - ab (ApacheBench), JMeter, Siege, Locust, and wrk2.

Full post here, 13 mins read

Distributed systems learnings

Building a new distributed system is easier than migrating the old system over to it. Migrating an old system is more time-consuming and just as challenging as writing one from scratch. You tend to underestimate the amount of custom monitoring needed to ensure they both work the same way and a new system is more elegant, but you need to decide whether to accommodate or drop edge cases from the legacy system.
To improve reliability, start simple, measure, report and repeat: establish simple service-level objectives (SLOs) and a low bar for reliability (say 99.9%), measure it weekly, fix systemic issues at the root of the failure to hit it, and once confident, move to stricter definitions and targets.
Treat idempotency, consistency and durability changes as breaking changes, even if technically not, in terms of communication, rollouts, and API versioning.
Give importance to financial and end-user impacts of outages over the systems. Talk to the relevant teams and use appropriate metrics, and use these to put a price tag on preventive measures.
To determine who owns a service, check who owns the oncall(the operating of the system). The rest - code ownership, understanding of the system - follow from there. This means that shared oncall between multiple teams is not a healthy practice but a bandage solution.

Full post here, 6 mins read

Get these notes directly in your inbox every weekday by signing up for my newsletter, in.snippets().