Everyone can code these days. At the age of the internet, it's very easy to find all the material you need to learn how to write program code. Invest a few months in it and you can start looking for an entry-level developer's position. No need for formal education, no need to waste years in universities or colleges - anyone can get a golden collar at this golden age of developers!
While anyone can learn how to write code, not everyone can write quality code. Furthermore, even fewer can write efficient code or make efficient design decisions. And it is all-natural, it is all expected. As time goes, the IT field (and any other, for that matter) becomes more and more diverse. We know more, there are more topics to cover, to know in detail. True full-stack developers breed is getting closer to the brink of extinction every day. They are replaced by diverse developers, devops, noops, servops, secops, perfops, *ops. In this post, I'll try to explain the need of performance engineers (perfops): what we are, what we do, how we do it and why do you need us.
TL;DR; A performance engineer is there to balance all the scales in the closed system so that it manages to serve more requests with fewer expenses on resources.
Let's look at the basics first. What is it we are doing with our applications? Why do we write them? Usually, that's because we want to automate some part of the domain. This automation requires computing resources (it used to require human resources before!): CPU time, memory, disk storage, network throughput. And our application translates those resources to the results of some work done. These results are required for application users. So application users (indirectly) demand computing resources to get some work done.
And there's also funds. We sell our resources to the users because we are getting paid for the work our application does for them. However, we have to spend some funds to purchase some resources for our application to run on. Naturally, we want our expenses to be as low as possible and our incomes are as high as possible - we want high profits.
Problem is, resources are always limited. The more/better resources we purchase, the more they cost to us. However, there's a good chance we'll be able to serve more requests with more expensive resources. It's a common misconception, that we can always buy more resources we need them. Yes, we can, but there is a point at which each new portion of resources we purchase and add to the system will have a questionable benefit, or even worse - will bring us harm.
Another problem is that demand for resources is always UNlimited. It means that there is a potential for unlimited profits. However, since resources are limited, our profits will also be limited.
To squeeze the most of this situation we must find a way to use our resources as efficiently as possible: spend as little on resources and satisfy as many requests as possible.
This efficiency is what performance is all about: using limited resources to satisfy unlimited demands as efficiently as possible. And efficiency is (directly or indirectly) measured by... funds. After all, money is what makes this world go round :)
Now, because our applications are usually very complex, it is not that easy to tune them for optimal efficiency. Even worse - they are always changing: new chunks of code are introduced every sprint, infrastructure is always changing (networks, internet, servers, disks, etc.). And demand patterns are never the same: is it efficient to have 20 servers running at noon when load peaks - probably yes; is it efficient to keep them running overnight when no one's using them - probably not; is it efficient to have servers in the AMER region when business is expanding to APAC and EMEA - it depends. And there are many, many more variables to it. We want to juggle them right to have the biggest profit possible - that's what we need performance engineers for. We, performance engineers, juggle all those variables, tilt all the scales we have at our disposal one way or another to utilize our resources as efficiently as possible and satisfy as many requests as possible. As my colleague likes to say, "we want the work done good, fast and cheap" :)
As I've already laid out in the previous chapter, money is what makes the world go round. Consequently, the goal of PE is to change/maintain the closed system of the application so that:
the application generates more income than it requires expenses
either of the two happens:
- the application generates more income
- the application requires fewer expenses
the application generates more income AND requires fewer expenses
over a period of time. Once we have 1 accomplished, we want to make it better by either pushing down the profits or pulling up the expenses side of the scale. Once we achieve that, we lock ourselves in a loop 3 and work on the growth of the profits by both pushing and pulling the scales.
Even though our system is closed, it's always changing:
- we accumulate more data,
- we deploy code updates,
- we change our configurations,
- we (or someone else) make infrastructure changes,
- we want to migrate to another infrastructure,
- we get more/fewer requests to serve,
- we get requests from other parts of the world,
- wind changes direction (and breaks a tree which cuts down power on our main datacentre, so we have to failover to our DR site ASAP),
There are many forces at play. All these forces change the efficiency of our application - usually to the worse. On the other hand, new technology solutions are released quite often, and they tend to be cheaper than their predecessors, which is on par with our goals. However, applying those new technologies itself introduces changes in our closed system.
Changes can be either good or bad for our application efficiency. Considering, performance, changes can introduce:
- new/more errors under pressure
Interestingly enough, both slowdowns and speedups can be either gain or loss when it comes to performance. Read on.
I like to divide the "WHEN" into 3 parts of the application lifecycle:
- Before GO-LIVE - the application is under active development and it has not been released to the public yet
- GO-LIVE - we have finished developing the first decent version of the application and we are about to release it to the public
- After GO-LIVE - the application is already released to the public and it's being actively used
At this phase, the application code is constantly mutating. It's a green-field project that's in the progress of getting shape. It hasn't been clearly structured yet - even though we know what it has to do, it's still a shapeless mass that somehow does what it's expected to. This shapeless mass is constantly mutating and slowly gaining notions of what it's supposed to look like.
Even though the application is still in chaos mode, we do still care about its performance. The go-Live deadline is coming and we cannot leave perf testing to the end of this phase. What if performance is poor? What if we can only serve 10 requests at a time, while we need 10000? Will we have to rewrite almost everything or just several parts of the code? Have we made a poor architecture choice?
At this phase, we want to test application performance periodically. We don't need frequent tests, as there's too much going on already. There's a good chance that some performance problems will be resolved without being ever noticed. However, we want the core of our system to be stable and we want to know that we are going in the right direction with our changes. By testing performance before Go-Live we want to answer questions:
- is application architecture/libraries/code/infrastructure capable to deliver reasonable performance?
- will you have to rewrite 80% of your app to fix performance?
- is feature design/libraries/code capable to deliver reasonable performance?
Naturally, we don't want to rewrite 80% of our code, and that's why we should care about performance this early in the cycle. We want to spot our mistakes as early as possible so that we don't make inefficient code a building block of other parts of the code. Rewriting a slow function is easy. Redesigning a slow solution, on the other hand, is not - and it's expensive. And it's VERY stressful when we have to do it days before the due date.
Our application now has a firm shape and we know what it does. We know it performs well and meets or exceeds our SLAs. Great! Now what? We buy new servers, deploy the application and make it public, right? Not yet...
At this phase, we want to know how many resources we require to maintain application efficiency in PROD. Prod is a different infrastructure, could be a different location in the world, could be a different IaaS vendor, could be different ISP, different load patterns, etc. The production environment is going to be different from what we had in our sandbox. So we need to know how/if these differences impact our performance and how to adapt to them so the application is as efficient in prod as it was in our dev environment.
Testing performance when going Live should provide answers to questions:
- how large a PROD infrastructure do you need? (horizontal measure)
- how powerful PROD infrastructure do you need? (vertical measure)
- what middleware (JVM, Nginx, etc.) settings to choose?
only when we have those answers can we go ahead and configure our Prod environment, deploy our application on it and proceed with the go-live.
Champagne corks are popping, glasses are clinking, everyone is celebrating - we are LIVE! But there's no time to relax. The application was passing our real-life simulation tests, but were they accurate enough? Have we covered everything? Before Go-Live we had 6 testers. Now we have what - thousands? millions? billions of testers? Oh, mark my words - they will find ways to give you a hard time by using your application! Features used wrong, use-cases not covered by tests, random freezes, steady performance degradation, security vulnerabilities,... You were playing in a sandbox. Now suddenly you find yourself in a hostile world, that plays by similar rules to those you are used to, but not quite the same.
While you're busy patching bugs here and there, don't forget to include performance testing. We need performance testing after go-live to tell us:
- do new code/version/components’ changes slow us down?
- why is the application randomly freezing?
- why is the application slower than it was before?
- can we reduce our infra to save €€€ and still deliver SLA?
- do we need more/larger/different infra to deliver SLA?
And the application will keep on slowing down every day. That's mostly because of the data you will be accumulating. While databases are quite good at coping with growing datasets, you were focusing on the short TTM before go-live and not on the most efficient data processing algorithms. That is normal and expected. It must also be normal to have means in place to detect such inefficiencies - perf testing with prod dataset, after go-live.
People from other projects often come to us: "Hey, look, we have a performance SLA with our client and we are not meeting it. We need performance testing. Where do we begin? What do we need?" It's all okay, a developer does not necessarily have to know how to introduce perf testing. FWIW, a dev may not even need to know how to test application performance in the first place. There are performance engineers for that! (however, I'd like developers to at least have notions of how to write efficient code...)
Performance testing can be introduced before or after go-live. It can be introduced even years after the go-live.
Now you have to sit down with the client and agree on the performance testing plan. Hear out the client's expectations for the system, communicate out all the requirements. Prepare for this meeting in advance, because after it you will have to know what, how and when you are going to test, and what are you to expect to see/put in the results/reports.
SLI, SLO, SLA. Decide with your client what test metrics you want to monitor (SLI), what observed values are acceptable (SLO) and what fraction of all the requests must be in the acceptable range (SLA). For SLIs you can consider transaction response times (RT), transactions per second (TPS), error rate (ERR), number of completed orders, number of completed e2e iterations, etc. SLOs should define what is expected max RT of each transaction, at least how many TPS should be made during a test, what number of errors is tolerable per-test or per-transaction. SLAs could be: "All RTs' 90th percentile does not exceed 2 seconds"; "We make at least 200TPS during Steady phase"; "We make 1500 orders during ramp-up and steady test phases combined", etc.
Load patterns. This part is a lot easier after go-live because you have historical data to base load patterns on. Anyway, the goal here is to simulate the real-world load. It's normal, that throughout the day/week/month/seasons load varies, and you might want to design different load patterns to simulate each load variation. Or at least a few of them. When designing load pattern, consider:
- how many users will be using the system at the same time?
- ramp-up -> steady -> ramp/down patterns: how long will you be warming your system up for; at what rate will you simulate new users logging in? How long will the test last? How long will your simulated "load dropping" phase be?
- stress, endurance tests - do we want to know how much the system can handle? Do we want to know whether its performance is stable under pressure for long periods of time?
- testing schedule - how often will you be testing? At what time (wall clock; consider real users using the system in parallel while you're testing)?
- anything else comes to mind?
Performance tests should simulate real-life users on a real-life application, running in a simulated (or not) real-life environment with simulated real-life data. In real life, users don't usually keep on pressing the same button tens of thousands of times in a row. Users tend to log into the system first, then they might want to browse some catalogues, maybe get lost in the menus, open up some pictures until they find what they like (or leave). Then they add that item to their cart, maybe add something else too. Oh, did the user change its mind? Sure, why not! Let's clear out the cart and start anew, with some other product in the shop! Then <...>.
It is very important to choose real-life-like user flows for your tests. Even better - make several flows and run them in parallel - after all, not all the users are going to do exact same thing. How do you design your flows? Well, application access logs will help you a lot here. Analyze users behaviour per session, find the most common patterns and embed them in your test flow. You might find some Analytics data very useful too.
When it comes to e2e flows in performance tests, we usually need more than one. I like to have
- main flows - the most obvious, most followed flows in the production environment
- background flows (noise) - additional actions in the system, could be chosen randomly, that generate background noise. You might want to embed some rarely used application calls here. This way you add additional stress on the system AND test how less used features behave under pressure.
Your e2e flows will require some test data. What accounts are you going to use? You will need lots of them in your test. What entities will all these virtual users work on? What entities will they have to share? If you are testing an e-shop with products, you might want to not run out of products in your stock during (or after) the test. And you might not want to actually pay for or ship them... So securing test data is a very important, tedious and difficult task. Work closely with developers to choose the right items for the test.
During your test, you might provoke an application to make calls to external systems, like shipping, tracking, stock, vendors. You might not want some of such calls, because:
- you are not perf-testing external systems - you are testing your system;
- external services (vendors) might not like being perf-tested and you might be throttled or banned
- you might make state changes you cannot undo (e.g. create a shipping request at FedEx)
For this reason, you might want to stub out some external calls - either completely or selectively, so that only your test requests are routed to mocks. This means the application (or proxies, like Nginx) might need some changes to capture your test requests and treat them slightly differently.
A word of caution - don't push too far with stubbing. Testing stubs is not what you want, so don't be too liberal with them. Only stub out what has to be short-circuited, nothing more.
So now you have your testing plan, you know what flows to test and you have data and mocks in place. Now what?
- Write test scripts - translate e2e flows and test data into test scripts. One script is supposed to reflect a single e2e flow.
- Assemble scripts into tests - combine multiple scripts into tests. In a single test, you might want to run different e2e flows with a different number of users. Set all those parameters to make your tests reflect real-life load on the system. E.g. how many users are browsing the catalogue? How many users are running the complete login-order flow? How many users are getting lost? How many users are generating background noise? etc. A piece of friendly advice: create Sanity tests as well (run them for several minutes before the actual tests). Don't assign many VUsers to them. They are needed to warm up an environment and make sure everything is running smoothly and is ready for the load test.
Provision environments this requirement often takes clients by surprise. For perf testing you require 2 isolated environments (unless you are testing in the prod - then you only need 1 additional environment):
- testable prod-like environment (PLAB) with prod-like data (quality and quantity) - this is where your application is deployed and running; all your virtual users (VUsers) will be sending requests to this environment
- testing environment – load generators (1 server for 1’000 VUs) - this is "the internet" - your VUsers' requests will be sent from this environment. This must be a separate environment because generating load (sending requests) also puts a load on the infrastructure (networks, hypervisor, servers, etc.). Generating load from the same infrastructure that you are testing will always yield false test results.
- Set up telemetry, persistent logging - I cannot stress this part enough. Set up monitoring everywhere you can. Monitor everything you can. NO, this is not excessive and I am not exaggerating. You will thank me later. It is always better to have data you did not need than need data you didn't think you will. Monitor everything you can: networks, servers, application, memory, CPU, request sizes/counts, durations, depths,... Everything you can. Don't be cheap on monitoring, because this is your eyes. Load tests will only be able to tell you that you have performance degradation. Monitoring will help you identify where and why. I have experienced cases where we spent months pin-pointing an issue because the client didn't think it's worth spending an additional 7$/month for monitoring one more metric...
- Run tests - now you have your tests in place, and you know the schedule when to run them. When the time comes - execute your tests. You don't have to look at the results all the time - just glance at the screen every now and then to see if the test is still running and the application hasn't crashed. Don't chase every tiny rise and fall. They will even out in the end.
- Collect, analyze and compare run results - after a successful test collect test results (SLIs) and compare them against SLAs. If applicable, compare to previous tests' results. Comparative analysis suits here very well.
- Draw conclusions (better/worse/status quo) - do your SLIs stay within SLAs? Do you have performance degradation when compared to previous test runs? Was this run, perhaps, a better one? Are there any suspiciously good response times? Maybe you didn't validate the response body/headers/code in your scripts and the application returned you with errors? Don't trust unusually good results right away.
Use telemetry, logs, dumps, live data from OS, other data to identify causes of degradations - if you have degradations, correlate your metrics on the same timeline along with test results: see, what happened when your response times peaked or errors occurred. If required and possible, try a re-run and capture thread/heap dumps of the application to carry out more extensive analysis. Are your thread pools too small? Does synchronization slow your threads down? Are responses from external component/system too slow? Are you hitting network limits? You can find answers in thread-dumps. Memory-related answers lie in memory telemetry (server, JMX) and heap-dumps (core-dumps).
- be creative – know how to tickle the application to make it cry. To provoke some performance problems you may have to alter your scripts or tests, or perhaps intervene in the environment manually during the test. Before running a test prepare a testing plan - what you want to achieve, what tests you need to run to achieve that and what either outcome of either test mean in your troubleshooting plan. Don't run useless tests or tests of little benefit. Don't waste resources, time and money. Decide on what tests you need in advance and stick to your plan.
Come up with an approach to fix it - the fix can be anything that cuts it: additional caching, reducing the number of calls (code), server/service relocation to different provider/infra/geographical location, version update, code fix, architecture modification, configuration change, cluster rebalancing, SQL plan change, DB index creation/removal, etc. Literally, it can be anything that solves the problem.
- fix it yourself - if you can or know-how (in some projects/teams you might not have accesses, approval or competency to apply fixes yourself)
- recommend domain teams/client how to fix it - THAT you can always do: prepare a testing report and include recommendations to dev/infra/dba/other teams on how to fix the performance problem. You can even include a PoC proving that your proposals do in fact alleviate the problem or eradicate it completely. Describing the etymology of the issue might help the client to understand it better and perhaps choose a different fix - one that suits them better.
- And retest - once you have the fix in place, run the test again to confirm the issue is no more. Fixing one problem is likely to surface some other problems. Always retest after applying a fix to be sure everything is in order.
The work of a performance engineer can be summarized into 3 Ps. PPP sums up the explanation of what we do and why clients need us.
- Predict (requirements, pitfalls)
- Protect (from degradations)
- Progress (maintain performance with less expenses; improve performance with reasonable/unchanged/less expenses)
Performance engineering is a challenging role, requiring one to possess extensive programming, infrastructure, middleware, DB, algorithms theory, architecture design knowledge and skills. Over the years in the industry I have learned many things, but working as a PE has taught me some things that might sound controversial and unintuitive. This is why I like to call them "the secret sauce".
All the systems have bottlenecks. No matter the scale, size, location or resources available - there always, ALWAYS are bottlenecks. Some most common bottlenecks are:
- network - network latency, limits, connection handshakes - all these are slow
- database - databases are extraordinary "creatures" as they manage to maintain more or less stable response times regardless of the amount of data they contain. However, this comes at a price - memory, computing power and concurrency. I have seen the most powerful database in the industry (Oracle) fall on its knees and there was nothing left we could tune. We'd hit the physical limits.
- disk IO - be it logging, caching, database or anything else - disk speeds have always been, are and probably will be one of the infamous bottlenecks
- number of CPU cores - is a problem when you need to synchronize them in between. The more cores you have, the more synchronization will become a bottleneck for you. The best way to maintain the critical section is not synchronization - it's avoiding the critical section in the first place. There are many methods around it, both in local (code - threads) and distributed (microservices, distributed monoliths) systems. However, each of them has its drawbacks - there is no holy grail.
- CPU, speed - this doesn't need an awful lot of explanation.
We have now established that there always are bottlenecks, in any given system. We might as well say, that any given system is a system of bottlenecks. There is no way to solve all of them. The best we can do is to change how severely they impact us. Usually, we only want to address bottlenecks that are giving us a hard time and leave the others be.
- reduce or remove - this is quite intuitive. Reducing or removing a bottleneck is expected to improve application performance.
increase or create - this is counter-intuitive, isn't it. However, that doesn't make it any less true. Some bottlenecks are more severe than others, some are causing more degradation than others. Removing one bottleneck may cause more load to be applied to another bottleneck - one, that was not worth your attention before, now became the worst performance hog in the system. Applying the same principles we can reduce some severe bottlenecks by introducing new ones (yes,
Thread.sleep(20)is a perfectly viable solution to some performance problems), or make existing ones be more aggressive (e.g. reduce the thread pool size to prevent contention - the price you pay for heavy synchronization).
Performance engineers, once they have test results, telemetry metrics and a good understanding of the architecture and domain, can make an educated call to alter either of the bottleneck(s) in the system to improve performance. Knowing which ones and how to shift them is why we are getting paid :)
This sounds like a playfully worded phrase, but it is very true in PE. When people encounter a performance problem, they immediately think "let's add more resources", "let's buy a bigger server", "let's increase the pool size", "let's add more CPUs", "let's add more memory", ... They are not wrong... usually. But the "bigger and stronger" approach only gets you this far. There are many types of resources and each type (even more - each combination) has its own threshold, after which adding more resources gives you less performance benefit, or even introduces performance loss. In such cases, we want to have fewer resources in some particular areas of the system. Or, perhaps, we've been inflating the wrong resource all this time - maybe reducing CPU and adding more memory would give us more benefit!
Sometimes less is more.
This is a sum-up of all the previously listed secrets. Performance engineering is a challenging job, requiring a good understanding of various aspects of the system: how it works, how it's made. There are oh-so-many variables to juggle with.
Balancing funds and resources. As I have already stated at the beginning of this post, resources are always limited, but demand for them is always unlimited. Physics and funds are the main factors limiting resources: physics set hard limits for vertical and horizontal scaling of a given resource while funds define soft limits - how close can we approach the hard limits. The closer to the limits - the more expensive the resources are, but they also, if used well, have a potential for more incomes.
Balancing bottlenecks. Each system has plenty of bottlenecks in it. Some are more severe, others are less. Shifting the configuration of those bottlenecks in order to improve performance is also like balancing the scales: bottlenecks on the left, bottlenecks on the right and performance is the level of the scales.
Written with StackEdit.