When do you know the performance bottleneck is not your code, but the infrastructure?

#discuss #scaling #performance

I have cases where we'd see some performance issue with our REST API and start working on optimizing it.

I think the first thing anyone does is blame the code. Maybe we haven't used the SQL queries right? Could we change the algorithm to process the data better? Are we indexing the columns properly?

But at what point do you consider that perhaps it's the infrastructure: should we upgrade the EC2/RDS instance? Add more load-balanced servers?

Upgrading the infrastructure will definitely improve performance, but at a financial cost. We'd always want to get the last bit of performance from the current hardware by changing the software as much as possible, right?

Latest comments (3)

David J Eddy • Nov 19 '18

As @rhymes pointed out; hardware or software either way you pay for performance enhancement changes. As well as, the more efficient your hardware the more the software will fill to use it. The more efficient the software the less hardware will be used to handle the processing. It is a never ending tug-of-war back and forth.

As for your question: how do you find the bottleneck in a system? The first thing you will need is monitoring and metrics. Just like @rhymes says "...There's no magical formula, it takes visibility into your system, monitoring, measuring and expertise...."

How would I would find the bottleneck? Load testing with system monitoring. For Example: does the CPU max out under X load? Ok, if I increase the hardware resources by 50% can I handle 50% more load? If not, look at metrics. What resources are being used by what process for how long? What function call in the application takes the longest (flame graphs are great for this).

At the end, finding the balance between acceptable hardware costs and software performance is a never ending back-and-forth.

rhymes • Nov 15 '18

Upgrading the infrastructure will definitely improve performance, but at a financial cost

Well, spending N developer days measuring and optimizing will probably cost you money anyway.

It depends on the tradeoff and the issue. If you have a slow query and the solution is an index you obviously are not going to gain much by upgrading the amount of RAM on your database server. If the issue is "two nodes are too slow to process these images" you either rewrite the function to make it faster or you add nodes to finish earlier. How to decide? Depends on the expertise available, time to market, if it's a one off and so on.

If it's going to take your devs a month of no features because they have to learn a faster language to process images when with a few dollars you can spin up a bunch of functions or machines to accomplish the work with your current language... I would focus on the second. Obviously the suggestion would be the opposite if your entire business is selling image processing :)

If you have a business running on a Rails app that occupies 600 MB (arbitrary example) and your tier is 512 MB, you're probably better off upgrading to 1 GB of RAM than halting development of bug fixes and new features for an indefinite amount of time to find a way to shave off 100 MB of RAM. You still should probably do it, but maybe not while you are focusing on a more mission critical part of your app.

There's no magical formula, it takes visibility into your system, monitoring, measuring and expertise.

Priyansh Jain • Nov 15 '18

A log of the processes running in your servers should help you out. For example, If the available RAM is too less to handle, it happens to be the infrastructure. If you have a service, it is kind of a duty to optimise the code as much as possible, I'm pretty sure if you've good developers they will be able to judge whether the code can be further optimised or not.
For example with nodejs, blog.caustik.com/2012/08/19/node-j....