When we develop a product we are aware that often bugs can happen, or unpredictable events can affect the product when the customer is using it. I'm sure that no one wants to get a bad review or a call from an angry customer telling them that the platform that he/she is using is not working correctly. The worst scenario for every team is that one of the critical functionality isn't working.
This is an approach for a small team of developers, very often to keep the product up and running we need to move super fast, sometimes this means leaving things for later and moving on with the critical things.
Scenario: The server is getting loaded with multiple requests from the application so it's throwing errors.
Possible solutions:
Infra-side
Scale horizontally
If you have your servers on the cloud then add another server and configure your load balancer. The customer probably won't notice the delay, however, this means that we are going to pay something $$ for this unplanned scale.
Scale Vertical
Again under the same premise of getting our servers in the cloud we can upgrade the characteristic of the machine (add more RAM, CPU, etc). Again the customer probably won't notice the delay, however, this means that we are going to pay something $$ for this unplanned scale.
Server-side
Verify if your server has a configuration that is limiting the number of requests. In the case of using Nginx if the server is being overloaded with requests, it will start to throw the same response code. Make sure to use CDN for static content & cache strategies for files that are being served like jpg, js, mp3, etc.
Code-level
Add interceptors at frontend level
This means that when the frontend receives a request indicating a memory issue the frontend is going to handle the request that is sent to the server. This means that we need to consider what are the critical functionalities that need to serve and what are the ones that can wait till the server gets fully recovered.
Verify your ORM connections
Verify that you are not using (wrongly) transaction connection in your ORM at backend level, remember that you will block access to the DB if you are using a transaction code (e.g you have a GET endpoint some_domain/users/125, and for some reason, you are opening a transaction and store something in the DB, then close and return the response. This works great ... for at least 20 requests or at least, after the amount is being passed, let's say 30RPS the new requests are being queued, this means that a request that takes 2 seconds is taking 1 minute, and so on).
The most important advice, use proper logs, test the scenario of your catch (in your try{ ...} catch(...) {...}), because of time we don't usually check this, and in the Merge request process is not simple to notice if the process to log is the proper one.
This is everything, for now, let me know what you think and if you consider other tips that can be useful for the people who land in this post.
Top comments (0)