I am tagging this post as #productivity because I believe that is best way to describe something when you are in state of a chaos.
Nothing fills you up with adrenalin when you are in a meeting and someone slacks you:
"DUUUUUDEEEEEEE we got problems in PROD. Stop what you are doing and come over here!"
Veterans in IT hearing something like that, is just another incident, however for someone that recently started career in IT can be overwhelming and quite depressing.
I was in a meeting few years ago, that very same day I read something about Chaos Monkey by Netflix. At that meeting we had some topic to go over the stability of an environment so I said..
"What do you (Dev and few DevOps guys/gals) think if we randomly delete some apps, how would our post mortem would look like?"
Everyone looked at me like I am from some planet Thanos 50/50, saying like "Why would we want to do that?".
Welllll... the problem is that when you sometimes have new apps and legacy apps and legacy systems, you don't necessarily know what would happen if some disk gets destroyed or some random process gets killed or route53 for xyz.zyz.zy domain vanishes.. or xyz EC2 instance is terminated.
I've plated with chaoskube for Kubernetes for a while now, and must say that sometimes just scares you what could have gone wrong or have not anticipated when building a solid infrastructure.
Go ahead and put something to work that randomly destroys things, your resilience and stability will be greater in long run and you will spend less time tinkering around production problemes.
Overall you are trying to find dependencies in your infrastructure and applications.
If app B goes down, app A will become unstable, therefore your tasks is to find way to self heal app B and A.
And if you do not want to use tools like ChaosMonkey, try something like hitting your site with 15k hits per minute, observe your server load. Find out if your boss head color gets red or purple 😡👿.
By the way, don't go full on mode with ChaosMonkey or if you are on K8s (Chaos Kube) , test it out first in dry run mode.
The main point I am trying to make here is that you need to accompass things that are outside of your comfort level and be prepared for what's coming. There are two ways you can go around it. Ignore the problem and keep restarting your application from time to time because of memory leakage, make a script to restart the application or fix the darn thing so you don't depend on being called at 2am.