I am tagging this post as #productivity because I believe that is best way to describe something when you are in state of a chaos.
Nothing fills you up with adrenalin when you are in a meeting and someone slacks you:
"DUUUUUDEEEEEEE we got problems in PROD. Stop what you are doing and come over here!"
Veterans in IT hearing something like that, is just another incident, however for someone that recently started career in IT can be overwhelming and quite depressing.
Quick story:
I was in a meeting few years ago, that very same day I read something about Chaos Monkey by Netflix. At that meeting we had some topic to go over the stability of an environment so I said..
"What do you (Dev and few DevOps guys/gals) think if we randomly delete some apps, how would our post mortem would look like?"
Everyone looked at me like I am from some planet Thanos 50/50, saying like "Why would we want to do that?".
Welllll... the problem is that when you sometimes have new apps and legacy apps and legacy systems, you don't necessarily know what would happen if some disk gets destroyed or some random process gets killed or route53 for xyz.zyz.zy domain vanishes.. or xyz EC2 instance is terminated.
I've plated with chaoskube for Kubernetes for a while now, and must say that sometimes just scares you what could have gone wrong or have not anticipated when building a solid infrastructure.
Go ahead and put something to work that randomly destroys things, your resilience and stability will be greater in long run and you will spend less time tinkering around production problemes.
Overall you are trying to find dependencies in your infrastructure and applications.
If app B goes down, app A will become unstable, therefore your tasks is to find way to self heal app B and A.
And if you do not want to use tools like ChaosMonkey, try something like hitting your site with 15k hits per minute, observe your server load. Find out if your boss head color gets red or purple 😡👿.
By the way, don't go full on mode with ChaosMonkey or if you are on K8s (Chaos Kube) , test it out first in dry run mode.
The main point I am trying to make here is that you need to accompass things that are outside of your comfort level and be prepared for what's coming. There are two ways you can go around it. Ignore the problem and keep restarting your application from time to time because of memory leakage, make a script to restart the application or fix the darn thing so you don't depend on being called at 2am.
Top comments (5)
This is so true, we often don't take care of scenarios like what if the database is not available, if a certain validation fails, etc... and we need to make sure the user sees a reasonable message to let them know whats going on.
I think productivity fits to this topic :)
A good reminder that we need to be ready/aware for/of possible chaos. I res about that recently in DevOps handbook. It really makes sense if you have many microservices running.
Hvala.
See recently I got Kafka/Zookeeper on K8s running and things went smooth. Now I played that Chaos and went ahead and deleted random zk kf pods. Things were good until I found out about wait times and rebalancing etc. Now imagine introducing a lag of 1-2min in prod with say 500 pods. Disaster. :)
Hey there,
Some good points in this article, no doubt. I like the more abstract concept of intentionally putting your system in less-than-ideal circumstances, observing the outcomes and acting accordingly. It could help to deliver a really robust system.
I do think, though, that this post is very heavily aimed towards people who work in CI/CD or DevOps. As a web developer, there are many chaotic situations that I could intentionally trigger, but would be unable to resolve. Sometimes, the solution lies with a third party or is restricted by technology.
I think we need to take these principles, but keep them flexible and configure them to match a given discipline.
All in all - great article 😊
True, this was more from DevOps prospective, however you could/can work along the DevOps (if you don't have own full access dev env) in where you could observe as to what happens if you shut off network , stop access to database, provide wrong username, dependencies on other apps and such.
When I do things like one above, I work with Devs because they help me provide better insights into app and or logs. So that Noc has a notification at least that an app restarted and healed it self by running script xyz that also restarted dependencies..
Thank you for commenting, much appreciated thoughts from different prospective..