Here are some lessons and analogies I extracted from reading “An Astronaut’s Guide to Life on Earth: What Going to Space Taught Me About Ingenuity, Determination, and Being Prepared for Anything”. Almost everything Chris Hadfield says is not just applicable to life in general but it is applicable to software engineering in particular.
In chapter 3 he drives the point home about training and preparation. To become and continue being an astronaut one must be constantly learning and practicing. The vacuum of space is hostile to life and making the wrong decision can have all sorts of cascading effects that lead to disaster. So instead of sitting around and moping about the hostility of space they tackle the problem head on by running simulations about disaster scenarios. They practice and drill so much that cool-headed thinking, even in the face of certain doom, becomes almost instinctive. Instead of freezing up they “work the problem”. At one point he mentions they even have simulations for what happens when somebody aboard the ISS (international space station) dies. It’s not just the people in ISS that run this simulation. The people on earth also go through what they would have to do if they found out a close friend or a loved one had died in space. Suffice it to say that astronauts don’t mess around when it comes to being prepared for almost anything.
How does all this apply to software engineering? If I asked you what happens when your main database dies would you have an answer? If you don’t then you need to start running a simulation about that exact scenario. You need to start “working the problem”. How about if one of your application servers gets hacked? Are you running the application inside a sandboxed environment? Is it isolated enough from the rest of the system that a compromised application server would not have long lasting consequences? How quickly would you find out if this happened? How easy is it to deploy previous versions of your code? Is your infrastructure “generative in the sense that anyone on the team can re-create a working environment from scratch in a reasonable amount of time? What happens when one of your team members gets hit by a bus?
You need to be asking and answering such questions every week. You need to start practicing your responses to such disaster scenarios to the point that they become almost instinctive. For example, what do you do if the main database goes down? First thing you need to do is figure out if it went down because of high load. If it did then you need to figure out what is causing the high load and mitigate it because even if you fail over to a standby database that one is going to fail as well and you’ll be back where you started. Once the source of the high load has been pinpointed and cordoned off then you can decide whether you should fail over and gradually increase the load or in case of a concerted attack just ride it out. If your server failed because of a hardware issue then you need to call the hardware support folks and get a replacement ASAP while you distribute the load to the standby database. If you don’t have a standby database then you need to promote one of the followers to a leader while you create another leader from a backup or through some other means. This is where having “generative infrastructure helps because anyone on your team, not just the DBA (database administrator), should be able to do this.
That’s just for the database. You should be running such simulations for every component of your stack. Chances are you’ll stumble upon a few knowledge and process gaps along the way which you’ll then hopefully rectify through better processes and tooling.