Michael

Posted on Jan 18, 2022

Assuming Risk

#programming #devops #management

Doing just about anything, not just software, has an assumption of risk associated with it. What matters is how that risk is managed. Sometimes the risk is low and can be accepted, such as putting a plate on the edge of the counter, or it can be high, such as whether or not that barking dog pulling at its chain will bite you if you get too close.

Dealing with it requires not just an Assumption of Risk, where you accept the consequences that occur, but Risk Mitigation where you can adjust the level of Risk before accepting it. You can carry this over to pretty much anything in life, as my previous examples note, but let's look at this solely in the realm of software to keep examples on point.

Assess the Level

Before assuming ANYTHING always review first what the consequences are. Will updating that one rarely used library cause much in the way of build failures? Probably not. Will updating one of the main libraries, and not any dependencies, cause build errors? More than likely.

Take a look at not just what the consequences of the actions are, but what it will take to fix them. In worst case also have a rollback plan. When I was doing more in the way of application releases to web sites if the rollback in case of a failure was longer than the deploy that to me was a key point that something could easily go wrong and be wrong for awhile. This meant we needed to reassess the steps or the deployment. If the deployment is simple and just requires running a simple script to install, or reinstall the last version, then it's going to be easier to rollback in case there are issues.

Think to on what dependencies there are. Are there dependent services? What is the up and down flow for what is being updated, the more and longer either way is the more risk you will have versus something that is touched rarely or by one minor service that is barely used. Knowing what the cascade of changes is helps in determining the level of risk when a failure happens. The more changes the more risk there may be, take a look at the changes and then what other changes those cause. Over time this gets easier.

Manage the Level - Mitigation

I like to think of Risk as a malleable factor, it is rarely a fixed and unchangeable item. If you come across a risk like that back away very slowly, you may have encountered a Grue and it may be pretty dark where you are going.

Want you want is the smallest risk imaginable, and you do that by resolving the issues that can be generated by the most exposed service or largest dependency. Resolve each in some manner, either getting a fix made, or minimize it in some manner so that if there is an issue a service won't go down or an application doesn't return a blank page or suddenly crash. I know I am mixing a few different things here, but remember risk is everywhere (getting in your car and driving, flying, or in 2020-2021 just going outside with Covid) and you WANT to minimize your exposure to those risks. Sometimes it requires being creative, or asking "How can we fix this?"

Something like "Why is it like this?" can be useful, but you have to be careful in asking Why questions with risk as they can come across as accusatory and you don't want that. Accusatory questions can make people made, "why would you do that?!?" is different than "why is it implemented in this manner?" and the former can make you enemies; you want friends, people who will help you with solutions. Asking implementation questions can often reveal why something is like that, and why the risk was accepted, which doesn't mean that a risk acceptable a year ago is the same today. Things change. So does risk.

When you have done enough to minimize the risk you face it's time to put that in place, and we get to the next step...

Resolve the Risk

Deployments are a fact of life. They have to be, code needs to get out in the wild for the company to grow. For Customers to see that amazing new widget or that wonderful feature that will make their lives easier. To get it out there though, requires a deployment.

For a deployment, and they are all different, you need to know what you are working with. If you've gotten this far then you have already followed along with the complexity, looked at it, and understand it. Now you need to fix it to get it to deploy. What helps with a lot of automated pipelines these days is that the deployment has been happening each step along the way. From when the PR checked the code with its linters and code scans, to Dev and QA Environments where the code was installed in those environments.

So we already know what the risks are, and have mitigated the worst of them. At times hot fixes come up and they tend to skip around a bit, but in the end there is a way to either manually install something, or there is a pipeline to do this. By the time you have come to Production you know what the code is, or in a Deployment Review know each step and what is happening within each. Is there a dependency on something new in Production? Has that been done elsewhere? Do we understand the rollback and have a way to verify each step worked?

When I worked on big software releases, or site moves, I made a long checklist of steps, and to be extra detailed (yeah I can go that way) I would at times add in how long a step would take. That way we knew how long the entire process would take, and if a step took longer we knew to watch that one later, or at the time be aware something may be up. Not every release needs that kind of detail, but you DO need to have that much of a handle on every step.

Last thing you want in Production and during a deployment is a surprise. A code mismatch, a version just out of date for someone's pet service, or the password for the one server that is updated once a year by that guy who works in another department. You need to know your environment, the dependencies, and a way around them. Otherwise its rollback time, and if you don't have a plan for that and know how to get back to the original state it's going to be a long night of trying things out under pressure to bring everything up.

No one wants that kind of stress.

KNOW your environment. Understand its dependencies, and ensure your QA, or if you are lucky to have a Staging, Environments are up to date. Plus that you practice with the same steps in those environments to ensure everything is covered.

Risk is not for everyone

Some people don't like risk and they avoid it. Some don't like ice climbing, parachuting, driving on icy roads, skiing. Whatever it is, that introduces stress and a fear of failure in people. At times the idea that YOU TOUCH PRODUCTION can give a slight twinge of anxiety. Plan for it, plan for failure and recovery, and plan for planning it all.

In the world over the past two years we've ALL lived with a lot of risk. Go outside. Shop? Touch things on the shelf and see if its what you want. There is a lot of things that carry risk, and things you can do to minimize it. Risk is everywhere, and it can be managed if you just take a careful look at it and determine the best path forward.

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

DEV Community

Assuming Risk

Assess the Level

Manage the Level - Mitigation

Resolve the Risk

Risk is not for everyone

Hands-on debugging session: instrument, monitor, and fix

Top comments (0)

Build apps, not infrastructure.

Read next

Rust Error Handling: A Complete Guide to Building Reliable Applications [2024]

ML Framework Cuts Industrial System Design Time by 60% While Boosting Reliability

New AI Method Makes Language Models Smarter Through Adversarial Context Training

AI System Generates Molecular Structures from Mass Spectra with 92% Accuracy

Okay