"Hey, we need start doing this thing I read about." Some of you have probably heard that at some point. Depending on who was talking to you, it can usually be one of two reactions: "This is just great! We can finally make some real improvements!" or "This is just great. We have another stupid thing to add to our graveyard of half-thoughts."
One of the hardest sets of lessons to learn is around selecting tooling; knowing how to make the best use of your toolbox, and when it's time to invest in a new tool, and how to push for the right (or better) tooling, can be difficult.
For most problems, there are multiple ways of solving it. Selecting the right tool is important - selecting the wrong tool means you're going to be slower, less effective, and you're going to end up with a worse product. Selecting the right tool for your problem is a well-studied problem, and your methodologies will vary from team to team.
What a lot of people miss is the core question: what's the right problem to solve? This is the most important question to ask, through every step of a process as complex as software design. If you solve the wrong problem perfectly, you've still solved the wrong problem, and you haven't made any real progress; solve the right problem in a minimal way, and you can bootstrap that into something bigger and better.
When you're looking at a problem, start by stepping back and understanding why it's a problem; from there, keep working your way back, and eventually you'll find the real problem you're trying to solve.
For example, some time ago I was asked to figure out how to make a certain process able to be deployed without incurring any downtime. This is a large process, and while I knew what problem I was trying to solve on the surface - making this process work through deployments. By digging deeper, I was able to look at the real problem: requiring a complete restart in order to make changes to this process meant that making changes happened at a glacial pace, because it could only be restarted during off hours.
Not only did this question help me make better decisions on this project, but it has informed my decision-making moving forward; when I have to make a choice in the midst of a project, I am able to make the best decisions moving forward in a general sense.
When it comes to DevOps, our tooling doesn't matter so much as understanding what our goals are.
I tend to break the general goals for doing DevOps into three large goals:
- Reduce risk
- Increase velocity
- Reduce toil
All of these intertwine heavily; a lot of things that reduce toil reduce risk; a lot of things that increase velocity reduce toil, and so on.
It's 4:58 on Friday afternoon, and the developers have a change to make before the weekend. What do we do here? If we don't push, the devs get mad because they're being held back; if we do push and it fails, we may have just thrown away a weekend for everyone who has to fix it.
Part of what we should be doing is our risk; if a deployment needs to happen at 4:58 on a Friday afternoon, we want to know it will work. There are multiple ways of doing this, but the important thing is this: when we look at the problem of how to handle a deployment at 4:58 on a Friday afternoon, what we're looking at is how to make something less risky. Whether we make the software itself less risky, or the deployment process, or our environment, our real end goals are the same.
A developer has spent a week implementing a feature; now, it needs to be integrated into the main branch and deployed. Sometimes, this is easy; sometimes, it can take as long, or even longer, as the actual development.
What can we do to make this process faster? What can we to do completely eliminate the need to spend extra time mushing codebases back together after a divergence? What can be done to make it so that our developers can work faster and more effectively?
Every day at 9 AM, you have to restart some service or it will fail. If you don't, your entire infrastructure may fail. If you get sick, or forget to do it, or even just get busy at that exact moment, congrats - you're service is down. Even though it's only 2 minutes a day, it adds up, and the context switching can be brutal. It's busywork. It's toil.
When we reduce the amount of work we need to do in order to do something, we are multiplying our effort; now, we don't have to handle this, we don't have to watch the clock, we don't have to touch it at all. This does a few things for us - when we have fewer things to worry about, we don't have to spend time on them, and we don't feel as exhausted.
We can do this in two ways - removing things that don't need to be done (for example, is the service that needs to be restarted every morning broken? Let's fix it, removing the need to do it at all) and automating what's left.
So what does real DevOps look like? It's not about using certain tools, it's about asking the questions to make the best decisions to help us reduce risk, increase velocity, and reduce toil.
When we are able to ask these questions and really focus on what matters to making things better, not only do we get to start making products better, faster, but we're less frustrated, we're more reliable, and we can start making the right tool choices even in new circumstances. Tooling isn't what matters when it comes down to it -- knowing where you want to be is.