Orel Bello for AWS Community Builders

Posted on May 29 • Originally published at Medium

The 10 Commandments of Working in Production

#aws #devops #sre #tutorial

Intro

What scares you the most?

Some say spiders, some say clowns, but what scares Engineers (both DevOps and Developers) the most is a P0 incident, where production is down.
Want to make it even scarier? Imagine that you’re the one who’s responsible for it.

When this kind of incident happens, it’s never pleasant, but is it really inevitable?
Production incidents unfortunately happen, and at some companies, they happen more than others.

They say that you can’t be a true Senior Engineer if you don’t have a few Production incidents with your name on them, but that doesn’t mean we want to break production intentionally. We’d like to avoid it as much as possible, and even though we can’t completely eliminate it, with the right methodologies, we can definitely reduce it.

So, let’s learn how to do it, but first, let me introduce myself.

About Me

I’m Orel Bello, an AWS Community Builder and a passionate DevOps Engineer with over 4 years of experience, including the past 3 years at Melio. My tech journey began during my military service as a Deputy Commander in the Technological Control Center for the Israel Police. After earning a B.Sc. in Computer Science, I started as a Storage and Virtualization Engineer before discovering my true calling in DevOps.

Now an AWS Certified Professional in both DevOps and Solutions Architecture, I specialize in building scalable, efficient, and cost-effective cloud solutions.

So you can imagine that I have some experience as a production breaker (and also as a breakdancer, but this is for another blogpost).

WARNING! THESE RULES WERE WRITTEN IN BLOOD!

1. Always have a rollback plan

The first thing that you need to do when you’re touching a Production environment is have a rollback plan.

Let’s say that you need to modify some resource. What if this modification will cause an outage? You need to be prepared. Like they say: Hope for the best, but prepare for the worst. Better be safe than sorry.

Playbooks and documentation can save lives. So even if you’re making a small change, it’s important to prepare a rollback plan ahead of time.

Are you touching the DB? Make sure to take a Snapshot before you do.
Changing a secret, SSM Parameter, or even an IAM Policy? Make sure to save the original value in a safe place.
The examples are endless, but the concept stays the same. Always be sure to have a rollback plan in case things get messy.

2. Timing is everything

Do you have production-related work to do? If you can, always schedule it for the very beginning of the workweek, especially if you’re collaborating across time zones.That’s usually when traffic is lighter, so if your change requires downtime or carries some kind of risk, it’s safer to do it when fewer clients are actively using your system.

We also have the opposite rule that completes the circle — never perform a production change right before the weekend. In just a few hours, the entire company will be offline, and trust me, you don’t want to be the one who forces people back online to fix an issue.

For that same reason, the very end of a working day isn’t the best time for sensitive tasks, either.

3. Work on Dev before Prod — gradually

If you have no idea what a development or QA environment is, stop everything you’re doing right now and go build one.

On a best practice methodology, we always want to avoid testing features in a live Production environment.

It doesn’t matter if you’re working at a small company, you can never have just one environment that shares your development and your production workload. It’s a recipe for multiple outages and downtimes.

It’s best to have a Development/QA environment, a Staging/Pre-Prod environment and a Production environment. And once you have those environments, you can deploy your changes gradually:

First on Development
Then on Staging
Only after that on Production
This way you can handle errors and bugs before they make it to Production.

4. A wolf in sheep’s clothing — not everything is as innocent as it seems

It’s important to have a Production mindset and always think: is what I’m doing somehow affecting production? The answer isn’t always straightforward.

There are the obvious resources that you know you should be aware of, like the Database, DNS records or your compute service that runs your core production logic (EC2, K8s, Lambda functions, you name it). But you shouldn’t let your guard down so easily when you’re working on other resources.

Example (based on a true story):
The security team gives you a list of unused IAM Roles (created by CloudFormation) for more than 180 days, and tells you to handle it. So you may think that you can delete them and no harm will be done. But when you delete them, suddenly dozens of Production CloudFormation stacks can’t be deployed anymore because you deleted a resource created by them, and now they’ve drifted.

So always think twice:

Is my action touching production?
Am I absolutely sure about it?
If I’m not sure what the resource I’m dealing with is, it’s better to be cautious and to tread lightly.

5. Overcome the shame — ask for help

Oops. You did your best but you still broke prod.

Take a deep breath and relax. Don’t panic. It’s unpleasant, but it will pass.

You probably want to fix it ASAP, and the fewer people who know the better. But it’s important to overcome the shame and ask for help. It’s better that your manager hears it from you than from someone else.

Consult with your teammates and fix it together. If you try to handle it yourself without anyone else knowing, there is a chance you can actually make it worse.

Everyone makes mistakes, it’s human nature. I can guarantee you that even your CTO broke production a few times throughout his career. So don’t take it to heart, just focus on fixing it the best way you can.

6. Version control best practices — don’t take shortcuts

Don’t do shortcuts.

Do you have a small and completely safe change? Don’t be lazy. Open a PR and send it to a teammate to review before you deploy it.

NEVER. WORK. ON. MASTER/MAIN.

Developers may be limited by repository rules, and even if they want, they can’t work directly on the Master/Main branch. But DevOps usually have Admin privileges on GitHub, so if they push directly to master, no one can stop them.

Working with PRs is crucial because:

CI/CD workflows may only trigger on merge, not direct pushes
Without a PR, you lose review and can miss mistakes
Rollbacks are harder if you work directly on master

7. Root account — even scarier than a production account

You probably know that when you’re dealing with your production environment, you should pay attention and be careful.

But on the root account, you should be even more careful. The root account, if you’re using an AWS Organization, is the account that manages all the other accounts, including production.

The most common encounter DevOps Engineers have with the root account is managing the SCP (Service Control Policies). If, for example, you apply an SCP to the wrong account or detach the FullAccess Policy, you can affect all the services in all the accounts at once.

So if you’re not paying attention, you can cause an outage to your entire Organization without even noticing.

8. IaC — don’t do anything manually

Remember we talked about how it’s important not to be lazy? Don’t do anything manually on the AWS console.

IaC (Infrastructure as Code) can help you deploy changes with ease, but sometimes it takes more time to write Terraform code for a new resource than to deploy it manually. Don’t get tempted.

Why is it so important?

Easier rollbacks (since the code is in a repo)
More scalable
Consistent across environments
You can preview your changes with a plan before deploying

9. AI — powerful, but dangerous

Today, AI is everywhere, and we can’t run from it even if we tried. And while it can be a productivity boost, unfortunately, it can also cause you an outage if you’re not careful.

Whether it’s malfunctioning code that breaks your application logic, or IaC code that unintentionally deletes core resources, you need to make sure you use AI in a responsible way.

Don’t deploy untested AI-generated code to production
Don’t rely solely on AI without checking documentation
Don’t test AI code on production, that’s what Dev and Staging are for

10. Learn from your mistakes

As much as we don’t like them and want to avoid them, production incidents are a natural part of life.

If you already broke prod, try to learn from the mistake. That’s why we do retro meetings after every incident. And trust me, you won’t forget what you did that caused an outage, and that’s how you will get better.

At the end of the day, production incidents are the best teachers.

Conclusion

The harsh truth is that production incidents are here to stay, and we need to learn to live with them.

But if you follow the best practices, have a “production mindset”, and always ask yourself “Is what I’m about to do affecting production functionality?” and plan your steps accordingly, you can definitely avoid many incidents and improve your entire system uptime.

Got your own rule? Or your production-war-story? Please share it in the comments below!

DEV Community