loading...
Cover image for An Engineer’s Rite of Passage

An Engineer’s Rite of Passage

molly_struve profile image Molly Struve (she/her) Updated on ・3 min read

It is a rite of passage for every engineer to take down production. Whether it be a full blown 500 page being served to all users or breaking background processing, at some point in your career, you will take down production. When it happens, especially for the first time, it can be ROUGH! At least, I know it was for me.

My First Production Outage

Shortly after being hired by Kenna, I was working on a ticket that required me to add a column to a table in the database. We were using simple Rails at the time, so all I did was write a migration for the new column and issued a PR. The PR was approved and I immediately merged it. Every migration I had ever run in my hefty 2+ years of experience took a few seconds. Why should this be any different? I’m sure some of you can see where this is headed 😉

Unfortunately, the migration did not take a few seconds. You see, the table I added the column to was the BIGGEST table in our database! Hundreds of millions of rows big. As soon as that migration started, the entire table locked itself and remained locked for 3 hours. I should also mention, this table gets A LOT of writes. Jobs meant to update the table started blowing up left and right. It was a disaster. Luckily, we were pretty small at the time, so we waited it out. Once the migration finished, we retried the jobs that had failed and came out the other side just fine.

Now the business might have been fine, but I was devastated. A couple days later my boss wanted to talk to me. I thought for sure this was it, I was going to get fired. But instead, my boss asked how I was doing after the outage. I responded I was hanging in there. He went on to assure me that it wasn’t just my fault and to remember that someone else had approved the PR. He also said he could have done better to prevent it. He explained that this is just what happens sometimes and that I am still learning so I shouldn’t beat myself up about it.

Hearing that was exactly what I needed. It has been 3 years since my first outage and I have taken down production, in varying degrees, more than once in that time. Does that make me a bad engineer? NO WAY! Every time it happens, I learn from the experience and I take steps to prevent it from ever happening again. I also remind myself, it’s not the end of the world when it does happen. Luckily, I work with an amazing group of people, and usually after a few months, we look back on our mistakes and laugh about them.

What is your production outage story?

How did you deal with it at the time? What did you learn from it? Let's share some war stories, so the next time an engineer takes down production, they can read this thread and be reminded that they are not alone!!! 🤗

Posted on Jan 12 '19 by:

molly_struve profile

Molly Struve (she/her)

@molly_struve

International Speaker 🗣 Runner 🏃‍♀️ Always Ambitious. Never Satisfied. I ride 🦄's IRL

Discussion

markdown guide
 

My worst production outage was accidentally adding code which redeployed the application upon boot. On this very website. 😄

I added some code in a Rails initializer file which pinged the Heroku API to change a config variable on boot. I didn't really think through the whole thing because every time you change a config variable, the app redeploys and restarts. The code was written in such a way that it only executed in this way in production, so we had not caught it earlier.

Enter the infinite loop.

Nothing we could do would stop the loop. The app just kept redeploying over and over again and nothing would work to stop it. We couldn't push new code, we couldn't figure anything out.

status.heroku.com showed yellow indicating something was going on with the system. That was because of me.

Eventually we figured out we could stop the problem by revoking my account's privileges within the app on Heroku—But shortly after that, Heroku suspended our whole organization account. dev.to was no longer being served.

We got some people on the phone and got the account restored and back online soon enough after that.

That was a day of learning.

 

Great story! Thanks for sharing @ben ! That was some innovative problem solving to revoke your account privileges to fix the issue. I always marvel at how innovative our team gets with solutions when our backs are against the wall. Feels like the pressure tends to really make us think outside the box to get things done.

 

I just checked and the most recent site-wide outage I caused was back in March 2018. My Slack message at the time read:

we had a ~3 minute period at 9:30 EST when some users might not have been able to access the app or storefronts. It was caused by a bad deploy and has been rectified

IIRC, it was caused by either a missing application key in the production environment or a badly-formatted YAML. I know I've done both.

I've been a professional developer for 20 years so it's not just "newbies" that do this. In fact, if you're always growing and learning then you're always a newbie at something.

 

OMG those pesky YAML files! I have definitely had that happen to me before. I added a cron string to one without quotes. Took down our background workers for a few minutes. I immediately put a test in to validate that YAML file and it hasn't happened since. Plus, that test has actually caught a few errors.

I've been a professional developer for 20 years so it's not just "newbies" that do this. In fact, if you're always growing and learning then you're always a newbie at something.

Could not agree more!

 

Now that is a hell of a first day!

 

Great post. More folks should be sharing these kinds of stories. We are all human and we make mistakes. It is going to happen, but we can always learn from them. I am glad to see your boss reacted so well and helped you through it. It's also great to hear they put blame on themselves as well. That is not common for a lot of people who go through situations like this.
Also, great use of GIFs. Especially that last one. It fits perfectly. :)

 

Thank you! I am very lucky to work in such a supportive environment. Some people don't have that and I am hoping others will share some of their stories so that everyone can realize we are all in this together and downtime is just an occupational hazard.

 

I take down production occasionally. Even last week. It doesn't help that we're so cash-strapped that we can't afford the usual test/production environment. I often debug in production. This is profoundly sub-optimal. Maybe it helps that we're Australian and so really good at doing a lot with a little. Maybe it also helps that we're ever so slightly crazy.

 

I was very lucky that in my 10 years, I only once turned off a production server via SSH thinking I was on my computer's terminal. The server had IPMI so it was down for about five minutes. Now I can tell the usefulness of the prompt.

It really scares me a lot to not have had more major problems in my career, it makes me feel like I am probably over-confident and once I will screw up, I will screw up big! For my defence, I read a lot of articles about good practices and ALWAYS ensure I have backups.

 

I think the best thing you can do is not be afraid of when/if you make a mistake and it leads to an outage. Rather than fear it know in the back of your mind that it is part of the job and when it happens don't let it define you, let it shape you and learn from it.

Also +1 for best practices!!!!

 

I've once wrote innocent looking code to invalidate the cache on module change,
but who knew that this module been changed in the loop on API calls, and cache invalidation wa making not piped udp call to Redis.

Long story short it took down the system. It wasn't fun...

 

Oooof, Redis is always tricky! We once had an engineer do flushdb on one of our Redis databases to try and fix a bug. The missing caches in the middle of the day caused our site to be unusable so some of our bigger clients, it was a scramble to get it fixed. We have since put in place some safety features like read-only consoles and alerts for missing caches. As long as you are learning from these experiences then they are not a waste 😊

 

yep, so the awesome thing with Redis is an ability to pipe commands,
so the solution to my problem was collecting cache keys to invalidate and then in separate call
making a pipe command to redis to invalidate them in bulk.

pipe ftw :D

 

I don't have an interesting story to share, but here are some of my general tips for not breaking production. Hope some of these are helpful. I'm sure there are plenty more, feel free to share yours.

  • It starts with writing good code. This can mean many things depending on the person and language, but my general rules are:
    • Have consistent styling or follow your team's style guide. I find this makes it easier to see when something is out of place during development.
    • Keep things simple and clear in your code. When problems arise, you may not be thinking straight. If your code is too confusing and unclear, this may only compound the problem. Other developers may have a difficult time helping get things online if they cannot decipher the code.
  • Test locally, on a dev server, then on production.
    • Do not just run automated tests or test locally, but test on a development server if available.
    • Once your change is deployed, test it on production.
    • Test small changes too.
  • Have someone review your code before going live with it.
    • Have them test it.
    • Make sure they actually review it and don't just give the go ahead.
    • Create some guidelines around this with your team if none exist.
  • Don't write/run queries directly on production.
    • Write and test them locally or on a dev server. After running them in a testing environment, make sure the updated data looks correct in the final product.
    • If it is an update or delete statement, write a select version of the same query first. This will help ensure you are pulling in the correct data. This will also help in the next step.
    • BACK-UP THE DATA. If you are unsure how, this can be a simple select statement, copied to a spreadsheet, and uploaded somewhere (as opposed to leaving _temp tables cluttering the DB).
    • Again, have queries reviewed by someone before running them.
    • If you are new, you should not have production database access on your first day. If you are in this situation seek out senior members of the team to verify and help run queries with you.
  • Do not push changes towards the end of the day or before the weekend.
    • Save yourself the trouble of having to scramble to fix something during your personal time or letting the problem continue while you are out of office.
    • Push things live in the morning while everyone is in the office.
  • Don't beat yourself up over it.
    • Development is hard, every project has a lot of different things to worry about, it happens.
    • Learn from your mistakes and help future developers avoid them as well.
 

Do not push changes towards the end of the day or before the weekend.

My cut off for the day is 3:30pm! Unless its an emergency, I won't merge a PR after that.

 

On my first dev job, I was working on this sales site, and I made a change to the Thanks E-Mail, which gets automatically sent to the customer once he/she makes a purchase, and broke it, and I didn't notice.

So... for, like, 24hs the mail didn't get sent and customers got confused, started buying things again and again believing the transaction didn't work because the mail wasn't delivered.

My team had to code a job to re send the failed E-Mails after the template was fixed, and correct the duplicated purchases customers had made.

One of my teammates got really mad, but didn't hear anything from my manager at the time. Some reassuring words would have been nice. Now I look back and laugh, but at the time was really awful.

 

Ooof breaking background workers is always rough bc you usually don't notice it right away and when you finally do, you have a mess to clean up 😝 Been there, done that!!!

 
 

Your boss sounds like a nice guy! Lucky to have people like that in management! Thanks for sharing :)

 
 

Some great response this got on Twitter!

 

Another great one from Twitter!

 

ProTip: If you break production and feel bad about causing extra work for others, beer makes everything better 😃