It is a rite of passage for every engineer to take down production. Whether it be a full blown 500 page being served to all users or breaking background processing, at some point in your career, you will take down production. When it happens, especially for the first time, it can be ROUGH! At least, I know it was for me.
My First Production Outage
Shortly after being hired by Kenna, I was working on a ticket that required me to add a column to a table in the database. We were using simple Rails at the time, so all I did was write a migration for the new column and issued a PR. The PR was approved and I immediately merged it. Every migration I had ever run in my hefty 2+ years of experience took a few seconds. Why should this be any different? I’m sure some of you can see where this is headed 😉
Unfortunately, the migration did not take a few seconds. You see, the table I added the column to was the BIGGEST table in our database! Hundreds of millions of rows big. As soon as that migration started, the entire table locked itself and remained locked for 3 hours. I should also mention, this table gets A LOT of writes. Jobs meant to update the table started blowing up left and right. It was a disaster. Luckily, we were pretty small at the time, so we waited it out. Once the migration finished, we retried the jobs that had failed and came out the other side just fine.
Now the business might have been fine, but I was devastated. A couple days later my boss wanted to talk to me. I thought for sure this was it, I was going to get fired. But instead, my boss asked how I was doing after the outage. I responded I was hanging in there. He went on to assure me that it wasn’t just my fault and to remember that someone else had approved the PR. He also said he could have done better to prevent it. He explained that this is just what happens sometimes and that I am still learning so I shouldn’t beat myself up about it.
Hearing that was exactly what I needed. It has been 3 years since my first outage and I have taken down production, in varying degrees, more than once in that time. Does that make me a bad engineer? NO WAY! Every time it happens, I learn from the experience and I take steps to prevent it from ever happening again. I also remind myself, it’s not the end of the world when it does happen. Luckily, I work with an amazing group of people, and usually after a few months, we look back on our mistakes and laugh about them.
What is your production outage story?
How did you deal with it at the time? What did you learn from it? Let's share some war stories, so the next time an engineer takes down production, they can read this thread and be reminded that they are not alone!!! 🤗
Top comments (24)
My worst production outage was accidentally adding code which redeployed the application upon boot. On this very website. 😄
I added some code in a Rails initializer file which pinged the Heroku API to change a config variable on boot. I didn't really think through the whole thing because every time you change a config variable, the app redeploys and restarts. The code was written in such a way that it only executed in this way in production, so we had not caught it earlier.
Enter the infinite loop.
Nothing we could do would stop the loop. The app just kept redeploying over and over again and nothing would work to stop it. We couldn't push new code, we couldn't figure anything out.
status.heroku.com showed yellow indicating something was going on with the system. That was because of me.
Eventually we figured out we could stop the problem by revoking my account's privileges within the app on Heroku—But shortly after that, Heroku suspended our whole organization account. dev.to was no longer being served.
We got some people on the phone and got the account restored and back online soon enough after that.
That was a day of learning.
Great story! Thanks for sharing @ben ! That was some innovative problem solving to revoke your account privileges to fix the issue. I always marvel at how innovative our team gets with solutions when our backs are against the wall. Feels like the pressure tends to really make us think outside the box to get things done.
Wow that's a good shout, changing the permissions - kinda the closest you have to ripping out ethernet cables as 'the hackers get closer' :D
I just checked and the most recent site-wide outage I caused was back in March 2018. My Slack message at the time read:
IIRC, it was caused by either a missing application key in the production environment or a badly-formatted YAML. I know I've done both.
I've been a professional developer for 20 years so it's not just "newbies" that do this. In fact, if you're always growing and learning then you're always a newbie at something.
OMG those pesky YAML files! I have definitely had that happen to me before. I added a cron string to one without quotes. Took down our background workers for a few minutes. I immediately put a test in to validate that YAML file and it hasn't happened since. Plus, that test has actually caught a few errors.
Could not agree more!
Great post. More folks should be sharing these kinds of stories. We are all human and we make mistakes. It is going to happen, but we can always learn from them. I am glad to see your boss reacted so well and helped you through it. It's also great to hear they put blame on themselves as well. That is not common for a lot of people who go through situations like this.
Also, great use of GIFs. Especially that last one. It fits perfectly. :)
Thank you! I am very lucky to work in such a supportive environment. Some people don't have that and I am hoping others will share some of their stories so that everyone can realize we are all in this together and downtime is just an occupational hazard.
Now that is a hell of a first day!
I've once wrote innocent looking code to invalidate the cache on module change,
but who knew that this module been changed in the loop on API calls, and cache invalidation wa making not piped udp call to Redis.
Long story short it took down the system. It wasn't fun...
Oooof, Redis is always tricky! We once had an engineer do
flushdb
on one of our Redis databases to try and fix a bug. The missing caches in the middle of the day caused our site to be unusable so some of our bigger clients, it was a scramble to get it fixed. We have since put in place some safety features like read-only consoles and alerts for missing caches. As long as you are learning from these experiences then they are not a waste 😊yep, so the awesome thing with Redis is an ability to pipe commands,
so the solution to my problem was collecting cache keys to invalidate and then in separate call
making a pipe command to redis to invalidate them in bulk.
pipe ftw :D
I take down production occasionally. Even last week. It doesn't help that we're so cash-strapped that we can't afford the usual test/production environment. I often debug in production. This is profoundly sub-optimal. Maybe it helps that we're Australian and so really good at doing a lot with a little. Maybe it also helps that we're ever so slightly crazy.
I don't have an interesting story to share, but here are some of my general tips for not breaking production. Hope some of these are helpful. I'm sure there are plenty more, feel free to share yours.
My cut off for the day is 3:30pm! Unless its an emergency, I won't merge a PR after that.
I was very lucky that in my 10 years, I only once turned off a production server via SSH thinking I was on my computer's terminal. The server had IPMI so it was down for about five minutes. Now I can tell the usefulness of the prompt.
It really scares me a lot to not have had more major problems in my career, it makes me feel like I am probably over-confident and once I will screw up, I will screw up big! For my defence, I read a lot of articles about good practices and ALWAYS ensure I have backups.
I think the best thing you can do is not be afraid of when/if you make a mistake and it leads to an outage. Rather than fear it know in the back of your mind that it is part of the job and when it happens don't let it define you, let it shape you and learn from it.
Also +1 for best practices!!!!
On my first dev job, I was working on this sales site, and I made a change to the Thanks E-Mail, which gets automatically sent to the customer once he/she makes a purchase, and broke it, and I didn't notice.
So... for, like, 24hs the mail didn't get sent and customers got confused, started buying things again and again believing the transaction didn't work because the mail wasn't delivered.
My team had to code a job to re send the failed E-Mails after the template was fixed, and correct the duplicated purchases customers had made.
One of my teammates got really mad, but didn't hear anything from my manager at the time. Some reassuring words would have been nice. Now I look back and laugh, but at the time was really awful.
Ooof breaking background workers is always rough bc you usually don't notice it right away and when you finally do, you have a mess to clean up 😝 Been there, done that!!!
Your boss sounds like a nice guy! Lucky to have people like that in management! Thanks for sharing :)