It is a rite of passage for every engineer to take down production. Whether it be a full blown 500 page being served to all users or breaking back...
For further actions, you may consider blocking this person and/or reporting abuse
My worst production outage was accidentally adding code which redeployed the application upon boot. On this very website. 😄
I added some code in a Rails initializer file which pinged the Heroku API to change a config variable on boot. I didn't really think through the whole thing because every time you change a config variable, the app redeploys and restarts. The code was written in such a way that it only executed in this way in production, so we had not caught it earlier.
Enter the infinite loop.
Nothing we could do would stop the loop. The app just kept redeploying over and over again and nothing would work to stop it. We couldn't push new code, we couldn't figure anything out.
status.heroku.com showed yellow indicating something was going on with the system. That was because of me.
Eventually we figured out we could stop the problem by revoking my account's privileges within the app on Heroku—But shortly after that, Heroku suspended our whole organization account. dev.to was no longer being served.
We got some people on the phone and got the account restored and back online soon enough after that.
That was a day of learning.
Great story! Thanks for sharing @ben ! That was some innovative problem solving to revoke your account privileges to fix the issue. I always marvel at how innovative our team gets with solutions when our backs are against the wall. Feels like the pressure tends to really make us think outside the box to get things done.
Wow that's a good shout, changing the permissions - kinda the closest you have to ripping out ethernet cables as 'the hackers get closer' :D
I just checked and the most recent site-wide outage I caused was back in March 2018. My Slack message at the time read:
IIRC, it was caused by either a missing application key in the production environment or a badly-formatted YAML. I know I've done both.
I've been a professional developer for 20 years so it's not just "newbies" that do this. In fact, if you're always growing and learning then you're always a newbie at something.
OMG those pesky YAML files! I have definitely had that happen to me before. I added a cron string to one without quotes. Took down our background workers for a few minutes. I immediately put a test in to validate that YAML file and it hasn't happened since. Plus, that test has actually caught a few errors.
Could not agree more!
Great post. More folks should be sharing these kinds of stories. We are all human and we make mistakes. It is going to happen, but we can always learn from them. I am glad to see your boss reacted so well and helped you through it. It's also great to hear they put blame on themselves as well. That is not common for a lot of people who go through situations like this.
Also, great use of GIFs. Especially that last one. It fits perfectly. :)
Thank you! I am very lucky to work in such a supportive environment. Some people don't have that and I am hoping others will share some of their stories so that everyone can realize we are all in this together and downtime is just an occupational hazard.
Now that is a hell of a first day!
I've once wrote innocent looking code to invalidate the cache on module change,
but who knew that this module been changed in the loop on API calls, and cache invalidation wa making not piped udp call to Redis.
Long story short it took down the system. It wasn't fun...
Oooof, Redis is always tricky! We once had an engineer do
flushdb
on one of our Redis databases to try and fix a bug. The missing caches in the middle of the day caused our site to be unusable so some of our bigger clients, it was a scramble to get it fixed. We have since put in place some safety features like read-only consoles and alerts for missing caches. As long as you are learning from these experiences then they are not a waste 😊yep, so the awesome thing with Redis is an ability to pipe commands,
so the solution to my problem was collecting cache keys to invalidate and then in separate call
making a pipe command to redis to invalidate them in bulk.
pipe ftw :D
I take down production occasionally. Even last week. It doesn't help that we're so cash-strapped that we can't afford the usual test/production environment. I often debug in production. This is profoundly sub-optimal. Maybe it helps that we're Australian and so really good at doing a lot with a little. Maybe it also helps that we're ever so slightly crazy.
I don't have an interesting story to share, but here are some of my general tips for not breaking production. Hope some of these are helpful. I'm sure there are plenty more, feel free to share yours.
My cut off for the day is 3:30pm! Unless its an emergency, I won't merge a PR after that.
I was very lucky that in my 10 years, I only once turned off a production server via SSH thinking I was on my computer's terminal. The server had IPMI so it was down for about five minutes. Now I can tell the usefulness of the prompt.
It really scares me a lot to not have had more major problems in my career, it makes me feel like I am probably over-confident and once I will screw up, I will screw up big! For my defence, I read a lot of articles about good practices and ALWAYS ensure I have backups.
I think the best thing you can do is not be afraid of when/if you make a mistake and it leads to an outage. Rather than fear it know in the back of your mind that it is part of the job and when it happens don't let it define you, let it shape you and learn from it.
Also +1 for best practices!!!!
On my first dev job, I was working on this sales site, and I made a change to the Thanks E-Mail, which gets automatically sent to the customer once he/she makes a purchase, and broke it, and I didn't notice.
So... for, like, 24hs the mail didn't get sent and customers got confused, started buying things again and again believing the transaction didn't work because the mail wasn't delivered.
My team had to code a job to re send the failed E-Mails after the template was fixed, and correct the duplicated purchases customers had made.
One of my teammates got really mad, but didn't hear anything from my manager at the time. Some reassuring words would have been nice. Now I look back and laugh, but at the time was really awful.
Ooof breaking background workers is always rough bc you usually don't notice it right away and when you finally do, you have a mess to clean up 😝 Been there, done that!!!
Your boss sounds like a nice guy! Lucky to have people like that in management! Thanks for sharing :)
Another great one from Twitter!
Some great response this got on Twitter!
ProTip: If you break production and feel bad about causing extra work for others, beer makes everything better 😃