DEV Community

An Engineer’s Rite of Passage

Molly Struve (she/her) on January 12, 2019

It is a rite of passage for every engineer to take down production. Whether it be a full blown 500 page being served to all users or breaking back...

Read full post

Ben Halpern • Jan 12 '19

My worst production outage was accidentally adding code which redeployed the application upon boot. On this very website. 😄

I added some code in a Rails initializer file which pinged the Heroku API to change a config variable on boot. I didn't really think through the whole thing because every time you change a config variable, the app redeploys and restarts. The code was written in such a way that it only executed in this way in production, so we had not caught it earlier.

Enter the infinite loop.

Nothing we could do would stop the loop. The app just kept redeploying over and over again and nothing would work to stop it. We couldn't push new code, we couldn't figure anything out.

status.heroku.com showed yellow indicating something was going on with the system. That was because of me.

Eventually we figured out we could stop the problem by revoking my account's privileges within the app on Heroku—But shortly after that, Heroku suspended our whole organization account. dev.to was no longer being served.

We got some people on the phone and got the account restored and back online soon enough after that.

That was a day of learning.

Molly Struve (she/her) • Jan 12 '19

Great story! Thanks for sharing @ben ! That was some innovative problem solving to revoke your account privileges to fix the issue. I always marvel at how innovative our team gets with solutions when our backs are against the wall. Feels like the pressure tends to really make us think outside the box to get things done.

Dan Silcox • Jun 10 '21

Wow that's a good shout, changing the permissions - kinda the closest you have to ripping out ethernet cables as 'the hackers get closer' :D

Jamie Lawrence • Jan 14 '19

I just checked and the most recent site-wide outage I caused was back in March 2018. My Slack message at the time read:

we had a ~3 minute period at 9:30 EST when some users might not have been able to access the app or storefronts. It was caused by a bad deploy and has been rectified

IIRC, it was caused by either a missing application key in the production environment or a badly-formatted YAML. I know I've done both.

I've been a professional developer for 20 years so it's not just "newbies" that do this. In fact, if you're always growing and learning then you're always a newbie at something.

Molly Struve (she/her) • Jan 14 '19

OMG those pesky YAML files! I have definitely had that happen to me before. I added a cron string to one without quotes. Took down our background workers for a few minutes. I immediately put a test in to validate that YAML file and it hasn't happened since. Plus, that test has actually caught a few errors.

I've been a professional developer for 20 years so it's not just "newbies" that do this. In fact, if you're always growing and learning then you're always a newbie at something.

Could not agree more!

dAVE Inden • Jan 12 '19

Great post. More folks should be sharing these kinds of stories. We are all human and we make mistakes. It is going to happen, but we can always learn from them. I am glad to see your boss reacted so well and helped you through it. It's also great to hear they put blame on themselves as well. That is not common for a lot of people who go through situations like this.
Also, great use of GIFs. Especially that last one. It fits perfectly. :)

Molly Struve (she/her) • Jan 12 '19

Thank you! I am very lucky to work in such a supportive environment. Some people don't have that and I am hoping others will share some of their stories so that everyone can realize we are all in this together and downtime is just an occupational hazard.

Molly Struve (she/her) • Jan 14 '19

Now that is a hell of a first day!

Noah Gibbs

@codefolio

@molly_struve I worked at a place where you made your dev account by just logging into the prod DB (!) and setting up your account and password by changing the table with raw typed SQL (!). I, uh, hit return slightly too early and blanked everybody's password in the table.

17:59 PM - 14 Jan 2019

0 2

MikhailShel • Mar 26 '19

I've once wrote innocent looking code to invalidate the cache on module change,
but who knew that this module been changed in the loop on API calls, and cache invalidation wa making not piped udp call to Redis.

Long story short it took down the system. It wasn't fun...

Molly Struve (she/her) • Mar 27 '19

Oooof, Redis is always tricky! We once had an engineer do flushdb on one of our Redis databases to try and fix a bug. The missing caches in the middle of the day caused our site to be unusable so some of our bigger clients, it was a scramble to get it fixed. We have since put in place some safety features like read-only consoles and alerts for missing caches. As long as you are learning from these experiences then they are not a waste 😊

MikhailShel • Mar 29 '19

yep, so the awesome thing with Redis is an ability to pipe commands,
so the solution to my problem was collecting cache keys to invalidate and then in separate call
making a pipe command to redis to invalidate them in bulk.

pipe ftw :D

Bruce Axtens • Oct 7 '19

I take down production occasionally. Even last week. It doesn't help that we're so cash-strapped that we can't afford the usual test/production environment. I often debug in production. This is profoundly sub-optimal. Maybe it helps that we're Australian and so really good at doing a lot with a little. Maybe it also helps that we're ever so slightly crazy.

Ryan Smith • Jan 16 '19 • Edited

I don't have an interesting story to share, but here are some of my general tips for not breaking production. Hope some of these are helpful. I'm sure there are plenty more, feel free to share yours.

It starts with writing good code. This can mean many things depending on the person and language, but my general rules are:
- Have consistent styling or follow your team's style guide. I find this makes it easier to see when something is out of place during development.
- Keep things simple and clear in your code. When problems arise, you may not be thinking straight. If your code is too confusing and unclear, this may only compound the problem. Other developers may have a difficult time helping get things online if they cannot decipher the code.
Test locally, on a dev server, then on production.
- Do not just run automated tests or test locally, but test on a development server if available.
- Once your change is deployed, test it on production.
- Test small changes too.
Have someone review your code before going live with it.
- Have them test it.
- Make sure they actually review it and don't just give the go ahead.
- Create some guidelines around this with your team if none exist.
Don't write/run queries directly on production.
- Write and test them locally or on a dev server. After running them in a testing environment, make sure the updated data looks correct in the final product.
- If it is an update or delete statement, write a select version of the same query first. This will help ensure you are pulling in the correct data. This will also help in the next step.
- BACK-UP THE DATA. If you are unsure how, this can be a simple select statement, copied to a spreadsheet, and uploaded somewhere (as opposed to leaving _temp tables cluttering the DB).
- Again, have queries reviewed by someone before running them.
- If you are new, you should not have production database access on your first day. If you are in this situation seek out senior members of the team to verify and help run queries with you.
Do not push changes towards the end of the day or before the weekend.
- Save yourself the trouble of having to scramble to fix something during your personal time or letting the problem continue while you are out of office.
- Push things live in the morning while everyone is in the office.
Don't beat yourself up over it.
- Development is hard, every project has a lot of different things to worry about, it happens.
- Learn from your mistakes and help future developers avoid them as well.

Molly Struve (she/her) • Jan 16 '19

Do not push changes towards the end of the day or before the weekend.

My cut off for the day is 3:30pm! Unless its an emergency, I won't merge a PR after that.

André Jacques • Aug 12 '19

I was very lucky that in my 10 years, I only once turned off a production server via SSH thinking I was on my computer's terminal. The server had IPMI so it was down for about five minutes. Now I can tell the usefulness of the prompt.

It really scares me a lot to not have had more major problems in my career, it makes me feel like I am probably over-confident and once I will screw up, I will screw up big! For my defence, I read a lot of articles about good practices and ALWAYS ensure I have backups.

Molly Struve (she/her) • Aug 12 '19 • Edited

I think the best thing you can do is not be afraid of when/if you make a mistake and it leads to an outage. Rather than fear it know in the back of your mind that it is part of the job and when it happens don't let it define you, let it shape you and learn from it.

Also +1 for best practices!!!!

Sebastian G. Vinci • Jan 16 '19

On my first dev job, I was working on this sales site, and I made a change to the Thanks E-Mail, which gets automatically sent to the customer once he/she makes a purchase, and broke it, and I didn't notice.

So... for, like, 24hs the mail didn't get sent and customers got confused, started buying things again and again believing the transaction didn't work because the mail wasn't delivered.

My team had to code a job to re send the failed E-Mails after the template was fixed, and correct the duplicated purchases customers had made.

One of my teammates got really mad, but didn't hear anything from my manager at the time. Some reassuring words would have been nice. Now I look back and laugh, but at the time was really awful.

Molly Struve (she/her) • Jan 16 '19

Ooof breaking background workers is always rough bc you usually don't notice it right away and when you finally do, you have a mess to clean up 😝 Been there, done that!!!

Victor Velasquez • Jan 16 '19

Your boss sounds like a nice guy! Lucky to have people like that in management! Thanks for sharing :)

Molly Struve (she/her) • Jan 16 '19

Another great one from Twitter!

Grant Horwood ⬋⬋⬋

@gbhorwood

@molly_struve 2003, editing a firewall in the uk over ssh. for safety i wrote an iptable clearing script called cleaniptables. in the cron i called it cleariptables.

locked out myself and the entire world and had to call the hosting coloc. took them 24hrs to reboot.

16:23 PM - 16 Jan 2019

Molly Struve (she/her) • Jan 15 '19

Brian Loomis

@djdarkbeat

@molly_struve Pushed staging data to production and overwrite it. Also "helped" my father set up his mew mac and carbon copy cloned the wrong direction and cloned a blank drive over his old machine and erased his memoirs, pics and quicken. No backups.

13:36 PM - 15 Jan 2019

0 0

Molly Struve (she/her) • Jan 15 '19

laen

@laen

@molly_struve I pushed a cfengine rule that copied an sshd_config to /etc on every machine I managed.

Note I didn't say *into* /etc. cfengine also noticed that. #myfirstproductionoutage

07:40 AM - 15 Jan 2019

00

Molly Struve (she/her) • Jan 14 '19

Some great response this got on Twitter!

The dark air moves through the trees at night

@forestpines

@ThePracticalDev @molly_struve Needed to find all files on a webserver that linked to a specific domain. Used grep, and output the matching lines to a file. Grep found the file it was outputting into, and kept looping until the disk filled up and the server fell over...

21:10 PM - 12 Jan 2019

02

Erik Booij

@erikbooij

Two weeks into my first development job, it’s 5:28pm, CI turns green, I hit merge > deploy and head home. 10 min. into my tramride, messages pop up in #general on Slack asking why “My Account” on the website was completely down.

Don’t deploy and leave.

#myfirstproductionoutage twitter.com/ThePracticalDe…

21:12 PM - 12 Jan 2019

DEV Community 👩‍💻👨‍💻 @ThePracticalDev
"My First Production Outage" 😄 Everybody has one eventually { author: @molly_struve } https://t.co/uAsUcN0ygP

03

Molly Struve (she/her) • Jan 16 '19

ProTip: If you break production and feel bad about causing extra work for others, beer makes everything better 😃

Attila Domokos

@adomokos

If you worked for @KennaSecurity, you'd have coworkers like mine. We are looking for ppl, check out our job listing: kennasecurity.com/jobs/software-…

14:12 PM - 11 Oct 2017

714