DEV Community

Tell me about a time you messed up

Ben Halpern on April 01, 2019

So I brought down the site for a little while this morning. Now I'm interested in hearing about when you messed everything up!

Jason R Tibbetts • Apr 1 '19

I forked a GitHub repo my first week at work, then had to delete it when I realized I'd forked the wrong one. Well, despite GitHub's repeated warnings, I managed to delete the source repo, not the fork. There were at least 6 open PRs in active development against it, and nobody had a full local clone that we could use to restore it. Thankfully, GitHub support was incredibly helpful and restored it.

The worst part is that some sympathetic coworkers humorously explained that this happens to all GitHub n00bs at some point. The problem was that I'd been using GH for at least 5 years at that point, and I should have known better.

Judith • Apr 2 '19

You are not alone Jason! I have ten years experience and LAST WEEK I merged a big commit not realizing another feature was finished first and already merged that had conflicts with my code. Thank God for merge tool or it would have been a sh*t show! Instead I walked away with my tail between my legs and a chance to fix my code. Moral of the story: it’s why we have versioning tools - we are all Human and will make mistakes 👍😀

Aivan Monceller • Apr 2 '19

That's a good thing to know that Github has backups

Jason R Tibbetts • Apr 2 '19

I wouldn’t rely on that option, though. We’re a big, visible company. Your results may vary.

savan kaneriya • Apr 27 '19

Git should give peptalk before trying any thing in cli mode.

JeFFBlanco • Apr 1 '19 • Edited

Don't sweat it! Software wouldn't be software without bugs and some outages. It happens 😃

My most heinous incident involved multiple threads, hitting one shared API connection and, in turn, criss-crossing customer data 😳

Max Antonucci • Apr 1 '19

When starting my current job, I was doing my first git rebase. But I didn't understand the command and wound up rebasing off the incorrect one. So my branch had dozens of extra commits that wound up getting pushed to master.

Thankfully the changes were reverted fast, but it also means I didn't see how spectacularly I screwed up the main site.

Pabi Forbes • Apr 2 '19

Almost experienced that while learning to rebase. Fortunately it got resolved quickly

Judith • Apr 2 '19

Rebase should have a consumer warning lable😂😂😂

Micah Riggan • Apr 1 '19 • Edited

Once I used a PassThrough stream instead of an EventEmitter. Apparently PassThrough streams retain some state as the data goes through them, and so eventually it caused a memory leak.

This memory leak was in a multi-machine process, which led to two processes thinking they were responsible for updating the database.

That caused mongo queries from the affected nodes to randomly execute after about an hour of being slammed from lack of memory.

Eventually the affected node would restart and start to function normally. Which was worse, because it allowed this problem to go unnoticed, until some users started reporting duplicate or corrupted data.

Stuff of nightmares.

Todd Stark II • Apr 1 '19

Last night I forgot to stop all my docker containers before running yum update. Now all of my containers are corrupted. Yay!

Ian Knighton • Apr 1 '19

I was working on writing a shell script to delete some files that were installed in the Applications directory on a Mac. In my mind, I had run it through command line so many times that I had just forgotten to fully write out the path.

So I finish up, make it fully executable, and bam! it erases my entire application folder.

Thank goodness for Time Machine.

Sam Myers • Apr 2 '19

Fat-fingered a command and failed to read the diff carefully. Wiped out the authentication system for the Kubernetes cluster my team had spent weeks building.

Fortunately, everything was in Terraform, so we only lost a few hours of work and gained a lot of confidence in the reliability of what we were working on.

The experience made me really sit down and think about the nature of professional integrity. The incident occurred near the end of my work-day and I wanted to just pretend nothing had happened. Maybe my login just so happened to stop working at the exact same time I ran something stupid... It would have been so easy.

But I was the only one who had the logs to figure out how I'd screwed up. I made a full write-up on what I had done, what recovery steps I had attempted, and what our options were. I felt terrible about it and apologized to my coworkers and the client. No one was upset; shit happens.

Judith • Apr 2 '19

Wow! Great work! If I were your boss I’d promote you for your work ethic alone; not to mention your honesty and courage. 👍

Kevin McKenna • Apr 2 '19 • Edited

The one that springs to mind for me was a quick and dirty update against a single entry in the database.. check the SQL, run the code, 1000000+ entries updated.

I had the where clause on its own line and somehow had the rest highlighted when i hit F5 to run it.

The panic and 'Oh nooooooooo!' was horrifying.

Luckily I had my transaction log backups in place and was able to get all but about the most recent 5 minutes worth of data back in quickly, and very thankfully before most of the end users had logged in for the day. Make a habit of checking your backups!

Alan Hylands • Apr 2 '19

Sounds VERY familiar. Lost count of the number of "heart dropping through the floor" moments I've had over the years doing something similar :-D

Karl N. Redman • Apr 10 '19 • Edited

I had just started as a Perl developer for Wolfram Research when Wolfram|Alpha (a sort of curated search engine) was launching. Because I had developed and automated their weather updates I was included in their launch team to be broadcast on twitch. However, once we were near ready to go to launch I discovered that none of the developers had wifi connectivity in the building we were launching from (it's a development launch and none of the developers could develop) -so I pressed the admin team to get the wifi working.

My insistence caught the notice of Stehpan Wolfram and the Director of Development. The Director, Peter O. asked me to oversee any and all potential hacks (DDos, etc.) that might be trying to interfere with our launch. I reluctantly took on the position for the launch (it was a 27 hour day in total).

Smack dab in the middle of the launch Stephan Wolfram is streaming our internet numbers network input when I realized that we were getting hit with an extreme number of search queries from a (seemingly) foreign range of IP addresses that would potentially start bringing down our in-house cluster. I hit the panic button and shut down all network traffic until I could block those IPs from our router.

As it turns out, the web group had been running tests from an outside cluster to drive up our numbers during the presentation. And no one had told me.... So, Stephen Wolfram is standing there, showing the numbers and there's a huge drop in traffic..... derp -that was me.

I turned everything back on in about 30 seconds but it didn't go unnoticed.

Karl N. Redman • Apr 11 '19

Stephan Wolfram video where the crash happens: the launch video is no longer available. The crash acknowledgement happens at 10:15

derp.

I think, at the time, 3 million people saw this happen.....

Sam Leibowitz • May 25 '19

Beautiful. :D

In a former life when I was doing network administration for a university's school of business, I got a phone call from another network admin demanding that I give him the contact information connected to a specific IP address that was "attacking" their system. He figured that a university kid was trying to brute force a password to their FTP service.

Turns out that it was a company being hosted by our business incubator. The angry admin's company had hired them to redo their website, so they were trying to FTP into the web server. Only the password was wrong, and their stupid software kept trying to reconnect.

That guy threatened to sic his lawyers on me. I told my boss, and to his credit, he shrugged and said, "that's fine. Here's the contact info for our lawyers. You can tell him to give it to his lawyers."

Björn Grunde • Dec 6 '19

I was working on a quite easy and basic feature of an Economy system and the only advanced part was a really old legacy module that had two functions with really similar and bad names, something like cr_EDC_use_ETM20 and cr_EDC_use_ET20. My ide auto spelled me the wrong function and I managed to send several invoices to over 60k clients. Invoices with an infinite amount to pay. Luckily we had rollback systems, so clients never noticed. Later our old-school legacy programmer refactored the module with names that made sense to the younger generation :P

Markus Siering • Apr 1 '19

Three weeks into my new job, I deleted the marketing website sidebar including various signup widgets. Was not aware of it being used on every page, did not need it on the one I was editing and went „Nah, let‘s throw this out!“ There was no undo function for this. Luckily, a colleague noticed quite fast and she was able to insert the content again quickly 😅

Kim Davis • Apr 1 '19

I once took a good portion of a large site down for over 24 hours by clearing the whole site's cache. So, it could've ben worse ;).

Mark O' Donnell • Sep 10 '19 • Edited

In my first job I managed a firewall configuration app for a large Telco. It allowed Fortune 500 companies to make firewall changes on their network. Can’t remember the exact problem but long story short I changed something which stopped anyone logging in at all. No-one could make any changes in for 2 days until we found the issue. 😳

This was particularly bad because it affected bonuses that year, whooops! 😂😂

JReca • Apr 2 '19

I manually updated store database with a SQL file. I messed up field order, and basically had stock and price reversed. For a few minutes we had a lot of very cheap products that we really had few expensive ones. At least no one bought anything.

Kevin McKenna • Apr 2 '19

The worst part about that kind of mistake is the sudden 'oh no!' that sweeps over you as you realize the implications of the mistake

kd2718 • Apr 1 '19

The site I was working on for a real estate company was having issues when users requested all photos for a house. The server would fetch all the photos and add them to a zip for download. This was a bad idea from the start as there would often be over 100 high quality images.

The worst part was this was a synchronous task. The users would stare at a blank browser until the request finished or timed out. They had me make this async with celery (python). Even though the whole process was bad, we settled on this solution as a "quick fix". Celery was already used in other parts of the site.

I made the changes and deployed them near the end of the day. It worked fine. The next morning I was woken up by an emergency phone call. Most of the site was no longer working.

I had forgotten to disable the download button when a zip job was in the queue. Apparently people were mashing the button expecting the old behavior and there were a massive amount of jobs backed up in the celery queue. Anything else using celery was basically broken on the site.

I had to use git rollback to revert to the previous version of the site. I felt horrible. I was chewed out by the owners and told that they lost "millions". I guess it wasn't that bad because they kept using us for a while after that...

Alan Hylands • Apr 2 '19

Damn right. It's not about never making mistakes, it's all about how you respond and get the problem fixed with as little collateral damage as possible. Then you can laugh about it later (and repeat ad nauseum for the rest of your tech career!)

Dian Fay • Apr 1 '19

Last week I set up a load balancer to automatically forward http to https for a particular application. That part went fine. What didn't is that we use Slack's slash commands extensively to interact with the application, and I forgot I'd set those up before I'd even gotten an SSL cert. Slack does not like getting a 301 response. They were all broken for hours for our people in Europe before I woke up and figured out what had happened.

Cameron Wilby • Jan 6 '20

We recently made a switch from environment foo to environment bar. After the migration to bar was completed, I assumed we could take foo down. For reasons, foo should have stayed up and I had to spin up foo within 15 minutes. I managed to do it, but that was not a fun time.

Takeaway was the situation could have been avoided if I had just asked "Hey, do we still need foo?".

Deepu K Sasidharan • Jul 18 '19

Well, I stopped keeping count of it. But the best would be me dropping a critical production database(Yes, I didn't take a backup) while running a migration script and then reconstructing missing data from logs. Before you judge me please note that I was working 36 hr straight to meet a production deadline and this is why no one should work like that.

Zane Milakovic • Jan 13 '20

I use to be a game developer.

One time I corrupted the external hard drive by mistake. It ruined months of work that was unrecoverable.

This was a four person team project, we all failed the final.

The mistake was not backing it up. I also had dropped the hard drive which caused the corruption. Obviously this was a accident, and it was my personal drive. But I also was very bad about sharing the work in process, thinking I could wow people. This caused team mates not having a copy that was nearly as close.

Fred Richards • Jul 12 '19

Years ago I moved from being a tech to a sys admin ... and one of my first responsibilities was swapping out the backup tapes in the pop. There's a bunch of story in the middle but the punchline ends with accidentally dropping a few important tapes down several flights of stairs.

Bruce Axtens • Aug 9 '19

I was learning JavaScript. I broke a website. The company I was consulting for (a Google Ads reseller) lost that client.

That was back in 2011 and I'm still learning JavaScript (and still consulting for the same company.)

Forgiveness is hard to find, but boy, when you find it!!!

OmkarShirodkar • Sep 30 '19 • Edited

New job. First time monitoring a Unix shell scripting job. Script errors out with memory issues. Go in the respective folder, find a couple of files ~ 2GB big, delete them. Re-run the script, it fails again. This time looking for the files yours truly deleted. Had to wake up the SME to fix the issue.

Giovana Morais • Apr 2 '19

Once I changed about 20.000 samples labels in internship production database because I forgot a parenthesis. I had to stay until 9PM to fix everything and thankfully my coworker had a backup from two or three weeks before that.

Steve Boyd • Apr 1 '19

I once took down a stock ordering service of the company I was working for at the time, and couldn't roll back the changes because backups were manual and I'd taken an old copy from the wrong server. Took half a day for me and 2 other developers to figure out what had gone wrong and how to fix it. It didn't help we were required to do them at 3 am.

Needless to say, I've learned to triple check the deployment plans before I'm sleep deprived.

Desi • Apr 1 '19

I don’t have an example (yet) but I was thinking how this goof could help others - especially newbies/beginners - realize they are definitely not alone when their first outage happens!

Vico • Apr 2 '19

I delete unused domain in cpanel that didn't have website stored in there, but I forgot to realize in that domain had 10+ registered client email. fortunately, when I put the domain back, the email and data restored.

Phil Nash • Apr 2 '19

I once deleted the code for the most important page of an application a week before it was supposed to be launched. The application was built in a framework that handled it's own version control (yeah). Except this part of the application had been built with a new feature that didn't have version control yet (yeah...).

It had taken weeks to get this page to its current state and I ended up rewriting it entirely within an afternoon out of sheer desperation. Sadly, it was probably better the second time. The launch went ahead on time at least.

Robin Kretzschmar • Apr 2 '19

During my training time I had the task to write an update Job to correct data. It should be SQL because it was more performant than via the client application (can also run some sort of coded jobs) so I started happily writing my script, hit F5 to test it on the customer TEST-Environment and after 10 minutes of execution I noticed the dropdown in SQL Manangement studio had the PROD database selected and my job was messing up production data...
Was no fun at all, even though I cancelled it immediately, the damage was huge 😅

Joe Hobot • Apr 2 '19

Copied the wrong version of dll to prod about 12y ago, troubleshoot production issue for 8hrs to find out it was just that f*** dll that was bad. Devs could not find out whats wrong either .. Since then I hate anything that is exe.dll etc

Btw I heard about site being down for few hrs, hope you have post mortems ? :) Like how fast can you recover and stuff . Sometimes user errors are good because you learn so much from it.

Jeremy Grifski • Apr 2 '19

This week I tried to load Linux on a thumb drive, so I could play around in a different environment. For whatever reason, I thought it would be simple to just launch Linux and use the same hard drives I use for Windows. Unfortunately, one of my drives was actually three drives, and Linux didn't understand how to open it. I made the mistake of Googling some solutions, naively ran a couple commands, and ultimately corrupted the drives.

After fighting with the drives for a couple hours, I gave up, formatted them, and restored them using one of my backups. On the bright side, my backups work! hahaha

Matthew Daly • May 2 '19

My main role is maintaining a legacy intranet system for a high street bank in the UK. As it's built with Zend 1, modern concepts like environment variables for config weren't integrated into its design.

Last year I was testing a scheduled task that sends users email notifications when the resources they posted on the site are coming up to their expiry dates. I had changed the config to point at Mailtrap, then I reverted some changes, and re-ran the script - now using the live settings. Fifty-odd users got an email about their resources expiring.

Needless to say, the next thing I did was to move the mail config for development to DotEnv so it wouldn't happen again...

Randie Miller • May 16 '19

Working at a large insurance company (largest in US). Needed to restore 1 database to previous nights image. Decided to restore entire set of databases from last backup to avoid bad references. Databases are scheduled to be backed up daily, and we didn't check last backup date. That night many jobs failed because drivers in Missouri were missing. Last backup of driver data was almost 30 days old. Oops! Lost almost 30 days of driver input for Missouri. Took a fews weeks to get cobble together most of the missing data. Wasn't my fault that it wasn't backed up, but we should have checked the backup date before the restore.

Aaron Kulbe • May 24 '19

I was using PowerCLI to revert snapshots for a block of VMs that I was developing automation for. I use TextExpander to do that snippet, with fill-in fields for the hostname. Wellllll... I got in a rush and executed the snippet with the hostname field blank. That had the effect of running the action on EVERY. SINGLE. VM. that contained a snapshot. I caught it quickly... but not before I had reverted the snapshot for the VM that contained our production instance of Confluence. 😳

Thankfully, the new Atlassian architect had things configured well. We lost about an hour's worth of data, but that's it. We were back online in about 15 minutes.

Sam Leibowitz • May 25 '19

Many years ago, I was a systems admin who was tasked with maintaining (among other things) an email server. The server was running Debian Linux for its OS, and Exim 3 for mail. Like an idiot, I figured that upgrading to Exim 4 would be no biggie, despite the many warnings that accompanied the release, and blithely went and apt-got it.

A couple hours of panicked hackery later and I'd finally managed to restore service back with Exim 3. Eventually, I got a very nice setup going on the cheap with virtual users in a mysql database,on Exim 4. But I never skipped past a warning like that again without doing my due diligence.

Nucu Labs • Apr 1 '19

Today, I forgot to turn off a service which was indexing data in ElasticSearch and I've ran an command on my machine that was also indexing data, causing the system to go into a corrupt state. I had to turn of the service and re-import data into ElasticSearch from Mongo and then re-index all the data, luckily it didn't took that much

Ali AslRousta • Apr 3 '19

Pretty recently, I was working on a Microservices-based project which required an API gateway. Despite a various range of options out there, I insisted on writing our own gateway from scratch, and it went disastrous wasting about one month before me going back to a solution based on OpenResty.

Sebastian Vargr • Oct 21 '19

I've brought down live sites numerously, messed up the master branch, ended up in merge conflict hell, grossly underestimate things, sent out e-mails to thousands of customers because i was using the production DB for developing, and all kinds of other sheit.

I've only messed up once of twice so bad a backup had to be used at the start of my career tho.

Developers should never test their own stuff or have access to production directly.

It always ends in chaos at some point in my experience. :)
Usually fixable chaos, but still.

Saurabh Daware 🌻 • Sep 30 '19

In my internship while copy pasting code from other component I also copy pasted google analytics code and one of the variable was undefined.
It went unnoticed (since it was just part of analytics so it didn't exactly break anything 🤷) and then they had bunch of undefined hits on their production analytics!

Kody James Ague • Nov 8 '19

I currently work in road design. Most of my career has been electrical building system design. So when I was new to road design I labeled a bunch of intersections with 20' radii. By a bunch I mean the whole project of 20+ intersections so over 80 corners, roughly. The problem is that they weren't actually drawn at 20', they were all drawn differently because I didn't know any better. Some were only 7', 13' etc. So when it was time to construct the curbs on each corner, nothing lined up. So now instead of a nice new consistent result these intersections all have random and different radii. It was embarrassing to say the least and I never was invited to visit the project during construction which I was thankful for!

Mike Simons • Apr 7 '19

My first big project was to hash / encrypt customer passwords (this was 13 years ago). Worked for an ISP. Half way through the rollout customer support started reporting an increase in connection issue tickets. Clients failing to authenticate.

Turns out I'd string trimmed binary strings thus truncating any with NUL bytes. 50k users affected including their RADIUS records (which were used to authenticate their ADSL connection).

I will always be thankful to the senior engineer I was paired with for the rollout who was cool as a cucumber and helped me spot the issue... I was bricking it!

Haider Ali • Apr 2 '19

I remember once we were going to have a meeting with a client to show him the website, it was at 6 pm, I was making some changes in CSS files directly in the Cpanel, didn't know exactly what I did, but I deleted the main. CSS file , which contain 80% of website styling ,the problem is not here , when my manager went to the meeting he didn't know about it, me neither, until he cleared his cache and found out that the all website is messed up, it was a panic moment, he ended up canceling the meeting and I was about to lose my job .