So I brought down the site for a little while this morning. Now I'm interested in hearing about when you messed everything up!
I forked a GitHub repo my first week at work, then had to delete it when I realized I'd forked the wrong one. Well, despite GitHub's repeated warnings, I managed to delete the source repo, not the fork. There were at least 6 open PRs in active development against it, and nobody had a full local clone that we could use to restore it. Thankfully, GitHub support was incredibly helpful and restored it.
The worst part is that some sympathetic coworkers humorously explained that this happens to all GitHub n00bs at some point. The problem was that I'd been using GH for at least 5 years at that point, and I should have known better.
You are not alone Jason! I have ten years experience and LAST WEEK I merged a big commit not realizing another feature was finished first and already merged that had conflicts with my code. Thank God for merge tool or it would have been a sh*t show! Instead I walked away with my tail between my legs and a chance to fix my code. Moral of the story: it’s why we have versioning tools - we are all Human and will make mistakes 👍😀
That's a good thing to know that Github has backups
I wouldn’t rely on that option, though. We’re a big, visible company. Your results may vary.
Git should give peptalk before trying any thing in cli mode.
Don't sweat it! Software wouldn't be software without bugs and some outages. It happens 😃
My most heinous incident involved multiple threads, hitting one shared API connection and, in turn, criss-crossing customer data 😳
When starting my current job, I was doing my first git rebase. But I didn't understand the command and wound up rebasing off the incorrect one. So my branch had dozens of extra commits that wound up getting pushed to master.
Thankfully the changes were reverted fast, but it also means I didn't see how spectacularly I screwed up the main site.
Rebase should have a consumer warning lable😂😂😂
Almost experienced that while learning to rebase. Fortunately it got resolved quickly
Last night I forgot to stop all my docker containers before running yum update. Now all of my containers are corrupted. Yay!
I was working on writing a shell script to delete some files that were installed in the Applications directory on a Mac. In my mind, I had run it through command line so many times that I had just forgotten to fully write out the path.
So I finish up, make it fully executable, and bam! it erases my entire application folder.
Thank goodness for Time Machine.
Once I used a PassThrough stream instead of an EventEmitter. Apparently PassThrough streams retain some state as the data goes through them, and so eventually it caused a memory leak.
This memory leak was in a multi-machine process, which led to two processes thinking they were responsible for updating the database.
That caused mongo queries from the affected nodes to randomly execute after about an hour of being slammed from lack of memory.
Eventually the affected node would restart and start to function normally. Which was worse, because it allowed this problem to go unnoticed, until some users started reporting duplicate or corrupted data.
Stuff of nightmares.
Fat-fingered a command and failed to read the diff carefully. Wiped out the authentication system for the Kubernetes cluster my team had spent weeks building.
Fortunately, everything was in Terraform, so we only lost a few hours of work and gained a lot of confidence in the reliability of what we were working on.
The experience made me really sit down and think about the nature of professional integrity. The incident occurred near the end of my work-day and I wanted to just pretend nothing had happened. Maybe my login just so happened to stop working at the exact same time I ran something stupid... It would have been so easy.
But I was the only one who had the logs to figure out how I'd screwed up. I made a full write-up on what I had done, what recovery steps I had attempted, and what our options were. I felt terrible about it and apologized to my coworkers and the client. No one was upset; shit happens.
Wow! Great work! If I were your boss I’d promote you for your work ethic alone; not to mention your honesty and courage. 👍
The one that springs to mind for me was a quick and dirty update against a single entry in the database.. check the SQL, run the code, 1000000+ entries updated.
I had the where clause on its own line and somehow had the rest highlighted when i hit F5 to run it.
The panic and 'Oh nooooooooo!' was horrifying.
Luckily I had my transaction log backups in place and was able to get all but about the most recent 5 minutes worth of data back in quickly, and very thankfully before most of the end users had logged in for the day. Make a habit of checking your backups!
Sounds VERY familiar. Lost count of the number of "heart dropping through the floor" moments I've had over the years doing something similar :-D
Three weeks into my new job, I deleted the marketing website sidebar including various signup widgets. Was not aware of it being used on every page, did not need it on the one I was editing and went „Nah, let‘s throw this out!“ There was no undo function for this. Luckily, a colleague noticed quite fast and she was able to insert the content again quickly 😅
I had just started as a Perl developer for Wolfram Research when Wolfram|Alpha (a sort of curated search engine) was launching. Because I had developed and automated their weather updates I was included in their launch team to be broadcast on twitch. However, once we were near ready to go to launch I discovered that none of the developers had wifi connectivity in the building we were launching from (it's a development launch and none of the developers could develop) -so I pressed the admin team to get the wifi working.
My insistence caught the notice of Stehpan Wolfram and the Director of Development. The Director, Peter O. asked me to oversee any and all potential hacks (DDos, etc.) that might be trying to interfere with our launch. I reluctantly took on the position for the launch (it was a 27 hour day in total).
Smack dab in the middle of the launch Stephan Wolfram is streaming our internet numbers network input when I realized that we were getting hit with an extreme number of search queries from a (seemingly) foreign range of IP addresses that would potentially start bringing down our in-house cluster. I hit the panic button and shut down all network traffic until I could block those IPs from our router.
As it turns out, the web group had been running tests from an outside cluster to drive up our numbers during the presentation. And no one had told me.... So, Stephen Wolfram is standing there, showing the numbers and there's a huge drop in traffic..... derp -that was me.
I turned everything back on in about 30 seconds but it didn't go unnoticed.
Stephan Wolfram video where the crash happens: the launch video is no longer available. The crash acknowledgement happens at 10:15
I think, at the time, 3 million people saw this happen.....
In a former life when I was doing network administration for a university's school of business, I got a phone call from another network admin demanding that I give him the contact information connected to a specific IP address that was "attacking" their system. He figured that a university kid was trying to brute force a password to their FTP service.
Turns out that it was a company being hosted by our business incubator. The angry admin's company had hired them to redo their website, so they were trying to FTP into the web server. Only the password was wrong, and their stupid software kept trying to reconnect.
That guy threatened to sic his lawyers on me. I told my boss, and to his credit, he shrugged and said, "that's fine. Here's the contact info for our lawyers. You can tell him to give it to his lawyers."
I once took a good portion of a large site down for over 24 hours by clearing the whole site's cache. So, it could've ben worse ;).
In my first job I managed a firewall configuration app for a large Telco. It allowed Fortune 500 companies to make firewall changes on their network. Can’t remember the exact problem but long story short I changed something which stopped anyone logging in at all. No-one could make any changes in for 2 days until we found the issue. 😳
This was particularly bad because it affected bonuses that year, whooops! 😂😂
The site I was working on for a real estate company was having issues when users requested all photos for a house. The server would fetch all the photos and add them to a zip for download. This was a bad idea from the start as there would often be over 100 high quality images.
The worst part was this was a synchronous task. The users would stare at a blank browser until the request finished or timed out. They had me make this async with celery (python). Even though the whole process was bad, we settled on this solution as a "quick fix". Celery was already used in other parts of the site.
I made the changes and deployed them near the end of the day. It worked fine. The next morning I was woken up by an emergency phone call. Most of the site was no longer working.
I had forgotten to disable the download button when a zip job was in the queue. Apparently people were mashing the button expecting the old behavior and there were a massive amount of jobs backed up in the celery queue. Anything else using celery was basically broken on the site.
I had to use git rollback to revert to the previous version of the site. I felt horrible. I was chewed out by the owners and told that they lost "millions". I guess it wasn't that bad because they kept using us for a while after that...
I manually updated store database with a SQL file. I messed up field order, and basically had stock and price reversed. For a few minutes we had a lot of very cheap products that we really had few expensive ones. At least no one bought anything.
The worst part about that kind of mistake is the sudden 'oh no!' that sweeps over you as you realize the implications of the mistake
Last week I set up a load balancer to automatically forward http to https for a particular application. That part went fine. What didn't is that we use Slack's slash commands extensively to interact with the application, and I forgot I'd set those up before I'd even gotten an SSL cert. Slack does not like getting a 301 response. They were all broken for hours for our people in Europe before I woke up and figured out what had happened.
New job. First time monitoring a Unix shell scripting job. Script errors out with memory issues. Go in the respective folder, find a couple of files ~ 2GB big, delete them. Re-run the script, it fails again. This time looking for the files yours truly deleted. Had to wake up the SME to fix the issue.
Forgiveness is hard to find, but boy, when you find it!!!
Well, I stopped keeping count of it. But the best would be me dropping a critical production database(Yes, I didn't take a backup) while running a migration script and then reconstructing missing data from logs. Before you judge me please note that I was working 36 hr straight to meet a production deadline and this is why no one should work like that.
Years ago I moved from being a tech to a sys admin ... and one of my first responsibilities was swapping out the backup tapes in the pop. There's a bunch of story in the middle but the punchline ends with accidentally dropping a few important tapes down several flights of stairs.
Many years ago, I was a systems admin who was tasked with maintaining (among other things) an email server. The server was running Debian Linux for its OS, and Exim 3 for mail. Like an idiot, I figured that upgrading to Exim 4 would be no biggie, despite the many warnings that accompanied the release, and blithely went and apt-got it.
A couple hours of panicked hackery later and I'd finally managed to restore service back with Exim 3. Eventually, I got a very nice setup going on the cheap with virtual users in a mysql database,on Exim 4. But I never skipped past a warning like that again without doing my due diligence.
I was using PowerCLI to revert snapshots for a block of VMs that I was developing automation for. I use TextExpander to do that snippet, with fill-in fields for the hostname. Wellllll... I got in a rush and executed the snippet with the hostname field blank. That had the effect of running the action on EVERY. SINGLE. VM. that contained a snapshot. I caught it quickly... but not before I had reverted the snapshot for the VM that contained our production instance of Confluence. 😳
Thankfully, the new Atlassian architect had things configured well. We lost about an hour's worth of data, but that's it. We were back online in about 15 minutes.
Working at a large insurance company (largest in US). Needed to restore 1 database to previous nights image. Decided to restore entire set of databases from last backup to avoid bad references. Databases are scheduled to be backed up daily, and we didn't check last backup date. That night many jobs failed because drivers in Missouri were missing. Last backup of driver data was almost 30 days old. Oops! Lost almost 30 days of driver input for Missouri. Took a fews weeks to get cobble together most of the missing data. Wasn't my fault that it wasn't backed up, but we should have checked the backup date before the restore.
My main role is maintaining a legacy intranet system for a high street bank in the UK. As it's built with Zend 1, modern concepts like environment variables for config weren't integrated into its design.
Last year I was testing a scheduled task that sends users email notifications when the resources they posted on the site are coming up to their expiry dates. I had changed the config to point at Mailtrap, then I reverted some changes, and re-ran the script - now using the live settings. Fifty-odd users got an email about their resources expiring.
Needless to say, the next thing I did was to move the mail config for development to DotEnv so it wouldn't happen again...
Today, I forgot to turn off a service which was indexing data in ElasticSearch and I've ran an command on my machine that was also indexing data, causing the system to go into a corrupt state. I had to turn of the service and re-import data into ElasticSearch from Mongo and then re-index all the data, luckily it didn't took that much
I once took down a stock ordering service of the company I was working for at the time, and couldn't roll back the changes because backups were manual and I'd taken an old copy from the wrong server. Took half a day for me and 2 other developers to figure out what had gone wrong and how to fix it. It didn't help we were required to do them at 3 am.
Needless to say, I've learned to triple check the deployment plans before I'm sleep deprived.
I don’t have an example (yet) but I was thinking how this goof could help others - especially newbies/beginners - realize they are definitely not alone when their first outage happens!
Yep! Don't get discouraged about issues and goofs. These things happen day one of programming and then 14 years later (I speak from experience).
You're going to commit bad stuff to the repo, and push up wonky stuff to the server, just smile and carry on!
Damn right. It's not about never making mistakes, it's all about how you respond and get the problem fixed with as little collateral damage as possible. Then you can laugh about it later (and repeat ad nauseum for the rest of your tech career!)
I delete unused domain in cpanel that didn't have website stored in there, but I forgot to realize in that domain had 10+ registered client email. fortunately, when I put the domain back, the email and data restored.
I once deleted the code for the most important page of an application a week before it was supposed to be launched. The application was built in a framework that handled it's own version control (yeah). Except this part of the application had been built with a new feature that didn't have version control yet (yeah...).
It had taken weeks to get this page to its current state and I ended up rewriting it entirely within an afternoon out of sheer desperation. Sadly, it was probably better the second time. The launch went ahead on time at least.
During my training time I had the task to write an update Job to correct data. It should be SQL because it was more performant than via the client application (can also run some sort of coded jobs) so I started happily writing my script, hit F5 to test it on the customer TEST-Environment and after 10 minutes of execution I noticed the dropdown in SQL Manangement studio had the PROD database selected and my job was messing up production data...
Was no fun at all, even though I cancelled it immediately, the damage was huge 😅
Copied the wrong version of dll to prod about 12y ago, troubleshoot production issue for 8hrs to find out it was just that f*** dll that was bad. Devs could not find out whats wrong either .. Since then I hate anything that is exe.dll etc
Btw I heard about site being down for few hrs, hope you have post mortems ? :) Like how fast can you recover and stuff . Sometimes user errors are good because you learn so much from it.
Once I changed about 20.000 samples labels in internship production database because I forgot a parenthesis. I had to stay until 9PM to fix everything and thankfully my coworker had a backup from two or three weeks before that.
This week I tried to load Linux on a thumb drive, so I could play around in a different environment. For whatever reason, I thought it would be simple to just launch Linux and use the same hard drives I use for Windows. Unfortunately, one of my drives was actually three drives, and Linux didn't understand how to open it. I made the mistake of Googling some solutions, naively ran a couple commands, and ultimately corrupted the drives.
After fighting with the drives for a couple hours, I gave up, formatted them, and restored them using one of my backups. On the bright side, my backups work! hahaha
Accidentally set the value of Expires (a timestamp) to max-age (a number of seconds) and permanently cached an entire website. That was behind a very efficient CDN. Oh nooooooo.
Now I obsessively check cache headers in all my integration tests. As a rookie it never occurred to me that I could screw up so badly that not even backups could help.
Pretty recently, I was working on a Microservices-based project which required an API gateway. Despite a various range of options out there, I insisted on writing our own gateway from scratch, and it went disastrous wasting about one month before me going back to a solution based on OpenResty.
My first big project was to hash / encrypt customer passwords (this was 13 years ago). Worked for an ISP. Half way through the rollout customer support started reporting an increase in connection issue tickets. Clients failing to authenticate.
Turns out I'd string trimmed binary strings thus truncating any with NUL bytes. 50k users affected including their RADIUS records (which were used to authenticate their ADSL connection).
I will always be thankful to the senior engineer I was paired with for the rollout who was cool as a cucumber and helped me spot the issue... I was bricking it!
In my internship while copy pasting code from other component I also copy pasted google analytics code and one of the variable was undefined.
It went unnoticed (since it was just part of analytics so it didn't exactly break anything 🤷) and then they had bunch of undefined hits on their production analytics!
I remember once we were going to have a meeting with a client to show him the website, it was at 6 pm, I was making some changes in CSS files directly in the Cpanel, didn't know exactly what I did, but I deleted the main. CSS file , which contain 80% of website styling ,the problem is not here , when my manager went to the meeting he didn't know about it, me neither, until he cleared his cache and found out that the all website is messed up, it was a panic moment, he ended up canceling the meeting and I was about to lose my job .
Relevant tweet thread for those who do: twitter.com/ScribblingOn/status/11...
This is interesting
We're a place where coders share, stay up-to-date and grow their careers.
We strive for transparency and don't collect excess data.