Share stories about the most remarkable bug you have ever had to fix.
Could be a really silly one, a very tough one, one that almost cost you your job, one that had incredible ripple effects, one that was never noticed by the user(s), a very funny one, or a downright embarrassing one.
Let's hear some good bug stories 🍿🍿🍿
Top comments (54)
Back in the agency days we had this client that was a real piece of work. He was very accomplished in his field and his ego was the only thing that rivalled the amount of data he was willing to cram into his personal website.
We had to make lots of unplanned changes due to him becoming suddenly unhappy with features that had been approved earlier, and management wouldn't do much because his account was rather big - and a retainer.
It got to a point where we stopped arguing and would numbly implement his requests, throwing any initial website strategy away and simply acting as voice-powered website implementation machines.
To add insult to injury, one day he asked for an autoplaying music playlist to be added on his website, in every single page, and that's where we get to the bug that I had to fix.
Few weeks after the implementation, he complained that the first song of said playlist wasn't repeating after the end. Now, I can't remember why adding a loop feature was so hard at the time, but instead of doing it I simply uploaded the same song 10 times. The first one would finish, and the next would start again, and so on.
It wasn't perfect, it wasn't particularly good for the users, but halfway through to project we realised that this website would only have one user, the client himself.
Awesome workaround 😁 Eat your own dogfood from the client perspective. Nice story, thank you so much!
Funny, thanks for sharing. I hate those kinds of clients.
Yay, I LOVE GOOD BUG STORIES.
I worked at Linden Lab (which runs the virtual world Second Life) for over five years. There were a ton of amazing bugs while I was there, because bugs involving virtual worlds of any kind are almost always hilarious. (Read the patch notes for The Sims if you want other examples.)
This is my favourite Second Life bug story. It happened while I was there, but I wasn't involved in fixing it, I just found out about it the next day. Years later, I tweeted a thread about it, in response to this:
Here's the text of the thread:
This is why virtual world bugs fascinate me. Some of them are just AMAZING. (There was another one that raised the water level 200 meters.)
My other favourite SL bug happened before I joined the company. It had very little to do with the virtual world, but was amazing for completely different reasons:
Hilarious! That's a beautiful one.
My takeaway: Don't invest in digital pets and don't ever buy digital goods for them.
Reminds me when CryptoKitties became the viral app of Ethereum.
Another true story. I was working on a bug with a colleague, we were able to recreate it at will, but could not find the root cause. Both of us looked over the code a zillion times, until one day we realized that the O should have been a 0.
I see this problem a lot from a UX point of view. Password and license code boxes (and the like) are especially error-prone here. If systems use a sans-serif font and 0 (zero) vs. O (letter), and 1 (numer) vs I (capitalized i) vs. (l non-capitalized L) are not distinguishable, this can lead to problems (especially if you cannot use copy/paste, which in itself is an issue when security is concerned). I especially hate my KeePass version, which makes this very mistake.
Today, the zero on Windows has a diagonal line through it.
You mean in the default fonts? Luckily that's true. They have figured that out even on Windows 🤪
Ya I hear you I'm not the Windows fanboy I once was.
The funniest one
I joined a startup in their "stabilization" stage recently. Suffice to say the code quality is subpar.
I was tasked with consolidating the code. First thing I do is start to build a test suite. The framework for tests was already there (RSpec for a Ruby on Rails app) but there was no tests at all. So I write an exemple test and attempt to run it.
Turns out, every time I ran the test suite, A WHOLE FOLDER OF CODE WAS BEING DELETED.
This was the first and only case I ever met where the app was literally deleting parts of itself.
The reason in the end was a weird callback that was trying to reset the cache of the application by deleting the cache folder. Worst, it did so with the use of a relative path instead of an absolute one, so calling this method on the wrong place had those unintended consequences. Suffice to say that is the opposite of a best practice and I rewrote this shit immediatly.
I still work there though, mostly because my colleagues are fantastic people.
That’s hilarious and horrifying at the same time! Reminds me of this “test” I discovered for an auth provider component (so, you know, nothing critical or anything 🙄). The file had one spec, expect(true).toBe(true), and the whole team was under the impression that the component is tested and all is well.
When I started programming I spent a good part of a day bashing out a bug I had in Python. As I just started out, my debugging skills weren't very fleshed out to the point I restarted my computer a few times in a rash attempt at fixing the problem.
The actual problem was stupidly simple, but due to me lacking the insight, experience or plain knowledge of understanding what the error was saying, I was second guessing all of my limited knowledge of programming and questioning reality.
I eventually fixed the bug by re-writing all of my code from scratch and it "disappeared". I then spent more of my day trying to figure out what underlying issue was instead of moving on as I didn't want to just "move on" if it seemed like the code was magically working.
The bug was I mixed tabs with spaces for indentation, which wasn't very noticeable in the editor I was using which was notepad.
Morals of the story:
I recently spent two hours trying to figure out why Spyder couldn't import a module I'd just installed, when it ran fine from the command line python3 interface.
Turns out I'd mistyped it in my test script and didn't notice until I made a new one. (These two scripts say EXACTLY THE SAME THING, why does one work and the other one.....oh.)
How did Python react when confronted with a misspelled module name? I'd expect some kind of (helpful?) error message to stderr or, at least, stdout? Or did you just run it interactively?
I was initially trying it in Spyder's integrated IPython console, and eventually tried that file from the command line also when my second try worked (which also failed and made me realize it was the file, not an environment error.)
It did helpfully raise the
ModuleNotFoundError, but since at that point, the whole purpose of the file I had written and my running it with nothing in it but an import statement was to verify if I'd successfully a) installed the module to my venv and b) successfully ran Spyder out of the venv to which it was installed.... I thought it was an error in either a or b and was trying to track it down :D
😂 We have all been there.👍👍
In the same class of errors, when I moved to Norway 3 years ago naturally I started using the Norwegian keyboard layout. That's the second time I found out about non-breaking space, the first being the old days when it was common to add a
in an otherwise empty DOM element to retain the layout.
With some sequence that I never figured out, the normal space would be replaced with non-breaking space. Unlike the human eye, most programming languages make a pretty deep distinction, and the program would just crash upon encountering one outside of a string. Since that would happen rather frequently, I ended up making my own keyboard layout that supports 3 languages I use, and 2 additional for writing names
That's a good one. Multi language/culture use and keyboard layout still remains an issue to this day ☹
Meaningful white space can be a PITA if you are not used to it and/or your IDE doesn't support you with some visual means. On a personal note, I find the mere concept of something so meaningful in something so totally non-iconic as white space quite non-appealing, to say the least.
Many many moons ago, panicky customer on the other coast calls to report their system is completely erased. It doesn't happen just once. I got a trip to Baltimore to try to calm them down (and some really nice meals with some really nice people). To skip to the end: (1) they have the system plugged into an outlet in a cubicle wall (2) the modular cubicle walls plug into each other, there's not a single uninterrupted wire (3) someone bumps the wall and there's a brief power glitch (4) system doesn't crash, but the power blip appears on the system bus, and is interpreted by the disk controller as something worthy of generating an interrupt over (5) we have a recently rewritten disk driver, where the programmer had recoded the interrupt routine to "decremet the pending count, then look for more work to do". But in the idle system, (0 - 1) made it look like a non-zero number, so there's work to do! driver blithely counted down disk block numbers from a (16-bit) -1 writing garbage . Result: wiped disk. (a little more complex than this, but close enough) Oops.
One of the most annoying bugs I've worked on involved the following:
A customer on the other side of the world (barely any timezone overlap) reported that the memory usage of one of our components would continue to climb linearly in one of their environments until the process essentially got OOM Killed. Sounds like a memory leak right? It wasn't cause forcing the Ruby garbage collector to run would manage to free up most of the memory!
Got to spend weeks going through heap dump after heap dump trying to find the culprit. Turns out it was due to a bad ORM call that was loading a collection into server memory to filter instead of doing it in the database. It didn't cause problems for most users because this collection was typically small -- on the order of 10s of elements. In this environment, however, it was on the order of thousands. Very simple fix, but very difficult to hunt down!
Got to learn a lot about how memory is managed in Ruby, though. 😊
My team blogged about it in more detail here: engineering.pivotal.io/post/debugg...
Once the team pushed the reports, but haven't done a good check on the security level. Providers could see the salaries of each other! on the bright side, providers were either careless to check immediately, or we were just lucky, but we haven't got any complaints. It was embarrassing and scary and in fact, could lead to someone being fired at the end.
This would be a HUGE GDPR problem in the EU nowadays. 🙈🙈
Many, many years ago when I was a grad student I was writing a model based on spherical harmonics and I couldn't get the results to come out even close to the expected answer. Eventually, after stripping more and more code out of my model I came to the conclusion that the pow function in the compiler was wrong and fortunately because in those days you could read the compiler's C code I could prove it was. Some one had decided that the result of raising an integer to the power 0 was 0 not 1 as the rest of the world did. In the end I had to take my big maths book down to system support to get them to believe me.
😂 x⁰ = x¹/x¹ = 1 q.e.d.
☝️ for all x <> 0 that is
Here's a recent one.
I'm working on a Python application with data being passed around different threads. There was this one particular piece of data that made me almost throw my laptop over the balcony.
So in one thread, a database table is selected and is serialized in to a comma-separated string. Then, this string will be added to a queue which will then be read by another thread. The other thread will split the string by commas and will access each value by array indexing.
The problem is, the thread the reads the data seemed to be accessing indices more than the number of columns from the original database table! It turns out that some of the serialized values had inherent commas in them, so the indices naturally increased.
There's a special place for the people who designed our original application, and it's not pretty.
That's a classic within CSV handling in particular and the "data vs. metadata" department in general.
We had a CD build running that utilized the 1.2.3-b456 naming convention (major.minor.patch-buildNum). One of the devs decided it would be cool to include the name of the artifact in a properties/config file. How did we achieve this?
I caught it early (i.e., after 20 builds) only because their build was dependent on one of my builds and it kept kicking off new builds on my end. Good times.
Damn, and your bosses must have been so happy you were pushing your CD pipe to the limit with all your increased productivity (as measured in builds accomplished) 🤣🤣
Once I was called to do an SEO audit for an online shop. They complained that they were performing really badly in search results and wanted me to find the reason. Turned out, for whatever reason a developer had intentionally (!) de-indexed all their category pages. All that was left for search engines to index were general "About us" pages. I never found out why they had done it.
That's quite an accomplishment for a web dev😂 How did he manage to do that without noticing? Disallow via robots.txt?
Yes indeed :D Every category page had a "noindex,nofollow" tag. My best guess is that they were trying to prevent some sort of duplicate content problem, but executed it really badly and/or didn't understand the basics of what they were doing.
I had just taken a position on the Communications team of a large mid-range company. This team was comprised of folks that implemented operating system layer code to allow communications between mainFrames , midRange and PCs. A heady job for sure, but as the newbie; I was delegated to work the edges instead of the action item work.
As I dug in I had found a bug that had been around for at least five years with no resolve. The midrange had at least three layers of code, just call it the top layer, the mid layer and the I/O layer. When this bug showed (about two or three times a year) it was particularly bad because it effected a complete network reboot. This mean thousands of "sessions" were lost. Getting things back to normal took long periods of time, and of course managers were everywhere by then.
After everything calmed down, I studied the code in the area of concern and decided to put in a massive amount of trace points so we could get a flight recording of all the events between the top and middle layer interface.
My manager did not want me to put out the trace, but I convinced him. He found a customer willing to try it and we were on for a recreation and data to review. About a week later he popped into my office to tell me the patch crashed their system. They had so much network activity the trace logs couldn't keep up. They pulled the patch and life went on.
About two years later another person unknown to me attempted a similar patch but with much less tracing. He found the now 7 year old root cause. All glory went to him, and rightfully so.
So close but so far at the same time.
It seemed to be the right call. Empirical evidence in debugging always works better than mere conjecturing, at least in my experience.
This is the best... A new tape deck unit for saving data was to be supported at the operating system layer. I changed all the code to support the unit based on specifications that were given. We did a ton of testing and shipped the code. About a year later we kept getting reports of the tape deck fires that companies were seeing using that new model. In addition, we had heard the data restores were losing data. I had a college call me directly to see if I could get data off of the tapes they sent in because their student's thesis' were on them.
I spent countless hours going over the code to see if something was wrong and came to the conclusion that there was no way the code could have caused any of these symptoms. The problems continued, until we were flown to a customer's site. We being, myself and two hardware engineers with lots of equipment. They told me to create a large batch of data to save while they had probes attached to the hardware. After about two hours they found out that when the hardware buffer ran out of data, it wasn't signaling properly that it was empty! This caused the tape drive to stop, rewind a bit an attempt to lay down a new track of data. After doing this for hours, these tape decks would overheat and catch on fire!
The engineers went back to the office and I never heard another word on the issue. Other than they stopped the fires somehow.
One of those rare cases (at least in my line of work) where software problems could potentially cause life-threatening problems.
Not original but still got to be the best (worst?) bug ever - the 500 mile email!
That was a funny read. The FAQ is nice too.