In my first job as a developer, the prevailing attitude from management was "If developers would just be more careful, we wouldn't have any bugs." The other prevailing attitude was, naturally, "Also, work as fast as possible." Unsurprisingly, in this environment, bugs were commonplace. This attitude came to a head when, despite months of warnings about how flawed our deployment process was, a junior developer took down production for half a day. Rather than seeing the error as caused by an underlying problem, management chose to harass a capable young developer out of his job. Not only was this a bad outcome for everyone involved, it meant that they also missed the opportunity to prevent the issue from happening again.
Most companies aren't so punitive and petty. We don't assign blame or harass innocent and inevitable errors. But still, when all we do is fix a bug, close the ticket and stop thinking about it, we're treating the bug as a mistake. We're implicitly labelling the bug as a developer error.
By leaving the underlying cause untreated, we end up fixing the same issues again and again. The bug count never decreases for long. Both the underlying issues and the symptoms start to pile up, making it harder to get your work released. Bugs consume your time and your reputation.
The alternative is to view bugs as symptoms of a deeper cause: faulty process, lack of automation or tooling, insufficient knowledge sharing, time spent in the wrong places. Each bug happened and made its way into the release candidate or release only because something let it though.
It's challenging to get enough of a picture of all of the problems to be able to spot patterns. Sometimes you have a slow-leaking fault, causing significant, but occasional problems. It takes a shift in mindset from multiple people, switching from a reactionary stance to understanding the bigger picture of teams, process, tools and interactions. You need time, coordination between teams and room to experiment. Some problems need to be escalated, and gathering the data to make your case can be difficult.
By first grouping bugs, you can start to understand the underlying problems. For example, if you find that missed requirements are a problem, you can then find out what stage they've been lost: did they not get discovered, or not get communicated to the team, or were they written down, but dev and QA didn't check for them?
Misconfiguration - Some required configuration or setup never got deployed to an environment.
Missing/incorrect requirements - The feature was developed appropriately according to specification, but initial requirements were incorrect. Perhaps the end-user is using the feature differently than anticipated.
Edge cases - The new work has not accounted for uncommon situations. Your application might throw exceptions or undefined behaviour occurs. You haven't considered customers with unusual setups.
Flakey/problem components - A part of your application has a high level of complexity and technical debt. It might be poorly understood. All changes to this component require a high level of care.
Third-party issues - Your team could work in conjunction with a product actively maintained by an external team. Bugs in that team's work are causing problems in your release.
Feature requests as bugs - The stakeholders might have changed their mind. There is a disagreement about the intended functionality. Someone is treating bug tickets as the fast lane for getting new work done. The business changed its mind about an accepted edge case.
Merge issues - The feature was working at the point in time it made it into master. But, by the time you cut the release, the feature had been broken by a subsequent changeset.
Sometimes, the level of bugs gets so high that everyone recognizes something needs to be done. You gather everyone together and come up with a dozen different possible solutions. However, not all solutions are valid or helpful. Limited viewpoints mean everyone sees a different part of the picture. Leaders are too far removed from the day-to-day realities to have a grasp of the problems. People come in with their own biases based on past negative experiences or future agendas. Some haven't been experiencing the problem, so don't understand how it could be an issue.
With such a variety of differing opinions, how do you proceed? Without understanding the cause, or having any data, you end up going with the best guess. Choosing the wrong solution does little to help your problems. Worse, you could take resources away from the behaviours that are doing some good. The wrong solution costs you extra time and paralyzes teams with additional steps and processes.
For example, automated UI tests don't help if you're not getting your requirements correct upfront. You could even end up coding the UI automation to look for the wrong result. UI automation is not a practical approach for enumerating all of your edge cases, and the time spent on UI automation is time not spent on unit tests that could be covering these situations. Increasing time spent testing each feature before you create each release won't identify missing environment configuration or settings. And it won't prevent other teams from accidentally rolling back your work. Doing away with branching isn't going to catch regressions in a product developed by an external team. Classifying a bug as an accepted edge case and setting up infrastructure, process and testing to ensure it doesn't get any worse could be more time consuming than just fixing the bug.
Treating bugs as issues that could happen at any time for any reason requires a high level of caution. In response to past quality issues, the development team ends up with a lot of added checks, processes, additional testing and signoffs.
Rarely do we go back and assess whether the additional effort has been adding value. Nor do we acknowledge the impact it has on teams in terms of momentum, throughput, time lost, context shifting, and having work stuck in long-lived branches waiting for signoff.
The work we do to understand better where our issues are coming from also means that we can understand where we are not having issues. On top of working to prevent the causes of bugs we are having, we can strip away the efforts to prevent bugs that are giving us no benefit.
Six months before the faulty deployment taking down our application for half a day, I had built a lightweight tool for deploying changes to production according to the company's business rules: a couple of Powershell scripts wrapped in a user-friendly UI. The tool deployed in a tenth of the time could have helped other teams avoid the underlying issues in a complicated deployment process.
Once you understand the types of bugs that are slipping through, you can see which of your teams just aren't having those same problems everyone else is. They might be doing something that helps pin down requirements upfront, or added a step to catch configuration issues early. They might have adopted a tool, technique or set of patterns to help them. Identifying and spreading something already in use can be a lot easier and safer than trying to adopt a new solution.