I like the term “Technical Debt” because it is an easy metaphor for the average business owner to understand and put into real terms. A particular benefit of the debt metaphor is that it’s very handy for communicating to non-technical people. Just like financial debt, Technical Debt will incur interest payments, which come in the form of the extra effort that we have to do in future development because of the choices that we make now.
Technical debt isn’t always bad, just like how a business may borrow and incur debt to take advantage of a market opportunity, developers may also incur technical debt to hit an important deadline or get a particular feature to market faster than if they had “done it right the first time.” There may also be prudent debt within a system where team members recognize that it may not be worth paying down if the interest payments are sufficiently small, such as within portions of the system that are rarely updated or touched by development – we may not need to care about comment density, complexity, or refactoring if that sub-system is never going to receive feature updates. The tricky thing about technical debt is that unlike money, sometimes it’s difficult to effectively measure how it will impact your future velocity and, in some cases, may never need to be paid off in the future. Each type of technical debt must be weighed against the specific system and its lifecycle.
Technical Debt comes from various sources, some of which can be good and some bad, but the idea behind the technical debt metaphor is that there is a cost associated to taking short cuts, making mistakes, or deliberate choices and that the cost of not dealing with these issues will increase over time.
It’s no secret that I’m a big fan of sonarQube ( https://www.sonarqube.org ), an open source dashboard for managing code quality. It tries to calculate a technical debt (called a ‘code smell’) for a code base, using static code analysis findings like code coverage of automated tests, code complexity, duplication, violations of coding practices, comment density, and the following of basic coding standards.
And while it does of a good job of reporting the areas of technical debt that it can through code analysis, what do the numbers really mean to the business? And this is where get into the fuzziness of technical debt and how it impacts the long-term vitality of a project. When I think about the biggest cost of technical debt, it usually revolves around how the designs or code implemented today may slow down our ability to deliver future features, thus creating an opportunity cost for lost revenue.
Keeping this in mind, when measuring technical debt, it’s important and specific to each project to identify the impact that these different kinds of technical debt have. It is by evaluating each type of technical debt that has the potential to hurt and figuring out when there is too much of a certain type of technical debt that we can start to intelligently manage it.
When looking at a project and evaluating the different kinds of technical debt and how much they might cost you, this involves a fuzzier approach than just reviewing the sonarQube dashboard. Here’s some of the categories I like to group by when discussing different types of technical debt and the interest we may end up paying on them:
If you’re building out a system where a key component or platform is fundamentally flawed so that it’s not scalable or reliable, this can be a huge problem that you may not even realize until real customers are running on your product. If you can’t scale out your architecture the way you need to because of core dependency problems or incorrect assumptions about how your customers will be using your system, you will have no choice but to rewrite or retool huge chunks of the system.
A good example of this would be the game Star Citizen and their choice to build out the game on the CryEngine platform (https://www.extremetech.com/gaming/237434-star-citizen-single-player-delayed-indefinitely)
In every large system, there’s always a couple of modules that seem to give developers the most problems. These are the sub-systems or components with code that is hard to understand and expensive and dangerous to change because it was poorly written to begin with or uses extremely outdated technology. Because these subsystems are so fragile, no developer wants to touch them and when they do it’s usually to go in and apply a very specific fix for their situation and then move on. Because these short-sighted fixes accumulate over time, the problem only gets worse. These fragile components need to be identified and evaluated for a complete rewrite to ‘bullet-proof’ them or they will continue to be an expensive debt on the project’s ledger.
Writing Unit Tests takes time. It also requires that developers write their code so that it can be unit tested. A developer who is writing their code so that it can be unit tested tends to break their functionality up into small atomic components that make unit testing easy. If your system has monolithic functions that don’t automate well and you choose not to take the time to refactor them, you end up with tests that are brittle and slow and keep falling apart whenever you change the code. This causes your testing expenses to increase over time as additional options and features are added to the code base. Even worse is when brittle automated tests are ignored on failure because “it always fails anyways”. This can lead to an increase in manual and exploratory testing costs as well as additional costs in unplanned work when code is returned with a slew of bug reports that could have been avoided with proper automated testing in place.
I’m including this under technical debt because we pay “interest” on this every time there is a release in terms of man hours and inherent risks. This is one of those hidden costs that nobody seems to think about until you actually sit down and review how it’s impacting not only your releases, but also your development cadence. Manual release processes are inherently error prone. As such, each release ends up being an all-hands on deck scenario “just in case”. These costs keep adding up over time. Not only with late nights, but also with taking time out of the development team’s normal cycle to prepare for a release and losing productivity during their present cycle. Is the cost of automating a deployment more expensive than scheduling one manual release? Probably. But automation pays huge dividends on each subsequent release of the product and probably has one of the best long-term ROI’s.
This is the code that just works, written by some long-lost Jedi Code master who has since left the company or retired. We all know it works and we see it working within our systems, but nobody can explain why it works the way that it does. This is also a really tricky area because the business may decide that it’s OK to carry this technical debt on the project ledger because there are no plans to change any of the functionality that this Black Box is responsible for. And that’s fine. Until something changes. And it doesn’t work. This type of technical debt is like those mortgages with the huge balloon payment at the end. You can get away with not paying it down for years and then either bite the bullet and pay it off or if your product reaches end-of-life you may be able to retire the system without ever having to pay it.
This can be a small amount of technical debt or it can be a huge amount of technical debt. That’s because this has to be evaluated on the risk that it presents to the business. This is especially important when outdated libraries contain newly discovered security flaws ( like the outdated struts library that was exploited during the Experian hack https://www.cyberscoop.com/equifax-breach-apache-struts-fbi-investigation or the Heartbleed vulnerability ( http://heartbleed.com/ ). An outdated library may be considered a small amount of technical debt until it isn’t, and then it becomes an all hands on deck remediation exercise to get your systems patched before being exploited.
If you don’t have proper error handling in your code, it’s hard to troubleshoot when something goes wrong. Even worse, you may not notice that certain sub-systems are erroring out unless you have a way of instrumenting those errors and performance issues. When the system isn’t working in the manner that it should be, if you don’t have instrumentation in place it’s difficult to pin-point the root cause of the issues unless you have built in various windows into the processes of your systems.
When code works, it works; and sonarQube does a pretty decent job of figuring out duplicate code blocks through it’s static code analysis. So, we may end up with many slightly different variations of code structures that developers have cut and pasted and then slightly modified over the iterations in order to get code into production. We always tell ourselves, “At some point I can go back and parameterize the code to consolidate and refactor the functions.” But, that time is rarely budgeted during the project’s iterations and the debt continues to pile up. Any changes to how the code works now requires the developer to remember where the multiple code locations are and make the same updates over and over again. If it’s only a few spots it may not be a big deal, but ignoring the problem causes the costs of making additional updates greater over the life of the project.
Sometimes it’s easy to tell who wrote what portion of the system by reviewing the code. One developer may always seem to use one particular pattern vs another, or they create wrappers around certain modules in a very specific way that makes their usage different than how another developer instantiates that code, or have variables named a particular way. This practice may go unnoticed in small teams, but the more developers who are involved in updating a system the more complex this problem becomes and the harder it is to hand off to other developers. Code should be a developer-neutral as possible and one should only be able to tell who wrote a particular line of code by reviewing the check-in logs.
Usually this is a necessary debt that should be carried on a short-term basis. You’re going to want to maintain some sort of compatibility with the previous version. But what about the version before that one, or the one before that? The further you go to maintain backwards (or forward) compatibility of your systems the greater the cost to maintain and test all the compatibility scenarios that your system can handle.
In today’s day and age, hardware is cheap. Sometimes you can get away with wasteful practices by throwing some hardware at the problem and it will go away for a while. This can lead to some lazy programming practices where inefficient memory usage or processing will not surface during initial rollouts. As you scale out your compute needs will grow and these problems will start to surface.
In general, magic numbers are unique values with meanings or multiple occurrences that can preferably be replaced by named constants. Their use in code is generally low hanging fruit, meaning that they can easily be replaced, but they make it difficult for other coders not as familiar with a system to get up to speed on the how and why of a particular magic number’s use. Replacing these values with a named constant in your code allows a more descriptive identifier to be used within the code and thus an easier understanding of the code blocks as a whole.
Every programmer is going to have a different level of experience with a particular framework. As they build out the functions for a system, they may end up creating particular functions that are already handled within the framework. Once that function is built (assuming it doesn’t have major bugs associated with it) it becomes a sunk cost. Sure, it’s inefficient, but as long as it’s working, then there’s not really any technical debt associated with the duplicate functionality.
Nobody reads documentation. Any documentation that is written is usually out of date by the time it’s published. So, is this really technical debt? Maybe not, it will depend on the complexity of the program and why the documentation is being produced. For small projects it may be easier for the developer to just read through the code and consider it to be ‘self-documenting’ (assuming we have good commenting practices). But for larger projects or systems that require regulatory scrutiny or are subject to an audit, the documentation may be considered a necessary evil that must be produced.
The technical debt metaphor is useful because it gives us a model that non-technical team members can use to evaluate the choices made throughout the lifecycle of a project. There is also a useful distinction between debt that must be paid down and debt that can be carried over time.
Prudent debt can be considered acceptable if the team recognizes that they are taking on that debt, and understand the trade-off of an earlier release versus the costs of paying it off. The important part of this evaluation process is that the team recognizes that they are in fact taking on these risks and weighing them against the efforts needed to remediate the issues further down the product lifecycle and plan for that eventual paying of the piper.
Even the best teams will have debt to deal with as a project progresses through its lifecycle – So it’s important that the team members recognize this and make a conscious choice of when to accept technical debt and when to take the time to remediate it.