Measuring high-performance engineering teams from a value perspective

#okrs #sli #slo

Is your current team a high-performing team? I believe this is a fundamental question every engineering lead should be asking themselves. Thankfully, over the years we have discovered many metrics that hint us whether our teams are high-performing or not. I'm going to bring some unfortunate sad news though. Sometimes, high-performing teams don't equal to unexpendable ones. It is extremely important for any leader to understand what is the value that their teams are providing and be able to quantify that value in terms of business impact. Otherwise, without you noticing, your high-performing team might be the perfect candidate for being dispensed in the next re-organization within your company.

Maybe you are already doing OKRs, or perhaps reporting SLOs and SLIs. But are those key results and indicators really showing that your team is high-performing and valuable or just showing that is high-performing?

Common Engineering metrics

Let's look at some of the usual indicators for engineering teams.

Code metrics: Here we find things like the number of tests, code coverage, code smells, duplications/bugs/vulnerabilities over time, etc. Static analysis tools provide this sort of information. In my view, these are excellent metrics to understand how good your engineering practices might be and how committed to quality your team is. But there is very little business value that we can take out of these.
CI metrics: These metrics can be obtained from your build server and version control system. Indicators like the number of successful/failed PRs, successful/failed builds, build time, commit-to-deploy time, deployments to production per day, etc. These are good indicators for how efficient your engineering pipeline is and they do also provide good information on how fast you can deliver changes into production but again from a business point of view, there is not much we can scratch here.
Operationability metrics: Metrics like number of successful deployments, number of failed deployments, number of rollbacks, etc. Very useful again in terms of operability insights. But as with the previous ones, it's doubtful if they provide much direct business value.
Service metrics: Typically the metrics used for reporting SLOs. The number of server errors, average CPU, average or percentile latencies, and so on. These are invaluable for running a service. That's unquestionable. But on their own, business leads can't do very little with that. Yes, surely your CFO understand that a faster service is better, but how to translate that into money? That's a different matter.
Project Management metrics: Here we get a bit closer to what our directors like to see. The time that takes to complete epics or initiatives, velocity statistics, burnt-down completion rate, bugs over time, etc. All these metrics help to identify how our teams are completing the work they have assigned. They try to measure team efficiency, or at least project management efficiency.
Support metrics: How many production incidents did we have? How many of those are critical? What are your MTTR, MTTA, MTBF times? How many production bugs have been reported by customers? These are just a few examples of metrics related to support a product in production. Most business owners will agree that these definitely have a direct impact on the value of their business.
Engagement metrics: Some engineering teams need to or are built for the purpose of giving service to others. Whether these teams build frameworks, build APIs, or build documentation, they share a common factor which is, their performance largely depends on how others feel about their work. Here, we would use metrics like a Net-Promoter-Score, the number of promoters and detractors, or simply gauges on how others rate your service in terms of support, documentation, availability, etc. These metrics are helpful to understand how others value your work. These are extremely important but I also believe they do provide no quantifiable business value that we can use.

A fictitious example

So, let's put ourselves in an imaginary scenario. Your director has decided to own a project to optimize some internal flow that everyone complains about. It's a great improvement on the internal way code is handled and generated. Your team will drive this project. Everyone agrees, not just within your team but outside it, that it will be huge if you deliver that project in time. Your team does great and gets high notes in all the metrics described above. After 6 months the project is completed as it was promised. Everyone is super happy.

Then, as it happens every 12 months, directors go to an off-site and discuss the state of things. Your director comes from that offsite with some bad news. The business is not doing well. Managers have gone over the different projects executed by the different units over the last year. Other product-related projects have been very successful. They are not that well-executed but the projects were customer-facing and product management sees those as essential.

Your director feels very sorry about it but there are going to be job cuts all across the different units and your team will be impacted more than others. Your director couldn't fight against all other teams which could communicate clearer business value gains in their projects. In fact, none of the product managers could understand what you and your team have been working on during the last six months. You feel very frustrated. How can these folks be so ignorant not to see the obvious?

Ok. I know. The tale above is likely too extreme. Certainly, I hope it is. But it tries to show something that is generally true within most engineering teams. Many teams don't know how to measure the real value they are providing. Engineers, we do perfectly understand that large code coverage is a good thing because it demonstrates people are working on tests and worrying about those corner cases. We do know that a low rate of build failures is good because it means people are being dedicated to quality and also that a low commit-to-deploy time tells us we are deploying very frequently and can react very fast, or that a low rollback ratio means customers are not experiencing issues. We know all that is great. But, engineers, we are not that good at translating those great things into dollars, euros, people effort or other metrics like MRR, Churn Rate or ARPA to name a few. That is where the disconnection point is.

Translating effort into actual value

Those teams that know how to express the value they are providing are going to be in a better position within the corporate ladder than those that are unaware or can't communicate that value. But how can we change this? How can we translate all the cool things our engineering team is doing into an actual value that business folks can understand?

I'm afraid there are no magic recipes. At least not one I know of. It all depends on what your team is doing and what measurable metric you might avail of.

If we look back at our fictitious example. The work our team did was essentially alleviating other workloads from different teams. How much? After some digging, we could have found that every time the process we did optimize had to be executed it would take three different teams to spend each of them 1 person-month of work. After doing some analysis we also might have found that this same process had been executed more than 30 times over every year during the past 5 years. With the changes we did, this process now only requires 5 minutes of work. That's actually a lot of value!! Let's do some basic math:

30 times x 3 PM = 90 PM spent last year due to inefficiencies
Our team did spend 5 * 6PM developing that solution.
With the new process the new cost is 30 times x 5 minutes = ~2 hours and a half work every year
We could also assume that 1 extra PM is needed to support the solution during the next years

Those simple numbers tell that the new solution will save 78 person-month (previous cost minus new cost) every year. That can easily be translated into money. Not only that. The previous numbers also tell us that as soon as the new process is executed 10 times the cost that we have incurred in would be already completely amortized.

That information should be enough to defend the project and the value of our unit. This can be used not only retroactively, but it is a great exercise to defend future engineering investments.

Value measurements

So, what are measurements we could use to reflect value? Here are some that come to mind:

Time saved: Person-month sucks as a predictive metric but it serves the purpose here. If you run a project that is saving people's time, then do a calculation on how much time you are going to save them with respect to how things are right now. The difference will be the value you are providing. Subtract the cost you are approximately going to incur in and that will tell you how long will it take to make the project self-paid.
Direct income: Sometimes our efforts might win large contracts. Don't take the bait when your product manager tells you that big corp Acme will sign up once your project is delivered. Most PMs will tell you that. You should really find it out. After releasing the project, you should ask your management how many customers really bought our project due to the newly released shiny feature. Surveys and some other tools can help with that. Was your feature the primary decisive factor? Then it's fair to take that customer's income as yours. Was your feature a secondary factor? Then you should weigh how much your feature might have contributed to win the deal, if contributed at all. It's important to go through this exercise to understand whether your team is an important player within your organization or if on the contrary, you might have a problem with your current situation.
Indirect income: Indirect income might come in non-obvious ways. For example, maybe the latest UX remake caused an increased number of sign-ups and a reduction in the churn rate. This is indeed very difficult to measure. As in the previous case, customer feedback is essential to understand how the different features might have impacted on the overall solution. Evaluating what has had an impact on the business and what hasn't is crucial. Usage can be used to extrapolate the value of some of the features. Tools like amplitude and many others can be used to tell if people are using features. Usually, popular features are adoption drivers. If you can agree on a quantitative measure for those actions then you will be able to get an indication of the value. What it will always be clear is that something that does not get used has very little value.
Cost saving: Some initiatives are meant to reduce costs. Whether we are migrating from an expensive commercial database to an open source one or we are moving from a SaaS telecommunications vendor to another different vendor, those initiatives will provide cost savings but will also carry some cost. The difference is the value we provide. If you are saving millions by switching database vendors and you are not making your managers aware of it, then you need to act immediately.
Shared service: This is a flavor of time saved. When your team is implementing a shared service that everyone else is meant to use, you are incurring a cost that will save others. How much would it cost for all others to do that work? That's roughly what you will be saving to the company if you subtract your own cost. For example, your team is implementing a feature toggle service? How much would it cost to others to implement feature toggles by themselves? How many services do you serve? The more the services, the larger the value you provide.
Error reduction: Errors usually come with a cost. Whether that comes from customer loss or from our operations team trying to help to failover some service to a secondary datacenter, it does not matter. There is a cost. The key here is to calculate how much is the cost of the particular pain that we are fixing. How much did it cost this in the past? Are we reducing the number of incidents? Then we should agree on an average incident cost and estimate how many incidents we will reduce with our improvements.

Those are just some ideas. I would love to see any others you might have if you want to share them in the comments section.

To finalize. A high-performing team is great. It makes us, engineers, proud. I love sound engineering practices. But it is important to make sure that everything we do in our team is sustainable from a business perspective. It does not matter how exciting is the new thing we might be doing if it is not going to provide any value. Similarly, it does not matter that you are doing everything your product manager tells you to do if nobody ends up using it. Value metrics are crucial but also a healthy exercise for engineering leads to reflect, to think whether they are providing value or not, and to consider their current role within the organization itself and act if that role needs to be changed.

Picture by quinoal @ unsplash