What DevOps Taught Me About Running a Function

#devops #platformengineering #leadership #management

Originally published at devopsdiary.blog.

Most engineering orgs measure platform teams like project teams. Both halves are wrong, and the second one is what kills them. Here are the three metrics that actually tell you if a platform function is working.

The first time I inherited a platform team I asked the obvious question. How is the platform doing? Uptime green, deploys up, tickets closing faster than they were opening. Two months later I knew none of those numbers had told me anything about whether the team was actually doing its job.

Once you see that gap, you can’t run a platform org any other way.

A function is not a project team
Most engineering organizations staff platform teams like project teams and then measure them like project teams. Both halves are wrong, and the second one is what kills them.

A project team exists to ship a thing. You measure it by whether the thing shipped, when, and how well it works. The metrics are about the team because the output is the team’s output.

A function is different. Platform engineering, DevOps, security, developer productivity: these are functions. A function exists to change the slope of everyone else’s work. Its output is not its own output. The thing you measure is what becomes possible across the rest of the org because the function exists.

If you measure a function the way you measure a project team you’ll get a team that ships beautiful internal artifacts nobody uses. Green dashboards and rising attrition. A platform org that looks healthy from the inside and is quietly failing from the outside, and you won’t see the failure until the consumer teams stop pretending.

Metric 1: adoption velocity
Adoption velocity is the percentage of consumer teams that move to the platform’s current standard within ninety days of release. Not whether they all get there eventually. The shape of the curve in the first quarter.

This is the metric that tells you whether the gap between built and adopted is closing or widening. A platform team can ship excellent technical work and still fail if the curve is flat. Worse, a flat curve means the platform team is generating debt at the same rate as the rest of the org, because every standard they release that nobody adopts becomes another version the team has to support forever.

When I led GitOps adoption, the first quarter looked great. Teams onboarded. We had momentum, we had a story to tell, the architecture review board was happy. The second quarter, same platform, same docs, same support model but the curve had stalled and nobody on the team noticed because the dashboards were full of green.

I went and talked to the teams that hadn’t adopted. Almost none of their reasons were technical. The blockers were political. Once I knew that, the fix was a half-day of negotiation with the product owners. The curve unstalled the next sprint.

Without an adoption curve I would have kept measuring uptime and deploy counts and concluded the team was crushing it. The team was crushing it. The platform was failing.

Metric 2: time to first success for a new consumer
This one is the cleanest signal in the set. How long does it take a brand-new team (one that has never touched the platform) to get from “we’re adopting this” to “we shipped something to production using it.”

Time to first success is the only proxy I trust for whether the documentation, the onboarding model and the support story actually work. It’s also the metric most platform teams are catastrophically wrong about, because they’ve never measured it. They ask each other whether the platform is intuitive and they all agree it is, because they built it.

Earlier in my career I inherited operational workflows where new teams were taking six weeks to onboard. Six weeks is a structural problem dressed up as an onboarding problem. The platform team had been adding documentation and the number hadn’t moved. Their theory was that the new teams weren’t reading carefully enough.

We didn’t write more docs. We restructured the handoffs. Of the four points where new teams were stalling, we collapsed two, automated one and put a single owner on the fourth. New teams started shipping in four days. Defect rates dropped, and throughput improved.

None of that came from better tooling. All of it came from going to look at a number the team wasn’t measuring and refusing to accept that the existing onboarding was working just because the team said it was.

Metric 3: support ratio
The third metric is the percentage of platform-team engineering hours going to consumer support, hand-holding and break-fix versus platform development. Healthy platform teams trend toward more development over time as the platform matures. Unhealthy teams trend the other way and don’t notice until the burnout hits and the senior engineers start interviewing.

Support ratio is the leading indicator for every organizational failure mode in platform engineering. Burnout. Attrition. Scope creep. Feature stagnation. The eventual quiet rebellion of the consumer teams who have been getting worse responses every month and have stopped expecting better. If you only get to watch one number on a platform org, watch this one.

It’s also the only metric that tells you whether the team’s design (interfaces, automation, self-service) is actually reducing toil or just relocating it. A team that ships a self-service portal and watches the support ratio climb has built a portal consumers can’t use.

This is the metric that convinced me the next generation of platform engineering needs structural governance. Better tools won’t save it. When AI generation accelerates the rate at which consumer teams produce work, the support ratio explodes unless the platform itself produces frozen, validated artifacts that the consumers can trust without a human in the loop. That conviction is why I’ve spent the last few months building AIEOS.

What these metrics force you to do
Once these three numbers are on your dashboard, the leadership job changes. You stop measuring your team by what they shipped and start measuring them by what the rest of the org shipped because of them. That sounds small. It isn’t.

The roadmap shifts, because you become willing to deprecate your own team’s work when adoption stalls instead of doubling down on a thing nobody is using. The way you spend political capital shifts, because you start defending the platform team’s time against the constant pressure to absorb every adjacent problem in the org.

It also changes the conversations you have with your own leadership. You stop reporting up on what your team built and start reporting up on what your team made possible. Those are different sentences. The second one is the one Directors and VPs are paid to say.

The hard part
The hardest thing about running a function is that the work is invisible until it isn’t. A team that’s quietly doing it right looks identical to a team that’s quietly burning down. Velocity charts won’t tell you which is which. Neither will uptime or deploy counts. These three metrics are how I tell the difference, and I can usually tell within the first month of taking over.

If you’re running a platform org and these aren’t on your dashboard, they should be. And if you’re hiring someone to run one, they should already be talking about them.

Todd Linnertz is a Senior Technology Leader with deep experience in enterprise architecture and DevOps. He is the creator of AIEOS, an open-source AI governance system for software delivery teams. Find him at devopsdiary.blog and github.com/wtlinnertz.

DEV Community

What DevOps Taught Me About Running a Function

Top comments (0)