DEV Community: Justin Reock

🤯 CrowdStrike Outage: Software Testing is Still Too Painful

Justin Reock — Wed, 24 Jul 2024 21:29:48 +0000

Right now, there’s an engineer at CrowdStrike feeling the weight of the world on their shoulders. On July 19, 2024, at 04:09 UTC, CrowdStrike released a configuration update for its Falcon threat detection product to Windows systems, as part of normal operations. This update inadvertently triggered a logic error, causing system crashes and blue screens of death (BSOD) on millions of impacted systems. The resulting computer outages caused chaos for airlines, banks, emergency operations and other systems we rely on.

And it could have been so much worse.

Further research has shown us that the same underlying driver and C++ issues that allowed the bug to take down Windows machines also exists in Linux (and MacOS) servers. In other words, it is only through sheer luck that the update was limited to Windows systems. The damage is estimated in the billions, and would have approached apocalyptic scale had it impacted Linux servers, which account for an exponentially larger share of critical infrastructure.

What happened, exactly?

Based on CrowdStrike’s post incident report, published July 24th, 2024, we now know that this outage was caused by a bug in one of their bespoke test suites, which they refer to as Content Validator. This app is responsible for, unsurprisingly, validating various content updates such as the ones that triggered the outage, before pushing them out for release. The root cause was three-fold:

Bug in Content Validator: A bug in the Content Validator allowed a problematic content update, specifically a template used to define rapid response data for specific, potentially exploitable system behavior, to pass validation checks despite containing content data that led to an out-of-bounds memory read.

Deployment of Problematic Template Instance: The problematic configuration content was deployed on July 19, 2024, as part of a “Rapid Response Content” update. This instance contained content data that, when interpreted by the sensor, triggered an out-of-bounds memory read.

Failure in Error Handling: The unexpected out-of-bounds memory read caused an exception that the CrowdStrike could not handle gracefully, resulting in a Windows system crash (Blue Screen of Death or BSOD).

It’s not the 90’s anymore, how on earth did this happen?

With an outage this widely publicized and impacting so many people, it’s only natural to seek someone to blame, and at a glance, it should be easy enough to do so. Obviously a quality check in Content Validator was missed, a test wasn’t run somewhere. One of those lazy, hapless software engineers forgot to run, or, decided to skip, a test, right?

And who would commit such an egregious dereliction of duty? What engineer in their right mind would dare let something like this slip?

I know first hand that the answer is “almost all of us.”

Because testing still sucks.

A 2023 LambdaTest survey found that 28% of large organizations have test cycles that last longer than an hour. That means for large applications, developers might wait hours or even days to get the feedback they need to do their work, and so they rely on various means of optimization to reduce the number of tests that need to be run. Or they just skip tests outright — especially tests that are known to be non-deterministic, or “flaky.”

Skipping tests has become its own science, complete with its own subgenre of tools. Techniques like Pareto testing, test impact analysis, and predictive test selection have all presented solutions that are truly symptomatic of a deeper problem: that the state of software testing is maddeningly burdensome for engineers.

Large engineering organizations have trouble enforcing quality standards cross-functionally, limiting the usefulness of accepted code coverage solutions like SonarQube and CodeCov, and opening doors for incidents like the CrowdStrike outage. Simply having the scanners and related data is not enough, there must be accountability for setting the right standards and driving adherence to those standards.

Improve your practices to offset increased developer burden

This incident proves that it’s not always ok to skip the tests we think are “safe” to skip, and that we can’t make a priori judgements about how changes will impact systems. The calculated cutting of corners, which we all do to preserve our productivity, will be officially unacceptable going forward. Judging from history, this outage will be used as an example of the need for wider code coverage and a higher priority placed on tests that cover traditionally low-risk changes. All of that sounds great, unless you’re the engineer who’s already dealing with unbearably large test sets.

So if we’re going to ask more of our developers, again, we need to reduce their cognitive load in other ways. We’ll focus on three areas of process improvement which are adjacent to the delivery of software to production: production readiness assessments, service maturity feedback loops, and continuous monitoring of quality metrics.

Fully automated post-CI production readiness assessments

We know from the 2024 State of Production Readiness report that a staggering 98% of organizations have experienced negative consequences as a result of failing to release production readiness standards, which is in essence what happened with CrowdStrike.

Software testing provides some, but not all of the feedback necessary to determine a software’s fitness for production. Content Validator’s code owners and stakeholders would have undergone various readiness assessments each time a new release was ready, including the release that contained the bug which allowed for this outage. Services would be assessed on areas such as test code coverage, number of open critical issues, the state of certain infrastructure tags and so on.

These assessments tend to be lengthy and brittle, taking the form of endless Slack channels or Zoom calls, where each stakeholder will effectively be asked to provide a yes/no response on whether the parts of the release they are responsible for are ready. The “checklist” used for this assessment is often kept in an inefficient system of record, like a wiki or spreadsheet, making it difficult to align on ever-changing standards.

The solution is to continuously monitor the same endpoints that are typically checked manually. This automates the ability to report on the same metrics, providing “at-a-glance” readiness for any stakeholder, and, where possible, providing that status to other systems.

In this example, readiness metrics are collected and visually represented with red/green status, where red indicates metrics that are below operational readiness standards. In this case, any services with metrics in a “red” status are not ready for deployment. When the standards are met, the report will automatically update. This makes it significantly easier to integrate readiness checks with deployment workflows, obviating the need for manual assessments, and freeing engineers to work on higher value tasks.

Collaborative service maturity metric scorecards

Keeping a software service like Content Validator in a state of continuous improvement is harder than it sounds. Not only do developers need to make iterative improvements to the service, but, they must also ensure that existing features stay fresh and functional. Engineers tend to automate much of this through various IDE and CI tools, but keeping track of metrics and data across all those tools introduces significant cognitive load.

An excellent and proven technique for driving all kinds of compliance standards, including maturity standards, across teams is a metric scorecard. Metric scorecards will observe and accumulate data from various parts of the platform, and automatically evaluate a service level based on domain-specific rules.

In the example below, a “Service Maturity” scorecard has been created which will assign Bronze, Silver, and Gold levels to services based on their compliance with various thresholds and metrics. In this case, two rules have been set for a service to achieve “Bronze” status. Services must have at least two service owners associated with them, and the service must have a README file in its repository.

The rules continue upwards through Silver and then Gold status, ultimately requiring metrics like an MTTR of less than an hour, and having no critical vulnerabilities associated with it.

Ideally, service owners will see these scores as part of their daily workflow, giving them a clear path to service improvement, and total clarity on what work needs to be performed to move services into a mature state.

Tools and systems that have scorecarding capabilities, such as internal developer portals (IDPs) make this workflow integration much easier. Ideally, the portal will already be integrated with the parts of the platform, such as incident management applications and quality scanners, so evaluation of scorecard data is efficient and continuous. Further, the developer homepage component of an internal developer portal is a natural place to provide service maturity feedback, obviating manual approval gates and other sources of friction.

If Content Validator’s service maturity standards were continuously monitored, including areas such as test code coverage and validation accuracy, it’s possible that the introduction of the bug could have been detected and flagged before being released and triggering the outage.

Continuous quality monitoring of test sets

We rely on automated testing to validate the quality of the software we create, but what happens when our test frameworks are inaccurate, as was the case with Content Validator? Additional layers of trust must be built into the system, to ensure that the testing itself is efficacious.

In this case, workflows could be built which would allow for the developers of Content Validator to more easily assess the service’s behavior when it is presented with new and incrementally changing fields and data types. Further, these workflows could be executed in multiple environments, such as Windows environments, to trap unexpected behavior and provide feedback to developers.

It’s ok to increase the complexity of the release pipeline on the backend if we make up for it by simplifying interaction on the front-end. So, additional quality tools such as software fuzzers could be introduced, and the data from those systems could be easily evaluated by the portal, since it would already be integrated into the same CI/CD pipelines. That data could be scorecarded in a manner similar to service maturity scorecards above, making it much easier to maintain continuous and sustainable improvements to Content Validator’s accuracy.

Bottom line, lower cognitive load leads to better quality overall

Developers are continuously expected by leadership to balance velocity with quality, with little regard to the opaque or even unknown constraints presented by the developer platform. The industry response to the CrowdStrike outage places software testers directly in the crosshairs, but it’s the state of software testing that should be indicted.

Instead of blaming developers for cutting corners on quality, let’s take a look at the underlying systems that force developers to take shortcuts in the first place. Let’s give them tools like IDPs to make it easier for them to stay compliant.

By implementing better collaborative tools and processes, we can lower the cognitive load necessary for developers to adhere to ever-deeper quality standards, and reduce the odds of another incident like the one caused by Content Validator.

🪩 It's time for IDPCON!!

Justin Reock — Tue, 23 Jul 2024 19:30:01 +0000

We are here to “learn, have fun, and make a difference.” - Dr. W. Edwards Deming, productivity visionary

On October 24th in New York City, Cortex will host IDPCon, the first-ever in-person event dedicated to Internal Developer Portals (IDPs) and broader themes of developer experience and productivity. Aimed at engineering leaders, DevOps and SREs, and other practitioners responsible for developer experience, this gathering will unite top minds to tackle complex socio-technical challenges and share insights from industry leaders like Docker, Xero, LinkedIn, Clear, and Blackstone. Registration is open here: https://idpcon.com

Attendees will engage in sessions and discussions designed for knowledge transfer, brainstorming, and networking, with the goal of improving software delivery methods. Inspired by successes in the field, IDPCon aims to make transformative practices like platform engineering and IDPs accessible to all software engineering organizations. But first, a little background on how this event came to be...

What are we all doing here, anyways?

More than two thirds of the global GDP is now digitally transformed (IDC FutureScape, October 2021). This shared economic landscape is supported by a group of craftspeople known collectively as software engineers, who create and operate the apps that we use to connect to our friends, navigate our world, and, increasingly, to apply the skills we possess working remotely.

It would be rational to assume that the best tooling and most frictionless environments have been curated to support this ever-important workforce. It should be a no-brainer to prioritize the preservation of flow state, it should be paramount for any business hoping to profit from software productivity. We should measure leaders on those kinds of outcomes.

But, we don't.

Engineers still deal with painful bottlenecks and impediments to their productivity and creative flow. Inefficient processes and frenetic schedules lead to excessive context switching and cognitive fatigue, lowering overall innovative capacity. Worse, the industry has not settled on the right metrics and frameworks to properly measure and improve productivity, and even if it had, tool fragmentation and sprawl obfuscates much of the raw data and makes observation difficult.

Why IDPCon? Why now? Why us?

Internal Developer Portals are uniquely suited to solve some of these specific challenges. A well-implemented IDP will integrate with each sprawling endpoint of the platform, continuously monitoring those endpoints and orchestrating their behavior, and collecting the data that can be evaluated into useful metrics and presenting that data cross-functionally. But the IDP pattern is still relatively new, with the now widely recognized Backstage solution launching as an open source project just a little over four years ago. As an industry, we still have a lot to learn about what can be done with this pattern, and which use cases are still untapped.

That’s why we decided to host IDPCon. With this event, we are bringing together experts from organizations like Docker, Xero, LinkedIn, Clear, and Blackstone, to share what they’ve learned about this pattern, and what techniques really work for driving developer joy, continuous improvement, and increased productivity. The one-day program will consist of sessions and open spaces discussions, creating space for knowledge transfer, brainstorming, and networking. We hope that participants will come away with new clarity and understanding of how to apply these ways of thinking to improve their own methods of software delivery.

IDPCon is inspired by the same successes that have motivated the team at Cortex to keep investing and improving the product. We have witnessed the transformations that can occur when teams are no longer artificially burdened by practices that ignore the underlying systems and the root cause of productivity loss. The application of practices like platform engineering, and the implementation of tools like IDPs are patterns that can be shared and adopted by any software engineering organization, and we want these capabilities to be accessible to everyone.

The seemingly endless studies and decades of research all suggest one thing: investments in better developer culture leads to improved productivity outcomes. This in turn leads to better business throughput, and an accelerated pace of global innovation. We hope that you’ll take the day to learn with us, and be part of an event that will help to shape the future of developer experience.

🎙️💥 The Backstage Community Wars Have Officially Begun

Justin Reock — Wed, 01 May 2024 22:29:14 +0000

In mid-March, the Backstage engineering team at Spotify announced that, along with several other updates in their planned April 30th roadmap webinar, they would unveil a new internal developer portal (IDP) solution called Spotify Portal. Details were sparse, and even nosing around the community a bit, it was difficult to determine exactly what this product would be, who it would be for, and perhaps most importantly, how it might impact users and the IDP community at large.

Concerned members of the platform community have speculated at events and in hallway discussions about what would be in the announcement, but it wasn’t until yesterday, April 30th, that we would get the official word from Spotify. Specifically, at the very end of their webinar. ;)

We now know that Spotify Portal will be a curated, no-code, and turnkey solution for creating Backstage portals, beginning with a hand-selected set of waitlisted limited beta participants. Now the question becomes, how will a commercial play from Spotify Backstage impact or alter the communities they helped to create?

No Surprises Here

The only thing that should surprise anyone about the new direction from Spotify is that they waited this long to plant their flag. Negative feedback about the practicality of the framework from the Backstage community has been near universal, with a host of issues plaguing Backstage deployments at scale.

Expensive rollouts and maintenance with insignificant adoption have resulted in negative ROI for a majority of teams. Much of this friction has come from the fact that Backstage is essentially a repository of Typescript that teams must build, deploy, and maintain on their own.

It’s practically begging for a managed and automated model, which is what Spotify Portal alleges it will provide.

The Players

While Spotify Backstage has been relatively motionless, not monetizing much beyond an anemic set of curated plugins, other companies have already approached the market with a very similar value proposition to Spotify Portal.

Red Hat Developer Hub, an opinionated and supported Backstage service supported by Red Hat’s Enterprise Support, will now find itself directly in competition with the creators of Backstage. This has historically been an uphill battle for Red Hat, especially when the project in question is technically governed by a different foundation, the Cloud Native Computing Foundation.

Roadie.io, which offers a curated and managed service for Backstage, is likely to be the first major player impacted by this decision. It will be a hard sell to convince potential opportunities, as well as existing customers, that they will be better off in the hands of a third party rather than the original maintainers of the open source Backstage project.

Even Port, whose plucky marketing and developer-friendly branding have promised an “open” experience with building an IDP, will now have to reckon with the fact that they are very much a closed source solution, certainly compared to Backstage’s highly permissive ASF 2.0 open source license. Can that message of openness stand up to an actual open source license?

This is all ahead of the cottage industry of independent and employed consultants who have been building and maintaining bespoke Backstage instances for years. And what will become of the many contributions that have come from Backstage’s large network of core and plugin committers? It would be hard to imagine Spotify being in favor of the open development of plugins that compete with their new commercial functionality.

How will this impact the market?

Though it’s probably time for some of these players to start circling the wagons, it's important to note that there are still major limitations to the Spotify Portal solution. By their own admission, at launch the product will only support GitHub as an identity provider. Limiting the identity provider is a well-known IDP antipattern, and one that Spotify will need to move quickly to address if it wants to be taken seriously in the enterprise.

The beta rollout is also very limited. Teams can apply on a waitlist, and the Spotify team will hand-pick the early beta testers and demoers. We know that Spotify laid off 17% of its workforce back in early December of 2023, not the best company signal to send right before a major pivot and product launch. So, it's entirely possible that Spotify is simply trying to conserve its engineering teams, though I don’t see any reason why Spotify wouldn’t prioritize the “big logos” first.

There’s a lot involved in these rollouts, even with improved deployment automation, teams still have to organize their data and model their software catalogs. To complicate matters further, Spotify doesn’t have enterprise software support in their DNA, so they’ll have to learn about these functions as they drive. These beta rollouts will not be quick, and it will take a long time to clear the waitlist. That means mere mortals will be waiting a long time to sample the product, at a time when interest in IDPs has never been higher. Many teams will not wait, and will look at other existing and trusted solutions.

Where does Cortex stand?

Historically, Cortex has both applauded the awareness Backstage has brought to IDPs, while also keeping an arm's length and a watchful eye. Some increasingly poignant philosophical differences in architecture and language have begun driving a deeper wedge in a once complementary go-to-market motion. The volume of "Backstage-burned" developers we've already spoken to this year even led to the recent release of a "migration helper" to expedite moving work from Backstage to Cortex. We're bracing for even greater volume here in the coming months.

Cortex supports the ability to easily ingest services from Backstage and pull them into the software catalog, greatly easing teams’ ability to offboard Backstage. This has been the case for many teams who have not derived the value from Backstage that they predicted, finding friction in areas such as cultural adoption and poor data modeling from the beginning.

Given that the Backstage-powered solutions are likely to see a significant disruption over the coming months, there’s never been a better time for a truly enterprise-class, supported alternative to Backstage. Cortex can organize your full service catalog, but goes on to provide features such as templated scorecards and initiatives to help drive adoption in cultural change. In a recent case study, BigCommerce reported 96% onboarding of engineers within the first three months, a feat that is typically a significant challenge for IDP adopters.

As we adjust to a swiftly changing landscape, with the Spotify Portal announcement arguably being one of the larger milestones in the relatively short history of IDPs, Cortex will stay its course and roadmap, mindful of new competitive signals, but unfazed in commitment to being the most enterprise-ready IDP for mission-critical developer workflows.

Developer Experience is Dead: Long Live Developer Experience! 🫠

Justin Reock — Wed, 14 Feb 2024 17:08:04 +0000

Right now, here in early 2024, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience or ‘DevEx.’ The “revelation” that better developer experience leads to better productivity outcomes is not a revelation at all, but a truth that’s been, perhaps bafflingly, deprioritized in favor of post-CI software delivery improvement initiatives like DORA that have all but ignored the role of the human in the loop.

It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them. From lean principles and “just-in-time” manufacturing in the 1970s to DevOps and platform engineering today, process stakeholders update their respective buzzwords and slang to match what they observe in their surrounding culture, but the cross-functional behaviors don’t shift much. Conclusively, physics is physics, and the way we work is subject to those universal constraints.

GitHub’s recent highlight of the Microsoft/DX study of the productivity outcomes of improved Developer Experience is being lauded as “finally” presenting the much-needed hard data behind the productivity outcomes of improved DevEx. Their article begins with: “The wait is over: we finally have data to back up the benefits of developer experience (DevEx).”

A highlight of my own work and study has been some of the frequent and fantastic exchanges I've been lucky to have with Peggy Storey and Abi Noda, so they know this is coming, and there is excellent data in this study, everyone absolutely should read it, but…

Finally? … really? ;)

Veterans of the developer productivity space may immediately be reminded, for instance, of “The Coding War Games,” an ongoing study from the mid-1980s to the mid-aughts which provided the same conclusions and led to the publication of the book ‘Peopleware.’

Without a lot of centralization of knowledge available, almost everyone else will agree with GitHub, unperturbed, that the data is the first of its kind. But we already have:

The Coding War Games - Taking place over several years across almost a hundred different software organizations, this study provides ample data over years that developer experience metrics are a great predictor of organizational productivity
The Code Red Study (CodeScene) - A study completed in March of 2022 demonstrating conclusively that time wasted by low-quality code led to 9x delays in cycle time
The Only Way to Measure Developer Productivity Without Causing a Revolt - A comprehensive response to Dan North’s “The Worst Programmer I Know” blog, illustrating the inefficacy of many supposed productivity metrics
Cat Hicks on Developer Thriving - A perspective that draws conclusions that subjective developer experience impacts overall productivity

Why do these cycles repeat? Why do we spend our energy discovering and rediscovering that the experience provided for the worker will almost universally predict better productivity outcomes when instead we could invest that energy in the cultural and environmental shifts necessary to enshrine those outcomes universally? More on what those investments would look like later.

What failed before will probably fail again

Chomsky said “If you are teaching today what you were teaching five years ago, either the field is dead or you are,” and this wisdom has reached anthem-scale for cultures of continuous improvement. This guidance shouldn’t direct us to forget the lessons of the past, but remind us that we stand on the shoulders of giants. I think the majority will forget, though, and I think that’s why we see clear indications that, up to this point, DevEx efforts have largely failed:

Leaders invest in software delivery, not software creativity - Everyone loves to talk about Developer Experience improvement, but forty years after Coding War Games, decision makers are still wary of ROI. Perhaps we just haven’t seen comprehensive enough solutions, it’s possible that the ROI doesn’t exist because the right solutions aren’t being invested in.
DevOps has descended into Dev ← Ops, where Developers have taken on additional responsibility for Ops, while the opposite has not been true. Developers have been expected to learn Docker and Kubernetes, SREs have not seen similar expectations. Developer self-service solutions have not kept pace with this change.
Release Engineering and SRE has become Production Support - When software quality is negatively impacted because of increased pressure and a lack of self-service options for developers, that leads to a culture where supposedly proactive resources become almost fully reactive. SREs spend time answering tickets instead of improving systems. This can be considered a cascading failure, where poor developer experience leads to failed adoption of proactive SRE/DevOps measures.
Everyone is waiting for Generative AI to save the day - While LLMs show great promise, they are unproven as of yet, nevertheless we are already abandoning the current state of the art in favor of the next shiny thing, and we’re not even sure the cavalry is coming.
DevEx Improvement Adoption Rates Are Abysmally Low - The critical cultural shift following adoption of better DevEx practices such as implementing Backstage often fail with a lack of organizational acceptance, with most teams never realizing more than a 10% adoption rate. Typically, the most elite developers in the organization are responsible for curating the portal, and it may not align to what will actually be consumed by the more mainstream engineers in the organization. These failures may even further drive divides between these classes of engineers.

At some point, the tension will snap, and the data suggests that we are inching towards this milestone. A 2021 study by JL Partners observed that a significant percentage of developer burnout can be blamed squarely on flawed productivity metrics, and that these flawed metrics are responsible for perpetuating a vicious cycle. When goals such as release dates are calculated using incomplete metrics, the accuracy of those calculations is negatively affected. This leads to missed deadlines, which turn into efforts by the business to “fix productivity” by trying to find new ways for developers to ship faster. Unsurprisingly the end result is generally longer hours for developers, complemented with eroded trust because of continually perceived “failure.”

To complicate matters further, the data shows us that we still haven’t aligned on the proper way to communicate about productivity and developer experience, despite the work that has been done over the decades. A criminally overlooked April 2022 study, seriously please go read it, by many of the same researchers who worked on SPACE concluded that developers and their leadership tend not to prioritize the same metrics when thinking about productivity, often creating tension between developer experience and company velocity.

This is dizzyingly ironic in a world where developer salaries increasingly exceed $500k annually and digital transformation is a universal driver for nearly every business. In almost every other industry, the environment is curated for workers and bound by open and enforceable standards such as the ones provided by OSHA.

A perfect storm may be brewing where:

As predicted by IDC, 65% of the Global GDP is software driven, a huge responsibility for a single workforce
Developer salary continues to drive higher business cost while developer outputs are not improving
Workforce fatigue leads to unprecedented economic waste and lack of innovation

None of these indicators show any significant signs of improvement, despite billions of dollars in DevEx investment in 2023. And can we even define “improvement,” when we still have not done a thorough job of aligning on what exactly those indicators even are? Digital transformation is only accelerating with an industry most recently invigorated with the promise of breakthroughs in AI and augmented or extended reality. Developer salaries will continue to surge with demand. Without real breakthroughs in developer productivity, cognitive fatigue and burnout will only become more ingrained.

We need productivity engineering, not productivity management

There is good news emerging out of the platform engineering and developer productivity engineering (DPE) circles, in that many who have been spending time thinking about better ways to measure and improve productivity have discovered long-hidden sources of bottlenecks and friction in the developer experience, and are starting to take action to deal with them. When we move past speculation and use solutions that let us gather quantitative data about developer workflow, the parts of the workflow deleterious to developer experience show up in high resolution. With this a new set of metrics are emerging, ones that more closely resemble the efficiency of a developer’s workflow. With more accurate metrics, we can drive real improvements.

These metrics deal with pains felt acutely by developers, and more chronically by organizations. They come from dozens of sources and represent the galaxy of tools relied upon by developers in their daily workflows. Build tools like Maven and NPM, testing frameworks like Cucumber, security and quality scanners like Snyk and Checkmarx, deployment substrates like Kubernetes, requirements management tools like JIRA and even on-call management frameworks like PagerDuty, all capable of creating new frustrations and sources of context switches for developers.

These revelations are proving what many of us have known for decades, and what countless studies have proven time and time again. There is no one metric, or even one set of metrics, that can even hope to paint an accurate picture of developer productivity, and so no one metric or set of metrics can fix it. As the researchers on the SPACE framework put it, we need to capture a “constellation of metrics in tension.”

The right solutions prioritize integration… and they already exist

We need to invest in tools that can act as predictive engineering systems of record and analyze the full spectrum of metrics, not just ones that are evident in CI/CD. A fair criticism of DORA is that it only captures metrics post-CI, and Conway’s Law would then lead to tools which only look at data generated in CI. Those data are important to understand parts of delivery, but to see the full picture, we need solutions that allow us to look at data generated by every part of the developer workflow, from the laptop all the way to the deployment substrate, and they need to adapt quickly to changes in the infrastructure.

The right solution for this problem will prioritize gathering data easily and up-to-the-microsecond from the full bloom of developer tools used by engineers in the organization. The solution must also use these data to drive new effects and behaviors in the organization, through gamification and other proven techniques. These tools should not overburden teams and should require little administration past the point of integration and data processing.

Lucky for us, “The future already exists, it’s just not evenly distributed yet.” We have no need to reinvent another productivity approach when effective solutions such as internal developer portals already exist, and already prioritize the correct mechanisms for facilitating successful productivity engineering initiatives. As business leaders, we just need to invest in and create urgency behind these solutions today, so that we don’t find ourselves having the same discussion again in a decade.

DORA Metrics: What are they, and what's new in 2024?

Justin Reock — Tue, 23 Jan 2024 21:15:57 +0000

Despite some recent criticism, DORA metrics remain the most asked about framework for measuring developer productivity. But how can it's younger sibling, the SPACE framework change the dialogue around engineering measurement, and what role do IDPs play in bridging the gap?

There is nothing more valuable to an organization than data—about customers, products, opportunities, gaps... the list goes on. We know that to maximize value streams for the business we need to turn a critical eye to data related to how each group operates, including software development teams. In 2019 a group known as the DevOps Research and Assessment (DORA) team set out to find a universally applicable framework for doing just that. After analyzing survey data from 31,000 software professionals worldwide collected over a period of six years, the DORA team identified four key metrics to help DevOps and engineering leaders better measure software delivery efficiency:

Velocity Metrics

Deployment frequency: Frequency of code deployed
Lead time for changes: Time from code commit to production
Stability metrics
Mean time to recovery: Time to recover after an incident (now Failed Deployment Recovery Time)
Change failure rate: Percentage of changes that lead to failure

In 2021, DORA added a 5th metric to close a noted gap in measuring performance — reliability. The addition of this metric opened the door for increased collaboration between SREs and DevOps groups. Together these five metrics, now referred to simply as “DORA metrics” have become the standard for gauging the efficacy of software development teams in organizations looking to modernize, as well as those looking to gain an edge against competitors.

In this post we’ll discuss what each metric can reveal about your team, how the benchmarks available for “Elite,” (back again in 2023 after being dropped in 2022) “High-Performing,” “Medium,” and “Low-Performing” teams has changed in the last year, and what all of this means in relation to the recently released SPACE framework—which puts a higher emphasis on process maturity rather than output.

What is DORA?

First, let’s revisit what DORA (the institution behind the metrics) actually is. The DevOps Research and Assessment (DORA) team was founded in 2015 by Dr. Nicole Forsgren, Gene Kim, and Jez Humble with the charter of improving how organizations develop and deploy software. This group was also behind the first inaugural State of DevOps report, and maintained ownership of the report until 2017. Their research resulted in what Humble has referred to as “—a valid and reliable way to measure software delivery performance,” while also demonstrating that these metrics can “drive both commercial and non-commercial business outcomes.” In 2019 they joined Google, and in 2020 the first four of the familiar five DORA metrics were released, with the fifth and final following in 2021. An overview of each metric is below:

Lead Time for Changes

Lead Time for Changes (LTC) is the amount of time between a commit and production. LTC indicates how agile your team is—it not only tells you how long it takes to implement changes, but how responsive your team is to the ever-evolving needs of end users, which is why this is a critical metric for organizations hoping to stay ahead of an increasingly tight competitive landscape.

The DORA team first identified these benchmarks for performance in their Accelerate State of DevOps 2021 report, but have since updated them to the following (with original benchmarks noted in parentheses):

Elite Performers: <1 day (Original: <1 hour)
High Performers: 1 day to 1 week (Original: Same)
Medium Performers: 1 week and 1 month (Original: 1 month and 6 months)
Low Performers: 1 week and 1 month (Original: 6+ months)

LTC can reveal symptoms of poor DevOps practices: if it’s taking weeks or months to release code into production, you should assume inefficiencies in you processes. However, engineering teams can take several steps to minimize your LTC:

Implement continuous integration and continuous delivery (CI/CD). Encourage testers and developers are working closely together, so everyone has a comprehensive understanding of the software.
Consider building automated tests. Save even more time and improve your CI/CD pipeline.
Define each step of your development process. Because there are a number of phases between the initiation and deployment of a change, it’s smart to define each step of your process and track how long each takes.
Examine your pull request cycle time. Gain a thorough picture of how your team is functioning and further insight into exactly where they can save time.
Be careful not to let the quality of your software delivery suffer in a quest for quicker changes. While a low LTC may indicate that your team is efficient, if they can’t support the changes they’re implementing, or if they’re moving at an unsustainable pace, you risk sacrificing the user experience. Rather than compare your team’s Lead Time for Changes to other teams’ or organizations’ LTC, you should evaluate this metric over time, and consider it as an indication of growth (or stagnancy).

Deployment Frequency

Deployment Frequency (DF) is how often you ship changes, how consistent your software delivery is. This enables your organization to better forecast delivery timelines for new features or enhancements to end user favorites. According to the DORA team, these are the latest benchmarks for Deployment Frequency:

Elite Performers: On-demand or multiple deploys per day (Original: multiple per day)
High Performers: Once per day to once per week (Original: Once a week to once a month)
Medium Performers: Once per day to once per week (Original: Once a month to once every 6 months)
Low Performers: Once per day to once per week (Original: Less than once every 6 months)

While DORA has raised the bar on acceptable deployment frequency, it should be noted that numbers that are starkly different within and across teams could have a deeper meaning. Here are some common scenarios to watch for when investigating particularly high deploy counts:

Bottlenecks in development process: Inconsistencies in coding and deployment processes could lead some teams to have starkly different practices for breaking up their code.
Project complexity: If projects are too complex deploy frequency may be high, but may not say much about the quality of code shipped in each push.

Gamification: This particular metric may be easier to “game” than others since it’s largely in the control of an individual developer who may push code at higher intervals than normal if they believe their impact is measured by this metric alone.

Shipping many small changes usually isn’t a bad thing in and of itself, however. Shipping often might also mean you are constantly perfecting your service, and if there is a problem with your code, it’s easier to find and remedy the issue. However, If your team is large, this may not be a feasible option. Instead, you may consider building release trains and shipping at regular intervals. This approach will allow you to deploy more often without overwhelming your team members.

Failed Deployment Recovery Time (Formerly Mean Time to Recovery)

DORA recently updated the Mean Time to Recovery (MTTR) metric to a more specific, Failed Deployment Recovery Time (FDRT)—which is more explicitly focused on failed software deployments rather than incidents or breaches at large. FDRT is the amount of time it takes your team to restore service when there’s a service disruption as a result of a failed deployment, like an outage. This metric offers a look into the stability of your software, as well as the agility of your team in the face of a challenge. These are the benchmarks identified in the State of DevOps report:

Elite Performers: <1 hour (Original: Same)
High Performers: <1 day (Original: Same)
Medium Performers: 1 day to 1 week (Original: Same)
Low Performers: Between 1 month and 6 months (Original: Over 6 months)

To minimize the impact of degraded service on your value stream, there should be as little downtime as possible. If it’s taking your team more than a day to restore services, you should consider utilizing feature flags so you can quickly disable a change without causing too much disruption. If you ship in small batches, it should also be easier to discover and resolve problems.

Although Mean Time to Discover (MTTD) is different from Mean Time to Recovery, the amount of time it takes your team to detect an issue will impact your MTTR—the faster your team can spot an issue, the more quickly service can be restored.

Just like with Lead Time for Changes, you don’t want to implement hasty changes at the expense of a quality solution. Rather than deploy a quick fix, make sure that the change you’re shipping is durable and comprehensive. You should track MTTR over time to see how your team is improving, and aim for steady, stable growth in successful deployments.

Change Failure Rate

Change Failure Rate (CFR) is the percentage of releases that result in downtime, degraded service, or rollbacks, which can tell you how effective your team is at implementing changes. This metric is also critical for business planning as repeated failure and fix cycles will delay launch of new product initiatives. Originally, there was not much distinction between performance benchmarks for this metric, with Elite performers pegged at 0-15% CFR and High, Medium, and Low performers all grouped into 16-30%. But the latest State of DevOps Report has made a few changes:

Elite Performers: 5% (Original: 0-15%)
High Performers: 10% (Original: 16-30%)
Medium Performers: 15% (Original: 16-30%)
Low Performers: 64% (Original: 16-30%)

Change Failure Rate is a particularly valuable metric because it can prevent your team from being mislead by the total number of failures you encounter. Teams who aren’t implementing many changes will see fewer failures, but that doesn’t necessarily mean they’re more successful with the changes they do deploy. Those following CI/CD practices may see a higher number of failures, but if CFR is low, then these teams will have an edge because of the speed of their deployments and their overall rate of success.

This rate can also have significant implications for your value stream: it can indicate how much time is spent remedying problems instead of developing new projects. Improve change failure rate by implementing testing, code reviews and continuous improvement workflows.

Examples of things to monitor to maintain a low change failure rate include:

Number of rollbacks in the 30 days
Ratio of incidents to deploys in the last 7 days
Ratio of rollbacks to deploys in the last 30 days

Reliability

The reliability metric -- or more accurately the reliability “dimension” -- is the only factor that does not have a standard quantifiable target for performance levels. This is because this dimension comprises several metrics used to assess operational performance including availability, latency, performance, and scalability. Reliability can be measured with individual software SLAs, performance targets, and error budgets.

These metrics have a significant impact on customer retention and success—even if the “customers” are developers themselves. To improve reliability, organizations can set checks and targets for all of the software they create. Some examples include:

Attach appropriate documentation
Attach relevant incident runbooks
Ensure integration with existing incident management tools
Ensure up to date software including search tools
Perform standard health checks
Add unit tests in CI
Ensure database failover handling code patterns are implemented

Are DORA metrics still the best way to build high-performing teams?

Because DORA metrics provide a high-level view of your team’s performance, they can be particularly useful for organizations trying to modernize—DORA metrics can help you identify exactly where and how to improve. Over time, you can see how your teams have grown, and which areas have been more stubborn.

Those who fall into the elite categories can leverage DORA metrics to continue improving services and to gain and edge over competitors. As the State of DevOps report reveals, the group of elite performers is rapidly growing (from 7% in 2018 to 26% in 2021), so DORA metrics can provide valuable insights for this group.

So why have these metrics come under fire recently? Criticism of this framework is partially rooted in how they’re applied more than how they’re defined. Teams may over-rotate on the numbers themselves, rather than context surrounding them. This has lead to gamification, and a separation of output from real business outcomes. Though, it’s not difficult to see how we got here when we consider that previous means for tracking DORA metrics included static exports, cobbled together spreadsheets, or stand-alone tools that failed to consider individual and team dynamics.

How does DORA relate to the new SPACE Framework?

The same group at DORA behind the original 5 metrics recently released a new framework that refocuses measurement more towards the human process behind each technical metric outlined in DORA. The SPACE Framework is a compilation of factors that comprises a more holistic view of developer productivity.

The full list includes:

Satisfaction: How do developers feel about the work they’re doing? How fulfilling is it?
Performance: Are we meeting deadlines? Addressing security issues quickly enough?
Activity: How is productivity? PRs, lines of code, etc.
Collaboration: Is the team working together and taking advantage of their strengths?
Efficiency: Keeping developers within their creative flow state

While DORA primarily focused on output, SPACE focuses on the process to get to the output (optimizing workflows). This duality is why many teams don’t find the two to be mutually exclusive—and instead consider SPACE to be an extension of DORA. Dr. Nicole Forsgren herself has reportedly noted, “DORA is an implementation of SPACE.”

This framing is bolstered by SPACE’s open guidelines—which are intentionally non-prescriptive when it comes to the data needed to assess each pillar. This makes the SPACE framework highly portable and universally applicable, regardless of your organization’s maturity.

Are we done with DORA?

While DORA has been met with increased criticism in recent years, the addition of SPACE has greatly balanced the equation. Rather than feel the need to choose between the two models, or throw away DORA entirely, Engineering and DevOps teams should consider using them in parallel to give equal weight to developer productivity and happiness. Using both together enables organizations to ask bi-directional questions like: Is that recent performance issue impacting satisfaction and efficiency? Or has a recent increase in satisfaction led to temporary dip in performance?

Context is king. DORA metrics can still be used to improve overall team performance, but they must be considered lagging indicators of success in relation to context about team talent, tenure, and complexity of on-going projects. Engineering leaders should first consider the health of software produced in the context of this information, and then use DORA metrics to trace correlations between metrics associated with velocity and reliability. For example, software produced by a new team that fails basic checks of maturity and security is far more likely to see comparatively poor MTTR and change failure rate metrics. But that doesn’t mean the team lacks drive or capability.