Reframing Quality Beyond Code: "May the bridges I burn light my way"

#mvt #programming #ai

Quality Engineering and software leaders are traditionally tasked with answering a deceptively simple question: Is this software good enough to ship? For years, the industry answered this through regression testing, automation suites, and staging environments designed to prevent bugs. However, recent insights into R&D data are supporting an important paradigm shift - one that has been challenging us to expand our definition of "quality" from merely ensuring the product is built right to ensuring we are building the right thing..

When analysing engineering throughput, an organisation's experimentation data can look alarming to some on the leadership team. We have all seen those confused faces opening THAT email. If a significant percentage of new features or ideas don't end up fully launched to production, it might look like an engineering bottleneck/waste of time. But from a modern quality perspective, that gap represents the ultimate safety net.

The Three-Tiered Quality Ledger

Velocity is a painful word to some engineers, but in Leadership to really understand the value of engineering velocity, we need that data. We need to move away from binary "pass/fail" or "deployed/abandoned" metrics. Instead, we should anchor our measurement by classifying each completed experiment into one of three outcomes: shipped, rolled back, or informed a decision. If we only measure the "Shipped" bucket, we are severely miscalculating the ROI of our engineering efforts:

Shipped: The clear wins that drive measurable value and user satisfaction.

Rolled Back: The features that hit "guardrail metrics" or triggered negative user feedback, allowing us to safely retreat.

Informed a Decision: The ambiguous, flat, or null results that nevertheless provided critical data to redirect the team's next strategic move.

Shifting from "Bug Hunting" to "Harm Prevention"

The most hopeful part of this framework is the strategic value of the rollback.

In traditional QA/Testing models, letting a feature reach production only to roll it back because it caused user friction or metric regression would be considered a post-mortem-worthy failure. But in a mature experimentation culture, this is intentional, controlled harm prevention.

When an A/B test or progressive roll-out reveals that a feature degrades user engagement or causes a silent regression on a core KPI, the delivery platform has done its job. It has protected the broader user base from an unmitigated product degradation. Organisations without this structured experimentation capability still ship changes at the same rate—but their regressions accumulate silently, resulting in a gradual, unexplained erosion of overall product quality. I see this in old large enterprise companies all the time.

As Quality leaders we can focus and advocate for these three focus areas:

Establish guardrail metrics: Quality cannot just be about system uptime, crash rates, and API load times; it must include user behaviour. If a feature deployment drops a core business metric, that is a quality defect.
Celebrate the rollback rate: A healthy rollback rate means your experimentation safety net is successfully catching product and usability defects before they impact your entire audience.
Measure organisational learning velocity: Work with product management to track how many null or negative tests actually changed a team’s roadmap. If we ran a test, it failed, and we chose not to spend the next two quarters building a dead-end feature, that is a massive engineering efficiency win.

Real-Time Action with AI

When you’re looking at this delicious data, AI could be the key to shifting it from a retrospective reporting tool into a real-time safety net. In the "Rolled Back" bucket, it takes the pressure off the team by automating anomaly detection and instant root cause analysis. We shouldn’t have engineers manually watching dashboards for metric regressions during a roll-out EVER. Instead, use AI to establish a deep baseline of normal user behaviour, so the moment a progressive deployment triggers a micro-deviation, it can act as an automated kill-switch—safely reverting the change and immediately tracing the blast radius right back to the exact feature flag responsible.

Where it really could change the game though, is the "Informed a Decision" bucket. Flat, null, or ambiguous results are notoriously hard to parse, and teams can spend weeks debating what a neutral test actually means. AI could solve this bottleneck by synthesising massive, disparate data streams cross-referencing test results with qualitative data like user session replays, support tickets, and sentiment. By automatically uncovering hidden user patterns or causal factors that a person would easily miss, AI translates inconclusive data into a clear, actionable direction for the next sprint. It stops us from guessing and helps us actually learn.

Operational Example:
[Context Ingested by your AI]
Imagine an LLM core integrated directly into your pipeline to automate that "Rolled Back" safety net.

Raw application error logs (Last 5 minutes)
Live feature flag states: {"enable_checkout_v2": "treatment_alpha", "loyalty_points_tier": "control"}
Recent Git deployment diffs for the microservice

System Prompt:

Analyse the provided logs and deployment context. If a metric regression is correlated with an active feature flag, pinpoint the root cause line, determine the blast radius, and output a valid JSON command to disable the faulty flag.

The Response: Because most AI can ingest thousands of lines of log data and code simultaneously, it bypasses hours of manual triage and immediately returns a clean, actionable payload:

{
  "incident_status": "CRITICAL_REGRESSION",
  "correlated_flag": "enable_checkout_v2",
  "blast_radius": "Mobile Web / UK Users",
  "root_cause": "NullPointerException in payment_gateway.py at line 142 caused by unhandled currency symbol formatting in the new UI layout.",
  "action": "TRIGGER_ROLLBACK",
  "api_payload": {
    "flag_id": "enable_checkout_v2",
    "state": "OFF"
  }
}

The pipeline parses that JSON, instantly hits the feature flag API to kill the variant, and posts the exact root cause directly into the team's Slack channel. Huzzah!! Production is saved in seconds without a single engineer having to open a dashboard. _If I could be bothered I would make a smug meme for this bit. _

At the end of the day, true quality assurance isn't about setting up speed bumps; it’s about protecting the user experience while keeping the engineering engine running as fast as possible, and —it’s a lot more fun to build this way. By ditching the rigid, bottleneck-heavy QA models of the past and just grouping our outcomes into "shipped, rolled back, or informed a decision," we move away from defensive, block-by-default testing. Instead, we get to champion a progressive delivery model where experimentation actually becomes our ultimate safety net.

For the data geeks:
LaunchDarkly (2025) The State of Feature Management and Progressive Delivery.

Microsoft Research (2023) The Science of Experimentation at Scale: Lessons from Experimentation Platforms.

Split Software (2024) The State of Impact-Driven Development and Quality Engineering.

Capgemini, Sogeti and OpenText (2025) World Quality Report 2025-26.

Continuous Delivery Foundation (2024) State of Continuous Delivery Report.

2024 State of DevOps Report: User-Centric Design and Technical Stability.