Alen Prasevic

Posted on Mar 23

AI Won't Fix a Broken CI/CD Pipeline

#testing #devops #ai #cicd

Most teams are adding AI to broken pipelines. Here's the sequencing that actually works.

There's a pattern I've watched repeat itself across engineering organizations over the last two years: a team gets pressure to "add AI" to their release process, someone bolts a generative testing tool onto an existing Jenkins pipeline, and within a quarter they're dealing with more false positives, more noise in their dashboards, and more time triaging failures than before.

They got faster. They didn't get better.

This isn't a failure of AI. It's a failure of sequencing.

I know because we made a version of that mistake ourselves.

I lead both testing and release management at NoMachine — which means I sit at the intersection where a lot of this pain lives. Our pipelines run on Jenkins, Ansible, Selenium, Python, and Bash. When we started integrating smarter automation into that stack, we assumed the hard work was behind us. We had coverage. We had tooling. We had process documentation.

What we didn't have, it turned out, was alignment.

The Gap Nobody Talks About

Testing teams and release teams often operate on different clocks, different definitions of "ready," and different tolerances for risk. When they're separate functions, that gap stays manageable because it's visible. People negotiate it. They argue in stand-ups.

When you integrate AI-assisted tooling across both functions, that gap doesn't disappear. It gets automated. And a hidden disagreement baked into your pipeline logic is significantly harder to find than one that surfaces in a conversation.

For us, the issue showed up during a push to introduce AI-assisted failure analysis into our release gates. The testing side was generating clean structured data — tagged failures, environment metadata, run history. The release side was working off a different data model entirely — looser, more manually maintained, accumulated over years of one-off fixes. When the AI tooling tried to correlate failure patterns against release history, the signal was incoherent. Not wrong, exactly. Just noisy in a way that made the output untrustworthy.

We spent two weeks trying to tune the model before we realized the model wasn't the problem. The seam between our two data structures was.

The Illusion of Progress

When organizations talk about AI in their delivery pipeline, they usually mean one of three things: AI-assisted test generation, predictive failure analysis, or automated code review. All three are genuinely useful — but only when the pipeline underneath them is stable enough to benefit.

The DORA 2024 State of DevOps Report makes this uncomfortably concrete. Across thousands of respondents, increased AI adoption correlated with a 1.5% decrease in delivery throughput and a 7.2% reduction in delivery stability. That isn't because AI is bad. It's because teams are adopting AI before their pipelines are ready to absorb it.

What DORA found consistently is that the fundamentals — fast feedback loops, reliable deployments, and a culture of trust — remain the primary drivers of elite performance. AI doesn't replace those. It depends on them.

That's the part most vendor conversations leave out.

If your test suite is fragile, AI-generated tests will be fragile at scale. If your deployment data is noisy, predictive failure models will produce confident-sounding noise. AI amplifies whatever is already happening in your system — which means if your system is a mess, you now have a faster mess.

What Pipeline Debt Actually Looks Like

Before any team introduces AI tooling, they need to do an honest audit. In my experience, most enterprise pipelines carry at least three of the following liabilities — often without fully acknowledging them.

Flaky tests with no remediation policy. Tests that sometimes pass and sometimes fail get treated as a known annoyance rather than a critical signal. Research shows flaky tests can account for up to 16% of CI failures, causing reruns that slow merge cycles and erode trust in the entire pipeline. When you hand an AI-assisted system inconsistent input, you get inconsistent output — now with the credibility of automation attached.

No real ownership of pipeline failures. When a build breaks at midnight, who owns it? If the answer is "whoever's on call figures it out," you don't have a tooling problem. You have a culture problem your CI/CD setup is quietly masking.

Release gates that exist on paper only. I've worked with organizations that have beautifully documented quality gates and override them 40% of the time under deadline pressure. Adding AI-assisted analysis to a pipeline humans routinely bypass isn't augmentation — it's theater.

Disconnected data models across testing and release. This one is easy to miss precisely because each side often looks fine on its own. Test metadata is clean. Release records are maintained. But if the schemas, environments, and ownership models don't align, any tool that tries to reason across both will hit a wall. We hit that wall. It's a real one.

The Fix Isn't a Tool Purchase

Here's the framework I'd apply before any AI tooling conversation happens.

Step 1 — Stabilize Your Signal

Run a flakiness audit across your full test suite. Any test with a failure rate above 5% in the last 90 days without a confirmed root cause gets quarantined — not deleted, quarantined — until it's properly fixed or replaced. This single step can cut pipeline noise by 20–30% before you've changed anything else.

Here's a minimal Python script to get your baseline:

import json
from collections import defaultdict

# Load your CI run history (exported from Jenkins, GitHub Actions, etc.)
with open("test_run_history.json") as f:
    runs = json.load(f)

test_stats = defaultdict(lambda: {"pass": 0, "fail": 0})

for run in runs:
    for test in run["tests"]:
        name = test["name"]
        if test["status"] == "passed":
            test_stats[name]["pass"] += 1
        else:
            test_stats[name]["fail"] += 1

print("⚠️  Tests with failure rate > 5% (quarantine candidates):\n")
for test, stats in sorted(test_stats.items()):
    total = stats["pass"] + stats["fail"]
    if total == 0:
        continue
    failure_rate = stats["fail"] / total
    if failure_rate > 0.05:
        print(f"  {test}: {failure_rate:.1%} failure rate ({stats['fail']}/{total} runs)")

This gives you a ranked list of your highest-risk tests in under 60 seconds. Run it before any other conversation about AI tooling.

Step 2 — Align Your Data Models

Your test infrastructure and release infrastructure need to speak the same language — same environment identifiers, same failure taxonomy, same timestamps. If you're planning to introduce any AI that reasons across both, this alignment has to come first. No amount of model tuning compensates for mismatched schemas.

Step 3 — Instrument Everything

You need structured, queryable data out of your pipeline: build duration by stage, test failure rates by category, deployment frequency, mean time to recovery. Jenkins has solid plugin support for this. GitHub Actions and GitLab CI have it built in. The question you need to answer before you touch anything with AI: where does our pipeline spend the most time failing? If you can't answer that clearly today, you're not ready to optimize it tomorrow.

Step 4 — Define and Enforce Your Gates

Every release gate needs an owner, a clear pass/fail criterion, and a formal exception process for overrides. If overrides are happening more than 10% of the time, the gate criteria need to be revisited — not routinely ignored. AI-assisted release decisions are only as trustworthy as the decision framework beneath them.

Step 5 — Introduce AI at the Edges First

The safest entry point is test generation for new features — not optimizing your existing suite, and definitely not automating deployment decisions. Tools like GitHub Copilot for unit tests, Applitools for visual regression, or CodiumAI for behavioral test suggestions give you real lift without putting your release-critical logic at risk. Build institutional trust there before going deeper.

Step 6 — Baseline Before You Change Anything

Capture your current MTTR, change failure rate, and test execution time before any tooling is introduced. Then you have an actual story — with numbers — to tell internally, and a real way to know if things got better or just looked like they did.

What Good Actually Looks Like

After we fixed the data alignment issue — standardizing environment identifiers across our test and release records and rebuilding the bridge between them — the AI-assisted failure analysis we'd been trying to roll out actually started working. Not dramatically, not overnight, but in a way that was genuinely useful: patterns in Selenium failure logs that used to require manual investigation started surfacing in minutes. Release decisions that previously required pulling three people into a call became something a single engineer could make with confidence.

The lesson wasn't that the tool was better than we thought. It was that the tool was exactly as good as the data we gave it. Once the seam was clean, the signal was trustworthy. And once the signal was trustworthy, the team stopped second-guessing it.

That trust is what most teams are actually missing. No AI tool installs it for you.

Key Takeaways

AI in CI/CD is a multiplier, not a foundation — a broken pipeline plus AI equals a faster, harder-to-debug broken pipeline.
The DORA 2024 report confirms that AI adoption currently correlates with worse delivery stability for most teams — not because of the tools, but because of unresolved pipeline debt.
The highest-value action before any AI purchase is a flakiness audit — quarantine tests with >5% failure rate in the last 90 days.
Data model alignment between testing and release is the hidden prerequisite most teams skip, and the most common reason AI-assisted failure analysis produces noise instead of signal.
Start AI integration at the edges (new feature test generation, visual regression) before touching deployment logic.
Baseline your DORA metrics now, before any tooling change — you can't demonstrate improvement without a before.

DEV Community