DEV Community

Cover image for GitHub Agentic Workflows: Building Self-Healing CI for .NET
Borys Generalov
Borys Generalov

Posted on • Originally published at blog.bgener.nl

GitHub Agentic Workflows: Building Self-Healing CI for .NET

Agentic Platform Engineering: Self-Healing CI/CD Pipelines

Demo Repository: Check the complete project on GitHub to see the full setup.

My CI failures are usually not dramatic. But they are still annoying.

A test breaks with a NullReferenceException. A Helm chart release failed. I open the logs, trace the problem, fix a tiny mistake, push, and wait for CI again. That is a lot of delay for bugs that are often small.

So I built a workflow for that exact loop. When CI fails, a GitHub Agentic Workflow reads the logs and uploaded artifacts, traces the root cause, and asks for a draft pull request with the fix. I still review it. I still merge it. The agent does the investigation work that normally takes the first 15 minutes.

In this article, I will show you how I built that setup in a standard .NET project, how I fed the agent the evidence it needed, and what happened when I tested it with two deliberate bugs.


What Are GitHub Agentic Workflows?

Getting started: Install the CLI and set up your first workflow by following the official quick start guide.

GitHub Agentic Workflows let you define automation in Markdown with YAML frontmatter. That sounds really simple. The YAML part tells GitHub when the workflow runs, which permissions it gets, and which safe write actions it may request. The Markdown body tells the agent what job to do.

Here is a tiny example:

---
on:
  issues:
    types: [opened]
permissions: read-all
safe-outputs:
  add-comment:
---

# Issue Clarifier

Analyze the current issue and ask for additional details if the issue is unclear.
Enter fullscreen mode Exit fullscreen mode

You compile that file with:

gh aw compile
Enter fullscreen mode Exit fullscreen mode

This creates a .lock.yml file that GitHub Actions can execute. The Markdown file is the source you maintain. The compiled workflow is what runs in CI.

Here, I am not using agentic workflows for summaries or changelogs. Those are straightforward to automate. I care about practical use cases where a failure is easy to address, yet if the CI build fails, someone still has to dig through it.


Setting Up the Project

I used a plain .NET 10 Web API and an xUnit test project. No custom starter kit. Just the templates you already know.

Scaffold the Project

dotnet new sln -n DemoAiPipelines
dotnet new webapi -n OrdersApi -o src/OrdersApi
dotnet new xunit -n OrdersApi.Tests -o tests/OrdersApi.Tests

dotnet sln add src/OrdersApi/OrdersApi.csproj
dotnet sln add tests/OrdersApi.Tests/OrdersApi.Tests.csproj

cd tests/OrdersApi.Tests
dotnet add reference ../../src/OrdersApi/OrdersApi.csproj
Enter fullscreen mode Exit fullscreen mode

Then I added two deliberate bugs.

Bug One: Guest Checkout Crash

This service throws when Customer is null:

public class OrderService
{
    public decimal CalculateDiscount(Order order)
    {
        // BUG: throws NullReferenceException when Customer is null
        var rate = order.Customer.LoyaltyTier switch
        {
            "gold" => 0.15m,
            "silver" => 0.10m,
            _ => 0.05m
        };
        return order.Total * rate;
    }
}
Enter fullscreen mode Exit fullscreen mode

And the test that exposes it:

[Fact]
public void CalculateDiscount_GuestCheckout_ReturnsZero()
{
    var order = new Order(200m, 1, Customer: null);
    var result = _sut.CalculateDiscount(order); // crash here
    Assert.Equal(0m, result);
}
Enter fullscreen mode Exit fullscreen mode

Bug Two: Wrong Port in Helm

The app listens on 8080, but the chart still points at 80:

containers:
  - name: orders-api
    ports:
      - containerPort: 80
    readinessProbe:
      httpGet:
        port: 80
Enter fullscreen mode Exit fullscreen mode

That is enough to make Kubernetes restart the pod forever.

Capture the Evidence

If you want an agent to investigate failures, you need to upload the same logs you would need as a human. For test failures, that means the test output. For deploy failures, that means enough cluster state to explain why the pod never became healthy.

Here is the CI workflow:

# .github/workflows/ci.yml
jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Test
        id: test
        run: dotnet test --logger "trx;LogFileName=results.trx" 2>&1 | tee test-output.txt
        continue-on-error: true
      - name: Upload test evidence
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: test-results
          path: "**/test-output.txt"

  deploy-to-kind:
    needs: build-and-test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Create KinD cluster
        uses: helm/kind-action@v1.12.0
      - name: Deploy with Helm
        id: helm
        run: |
          helm upgrade --install orders-api ./deploy/helm \
            --wait --timeout 2m 2>&1 | tee helm-output.txt
        continue-on-error: true
      - name: Capture Kubernetes state
        if: steps.helm.outcome == 'failure'
        run: |
          kubectl get pods -o wide > k8s-debug.txt
          kubectl describe pods >> k8s-debug.txt
          kubectl logs -l app=orders-api --tail=50 >> k8s-debug.txt
      - name: Upload deploy evidence
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: deploy-results
          path: k8s-debug.txt
Enter fullscreen mode Exit fullscreen mode

Writing the Self-Healing Workflow

Now we can define the workflow that reacts to a failed CI run.

Create .github/workflows/self-heal.md:

---
engine:
  id: copilot
  version: latest # defaults to latest
  model: gpt-5
on:
  workflow_run:
    workflows: ["CI"]
    types: [completed]
permissions:
  contents: read
  actions: read
safe-outputs:
  create-pull-request:
    title-prefix: "fix: "
    labels: [ai-fix, self-healing]
    draft: true
    expires: 7
---

# Self-Healing: Fix Failed CI

You are a .NET and DevOps engineer. A CI run has just failed.

## Your Mission

Analyze the failure, find the root cause, and submit a fix as a
pull request.

## Step-by-Step Instructions

1. Check which job failed: `build-and-test` or `deploy-to-kind`.
2. Download the relevant artifact (`test-results` or `deploy-results`).
3. Read the logs to identify the root cause.
4. For test failures: find the exception and fix the source code.
5. For deploy failures: read `k8s-debug.txt` to trace the issue.
   - Cross-reference with the Dockerfile and Helm chart.
6. Open a PR explaining what went wrong and why the fix is correct.
Enter fullscreen mode Exit fullscreen mode

You do not need to overdo the prompt. You do need to tell the agent where the failure lives, which artifact to read, and what kind of output you want back.


Watching It Fix Real Failures

The normal CI workflow ran first. That workflow built the code, ran tests, tried the Helm deployment, and uploaded evidence whether it passed or failed. After that finished, GitHub triggered the compiled agentic workflow through workflow_run.

So the order looked like this:

  1. the regular CI workflow ran
  2. one job failed: test or deploy
  3. CI uploaded the relevant artifact
  4. the compiled self-heal workflow started
  5. the agent downloaded the artifact and investigated
  6. the agent proposed a draft PR

That means the self-healing workflow only woke up after the normal pipeline had already failed and produced evidence. It was not polling. It was not scanning on a schedule. It reacted to a failed run automatically.

For the test failure, the agent downloaded the test-results artifact, found the NullReferenceException, followed the stack trace into OrderService.cs, and proposed the missing null check.

For the deploy failure, it downloaded the deployment evidence, read k8s-debug.txt, saw that the app was listening on 8080 while the probe was still hitting 80, and changed the Helm config to match.

In both cases the result was a draft PR, not a silent commit to the branch.

That was important for me because I wanted to see the exact diff, the explanation, and the reasoning path. I was not trying to hide the process. I wanted the same review surface I would expect from a teammate.

This also made testing the idea straightforward. Break the pipeline on purpose, let the normal CI fail, and watch whether the follow-up workflow can read the evidence and get back to the right fix.


The Real Cost of Self-Healing

The extra cost starts only after CI fails and the self-heal workflow wakes up. You pay for the agent run, the model tokens used to read logs and repo files, and whatever artifact storage and transfer that investigation needs.

So the real way to think about it is cost per failed run. If failures are rare and the artifacts are small, the cost stays low. If builds fail often and every failure uploads huge logs, the bill grows.

A Practical Rollout Plan

If I were rolling this out for a real team, I would keep the scope narrow:

  1. run only after failed CI workflows
  2. restrict it to one or two common failure types
  3. require uploaded evidence before the workflow can act
  4. allow only draft pull requests as output
  5. review every proposed diff manually

That keeps the workflow predictable. It also gives you a clean way to measure the return on cost. I would track three basic numbers from day one:

  1. how many failed runs triggered the workflow
  2. how many proposed PRs were actually correct
  3. how much engineer time those investigations would normally have taken

If the workflow costs a few dollars but saves hours of senior engineering time on repeatable failures, the tradeoff is obvious. If it produces noisy PRs that nobody merges, the token bill is a waste of money.

Where Self-Healing Gets More Interesting

I would not jump straight to "AI fixes everything." I would expand the triggers one by one.

For example:

  • after deployment, scan pod logs for restart loops or obvious startup exceptions
  • after a health check job, inspect the logs if the app never became ready
  • after a scheduled smoke test, investigate if an endpoint starts failing

That is where self-healing gets interesting. Not a magic system that pushes to production on its own, but a continuous investigator that notices a broken deployment, reads the evidence, and hands you a draft PR.

I still would not let it merge for me. But I would absolutely let it do the boring first pass on failures.

If you want to try the full setup yourself, the demo repo and workflow files are here:

Tip:
Demo Repository:
github.com/bgener/demo-ai-github-pipelines

This article is part of The Modern DevEx Stack series. The next post looks at using MegaLinter in a polyglot repo without turning every pull request into a waiting game.

Top comments (0)