DEV Community

Cover image for AI Is for DevOps: How a Neural Network Debugs Failed Pipelines
Sergey Byvshev
Sergey Byvshev

Posted on

AI Is for DevOps: How a Neural Network Debugs Failed Pipelines

How often does someone rush to you wide-eyed, begging for help with a broken pipeline? Or you find yourself staring at a red status in Slack on a Friday evening, knowing the next 15–20 minutes will be spent on routine work: open the log, find the error line, compare with the last commit, check dependencies…

The work is straightforward. And that's exactly why it's boring — a perfect candidate for automation.

Fortunately, neural networks can now handle this for us and provide solid advice (not all of them, but some definitely can).

What to Think Through Beforehand

Before writing any code, it's worth answering four questions. They'll define the architecture of the entire solution.

What events trigger the analysis? In our case — a job that finished unsuccessfully in CI/CD. To start diagnostics, it's enough to pass the agent the last 50 lines of the build log and the pipeline file contents.

What data sources will be needed? The main ones are the version control system (repository access), CI/CD (full log, related jobs), an endpoint availability checker, and CI agent resource consumption metrics.

To build such an engineer, you first need to determine:

  • What events and in what format to provide to the engineer?
  • What sources and data might be needed?
  • How to manipulate this data to identify the root cause?
  • What should the diagnostic output look like in form and content?

How to analyze the data? This is the most interesting part, because there are many scenarios depending on the job type:

  • Build jobs — dependency issues (missing, incorrectly specified, unavailable), code errors, insufficient build resources.
  • Test jobs — code errors, incorrectly written tests.
  • Deploy jobs — manifest errors, issues on the target platform side.
  • Common problems — errors in the pipeline/workflow itself, missing utilities on the agent, agent initialization issues.

What should the report format be? Most often, such a report is read by human eyes in a chat, so it should be written in plain language. Concise, facts only: what was found, most probable causes, specific steps to fix. A convenient place for such a report is a thread under the corresponding error message or a dedicated channel.

Solution Architecture

At a high level, the flow works like this: an event arrives about a job completing unsuccessfully. Then we request data for analysis: build logs, pipeline description. Based on this data and its system prompt, the AI agent performs the failure analysis. During the process, the assistant can independently check the repository, see what changed, and so on. In case of external endpoint unavailability errors, it can verify this. On a failed deploy, it can check application logs and metrics. As a result, several most probable failure causes and remediation steps are generated. This data is then sent to the team chat.

Architecture

Time to Implement

Some implementation aspects are covered in more detail in a the article.

Setting Up Incoming Events

Input chain

The first step is to create a webhook in n8n. Gitlab uses the X-Gitlab-Token header for authentication, so in n8n we select Header Auth and specify the corresponding credential.

Webhook

Webhook Auth
In Gitlab, we configure webhook delivery. This can be done for an individual repository or for an entire group. We specify the webhook address and the secret token, and from the event types we select Pipeline events.

Gitlab webhook
Then, using the If node, we filter out all non-failed events — we don't need them.

Data Collection

As soon as we receive a job failure event, we request details from Gitlab. For this, you'll need to create a gitlab access token (I recommend read-only) and the corresponding credential in n8n.


As soon as we receive a job failure event, we request details from Gitlab. For this, you'll need to create a gitlab access token (I recommend read-only) and the corresponding credential in n8n.
Then we merge all the collected data using the Merge node.

Error Analysis

Data arrives to the agent in the following format:

`[

{ "job_log": "Last 50 lines of failed job" },

{ "data": "Content .gitlab-ci.yml" },

{ "pipeline": {} },

{ "failed_job": {} }

]`

This format should be communicated to the agent in advance via the system prompt. There we also describe the available tools, the investigation strategy (based on the considerations outlined above), and the desired output report format.

ai-asisstant
It's best to use the latest model versions, as they handle MCP tool use significantly better. We don't connect memory here, since each build failure is an independent event for the agent.

MCP Tools

The agent has access to three tools:

  • Gitlab MCP — for retrieving additional information about the failed job, code changes, etc.
  • Grafana MCP — for retrieving CI agent metrics, as well as failed deploy logs.
  • HTTP Request — n8n's built-in tool for checking endpoint availability.

Important note: make sure your MCP servers are running in remote mode. If an MCP server doesn't support remote out of the box, you can solve this with mcpgateway — it proxies HTTP to stdin. For the transport method, streaming HTTP is the best choice.

Posting to Chat

The final step is sending the generated report to Slack. The report goes to the selected channel or thread.

Output

Testing and Real-World Examples

The final workflow looks like this.

full-workflow

Example 1: Failed Build

Gradle can't resolve a dependency. The agent determines that this is a dependency resolution issue, not a compilation error. It provides specific causes: the artifact isn't published in the repository, or credentials are unavailable inside the Docker build context. For each cause — concrete steps to fix.
Gitlab-logs
Slack-message

Example 2: Infrastructure Change Errors

Terraform plan fails with Unsupported argument errors. The agent recognizes that the HCL configuration contains attributes not supported by the current DigitalOcean provider schema. It provides three probable causes — from the wrong resource type to provider version mismatch — with specific remediation steps for each.

gitlab-example

report-example

Conclusion

We've built an assistant that performs full error analysis in approximately 30 seconds. This allows the team to respond to failed jobs significantly faster and spend their time on real engineering tasks rather than routine log analysis.

Token consumption stays at the level of a few thousand per analysis.


Base workflow version is here.
Full tutorial with all scripts can be seen here

Top comments (0)