Sergey Byvshev

Posted on Apr 1 • Edited on Apr 5

AI Is for DevOps: How a Neural Network Debugs Failed Pipelines

#devops #ai #infrastructure #cicd

How often does someone rush to you wide-eyed, begging for help with a broken pipeline? Or you find yourself staring at a red status in Slack on a Friday evening, knowing the next 15–20 minutes will be spent on routine work: open the log, find the error line, compare with the last commit, check dependencies…

The work is straightforward. And that's exactly why it's boring — a perfect candidate for automation.

Fortunately, neural networks can now handle this for us and provide solid advice (not all of them, but some definitely can).

What to Think Through Beforehand

Before writing any code, it's worth answering four questions. They'll define the architecture of the entire solution.

What events trigger the analysis? In our case — a job that finished unsuccessfully in CI/CD. To start diagnostics, it's enough to pass the agent the last 50 lines of the build log and the pipeline file contents.

What data sources will be needed? The main ones are the version control system (repository access), CI/CD (full log, related jobs), an endpoint availability checker, and CI agent resource consumption metrics.

To build such an engineer, you first need to determine:

What events and in what format to provide to the engineer?
What sources and data might be needed?
How to manipulate this data to identify the root cause?
What should the diagnostic output look like in form and content?

How to analyze the data? This is the most interesting part, because there are many scenarios depending on the job type:

Build jobs — dependency issues (missing, incorrectly specified, unavailable), code errors, insufficient build resources.
Test jobs — code errors, incorrectly written tests.
Deploy jobs — manifest errors, issues on the target platform side.
Common problems — errors in the pipeline/workflow itself, missing utilities on the agent, agent initialization issues.

What should the report format be? Most often, such a report is read by human eyes in a chat, so it should be written in plain language. Concise, facts only: what was found, most probable causes, specific steps to fix. A convenient place for such a report is a thread under the corresponding error message or a dedicated channel.

Solution Architecture

At a high level, the flow works like this: an event arrives about a job completing unsuccessfully. Then we request data for analysis: build logs, pipeline description. Based on this data and its system prompt, the AI agent performs the failure analysis. During the process, the assistant can independently check the repository, see what changed, and so on. In case of external endpoint unavailability errors, it can verify this. On a failed deploy, it can check application logs and metrics. As a result, several most probable failure causes and remediation steps are generated. This data is then sent to the team chat.

Time to Implement

Some implementation aspects are covered in more detail in a the article.

n8n — You can quickly launch n8n with the MCP update using docker-compose (see below)
Gitlab
Grafana
Loki
Prometheus
Slack

Setting Up Incoming Events

The first step is to create a webhook in n8n. Gitlab uses the X-Gitlab-Token header for authentication, so in n8n we select Header Auth and specify the corresponding credential.

In Gitlab, we configure webhook delivery. This can be done for an individual repository or for an entire group. We specify the webhook address and the secret token, and from the event types we select Pipeline events.

Then, using the If node, we filter out all non-failed events — we don't need them.

Data Collection

As soon as we receive a job failure event, we request details from Gitlab. For this, you'll need to create a gitlab access token (I recommend read-only) and the corresponding credential in n8n.

As soon as we receive a job failure event, we request details from Gitlab. For this, you'll need to create a gitlab access token (I recommend read-only) and the corresponding credential in n8n.
Then we merge all the collected data using the Merge node.

Error Analysis

Data arrives to the agent in the following format:

{ "job_log": "Last 50 lines of failed job" },

{ "data": "Content .gitlab-ci.yml" },

{ "pipeline": {} },

{ "failed_job": {} }

This format should be communicated to the agent in advance via the system prompt. There we also describe the available tools, the investigation strategy (based on the considerations outlined above), and the desired output report format.

It's best to use the latest model versions, as they handle MCP tool use significantly better. We don't connect memory here, since each build failure is an independent event for the agent.

MCP Tools

The agent has access to three tools:

Gitlab MCP — for retrieving additional information about the failed job, code changes, etc.
Grafana MCP — for retrieving CI agent metrics, as well as failed deploy logs.
HTTP Request — n8n's built-in tool for checking endpoint availability.

Important note: make sure your MCP servers are running in remote mode. If an MCP server doesn't support remote out of the box, you can solve this with mcpgateway — it proxies HTTP to stdin. For the transport method, streaming HTTP is the best choice.

Posting to Chat

The final step is sending the generated report to Slack. The report goes to the selected channel or thread.

Testing and Real-World Examples

The final workflow looks like this.

Example 1: Failed Build

Gradle can't resolve a dependency. The agent determines that this is a dependency resolution issue, not a compilation error. It provides specific causes: the artifact isn't published in the repository, or credentials are unavailable inside the Docker build context. For each cause — concrete steps to fix.

Example 2: Infrastructure Change Errors

Terraform plan fails with Unsupported argument errors. The agent recognizes that the HCL configuration contains attributes not supported by the current DigitalOcean provider schema. It provides three probable causes — from the wrong resource type to provider version mismatch — with specific remediation steps for each.

Conclusion

We've built an assistant that performs full error analysis in approximately 30 seconds. This allows the team to respond to failed jobs significantly faster and spend their time on real engineering tasks rather than routine log analysis.

Token consumption stays at the level of a few thousand per analysis.

Base workflow version is here.
Full tutorial with all scripts can be seen here

DEV Community