Your CI/CD Pipeline Should Have Its Own AI — Here's How I Built One That Runs Locally
Tags: ai, devops, cicd, python
If you've shipped software for any meaningful length of time, you've felt the same three pains:
- A test failed in CI, but it failed two days ago too, and yesterday it passed, and today it failed again. Is it broken or is it flaky?
- The PR diff is 1,800 lines, the description says "small refactor," and the reviewer has 12 minutes before their next meeting.
- The deploy went out at 4:55pm on Friday. By 5:10pm something is off in the logs, but nobody's sure if it's the deploy, the cache rollout, or the third-party API hiccupping again.
Every team I've worked on has built some homegrown answer to these. Spreadsheets of flaky tests. CODEOWNERS rules. Slack channels that page on log spikes. They all work — kind of — until the project grows past the point where humans can keep the rules current.
So I built a small set of CI/CD assistants that run on local LLMs, do exactly one job each, and plug into existing pipelines via a single CLI call. No cloud, no API keys, no per-call billing. The pipeline calls a binary, gets back JSON, and decides what to do with it.
This post walks through the four assistants I built, what each one is actually good at, and the small handful of design decisions that made them useful instead of ignorable.
Why Local LLMs Are Actually a Good Fit Here
CI/CD is the one place in your stack where "send my data to a third-party AI" is a worse idea than usual. Test logs leak schema names and table contents. Stack traces leak file paths and internal hostnames. PR diffs are, by definition, your unreleased proprietary code. A junior engineer pasting a stack trace into ChatGPT to debug it is — depending on your industry — somewhere between "uncomfortable" and "regulatory incident."
Running the model locally also gives you three boring-but-real advantages:
- Reproducibility. Same model, same prompt, same temperature → same output. Nice property when you're using AI to make merge decisions.
- No quota. You can run the analyzer on every PR, every build, every deploy, instead of rationing it to the "important" ones.
- Latency that fits CI. Gemma 4 on a developer laptop or a self-hosted runner returns a typical analysis in 3–8 seconds. That fits inside an existing CI step. A round-trip to a hosted API plus rate-limit backoff often doesn't.
The four assistants I'm going to describe all share the same backbone: Ollama running Gemma 4, called via HTTP from a small Python CLI, with structured-output prompting that returns JSON the pipeline can act on.
Assistant #1: The Flaky Test Triager
The most boring of the four, and by a wide margin the most useful.
When a test fails in CI, my pipeline runs:
flaky-triage --test "test_user_creation_idempotent" \
--history .ci/test-history.jsonl \
--log build.log \
--output triage.json
The CLI does three things:
- Loads the last 200 runs of that test from a rolling JSONL file (cheap, just
grep | tail). - Computes the actual stats — pass rate, average duration, recent failure pattern.
- Hands the LLM the recent stack trace plus the stats, and asks for a one-of-five classification:
flaky_intermittent,flaky_environmental,recently_broken,chronically_broken, orlegitimate_failure.
The classification is the part the LLM does well. The stats are the part the LLM does badly, so I don't ask it. Same prompt: "Here is the failure rate (0.18 over 200 runs), here are the last 10 outcomes, here is the new stack trace. Classify."
The output goes into a GitHub Actions check that either:
- Auto-retries the test once if it's
flaky_intermittent(and posts a comment so we can see the rate climbing). - Blocks the merge if it's
recently_broken(i.e., the test was passing reliably until the last 5 commits). - Files an issue with the existing pattern if it's
chronically_brokenand there isn't already an open issue.
The win isn't in any one decision. It's that the team stopped having the daily "is this flaky or is this real?" Slack thread.
Assistant #2: The PR Risk Reviewer
This one has a higher false-positive rate than #1, but the false positives are still useful.
On every PR, a CI step runs:
pr-risk --diff $(git diff origin/main...HEAD) \
--files-changed-history .ci/touch-history.jsonl \
--output risk.json
The risk reviewer scores three things on a 0–10 scale:
- Blast radius — how much of the system this PR can break if it's wrong.
- Surprise factor — how unusual the changed files are vs. the PR title and description.
- Test density — whether the test coverage in this PR is consistent with the code coverage in the rest of the repo.
The first two are the LLM's job. The third I compute deterministically (lines of test changed / lines of non-test changed, normalized against the repo's historical ratio) and then have the LLM interpret.
The reason this works is the same reason most LLM-assisted code review fails: I never ask the LLM "is this PR good." I ask it "does the PR description match the diff?" and "what files outside the stated scope did this PR touch?" Those are reading-comprehension questions, which is what these models are actually good at. They're not architecture review questions, which is what they're bad at.
The output gets posted as a single review comment with three paragraphs and a final risk score. Reviewers tell me the most-used part is the "files outside stated scope" sentence — the model catches the # also bumped this lib version while I was here lines that hide in long diffs.
Assistant #3: The Deploy Log Watcher
This one runs after the deploy, on a 5-minute and a 30-minute interval, and looks at structured logs from the freshly-deployed service.
deploy-watch --service api \
--window 5m \
--baseline-window 24h \
--logs $(kubectl logs --since=5m ...)
The trick that made this assistant actually useful — instead of crying-wolf useful — is that I do the statistics in code and let the LLM do the explanation. The CLI computes:
- Error-rate delta vs. the 24h baseline.
- The top 10 new error signatures (by message clustering — also done locally, with a small embedding model).
- Latency p50/p95/p99 deltas.
If nothing crosses a threshold, the assistant doesn't even call the LLM. It posts a "deploy looks healthy" check and exits.
If thresholds are crossed, it hands those numbers + 200 lines of representative log samples to the LLM and asks one question: "Write a one-paragraph status-page-style update describing what's happening, suitable for a Slack #incidents post. Include the affected endpoint, the symptom, the magnitude, and whether to roll back."
That last clause — whether to roll back — is where I made the LLM's job easy. I give it the rule explicitly in the prompt: "Recommend rollback if the error-rate delta is over 5x baseline AND the new error signatures correlate with files changed in this deploy. Otherwise recommend monitor."
The model never has to invent a rule. It just has to apply the rule I gave it to the data I gave it. That's a thing local 7B models are great at.
Assistant #4: The Release Notes Writer
This one I built last, and it's the one that surprised me with how much joy it generated.
Every Friday afternoon a job runs:
release-notes --since "last release tag" \
--commits $(git log ...) \
--merged-prs $(gh pr list --state merged ...) \
--output release-notes.md
It produces a markdown document with three sections: "What's new for users," "What changed for ops," and "Internal cleanup." It groups commits by intent (not by author), it strips the boring conventional-commit prefixes, and it links each line back to the merged PR.
The reason engineers love this one is the same reason they previously hated writing release notes by hand: it does the grouping well. Five PRs that all touched the auth subsystem get summarized as one paragraph in "What changed for ops." A docs-typo PR doesn't show up in "What's new for users." A migration script gets called out at the top.
The whole output is a draft, not the final release note. An engineer reads it, edits maybe 20% of it, and ships it. Pre-AI, the same engineer was spending 45 minutes staring at a git log and then writing four bullet points that nobody read.
The Architecture That Ties Them Together
All four assistants are the same shape:
┌───────────────────┐ ┌─────────────────────┐ ┌──────────────┐
│ Pipeline step │ → │ Local Python CLI │ → │ Ollama HTTP │
│ (GH Actions / │ │ (deterministic │ │ (Gemma 4) │
│ Jenkins / etc) │ │ stats + prompt) │ │ │
└───────────────────┘ └─────────────────────┘ └──────────────┘
↓
┌─────────────────────┐
│ structured JSON │
│ back to pipeline │
└─────────────────────┘
Three rules I followed in every CLI:
- Compute before you generate. Anything you can count (failure rates, line deltas, latency percentiles) you should count yourself, and pass the numbers to the LLM as facts. Don't ask the LLM to do arithmetic on log lines.
- Constrain output to JSON. Every prompt ends with "Respond ONLY with a JSON object matching this schema." Validate before returning. If validation fails, retry once at temperature 0.
- Make the rule explicit, not implicit. Anywhere the assistant has to make a recommendation (rollback, block merge, file issue), put the rule in the prompt verbatim. The LLM is applying the rule. It is not inventing the rule.
These three rules are 90% of the difference between an LLM-augmented pipeline that engineers actually trust and one that gets disabled by the second on-call rotation.
What I Got Wrong the First Time
Three honest mistakes worth sharing:
Mistake 1: Letting the LLM write the rules. First version of the deploy watcher just got "Decide whether to roll back" with no threshold guidance. It rolled back twice on noise and missed an actual incident the third time. Trust collapsed in one week.
Mistake 2: One giant prompt instead of four small ones. I tried building a single "DevOps copilot" assistant that handled all four tasks. Latency went up, error rate went up, and the JSON output schema got too complicated for the model to consistently produce. Splitting into four small CLIs cut combined latency in half.
Mistake 3: Not versioning the prompts. First two months, prompts lived in Python string literals. When output drifted, I couldn't tell whether it was the model, the prompt, or the inputs. Now every prompt is a separate file with a hash in its name, logged alongside the output.
What's Next
I'm working on assistant #5: a "PR description rewriter" that takes the diff plus the author's two-line PR description and proposes an improved version, with the same three-section structure as the release notes. Early signal is good but not yet good enough to ship — it's too eager to add bullet points that aren't supported by the diff. Same lesson as before: the model wants to invent things; the prompt has to forbid it.
If you want to see the actual code, all four assistants are open source and run on a single laptop with Ollama and Gemma 4. The repository has the prompts, the CLIs, the GitHub Actions workflows, and the test-history JSONL format I use. Local-first AI for CI/CD turns out to be a much better fit than I expected, and I think the pattern generalizes to a lot more places than just pipelines.
The PR is small. The LLM is local. The decision is auditable. That's the bar.
I'm a Senior Software Engineer at Microsoft and the author of 90+ open-source local-AI projects. If this helped, ⭐ the repos and follow @kennedyraju55 here on Dev.to.
Top comments (0)