Barnabas Kun

Posted on Apr 14

How I built an AI agent that runs your dependency upgrades in a K8s sandbox and scores confidence per package

#devops #ai #kubernetes #cicd

Dependabot is great at one thing: opening PRs when a dependency is outdated.

It's not great at telling you whether merging that PR will break your build. For that, you still have to read changelogs manually, run tests locally, and hope you catch everything before it hits staging.

I spent the last few months building Migratowl to close that gap. This is a writeup of the most interesting engineering decisions.

What it does (briefly)

Migratowl receives a webhook with a repo URL, runs a four-phase AI agent workflow inside an ephemeral Kubernetes pod, and delivers a structured JSON report per dependency:

{
  "dependency_name": "requests",
  "is_breaking": true,
  "error_summary": "ImportError: cannot import name 'PreparedRequest'",
  "changelog_citation": "## 3.0.0 — Removed PreparedRequest from the public API.",
  "suggested_human_fix": "Replace `from requests import PreparedRequest` with `requests.models.PreparedRequest`.",
  "confidence": 0.95
}

It supports Python, Node.js, Go, Rust, and Java. It integrates with Dependabot via a GitHub Actions workflow — every Dependabot PR gets an analysis comment before anyone reviews it.

But the interesting parts are in the implementation. Let's get into them.

Problem 1: Running untrusted code safely

Dependency analysis requires actually running the code. You clone a repo you don't fully control, bump a bunch of packages, and execute a test suite. That's a meaningful attack surface.

My threat model: a malicious conftest.py or setup.py that tries to exfiltrate data, write to the host filesystem, or make outbound network calls.

The solution is a Kubernetes pod per analysis, with every hardening option enabled by default:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  allowPrivilegeEscalation: false
  capabilities:
    drop: [ALL]
  seccompProfile:
    type: RuntimeDefault
automountServiceAccountToken: false

Plus a deny-all NetworkPolicy on both ingress and egress — the sandbox has no network access at all once the analysis starts. Source cloning happens before the pod is fully isolated.

For clusters that support it, we use kubernetes-sigs/agent-sandbox which adds gVisor/Kata isolation and warm pod pools for sub-second startup. On any standard cluster, there's a raw-pod fallback.

This is all wrapped behind langchain-kubernetes, which exposes a simple sandbox.execute(command) interface to the agent.

Problem 2: Attribution — which package actually broke the build?

When you upgrade 20 packages at once and the test suite fails, you have an attribution problem. The naive options:

Test each package individually — accurate, but O(n) sandbox runs. With 20 packages, that's potentially 20 × (clone + install + test). Too slow.
Trust the bulk output — fast, but a single noisy test failure can look like it's caused by the wrong package.

I landed on a hybrid approach driven by confidence scoring.

Phase 1: Bulk run

Bump everything at once, run tests. If everything passes, every package is non-breaking with confidence 1.0. Done — this is the common case and it's fast.

Phase 2: Confidence scoring (only on failure)

The AI agent reads the test output and assigns a score per package:

Signal	Confidence effect
Error message directly names the package	`≥ 0.8`
ImportError or AttributeError for a known API	`≥ 0.8`
Major version jump (e.g. `2.x → 3.x`)	`+0.1–0.2` boost
Generic failure, no clear link	`< 0.5`

The default threshold is 0.7 (configurable via MIGRATOWL_CONFIDENCE_THRESHOLD).

Phase 3: Selective isolation

Above threshold: fetch the changelog, generate the report immediately
Below threshold: delegate to a package-analyzer subagent — an isolated LangGraph agent that re-runs the test suite with only that package bumped

This keeps the fast path fast and the slow path accurate. A codebase with 30 outdated packages where only 2 have ambiguous failures will only spin up 2 subagents, not 30.

Problem 3: Structuring a multi-agent LangGraph workflow

The agent is built on deepagents and LangGraph. The main agent has 10 tools across four phases:

Phase 1: clone_repo → detect_languages → scan_dependencies → check_outdated_deps
Phase 2: copy_source → update_dependencies → execute_project
Phase 3: [confidence scoring] → fetch_changelog → [subagent delegation]
Phase 4: compile results → POST to callback_url

The subagent delegation was the cleanest part of the design. A package-analyzer subagent is just another create_agent() call with a scoped tool set — it gets its own workspace directory in the same sandbox pod (/home/user/workspace/<package-name>/), runs independently, and returns a structured AnalysisReport. The main agent merges them at the end.

The workspace layout is intentional:

/home/user/workspace/
├── source/          # Immutable clone — never executed directly
├── main/            # All deps bumped, used in Phase 2
└── <package-name>/  # Per-package copy, used by subagents

The immutable source/ directory means we never touch the original clone — each phase and each subagent operates on a fresh copy.

Observability

Agent-based systems are hard to debug when something goes wrong. We use LangFuse for trace-level observability — every scan produces a session with spans for each tool call and subagent run. When a migration fails unexpectedly, you can open the LangFuse trace and see exactly which tool call returned what output.

Tracing is off by default, opt-in with two env vars. No code changes needed.

What I'd do differently

The sandbox setup is too complex for a first-time user. minikube + CRD installation is a real barrier. I want to add a Docker Compose mode that wraps the K8s complexity for local dev.
Confidence scoring heuristics were mostly hand-tuned. I'd like to replace this with a small classifier trained on real migration failure data.
The raw-pod fallback is slower than agent-sandbox mode (no warm pool). The startup latency difference is noticeable at scale.

Getting started

git clone https://github.com/bitkaio/migratowl
cd migratowl
uv sync
cp .env.example .env
# Set ANTHROPIC_API_KEY
minikube start --driver=docker --memory=8192 --cpus=4
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.1.0/manifest.yaml
kubectl apply -f k8s/
uv run uvicorn migratowl.api.main:app --reload

Trigger a scan:

curl -X POST http://localhost:8000/webhook \
  -H 'Content-Type: application/json' \
  -d '{"repo_url": "https://github.com/org/repo", "callback_url": "https://yourservice.example.com/results"}'

The full code, K8s manifests, and Dependabot integration workflow are on GitHub: https://github.com/bitkaio/migratowl

Happy to answer questions about the sandbox model, the LangGraph structure, or the confidence scoring design in the comments.

DEV Community