<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Barnabas Kun</title>
    <description>The latest articles on DEV Community by Barnabas Kun (@barnakun).</description>
    <link>https://dev.to/barnakun</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3878289%2F68e04393-588b-4ccf-84ac-58fb45ccfd1b.jpeg</url>
      <title>DEV Community: Barnabas Kun</title>
      <link>https://dev.to/barnakun</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/barnakun"/>
    <language>en</language>
    <item>
      <title>How I built an AI agent that runs your dependency upgrades in a K8s sandbox and scores confidence per package</title>
      <dc:creator>Barnabas Kun</dc:creator>
      <pubDate>Tue, 14 Apr 2026 09:43:50 +0000</pubDate>
      <link>https://dev.to/barnakun/how-i-built-an-ai-agent-that-runs-your-dependency-upgrades-in-a-k8s-sandbox-and-scores-confidence-43cm</link>
      <guid>https://dev.to/barnakun/how-i-built-an-ai-agent-that-runs-your-dependency-upgrades-in-a-k8s-sandbox-and-scores-confidence-43cm</guid>
      <description>&lt;p&gt;Dependabot is great at one thing: opening PRs when a dependency is outdated.&lt;/p&gt;

&lt;p&gt;It's not great at telling you whether merging that PR will break your build. For that, you still have to read changelogs manually, run tests locally, and hope you catch everything before it hits staging.&lt;/p&gt;

&lt;p&gt;I spent the last few months building &lt;a href="https://github.com/bitkaio/migratowl" rel="noopener noreferrer"&gt;Migratowl&lt;/a&gt; to close that gap. This is a writeup of the most interesting engineering decisions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it does (briefly)
&lt;/h2&gt;

&lt;p&gt;Migratowl receives a webhook with a repo URL, runs a four-phase AI agent workflow inside an ephemeral Kubernetes pod, and delivers a structured JSON report per dependency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dependency_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"is_breaking"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"error_summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ImportError: cannot import name 'PreparedRequest'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"changelog_citation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"## 3.0.0 — Removed PreparedRequest from the public API."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"suggested_human_fix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Replace `from requests import PreparedRequest` with `requests.models.PreparedRequest`."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It supports Python, Node.js, Go, Rust, and Java. It integrates with Dependabot via a GitHub Actions workflow — every Dependabot PR gets an analysis comment before anyone reviews it.&lt;/p&gt;

&lt;p&gt;But the interesting parts are in the implementation. Let's get into them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 1: Running untrusted code safely
&lt;/h2&gt;

&lt;p&gt;Dependency analysis requires actually running the code. You clone a repo you don't fully control, bump a bunch of packages, and execute a test suite. That's a meaningful attack surface.&lt;/p&gt;

&lt;p&gt;My threat model: a malicious &lt;code&gt;conftest.py&lt;/code&gt; or &lt;code&gt;setup.py&lt;/code&gt; that tries to exfiltrate data, write to the host filesystem, or make outbound network calls.&lt;/p&gt;

&lt;p&gt;The solution is a Kubernetes pod per analysis, with every hardening option enabled by default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runAsNonRoot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;runAsUser&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
  &lt;span class="na"&gt;allowPrivilegeEscalation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ALL&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;seccompProfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RuntimeDefault&lt;/span&gt;
&lt;span class="na"&gt;automountServiceAccountToken&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus a deny-all &lt;code&gt;NetworkPolicy&lt;/code&gt; on both ingress and egress — the sandbox has no network access at all once the analysis starts. Source cloning happens before the pod is fully isolated.&lt;/p&gt;

&lt;p&gt;For clusters that support it, we use &lt;a href="https://github.com/kubernetes-sigs/agent-sandbox" rel="noopener noreferrer"&gt;&lt;code&gt;kubernetes-sigs/agent-sandbox&lt;/code&gt;&lt;/a&gt; which adds gVisor/Kata isolation and warm pod pools for sub-second startup. On any standard cluster, there's a raw-pod fallback.&lt;/p&gt;

&lt;p&gt;This is all wrapped behind &lt;a href="https://github.com/bitkaio/langchain-kubernetes" rel="noopener noreferrer"&gt;&lt;code&gt;langchain-kubernetes&lt;/code&gt;&lt;/a&gt;, which exposes a simple &lt;code&gt;sandbox.execute(command)&lt;/code&gt; interface to the agent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 2: Attribution — which package actually broke the build?
&lt;/h2&gt;

&lt;p&gt;When you upgrade 20 packages at once and the test suite fails, you have an attribution problem. The naive options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test each package individually&lt;/strong&gt; — accurate, but O(n) sandbox runs. With 20 packages, that's potentially 20 × (clone + install + test). Too slow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust the bulk output&lt;/strong&gt; — fast, but a single noisy test failure can look like it's caused by the wrong package.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I landed on a hybrid approach driven by &lt;strong&gt;confidence scoring&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Bulk run
&lt;/h3&gt;

&lt;p&gt;Bump everything at once, run tests. If everything passes, every package is non-breaking with confidence &lt;code&gt;1.0&lt;/code&gt;. Done — this is the common case and it's fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Confidence scoring (only on failure)
&lt;/h3&gt;

&lt;p&gt;The AI agent reads the test output and assigns a score per package:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Confidence effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Error message directly names the package&lt;/td&gt;
&lt;td&gt;&lt;code&gt;≥ 0.8&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ImportError or AttributeError for a known API&lt;/td&gt;
&lt;td&gt;&lt;code&gt;≥ 0.8&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Major version jump (e.g. &lt;code&gt;2.x → 3.x&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;+0.1–0.2&lt;/code&gt; boost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generic failure, no clear link&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;lt; 0.5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The default threshold is &lt;code&gt;0.7&lt;/code&gt; (configurable via &lt;code&gt;MIGRATOWL_CONFIDENCE_THRESHOLD&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3: Selective isolation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Above threshold:&lt;/strong&gt; fetch the changelog, generate the report immediately&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Below threshold:&lt;/strong&gt; delegate to a &lt;code&gt;package-analyzer&lt;/code&gt; subagent — an isolated LangGraph agent that re-runs the test suite with only that package bumped&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps the fast path fast and the slow path accurate. A codebase with 30 outdated packages where only 2 have ambiguous failures will only spin up 2 subagents, not 30.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 3: Structuring a multi-agent LangGraph workflow
&lt;/h2&gt;

&lt;p&gt;The agent is built on &lt;a href="https://github.com/langchain-ai/deepagents" rel="noopener noreferrer"&gt;deepagents&lt;/a&gt; and LangGraph. The main agent has 10 tools across four phases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Phase 1: clone_repo → detect_languages → scan_dependencies → check_outdated_deps
Phase 2: copy_source → update_dependencies → execute_project
Phase 3: [confidence scoring] → fetch_changelog → [subagent delegation]
Phase 4: compile results → POST to callback_url
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The subagent delegation was the cleanest part of the design. A &lt;code&gt;package-analyzer&lt;/code&gt; subagent is just another &lt;code&gt;create_agent()&lt;/code&gt; call with a scoped tool set — it gets its own workspace directory in the same sandbox pod (&lt;code&gt;/home/user/workspace/&amp;lt;package-name&amp;gt;/&lt;/code&gt;), runs independently, and returns a structured &lt;code&gt;AnalysisReport&lt;/code&gt;. The main agent merges them at the end.&lt;/p&gt;

&lt;p&gt;The workspace layout is intentional:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;/home/user/workspace/
├── source/          # Immutable clone — never executed directly
├── main/            # All deps bumped, used in Phase 2
└── &lt;span class="nt"&gt;&amp;lt;package-name&amp;gt;&lt;/span&gt;/  # Per-package copy, used by subagents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The immutable &lt;code&gt;source/&lt;/code&gt; directory means we never touch the original clone — each phase and each subagent operates on a fresh copy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Agent-based systems are hard to debug when something goes wrong. We use &lt;a href="https://langfuse.com" rel="noopener noreferrer"&gt;LangFuse&lt;/a&gt; for trace-level observability — every scan produces a session with spans for each tool call and subagent run. When a migration fails unexpectedly, you can open the LangFuse trace and see exactly which tool call returned what output.&lt;/p&gt;

&lt;p&gt;Tracing is off by default, opt-in with two env vars. No code changes needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The sandbox setup is too complex for a first-time user.&lt;/strong&gt; minikube + CRD installation is a real barrier. I want to add a Docker Compose mode that wraps the K8s complexity for local dev.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence scoring heuristics were mostly hand-tuned.&lt;/strong&gt; I'd like to replace this with a small classifier trained on real migration failure data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The raw-pod fallback is slower&lt;/strong&gt; than agent-sandbox mode (no warm pool). The startup latency difference is noticeable at scale.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/bitkaio/migratowl
&lt;span class="nb"&gt;cd &lt;/span&gt;migratowl
uv &lt;span class="nb"&gt;sync
cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Set ANTHROPIC_API_KEY&lt;/span&gt;
minikube start &lt;span class="nt"&gt;--driver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;docker &lt;span class="nt"&gt;--memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8192 &lt;span class="nt"&gt;--cpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.1.0/manifest.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; k8s/
uv run uvicorn migratowl.api.main:app &lt;span class="nt"&gt;--reload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trigger a scan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/webhook &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"repo_url": "https://github.com/org/repo", "callback_url": "https://yourservice.example.com/results"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;The full code, K8s manifests, and Dependabot integration workflow are on GitHub: &lt;a href="https://github.com/bitkaio/migratowl" rel="noopener noreferrer"&gt;https://github.com/bitkaio/migratowl&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy to answer questions about the sandbox model, the LangGraph structure, or the confidence scoring design in the comments.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>kubernetes</category>
      <category>cicd</category>
    </item>
  </channel>
</rss>
