<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kalyan Ram Jaladi</title>
    <description>The latest articles on DEV Community by Kalyan Ram Jaladi (@kalyanjaladi).</description>
    <link>https://dev.to/kalyanjaladi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3884810%2F585fa7d4-8198-42b7-ac17-890786ab4119.jpeg</url>
      <title>DEV Community: Kalyan Ram Jaladi</title>
      <link>https://dev.to/kalyanjaladi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kalyanjaladi"/>
    <language>en</language>
    <item>
      <title>AI Ops Agents Are a New Class of Attack Surface</title>
      <dc:creator>Kalyan Ram Jaladi</dc:creator>
      <pubDate>Sun, 26 Apr 2026 17:36:34 +0000</pubDate>
      <link>https://dev.to/kalyanjaladi/ai-ops-agents-are-a-new-class-of-attack-surface-3g8f</link>
      <guid>https://dev.to/kalyanjaladi/ai-ops-agents-are-a-new-class-of-attack-surface-3g8f</guid>
      <description>&lt;p&gt;&lt;strong&gt;Decades of operational tribal knowledge are now concentrated in one system. That concentration is the feature, and the vulnerability.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My first thought after reading about the Azure SRE Agent CVE was not about Microsoft's bug. It was about a new attack surface. The agent's security model has not caught up to what the agent can reach.&lt;/p&gt;

&lt;p&gt;Earlier this month, security researcher Yanir Tsarimi at Enclave AI disclosed CVE-2026-32173 against Azure SRE Agent, rated CVSS 8.6. Azure SRE Agent and its AWS counterpart, AWS DevOps Agent, are a new generation of cloud-native agents built to do what senior site reliability engineers do: triage incidents, query logs across services, correlate metrics with traces, and propose or execute fixes. Both vendors position these agents as the future of operations, available 24x7, queryable in natural language, and capable of acting on the cloud accounts they are wired into. They run as managed multi-tenant SaaS in the vendor's infrastructure.&lt;/p&gt;

&lt;p&gt;The flaw allowed an attacker with any valid Entra ID token, from any Microsoft tenant, to silently eavesdrop on another customer's agent activity. The agent's transport layer accepted the token without checking whether it had any business looking at the victim tenant's data. Live agent conversations, log queries, command outputs, the agent's reasoning as it worked through an incident, all of it readable by an attacker who only needed to spin up a free Microsoft tenant of their own.&lt;/p&gt;

&lt;p&gt;Microsoft has acknowledged it and patched the issue server-side. No customer action was required, and there is no public proof-of-concept exploit. But the architectural premise that made this attack possible is still standing in every enterprise that adopts an SRE or DevOps agent over the next year.&lt;/p&gt;

&lt;p&gt;What follows is the Pandora's box. Some context first, with the recent CVEs that opened it up for me.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inside the Pandora's box
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjxe40azqz5i0purq59wr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjxe40azqz5i0purq59wr.png" alt="Recent vulnerabilities in agentic systems" width="800" height="840"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Different attack classes. Different root causes. What they share is the architectural premise: an agent with broad operational reach, autonomy to act, and trust from humans to do useful work. When any one of those is exploited, the consequence scales with what the agent can reach, not with the depth of the bug.&lt;/p&gt;

&lt;p&gt;LangChain and the open source frameworks (LangGraph, CrewAI, and others) take their share of vulnerabilities, partly because they are the most popular targets, partly because their code is open to scrutiny. The good news is that fixes ship in days. Cloud-native agents are a different beast. Azure SRE Agent, AWS DevOps Agent, and the equivalent offerings from other vendors operate with elevated privileges inside proprietary ecosystems. The complexity is massive, the attack surface is opaque to customers, and the patching depends entirely on the vendor.&lt;/p&gt;

&lt;h2&gt;
  
  
  New gates for agents (and why enterprises are right to wait)
&lt;/h2&gt;

&lt;p&gt;Traditional enterprise security has well-understood gates. We use SonarQube, Nexus IQ, Aqua Security, and similar tools to cover code quality, dependency vulnerabilities, container image scanning, and base image hygiene. None of them are trained on what an agent can actually do once it is deployed. The category does not exist in their product mental model yet.&lt;/p&gt;

&lt;p&gt;From my own experience working in regulated environments, this is exactly why agent adoption has been slower than vendors expected. And the delay is rightful, not stubborn. Every new architecture pattern in a regulated enterprise has to pass through the architecture review group. Getting a new API pattern from on-premises to cloud approved involves serious questions: does the traffic sit behind the approved gateway, is it routed through the existing F5 and Akamai layers, are the backend images fully scanned, does it follow the approved load balancer pattern. RBAC is enforced at every layer, every action is consent-gated, and no new design pattern goes into production without sign-off.&lt;/p&gt;

&lt;p&gt;These gates exist for good reasons, and they work for traditional services. They do not yet have categories for agents. The new gates that have to come, based on what the recent CVEs have exposed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tenant isolation enforced at the application layer&lt;/strong&gt;, not just the auth layer. The Azure CVE is what happens when you assume the auth layer covers it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP server provenance.&lt;/strong&gt; How do you verify that the MCP server you are loading is the one the vendor signed, and not a typosquatted version that an attacker registered last week. Custom MCP tools will need their own security review process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-loop on destructive operations.&lt;/strong&gt; The old four-eyes check, where two engineers approve a destructive operation, will become standard for agent-initiated changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-agent zero-trust in multi-agent systems.&lt;/strong&gt; An agent should not trust another agent's request just because they are inside the same orchestration. Signed intent on tokens, re-authorization at each privilege boundary, audit trails that survive even if one agent is compromised.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd67ellcawewosj6ows33.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd67ellcawewosj6ows33.png" alt="Enterprise review gates" width="800" height="572"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine a banking client deploying an SRE agent and getting hit with the Azure CVE before the patch. The blast radius would not be one service. It would be operational intelligence across the bank's entire incident history.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tribal-Knowledge is now Agent's
&lt;/h2&gt;

&lt;p&gt;The analogy I am thinking of is this. A senior support or ops engineer at a large company carries a body of context in their head that lets them solve complex problems fast. The technical term for this is &lt;em&gt;tribal knowledge&lt;/em&gt;: the unwritten, accumulated understanding of how the system actually behaves, which alerts matter, which workarounds work, what was tried last time, why this particular service has that strange retry logic. It is the knowledge that does not exist in any document because it was learned through years of incidents.&lt;/p&gt;

&lt;p&gt;A typical platform team has maybe ten such engineers. Their cumulative tribal knowledge is the actual reason incidents get resolved in minutes instead of hours.&lt;/p&gt;

&lt;p&gt;An AI Ops agent compresses that knowledge into a single system, and it does so by design. Several mechanisms compound to make this happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP tool proliferation.&lt;/strong&gt; The agent connects to dozens of MCP servers including observability platforms, code repositories, CI systems, ticketing systems, and runbooks. Each one adds reach.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills.&lt;/strong&gt; Skills are the instruction manuals that tell the agent how to use the tools. AWS DevOps Agent and Azure SRE Agent both ship with pre-loaded Skills that encode the topology, the conventions, the patterns of the environment they operate in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG databases.&lt;/strong&gt; Past incidents, postmortems, runbooks, and architectural documents are indexed into retrieval-augmented generation stores. This is how the agent learns the equivalent of "what happened last time."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversation memory.&lt;/strong&gt; Across sessions, the agent retains operator intent, recent decisions, and reasoning traces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is one system with the cumulative tribal knowledge of ten ops engineers, available 24x7, queryable in natural language, reachable over the network. That compression is the value proposition. It is also a new class of attack surface, because compromising one agent yields more than compromising any single underlying service ever could.&lt;/p&gt;

&lt;p&gt;This is what I mean by &lt;em&gt;the concentration is the feature, and the vulnerability&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A note on Skills, because they do not get enough attention. Skills are simple in form, often short markdown files, but they are what gives an agent the appearance of expertise. They are how you give an agent the tribal knowledge that it does not have by default. A short instruction document that captures the conventions of the team, the names of services, the meanings of error codes, the workarounds that have accumulated over years, that is what makes the difference between an agent that gets stuck in a reasoning loop and an agent that finds the answer in seconds. Which is also why Skills are a security concern. If the agent loads instructions from an untrusted source, those instructions are now part of the agent's behavior. EchoLeak from June 2025 is the classic example of how Skills and instructions become an attack vector. Skills make the agent more powerful, and they expand the surface area of what an attacker can manipulate.&lt;/p&gt;

&lt;h2&gt;
  
  
  A deserving new threat model
&lt;/h2&gt;

&lt;p&gt;Once you start looking at agents through this lens, several attack categories become more interesting. The OWASP GenAI Security Project published the &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/" rel="noopener noreferrer"&gt;OWASP Top 10 for Agentic Applications 2026&lt;/a&gt; in December 2025, peer-reviewed by over 100 industry contributors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1n4ft9di6g1ch9945dno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1n4ft9di6g1ch9945dno.png" alt="OWASP Top 10 for Agentic Applications 2026" width="800" height="671"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few of the categories are worth calling out because they map directly to what the recent CVEs have shown.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool poisoning&lt;/strong&gt; (ASI02). An attacker compromises the descriptor or metadata of an MCP tool, so that when the agent loads the tool at runtime, it loads malicious capability descriptions. The agent then invokes the tool based on falsified metadata. This is not theoretical. The malicious MCP server impersonating Postmark on npm, reported in September 2025, was the first documented in-the-wild case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adversary-in-the-middle&lt;/strong&gt; (covered under ASI07 Insecure Inter-Agent Communication). Multi-agent systems pass messages between agents, often over weakly authenticated channels. An attacker positioned in the middle can intercept and manipulate those messages, hijacking the goals or actions of downstream agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal hijacking and prompt injection&lt;/strong&gt; (ASI01). The most discussed category, and rightly so. EchoLeak demonstrated that a single crafted email could redirect an agent's goals without any user interaction. The pattern works whenever the agent ingests untrusted natural language as part of its input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identity and privilege abuse&lt;/strong&gt; (ASI03). What the Azure SRE Agent CVE was. An agent operates without a strong identity of its own, inherits permissions from the user it acts on behalf of, and the boundary between agent identity and user identity blurs in dangerous ways.&lt;/p&gt;

&lt;p&gt;This is not the full list. Memory poisoning, supply chain compromise, cascading failures across multi-agent systems, rogue agents, and human-agent trust exploitation are all in the OWASP doc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thoughts
&lt;/h2&gt;

&lt;p&gt;This is a new game and an exciting one. AI is being called one of the great inventions of the last century, and I broadly agree. But that versatility deserves a dedicated threat-modeling discipline, which does not yet exist in most enterprises. The new architectural category disrupts how we think about review, governance, and security ownership. It will create new jobs and new roles.&lt;/p&gt;

&lt;p&gt;I am still working through what this means for my own practice. The post is more of a thinking-out-loud than a recommendation. If you are seeing this differently from where you sit, I would be interested to hear about it.&lt;/p&gt;

</description>
      <category>security</category>
      <category>agents</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>Your DevOps automation is invisible to AI. That's AI-Debt. And it's compounding.</title>
      <dc:creator>Kalyan Ram Jaladi</dc:creator>
      <pubDate>Fri, 17 Apr 2026 15:56:01 +0000</pubDate>
      <link>https://dev.to/kalyanjaladi/your-devops-automation-is-invisible-to-ai-thats-ai-debt-and-its-compounding-1i1h</link>
      <guid>https://dev.to/kalyanjaladi/your-devops-automation-is-invisible-to-ai-thats-ai-debt-and-its-compounding-1i1h</guid>
      <description>&lt;p&gt;&lt;em&gt;A new concept for platform and DevOps engineers, and why the window to act is narrower than you think.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A few months ago I set out to build an internal DevOps agent. The goal was straightforward: diagnose pipeline failures and surface root causes faster than any engineer could manually. I was writing Python functions, connecting to the ADO REST API, the Kubernetes client, the Azure SDK. Building the integration layer from scratch.&lt;/p&gt;

&lt;p&gt;A senior colleague asked one question that changed everything: &lt;em&gt;"Have you looked at the Azure MCP Server?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I hadn't. That question opened a window into an entire vendor ecosystem being assembled at speed, and into a far more important question about what it had not yet built. That gap has a name. This article is about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The future is agent-orchestration. MCP is its language.
&lt;/h2&gt;

&lt;p&gt;We are moving from a world where automation meant writing explicit instructions for machines, to one where autonomous agents receive a goal and reason their way to achieving it. For every DevOps and platform team, the question is whether their existing automation will be visible to those agents, or invisible.&lt;/p&gt;

&lt;p&gt;The interface that makes automation agent-visible is the &lt;em&gt;tool&lt;/em&gt;: a callable function an agent can discover, invoke, and reason over. The open standard governing how tools are described, discovered, and called is Model Context Protocol (MCP).&lt;/p&gt;

&lt;p&gt;MCP has three components. The &lt;strong&gt;MCP Host&lt;/strong&gt; is the environment where the agent runs (an IDE like Cursor or Kiro, a platform like GitHub Copilot, or a custom agent you build). It contains the LLM doing the reasoning. The &lt;strong&gt;MCP Client&lt;/strong&gt; lives inside the host and handles protocol communication. The &lt;strong&gt;MCP Server&lt;/strong&gt; is where tools live, exposing callable functions and responding to invocations.&lt;/p&gt;

&lt;p&gt;In Python, a tool on an MCP Server looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_build_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;build_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retrieve the full log output for a specific ADO pipeline build.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;@mcp.tool()&lt;/code&gt; decorator is the registration contract. The docstring (the text in triple quotes) is what the agent reads when deciding whether and how to call this function. Not optional documentation. It is the agent's primary reasoning interface into your tool. More on this shortly.&lt;/p&gt;

&lt;p&gt;SDK downloads grew from 100,000 at launch to over 8 million by April 2025. In December 2025, Anthropic donated MCP to the Linux Foundation, co-governed by OpenAI, Google, Microsoft, AWS, and Salesforce. The same governance move Kubernetes made in 2016. When something enters Linux Foundation governance, it stops being one vendor's experiment and becomes shared infrastructure.&lt;/p&gt;

&lt;p&gt;Every major cloud vendor has now built production-grade MCP servers for their own ecosystems. What matters for this article is what they have built, what they haven't, and what that gap means for every DevOps platform already running.&lt;/p&gt;

&lt;h2&gt;
  
  
  What vendors have built — and why AWS is leading this race
&lt;/h2&gt;

&lt;p&gt;Azure MCP Server (GA) exposes 40+ Azure services as agent-callable tools. The ADO MCP Server covers pipelines, builds, pull requests, and repositories. AWS embedded MCP into Bedrock AgentCore with IAM permissions and CloudTrail audit logging per tool call. Google released fully managed MCP servers for GKE, BigQuery, AlloyDB, and Spanner. Microsoft shipped a dedicated SQL MCP Server covering SQL Server, PostgreSQL, and Cosmos DB. Zero code, open source, free.&lt;/p&gt;

&lt;p&gt;Beyond their own services, the vendors have gone further, building tools to make &lt;em&gt;your existing automation&lt;/em&gt; agent-ready without custom code. This is where the comparison gets interesting.&lt;/p&gt;

&lt;p&gt;AWS is the clear pioneer. At AWS Summit New York in July 2025, they announced a $100 million investment in their Generative AI Innovation Center, with agentic AI as the centrepiece. At re:Invent 2025, they shipped three domain-specific autonomous agents: a DevOps Agent, a Security Agent, and Kiro, an agentic IDE. Their open-source Strands Agents SDK introduced a model-first design philosophy. Instead of developers hardcoding every workflow path, the LLM reasons over available tools and decides the path itself. AWS has made agent development a first-class engineering discipline, with tooling, documentation, and production infrastructure to match.&lt;/p&gt;

&lt;p&gt;The centrepiece for existing automation is &lt;strong&gt;AgentCore Gateway&lt;/strong&gt;: a fully managed service that converts your existing APIs and Lambda functions into MCP-compatible tools automatically. You provide an OpenAPI specification or a Lambda ARN, and the Gateway handles protocol translation, authentication, semantic tool discovery, and observability. No custom code required.&lt;/p&gt;

&lt;p&gt;Azure has equivalent capability but spread across three services. Azure APIM can expose any REST API as an MCP server: import your OpenAPI spec, click "Create MCP server," done. Azure Functions has a native &lt;code&gt;mcpToolTrigger&lt;/code&gt; binding. Microsoft Foundry provides governance across 1,400+ connectors alongside custom tools, authenticated through Entra. The capability is there, but it requires more coordination across services compared to AWS's single-surface approach.&lt;/p&gt;

&lt;p&gt;Google's Apigee converts any managed API to an MCP server without changing the underlying service. Powerful for GCP-native APIs, but Apigee has historically been a complex enterprise product and lacks the seamless function-wrapping simplicity of AgentCore Gateway.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdmqcw9274i1h6atpmjr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdmqcw9274i1h6atpmjr.png" alt="Vendor MCP comparison: AWS AgentCore Gateway vs Azure APIM vs GCP Apigee showing what each can auto-wrap" width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The green rows are real, meaningful progress. AWS leads on developer experience and unification. The red rows tell the more important story. And they have a name.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI-Debt
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AI-Debt is the human automation your team built that AI agents cannot reach.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Human&lt;/em&gt; is the key word. This is automation written by engineers, for engineers: scripts run manually, pipelines triggered by people, YAML files committed to repos and executed on build agents with specific tooling installed. It works perfectly for the people who use it today. The debt only becomes visible the day an agent arrives and has nothing to call.&lt;/p&gt;

&lt;p&gt;AI-Debt has two distinct components: &lt;strong&gt;Interface-Debt&lt;/strong&gt; and &lt;strong&gt;Context-Debt&lt;/strong&gt;. Both matter, and neither is solved by any vendor.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's locked: Interface-Debt
&lt;/h2&gt;

&lt;p&gt;Interface-Debt is automation that exists but cannot be called by agents. It has no discoverable interface, no function signature, no API endpoint, no callable handle that an agent can find and invoke.&lt;/p&gt;

&lt;p&gt;Your ADO pipeline YAML that runs a Helmfile deployment: not callable by agents. Your PowerShell script that creates Azure resources: not callable by agents. Your Bash script that validates secrets before a deployment: not callable by agents. Your kubectl wrapper that diagnoses stuck pods: not callable by agents. Bicep templates, ARM parameters, Makefile targets, cron scripts: none of them are discoverable or invocable.&lt;/p&gt;

&lt;p&gt;Vendors are reducing Interface-Debt for &lt;strong&gt;API-surface automation&lt;/strong&gt; (code that already has a callable interface: Lambda functions, REST APIs, Azure Functions, managed cloud endpoints). Valuable progress. But it only covers automation that already exposes a typed, invocable surface. A Bash script running on an ADO pipeline agent has no Lambda ARN. A Helmfile task has no OpenAPI spec. The auto-wrap tools have nothing to target.&lt;/p&gt;

&lt;p&gt;For Azure/ADO environments specifically, the gap is significant. Years of YAML. Hundreds of pipeline tasks. Thousands of shell functions. All invisible to agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's dark: Context-Debt
&lt;/h2&gt;

&lt;p&gt;Context-Debt is callable automation that agents cannot use intelligently, because the tools carry no description of when to use them, what they do, or how they behave on your specific platform.&lt;/p&gt;

&lt;p&gt;When an AI agent is given a set of tools, it decides which tool to call, and how, entirely based on the Python docstring attached to each tool. Not the code. The description.&lt;/p&gt;

&lt;p&gt;Research published in 2026 quantified this directly: editing tool docstrings can yield up to 10 times more usage of the same underlying function in production agents. A 2026 benchmark called OpaqueToolsBench studied what happens when tools have incomplete documentation and found that LLMs consistently struggle with tools that lack clear best practices or documented failure modes.&lt;/p&gt;

&lt;p&gt;Anthropic's own engineering team documented this from building Claude Code: when they launched the web search tool, they discovered Claude was needlessly appending "2025" to every query, biasing results. The fix was not a model change. It was improving the tool's docstring.&lt;/p&gt;

&lt;p&gt;AgentCore Gateway can wrap your Lambda, but it cannot write the docstring that tells the agent when this tool is relevant, what your platform's naming conventions are, or why a particular failure pattern should trigger it first. That knowledge exists only in your engineers' heads, your incident history, and the habits your team has built over years.&lt;/p&gt;

&lt;p&gt;A Lambda with an empty docstring is callable. It is not agent-ready. That gap is Context-Debt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI-Debt audit: two questions
&lt;/h2&gt;

&lt;p&gt;Before paying down AI-Debt, you need to know how much you have. Most teams have never measured it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feaobxc2u6rs6zpmytte4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feaobxc2u6rs6zpmytte4.png" alt="AI-Debt audit framework showing Interface-Debt (what's locked) and Context-Debt (what's dark)" width="800" height="491"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two questions. Count the numbers and you have your baseline.&lt;/p&gt;

&lt;p&gt;Most enterprise DevOps platforms, when audited this way, find 70-90% of their automation sitting in Interface-Debt, with significant Context-Debt on whatever is callable. That number is your starting point.&lt;/p&gt;

&lt;h2&gt;
  
  
  How custom MCP tools pay down both debts: three examples
&lt;/h2&gt;

&lt;p&gt;Custom MCP tools are new Python functions decorated with &lt;code&gt;@mcp.tool()&lt;/code&gt;. They do two things at once: give your existing automation a callable interface (addressing Interface-Debt), and encode your platform knowledge in their docstrings (addressing Context-Debt). One new function, two debts addressed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 1: Context-Debt (same function, different quality)
&lt;/h3&gt;

&lt;p&gt;This tool retrieves Kubernetes pod logs, a core diagnostic step in any deployment failure. The engineer already has a working &lt;code&gt;kubectl logs&lt;/code&gt; call. The question is whether an agent can use it intelligently.&lt;/p&gt;

&lt;p&gt;The key is the Python docstring. This is what the agent reads when deciding whether and how to call this function. Not documentation for your colleagues, but the agent's only reasoning interface into your tool.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Context-Debt: callable, but the agent has nothing to reason with
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pod_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Get logs for a pod.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Context-Debt resolved: the docstring tells the agent
# when to call this, how to call it, and what to do next
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pod_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tail_lines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Retrieve recent logs from a Kubernetes pod in the AKS cluster.

    Use when diagnosing:
    - CrashLoopBackOff pods — set previous=True to see the crash reason
    - Init container failures — include init container name in pod_name
    - Startup failures during helmfile atomic deployments

    Namespace naming on this platform: {service}-{env}
    e.g. payments-dev, payments-staging, auth-prod

    If pod_name is unknown, call get_pods_in_namespace() first.
    Returns last {tail_lines} log lines. Increase for deeper history.
    Returns empty string if pod has not started emitting logs yet.
    In that case, call describe_pod() to check events instead.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--tail=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tail_lines&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--previous&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code is nearly identical. The agent's ability to use it correctly is not. The second version tells the agent when to call it, what namespace naming convention to use, which companion tool to call when it doesn't have a pod name, and what an empty response means. Without that docstring, the agent either skips the tool, calls it with wrong parameters, or hallucinates a response.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 2: Interface-Debt (wrapping an existing script)
&lt;/h3&gt;

&lt;p&gt;Your team has a Bash script, &lt;code&gt;/scripts/get-failed-builds.sh&lt;/code&gt;, that queries the ADO REST API for recent pipeline failures. It has been running for two years. Developers trigger it manually or reference it in ADO pipeline tasks running on a private agent pool. No AI agent can call it. It lives on a file system, not behind an API, with no discoverable interface.&lt;/p&gt;

&lt;p&gt;Here is how you pay down Interface-Debt: write a new MCP tool that calls the script, giving it a callable interface, while encoding platform knowledge in the docstring.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# /scripts/get-failed-builds.sh&lt;/span&gt;
&lt;span class="c"&gt;# Runs on ADO private agent pool, triggered manually or via pipeline task&lt;/span&gt;
&lt;span class="c"&gt;# Usage: ./get-failed-builds.sh &amp;lt;project&amp;gt; &amp;lt;pipeline-name&amp;gt; &amp;lt;days&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;# Returns: JSON array of failed runs with build_id, stage, duration&lt;/span&gt;
&lt;span class="c"&gt;# No agent can reach this — it has no callable interface&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# New MCP tool: Interface-Debt + Context-Debt resolved together
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_recent_pipeline_failures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pipeline_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;days_back&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Get recent failed pipeline runs for a given ADO project and pipeline.

    Wraps the internal ADO query script and returns structured failure data.
    Call this as the first step in any pipeline diagnosis workflow.
    It gives you the build IDs needed for deeper analysis.

    Pipeline naming on this platform: {service}-{env}-deploy
    e.g. payments-dev-deploy, auth-staging-deploy, gateway-prod-deploy

    Returns list of failures with fields:
    build_id, start_time, failed_stage, duration_seconds,
    triggered_by, branch, retry_count.

    Most common failed stages on this platform:
    - helmfile-apply     -&amp;gt; missing secrets (79%) or image pull (15%)
    - integration-tests  -&amp;gt; environment config or dependency issues
    - security-scan      -&amp;gt; new CVE in base image (check monthly patch cycle)

    After calling this, pass build_id to diagnose_build_failure()
    for root cause analysis.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/scripts/get-failed-builds.sh&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;pipeline_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days_back&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Bash script has not changed. It still runs where it always ran. The new MCP tool is a thin wrapper that converts it from invisible to callable, and the docstring converts it from callable to agent-ready. That is what paying down Interface-Debt looks like in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 3: The purely contextual tool (no vendor equivalent)
&lt;/h3&gt;

&lt;p&gt;This tool has no script to wrap and no API to call. It queries an internal incident database built from 18 months of real platform failures. But the real value is not the database call. It is the docstring that encodes the diagnostic patterns a senior engineer applies instinctively. Think of it as team knowledge, made callable and permanent.&lt;/p&gt;

&lt;p&gt;No vendor MCP server can build this. AgentCore Gateway has no OpenAPI spec to import. This tool exists only because someone encoded real incident history into a docstring.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_platform_failure_pattern&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;error_signature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pipeline_stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Look up known failure patterns on this platform from real incident history.

    CALL THIS FIRST in any diagnosis before running other tools.
    It encodes 18 months of incident data and directs you to the
    highest-probability root cause, skipping diagnostic dead ends.

    Known patterns (error_signature -&amp;gt; likely cause):
    - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timed out waiting for condition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; + helmfile-apply
      -&amp;gt; missing secret in namespace (79% of cases)
      -&amp;gt; next: call check_keyvault_secret_exists()

    - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ImagePullBackOff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
      -&amp;gt; ACR authentication failure or incorrect image tag (92%)
      -&amp;gt; next: call check_acr_image_exists()

    - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CrashLoopBackOff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; shortly after deployment
      -&amp;gt; application ConfigMap missing or malformed (71%)
      -&amp;gt; next: call get_pod_logs(previous=True) then check_configmap()

    - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;503 Service Unavailable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; post-deployment with healthy pods
      -&amp;gt; stale Istio VirtualService conflict in namespace (58%)
      -&amp;gt; next: call get_all_virtualservices_for_host()

    Returns: likely_cause, confidence_percent, recommended_tools,
    similar_past_incidents, avg_resolution_minutes.

    If confidence &amp;lt; 50%, this is a new pattern not yet seen.
    Document it via create_incident_record() and use the generic path.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;query_incident_database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;error_signature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipeline_stage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is Context-Debt resolved, and the only category of tool that truly differentiates your platform's agent capability from every other organisation using the same vendor tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on diminishing AI-Debt: why vendors won't close this gap
&lt;/h2&gt;

&lt;p&gt;Could vendors extend their auto-wrap to cover scripts and pipeline tasks? Theoretically yes. AWS could build mechanisms to execute arbitrary scripts via Lambda wrappers, Azure could auto-instrument ADO pipeline tasks. But there is a structural reason why they are unlikely to prioritise this.&lt;/p&gt;

&lt;p&gt;Vendors have no incentive to solve your platform's knowledge problem. Their investment goes into making their own services and managed resources accessible to agents: Lambda, REST APIs, their cloud-native tooling. The script-and-pipeline estate your team has accumulated over five years is yours, not theirs. Even if a vendor shipped a zero-code script wrapper tomorrow, it would only address Interface-Debt. Context-Debt (the knowledge of when to use each tool, how your platform behaves, what your failure patterns are) remains yours to encode. No vendor will ship that for you.&lt;/p&gt;

&lt;p&gt;That is what makes the custom MCP layer valuable and defensible. It is automation that only you can build.&lt;/p&gt;

&lt;h2&gt;
  
  
  The window is closing
&lt;/h2&gt;

&lt;p&gt;Gartner predicts 40% of enterprise applications will integrate task-specific AI agents by end of 2026, up from less than 5% in 2025. At the same time, they predict over 40% of those agentic AI projects will be cancelled by end of 2027 due to inadequate technical foundations.&lt;/p&gt;

&lt;p&gt;Read those two together. Agents are arriving. Nearly half the projects will fail — not because the models are poor, but because the platforms were not ready.&lt;/p&gt;

&lt;p&gt;The ones that succeed will have paid down AI-Debt before the agents arrived.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Run the audit.&lt;/strong&gt; Count your automation across two dimensions: what's locked and what's dark. That number is your baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with the highest-value workflows.&lt;/strong&gt; Incident diagnosis, deployment validation, environment setup. Build custom MCP tools for those five or ten scenarios first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat docstrings as engineering work.&lt;/strong&gt; The quality of your agent's decisions is directly proportional to the quality of your docstrings. Not documentation overhead — the core of what makes a platform agent-ready.&lt;/p&gt;

&lt;p&gt;AI-Debt is silent. Your scripts still run. Your pipelines still deploy. Everything works perfectly for humans. The debt only becomes visible the day an agent arrives and has nothing to call.&lt;/p&gt;

&lt;p&gt;That day is closer than most platform teams realise.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Platform and DevOps engineering at a large UK financial institution. Views are my own.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;I write about AI agents, cloud architecture, and occasionally things that have nothing to do with technology.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Building in this space or thinking about AI-Debt on your platform? I would be glad to hear from you.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>devops</category>
      <category>mcp</category>
      <category>aiops</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
