DEV Community: Kalyan Ram Jaladi

Aditi: a private medical paralegal that runs entirely on your Mac

Kalyan Ram Jaladi — Mon, 25 May 2026 00:32:45 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

When the pharmacy gave my mother a different medicine than the one her doctor had prescribed, I didn't know if the substitute was right. I didn't have hours to research. I didn't want to send her medical details to a cloud AI in another country. So I built Aditi, a paralegal that runs entirely on my Mac. It reads what the doctor wrote and what the pharmacy actually gave, listens to how she describes her symptoms, and organizes it into clear questions I can take to her doctor. Nothing leaves the device.

Aditi runs Google's Gemma 4 models locally on Apple Silicon (macOS, via MLX), and it uses all three on-device variants for what each is best at: the 26B MoE for the heavy text and image work (reading messy handwriting and synthesizing many documents at once), E4B for voice, including Indian languages, with strong all-round quality at the edge, and E2B for the fastest, lightest runs. No cloud, no account, no network call.

The case that started it

My mother is 72. She has diabetes. Last week she saw a neurologist for nerve pain in her legs and feet. The doctor wrote a prescription for Pentanerv. The pharmacy didn't have it, so they gave her Gabapin (Gabapentin) instead. Both are common treatments for nerve pain and sit in the same drug family (gabapentinoids), but they have different active ingredients and the dose does not convert one to one, so a substitution without the doctor's approval is worth questioning. After two days on the substitute, the pain hadn't reduced and she started worrying. She was not convinced Gabapin was the right substitute, and that doubt added to her stress. The pharmacist had said the substitute was equivalent, but I had questions. Was it really the same medicine? Were the dosages comparable? If the pain didn't reduce, should we switch back?

There's a second problem I kept running into. When I see a new doctor, nobody has the full picture. My history is scattered across prescriptions from a cardiologist, a pulmonologist, an orthopedist, and a gastroenterologist, plus CPAP therapy reports. I can't summarize all of that accurately from memory. I wanted Aditi to read that whole stack and hand the next doctor a clean, one-page summary.

What Aditi does

You give Aditi three things: a short note in your own words, the prescription images (with the pharmacy bill and a photo of the dispensed tablets), and, optionally, a voice memo of the patient describing how they feel. Aditi makes a single, on-device, multimodal call to a Gemma 4 model and returns a result card: a plain-language finding at the top, the questions to ask the doctor, then the detailed extraction (what the doctor prescribed versus what the pharmacy dispensed), a bill cross-check, and a voice-interpretation check.

Here's what it produced for my mother's case, on the E4B model with a clean synthetic English voice memo as the spoken input (made with macOS's Ava Premium voice; I also recorded the same concern in Telugu, shown later):

Aditi did not hand me a verdict, and it should not. What it surfaced were the right questions. Both drugs treat the same nerve pain, but they are not interchangeable, so the things to confirm with the doctor are whether Gabapin is the correct substitute and whether its dose matches what was intended.

The paralegal posture

Aditi is not a doctor. That is not a disclaimer. It is the entire architectural decision.

In every model call, the prompt opens with:

"I am the patient. Act as my paralegal medical assistant. You are NOT a doctor. You will NOT diagnose, prescribe, suggest medication changes, or make clinical judgments. Your job is to ORGANIZE and SUMMARIZE what is already in these inputs so I can have an informed conversation with my doctor and pharmacist."

This is the defensibility argument for medical AI. When someone asks whether deploying AI for medical use is dangerous, the answer is that Aditi doesn't diagnose. It organizes. The doctor diagnoses.

This posture shapes every output Aditi produces:

It surfaces substitutions but does not endorse them.
It transcribes voice memos but does not interpret them clinically.
It generates questions for the doctor but does not answer them.
It marks unclear handwriting as uncertain rather than guessing.

Demo

The video has captions and a full transcript (turn on CC on YouTube).

Code

https://github.com/kalyanrj16/aditi-gemma-4

Quick start (full setup is in the README):

uv venv --python 3.11
uv pip install -r requirements.txt
uv run streamlit run src/aditi_app.py

You'll need a Mac with Apple Silicon (M1, M2, M3, or M4) and at least 24 GB of memory for the 26B model. The smaller E2B and E4B variants run with much less. On first run the selected model downloads from Hugging Face; after that, Aditi runs fully offline, which you can confirm by turning Wi-Fi off.

Every run in this writeup is reproducible. The JSON outputs and full result-card screenshots for all 12 cells of the test matrix (three model variants across the medicine-check conditions, plus three models for the health summary) are committed under outputs/.

How I Used Gemma 4

Everything below ran on a MacBook with an M-series chip and 24 GB of unified memory, fully offline.

The model selection is tiered, but not in the obvious way. I ran a small matrix on these real cases: the three variants (E2B, E4B, 26B MoE) across the medicine-check conditions (no audio, plus English, Indian-English, and Telugu voice memos) and the multi-document health summary, twelve runs in all. Every run is saved in the repo, a JSON record and a full result-card screenshot per cell, under outputs/, so every claim below is checkable. The data showed something more interesting than "bigger model is better."

26B reads handwriting cleanly enough to catch the substitution from the prescription image alone. It correctly extracted PENTANERV-NT, PENTANERV, and the full five-drug list without confusing what was prescribed with what was dispensed (outputs/UC1_noaudio_26b.json).

E2B and E4B can't do this from images alone. E2B's handwriting OCR drifts; the same Pentanerv reads as "PENTANER" one run and "PENTENERUV" the next (outputs/UC1_english_e2b.json, outputs/UC1_indianenglish_e2b.json). E4B duplicates the dispensed drug into the prescribed list, which confuses the substitution finding (outputs/UC1_noaudio_e4b.json).

But E2B and E4B can hear the voice memo, which 26B can't. Per Google's official Gemma 4 documentation, audio is native to E2B and E4B only; the 26B and 31B variants are vision and text. A voice memo adds about 460 to 580 tokens to the prompt, but only the audio-native models actually attend to them.

So the tiers rescue each other:

E4B uses the voice memo to recover the substitution that its image extraction alone would have missed.
26B uses its handwriting strength to catch the same substitution without needing audio.
E2B catches it only when given audio. Without it, it collapses.

Tier	Model	What it does best	Memory
Synthesis	Gemma 4 26B MoE 4-bit	Reads handwriting cleanly, multi-document health summaries	~17.5 GB
Multilingual audio	Gemma 4 E4B 4-bit	Transcribes voice memos in English (US and Indian-accented) and Telugu; cleanest patient-facing prose	~7 GB
Edge / capture	Gemma 4 E2B 4-bit	Fast (~114 tok/s), lowest memory	~5 GB

And the cost of each tier, measured on this case (single multimodal call, MLX, 24 GB Mac):

Model	Peak memory	Throughput	Time for one medicine-check finding
Gemma 4 26B MoE 4-bit	~17.7 GB	~35 to 40 tok/s	~24 s
Gemma 4 E4B 4-bit	~7.1 GB	~66 tok/s	8 to 15 s
Gemma 4 E2B 4-bit	~5.4 GB	~114 tok/s	4 to 6 s

The finding behind this is that audio support is a Google design choice, not a model capability ceiling. E2B and E4B are built for on-device deployment where voice input is natural. 26B and 31B are built for server contexts where text and images dominate. Prompt engineering can't change that; it is baked in. Aditi handles it honestly: pick a vision-only model and the voice-check panel says "audio not used by this model" instead of pretending it listened.

Voice memos in Indian languages

One thing surprised me in testing: Gemma 4 E4B handles Telugu voice memos cleanly. My mother and I both speak Telugu, so I recorded the same concern as a Telugu voice memo, and E4B extracted it accurately (full record in outputs/UC1_telugu_e4b.json):

E4B's output on a Telugu voice memo: "concerned about the substitution of Pantenerv with Gabapin… She has been taking the Gabapin for two days but feels no improvement."

This matters for Indian patients who explain their symptoms most naturally in their regional language, not in English. The architecture takes this seriously: E4B handles the multilingual audio tier, and 26B handles the precise document extraction tier. The user doesn't choose; the app picks the right model for the input modality.

This is one data point with one Telugu memo, and wider testing across more regional languages would be useful future work. But for my mother's case, Telugu support means Aditi works for her, not just for me.

Beyond a single visit: the health story

Aditi also handles a second use case: reading prescriptions from multiple specialists across visits and producing a synthesized health summary.

I tested this on 8 documents (six specialist prescriptions covering orthopedics, pulmonology, cardiology, and gastroenterology, plus two CPAP reports) and asked Aditi to produce a Markdown summary covering conditions and diagnoses, medications across visits, a visit timeline, tests and investigations, items that couldn't be read clearly, and questions for the doctor. The full output for each model is in outputs/UC2_summary_26b.json, outputs/UC2_summary_e4b.json, and outputs/UC2_summary_e2b.json.

A slice of what 26B produced across those 8 documents:

Health Summary. Patient name and age.

Conditions: Obstructive Sleep Apnea (noted across the cardiology and general examination visits), chest pain (cardiology), abdominal swelling, knee issues.

Medications across visits: Aztor (bedtime), CPAP therapy, Montek-BL, Tysulet, Foracort.

Visits: Cardiology (02/06/2026), chest pain and vitals; general examination, abdominal swelling and bowel habits.

For multi-document synthesis, 26B is clearly the right model. It consolidates recurring conditions across visits (obstructive sleep apnea showing up in multiple documents, for instance) and integrates CPAP data into the broader medical picture. The smaller models struggle here: E4B falls into a repetition loop and drops the medication list (outputs/UC2_summary_e4b.json), which is the same edge-model limitation the substitution case showed.

What can go wrong

Aditi has a bill cross-check that matches extracted prescribed drugs against names printed on the pharmacy bill, then flags discrepancies. It works when extraction is clean. When extraction is messy, as it sometimes is on the E2B model, the cross-check can produce false positives.

In one test run, E2B flagged "VENTIN to GABAPENTIN" as a possible misread of the same drug. They are actually different drugs; the fuzzy-match threshold (0.6) was too loose for that pair. The cross-check correctly surfaces uncertainty, but the similarity check itself is blunt, and it mismatched here.

This is the right kind of failure to surface. The cross-check never overrides the model; it only flags. If the model gets confused, the flag shows it. If the cross-check itself is confused, that is visible too. Both signals reach the doctor.

What Aditi doesn't do

Honesty about scope matters in medical AI. Aditi:

Doesn't diagnose anything. Aditi never says "this is safe" or "this is the right medicine." Those are doctor questions.
Doesn't recommend dose changes. Even when medications are clearly in the same drug family, the dose conversion (for example, Pentanerv versus Gabapin) depends on patient-specific factors.
Runs on Mac only. An iPhone prototype tested during research exposed five distinct failure modes, which is what led to the Mac-only design.
Tested on two real cases in depth, not at clinical scale.
Did not test the 31B Dense Gemma 4 variant due to memory constraints.

These are real limitations. I list them because confidence in medical AI comes from honesty about what it can and can't do.

Privacy by architecture

Every model call, every voice memo, and every prescription image stays on the Mac. There is no API call to a vendor server. There is no telemetry; the app even disables Streamlit's usage stats. The model weights are downloaded once and cached locally.

I built Aditi this way because the alternative, sending my mother's prescription, her medical history, and her voice describing her pain to a server in another country, felt wrong. It probably is wrong, under India's Digital Personal Data Protection Act 2023 and other privacy norms.

And beyond the law, this is personal. A family's medical history, including the conditions people are ashamed to talk about, is some of the most sensitive data a person owns. It should not have to leave the house to be useful to the people it belongs to.

I treated privacy as an architectural choice from the start, not a feature to add at the end.

Consent and the case data

The images in this submission come from two real cases: my mother's medication-substitution case, used with her informed verbal consent, and my own multi-visit history (the health-summary case). All identifying information (patient names, doctor names, hospital names, IDs, addresses, phone numbers, signatures) has been substituted with fictional alternatives or redacted. Drug names, dosages, complaints, and medical content are preserved exactly so the model's extraction capability can be demonstrated honestly.

Aditi is one honest paralegal on a person's own device. It reads what the doctor wrote, hears what the patient said, and helps the patient ask the right questions. That was the point.

Cover photo: Arthur's Seat, Edinburgh, 2020, the vast and beautiful sweep of green hills above the city. The name Aditi means "the boundless one," which felt right for a tool meant to give care without limits, as open as that landscape.

Kalyan Ram Jaladi, writing from Hyderabad, India.

AI Ops Agents Are a New Class of Attack Surface

Kalyan Ram Jaladi — Sun, 26 Apr 2026 17:36:34 +0000

Decades of operational tribal knowledge are now concentrated in one system. That concentration is the feature, and the vulnerability.

My first thought after reading about the Azure SRE Agent CVE was not about Microsoft's bug. It was about a new attack surface. The agent's security model has not caught up to what the agent can reach.

Earlier this month, security researcher Yanir Tsarimi at Enclave AI disclosed CVE-2026-32173 against Azure SRE Agent, rated CVSS 8.6. Azure SRE Agent and its AWS counterpart, AWS DevOps Agent, are a new generation of cloud-native agents built to do what senior site reliability engineers do: triage incidents, query logs across services, correlate metrics with traces, and propose or execute fixes. Both vendors position these agents as the future of operations, available 24x7, queryable in natural language, and capable of acting on the cloud accounts they are wired into. They run as managed multi-tenant SaaS in the vendor's infrastructure.

The flaw allowed an attacker with any valid Entra ID token, from any Microsoft tenant, to silently eavesdrop on another customer's agent activity. The agent's transport layer accepted the token without checking whether it had any business looking at the victim tenant's data. Live agent conversations, log queries, command outputs, the agent's reasoning as it worked through an incident, all of it readable by an attacker who only needed to spin up a free Microsoft tenant of their own.

Microsoft has acknowledged it and patched the issue server-side. No customer action was required, and there is no public proof-of-concept exploit. But the architectural premise that made this attack possible is still standing in every enterprise that adopts an SRE or DevOps agent over the next year.

What follows is the Pandora's box. Some context first, with the recent CVEs that opened it up for me.

Inside the Pandora's box

Different attack classes. Different root causes. What they share is the architectural premise: an agent with broad operational reach, autonomy to act, and trust from humans to do useful work. When any one of those is exploited, the consequence scales with what the agent can reach, not with the depth of the bug.

LangChain and the open source frameworks (LangGraph, CrewAI, and others) take their share of vulnerabilities, partly because they are the most popular targets, partly because their code is open to scrutiny. The good news is that fixes ship in days. Cloud-native agents are a different beast. Azure SRE Agent, AWS DevOps Agent, and the equivalent offerings from other vendors operate with elevated privileges inside proprietary ecosystems. The complexity is massive, the attack surface is opaque to customers, and the patching depends entirely on the vendor.

New gates for agents (and why enterprises are right to wait)

Traditional enterprise security has well-understood gates. We use SonarQube, Nexus IQ, Aqua Security, and similar tools to cover code quality, dependency vulnerabilities, container image scanning, and base image hygiene. None of them are trained on what an agent can actually do once it is deployed. The category does not exist in their product mental model yet.

From my own experience working in regulated environments, this is exactly why agent adoption has been slower than vendors expected. And the delay is rightful, not stubborn. Every new architecture pattern in a regulated enterprise has to pass through the architecture review group. Getting a new API pattern from on-premises to cloud approved involves serious questions: does the traffic sit behind the approved gateway, is it routed through the existing F5 and Akamai layers, are the backend images fully scanned, does it follow the approved load balancer pattern. RBAC is enforced at every layer, every action is consent-gated, and no new design pattern goes into production without sign-off.

These gates exist for good reasons, and they work for traditional services. They do not yet have categories for agents. The new gates that have to come, based on what the recent CVEs have exposed:

Tenant isolation enforced at the application layer, not just the auth layer. The Azure CVE is what happens when you assume the auth layer covers it.
MCP server provenance. How do you verify that the MCP server you are loading is the one the vendor signed, and not a typosquatted version that an attacker registered last week. Custom MCP tools will need their own security review process.
Human-in-the-loop on destructive operations. The old four-eyes check, where two engineers approve a destructive operation, will become standard for agent-initiated changes.
Cross-agent zero-trust in multi-agent systems. An agent should not trust another agent's request just because they are inside the same orchestration. Signed intent on tokens, re-authorization at each privilege boundary, audit trails that survive even if one agent is compromised.

Imagine a banking client deploying an SRE agent and getting hit with the Azure CVE before the patch. The blast radius would not be one service. It would be operational intelligence across the bank's entire incident history.

The Tribal-Knowledge is now Agent's

The analogy I am thinking of is this. A senior support or ops engineer at a large company carries a body of context in their head that lets them solve complex problems fast. The technical term for this is tribal knowledge: the unwritten, accumulated understanding of how the system actually behaves, which alerts matter, which workarounds work, what was tried last time, why this particular service has that strange retry logic. It is the knowledge that does not exist in any document because it was learned through years of incidents.

A typical platform team has maybe ten such engineers. Their cumulative tribal knowledge is the actual reason incidents get resolved in minutes instead of hours.

An AI Ops agent compresses that knowledge into a single system, and it does so by design. Several mechanisms compound to make this happen:

MCP tool proliferation. The agent connects to dozens of MCP servers including observability platforms, code repositories, CI systems, ticketing systems, and runbooks. Each one adds reach.
Skills. Skills are the instruction manuals that tell the agent how to use the tools. AWS DevOps Agent and Azure SRE Agent both ship with pre-loaded Skills that encode the topology, the conventions, the patterns of the environment they operate in.
RAG databases. Past incidents, postmortems, runbooks, and architectural documents are indexed into retrieval-augmented generation stores. This is how the agent learns the equivalent of "what happened last time."
Conversation memory. Across sessions, the agent retains operator intent, recent decisions, and reasoning traces.

The result is one system with the cumulative tribal knowledge of ten ops engineers, available 24x7, queryable in natural language, reachable over the network. That compression is the value proposition. It is also a new class of attack surface, because compromising one agent yields more than compromising any single underlying service ever could.

This is what I mean by the concentration is the feature, and the vulnerability.

A note on Skills, because they do not get enough attention. Skills are simple in form, often short markdown files, but they are what gives an agent the appearance of expertise. They are how you give an agent the tribal knowledge that it does not have by default. A short instruction document that captures the conventions of the team, the names of services, the meanings of error codes, the workarounds that have accumulated over years, that is what makes the difference between an agent that gets stuck in a reasoning loop and an agent that finds the answer in seconds. Which is also why Skills are a security concern. If the agent loads instructions from an untrusted source, those instructions are now part of the agent's behavior. EchoLeak from June 2025 is the classic example of how Skills and instructions become an attack vector. Skills make the agent more powerful, and they expand the surface area of what an attacker can manipulate.

A deserving new threat model

Once you start looking at agents through this lens, several attack categories become more interesting. The OWASP GenAI Security Project published the OWASP Top 10 for Agentic Applications 2026 in December 2025, peer-reviewed by over 100 industry contributors.

A few of the categories are worth calling out because they map directly to what the recent CVEs have shown.

Tool poisoning (ASI02). An attacker compromises the descriptor or metadata of an MCP tool, so that when the agent loads the tool at runtime, it loads malicious capability descriptions. The agent then invokes the tool based on falsified metadata. This is not theoretical. The malicious MCP server impersonating Postmark on npm, reported in September 2025, was the first documented in-the-wild case.

Adversary-in-the-middle (covered under ASI07 Insecure Inter-Agent Communication). Multi-agent systems pass messages between agents, often over weakly authenticated channels. An attacker positioned in the middle can intercept and manipulate those messages, hijacking the goals or actions of downstream agents.

Goal hijacking and prompt injection (ASI01). The most discussed category, and rightly so. EchoLeak demonstrated that a single crafted email could redirect an agent's goals without any user interaction. The pattern works whenever the agent ingests untrusted natural language as part of its input.

Identity and privilege abuse (ASI03). What the Azure SRE Agent CVE was. An agent operates without a strong identity of its own, inherits permissions from the user it acts on behalf of, and the boundary between agent identity and user identity blurs in dangerous ways.

This is not the full list. Memory poisoning, supply chain compromise, cascading failures across multi-agent systems, rogue agents, and human-agent trust exploitation are all in the OWASP doc.

Closing thoughts

This is a new game and an exciting one. AI is being called one of the great inventions of the last century, and I broadly agree. But that versatility deserves a dedicated threat-modeling discipline, which does not yet exist in most enterprises. The new architectural category disrupts how we think about review, governance, and security ownership. It will create new jobs and new roles.

I am still working through what this means for my own practice. The post is more of a thinking-out-loud than a recommendation. If you are seeing this differently from where you sit, I would be interested to hear about it.

Your DevOps automation is invisible to AI. That's AI-Debt. And it's compounding.

Kalyan Ram Jaladi — Fri, 17 Apr 2026 15:56:01 +0000

A new concept for platform and DevOps engineers, and why the window to act is narrower than you think.

A few months ago I set out to build an internal DevOps agent. The goal was straightforward: diagnose pipeline failures and surface root causes faster than any engineer could manually. I was writing Python functions, connecting to the ADO REST API, the Kubernetes client, the Azure SDK. Building the integration layer from scratch.

A senior colleague asked one question that changed everything: "Have you looked at the Azure MCP Server?"

I hadn't. That question opened a window into an entire vendor ecosystem being assembled at speed, and into a far more important question about what it had not yet built. That gap has a name. This article is about it.

The future is agent-orchestration. MCP is its language.

We are moving from a world where automation meant writing explicit instructions for machines, to one where autonomous agents receive a goal and reason their way to achieving it. For every DevOps and platform team, the question is whether their existing automation will be visible to those agents, or invisible.

The interface that makes automation agent-visible is the tool: a callable function an agent can discover, invoke, and reason over. The open standard governing how tools are described, discovered, and called is Model Context Protocol (MCP).

MCP has three components. The MCP Host is the environment where the agent runs (an IDE like Cursor or Kiro, a platform like GitHub Copilot, or a custom agent you build). It contains the LLM doing the reasoning. The MCP Client lives inside the host and handles protocol communication. The MCP Server is where tools live, exposing callable functions and responding to invocations.

In Python, a tool on an MCP Server looks like this:

@mcp.tool()
async def get_build_logs(organization: str, build_id: int) -> str:
    """Retrieve the full log output for a specific ADO pipeline build."""
    ...

The @mcp.tool() decorator is the registration contract. The docstring (the text in triple quotes) is what the agent reads when deciding whether and how to call this function. Not optional documentation. It is the agent's primary reasoning interface into your tool. More on this shortly.

SDK downloads grew from 100,000 at launch to over 8 million by April 2025. In December 2025, Anthropic donated MCP to the Linux Foundation, co-governed by OpenAI, Google, Microsoft, AWS, and Salesforce. The same governance move Kubernetes made in 2016. When something enters Linux Foundation governance, it stops being one vendor's experiment and becomes shared infrastructure.

Every major cloud vendor has now built production-grade MCP servers for their own ecosystems. What matters for this article is what they have built, what they haven't, and what that gap means for every DevOps platform already running.

What vendors have built — and why AWS is leading this race

Azure MCP Server (GA) exposes 40+ Azure services as agent-callable tools. The ADO MCP Server covers pipelines, builds, pull requests, and repositories. AWS embedded MCP into Bedrock AgentCore with IAM permissions and CloudTrail audit logging per tool call. Google released fully managed MCP servers for GKE, BigQuery, AlloyDB, and Spanner. Microsoft shipped a dedicated SQL MCP Server covering SQL Server, PostgreSQL, and Cosmos DB. Zero code, open source, free.

Beyond their own services, the vendors have gone further, building tools to make your existing automation agent-ready without custom code. This is where the comparison gets interesting.

AWS is the clear pioneer. At AWS Summit New York in July 2025, they announced a $100 million investment in their Generative AI Innovation Center, with agentic AI as the centrepiece. At re:Invent 2025, they shipped three domain-specific autonomous agents: a DevOps Agent, a Security Agent, and Kiro, an agentic IDE. Their open-source Strands Agents SDK introduced a model-first design philosophy. Instead of developers hardcoding every workflow path, the LLM reasons over available tools and decides the path itself. AWS has made agent development a first-class engineering discipline, with tooling, documentation, and production infrastructure to match.

The centrepiece for existing automation is AgentCore Gateway: a fully managed service that converts your existing APIs and Lambda functions into MCP-compatible tools automatically. You provide an OpenAPI specification or a Lambda ARN, and the Gateway handles protocol translation, authentication, semantic tool discovery, and observability. No custom code required.

Azure has equivalent capability but spread across three services. Azure APIM can expose any REST API as an MCP server: import your OpenAPI spec, click "Create MCP server," done. Azure Functions has a native mcpToolTrigger binding. Microsoft Foundry provides governance across 1,400+ connectors alongside custom tools, authenticated through Entra. The capability is there, but it requires more coordination across services compared to AWS's single-surface approach.

Google's Apigee converts any managed API to an MCP server without changing the underlying service. Powerful for GCP-native APIs, but Apigee has historically been a complex enterprise product and lacks the seamless function-wrapping simplicity of AgentCore Gateway.

The green rows are real, meaningful progress. AWS leads on developer experience and unification. The red rows tell the more important story. And they have a name.

AI-Debt

AI-Debt is the human automation your team built that AI agents cannot reach.

Human is the key word. This is automation written by engineers, for engineers: scripts run manually, pipelines triggered by people, YAML files committed to repos and executed on build agents with specific tooling installed. It works perfectly for the people who use it today. The debt only becomes visible the day an agent arrives and has nothing to call.

AI-Debt has two distinct components: Interface-Debt and Context-Debt. Both matter, and neither is solved by any vendor.

What's locked: Interface-Debt

Interface-Debt is automation that exists but cannot be called by agents. It has no discoverable interface, no function signature, no API endpoint, no callable handle that an agent can find and invoke.

Your ADO pipeline YAML that runs a Helmfile deployment: not callable by agents. Your PowerShell script that creates Azure resources: not callable by agents. Your Bash script that validates secrets before a deployment: not callable by agents. Your kubectl wrapper that diagnoses stuck pods: not callable by agents. Bicep templates, ARM parameters, Makefile targets, cron scripts: none of them are discoverable or invocable.

Vendors are reducing Interface-Debt for API-surface automation (code that already has a callable interface: Lambda functions, REST APIs, Azure Functions, managed cloud endpoints). Valuable progress. But it only covers automation that already exposes a typed, invocable surface. A Bash script running on an ADO pipeline agent has no Lambda ARN. A Helmfile task has no OpenAPI spec. The auto-wrap tools have nothing to target.

For Azure/ADO environments specifically, the gap is significant. Years of YAML. Hundreds of pipeline tasks. Thousands of shell functions. All invisible to agents.

What's dark: Context-Debt

Context-Debt is callable automation that agents cannot use intelligently, because the tools carry no description of when to use them, what they do, or how they behave on your specific platform.

When an AI agent is given a set of tools, it decides which tool to call, and how, entirely based on the Python docstring attached to each tool. Not the code. The description.

Research published in 2026 quantified this directly: editing tool docstrings can yield up to 10 times more usage of the same underlying function in production agents. A 2026 benchmark called OpaqueToolsBench studied what happens when tools have incomplete documentation and found that LLMs consistently struggle with tools that lack clear best practices or documented failure modes.

Anthropic's own engineering team documented this from building Claude Code: when they launched the web search tool, they discovered Claude was needlessly appending "2025" to every query, biasing results. The fix was not a model change. It was improving the tool's docstring.

AgentCore Gateway can wrap your Lambda, but it cannot write the docstring that tells the agent when this tool is relevant, what your platform's naming conventions are, or why a particular failure pattern should trigger it first. That knowledge exists only in your engineers' heads, your incident history, and the habits your team has built over years.

A Lambda with an empty docstring is callable. It is not agent-ready. That gap is Context-Debt.

The AI-Debt audit: two questions

Before paying down AI-Debt, you need to know how much you have. Most teams have never measured it.

Two questions. Count the numbers and you have your baseline.

Most enterprise DevOps platforms, when audited this way, find 70-90% of their automation sitting in Interface-Debt, with significant Context-Debt on whatever is callable. That number is your starting point.

How custom MCP tools pay down both debts: three examples

Custom MCP tools are new Python functions decorated with @mcp.tool(). They do two things at once: give your existing automation a callable interface (addressing Interface-Debt), and encode your platform knowledge in their docstrings (addressing Context-Debt). One new function, two debts addressed.

Example 1: Context-Debt (same function, different quality)

This tool retrieves Kubernetes pod logs, a core diagnostic step in any deployment failure. The engineer already has a working kubectl logs call. The question is whether an agent can use it intelligently.

The key is the Python docstring. This is what the agent reads when deciding whether and how to call this function. Not documentation for your colleagues, but the agent's only reasoning interface into your tool.

# Context-Debt: callable, but the agent has nothing to reason with
@mcp.tool()
async def get_pod_logs(namespace: str, pod_name: str) -> str:
    """Get logs for a pod."""
    result = subprocess.run(
        ["kubectl", "logs", pod_name, "-n", namespace],
        capture_output=True, text=True
    )
    return result.stdout

# Context-Debt resolved: the docstring tells the agent
# when to call this, how to call it, and what to do next
@mcp.tool()
async def get_pod_logs(
    namespace: str,
    pod_name: str,
    tail_lines: int = 100,
    previous: bool = False
) -> str:
    """
    Retrieve recent logs from a Kubernetes pod in the AKS cluster.

    Use when diagnosing:
    - CrashLoopBackOff pods — set previous=True to see the crash reason
    - Init container failures — include init container name in pod_name
    - Startup failures during helmfile atomic deployments

    Namespace naming on this platform: {service}-{env}
    e.g. payments-dev, payments-staging, auth-prod

    If pod_name is unknown, call get_pods_in_namespace() first.
    Returns last {tail_lines} log lines. Increase for deeper history.
    Returns empty string if pod has not started emitting logs yet.
    In that case, call describe_pod() to check events instead.
    """
    cmd = ["kubectl", "logs", pod_name, "-n", namespace,
           f"--tail={tail_lines}"]
    if previous:
        cmd.append("--previous")
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result.stdout or result.stderr

The code is nearly identical. The agent's ability to use it correctly is not. The second version tells the agent when to call it, what namespace naming convention to use, which companion tool to call when it doesn't have a pod name, and what an empty response means. Without that docstring, the agent either skips the tool, calls it with wrong parameters, or hallucinates a response.

Example 2: Interface-Debt (wrapping an existing script)

Your team has a Bash script, /scripts/get-failed-builds.sh, that queries the ADO REST API for recent pipeline failures. It has been running for two years. Developers trigger it manually or reference it in ADO pipeline tasks running on a private agent pool. No AI agent can call it. It lives on a file system, not behind an API, with no discoverable interface.

Here is how you pay down Interface-Debt: write a new MCP tool that calls the script, giving it a callable interface, while encoding platform knowledge in the docstring.

# /scripts/get-failed-builds.sh
# Runs on ADO private agent pool, triggered manually or via pipeline task
# Usage: ./get-failed-builds.sh <project> <pipeline-name> <days>
# Returns: JSON array of failed runs with build_id, stage, duration
# No agent can reach this — it has no callable interface

# New MCP tool: Interface-Debt + Context-Debt resolved together
@mcp.tool()
async def get_recent_pipeline_failures(
    project: str,
    pipeline_name: str,
    days_back: int = 7
) -> list[dict]:
    """
    Get recent failed pipeline runs for a given ADO project and pipeline.

    Wraps the internal ADO query script and returns structured failure data.
    Call this as the first step in any pipeline diagnosis workflow.
    It gives you the build IDs needed for deeper analysis.

    Pipeline naming on this platform: {service}-{env}-deploy
    e.g. payments-dev-deploy, auth-staging-deploy, gateway-prod-deploy

    Returns list of failures with fields:
    build_id, start_time, failed_stage, duration_seconds,
    triggered_by, branch, retry_count.

    Most common failed stages on this platform:
    - helmfile-apply     -> missing secrets (79%) or image pull (15%)
    - integration-tests  -> environment config or dependency issues
    - security-scan      -> new CVE in base image (check monthly patch cycle)

    After calling this, pass build_id to diagnose_build_failure()
    for root cause analysis.
    """
    result = subprocess.run(
        ["/scripts/get-failed-builds.sh", project,
         pipeline_name, str(days_back)],
        capture_output=True, text=True
    )
    return json.loads(result.stdout)

The Bash script has not changed. It still runs where it always ran. The new MCP tool is a thin wrapper that converts it from invisible to callable, and the docstring converts it from callable to agent-ready. That is what paying down Interface-Debt looks like in practice.

Example 3: The purely contextual tool (no vendor equivalent)

This tool has no script to wrap and no API to call. It queries an internal incident database built from 18 months of real platform failures. But the real value is not the database call. It is the docstring that encodes the diagnostic patterns a senior engineer applies instinctively. Think of it as team knowledge, made callable and permanent.

No vendor MCP server can build this. AgentCore Gateway has no OpenAPI spec to import. This tool exists only because someone encoded real incident history into a docstring.

@mcp.tool()
async def get_platform_failure_pattern(
    error_signature: str,
    pipeline_stage: str,
    service_name: str = None
) -> dict:
    """
    Look up known failure patterns on this platform from real incident history.

    CALL THIS FIRST in any diagnosis before running other tools.
    It encodes 18 months of incident data and directs you to the
    highest-probability root cause, skipping diagnostic dead ends.

    Known patterns (error_signature -> likely cause):
    - "timed out waiting for condition" + helmfile-apply
      -> missing secret in namespace (79% of cases)
      -> next: call check_keyvault_secret_exists()

    - "ImagePullBackOff"
      -> ACR authentication failure or incorrect image tag (92%)
      -> next: call check_acr_image_exists()

    - "CrashLoopBackOff" shortly after deployment
      -> application ConfigMap missing or malformed (71%)
      -> next: call get_pod_logs(previous=True) then check_configmap()

    - "503 Service Unavailable" post-deployment with healthy pods
      -> stale Istio VirtualService conflict in namespace (58%)
      -> next: call get_all_virtualservices_for_host()

    Returns: likely_cause, confidence_percent, recommended_tools,
    similar_past_incidents, avg_resolution_minutes.

    If confidence < 50%, this is a new pattern not yet seen.
    Document it via create_incident_record() and use the generic path.
    """
    return await query_incident_database(
        error_signature, pipeline_stage, service_name
    )

This is Context-Debt resolved, and the only category of tool that truly differentiates your platform's agent capability from every other organisation using the same vendor tooling.

A note on diminishing AI-Debt: why vendors won't close this gap

Could vendors extend their auto-wrap to cover scripts and pipeline tasks? Theoretically yes. AWS could build mechanisms to execute arbitrary scripts via Lambda wrappers, Azure could auto-instrument ADO pipeline tasks. But there is a structural reason why they are unlikely to prioritise this.

Vendors have no incentive to solve your platform's knowledge problem. Their investment goes into making their own services and managed resources accessible to agents: Lambda, REST APIs, their cloud-native tooling. The script-and-pipeline estate your team has accumulated over five years is yours, not theirs. Even if a vendor shipped a zero-code script wrapper tomorrow, it would only address Interface-Debt. Context-Debt (the knowledge of when to use each tool, how your platform behaves, what your failure patterns are) remains yours to encode. No vendor will ship that for you.

That is what makes the custom MCP layer valuable and defensible. It is automation that only you can build.

The window is closing

Gartner predicts 40% of enterprise applications will integrate task-specific AI agents by end of 2026, up from less than 5% in 2025. At the same time, they predict over 40% of those agentic AI projects will be cancelled by end of 2027 due to inadequate technical foundations.

Read those two together. Agents are arriving. Nearly half the projects will fail — not because the models are poor, but because the platforms were not ready.

The ones that succeed will have paid down AI-Debt before the agents arrived.

Where to start

Run the audit. Count your automation across two dimensions: what's locked and what's dark. That number is your baseline.

Start with the highest-value workflows. Incident diagnosis, deployment validation, environment setup. Build custom MCP tools for those five or ten scenarios first.

Treat docstrings as engineering work. The quality of your agent's decisions is directly proportional to the quality of your docstrings. Not documentation overhead — the core of what makes a platform agent-ready.

AI-Debt is silent. Your scripts still run. Your pipelines still deploy. Everything works perfectly for humans. The debt only becomes visible the day an agent arrives and has nothing to call.

That day is closer than most platform teams realise.

Platform and DevOps engineering at a large UK financial institution. Views are my own.
I write about AI agents, cloud architecture, and occasionally things that have nothing to do with technology.
Building in this space or thinking about AI-Debt on your platform? I would be glad to hear from you.