The read-only MCP server work shipped clean. AI agents could read tags, search logs, query costs, walk topology. Operators saved hours per incident. The next question was obvious: which writes can the agent do too. We turned write capability on for tagging, log retention adjustments, idle non-prod resource stops, and a handful of other low-blast operations. Most of that stayed on. Ninety days in, three specific write classes got pulled back to read-and-propose.
The pieces that stayed write-enabled were the ones the closed-loop trust scoring framework would call "tier 1": low blast radius, high reversibility, single-axis decision. Tag corrections. Log retention nudges. Stopping a non-prod EC2 instance flagged by a 14-day idle alarm. The AI got faster at all of these and operators stopped intervening on the routine ones.
The three pulled back are the operations where the AI consistently picked the wrong action despite picking the right metric. Each failure mode has the same shape: the agent had every piece of local data the action required and was still missing a piece of global context that turned the action into an outage. This piece is about those three, why adding more local context doesn't fix them, and how the policy-aware MCP work needs to be extended with explicit capability tiers.
What we shipped, what we pulled back
| Tool class | Status after 90 days | Why |
|---|---|---|
| Tag add / correct | Write-enabled | Single-axis, fully reversible, no dependency cascades |
| Log retention adjust | Write-enabled | Reversible, blast radius bounded by retention window |
| Stop idle non-prod EC2 | Write-enabled | Trust-score gated, traffic-confirmed, 90-second reverse |
| Cost query / report | Read-only | Output is the value; no write needed |
| Bulk security-group edit | Pulled back | Dependency cascades invisible to the LLM |
| Cross-region replication toggle | Pulled back | Recovery cost not in the local cost metric |
| Cost-anomaly auto-resolution | Pulled back | Multi-axis root cause; agent picks plausible-wrong action |
| Right-size production resource | Read-only since launch | Never went write-enabled — too high-blast for tier 1 |
The pattern that lets some tools stay write-enabled: the action's success or failure is fully determined by data the MCP tool already returns. Tagging a resource is fine because everything that matters about the tag is in the tag. Stopping an idle non-prod VM is fine because the traffic data, the env tag, and the time-since-last-use are all in the agent's context window. The three failed classes share a different shape.
Failure 1: bulk security-group edits and the invisible dependency cascade
The trigger event: an agent ran a routine cleanup pass on cleanup-stale-sg-rules. Eleven rules tagged "unused for 90 days" got removed across a non-prod account. One of those rules was a CIDR allow on port 5432 sourced from 10.99.0.0/16. The CIDR didn't map to anything in the source account's resources. The MCP tool's enrichment said "no resources in this account match this source range." Agent removed the rule.
Three minutes later a partner-integration job started failing in a different account. The 10.99.0.0/16 was the partner account's VPC peered into ours. The peering connection lived in a different VPC than the security group. The dependency wasn't visible from either the SG or its enrichment context.
The MCP tool returned correct local data. There were no resources in account-A using port 5432 from 10.99.0.0/16. That was true. The fact that account-B's ETL job traffic landed on those rules through a VPC peering connection was not in the SG's metadata, not in the tool's enrichment, and not in the agent's context window.
Adding more local data doesn't fix this. We tried "include all VPC peering connections in the enrichment." The next failure was a Direct Connect virtual interface that the agent didn't know to look at. Then a Transit Gateway attachment. The dependency cascade is unbounded; every cloud account has a different topology. The right move is to stop letting the agent execute these and have it propose instead.
Failure 2: cross-region replication toggles and the recovery cost the LLM doesn't see
S3 cross-region replication costs $0.02 per GB transferred plus the destination storage. For cold-tier buckets that are write-once and rarely read, the replication cost can dominate the bucket's monthly bill. The MCP tool exposed this signal cleanly: per-bucket, per-month, here's how much you're spending on replication.
The agent did the obvious thing. It found buckets where replication cost was high relative to read activity, proposed disabling replication, and (when in write mode) executed. Two weeks later one of those buckets was the only durable copy of a quarterly compliance archive. The primary region had a regional storage event. Restore from the replica was the disaster recovery plan. The replica had been turned off six days earlier.
The cost metric the agent saw:
| Signal | Value | What the agent concluded |
|---|---|---|
| Monthly replication cost | $1,200 | Significant spend |
| Read activity (last 90 days) | 0 | Bucket isn't accessed |
| Storage class | STANDARD_IA | Cold tier already |
| Bucket size | 60 TB | Replication cost will keep growing |
The cost metric the agent didn't see: this bucket is the only off-region copy of a regulator-required dataset. The recovery cost if the primary region fails and the replica is gone is not "$1,200 saved per month." It's "potential six-figure compliance penalty plus reconstruction work plus customer trust damage."
This isn't fixable by giving the agent more local data. The recovery posture lives in a separate system (the disaster recovery plan, the compliance map, sometimes just one person's head). Even if the bucket had a dr-tier=critical tag, an agent optimizing for cost-per-byte would override the tag because the dollars look so clear. Cross-region replication toggles need a human in the decision because the human knows what the bucket is for, not just what it costs.
Failure 3: cost-anomaly auto-resolutions and root-cause attribution
The cost-anomaly detector flagged a 40% EC2 spend spike in us-east-1 on a Wednesday. The MCP tool returned the anomaly with the standard enrichment: which accounts contributed, which instance types, time-of-day breakdown. The agent's proposed action: right-size the largest contributing instances; estimated saving $14,000/month.
The actual root cause: a feature deployment on Tuesday night had moved a recommendation pipeline from batch to streaming. The new architecture used larger instances continuously instead of bursting briefly through a queue. The "spike" was the new steady state. Right-sizing the instances would have broken the recommendation latency SLO and triggered a rollback by morning.
The agent's reasoning chain was technically correct. EC2 spend was up. The largest instances were the ones that grew. Right-sizing is a known-safe action class. The miss was that the cost change had a non-cost cause. Cost-anomaly attribution is multi-axis: deployment changes, traffic changes, instance-type changes, capacity-reservation expirations, regional pricing changes, and tagging changes can all surface as the same cost signal.
The agent had the change set deployed Tuesday in its context window — we had wired GitHub deployments into the MCP server precisely for this. The agent didn't connect them because the deployment description said "switch to streaming pipeline" and the agent was reasoning about EC2 cost, not architectural changes. The correlation existed; the LLM didn't bridge to it.
We considered training a smaller model on internal incident postmortems to do better attribution. The time cost was prohibitive and the failure mode would just shift from "wrong action" to "wrong action with more confidence." The structural fix was to stop letting the agent close the loop on cost-anomaly resolutions. The agent now produces a ranked list of candidate actions with the evidence chain attached. A human picks one or marks the anomaly as expected.
The pattern: local data, global decision
All three failures share the same shape. The LLM has the local data the action requires. The action requires global context the LLM doesn't have.
| Failure class | Local data agent had | Global context agent missed |
|---|---|---|
| Bulk SG edit | SG metadata, enrichment of resources in same account | Cross-account peering dependencies, partner traffic patterns |
| Cross-region replication toggle | Replication cost, read activity, bucket size | Disaster recovery posture, compliance requirements |
| Cost-anomaly auto-resolution | Cost data, instance attribution, time-of-day breakdown | Concurrent deployment context, traffic-shape changes, intentional architectural shifts |
Adding more local context doesn't fix this. Each time we tried (more enrichment fields, more cross-references, more ML-extracted context), we found a new edge case in the same failure class. The missing information is by definition not in the resource being touched. It's in another system, another team's heads, or the timeline of unrelated events.
Two patterns that DO work in this shape: a human in the decision loop, or a structural rule that pre-filters the action class. We picked human-in-the-loop because the structural rule is hard to maintain (every edge case becomes a new rule, the rule list grows unbounded), and the human cost is low when the action is pre-formatted.
Read+propose is not read-only
The pull-back is "the AI drafts, the human approves." The MCP server still enriches the signal, runs the dependency analysis, computes the cost-benefit, formats the change as a runbook with an evidence chain. The human's job is to look at the change description and click "approve" or "reject."
For the cost-anomaly case, the change description renders as a structured runbook. The header section names the anomaly (EC2 spend +40% in us-east-1, started Wed 02:00 UTC), the contributing accounts (prod-us-east, ml-pipeline, data-platform), and a ranked list of candidate actions:
| Rank | Action | Estimated saving | Risk note |
|---|---|---|---|
| 1 | Right-size instances i-abc, i-def, i-ghi
|
$14k/mo | Triggered by Tue 23:00 deployment "switch recs to streaming"; verify with platform team before action |
| 2 | Roll back the Tue deployment | $14k/mo | Feature rollback; coordinate with rec-pipeline team |
| 3 | Mark anomaly as expected | $0 | Use if streaming rec pipeline is intentional steady-state |
Most reviewers approve in seconds. The friction cost is one click per change instead of zero. The incident cost saved is hours of recovery work plus the trust damage of a bad auto-action. The 30-minute-to-30-second collapse on the human side is what makes read+propose worth it: the AI did the analysis, the human decided.
This is more valuable than read-only because the AI is still doing 95% of the work. The reasoning chain, the cost math, the evidence collection, the runbook formatting — all of that is the AI's job. The human's job collapses to "is this the right action?" which is the part the AI can't do reliably for these three classes.
MCP capability tiers + trust-score gating
The structural fix is to make this a property of the MCP server, not a per-tool policy debate every time someone wants to add a new write operation. Each MCP tool exposes a capability_tier field in its metadata.
Tier 1 (read-only) is always allowed. Tier 2 (mutate-low-blast) requires the trust score to pass the team's threshold (typically 0.55). Tier 3 (mutate-high-blast) requires a trust score so high it effectively forces a human into the loop. The three pulled-back classes go to tier 3.
The capability-tier field is more important than the agent's IAM permissions. An agent with broad cloud permissions but only access to tier-1 and tier-2 tools is safer than an agent with narrow permissions and a tier-3 tool. The MCP server is the right enforcement point because that's where every tool call already passes through. Adding the tier check there means it can't be bypassed by an agent that knows the underlying API.
The tier assignment is the policy debate that does belong in the team. Each new tool gets a tier when it ships, and the tier can move (usually upward, as failure modes show up). The three classes in this piece moved from tier 2 to tier 3 ninety days in. The capability-tier field is the audit trail: what was the tier when the action ran, who set the tier, when.
What changes in the next quarter
Two things on the roadmap. First, we're instrumenting every agent invocation with a resource-touch graph: which resources did the agent read, which did it propose to mutate, which other resources reference any of those. The graph is the input to a "did the agent see all the dependencies" check. Early prototype catches the cross-account peering case from the first failure cleanly, but doesn't yet handle Direct Connect or Transit Gateway. The structural problem is that the dependency graph is unbounded; the practical fix is to score "graph completeness confidence" and feed it back into the trust score.
Second, we're auditing whether multi-step agent loops (where the agent reads, proposes, gets feedback, iterates) hit the same failure shape as single-shot writes. Early sample size is small but suggests yes — the multi-step agent gathers more local context than the single-shot but still misses the global piece. If that holds, the capability-tier model needs to apply to the loop's terminal action, not the intermediate reasoning steps.
Read+write MCP works for most cloud operations. The three classes pulled back are the ones where the trust-score math couldn't compensate for the missing global context. Read+propose preserves the AI's value at the cost of one click per change. The capability-tier field on MCP tools is the structural enforcement point that makes the policy debate happen once per tool, not once per action.



Top comments (0)