DEV Community: Ajay Devineni

The AI Agent Cost Ceiling Problem: Why Your AWS Bill Is Your Reliability Alert

Ajay Devineni — Mon, 11 May 2026 21:16:09 +0000

Production AI agents fail on tool calls 3–15% of the time. That's not a failure rate you fix — it's a reality you design around.

The teams that have designed around it have circuit breakers: token budgets, retry limits, cost anomaly alerts wired to incident response.

The teams that haven't find out from their AWS bill.

This article is about the reliability infrastructure between those two outcomes.

The Retry Loop Failure Mode

When an AI agent calls a tool and gets an ambiguous response — not an error, not a success, just something unexpected — most agents do what they're designed to do: they try again. And again. And again.

Without a hard retry limit, this becomes a loop. Without a token budget cap, the loop has no ceiling. Without observability instrumentation specific to retry signatures, your standard dashboards show nothing unusual until the cost spike appears.

In documented production deployments, the cost spike is the first operational signal that something has gone wrong. By that point, if the agent has write permissions and has queued remediation actions, the incident may have worsened before anyone noticed the loop.

This is the reliability problem behind the cost problem. The bill is the symptom. The missing circuit breaker is the cause.

Why Standard SLIs Don't Catch It

Request latency: normal. The agent is responding within SLO. Error rate: zero. Every call returns something — just not what the agent expected. Availability: 100%. The agent is up and running.

The retry loop produces none of the infrastructure-layer signals your existing alerts are watching.

What it does produce is a Tool Invocation Efficiency (TIE) anomaly — your agent is making 4, 6, 8 tool calls per task when its baseline is 2. That ratio climbing is your early warning. It fires before the billing cycle closes. It fires before the incident escalates.

This is why TIE is a first-class SLI in the agentsre library. It catches what latency and error rate miss.

The Three Circuit Breakers

Every production AI agent needs three reliability controls specifically for the retry loop failure mode:

1. Hard Token Budget Per Session

Set a maximum token count per agent session. Not a soft recommendation in the system prompt — a hard limit enforced at the infrastructure layer. When the agent hits the limit, it stops executing and routes to your escalation path.

The budget should be sized at 3x your P95 task token usage. A task that normally uses 2,000 tokens gets a 6,000-token ceiling. Anything above that is a signal, not normal operation.

from agentsre import AgentSLICollector, TaskRecord

# Track token usage as part of your task record
collector.record(TaskRecord(
    task_id="t-001",
    task_class="incident-analysis",
    tool_calls=8,               # elevated — baseline is 2.3
    decision_confidence=0.71,
    completed=True,
))

# TIE will catch the retry signature before the bill does
results = collector.collect("incident-analysis")
for r in results:
    if r.breached:
        trigger_circuit_breaker(r)

2. Retry Loop Signature in Observability

A retry loop has a distinctive signature: tool call count per task climbing above baseline, task completion time extending beyond P99, and decision confidence declining across sequential attempts.

Configure a CloudWatch alarm on TIE drift: when tool calls per task exceed 2x baseline for 10 consecutive minutes, fire an alert. This is your early warning before the cost spike and before the incident escalates.

# CloudWatch alarm for retry loop detection
aws cloudwatch put-metric-alarm \
  --alarm-name "AgentRetryLoopDetected" \
  --metric-name "ToolInvocationEfficiency" \
  --namespace "AgentReliability" \
  --statistic Average \
  --period 300 \
  --threshold 2.0 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:REGION:ACCOUNT:AgentAlerts

3. Cost Anomaly as Incident Trigger

Wire your AWS Cost Anomaly Detection to your incident management system. An AI agent whose cost per hour doubles is experiencing a reliability event — treat it as one.

Set a cost anomaly threshold at 150% of your rolling 7-day average for the relevant Lambda functions and Bedrock invocations. When it fires, it routes to the same on-call channel as your availability alerts — because it is an availability signal.

The Numbers Behind This

40% of agentic AI projects are expected to be cancelled by 2027. Cost overruns and inadequate risk controls rank in the top three reasons. These are not independent failure modes — they're the same failure mode at different stages of the same incident.

The retry loop causes the cost overrun. The missing circuit breaker causes the retry loop. The missing circuit breaker exists because teams treat AI agent reliability as an application problem rather than an infrastructure problem requiring SRE governance.

What To Do Before Your Next Agent Goes Live

Three checks before any AI agent touches production:

Check 1: Does this agent have a hard token budget enforced at the infrastructure layer? Not a prompt instruction — a hard limit.

Check 2: Is TIE instrumented per task class with a 2x-baseline breach alert configured?

Check 3: Is cost anomaly detection wired to your incident management system for this agent's associated AWS resources?

If any answer is no — the agent is not production-ready. It is demo-ready.

The circuit breaker for the retry loop costs an afternoon to build. The absence of it costs the project.

Open-source implementation: github.com/Ajay150313/agentsre — the agentsre library instruments TIE, DQR, HER, and AQDD out of the box with AWS CloudWatch integration.

LinkedIn discussion: https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7459711021738307584-x6cv?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

What ceiling do you have today when an agent starts looping?

The Double-Exposure Problem: When AI Agents and AI-Generated Code Fail Together

Ajay Devineni — Fri, 08 May 2026 02:02:45 +0000

Amazon's March 2026 AI outages — two separate incidents within three days, totaling more than 6 million lost orders — have done something unusual for the SRE community: they've made a failure mode visible that most teams have been quietly carrying in their production systems without acknowledging.

The incidents were traced to AI-generated code changes deployed without adequate approval gates. Amazon's response was a 90-day code safety reset across 335 critical systems, with a new requirement that AI-assisted code changes be reviewed by senior engineers before deployment.

That response is SRE discipline. Applied reactively. This article is about applying it proactively — and about a compounding failure mode most teams haven't modeled yet.

The Double-Exposure Problem

The SRE concept of blast radius asks: when a component fails, what is the maximum scope of impact? Most blast radius models assume that the failing component is one thing — a service, a database, a network partition.

In 2026 production environments, a new blast radius scenario is emerging that most models don't account for:

What happens when your AI agent and the AI-generated code it runs on fail simultaneously?

This is the double-exposure problem. It has three components:

Exposure 1 — AI runtime behavior. Your AI agent operates non-deterministically. Its decisions, tool selections, and reasoning paths vary across invocations. Standard observability — latency, error rate, availability — does not instrument this layer. The semantic failure modes (wrong decisions, context drift, tool compensation) are invisible to your dashboards.

Exposure 2 — AI-generated code changes. Your CI/CD pipeline uses AI assistance to generate infrastructure changes, configuration updates, or application code. According to Lightrun's 2026 survey of 200 senior SRE and DevOps leaders, 43% of these changes require manual debugging in production even after passing QA. Not a single survey respondent expressed "very confidence" that AI-generated code would behave correctly in production.

Exposure 3 — The interaction.** When an AI-generated code change deploys to the same environment your agent is operating in, you have two non-deterministic systems interacting. The code change may alter the agent's tool environment, context window, or available action space in ways that manifest as behavioral drift — drift that your current instrumentation will miss because it's measuring infrastructure, not agent behavior.

The result: a production incident that looks like agent degradation. The root cause is a code change. The RCA takes hours because the investigation starts at the wrong layer.

Why Standard Observability Misses This

IEEE Spectrum described this failure class in their recent article on quiet AI failures: every monitoring dashboard reads healthy while users report that system decisions are becoming wrong.

This is structurally identical to what happens in the double-exposure scenario. A code change that subtly alters an agent's tool environment produces no infrastructure-layer signal. The agent's HTTP responses stay at 200. Latency stays within SLO. Error budget stays unburned.

What changes is the agent's Decision Quality Rate — the percentage of decisions falling within expected behavioral bounds. And Tool Invocation Efficiency — the ratio of tool calls per task completion. And eventually Human Escalation Rate — the percentage of tasks requiring intervention.

None of these are instrumented in a standard observability stack. All of them detect the double-exposure failure mode before it reaches user impact.

The Governance Framework

Amazon's 90-day reset is a retroactive version of what proactive SRE governance looks like. Here are the four components that matter, drawn from first principles rather than post-incident response:

1. The AI Code Change Approval Gate

Every code change touching an AI agent's runtime environment — its tools, configuration, action space, or infrastructure — should require explicit approval before deployment. Not because AI code generation is untrustworthy, but because non-deterministic code changes interacting with non-deterministic runtime systems have a compounding failure surface that standard CI/CD testing cannot fully cover.

This is not a new concept. Amazon has now required it. The cost of implementing it proactively is hours. The cost of discovering it's missing is incidents.

Implementation: A dedicated approval stage in your deployment pipeline for changes flagged as AI-generated or agent-environment-adjacent. This is distinct from your standard peer review — it specifically evaluates: does this change touch any agent's tool environment, context configuration, or action space?

2. Behavioral Baseline Snapshots Around Code Deployments

Apply the same framework version governance pattern to AI code changes: snapshot your agent's behavioral baselines before the change deploys, and compare post-deployment behavior against them.

Specifically, capture per-task-class TIE and DQR baselines immediately before any deployment that touches your agent's environment. Run the deployment in a shadow environment for a minimum review period. If TIE drifts more than 15% or DQR drops more than 15%, flag for human review before promoting to production.

This is the instrumentation that would have surfaced Amazon's failure earlier in the pipeline — not at the infrastructure layer, but at the behavioral layer where the actual impact manifested.

from agentsre import AgentSLICollector, TaskRecord
from agentsre.sprawl import FrameworkVersionGovernance

# Capture baseline before deployment
gov = FrameworkVersionGovernance(
    tie_drift_threshold=1.15,
    dqr_drift_threshold=0.85,
    min_shadow_samples=30,
)

gov.snapshot_baseline(
    agent_id="your-agent",
    task_class="your-task-class",
    framework_version="pre-ai-code-change",
    tie_values=current_tie_samples,
    dqr_values=current_dqr_samples,
)

# After shadow deployment — evaluate before promoting
result = gov.evaluate_upgrade(
    agent_id="your-agent",
    task_class="your-task-class",
    production_version="pre-ai-code-change",
    shadow_version="post-ai-code-change",
)

if result.decision == UpgradeDecision.BLOCK:
    block_deployment(result.block_reason)

3. A Blast Radius Model for Double-Exposure

Most blast radius models assume one failing component. Run the double-exposure calculation explicitly:

Which of your production services depend on AI agents?
Which code paths in those services are AI-generated?
If both the agent's semantic behavior and the underlying code fail simultaneously, what is the maximum scope of user impact?
What is the safe degradation sequence — which agent capabilities can you reduce autonomously, and in what order?

This calculation should exist as a named document, owned by a named person, reviewed quarterly. It is the blast radius equivalent of a fire drill — done in advance so the answer is known before the incident.

4. A Proactive Runbook — Not Amazon's Retroactive Reset

Amazon's 90-day reset is a retroactive runbook. Write yours proactively. A minimum viable AI code reliability runbook covers:

Detection: Which metrics signal that an AI code change has degraded agent behavior? (Answer: TIE drift, DQR drop, HER increase — not latency or error rate)
Attribution: How do you determine whether the degradation is agent behavior, code change, or model drift? (Answer: compare against behavioral baseline snapshots captured pre-deployment)
Containment: What is the fastest path to reverting the code change while maintaining partial agent operation? (Answer: the progressive autonomy constraint ladder — not a binary kill switch)
Recovery criteria: When is it safe to redeploy? (Answer: shadow behavioral baselines within ±15% of production baseline for 30 consecutive minutes)

The SRE Perspective on AI Code Generation

The Lightrun finding that 88% of SRE leaders need two to three redeploy cycles to verify an AI-generated fix suggests something straightforward: the testing and verification frameworks for AI-generated code have not kept pace with the adoption of AI code generation.

This is the same lag that produced Amazon's incidents. And it's the same lag that the SRE community has closed before — with microservices, with Kubernetes, with cloud-native architectures. Each time, capability arrived before governance. The SRE discipline developed the governance.

The governance for AI-generated code in agent environments exists. Error budgets, blast radius models, approval gates, behavioral baseline comparison — these are standard SRE tools. They need to be applied to a new layer of the stack.

The open-source implementation is at github.com/Ajay150313/agentsre. The FrameworkVersionGovernance module handles behavioral baseline capture and comparison. The progressive constraint ladder handles safe degradation. Both work for AI code change governance as directly as they do for framework upgrades.

Amazon spent 6.3 million lost orders learning this lesson. Most teams can learn it for the cost of an afternoon.

What To Do This Week

If you're running AI agents in production and using AI-assisted code generation in the same environment:

Today: Identify which code changes in your last 30 days touched your agent's tool environment, configuration, or action space. Determine whether any were AI-generated. If yes — were they reviewed specifically for agent-environment impact?

This week: Add an AI code change flag to your deployment pipeline. Start capturing TIE and DQR baselines around any deployment flagged as agent-environment-adjacent.

This month: Run the double-exposure blast radius calculation. Document the result. Assign an owner. Review it with your team.

The Amazon incidents happened in March. The Lightrun survey data was collected in January. IEEE Spectrum is calling quiet failure one of the defining challenges of the year.

The signal is clear. The governance frameworks exist.

Open-source implementation: github.com/Ajay150313/agentsre
LinkedIn discussion: https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-activity-7458330530212835328-36__?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

What does your current approval gate for AI-generated code look like? Or is this the first time you've run the double-exposure calculation?

The Agent Control Plane is an SRE Problem: Governing the Orchestration Layer Nobody is Watching

Ajay Devineni — Mon, 04 May 2026 23:57:35 +0000

IBM's Distinguished Engineer Chris Hay declared this week that "agent control planes and multi-agent dashboards become real in 2026." Gartner projects that 40% of enterprise applications will use task-specific AI agents by 2026. The orchestration infrastructure to manage all of those agents — the control plane — is becoming the most critical and least governed layer in production AI.

This article applies SRE discipline to the agent control plane: what it is, what failure modes it introduces, and what instrumentation it requires before it goes to production.

What Is an Agent Control Plane?

In 2026, an agent control plane is the orchestration layer that:

Receives tasks from humans or upstream systems
Decomposes them into subtasks
Routes subtasks to specialist agents
Manages retry, rescheduling, and priority queues across the agent fleet
Makes autonomous decisions about resource allocation when demand spikes

The control plane is distinct from the agents it manages. It is infrastructure — the same way a Kubernetes control plane is distinct from the pods it schedules.

This distinction matters for reliability: when the control plane degrades, it does not degrade one agent. It degrades the entire fleet simultaneously.

The Control Plane Failure Taxonomy

Control plane failures are uniquely difficult to detect because they do not look like single-agent failures. They look like correlated degradation across multiple agents — which standard observability interprets as coincidence or noise.

Failure Class 1: Routing Drift

The control plane misassigns tasks to suboptimal agents — sending high-complexity reasoning tasks to agents specialized for retrieval, or routing compliance-sensitive tasks through agents without the required tool access. Each individual agent appears healthy. The control plane's routing logic is the failure.

Observable signal: fleet-wide DQR drops across unrelated task classes simultaneously.

Failure Class 2: Retry Storms

When multiple downstream agents fail simultaneously, the control plane retries across its full routing table. Each retry generates additional tool calls. If the control plane does not implement backoff and circuit breaking at the routing layer, a partial agent outage generates a retry storm that saturates the entire MCP tool layer.

Observable signal: fleet-wide TIE spike not attributable to any single agent or task class.

Failure Class 3: Priority Queue Starvation

Under load, control planes must prioritize. If the priority algorithm fails — or if it was never set — low-priority tasks consume resources that high-priority tasks need. Users of business-critical workflows experience silent slowdown while batch jobs consume capacity.

Observable signal: AQDD breaches across multiple task classes with no corresponding error rate increase.

Failure Class 4: Decomposition Accuracy Degradation

As task complexity increases, the control plane's decomposition logic produces subtask sets that are incomplete, redundant, or contradictory. Individual agents execute their subtasks correctly. The composed result is wrong because the decomposition was wrong.

Observable signal: HER climbs fleet-wide — humans are intervening not because agents failed, but because the task decomposition produced nonsensical results.

The Three SLIs Your Control Plane Needs

I extend the agentsre SLI framework with three control plane-specific measurements:

1. Routing Accuracy Rate (RAR)

The percentage of task assignments that match the optimal agent for the task class, measured against a labeled evaluation set.

RAR(t, w) = (correct_assignments / total_assignments) × 100

Baseline during a 30-day calibration window. Alert when RAR drops >15% from baseline — this is the signal that routing logic has drifted, usually because a new task class was added without updating routing rules.

2. Retry Storm Index (RSI)

The ratio of retry-generated tool calls to primary-invocation tool calls across the fleet in a rolling window.

RSI(t, w) = retry_tool_calls / primary_tool_calls

Normal RSI baseline is typically 0.05–0.15 (5–15% of tool calls are retries). RSI > 0.50 indicates retry storm conditions. RSI > 1.0 means more retry traffic than primary traffic — the control plane is in a positive feedback loop.

3. Decomposition Completeness Score (DCS)

The percentage of decomposed subtask sets that, when executed, produce outputs covering all requirements of the original task.

DCS requires a completeness validator per task class.

This is the hardest to instrument — it requires semantic understanding of task requirements. Start with a rule-based validator for your highest-volume task classes before attempting ML-based validation.

The Control Plane Governance Model

Separate SLO Ownership

The control plane is not owned by the same person who owns the agents. It is a separate system with a separate error budget. The control plane SLO owner:

Is paged when RAR drops >15% from baseline
Is paged when RSI exceeds 0.50 for 10+ minutes
Owns the retry storm runbook
Reviews control plane decomposition logic on every new task class addition

The Retry Storm Runbook (minimum viable version)

Every production control plane needs this runbook before launch:

Detection: RSI > 0.50 sustained 10 minutes → page control plane owner
Immediate action: Reduce control plane retry limit from default (3) to 1
Circuit breaking: Identify failing agents via fleet-wide TIE spike attribution. Apply circuit breaker (open at 85% semantic validation rate)
Recovery: Restore retry limit only after RSI returns to < 0.20 for 15 consecutive minutes
Postmortem trigger: Any RSI > 1.0 event requires a postmortem within 48 hours

Control Plane Version Governance

Apply the same framework upgrade governance to control plane versions as to agent framework versions: snapshot RAR, RSI, and DCS baselines before any control plane update. Run shadow traffic. Block promotion if any metric drifts beyond threshold.

Implementation on AWS

The three control plane SLIs instrument naturally on Bedrock's orchestration layer:

RAR: Evaluate routing decisions by comparing agentId in Bedrock orchestration traces against a task-class-to-optimal-agent mapping in DynamoDB
RSI: Count RETRY events vs INVOKE events in Bedrock CloudWatch logs, published as a ratio metric per 5-minute window
DCS: Lambda validator comparing subtask outputs against original task requirements, triggered by task completion events via EventBridge

Full implementation is in the agentsre library: https://github.com/Ajay150313/agentsre

Connecting the Arc

This is the fifth layer of the AI-SRE reliability framework:

Single-agent SLIs (DQR, TIE, HER, AQDD)
A2A semantic boundary validation + circuit breaker
Agent Sprawl governance (fleet inventory, framework canary, deprecation alerting)
Agent Control Plane SLIs (RAR, RSI, DCS) — this article

Each layer adds governance to the next abstraction level of the same infrastructure problem: autonomous AI operating in production without adequate reliability discipline.

LinkedIn discussion:
https://www.linkedin.com/posts/ajay-devineni_agenticai-sre-controlplane-share-7457213748500475904-yi9g?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

What's the biggest control plane reliability gap in your environment?

Agent Sprawl is Your Next Production Incident: An SRE Response to Datadog's State of AI Engineering 2026

Ajay Devineni — Fri, 01 May 2026 01:20:45 +0000

Datadog published the State of AI Engineering 2026 report this week — real telemetry from over a thousand production environments. Read it. It is the most comprehensive look at AI in production available right now.

I want to respond from the reliability engineering perspective, because the data reveals a problem the report names but doesn't fully resolve: agent sprawl is now a production reliability crisis, and the SRE discipline does not yet have governance frameworks for it.

What the Data Shows

Three findings stand out from an SRE perspective:

Framework adoption doubled year over year. LangChain, LangGraph, Pydantic AI, Vercel AI SDK — up from 9% of organizations in early 2025 to nearly 18% by 2026. Services using agentic frameworks: more than doubled.

70%+ of organizations run three or more models. The share running more than six models nearly doubled. Teams are building model portfolios rather than committing to a single provider.

Teams add models faster than they retire them. Datadog calls this "LLM tech debt." Each overlapping model introduces its own quality, latency, and cost profile. The report is explicit: this becomes a governance problem.

These three findings combine to describe an environment growing faster than it can be governed. I call this Agent Sprawl.

Defining Agent Sprawl

Agent Sprawl — the condition where AI agent infrastructure complexity (frameworks, models, tool layers, orchestration patterns) grows faster than your ability to measure and govern its reliability.

It is structurally identical to the microservices sprawl problem SRE teams faced between 2015 and 2020. Teams added services faster than they added SLOs. The result: production incidents nobody could attribute because the dependency graph was too complex to observe.

Agent Sprawl has three specific manifestations:

1. Framework-Invisible Call Complexity

When you add LangChain, LangGraph, or any orchestration framework, it adds steps and paths you did not write — retry logic, fallback handlers, context window management, tool routing. All of this happens between your application code and your observability layer.

Your SLIs measure at the application boundary. Framework-added calls are invisible.

This means your Tool Invocation Efficiency (TIE) baseline — tool calls per task completion — is measuring a mix of your agent's behavior and your framework's behavior. When you upgrade the framework, both change simultaneously. You cannot separate them.

In practice, across regulated production environments I've studied: TIE baselines can drift 30–40% after a framework major version upgrade with no corresponding change in the agent's task logic. The baseline shift looks like agent degradation. It's actually framework overhead. Teams spend hours on a false RCA.

The fix: Instrument at the framework output layer, not the application layer. Capture tool invocations after framework processing. Then freeze your TIE baseline before any upgrade and compare shadow traffic before promoting.

2. Multi-Model SLO Orphaning

70% of organizations running 3+ models means 70% have at least two additional SLO ownership gaps they haven't acknowledged.

SLOs are set once — typically when the first model is deployed. As models 2, 3, 4, 5, 6 are added for specific task classes, latency profiles, or cost tiers, nobody revisits the SLO ownership model. Models run in production with no named owner, no baseline, no error budget.

When model 3 degrades, there is no owner to page, no baseline to compare against, no runbook to execute. The degradation surfaces as a customer complaint, not an alert.

The fix: Treat every model in your fleet like a microservice. Each model gets: a named owner (not a team — a person), a task-class-specific SLO, and a 30-day observation baseline before the SLO is enforced.

3. LLM Tech Debt as a Reliability Liability

Deprecated models running in agent chains create silent compatibility risks. When a provider announces deprecation, teams with models buried inside multi-step chains often miss the migration window. The model ages. Safety training falls behind. Decision Quality Rate declines slowly — too slowly to trigger a threshold alert — until accumulated drift surfaces as a production incident.

The fix: Treat model deprecation notices the same way you treat dependency CVEs. Automate alerts at 60, 30, and 7 days before end-of-life. Build the migration ticket at announcement time, not at expiry.

The Governance Framework Agent Sprawl Needs

The Agent Fleet Inventory

Before you can govern sprawl, you need to know what you're governing. Maintain a living inventory with, for each component: framework and version, model(s) used, task classes handled, named SLO owner, current TIE/DQR baselines, and deprecation dates.

from agentsre.sprawl import AgentFleetInventory, FleetComponent, ComponentType

inventory = AgentFleetInventory()
inventory.register(FleetComponent(
    component_id="anthropic.claude-sonnet-4-6",
    component_type=ComponentType.MODEL,
    agent_id="payment-processor",
    task_classes=["payment-routing", "fraud-detection"],
    slo_owner="owner@team.com",                    # named human — not a team
    baseline_established_at="2026-04-01",
    deprecation_date="2027-06-01",
    last_slo_review="2026-04-01",
    current_tie_baseline=2.4,
    current_dqr_baseline=91.2,
))

report = inventory.quarterly_review_report()
print(f"Fleet governance score: {report['fleet_governance_score']}/100")

Framework Version Governance — Canary Before Promotion

from agentsre.sprawl import FrameworkVersionGovernance

gov = FrameworkVersionGovernance(
    tie_drift_threshold=1.15,   # block if TIE drifts >15%
    dqr_drift_threshold=0.85,   # block if DQR drops >15%
    min_shadow_samples=50,
)

# Before upgrade: snapshot production baseline
gov.snapshot_baseline(
    agent_id="payment-processor",
    task_class="payment-routing",
    framework_version="langchain-0.2.x",
    tie_values=production_tie_samples,
    dqr_values=production_dqr_samples,
)

# After 48hrs shadow traffic:
result = gov.evaluate_upgrade(
    agent_id="payment-processor",
    task_class="payment-routing",
    production_version="langchain-0.2.x",
    shadow_version="langchain-0.3.x",
)

if result.decision == UpgradeDecision.BLOCK:
    rollback()   # framework added hidden overhead — don't promote

The Quarterly Multi-Model SLO Review

The review should take 30–60 minutes per quarter. For every model in fleet:

Verify named owner exists
Verify baseline is current (< 90 days old)
Check deprecation schedule against provider announcements
Review TIE per-model — models with rising TIE relative to task class baseline are drifting

Models scoring below 70 on the governance health score are flagged as governance debt requiring a 30-day remediation window.

The Datadog Report's Implicit Challenge

The State of AI Engineering 2026 describes an industry in rapid expansion. What it does not fully resolve is the SRE question: who governs all of this, and what does that look like in practice?

The SRE community has solved exactly this class of problem before — in distributed systems, in microservices, in cloud infrastructure. The discipline already exists. It needs to be applied to the AI agent layer now, before agent sprawl becomes agent chaos.

The Datadog data tells us the window is closing. Framework adoption doubles in a year. Multi-model fleets become the norm. Model debt accumulates.

Build the governance layer before the production incidents start.

Open-source implementation: [https://github.com/Ajay150313/agentsre]
LinkedIn discussion: [https://www.linkedin.com/posts/ajay-devineni_agenticai-sre-reliability-ugcPost-7455786901673902080-BCRM?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU]

What's your biggest agent sprawl challenge right now?

RAG vs MCP is the wrong debate — here's the right framing for production AI systems

Ajay Devineni — Tue, 28 Apr 2026 21:13:56 +0000

The question I keep seeing in every AI engineering forum right now:

"Should we use RAG or MCP?"

It's the wrong question. And the fact that it's being asked at all tells me the field hasn't yet settled on a shared mental model for agentic AI architecture.

Here's the framing I use — and why getting this wrong has real production consequences.

RAG and MCP operate at different layers

RAG (Retrieval-Augmented Generation) and MCP (Model Context Protocol) are not alternatives. They are not competitors. They solve fundamentally different problems in an agentic system.

Think of it this way:

RAG answers: what does the agent know?
MCP answers: what can the agent do?

One is a knowledge pattern. The other is an execution protocol. Comparing them is a category error — like asking whether you should use a database or an API. The answer is almost always: both, at the right layer.

What RAG actually is (and isn't)

RAG is a memory pattern. Before the model reasons, you fill its context window with relevant information retrieved from an external store — documents, knowledge bases, runbooks, historical data.

RAG is appropriate when:

The agent needs domain knowledge that isn't in its training data
The information is relatively stable (changes on the order of days or weeks, not seconds)
The query is about "what do we know" not "what is happening right now"

RAG is not appropriate when:

The agent needs to know the current state of a live system
The information changes faster than your retrieval pipeline can refresh
The agent needs to take an action, not just retrieve information

This last point is where teams get into trouble. Embedding stale infrastructure docs into a RAG pipeline and treating them as a substitute for live system data is one of the most common architecture mistakes I see in agentic AI deployments.

What MCP actually is (and isn't)

MCP is an execution protocol. It gives agents the ability to invoke tools, call external APIs, read live system state, and take actions in the world — all in a standardized, auditable way.

MCP is appropriate when:

The agent needs to act, not just reason
The information required is live — current system state, real-time data, dynamic context
You need auditability of what the agent did and why (decision lineage)

MCP is not appropriate as a substitute for knowledge retrieval. Routing every context-building query through a live MCP tool call introduces unnecessary latency, increases blast radius surface area, and creates tool dependency chains that are hard to reason about under failure.

The production architecture that actually works

RAG and MCP compose. They don't compete. Here is the pattern I recommend for agentic systems that need both knowledge and action:

User goal / trigger
       |
       v
RAG retrieval layer
  - Fetch relevant runbook sections
  - Fetch historical incident context
  - Fetch policy and compliance docs
       |
       v
Agent reasoning
  - Synthesize retrieved context
  - Classify decision (blast radius tier)
  - Determine required action
       |
       v
MCP execution layer
  - Invoke appropriate tool
  - Apply validation gates (LOW / HIGH / CRITICAL)
  - Emit decision lineage trace
  - Execute or route for human review

The boundary between RAG and MCP is the boundary between knowing and doing. Design it intentionally.

The SRE reliability implications

From a reliability engineering perspective, conflating RAG and MCP creates two distinct failure modes:

Failure mode 1: using RAG where MCP belongs
The agent makes decisions based on stale retrieved data about a live system. The information looked correct at retrieval time. By execution time, the system state has changed. The agent acts on a false picture of reality.

This is particularly dangerous in infrastructure automation, where a runbook that was accurate six months ago may describe a system that no longer exists in that form.

Failure mode 2: using MCP where RAG belongs
Every knowledge query goes through a live tool call. Latency climbs. Tool dependencies multiply. Each MCP call is a potential blast radius event. The agent becomes slow, brittle, and expensive to operate — not because it's doing more, but because it's routing the wrong workload through the wrong layer.

The SLO implications

If you've read my previous posts on agentic SLO design, this connects directly. Your SLOs need to be aware of which layer a failure occurred in:

A RAG retrieval failure (stale data, embedding drift, retrieval miss) has different blast radius than an MCP execution failure (wrong tool invoked, action taken on bad context).
Human Escalation Rate (HER) needs to be segmented by failure layer. Rising HER from RAG staleness looks different from rising HER from MCP tool errors — and the runbook responses are completely different.
Decision lineage traces should capture which documents were retrieved via RAG and which tool calls were made via MCP, so post-mortems can identify which layer caused a bad decision.

The decision framework

Before your team debates RAG vs MCP, answer these questions:

Is the agent retrieving knowledge or taking action? Knowledge → RAG. Action → MCP.
How fast does the information change? Stable → RAG. Live → MCP.
What is the blast radius if this goes wrong? High blast radius operations belong behind MCP validation gates regardless of how the context was retrieved.
Do you need an audit trail? MCP gives you decision lineage natively. RAG retrieval should be logged separately and linked to the agent's reasoning trace.

Closing thought

The RAG vs MCP debate is a sign that the field is still building its shared vocabulary for agentic AI architecture. That's fine — this is early. But the teams shipping production agents today can't wait for consensus.

Design the boundary between knowing and doing intentionally. SLO it separately. Trace both layers in your observability stack.

The question isn't which one to use. It's whether you've thought carefully about where each one belongs.

This post is part of an ongoing series on AI-SRE: applying production reliability engineering principles to agentic AI systems.

SLO design for agentic AI systems — beyond uptime metrics
MCP decision-lineage observability in production
Human Escalation Rate (HER) as a reliability signal for agentic systems

https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-rag-share-7454971617409150976--nbK?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet

Ajay Devineni — Thu, 23 Apr 2026 18:51:21 +0000

Google A2A Protocol turned one year old on April 9, 2026. Over 150 organizations are running it in production. It is live inside Amazon Bedrock AgentCore and Azure AI Foundry. IBM's competing Agent Communication Protocol merged into A2A rather than fight it. The Linux Foundation now governs the spec.

The protocol is production-grade. The reliability engineering discipline for it is not.

I have spent the past year building SRE frameworks for single-agent + MCP deployments in regulated financial services environments. When A2A entered the picture, I realized the failure surface I had been managing had changed completely. This article documents the new failure modes A2A introduces and the SRE patterns I believe are required to manage them.

The Two-Layer Stack and Why It Changes Everything

MCP and A2A solve different problems at different layers of the agent stack. This is well understood by now. What is not yet well understood is what the two-layer combination means for reliability engineering.

MCP (Model Context Protocol)** — the vertical layer. An agent connects to tools and data sources. The failure modes are familiar to any distributed systems engineer: tool unavailability, degraded response quality, latency spikes, authentication failures. The blast radius is bounded. One agent, one tool layer, one error budget.

A2A (Agent-to-Agent Protocol)** — the horizontal layer. Agents communicate with other agents across organizational and platform boundaries. An orchestrator agent delegates subtasks to specialist agents via JSON-RPC over HTTP. Those specialist agents may be built by different teams, running on different vendors, governed by different SLOs.

The reliability engineering challenge A2A creates is not technical — the protocol itself is well-designed. It is organizational and observational. When an orchestrator agent delegates to a sub-agent via A2A, and that sub-agent fails silently, who carries the error budget? How do you instrument the boundary? What does safe degradation look like when an entire reasoning capability disappears because a downstream agent is unavailable?

These questions have no consensus answers yet. This article is my attempt to start building them.

The A2A Failure Mode Taxonomy

After studying multi-agent failure patterns across production deployments, I categorize A2A-specific failures into four classes. The first two are detectable with existing tooling. The last two are not.

Class 1: Sub-Agent Unavailability

The downstream agent returns a 503 or connection timeout. This is the easiest failure to handle — it looks like a standard HTTP failure and can be caught by existing circuit breaker patterns. Your orchestrator agent should treat sub-agent unavailability exactly as it treats MCP tool unavailability: fall back to a degraded capability or route to a human escalation path.

Instrumentation: standard HTTP error rate monitoring at the A2A client layer.

Class 2: Sub-Agent Latency Degradation

The downstream agent responds, but slowly. In a multi-agent chain (Agent A → Agent B → Agent C), latency compounds. A 2-second degradation at Agent C becomes a 6-second degradation at Agent A's response time. Users experience this as the orchestrator being slow — but the problem is buried three hops down the chain.

Instrumentation: distributed tracing across A2A boundaries. Each A2A task invocation should carry a trace ID propagated from the orchestrator. Without this, your latency SLI for the orchestrator tells you nothing useful about where the latency is originating.

Class 3: Silent Task Result Corruption — ⚠️ Not detectable with standard tooling

The downstream agent returns HTTP 200 with a syntactically valid A2A task result, but the result is semantically wrong — incomplete reasoning, missing context fields, hallucinated data treated as factual output. The orchestrator agent receives this as a successful response and incorporates it into its own output.

Your error rate SLI stays at zero. Your latency SLI stays normal. Your user receives incorrect output from a system that reported 100% success.

This is the failure mode that existing observability stacks cannot detect. It requires what I call an A2A Semantic Boundary Validator — a lightweight evaluation function that runs at the A2A client layer on every incoming task result, checking the result against expected behavioral bounds for that sub-agent's task class.

The implementation pattern mirrors my Decision Quality Rate (DQR) SLI for single-agent systems: maintain a behavioral baseline per sub-agent per task class, and flag results that fall outside expected bounds as potential corruptions before they propagate upstream.

Class 4: Cascading Autonomy Amplification — ⚠️ The most dangerous failure mode

Agent A delegates to Agent B. Agent B, uncertain about the task, makes additional autonomous decisions to resolve the ambiguity — invoking more MCP tools than its baseline, delegating further to Agent C, modifying its task interpretation. Agent C does the same.

By the time a result returns to Agent A, the original task intent has been substantially transformed by a chain of autonomous interpretations — none of which were visible to the orchestrator, none of which crossed any error threshold, and none of which can be reconstructed without end-to-end decision lineage capture.

This failure mode is unique to multi-agent systems. Single-agent + MCP deployments cannot produce it. It requires agents talking to agents, each adding their own layer of autonomous interpretation to a task that was never explicitly respecified.

The SRE Framework for A2A: Five Additions to Your Existing Stack

If you have followed my previous work on SLOs for agentic AI, you already have Decision Quality Rate, Tool Invocation Efficiency, and Human Escalation Rate instrumented for your single-agent deployments. A2A requires five additional capabilities on top of that foundation.

1. A2A Boundary Tracing

Every A2A task delegation must carry a distributed trace ID originating from the orchestrator. This is not optional — without it, you cannot attribute latency, errors, or behavioral drift to the correct agent in a multi-agent chain.

Implementation: Propagate a x-trace-id header on every A2A HTTP request. Store the full delegation tree (which agent delegated to which, with what task parameters, at what timestamp) in your centralized trace store. On AWS, I use X-Ray for the distributed trace and a DynamoDB table for the delegation tree — X-Ray captures the HTTP-level trace, DynamoDB captures the semantic-level task delegation structure.

2. Per-Sub-Agent SLO Ownership

Every A2A sub-agent your orchestrator calls must have a designated SLO owner — a named human or team who is paged when that sub-agent's reliability degrades. In practice, this means:

For internal sub-agents: assign SLO ownership the same way you assign ownership to microservices
For external/third-party sub-agents: define a sub-agent reliability budget. If a third-party A2A agent degrades, your orchestrator should treat it as a dependency failure and activate your degraded-mode runbook — not wait for the vendor to page you

The org chart question — who owns the SLO when agents from different vendors collaborate via A2A — is the most important unresolved governance question in multi-agent reliability today.

3. A2A Semantic Boundary Validation

For each sub-agent your orchestrator calls, define the expected output schema and behavioral bounds. Implement a validator function that runs on every incoming A2A task result before the orchestrator acts on it.

Minimum validation layer:

Schema validation: does the result match the expected A2A task result structure?
Completeness check: are required fields populated?
Behavioral bound check: does the result fall within the baseline distribution for this sub-agent's task class?

Results that fail validation should not be silently dropped — they should trigger your escalation path and log the full task context for postmortem analysis.

4. The Agent Chain Circuit Breaker

In traditional microservices, a circuit breaker opens when downstream failure rate exceeds a threshold, preventing cascade failures. Multi-agent systems need an equivalent pattern, adapted for the non-deterministic nature of agent communication.

My implementation: an agent chain circuit breaker that tracks the running success rate of each A2A sub-agent invocation over a 15-minute rolling window. When the validated success rate drops below 85% (accounting for semantic validation failures, not just HTTP errors), the circuit opens and the orchestrator routes that task class to a degraded-mode handler — typically a simplified version of the task that can be completed with MCP tools the orchestrator controls directly, or an immediate human escalation.

5. End-to-End Decision Lineage for Multi-Agent Chains

In single-agent systems, decision lineage is the record of what tools an agent invoked and what reasoning it applied. In A2A multi-agent systems, decision lineage must span the entire delegation chain — capturing not just what the orchestrator decided, but what each sub-agent decided on its behalf.

This is the audit trail that satisfies SOC 2 Type II requirements for autonomous decision-making in regulated environments. Without it, you cannot demonstrate to auditors that you have oversight of decisions made by agents you deployed but didn't directly control.

Implementation: each A2A task result must include a decision_lineage field containing the sub-agent's tool invocations, reasoning path, and confidence metadata. The orchestrator appends this to its own decision lineage before logging the full chain to the immutable audit store.

The Organizational Question A2A Forces

Every SRE framework I've described above requires answers to an organizational question the industry hasn't resolved:

When an orchestrator agent delegates to a third-party sub-agent via A2A, and the sub-agent produces output that causes downstream harm — who is operationally responsible?

This is not a legal question (yet). It is an operational ownership question that every multi-agent team will face in 2026.

My position: the orchestrator owner carries responsibility for validating and acting on sub-agent output. The A2A protocol handles communication. It does not handle accountability. An orchestrator that blindly trusts A2A task results without semantic validation is the operational equivalent of a service that makes no network calls — in other words, it doesn't exist in any production-grade form.

Build the semantic boundary validation. Own the chain.

Where to Start

If you are moving from single-agent + MCP to multi-agent + A2A, I recommend this progression:

Week 1: Implement A2A boundary tracing with distributed trace ID propagation. You cannot debug what you cannot trace.

Week 2: Assign explicit SLO ownership to every A2A sub-agent your orchestrator calls. Even a spreadsheet with named owners is better than none.

Week 3-4: Build the semantic boundary validator for your highest-volume A2A task class. Start with schema and completeness validation before attempting behavioral bound checks.

Month 2: Instrument the agent chain circuit breaker. Set your initial threshold conservatively (85% validated success rate) and adjust based on 30 days of baseline data.

Month 3+: Build end-to-end decision lineage capture. This is the hardest piece and the most important for regulated environments.

Connecting the Arc

This article is part of a series on applying SRE discipline to agentic AI in production:

Why SRE Principles Are the Missing Layer in MCP Security
SLOs for Agentic AI: The Reliability Framework Production Teams Are Missing (published this week)
A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet (this article)

I shared the core argument on LinkedIn: https://www.linkedin.com/posts/ajay-devineni_agenticai-a2a-mcp-share-7453145380822605824-pMta?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

The SRE community spent a decade learning to run distributed microservices reliably. We're at day one for multi-agent systems with A2A. The failure modes are different. The organizational questions are harder. The instrumentation doesn't exist yet.

Build it now — before your agent chains are running at a scale where these gaps become production incidents.

SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)

Ajay Devineni — Tue, 21 Apr 2026 18:05:55 +0000

The problem with applying traditional SLOs to AI agents

SLOs work beautifully when "good" is observable.

An API either returns 200 or it doesn't. Latency is measurable. Availability is binary. You instrument, you baseline, you commit to a number, and you burn down an error budget when reality diverges.

AI agents break every one of these assumptions.

After a quarter of running agentic systems against production infrastructure, here are the three failure modes I keep running into when teams apply traditional SLO thinking to agents.

Failure mode 1: Correctness is not observable at the response layer

A REST service fails loudly. A 500, a timeout, a malformed payload — your existing observability catches it.

An agent can produce a response that:

Parses correctly
Passes schema validation
Triggers no alerts

...and still be wrong in a way that compounds silently for hours.

Traditional error rate SLOs have zero visibility into this. Your dashboards stay green. The blast radius is growing.

What to do instead: Add a behavioral correctness signal. For every agent decision class, define a human-reviewable sample rate and track the delta between agent judgment and human override. That delta is data. It belongs in your SLO.

Failure mode 2: Latency SLOs punish safe agent behavior

A p99 latency SLO makes perfect sense for a stateless service.

It is actively dangerous for an agent.

Agents that pause to verify context, escalate ambiguous decisions to a human, or refuse to act on a poisoned tool output are doing exactly what you want them to do. A latency SLO penalizes them for it.

If you optimize against a latency target, you are implicitly optimizing for speed over safety. In agentic systems, that's how you get silent degradation and runbook violations at 2am.

What to do instead: Track decision latency distribution separately from response latency. Escalation paths should be excluded from latency SLO calculations or governed by a separate, explicitly higher target.

Failure mode 3: You cannot commit to a number you haven't earned

This one keeps coming up in conversations with other SRE leads.

Teams instrument an agent, run it for a week, and immediately try to commit to a 99.5% reliability target. Then they burn their error budget in the first real incident because the baseline was built on demo traffic.

Rule I enforce on my team: Minimum 30-day behavioral baseline before any agentic SLO is ratified. No exceptions. The baseline must cover:

Tool failure scenarios
Context window edge cases
At least one simulated prompt drift event
Real production traffic, not synthetic load

You cannot reliability-engineer what you have not yet measured.

What an agentic SLO actually looks like

After iterating on this for a quarter, I'm building agentic SLOs around three signal types that traditional SLOs don't capture:

Signal 1: Human Escalation Rate (HER)

HER = (decisions requiring human override) / (total agent decisions) × 100

This is your canary metric. Rising HER is often the first observable signal of:

Model drift
Context degradation
Prompt decay
Tool output poisoning

Set a threshold. Wire it to your on-call rotation. Page on it.

My current target: HER ≤ 8% over any 24-hour rolling window

Signal 2: Decision confidence distribution

Don't track a single average confidence score. Track the distribution.

When an agent is operating normally, confidence tends to be bimodal — high confidence on routine decisions, lower on edge cases. When the distribution collapses from bimodal to flat, something has shifted in the agent's environment.

That shift may not produce errors yet. But it will.

My current target: Decision confidence p10 ≥ 0.65

Signal 3: Blast radius exposure rate

BRER = (HIGH + CRITICAL tier changes per hour)

You can have a green error rate and a dangerous blast radius exposure rate at the same time.

This metric captures risk velocity — how fast your agent is accumulating unreversed high-impact changes. It belongs in your SLO alongside uptime.

My current target: CRITICAL tier changes ≤ 2/hour without explicit approval gate

The SLO I'm piloting

agent_slo:
  baseline_period: 30d
  signals:
    human_escalation_rate:
      threshold: "≤ 8%"
      window: "24h rolling"
      alert: page_on_call
    decision_confidence_p10:
      threshold: "≥ 0.65"
      window: "1h rolling"
      alert: warn
    critical_blast_radius_rate:
      threshold: "≤ 2/hour"
      gate: explicit_approval_required
  error_budget:
    calculated_from: [HER, confidence_p10, blast_radius_rate]
    not_from: [uptime, latency]
  review_cadence: weekly_baseline_review

The mindset shift

Traditional SLO: Is the system up?

Agentic SLO: Is the system trustworthy?

These are not the same question. Uptime is necessary but not sufficient. An agent can be 100% available and producing wrong decisions at scale.

The SRE community has the tooling, the culture, and the postmortem discipline to solve this. But we have to resist the temptation to copy-paste our existing SLO playbook onto a fundamentally different kind of system.

What's next

In the next post in this series, I'll walk through how I'm wiring these signals into OpenTelemetry alongside the decision-lineage layer from my previous MCP observability write-up — so a single trace can answer both "what happened" and "why the agent decided to do it."

If you're running agentic AI against production infrastructure and have built your own reliability signals, I'd genuinely like to hear what you're measuring. Drop it in the comments.

This post is part of an ongoing series on AI-SRE: applying production reliability engineering principles to agentic AI systems in regulated cloud-native environments.
Linkedin url https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7452416001553567744-BPgq?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

Your AI Agent Doesn't Have a Feature Problem. It Has an On-Call Rotation Problem. published: true

Ajay Devineni — Thu, 16 Apr 2026 16:06:25 +0000

Applying SRE principles to AI agents in production — ownership, observability, SLOs, runbooks, and the kill switch pattern.
I've spent a year closely studying how AI agents fail in the wild — across incidents, postmortems, and real operational patterns — and what I keep noticing is a gap nobody talks about. Teams celebrate capability. Nobody builds operational readiness.
Here's what that gap costs, and how to close it.

The Gap: AI Agents Are Treated Like Features, Not Services
In traditional SRE, every production service has:

✅ A named owner who carries the pager
✅ A defined SLO
✅ An on-call rotation
✅ A runbook
✅ A postmortem process

Most AI agents have a demo video and a Slack channel.
This is a category error. An agent is not a feature. It is an autonomous decision-making service operating at the speed of your automation. When it fails, it doesn't fail quietly like a broken button. It fails at the rate of your automation — and often with external side effects: emails sent, APIs called, records written.

The Failure Nobody Talks About
The failure everyone prepares for is the hard failure: an exception thrown, a timeout, a 500 error. These are easy to catch. CloudWatch alarm, SNS notification, done.
The failure nobody prepares for is the silent degradation.

The agent completes tasks. Dashboards stay green. But for the last 6 hours, its reasoning has been subtly wrong — selecting the wrong tools, misinterpreting scope, producing outputs that look correct and aren't.

This is the worst case. Not failure. Plausible, undetected, incorrect action at scale.
Traditional observability doesn't catch this. You need a new layer.

Introducing HER: Human Escalation Rate
The most useful signal I've seen for agent health is one most teams don't track:
HER = (decisions requiring human override / total decisions) × 100
HER is to AI agents what error rate is to APIs. It tells you whether the agent's judgment is holding up.
Here's a simple implementation:
pythondef publish_her_metric(agent_id: str, human_overrides: int, total_decisions: int):
her = (human_overrides / total_decisions) * 100 if total_decisions > 0 else 0

# Push to your metrics store
metrics.gauge(
    "agent.human_escalation_rate",
    her,
    tags=[f"agent_id:{agent_id}"]
)

# Alert if above threshold
if her > THRESHOLD:
    alert_oncall_owner(agent_id, her)

return her

When HER exceeds your threshold, a named human gets paged. Not a team. Not a Slack channel. A person.

Three Requirements Before Any Agent Goes to Production
Based on everything I've observed and learned, here's what I consider non-negotiable.

A Named Human Owner Who Gets Paged The ownership model matters more than the tooling. Every agent must have a named individual who is accountable when HER exceeds threshold. Shared ownership is no ownership. "The AI team owns it" means nobody owns it. Write it down: yamlagent: name: document-processor-v2 owner: ajay.devineni@company.com pager: +1-xxx-xxx-xxxx slack_handle: "@ajay" escalation_policy: p1-sre-rotation
A Runbook That Covers At Least Four Failure Modes Before any agent ships, a runbook must exist. Minimum coverage: Failure ModeWhat to look forImmediate actionTool failureTool error rate spikesCheck dependency health, assess in-flight tasksContext degradationOutput length increases, HER spikesInspect conversation history, rollback promptPrompt driftBehavioral baseline deviationFreeze deploys, compare prompt versionsBlast radius eventAgent operating outside defined scopeInvoke kill switch, audit side effects A runbook doesn't need to be 20 pages. It needs to be right and reachable at 2am.
A 30-Day Behavioral Baseline Before Any SLO Is Set This is the one most teams skip because it feels slow. You cannot commit to reliability you have not measured. Run your agent in shadow mode for 30 days — processing real inputs, generating real outputs, but reviewed before action. During that window, measure everything:

Task completion rate
Human escalation rate (baseline HER)
Tool call accuracy
Decision latency (p50/p95/p99)
Context window utilization
Output quality score variance across identical inputs

Only after 30 days do you write an SLO. The baseline IS the SLO foundation.
yaml# Example SLO written after baseline
agent_slo:
valid_from: "after-30d-baseline"
objectives:
- metric: task_completion_rate
target: 99.2%
baseline_observed: 99.6% # headroom built in intentionally

- metric: human_escalation_rate
  target: "< 3%"
  baseline_observed: 1.8%
  alert_threshold: 5%

The Kill Switch Pattern
Every production agent needs a kill switch — a mechanism to halt execution immediately, without a code deployment.
pythondef check_kill_switch(agent_id: str) -> bool:
"""
Checks a config store for kill switch status.
Works with SSM Parameter Store, LaunchDarkly,
or any feature flag system.
"""
status = config_store.get(f"agents/{agent_id}/kill-switch")
return status == "ACTIVE"

def agent_task_loop(agent_id: str, tasks: list):
for task in tasks:
# Check before EVERY decision, not just at startup
if check_kill_switch(agent_id):
log_halt(agent_id, task)
raise AgentHaltException("Kill switch active")

    execute(task)

The kill switch should be:

Flipable without a deployment (config store, not code)
Checked before every decision, not just at startup
Audited — log every check and every activation

What the Observability Stack Actually Looks Like
Agent Runtime
│
├──▶ Structured logs (JSON, one entry per decision)
│ └── Fields: session_id, tool_calls, human_override, confidence, latency
│
├──▶ Custom metrics
│ └── HER, tool error rate, context utilization, decision latency
│
├──▶ Distributed traces
│ └── End-to-end: input → LLM → tool calls → output
│
├──▶ Event stream (one event per agent decision)
│ └── Powers alerting rules and downstream audit
│
└──▶ Decision audit log (immutable)
└── S3 / blob store, retained for postmortem analysis
Every agent decision should emit a structured log entry:
json{
"timestamp": "2025-01-15T14:23:01Z",
"agent_id": "doc-processor-v2",
"session_id": "sess_abc123",
"tools_called": ["search", "summarize"],
"tool_success": [true, true],
"human_override": false,
"context_utilization_pct": 47.1,
"latency_ms": 3420,
"task_completed": true
}
This is your audit trail. This is what you bring to a postmortem.

The Postmortem Question Nobody Asks
After an incident with a traditional service, postmortems ask:

What failed?
Why did it fail?
How do we prevent recurrence?

For AI agents, there's a fourth question that almost nobody asks:
Was there a window where the agent was wrong, and we didn't know?
Silent degradation periods are invisible in traditional postmortems because the dashboards were green. Adding a behavioral baseline comparison to every postmortem template forces this question into the open.

Is Your Agent Production-Ready or Demo-Ready?
The SRE community spent 20 years learning how to operate distributed systems reliably. Those lessons — ownership, observability, SLOs, runbooks, postmortems — weren't invented in conference rooms. They were earned through outages.
AI agents are distributed systems with an additional dimension of unpredictability: they make decisions.
Before your next agent ships, run this checklist:

Named human owner with pager assigned
Runbook covering tool failure, context degradation, prompt drift, blast radius
HER metric instrumented and alerting
Kill switch implemented and tested
30-day shadow mode baseline completed
SLO written and derived from baseline data
Postmortem template updated to include behavioral baseline comparison

If any box is unchecked, your agent is demo-ready. Not production-ready.
Author: Ajay Devineni | Connect on LinkedIn

MCP Security in Action: Decision-Lineage Observability

Ajay Devineni — Mon, 13 Apr 2026 19:37:53 +0000

Traditional observability tells you what broke.
Agentic observability must tell you why the agent decided to break it — before the decision cascades into production.
After sharing the risk-classification framework (Part 1) and the Cloud Security Alliance's Six Pillars of MCP Security (Part 2), the obvious next question was: how do we actually observe and audit why an agent made a particular change?
This post covers the decision-lineage architecture I shipped in a regulated cloud-native environment over the past two weeks, and the results.

The Gap in Current Agentic AI Security
When an AI agent proposes a Terraform change, an Auto Scaling adjustment, or a firewall rule modification — do you know:

Why it made that specific decision?
Which context it was operating from?
Whether that context was clean (i.e., not poisoned or injected)?

If your answer is "we have prompt logs" — you're one prompt-injection incident away from a very difficult post-mortem.
Prompt logs capture what was said. Decision lineage captures why the agent chose to act, at every step of the reasoning chain.

What Decision-Lineage Observability Actually Looks Like
The reasoning chain I instrument:
Goal → Context ingestion → Tool selection → Proposed action → Policy check → Execute / Quarantine
For each step, we capture:

The deterministic trace ID tying the step to its session and goal
A hash of the context at that moment (tamper-evidence)
The tool selected and the reasoning for selecting it
The proposed action and its blast-radius classification
The policy check result
Implementation: A Thin Layer on Top of OpenTelemetry
No new infrastructure. This wraps your existing observability stack.
Step 1: Wrap Every MCP Tool Call with a Deterministic Trace ID
pythonimport hashlib
import time
from dataclasses import dataclass

@dataclass
class LineageTraceId:
session_id: str
goal_hash: str
sequence: int
timestamp_ns: int

def __str__(self):
    payload = f"{self.session_id}:{self.goal_hash}:{self.sequence}:{self.timestamp_ns}"
    return hashlib.sha256(payload.encode()).hexdigest()[:16]

This ID is deterministic — you can reconstruct it from known inputs during incident investigation, even if the log store is unreachable.
Step 2: Write Reasoning Steps to an Append-Only Store
pythondef write_lineage_record(trace_id: str, record: dict):
s3.put_object(
Bucket=LINEAGE_BUCKET,
Key=f"decision-lineage/{date_prefix}/{trace_id}.json",
Body=json.dumps({
"trace_id": trace_id,
"timestamp": datetime.utcnow().isoformat(),
"reasoning_chain": record["reasoning_chain"],
"tool_selected": record["tool_selected"],
"proposed_action": record["proposed_action"],
"context_hash": record["context_hash"],
"blast_radius_tier": record["blast_radius_tier"],
"policy_result": record["policy_result"],
}),
)
S3 + Glacier with Object Lock (WORM) for 90-day retention. The immutability is the point — a lineage store you can modify after the fact is a liability, not an asset.
Step 3: Run Three Parallel Policy Checks Before Execution
pythonasync def run_policy_checks(proposed_action, context, tool_output):
results = await asyncio.gather(
check_blast_radius(proposed_action, context["approved_tier"]),
check_behavioral_consistency(context["tool_name"], tool_output, context["hash"]),
check_context_integrity(context, tool_output),
)
return {
"passed": all(r[0] for r in results),
"checks": {
"blast_radius": results[0],
"behavioral_consistency": results[1],
"context_integrity": results[2],
}
}
Blast radius check: Does the proposed action match the approved tier for this agent session?
Behavioral consistency check: Is the tool output consistent with historical baselines for this context? Significant deviations are flagged — they can indicate tool compromise or context drift.
Context integrity check: Pattern matching against known prompt injection signatures across the full context + tool output payload.
All three run in parallel (async). Overhead is under 50ms for most checks.
Step 4: Safe Degradation on Any Failure
pythondef handle_policy_result(policy_result, proposed_action, trace_id):
if policy_result["passed"]:
attach_lineage_to_pr(trace_id, proposed_action) # Attach "why" to the change record
execute_action(proposed_action)
else:
quarantine_action(proposed_action, trace_id)
create_human_review_ticket(action=proposed_action, trace_id=trace_id)
return safe_degradation_response(trace_id)
Quarantined changes are never silently dropped — they create a human review ticket with the full lineage record attached. The agent receives a safe fallback response explaining why the action was held.

Results After a 2-Week Pilot
MetricResultAI-proposed changes with full "why" traceability100%Poisoned-tool incidents caught pre-execution3SRE on-call pages–40%Compliance audit query time~3 days → ~2 hours (self-serve)
The SRE page reduction was unexpected. Because every change now carries its reasoning chain, on-call engineers spend far less time reconstructing why something changed during incident response. The agent essentially writes its own incident context in advance.
The compliance improvement was the immediate business win — the audit team can query the lineage store directly via a simple CLI instead of opening a ticket with engineering.

The Three Lessons That Surprised Me

Immutability is your integrity primitive, not a compliance checkbox. A lineage store that can be modified is a liability. The moment you apply WORM constraints, the audit value multiplies because any tampering becomes detectable.
Context hashing > content logging. Logging the full context at each step is expensive and creates its own data privacy surface. Hashing the context gives you tamper-evidence without logging sensitive payloads. You only need to store the full context for flagged events.
The lineage layer becomes your incident response system. Build the query interface for operators first, compliance second. If it's hard for SREs to use during an incident, it won't be used — and the value disappears.

What's Coming: Open-Source Reference Implementation
Next week I'll publish the reference implementation. It will include:

Drop-in OpenTelemetry instrumentation for common MCP-compatible agent frameworks
Pre-built policy checks (blast radius classification, behavioral baseline builder, injection pattern library)
CDK + Terraform modules for the storage/eventing infrastructure
A query CLI designed for operators (not just compliance teams)

It's designed to be framework-agnostic — if your agent emits OpenTelemetry spans, you can instrument it.

Where Are You on This?
If you're running agentic AI against production infrastructure — even in shadow mode — what's your current approach to decision auditability?
Specifically curious about:

Are you correlating agent decisions to change records (PRs, CRs, tickets)?
How are you handling prompt injection detection at the tool boundary?
What does "audit-ready" look like in your compliance context?

Drop your approach in the comments. This is an area where the community is still building the playbook, and I'd rather share notes than solve it in isolation.

Part 1: Risk Classification Framework for MCP Tool Calls
Part 2: The Cloud Security Alliance's Six Pillars of MCP Security
Part 3: Decision-Lineage Observability (this post)

Why SRE Principles Are the Missing Layer in MCP Security

Ajay Devineni — Tue, 07 Apr 2026 19:45:39 +0000

Traditional observability tells you what broke. Securing MCP-enabled agentic AI requires understanding why the agent decided to act — and that requires a fundamentally different engineering approach.
Views and opinions are my own.
The reliability engineering community has spent decades building frameworks for understanding why systems fail. Error budgets. Blast radius analysis. Reversibility constraints. Safe degradation patterns.
None of these were designed with AI agents in mind.
And that gap is becoming one of the most important unsolved problems in production infrastructure.
What MCP Actually Is — and Why It Changes Everything
The Model Context Protocol (MCP) is the emerging standard that gives AI agents the ability to invoke tools, access data, and execute operations at machine speed. It is not simply an API integration layer.
MCP is a capability delegation framework. When your AI agent connects to an MCP server, it gains the authority to act on behalf of your systems — reading data, writing records, triggering workflows — with minimal human intervention between decisions.
That fundamental shift in what software can do autonomously is what makes MCP security categorically different from traditional application security.
The Failure Modes Traditional SRE Doesn't See
SRE practice is built around observable failure. A service goes down. Latency spikes. Error rates climb. Dashboards turn red. Alerts fire.
MCP introduces a class of failures that produce none of these signals:
Poisoned tool outputs — A malicious or compromised MCP server returns data designed to manipulate the agent's reasoning rather than serve its stated purpose. The agent doesn't throw an error. It simply makes different decisions — quietly, at machine speed, across every subsequent action in the workflow.
Rug pull attacks — An MCP tool's behavior, schema, or permissions change after your security review approved it. The tool still responds. Requests still succeed. But what the tool actually does has changed in ways your authorization model never accounted for.
Context contamination — In multi-server MCP deployments, data from an untrusted server can influence the agent's reasoning about a completely separate trusted system. There is no network boundary violation. No access control failure. The contamination happens at the semantic layer — inside the agent's context window.
These are not failures that observability platforms are built to detect. They don't produce stack traces. They don't increment error counters. They manifest as the agent making decisions that appear locally reasonable but are globally wrong.
What SRE Principles Actually Map To in MCP Security
The Cloud Security Alliance AI Safety Working Group is currently developing "The Six Pillars of MCP Security" — a framework I'm contributing to through research and writing focused specifically on the SRE and operational resilience angle.
Here's how the core SRE concepts translate directly into MCP security primitives:
Decision lineage instead of just logs
Traditional logging captures what happened — which service was called, what response was returned, what error was thrown. MCP security requires capturing why the agent decided to act — which tool was selected, which context influenced that selection, which prior tool output shaped the current reasoning step.
This is decision lineage: a tamper-evident record of the agent's reasoning pathway that makes it possible to reconstruct exactly how a sequence of actions came to occur. Without it, forensic investigation of an MCP security incident is essentially impossible.
Error budgets applied to unsafe autonomy
SRE error budgets define the acceptable threshold for unreliable behavior — the point at which reliability risk outweighs the cost of moving slower. The same concept applies directly to agent autonomy.
An agent operating within normal behavioral bounds earns the right to act autonomously. An agent whose tool invocation patterns, context window composition, or decision sequences drift outside established baselines should have its autonomy progressively constrained — moving toward human-in-the-loop confirmation for high-impact actions until normal patterns are restored.
This is error budgets applied not to uptime, but to trustworthiness.
Safe degradation for agentic systems
When a microservice degrades, it fails gracefully — returning cached responses, shedding load, activating circuit breakers. When an MCP-enabled agent degrades, the equivalent is reducing its capability surface: restricting which tools it can invoke, requiring explicit approval for write operations, limiting the scope of context it can access.
Safe degradation for agentic systems means defining the progressive capability reduction path — from full autonomy to supervised operation to read-only mode to complete suspension — and automating the transitions based on observable behavioral signals.
The Observability Gap
The hardest part of this problem is not the controls. It's the detection.
Traditional observability tells you what broke. A request failed. A threshold was crossed. A dependency went down.
MCP security requires understanding why the agent made a particular decision — and that requires a fundamentally different instrumentation approach. You need to capture not just the inputs and outputs of each tool call, but the semantic context that surrounded it. What was in the agent's context window? What prior tool outputs influenced this decision? What was the agent's stated reasoning before it chose this action?
This is not a solved problem in the current observability tooling landscape. It is the gap that makes MCP security genuinely difficult — and genuinely important to get right before agentic AI is operating at scale in regulated production environments.
What This Means for Your Team Right Now
If your team is deploying AI agents that touch production infrastructure, the question isn't whether you need an MCP security strategy.
It's whether you're already operating with one without realizing it needs a formal name.
Start with three questions:
Can you reconstruct why your agent took a specific action? If not, you don't have decision lineage — and you can't do forensics on an MCP security incident.
Do you have behavioral baselines for your agents? If not, you can't detect drift — and context contamination and tool poisoning both manifest as behavioral drift before they manifest as anything else.
Do you have a defined capability reduction path? If your agent starts behaving outside expected parameters, what happens? If the answer is "we'd have to manually intervene," you don't have safe degradation — you have a manual kill switch, which is not the same thing.
These are solvable engineering problems. They require applying reliability engineering discipline to a new domain — which is exactly what SRE has always done.

I shared a shorter version of these ideas on LinkedIn here(https://www.linkedin.com/posts/ajay-devineni_agenticai-mcp-aisecurity-activity-7446992069618913281-dnPv?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

). This research is part of my contribution to the Cloud Security Alliance AI Safety Working Group's Six Pillars of MCP Security framework.
What challenges are you seeing when bringing agentic AI safely into production? Are observability gaps or control gaps the bigger problem for your team?

Zero Data Loss Migration: Moving Billions of Rows from SQL Server to Aurora RDS — Architecture, Predictive CDC Monitoring & Lessons from Production

Ajay Devineni — Sun, 05 Apr 2026 22:03:10 +0000

Migrating a live financial database with billions of rows, zero tolerance for data loss, and a strict cutover window is not a data transfer problem.
It is a resource isolation problem, a risk prediction problem, and a compliance documentation problem — all running simultaneously.
This article documents the architecture and lessons from a production SQL Server → AWS Aurora RDS migration I executed across multiple credit union banking environments. The core contribution is a framework I built called DMS-PredictLagNet — combining parallel DMS instance isolation with Holt-Winters predictive CDC lag forecasting for autonomous scaling.
The Challenge
The source environment was on-premises SQL Server across two separate data centers. Hundreds of tables. Two tables with billions of rows each. Continuous live transaction traffic — no maintenance window available. SOC 2 Type II and PCI DSS compliance required throughout.
The hardest constraint: cutover had to happen within a documented change window measured in hours. If CDC replication lag was not at zero when that window opened, the entire migration had to defer to the next available window.
Network Architecture: Dual VPN → Transit Gateway
I established Site-to-Site VPN tunnels (IPSec/IKEv2) from both on-premises data centers into AWS, terminating at AWS Transit Gateway with dedicated route tables per client VPC. This guaranteed complete traffic isolation between the two migration streams — data from one client's pipeline could not traverse the other's route domain under any circumstances.
Critical lesson learned the hard way: The source network team provided their internal LAN CIDR (192.x.x.x) for VPN configuration. What AWS actually sees is the post-NAT translated address — a completely different range. Every AWS-side configuration (route tables, security groups, network ACLs, VPN Phase 2 proxy ID selectors) must be built around the post-NAT address, not the internal LAN address. This mistake caused millions of connection timeouts before I identified the root cause. The fastest way to avoid it: ask "what IP address does AWS actually see when traffic leaves your environment?" before touching any configuration.
Before starting any DMS task, I ran AWS Reachability Analyzer to validate end-to-end connectivity from each DMS replication instance to its source endpoint. This caught a missing route table entry that would have caused a task failure mid-window. I now treat this as a mandatory pre-migration gate.
Schema Conversion with AWS SCT
I ran AWS Schema Conversion Tool on a Windows EC2 instance inside the VPC — giving it direct connectivity to Aurora through the VPC network and to SQL Server through the VPN tunnel. Running SCT on a local laptop introduces latency variability that causes timeouts on large schema assessments.
Credentials were stored in AWS Secrets Manager and accessed via IAM role — never stored in configuration files. This is a SOC 2 control requirement, not just a best practice.
Two transformation rules were configured before assessment:

Database remapping rule for naming convention differences
Drop-schema rule to remove the SQL Server dbo prefix from all migrated objects

Every incompatibility was resolved before a single row of data moved. Starting the full load before schema validation is complete is a common mistake with expensive consequences.
The Core Architectural Decision: Parallel DMS Instance Isolation
This was the most important design decision in the migration.
A single DMS replication instance handling both the billion-row table and everything else creates resource contention. The billion-row table's CDC competes with hundreds of other tables for memory, CPU, and network bandwidth. Under peak transaction volume, that contention manifests as lag accumulation across the entire pipeline — and lag on a billion-row table takes the longest to clear.
My solution: complete workload isolation.

Instance 1 — dedicated exclusively to CDC replication for the single billion-row table. Nothing else ran on this instance.
Instance 2 — handled full load and then CDC for all remaining tables.

Both instances ran on the latest available DMS instance type with high-memory configuration. Standard sizing guidance does not account for sustained 14-day CDC workloads in live financial environments. The newer instance generation provided lower baseline CPU utilization under CDC load, more memory for the transaction log decoder, and better network throughput — all of which directly improved the predictive monitor's accuracy by providing more headroom before threshold triggers.
LOB settings required per-table tuning. Tables with large text columns used Full LOB mode. Tables without LOB columns used Limited LOB mode with appropriate size limits. Mixing these without table-level configuration would have degraded throughput across the entire non-LOB majority of the table estate.
The Foreign Key Pre-Assessment Fix
The DMS pre-assessment failed on the first run — foreign key constraint violations because DMS loads tables in parallel and does not guarantee parent tables are loaded before child table inserts begin.
Fix: add initstmt=set foreign_key_checks=0 to the Aurora target endpoint extra connection attributes. This disables foreign key enforcement for the DMS session only — it does not affect any other connections to Aurora. Post-load referential integrity validation then confirms consistency was achieved through the migration process rather than enforced during loading.
In a SOC 2 environment: document this in the change control request and retain validation script output as audit evidence.
DMS-PredictLagNet: Predictive CDC Lag Monitoring
The standard reactive approach — CloudWatch alarm fires when lag exceeds a threshold — is insufficient in a live financial environment for two reasons. By the time an alarm fires, the backlog may already require hours to clear. And financial transaction volume is non-linear: payroll processing, end-of-day settlement, and batch jobs create predictable but sharp spikes that static thresholds do not adapt to.
I built a predictive monitoring system using Holt-Winters triple exponential smoothing trained on 90 days of source transaction volume patterns.
The model captures three components:

Level — baseline transaction rate
Trend — directional change over time
Seasonality — recurring patterns (daily and weekly cycles)

The seasonal period was set to m=168 (hourly observations over a 7-day weekly cycle) — the dominant periodicity in credit union banking, driven by business-day versus weekend patterns and weekly payroll cycles.
Rather than forecasting lag directly, I predicted transaction volume 30 minutes ahead and translated the forecast into predicted lag via an empirically calibrated throughput model for the specific DMS instance sizes in use. This two-stage approach produced more reliable results because CDC lag is affected by DMS internal buffer state that is not observable from CloudWatch metrics alone.
The autonomous scaling response operated on two tiers:
When forecast indicated predicted lag would reach 60% of critical threshold within 30 minutes → AWS Lambda triggered DMS instance scale-up automatically.
When forecast indicated 85% of critical threshold → AWS Systems Manager automation executed emergency scale-up to maximum pre-approved instance size and paged the on-call engineer via PagerDuty.
All automated actions wrote to the S3 audit log before execution — satisfying SOC 2 requirements for immutable evidence of automated control actions.
Results
Across the 14-day CDC replication window:

7 high-risk lag events identified by the predictive monitor
5 resolved autonomously by Lambda-triggered scale-up — no human intervention
2 required engineer engagement (one unscheduled batch job outside training distribution, one DMS task restart requiring SOC 2 change authorization)
Zero engineer pages for predictable, pattern-driven lag events

Post-migration outcomes:

Zero data loss across all tables
Cutover window met
41% query performance improvement on Aurora within 48 hours post-cutover

Post-CDC Validation Before Cutover
Three-level validation executed across all tables before cutover authorization:

Row count parity — exact match between source and Aurora at validation timestamp
Checksum validation — hash comparison over critical column sets to detect corruption that row counts alone would not reveal
Referential integrity validation — all foreign key relationships confirmed satisfied in Aurora

Two tables had minor row count discrepancies on first run — both traced to in-flight transactions committed in the milliseconds between source and target count queries. Rerunning during a low-transaction period confirmed equivalence. Run validation during known low-traffic windows, not during peak processing.
The 14-Day CDC Window
The 14-day validation period served three purposes simultaneously:

Application teams ran full regression testing against Aurora using real production data
The CDC pipeline's behavior was observed across a complete two-week transaction cycle including payroll, weekends, and month-end batch
Validation scripts were executed and verified before the cutover decision was made

Key Takeaways for Engineers Planning Similar Migrations
Ask the right network question first. What IP address does AWS actually see when traffic leaves your environment? Build everything around the post-NAT address.
Run Reachability Analyzer before any DMS task starts. The cost is negligible. The cost of discovering a routing gap after migration tasks have started is not.
Isolate your highest-volume table CDC on a dedicated instance. Do not let it compete for resources with your bulk load.
Validate content, not just row counts. Checksum validation caught LOB truncation that row count checks would have missed entirely.
Pre-assessment is not optional in regulated environments. Discovering the foreign_key_checks issue after a full load has started on a billion-row table is not recoverable within an eight-hour window.
Predictive monitoring is not about preventing every lag event. It is about converting unpredictable events into manageable ones — autonomous handling of known patterns, human escalation for genuinely novel ones.
The full framework — including the Holt-Winters forecasting methodology, parallel DMS partition design, and SOC 2 audit trail architecture — is written up as peer-reviewed research for the SRE and cloud engineering community. Migration patterns like this should be documented, not just passed around as tribal knowledge.
What's the hardest part of large database migrations for your team — data volume, CDC lag management, cutover coordination, or post-migration validation?

I also shared a high-level architecture overview of this migration on LinkedIn — you can find it here https://www.linkedin.com/posts/ajay-devineni_aws-databasemigration-aurorards-activity-7438712828808548352-rz76?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU