<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ajay Devineni</title>
    <description>The latest articles on DEV Community by Ajay Devineni (@ajaydevineni).</description>
    <link>https://dev.to/ajaydevineni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862822%2Fddbc52cd-519d-4344-bea2-effb2a513786.png</url>
      <title>DEV Community: Ajay Devineni</title>
      <link>https://dev.to/ajaydevineni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ajaydevineni"/>
    <language>en</language>
    <item>
      <title>A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Thu, 23 Apr 2026 18:51:21 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/a2a-mcp-in-production-the-sre-reliability-framework-nobody-has-written-yet-2hf2</link>
      <guid>https://dev.to/ajaydevineni/a2a-mcp-in-production-the-sre-reliability-framework-nobody-has-written-yet-2hf2</guid>
      <description>&lt;p&gt;Google A2A Protocol turned one year old on April 9, 2026. Over 150 organizations are running it in production. It is live inside Amazon Bedrock AgentCore and Azure AI Foundry. IBM's competing Agent Communication Protocol merged into A2A rather than fight it. The Linux Foundation now governs the spec.&lt;/p&gt;

&lt;p&gt;The protocol is production-grade. The reliability engineering discipline for it is not.&lt;/p&gt;

&lt;p&gt;I have spent the past year building SRE frameworks for single-agent + MCP deployments in regulated financial services environments. When A2A entered the picture, I realized the failure surface I had been managing had changed completely. This article documents the new failure modes A2A introduces and the SRE patterns I believe are required to manage them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Two-Layer Stack and Why It Changes Everything
&lt;/h2&gt;

&lt;p&gt;MCP and A2A solve different problems at different layers of the agent stack. This is well understood by now. What is not yet well understood is what the two-layer combination means for reliability engineering.&lt;/p&gt;

&lt;p&gt;MCP (Model Context Protocol)** — the vertical layer. An agent connects to tools and data sources. The failure modes are familiar to any distributed systems engineer: tool unavailability, degraded response quality, latency spikes, authentication failures. The blast radius is bounded. One agent, one tool layer, one error budget.&lt;/p&gt;

&lt;p&gt;A2A (Agent-to-Agent Protocol)** — the horizontal layer. Agents communicate with other agents across organizational and platform boundaries. An orchestrator agent delegates subtasks to specialist agents via JSON-RPC over HTTP. Those specialist agents may be built by different teams, running on different vendors, governed by different SLOs.&lt;/p&gt;

&lt;p&gt;The reliability engineering challenge A2A creates is not technical — the protocol itself is well-designed. It is organizational and observational. When an orchestrator agent delegates to a sub-agent via A2A, and that sub-agent fails silently, who carries the error budget? How do you instrument the boundary? What does safe degradation look like when an entire reasoning capability disappears because a downstream agent is unavailable?&lt;/p&gt;

&lt;p&gt;These questions have no consensus answers yet. This article is my attempt to start building them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The A2A Failure Mode Taxonomy
&lt;/h2&gt;

&lt;p&gt;After studying multi-agent failure patterns across production deployments, I categorize A2A-specific failures into four classes. The first two are detectable with existing tooling. The last two are not.&lt;/p&gt;

&lt;h3&gt;
  
  
  Class 1: Sub-Agent Unavailability
&lt;/h3&gt;

&lt;p&gt;The downstream agent returns a 503 or connection timeout. This is the easiest failure to handle — it looks like a standard HTTP failure and can be caught by existing circuit breaker patterns. Your orchestrator agent should treat sub-agent unavailability exactly as it treats MCP tool unavailability: fall back to a degraded capability or route to a human escalation path.&lt;/p&gt;

&lt;p&gt;Instrumentation: standard HTTP error rate monitoring at the A2A client layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Class 2: Sub-Agent Latency Degradation
&lt;/h3&gt;

&lt;p&gt;The downstream agent responds, but slowly. In a multi-agent chain (Agent A → Agent B → Agent C), latency compounds. A 2-second degradation at Agent C becomes a 6-second degradation at Agent A's response time. Users experience this as the orchestrator being slow — but the problem is buried three hops down the chain.&lt;/p&gt;

&lt;p&gt;Instrumentation: distributed tracing across A2A boundaries. Each A2A task invocation should carry a trace ID propagated from the orchestrator. Without this, your latency SLI for the orchestrator tells you nothing useful about where the latency is originating.&lt;/p&gt;

&lt;h3&gt;
  
  
  Class 3: Silent Task Result Corruption — ⚠️ Not detectable with standard tooling
&lt;/h3&gt;

&lt;p&gt;The downstream agent returns HTTP 200 with a syntactically valid A2A task result, but the result is semantically wrong — incomplete reasoning, missing context fields, hallucinated data treated as factual output. The orchestrator agent receives this as a successful response and incorporates it into its own output.&lt;/p&gt;

&lt;p&gt;Your error rate SLI stays at zero. Your latency SLI stays normal. Your user receives incorrect output from a system that reported 100% success.&lt;/p&gt;

&lt;p&gt;This is the failure mode that existing observability stacks cannot detect. It requires what I call an A2A Semantic Boundary Validator — a lightweight evaluation function that runs at the A2A client layer on every incoming task result, checking the result against expected behavioral bounds for that sub-agent's task class.&lt;/p&gt;

&lt;p&gt;The implementation pattern mirrors my Decision Quality Rate (DQR) SLI for single-agent systems: maintain a behavioral baseline per sub-agent per task class, and flag results that fall outside expected bounds as potential corruptions before they propagate upstream.&lt;/p&gt;

&lt;h3&gt;
  
  
  Class 4: Cascading Autonomy Amplification — ⚠️ The most dangerous failure mode
&lt;/h3&gt;

&lt;p&gt;Agent A delegates to Agent B. Agent B, uncertain about the task, makes additional autonomous decisions to resolve the ambiguity — invoking more MCP tools than its baseline, delegating further to Agent C, modifying its task interpretation. Agent C does the same.&lt;/p&gt;

&lt;p&gt;By the time a result returns to Agent A, the original task intent has been substantially transformed by a chain of autonomous interpretations — none of which were visible to the orchestrator, none of which crossed any error threshold, and none of which can be reconstructed without end-to-end decision lineage capture.&lt;/p&gt;

&lt;p&gt;This failure mode is unique to multi-agent systems. Single-agent + MCP deployments cannot produce it. It requires agents talking to agents, each adding their own layer of autonomous interpretation to a task that was never explicitly respecified.&lt;/p&gt;




&lt;h2&gt;
  
  
  The SRE Framework for A2A: Five Additions to Your Existing Stack
&lt;/h2&gt;

&lt;p&gt;If you have followed my previous work on SLOs for agentic AI, you already have Decision Quality Rate, Tool Invocation Efficiency, and Human Escalation Rate instrumented for your single-agent deployments. A2A requires five additional capabilities on top of that foundation.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. A2A Boundary Tracing
&lt;/h3&gt;

&lt;p&gt;Every A2A task delegation must carry a distributed trace ID originating from the orchestrator. This is not optional — without it, you cannot attribute latency, errors, or behavioral drift to the correct agent in a multi-agent chain.&lt;/p&gt;

&lt;p&gt;Implementation: Propagate a &lt;code&gt;x-trace-id&lt;/code&gt; header on every A2A HTTP request. Store the full delegation tree (which agent delegated to which, with what task parameters, at what timestamp) in your centralized trace store. On AWS, I use X-Ray for the distributed trace and a DynamoDB table for the delegation tree — X-Ray captures the HTTP-level trace, DynamoDB captures the semantic-level task delegation structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Per-Sub-Agent SLO Ownership
&lt;/h3&gt;

&lt;p&gt;Every A2A sub-agent your orchestrator calls must have a designated SLO owner — a named human or team who is paged when that sub-agent's reliability degrades. In practice, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For internal sub-agents: assign SLO ownership the same way you assign ownership to microservices&lt;/li&gt;
&lt;li&gt;For external/third-party sub-agents: define a sub-agent reliability budget. If a third-party A2A agent degrades, your orchestrator should treat it as a dependency failure and activate your degraded-mode runbook — not wait for the vendor to page you&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The org chart question — who owns the SLO when agents from different vendors collaborate via A2A — is the most important unresolved governance question in multi-agent reliability today.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. A2A Semantic Boundary Validation
&lt;/h3&gt;

&lt;p&gt;For each sub-agent your orchestrator calls, define the expected output schema and behavioral bounds. Implement a validator function that runs on every incoming A2A task result before the orchestrator acts on it.&lt;/p&gt;

&lt;p&gt;Minimum validation layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema validation: does the result match the expected A2A task result structure?&lt;/li&gt;
&lt;li&gt;Completeness check: are required fields populated?&lt;/li&gt;
&lt;li&gt;Behavioral bound check: does the result fall within the baseline distribution for this sub-agent's task class?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Results that fail validation should not be silently dropped — they should trigger your escalation path and log the full task context for postmortem analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The Agent Chain Circuit Breaker
&lt;/h3&gt;

&lt;p&gt;In traditional microservices, a circuit breaker opens when downstream failure rate exceeds a threshold, preventing cascade failures. Multi-agent systems need an equivalent pattern, adapted for the non-deterministic nature of agent communication.&lt;/p&gt;

&lt;p&gt;My implementation: an agent chain circuit breaker that tracks the running success rate of each A2A sub-agent invocation over a 15-minute rolling window. When the validated success rate drops below 85% (accounting for semantic validation failures, not just HTTP errors), the circuit opens and the orchestrator routes that task class to a degraded-mode handler — typically a simplified version of the task that can be completed with MCP tools the orchestrator controls directly, or an immediate human escalation.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. End-to-End Decision Lineage for Multi-Agent Chains
&lt;/h3&gt;

&lt;p&gt;In single-agent systems, decision lineage is the record of what tools an agent invoked and what reasoning it applied. In A2A multi-agent systems, decision lineage must span the entire delegation chain — capturing not just what the orchestrator decided, but what each sub-agent decided on its behalf.&lt;/p&gt;

&lt;p&gt;This is the audit trail that satisfies SOC 2 Type II requirements for autonomous decision-making in regulated environments. Without it, you cannot demonstrate to auditors that you have oversight of decisions made by agents you deployed but didn't directly control.&lt;/p&gt;

&lt;p&gt;Implementation: each A2A task result must include a &lt;code&gt;decision_lineage&lt;/code&gt; field containing the sub-agent's tool invocations, reasoning path, and confidence metadata. The orchestrator appends this to its own decision lineage before logging the full chain to the immutable audit store.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Organizational Question A2A Forces
&lt;/h2&gt;

&lt;p&gt;Every SRE framework I've described above requires answers to an organizational question the industry hasn't resolved:&lt;/p&gt;

&lt;p&gt;When an orchestrator agent delegates to a third-party sub-agent via A2A, and the sub-agent produces output that causes downstream harm — who is operationally responsible?&lt;/p&gt;

&lt;p&gt;This is not a legal question (yet). It is an operational ownership question that every multi-agent team will face in 2026.&lt;/p&gt;

&lt;p&gt;My position: the orchestrator owner carries responsibility for validating and acting on sub-agent output. The A2A protocol handles communication. It does not handle accountability. An orchestrator that blindly trusts A2A task results without semantic validation is the operational equivalent of a service that makes no network calls — in other words, it doesn't exist in any production-grade form.&lt;/p&gt;

&lt;p&gt;Build the semantic boundary validation. Own the chain.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to Start
&lt;/h2&gt;

&lt;p&gt;If you are moving from single-agent + MCP to multi-agent + A2A, I recommend this progression:&lt;/p&gt;

&lt;p&gt;Week 1: Implement A2A boundary tracing with distributed trace ID propagation. You cannot debug what you cannot trace.&lt;/p&gt;

&lt;p&gt;Week 2: Assign explicit SLO ownership to every A2A sub-agent your orchestrator calls. Even a spreadsheet with named owners is better than none.&lt;/p&gt;

&lt;p&gt;Week 3-4: Build the semantic boundary validator for your highest-volume A2A task class. Start with schema and completeness validation before attempting behavioral bound checks.&lt;/p&gt;

&lt;p&gt;Month 2: Instrument the agent chain circuit breaker. Set your initial threshold conservatively (85% validated success rate) and adjust based on 30 days of baseline data.&lt;/p&gt;

&lt;p&gt;Month 3+: Build end-to-end decision lineage capture. This is the hardest piece and the most important for regulated environments.&lt;/p&gt;




&lt;p&gt;Connecting the Arc&lt;/p&gt;

&lt;p&gt;This article is part of a series on applying SRE discipline to agentic AI in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/ajaydevineni/why-sre-principles-are-the-missing-layer-in-mcp-security-2fo8"&gt;Why SRE Principles Are the Missing Layer in MCP Security&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;SLOs for Agentic AI: The Reliability Framework Production Teams Are Missing (published this week)&lt;/li&gt;
&lt;li&gt;A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet (this article)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I shared the core argument on LinkedIn: &lt;a href="https://www.linkedin.com/posts/ajay-devineni_agenticai-a2a-mcp-share-7453145380822605824-pMta?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_agenticai-a2a-mcp-share-7453145380822605824-pMta?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The SRE community spent a decade learning to run distributed microservices reliably. We're at day one for multi-agent systems with A2A. The failure modes are different. The organizational questions are harder. The instrumentation doesn't exist yet.&lt;/p&gt;

&lt;p&gt;Build it now — before your agent chains are running at a scale where these gaps become production incidents.&lt;/p&gt;

</description>
      <category>sre</category>
      <category>agentaichallenge</category>
      <category>cloudnative</category>
      <category>a2a</category>
    </item>
    <item>
      <title>SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 21 Apr 2026 18:05:55 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/slo-design-for-agentic-ai-systems-why-traditional-reliability-metrics-break-and-what-to-use-3581</link>
      <guid>https://dev.to/ajaydevineni/slo-design-for-agentic-ai-systems-why-traditional-reliability-metrics-break-and-what-to-use-3581</guid>
      <description>&lt;h2&gt;
  
  
  The problem with applying traditional SLOs to AI agents
&lt;/h2&gt;

&lt;p&gt;SLOs work beautifully when "good" is observable.&lt;/p&gt;

&lt;p&gt;An API either returns &lt;code&gt;200&lt;/code&gt; or it doesn't. Latency is measurable. Availability is binary. You instrument, you baseline, you commit to a number, and you burn down an error budget when reality diverges.&lt;/p&gt;

&lt;p&gt;AI agents break every one of these assumptions.&lt;/p&gt;

&lt;p&gt;After a quarter of running agentic systems against production infrastructure, here are the three failure modes I keep running into when teams apply traditional SLO thinking to agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode 1: Correctness is not observable at the response layer
&lt;/h2&gt;

&lt;p&gt;A REST service fails loudly. A &lt;code&gt;500&lt;/code&gt;, a timeout, a malformed payload — your existing observability catches it.&lt;/p&gt;

&lt;p&gt;An agent can produce a response that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parses correctly&lt;/li&gt;
&lt;li&gt;Passes schema validation&lt;/li&gt;
&lt;li&gt;Triggers no alerts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...and still be wrong in a way that compounds silently for hours.&lt;/p&gt;

&lt;p&gt;Traditional error rate SLOs have zero visibility into this. Your dashboards stay green. The blast radius is growing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to do instead:&lt;/strong&gt; Add a &lt;em&gt;behavioral correctness&lt;/em&gt; signal. For every agent decision class, define a human-reviewable sample rate and track the delta between agent judgment and human override. That delta is data. It belongs in your SLO.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode 2: Latency SLOs punish safe agent behavior
&lt;/h2&gt;

&lt;p&gt;A p99 latency SLO makes perfect sense for a stateless service.&lt;/p&gt;

&lt;p&gt;It is actively dangerous for an agent.&lt;/p&gt;

&lt;p&gt;Agents that pause to verify context, escalate ambiguous decisions to a human, or refuse to act on a poisoned tool output are doing exactly what you want them to do. A latency SLO penalizes them for it.&lt;/p&gt;

&lt;p&gt;If you optimize against a latency target, you are implicitly optimizing for speed over safety. In agentic systems, that's how you get silent degradation and runbook violations at 2am.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to do instead:&lt;/strong&gt; Track &lt;em&gt;decision latency distribution&lt;/em&gt; separately from &lt;em&gt;response latency&lt;/em&gt;. Escalation paths should be excluded from latency SLO calculations or governed by a separate, explicitly higher target.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode 3: You cannot commit to a number you haven't earned
&lt;/h2&gt;

&lt;p&gt;This one keeps coming up in conversations with other SRE leads.&lt;/p&gt;

&lt;p&gt;Teams instrument an agent, run it for a week, and immediately try to commit to a 99.5% reliability target. Then they burn their error budget in the first real incident because the baseline was built on demo traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule I enforce on my team:&lt;/strong&gt; Minimum 30-day behavioral baseline before any agentic SLO is ratified. No exceptions. The baseline must cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool failure scenarios&lt;/li&gt;
&lt;li&gt;Context window edge cases&lt;/li&gt;
&lt;li&gt;At least one simulated prompt drift event&lt;/li&gt;
&lt;li&gt;Real production traffic, not synthetic load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You cannot reliability-engineer what you have not yet measured.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an agentic SLO actually looks like
&lt;/h2&gt;

&lt;p&gt;After iterating on this for a quarter, I'm building agentic SLOs around three signal types that traditional SLOs don't capture:&lt;/p&gt;

&lt;h3&gt;
  
  
  Signal 1: Human Escalation Rate (HER)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HER = (decisions requiring human override) / (total agent decisions) × 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is your canary metric. Rising HER is often the first observable signal of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model drift&lt;/li&gt;
&lt;li&gt;Context degradation&lt;/li&gt;
&lt;li&gt;Prompt decay&lt;/li&gt;
&lt;li&gt;Tool output poisoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Set a threshold. Wire it to your on-call rotation. Page on it.&lt;/p&gt;

&lt;p&gt;My current target: &lt;strong&gt;HER ≤ 8% over any 24-hour rolling window&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Signal 2: Decision confidence distribution
&lt;/h3&gt;

&lt;p&gt;Don't track a single average confidence score. Track the &lt;em&gt;distribution&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;When an agent is operating normally, confidence tends to be bimodal — high confidence on routine decisions, lower on edge cases. When the distribution collapses from bimodal to flat, something has shifted in the agent's environment.&lt;/p&gt;

&lt;p&gt;That shift may not produce errors yet. But it will.&lt;/p&gt;

&lt;p&gt;My current target: &lt;strong&gt;Decision confidence p10 ≥ 0.65&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Signal 3: Blast radius exposure rate
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BRER = (HIGH + CRITICAL tier changes per hour)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can have a green error rate and a dangerous blast radius exposure rate at the same time.&lt;/p&gt;

&lt;p&gt;This metric captures &lt;em&gt;risk velocity&lt;/em&gt; — how fast your agent is accumulating unreversed high-impact changes. It belongs in your SLO alongside uptime.&lt;/p&gt;

&lt;p&gt;My current target: &lt;strong&gt;CRITICAL tier changes ≤ 2/hour without explicit approval gate&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The SLO I'm piloting
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;agent_slo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;baseline_period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30d&lt;/span&gt;
  &lt;span class="na"&gt;signals&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;human_escalation_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;≤&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;8%"&lt;/span&gt;
      &lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;24h&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rolling"&lt;/span&gt;
      &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;page_on_call&lt;/span&gt;
    &lt;span class="na"&gt;decision_confidence_p10&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;≥&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0.65"&lt;/span&gt;
      &lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rolling"&lt;/span&gt;
      &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warn&lt;/span&gt;
    &lt;span class="na"&gt;critical_blast_radius_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;≤&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2/hour"&lt;/span&gt;
      &lt;span class="na"&gt;gate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;explicit_approval_required&lt;/span&gt;
  &lt;span class="na"&gt;error_budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;calculated_from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;HER&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;confidence_p10&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;blast_radius_rate&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;not_from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;uptime&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;latency&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;review_cadence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;weekly_baseline_review&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The mindset shift
&lt;/h2&gt;

&lt;p&gt;Traditional SLO: &lt;em&gt;Is the system up?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Agentic SLO: &lt;em&gt;Is the system trustworthy?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These are not the same question. Uptime is necessary but not sufficient. An agent can be 100% available and producing wrong decisions at scale.&lt;/p&gt;

&lt;p&gt;The SRE community has the tooling, the culture, and the postmortem discipline to solve this. But we have to resist the temptation to copy-paste our existing SLO playbook onto a fundamentally different kind of system.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;In the next post in this series, I'll walk through how I'm wiring these signals into OpenTelemetry alongside the decision-lineage layer from my previous MCP observability write-up — so a single trace can answer both "what happened" and "why the agent decided to do it."&lt;/p&gt;

&lt;p&gt;If you're running agentic AI against production infrastructure and have built your own reliability signals, I'd genuinely like to hear what you're measuring. Drop it in the comments.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post is part of an ongoing series on AI-SRE: applying production reliability engineering principles to agentic AI systems in regulated cloud-native environments.&lt;/em&gt;&lt;br&gt;
Linkedin url &lt;a href="https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7452416001553567744-BPgq?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7452416001553567744-BPgq?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt; &lt;/p&gt;

</description>
      <category>sre</category>
      <category>devplusplus</category>
      <category>devops</category>
      <category>agentaichallenge</category>
    </item>
    <item>
      <title>Your AI Agent Doesn't Have a Feature Problem. It Has an On-Call Rotation Problem. published: true</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Thu, 16 Apr 2026 16:06:25 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/your-ai-agent-doesnt-have-a-feature-problem-it-has-an-on-call-rotation-problempublished-true-22kg</link>
      <guid>https://dev.to/ajaydevineni/your-ai-agent-doesnt-have-a-feature-problem-it-has-an-on-call-rotation-problempublished-true-22kg</guid>
      <description>&lt;p&gt;Applying SRE principles to AI agents in production — ownership, observability, SLOs, runbooks, and the kill switch pattern.&lt;br&gt;
I've spent a year closely studying how AI agents fail in the wild — across incidents, postmortems, and real operational patterns — and what I keep noticing is a gap nobody talks about. Teams celebrate capability. Nobody builds operational readiness.&lt;br&gt;
Here's what that gap costs, and how to close it.&lt;/p&gt;

&lt;p&gt;The Gap: AI Agents Are Treated Like Features, Not Services&lt;br&gt;
In traditional SRE, every production service has:&lt;/p&gt;

&lt;p&gt;✅ A named owner who carries the pager&lt;br&gt;
✅ A defined SLO&lt;br&gt;
✅ An on-call rotation&lt;br&gt;
✅ A runbook&lt;br&gt;
✅ A postmortem process&lt;/p&gt;

&lt;p&gt;Most AI agents have a demo video and a Slack channel.&lt;br&gt;
This is a category error. An agent is not a feature. It is an autonomous decision-making service operating at the speed of your automation. When it fails, it doesn't fail quietly like a broken button. It fails at the rate of your automation — and often with external side effects: emails sent, APIs called, records written.&lt;/p&gt;

&lt;p&gt;The Failure Nobody Talks About&lt;br&gt;
The failure everyone prepares for is the hard failure: an exception thrown, a timeout, a 500 error. These are easy to catch. CloudWatch alarm, SNS notification, done.&lt;br&gt;
The failure nobody prepares for is the silent degradation.&lt;/p&gt;

&lt;p&gt;The agent completes tasks. Dashboards stay green. But for the last 6 hours, its reasoning has been subtly wrong — selecting the wrong tools, misinterpreting scope, producing outputs that look correct and aren't.&lt;/p&gt;

&lt;p&gt;This is the worst case. Not failure. Plausible, undetected, incorrect action at scale.&lt;br&gt;
Traditional observability doesn't catch this. You need a new layer.&lt;/p&gt;

&lt;p&gt;Introducing HER: Human Escalation Rate&lt;br&gt;
The most useful signal I've seen for agent health is one most teams don't track:&lt;br&gt;
HER = (decisions requiring human override / total decisions) × 100&lt;br&gt;
HER is to AI agents what error rate is to APIs. It tells you whether the agent's judgment is holding up.&lt;br&gt;
Here's a simple implementation:&lt;br&gt;
pythondef publish_her_metric(agent_id: str, human_overrides: int, total_decisions: int):&lt;br&gt;
    her = (human_overrides / total_decisions) * 100 if total_decisions &amp;gt; 0 else 0&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Push to your metrics store
metrics.gauge(
    "agent.human_escalation_rate",
    her,
    tags=[f"agent_id:{agent_id}"]
)

# Alert if above threshold
if her &amp;gt; THRESHOLD:
    alert_oncall_owner(agent_id, her)

return her
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;When HER exceeds your threshold, a named human gets paged. Not a team. Not a Slack channel. A person.&lt;/p&gt;

&lt;p&gt;Three Requirements Before Any Agent Goes to Production&lt;br&gt;
Based on everything I've observed and learned, here's what I consider non-negotiable.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A Named Human Owner Who Gets Paged
The ownership model matters more than the tooling.
Every agent must have a named individual who is accountable when HER exceeds threshold. Shared ownership is no ownership. "The AI team owns it" means nobody owns it.
Write it down:
yamlagent:
name: document-processor-v2
owner: &lt;a href="mailto:ajay.devineni@company.com"&gt;ajay.devineni@company.com&lt;/a&gt;
pager: +1-xxx-xxx-xxxx
slack_handle: "&lt;a class="mentioned-user" href="https://dev.to/ajay"&gt;@ajay&lt;/a&gt;"
escalation_policy: p1-sre-rotation&lt;/li&gt;
&lt;li&gt;A Runbook That Covers At Least Four Failure Modes
Before any agent ships, a runbook must exist. Minimum coverage:
Failure ModeWhat to look forImmediate actionTool failureTool error rate spikesCheck dependency health, assess in-flight tasksContext degradationOutput length increases, HER spikesInspect conversation history, rollback promptPrompt driftBehavioral baseline deviationFreeze deploys, compare prompt versionsBlast radius eventAgent operating outside defined scopeInvoke kill switch, audit side effects
A runbook doesn't need to be 20 pages. It needs to be right and reachable at 2am.&lt;/li&gt;
&lt;li&gt;A 30-Day Behavioral Baseline Before Any SLO Is Set
This is the one most teams skip because it feels slow.
You cannot commit to reliability you have not measured.
Run your agent in shadow mode for 30 days — processing real inputs, generating real outputs, but reviewed before action. During that window, measure everything:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Task completion rate&lt;br&gt;
Human escalation rate (baseline HER)&lt;br&gt;
Tool call accuracy&lt;br&gt;
Decision latency (p50/p95/p99)&lt;br&gt;
Context window utilization&lt;br&gt;
Output quality score variance across identical inputs&lt;/p&gt;

&lt;p&gt;Only after 30 days do you write an SLO. The baseline IS the SLO foundation.&lt;br&gt;
yaml# Example SLO written after baseline&lt;br&gt;
agent_slo:&lt;br&gt;
  valid_from: "after-30d-baseline"&lt;br&gt;
  objectives:&lt;br&gt;
    - metric: task_completion_rate&lt;br&gt;
      target: 99.2%&lt;br&gt;
      baseline_observed: 99.6%   # headroom built in intentionally&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- metric: human_escalation_rate
  target: "&amp;lt; 3%"
  baseline_observed: 1.8%
  alert_threshold: 5%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The Kill Switch Pattern&lt;br&gt;
Every production agent needs a kill switch — a mechanism to halt execution immediately, without a code deployment.&lt;br&gt;
pythondef check_kill_switch(agent_id: str) -&amp;gt; bool:&lt;br&gt;
    """&lt;br&gt;
    Checks a config store for kill switch status.&lt;br&gt;
    Works with SSM Parameter Store, LaunchDarkly, &lt;br&gt;
    or any feature flag system.&lt;br&gt;
    """&lt;br&gt;
    status = config_store.get(f"agents/{agent_id}/kill-switch")&lt;br&gt;
    return status == "ACTIVE"&lt;/p&gt;

&lt;p&gt;def agent_task_loop(agent_id: str, tasks: list):&lt;br&gt;
    for task in tasks:&lt;br&gt;
        # Check before EVERY decision, not just at startup&lt;br&gt;
        if check_kill_switch(agent_id):&lt;br&gt;
            log_halt(agent_id, task)&lt;br&gt;
            raise AgentHaltException("Kill switch active")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    execute(task)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The kill switch should be:&lt;/p&gt;

&lt;p&gt;Flipable without a deployment (config store, not code)&lt;br&gt;
Checked before every decision, not just at startup&lt;br&gt;
Audited — log every check and every activation&lt;/p&gt;

&lt;p&gt;What the Observability Stack Actually Looks Like&lt;br&gt;
Agent Runtime&lt;br&gt;
    │&lt;br&gt;
    ├──▶ Structured logs (JSON, one entry per decision)&lt;br&gt;
    │       └── Fields: session_id, tool_calls, human_override, confidence, latency&lt;br&gt;
    │&lt;br&gt;
    ├──▶ Custom metrics&lt;br&gt;
    │       └── HER, tool error rate, context utilization, decision latency&lt;br&gt;
    │&lt;br&gt;
    ├──▶ Distributed traces&lt;br&gt;
    │       └── End-to-end: input → LLM → tool calls → output&lt;br&gt;
    │&lt;br&gt;
    ├──▶ Event stream (one event per agent decision)&lt;br&gt;
    │       └── Powers alerting rules and downstream audit&lt;br&gt;
    │&lt;br&gt;
    └──▶ Decision audit log (immutable)&lt;br&gt;
            └── S3 / blob store, retained for postmortem analysis&lt;br&gt;
Every agent decision should emit a structured log entry:&lt;br&gt;
json{&lt;br&gt;
  "timestamp": "2025-01-15T14:23:01Z",&lt;br&gt;
  "agent_id": "doc-processor-v2",&lt;br&gt;
  "session_id": "sess_abc123",&lt;br&gt;
  "tools_called": ["search", "summarize"],&lt;br&gt;
  "tool_success": [true, true],&lt;br&gt;
  "human_override": false,&lt;br&gt;
  "context_utilization_pct": 47.1,&lt;br&gt;
  "latency_ms": 3420,&lt;br&gt;
  "task_completed": true&lt;br&gt;
}&lt;br&gt;
This is your audit trail. This is what you bring to a postmortem.&lt;/p&gt;

&lt;p&gt;The Postmortem Question Nobody Asks&lt;br&gt;
After an incident with a traditional service, postmortems ask:&lt;/p&gt;

&lt;p&gt;What failed?&lt;br&gt;
Why did it fail?&lt;br&gt;
How do we prevent recurrence?&lt;/p&gt;

&lt;p&gt;For AI agents, there's a fourth question that almost nobody asks:&lt;br&gt;
Was there a window where the agent was wrong, and we didn't know?&lt;br&gt;
Silent degradation periods are invisible in traditional postmortems because the dashboards were green. Adding a behavioral baseline comparison to every postmortem template forces this question into the open.&lt;/p&gt;

&lt;p&gt;Is Your Agent Production-Ready or Demo-Ready?&lt;br&gt;
The SRE community spent 20 years learning how to operate distributed systems reliably. Those lessons — ownership, observability, SLOs, runbooks, postmortems — weren't invented in conference rooms. They were earned through outages.&lt;br&gt;
AI agents are distributed systems with an additional dimension of unpredictability: they make decisions.&lt;br&gt;
Before your next agent ships, run this checklist:&lt;/p&gt;

&lt;p&gt;Named human owner with pager assigned&lt;br&gt;
 Runbook covering tool failure, context degradation, prompt drift, blast radius&lt;br&gt;
 HER metric instrumented and alerting&lt;br&gt;
 Kill switch implemented and tested&lt;br&gt;
 30-day shadow mode baseline completed&lt;br&gt;
 SLO written and derived from baseline data&lt;br&gt;
 Postmortem template updated to include behavioral baseline comparison&lt;/p&gt;

&lt;p&gt;If any box is unchecked, your agent is demo-ready. Not production-ready.&lt;br&gt;
Author: Ajay Devineni | Connect on &lt;a href="https://www.linkedin.com/in/ajay-devineni/" rel="noopener noreferrer"&gt;LinkedIn &lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>MCP Security in Action: Decision-Lineage Observability</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Mon, 13 Apr 2026 19:37:53 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/mcp-security-in-action-decision-lineage-observability-1k6c</link>
      <guid>https://dev.to/ajaydevineni/mcp-security-in-action-decision-lineage-observability-1k6c</guid>
      <description>&lt;p&gt;Traditional observability tells you what broke.&lt;br&gt;
Agentic observability must tell you why the agent decided to break it — before the decision cascades into production.&lt;br&gt;
After sharing the risk-classification framework (Part 1) and the Cloud Security Alliance's Six Pillars of MCP Security (Part 2), the obvious next question was: how do we actually observe and audit why an agent made a particular change?&lt;br&gt;
This post covers the decision-lineage architecture I shipped in a regulated cloud-native environment over the past two weeks, and the results.&lt;/p&gt;

&lt;p&gt;The Gap in Current Agentic AI Security&lt;br&gt;
When an AI agent proposes a Terraform change, an Auto Scaling adjustment, or a firewall rule modification — do you know:&lt;/p&gt;

&lt;p&gt;Why it made that specific decision?&lt;br&gt;
Which context it was operating from?&lt;br&gt;
Whether that context was clean (i.e., not poisoned or injected)?&lt;/p&gt;

&lt;p&gt;If your answer is "we have prompt logs" — you're one prompt-injection incident away from a very difficult post-mortem.&lt;br&gt;
Prompt logs capture what was said. Decision lineage captures why the agent chose to act, at every step of the reasoning chain.&lt;/p&gt;

&lt;p&gt;What Decision-Lineage Observability Actually Looks Like&lt;br&gt;
The reasoning chain I instrument:&lt;br&gt;
Goal → Context ingestion → Tool selection → Proposed action → Policy check → Execute / Quarantine&lt;br&gt;
For each step, we capture:&lt;/p&gt;

&lt;p&gt;The deterministic trace ID tying the step to its session and goal&lt;br&gt;
A hash of the context at that moment (tamper-evidence)&lt;br&gt;
The tool selected and the reasoning for selecting it&lt;br&gt;
The proposed action and its blast-radius classification&lt;br&gt;
The policy check result&lt;br&gt;
Implementation: A Thin Layer on Top of OpenTelemetry&lt;br&gt;
No new infrastructure. This wraps your existing observability stack.&lt;br&gt;
Step 1: Wrap Every MCP Tool Call with a Deterministic Trace ID&lt;br&gt;
pythonimport hashlib&lt;br&gt;
import time&lt;br&gt;
from dataclasses import dataclass&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdybh6waeb6tvluoz3xp.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdybh6waeb6tvluoz3xp.jpg" alt=" " width="800" height="537"&gt;&lt;/a&gt;&lt;br&gt;
@dataclass&lt;br&gt;
class LineageTraceId:&lt;br&gt;
    session_id: str&lt;br&gt;
    goal_hash: str&lt;br&gt;
    sequence: int&lt;br&gt;
    timestamp_ns: int&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def __str__(self):
    payload = f"{self.session_id}:{self.goal_hash}:{self.sequence}:{self.timestamp_ns}"
    return hashlib.sha256(payload.encode()).hexdigest()[:16]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This ID is deterministic — you can reconstruct it from known inputs during incident investigation, even if the log store is unreachable.&lt;br&gt;
Step 2: Write Reasoning Steps to an Append-Only Store&lt;br&gt;
pythondef write_lineage_record(trace_id: str, record: dict):&lt;br&gt;
    s3.put_object(&lt;br&gt;
        Bucket=LINEAGE_BUCKET,&lt;br&gt;
        Key=f"decision-lineage/{date_prefix}/{trace_id}.json",&lt;br&gt;
        Body=json.dumps({&lt;br&gt;
            "trace_id": trace_id,&lt;br&gt;
            "timestamp": datetime.utcnow().isoformat(),&lt;br&gt;
            "reasoning_chain": record["reasoning_chain"],&lt;br&gt;
            "tool_selected": record["tool_selected"],&lt;br&gt;
            "proposed_action": record["proposed_action"],&lt;br&gt;
            "context_hash": record["context_hash"],&lt;br&gt;
            "blast_radius_tier": record["blast_radius_tier"],&lt;br&gt;
            "policy_result": record["policy_result"],&lt;br&gt;
        }),&lt;br&gt;
    )&lt;br&gt;
S3 + Glacier with Object Lock (WORM) for 90-day retention. The immutability is the point — a lineage store you can modify after the fact is a liability, not an asset.&lt;br&gt;
Step 3: Run Three Parallel Policy Checks Before Execution&lt;br&gt;
pythonasync def run_policy_checks(proposed_action, context, tool_output):&lt;br&gt;
    results = await asyncio.gather(&lt;br&gt;
        check_blast_radius(proposed_action, context["approved_tier"]),&lt;br&gt;
        check_behavioral_consistency(context["tool_name"], tool_output, context["hash"]),&lt;br&gt;
        check_context_integrity(context, tool_output),&lt;br&gt;
    )&lt;br&gt;
    return {&lt;br&gt;
        "passed": all(r[0] for r in results),&lt;br&gt;
        "checks": {&lt;br&gt;
            "blast_radius": results[0],&lt;br&gt;
            "behavioral_consistency": results[1],&lt;br&gt;
            "context_integrity": results[2],&lt;br&gt;
        }&lt;br&gt;
    }&lt;br&gt;
Blast radius check: Does the proposed action match the approved tier for this agent session?&lt;br&gt;
Behavioral consistency check: Is the tool output consistent with historical baselines for this context? Significant deviations are flagged — they can indicate tool compromise or context drift.&lt;br&gt;
Context integrity check: Pattern matching against known prompt injection signatures across the full context + tool output payload.&lt;br&gt;
All three run in parallel (async). Overhead is under 50ms for most checks.&lt;br&gt;
Step 4: Safe Degradation on Any Failure&lt;br&gt;
pythondef handle_policy_result(policy_result, proposed_action, trace_id):&lt;br&gt;
    if policy_result["passed"]:&lt;br&gt;
        attach_lineage_to_pr(trace_id, proposed_action)  # Attach "why" to the change record&lt;br&gt;
        execute_action(proposed_action)&lt;br&gt;
    else:&lt;br&gt;
        quarantine_action(proposed_action, trace_id)&lt;br&gt;
        create_human_review_ticket(action=proposed_action, trace_id=trace_id)&lt;br&gt;
        return safe_degradation_response(trace_id)&lt;br&gt;
Quarantined changes are never silently dropped — they create a human review ticket with the full lineage record attached. The agent receives a safe fallback response explaining why the action was held.&lt;/p&gt;

&lt;p&gt;Results After a 2-Week Pilot&lt;br&gt;
MetricResultAI-proposed changes with full "why" traceability100%Poisoned-tool incidents caught pre-execution3SRE on-call pages–40%Compliance audit query time~3 days → ~2 hours (self-serve)&lt;br&gt;
The SRE page reduction was unexpected. Because every change now carries its reasoning chain, on-call engineers spend far less time reconstructing why something changed during incident response. The agent essentially writes its own incident context in advance.&lt;br&gt;
The compliance improvement was the immediate business win — the audit team can query the lineage store directly via a simple CLI instead of opening a ticket with engineering.&lt;/p&gt;

&lt;p&gt;The Three Lessons That Surprised Me&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Immutability is your integrity primitive, not a compliance checkbox.
A lineage store that can be modified is a liability. The moment you apply WORM constraints, the audit value multiplies because any tampering becomes detectable.&lt;/li&gt;
&lt;li&gt;Context hashing &amp;gt; content logging.
Logging the full context at each step is expensive and creates its own data privacy surface. Hashing the context gives you tamper-evidence without logging sensitive payloads. You only need to store the full context for flagged events.&lt;/li&gt;
&lt;li&gt;The lineage layer becomes your incident response system.
Build the query interface for operators first, compliance second. If it's hard for SREs to use during an incident, it won't be used — and the value disappears.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What's Coming: Open-Source Reference Implementation&lt;br&gt;
Next week I'll publish the reference implementation. It will include:&lt;/p&gt;

&lt;p&gt;Drop-in OpenTelemetry instrumentation for common MCP-compatible agent frameworks&lt;br&gt;
Pre-built policy checks (blast radius classification, behavioral baseline builder, injection pattern library)&lt;br&gt;
CDK + Terraform modules for the storage/eventing infrastructure&lt;br&gt;
A query CLI designed for operators (not just compliance teams)&lt;/p&gt;

&lt;p&gt;It's designed to be framework-agnostic — if your agent emits OpenTelemetry spans, you can instrument it.&lt;/p&gt;

&lt;p&gt;Where Are You on This?&lt;br&gt;
If you're running agentic AI against production infrastructure — even in shadow mode — what's your current approach to decision auditability?&lt;br&gt;
Specifically curious about:&lt;/p&gt;

&lt;p&gt;Are you correlating agent decisions to change records (PRs, CRs, tickets)?&lt;br&gt;
How are you handling prompt injection detection at the tool boundary?&lt;br&gt;
What does "audit-ready" look like in your compliance context?&lt;/p&gt;

&lt;p&gt;Drop your approach in the comments. This is an area where the community is still building the playbook, and I'd rather share notes than solve it in isolation.&lt;/p&gt;

&lt;p&gt;Part 1: Risk Classification Framework for MCP Tool Calls&lt;br&gt;
Part 2: The Cloud Security Alliance's Six Pillars of MCP Security&lt;br&gt;
Part 3: Decision-Lineage Observability (this post)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>security</category>
    </item>
    <item>
      <title>Why SRE Principles Are the Missing Layer in MCP Security</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 07 Apr 2026 19:45:39 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/why-sre-principles-are-the-missing-layer-in-mcp-security-2fo8</link>
      <guid>https://dev.to/ajaydevineni/why-sre-principles-are-the-missing-layer-in-mcp-security-2fo8</guid>
      <description>&lt;p&gt;Traditional observability tells you what broke. Securing MCP-enabled agentic AI requires understanding why the agent decided to act — and that requires a fundamentally different engineering approach.&lt;br&gt;
Views and opinions are my own.&lt;br&gt;
The reliability engineering community has spent decades building frameworks for understanding why systems fail. Error budgets. Blast radius analysis. Reversibility constraints. Safe degradation patterns.&lt;br&gt;
None of these were designed with AI agents in mind.&lt;br&gt;
And that gap is becoming one of the most important unsolved problems in production infrastructure.&lt;br&gt;
What MCP Actually Is — and Why It Changes Everything&lt;br&gt;
The Model Context Protocol (MCP) is the emerging standard that gives AI agents the ability to invoke tools, access data, and execute operations at machine speed. It is not simply an API integration layer.&lt;br&gt;
MCP is a capability delegation framework. When your AI agent connects to an MCP server, it gains the authority to act on behalf of your systems — reading data, writing records, triggering workflows — with minimal human intervention between decisions.&lt;br&gt;
That fundamental shift in what software can do autonomously is what makes MCP security categorically different from traditional application security.&lt;br&gt;
The Failure Modes Traditional SRE Doesn't See&lt;br&gt;
SRE practice is built around observable failure. A service goes down. Latency spikes. Error rates climb. Dashboards turn red. Alerts fire.&lt;br&gt;
MCP introduces a class of failures that produce none of these signals:&lt;br&gt;
Poisoned tool outputs — A malicious or compromised MCP server returns data designed to manipulate the agent's reasoning rather than serve its stated purpose. The agent doesn't throw an error. It simply makes different decisions — quietly, at machine speed, across every subsequent action in the workflow.&lt;br&gt;
Rug pull attacks — An MCP tool's behavior, schema, or permissions change after your security review approved it. The tool still responds. Requests still succeed. But what the tool actually does has changed in ways your authorization model never accounted for.&lt;br&gt;
Context contamination — In multi-server MCP deployments, data from an untrusted server can influence the agent's reasoning about a completely separate trusted system. There is no network boundary violation. No access control failure. The contamination happens at the semantic layer — inside the agent's context window.&lt;br&gt;
These are not failures that observability platforms are built to detect. They don't produce stack traces. They don't increment error counters. They manifest as the agent making decisions that appear locally reasonable but are globally wrong.&lt;br&gt;
What SRE Principles Actually Map To in MCP Security&lt;br&gt;
The Cloud Security Alliance AI Safety Working Group is currently developing "The Six Pillars of MCP Security" — a framework I'm contributing to through research and writing focused specifically on the SRE and operational resilience angle.&lt;br&gt;
Here's how the core SRE concepts translate directly into MCP security primitives:&lt;br&gt;
Decision lineage instead of just logs&lt;br&gt;
Traditional logging captures what happened — which service was called, what response was returned, what error was thrown. MCP security requires capturing why the agent decided to act — which tool was selected, which context influenced that selection, which prior tool output shaped the current reasoning step.&lt;br&gt;
This is decision lineage: a tamper-evident record of the agent's reasoning pathway that makes it possible to reconstruct exactly how a sequence of actions came to occur. Without it, forensic investigation of an MCP security incident is essentially impossible.&lt;br&gt;
Error budgets applied to unsafe autonomy&lt;br&gt;
SRE error budgets define the acceptable threshold for unreliable behavior — the point at which reliability risk outweighs the cost of moving slower. The same concept applies directly to agent autonomy.&lt;br&gt;
An agent operating within normal behavioral bounds earns the right to act autonomously. An agent whose tool invocation patterns, context window composition, or decision sequences drift outside established baselines should have its autonomy progressively constrained — moving toward human-in-the-loop confirmation for high-impact actions until normal patterns are restored.&lt;br&gt;
This is error budgets applied not to uptime, but to trustworthiness.&lt;br&gt;
Safe degradation for agentic systems&lt;br&gt;
When a microservice degrades, it fails gracefully — returning cached responses, shedding load, activating circuit breakers. When an MCP-enabled agent degrades, the equivalent is reducing its capability surface: restricting which tools it can invoke, requiring explicit approval for write operations, limiting the scope of context it can access.&lt;br&gt;
Safe degradation for agentic systems means defining the progressive capability reduction path — from full autonomy to supervised operation to read-only mode to complete suspension — and automating the transitions based on observable behavioral signals.&lt;br&gt;
The Observability Gap&lt;br&gt;
The hardest part of this problem is not the controls. It's the detection.&lt;br&gt;
Traditional observability tells you what broke. A request failed. A threshold was crossed. A dependency went down.&lt;br&gt;
MCP security requires understanding why the agent made a particular decision — and that requires a fundamentally different instrumentation approach. You need to capture not just the inputs and outputs of each tool call, but the semantic context that surrounded it. What was in the agent's context window? What prior tool outputs influenced this decision? What was the agent's stated reasoning before it chose this action?&lt;br&gt;
This is not a solved problem in the current observability tooling landscape. It is the gap that makes MCP security genuinely difficult — and genuinely important to get right before agentic AI is operating at scale in regulated production environments.&lt;br&gt;
What This Means for Your Team Right Now&lt;br&gt;
If your team is deploying AI agents that touch production infrastructure, the question isn't whether you need an MCP security strategy.&lt;br&gt;
It's whether you're already operating with one without realizing it needs a formal name.&lt;br&gt;
Start with three questions:&lt;br&gt;
Can you reconstruct why your agent took a specific action? If not, you don't have decision lineage — and you can't do forensics on an MCP security incident.&lt;br&gt;
Do you have behavioral baselines for your agents? If not, you can't detect drift — and context contamination and tool poisoning both manifest as behavioral drift before they manifest as anything else.&lt;br&gt;
Do you have a defined capability reduction path? If your agent starts behaving outside expected parameters, what happens? If the answer is "we'd have to manually intervene," you don't have safe degradation — you have a manual kill switch, which is not the same thing.&lt;br&gt;
These are solvable engineering problems. They require applying reliability engineering discipline to a new domain — which is exactly what SRE has always done.&lt;/p&gt;

&lt;p&gt;I shared a shorter version of these ideas on LinkedIn here(&lt;a href="https://www.linkedin.com/posts/ajay-devineni_agenticai-mcp-aisecurity-activity-7446992069618913281-dnPv?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_agenticai-mcp-aisecurity-activity-7446992069618913281-dnPv?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt; &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvhe27638f4nu9a4zsw5a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvhe27638f4nu9a4zsw5a.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffskx9sctibyqmreuy0aw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffskx9sctibyqmreuy0aw.jpg" alt=" " width="784" height="1168"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnte0t7ksff9d3iaar0kh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnte0t7ksff9d3iaar0kh.jpg" alt=" " width="784" height="1168"&gt;&lt;/a&gt;). This research is part of my contribution to the Cloud Security Alliance AI Safety Working Group's Six Pillars of MCP Security framework.&lt;br&gt;
What challenges are you seeing when bringing agentic AI safely into production? Are observability gaps or control gaps the bigger problem for your team?&lt;/p&gt;

</description>
      <category>agentaichallenge</category>
      <category>sre</category>
      <category>security</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Zero Data Loss Migration: Moving Billions of Rows from SQL Server to Aurora RDS — Architecture, Predictive CDC Monitoring &amp; Lessons from Production</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Sun, 05 Apr 2026 22:03:10 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/zero-data-loss-migration-moving-billions-of-rows-from-sql-server-to-aurora-rds-architecture-4g56</link>
      <guid>https://dev.to/ajaydevineni/zero-data-loss-migration-moving-billions-of-rows-from-sql-server-to-aurora-rds-architecture-4g56</guid>
      <description>&lt;p&gt;Migrating a live financial database with billions of rows, zero tolerance for data loss, and a strict cutover window is not a data transfer problem.&lt;br&gt;
It is a resource isolation problem, a risk prediction problem, and a compliance documentation problem — all running simultaneously.&lt;br&gt;
This article documents the architecture and lessons from a production SQL Server → AWS Aurora RDS migration I executed across multiple credit union banking environments. The core contribution is a framework I built called DMS-PredictLagNet — combining parallel DMS instance isolation with Holt-Winters predictive CDC lag forecasting for autonomous scaling.&lt;br&gt;
The Challenge&lt;br&gt;
The source environment was on-premises SQL Server across two separate data centers. Hundreds of tables. Two tables with billions of rows each. Continuous live transaction traffic — no maintenance window available. SOC 2 Type II and PCI DSS compliance required throughout.&lt;br&gt;
The hardest constraint: cutover had to happen within a documented change window measured in hours. If CDC replication lag was not at zero when that window opened, the entire migration had to defer to the next available window.&lt;br&gt;
Network Architecture: Dual VPN → Transit Gateway&lt;br&gt;
I established Site-to-Site VPN tunnels (IPSec/IKEv2) from both on-premises data centers into AWS, terminating at AWS Transit Gateway with dedicated route tables per client VPC. This guaranteed complete traffic isolation between the two migration streams — data from one client's pipeline could not traverse the other's route domain under any circumstances.&lt;br&gt;
Critical lesson learned the hard way: The source network team provided their internal LAN CIDR (192.x.x.x) for VPN configuration. What AWS actually sees is the post-NAT translated address — a completely different range. Every AWS-side configuration (route tables, security groups, network ACLs, VPN Phase 2 proxy ID selectors) must be built around the post-NAT address, not the internal LAN address. This mistake caused millions of connection timeouts before I identified the root cause. The fastest way to avoid it: ask "what IP address does AWS actually see when traffic leaves your environment?" before touching any configuration.&lt;br&gt;
Before starting any DMS task, I ran AWS Reachability Analyzer to validate end-to-end connectivity from each DMS replication instance to its source endpoint. This caught a missing route table entry that would have caused a task failure mid-window. I now treat this as a mandatory pre-migration gate.&lt;br&gt;
Schema Conversion with AWS SCT&lt;br&gt;
I ran AWS Schema Conversion Tool on a Windows EC2 instance inside the VPC — giving it direct connectivity to Aurora through the VPC network and to SQL Server through the VPN tunnel. Running SCT on a local laptop introduces latency variability that causes timeouts on large schema assessments.&lt;br&gt;
Credentials were stored in AWS Secrets Manager and accessed via IAM role — never stored in configuration files. This is a SOC 2 control requirement, not just a best practice.&lt;br&gt;
Two transformation rules were configured before assessment:&lt;/p&gt;

&lt;p&gt;Database remapping rule for naming convention differences&lt;br&gt;
Drop-schema rule to remove the SQL Server dbo prefix from all migrated objects&lt;/p&gt;

&lt;p&gt;Every incompatibility was resolved before a single row of data moved. Starting the full load before schema validation is complete is a common mistake with expensive consequences.&lt;br&gt;
The Core Architectural Decision: Parallel DMS Instance Isolation&lt;br&gt;
This was the most important design decision in the migration.&lt;br&gt;
A single DMS replication instance handling both the billion-row table and everything else creates resource contention. The billion-row table's CDC competes with hundreds of other tables for memory, CPU, and network bandwidth. Under peak transaction volume, that contention manifests as lag accumulation across the entire pipeline — and lag on a billion-row table takes the longest to clear.&lt;br&gt;
My solution: complete workload isolation.&lt;/p&gt;

&lt;p&gt;Instance 1 — dedicated exclusively to CDC replication for the single billion-row table. Nothing else ran on this instance.&lt;br&gt;
Instance 2 — handled full load and then CDC for all remaining tables.&lt;/p&gt;

&lt;p&gt;Both instances ran on the latest available DMS instance type with high-memory configuration. Standard sizing guidance does not account for sustained 14-day CDC workloads in live financial environments. The newer instance generation provided lower baseline CPU utilization under CDC load, more memory for the transaction log decoder, and better network throughput — all of which directly improved the predictive monitor's accuracy by providing more headroom before threshold triggers.&lt;br&gt;
LOB settings required per-table tuning. Tables with large text columns used Full LOB mode. Tables without LOB columns used Limited LOB mode with appropriate size limits. Mixing these without table-level configuration would have degraded throughput across the entire non-LOB majority of the table estate.&lt;br&gt;
The Foreign Key Pre-Assessment Fix&lt;br&gt;
The DMS pre-assessment failed on the first run — foreign key constraint violations because DMS loads tables in parallel and does not guarantee parent tables are loaded before child table inserts begin.&lt;br&gt;
Fix: add initstmt=set foreign_key_checks=0 to the Aurora target endpoint extra connection attributes. This disables foreign key enforcement for the DMS session only — it does not affect any other connections to Aurora. Post-load referential integrity validation then confirms consistency was achieved through the migration process rather than enforced during loading.&lt;br&gt;
In a SOC 2 environment: document this in the change control request and retain validation script output as audit evidence.&lt;br&gt;
DMS-PredictLagNet: Predictive CDC Lag Monitoring&lt;br&gt;
The standard reactive approach — CloudWatch alarm fires when lag exceeds a threshold — is insufficient in a live financial environment for two reasons. By the time an alarm fires, the backlog may already require hours to clear. And financial transaction volume is non-linear: payroll processing, end-of-day settlement, and batch jobs create predictable but sharp spikes that static thresholds do not adapt to.&lt;br&gt;
I built a predictive monitoring system using Holt-Winters triple exponential smoothing trained on 90 days of source transaction volume patterns.&lt;br&gt;
The model captures three components:&lt;/p&gt;

&lt;p&gt;Level — baseline transaction rate&lt;br&gt;
Trend — directional change over time&lt;br&gt;
Seasonality — recurring patterns (daily and weekly cycles)&lt;/p&gt;

&lt;p&gt;The seasonal period was set to m=168 (hourly observations over a 7-day weekly cycle) — the dominant periodicity in credit union banking, driven by business-day versus weekend patterns and weekly payroll cycles.&lt;br&gt;
Rather than forecasting lag directly, I predicted transaction volume 30 minutes ahead and translated the forecast into predicted lag via an empirically calibrated throughput model for the specific DMS instance sizes in use. This two-stage approach produced more reliable results because CDC lag is affected by DMS internal buffer state that is not observable from CloudWatch metrics alone.&lt;br&gt;
The autonomous scaling response operated on two tiers:&lt;br&gt;
When forecast indicated predicted lag would reach 60% of critical threshold within 30 minutes → AWS Lambda triggered DMS instance scale-up automatically.&lt;br&gt;
When forecast indicated 85% of critical threshold → AWS Systems Manager automation executed emergency scale-up to maximum pre-approved instance size and paged the on-call engineer via PagerDuty.&lt;br&gt;
All automated actions wrote to the S3 audit log before execution — satisfying SOC 2 requirements for immutable evidence of automated control actions.&lt;br&gt;
Results&lt;br&gt;
Across the 14-day CDC replication window:&lt;/p&gt;

&lt;p&gt;7 high-risk lag events identified by the predictive monitor&lt;br&gt;
5 resolved autonomously by Lambda-triggered scale-up — no human intervention&lt;br&gt;
2 required engineer engagement (one unscheduled batch job outside training distribution, one DMS task restart requiring SOC 2 change authorization)&lt;br&gt;
Zero engineer pages for predictable, pattern-driven lag events&lt;/p&gt;

&lt;p&gt;Post-migration outcomes:&lt;/p&gt;

&lt;p&gt;Zero data loss across all tables&lt;br&gt;
Cutover window met&lt;br&gt;
41% query performance improvement on Aurora within 48 hours post-cutover&lt;/p&gt;

&lt;p&gt;Post-CDC Validation Before Cutover&lt;br&gt;
Three-level validation executed across all tables before cutover authorization:&lt;/p&gt;

&lt;p&gt;Row count parity — exact match between source and Aurora at validation timestamp&lt;br&gt;
Checksum validation — hash comparison over critical column sets to detect corruption that row counts alone would not reveal&lt;br&gt;
Referential integrity validation — all foreign key relationships confirmed satisfied in Aurora&lt;/p&gt;

&lt;p&gt;Two tables had minor row count discrepancies on first run — both traced to in-flight transactions committed in the milliseconds between source and target count queries. Rerunning during a low-transaction period confirmed equivalence. Run validation during known low-traffic windows, not during peak processing.&lt;br&gt;
The 14-Day CDC Window&lt;br&gt;
The 14-day validation period served three purposes simultaneously:&lt;/p&gt;

&lt;p&gt;Application teams ran full regression testing against Aurora using real production data&lt;br&gt;
The CDC pipeline's behavior was observed across a complete two-week transaction cycle including payroll, weekends, and month-end batch&lt;br&gt;
Validation scripts were executed and verified before the cutover decision was made&lt;/p&gt;

&lt;p&gt;Key Takeaways for Engineers Planning Similar Migrations&lt;br&gt;
Ask the right network question first. What IP address does AWS actually see when traffic leaves your environment? Build everything around the post-NAT address.&lt;br&gt;
Run Reachability Analyzer before any DMS task starts. The cost is negligible. The cost of discovering a routing gap after migration tasks have started is not.&lt;br&gt;
Isolate your highest-volume table CDC on a dedicated instance. Do not let it compete for resources with your bulk load.&lt;br&gt;
Validate content, not just row counts. Checksum validation caught LOB truncation that row count checks would have missed entirely.&lt;br&gt;
Pre-assessment is not optional in regulated environments. Discovering the foreign_key_checks issue after a full load has started on a billion-row table is not recoverable within an eight-hour window.&lt;br&gt;
Predictive monitoring is not about preventing every lag event. It is about converting unpredictable events into manageable ones — autonomous handling of known patterns, human escalation for genuinely novel ones.&lt;br&gt;
The full framework — including the Holt-Winters forecasting methodology, parallel DMS partition design, and SOC 2 audit trail architecture — is written up as peer-reviewed research for the SRE and cloud engineering community. Migration patterns like this should be documented, not just passed around as tribal knowledge.&lt;br&gt;
What's the hardest part of large database migrations for your team — data volume, CDC lag management, cutover coordination, or post-migration validation?&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fieutspzuwckoi2fvjze3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fieutspzuwckoi2fvjze3.jpg" alt=" " width="800" height="537"&gt;&lt;/a&gt;&lt;br&gt;
I also shared a high-level architecture overview of this migration on LinkedIn — you can find it here &lt;a href="https://www.linkedin.com/posts/ajay-devineni_aws-databasemigration-aurorards-activity-7438712828808548352-rz76?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_aws-databasemigration-aurorards-activity-7438712828808548352-rz76?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>database</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
