DEV Community

aarhamforensics
aarhamforensics

Posted on • Originally published at twarx.com

AI Technology Governance: Why Amazon Says Human-in-the-Loop Fails

Originally published at twarx.com - read the full interactive version there.

Last Updated: June 20, 2026

The Register just published the most contrarian AI technology governance take of 2026: Amazon's own security VP says putting a human in the loop makes your AI less safe, not more. Most AI workflows are solving the wrong problem entirely. If you ship agents, this single reframing changes how you should architect oversight starting today.

This matters right now because every enterprise shipping agents — through LangGraph, Anthropic's MCP, AWS Bedrock, or CrewAI — has been sold the same lie: bolt a human onto the approval step and you're covered. Amazon, Google, Microsoft and IBM are now publicly abandoning that model in the same week. All four. Simultaneously. This single shift in AI technology thinking changes how every team should architect agent oversight.

By the end of this piece you'll understand the systems failure underneath it — what I call the AI Coordination Gap — and exactly when human review helps versus when it actively destroys reliability. For broader context, see our coverage of AI governance trends.

Eric Brandwine distinguished engineer and VP at Amazon Security discussing human-in-the-loop AI governance

Eric Brandwine, distinguished engineer and VP at Amazon Security, argues humans are non-deterministic too — the core of the AI Coordination Gap framing. Source: The Register

Overview: What Amazon Actually Said

On Saturday, June 20, 2026, at 15:25 UTC, The Register's Jessica Lyons published an interview with Eric Brandwine, distinguished engineer and VP at Amazon Security, under a blunt headline: 'Why Amazon hates human-in-the-loop AI governance.'

The argument is deceptively simple. Brandwine told The Register that we're 'a little bit precious about humans' — we assume we're consistent, disciplined, reliable. We're not. 'When you actually get down to it, humans are not terribly consistent,' he said. 'Humans, like AI agents and systems, are non-deterministic.' Neither a person nor a large language model is guaranteed to produce the same output given the same input twice. Both make mistakes. Both confabulate. This isn't a hot take — it's just true, and the industry spent years not saying it out loud.

The difference, Brandwine argues, is experience: 'We know how humans fail. We're comfortable with it. So human-in-the-loop isn't necessarily the gold standard.' That single sentence is the most quietly radical statement a hyperscaler security executive has made about AI technology governance this year.

For years, vendors sold human-in-the-loop as the universal safety patch. The advice got louder with modern LLMs and hit 'a fever pitch when enterprises started deploying agents into their IT environments,' per The Register. Now the largest cloud providers on earth are reversing course in unison. Google Cloud COO Francis deSouza described a migration from 'human-led' to 'human-in-the-loop' to 'AI-led defense overseen by humans.' Microsoft CEO Satya Nadella argued for 'loop learning' instead of step-by-step human checks. IBM executives called for 'human accountability — not humans in the loop.'

This article reframes that news through a systems lens. The real story isn't that humans are bad reviewers — it's that we keep inserting humans at the wrong layer of the agent stack, where they degrade under repetition, while leaving the actual coordination problem completely unsolved. That's the gap.

Coined Framework

The AI Coordination Gap

The AI Coordination Gap is the structural distance between where governance is placed (a human approving individual agent actions) and where reliability is actually determined (the orchestration layer that sequences, validates, and reconciles agent outputs). Most failures live in the gap — and no amount of human approval clicks closes it.

A six-step agent pipeline where each step is 97% reliable is only 83% reliable end-to-end. Adding a human who clicks 'approve' 200 times a day doesn't fix the math — it just adds a seventh non-deterministic node.

What Is It: Human-in-the-Loop and Why Amazon Is Walking Away

Let's define the thing plainly, because a small-business owner reading this deserves clarity, not a glossary.

Human-in-the-loop (HITL) is a control design where an AI system pauses before taking a consequential action and waits for a person to approve or reject it. Think of an AI agent that drafts a refund, then waits for a support manager to click 'send.' The human is the safety gate.

The intuition is comforting: a person catches the AI's mistakes. The reality, according to Brandwine, is that the person stops catching them. He invokes a concept he first presented at AWS re:Invent in 2017: normalization of deviance — a term coined by sociologist Diane Vaughan in her study of the Challenger disaster. It's the slow erosion of discipline that happens when shortcuts and skipped procedures produce no immediate catastrophe, so they quietly become the norm. I've watched this happen in production. It's not dramatic. It's just Tuesday, repeated until something expensive breaks.

His example is harrowing. In emergency rooms, machines beep constantly. 'Your first day on the job, you jump every single time one of the alarms beeps — but the patient is fine. It's a spurious alarm... over time, after enough of these false alarms... your discipline slips, and you stop responding. And eventually some tragic outcome occurs.' The Register notes this alarm-fatigue phenomenon is documented among healthcare workers, firefighters, and even Army pilots.

Now map it to agents. 'If you put a human inside of this tight loop, and ask them to make approval decisions for agentic tools repeatedly, time after time, they'll do a good job,' Brandwine said. 'And then they'll do an okay job. And pretty quickly they'll be doing a poor job.' That's why, in his words, 'we're not huge fans of human-in-the-loop' at Amazon. 'It's something that you should use judiciously, where you absolutely need it. But it's not something that you can do at high velocity.'

The killer insight: HITL doesn't fail because humans are dumb. It fails because approval is a repetitive cognitive task, and repetitive cognitive tasks are exactly where human reliability decays fastest. You're using your most expensive, least scalable resource for its single worst use case.

83%
End-to-end reliability of a 6-step pipeline where each step is 97% reliable (0.97^6)
[Compounding error in agent chains, arXiv](https://arxiv.org/abs/2210.03629)




2017
Year Brandwine first presented 'normalization of deviance' at AWS re:Invent
[The Register, 2026](https://www.theregister.com/security/2026/06/20/why-amazon-hates-human-in-the-loop-ai-governance/5258639)




4
Hyperscalers (Amazon, Google, Microsoft, IBM) publicly rethinking HITL in one week
[The Register, 2026](https://www.theregister.com/security/2026/06/20/why-amazon-hates-human-in-the-loop-ai-governance/5258639)
Enter fullscreen mode Exit fullscreen mode

How It Works: The Mechanism Behind the Coordination Gap

To understand why Amazon's position is correct from a systems standpoint, you have to see where governance gets placed versus where reliability is actually determined.

In a naive agent deployment, the architecture looks like a straight line: the agent reasons, calls a tool, a human approves, the tool executes. The human sits between reasoning and action. That feels safe. But the human is a single non-deterministic checkpoint reviewing the output of another non-deterministic system — and crucially, they review it in isolation, with no visibility into the upstream context that produced the action or the downstream consequences it's about to trigger. I'd argue this is architecturally worse than having no checkpoint at all, because it creates the illusion of safety without the substance.

The Broken Loop: Where HITL Actually Sits

  1


    **LLM Reasoning (Claude / GPT)**
Enter fullscreen mode Exit fullscreen mode

Agent plans an action from a prompt + retrieved context. Non-deterministic output — same input may yield different plans.

↓


  2


    **Tool Call Proposal (MCP)**
Enter fullscreen mode Exit fullscreen mode

Agent proposes a concrete action via Model Context Protocol — e.g. 'issue $400 refund', 'delete S3 bucket'.

↓


  3


    **Human Approval Gate (the HITL node)**
Enter fullscreen mode Exit fullscreen mode

A person clicks approve/reject — for the 200th time today. Discipline decays (normalization of deviance). Reviews in isolation, no upstream context.

↓


  4


    **Tool Execution**
Enter fullscreen mode Exit fullscreen mode

Action fires irreversibly. Any error is now in production. The 'gap' between approval placement and reliability determination is fully exposed.

The human sits at step 3 — a single fatiguing checkpoint reviewing one decision at a time — while reliability is actually decided by how steps 1, 2 and 4 are coordinated.

The fix isn't to remove humans. It's to move governance out of the per-action approval slot and into the orchestration layer — where decisions can be validated deterministically, reconciled against policy, logged for accountability, and escalated to humans only on genuine exceptions. That's the difference between Nadella's 'loop learning' and a tired analyst rubber-stamping refunds. See how this plays out in our agent architecture deep dive.

Coined Framework

The AI Coordination Gap — Layer View

Closing the gap means replacing one human checkpoint with four coordination layers: deterministic guardrails, policy reconciliation, outcome evaluation, and exception-only human accountability. The human stops being a gate and becomes a backstop.

The Coordination Layer: Amazon's Implied Alternative

  1


    **Deterministic Guardrails**
Enter fullscreen mode Exit fullscreen mode

Hard limits enforced in code, not judgment. 'No refund over $X', 'no destructive AWS calls without a tag'. Fires every time, never fatigues.

↓


  2


    **Policy Reconciliation (Orchestrator)**
Enter fullscreen mode Exit fullscreen mode

LangGraph / AWS Bedrock orchestrator checks the proposed action against organizational policy and prior traces before it can execute.

↓


  3


    **Outcome Evaluation (Private Evals)**
Enter fullscreen mode Exit fullscreen mode

Nadella's 'private evals' — measure whether the agent is improving against business outcomes, on real internal traces, not public benchmarks.

↓


  4


    **Exception-Only Human Accountability**
Enter fullscreen mode Exit fullscreen mode

IBM's model: a named human owns the system and reviews only flagged anomalies — high-signal decisions, not 200 routine clicks.

Governance moves from a single fatiguing gate to four layers; the human reviews high-signal exceptions, preserving discipline where it matters.

Diagram comparing single human approval gate versus four-layer orchestration governance for AI agents

The before/after of closing the AI Coordination Gap: one fatiguing human gate replaced by deterministic guardrails plus exception-only human accountability.

What It Means for Small Businesses

If you run a 10-person company and you've deployed an AI agent for customer support, invoicing, or scheduling, this news is directly actionable — and it can save you real money.

The opportunity: most small businesses use HITL by default because it feels responsible. But every approval click costs an employee's attention. If your support lead spends 90 minutes a day approving AI-drafted responses, that's roughly 30 hours a month — at a $40/hour loaded cost, that's $1,200/month of expensive human attention spent on a task they're getting progressively worse at.

The reframe: keep humans on the 2% of actions that are genuinely irreversible or high-value — refunds over $500, contract changes, anything that touches legal — and let deterministic rules plus orchestration handle the routine 98%. A plumbing company using an agent to confirm appointments doesn't need a human approving every text. It needs a hard rule that the agent can never double-book and an alert when something looks off. That's it.

You're not paying a human to catch AI mistakes. You're paying them to stay alert for the rare mistake — and the science says they can't, if you make them approve the routine ones all day.

The risk: blindly removing humans without building the coordination layers first. The Coordination Gap cuts both ways. Yank the human out without adding deterministic guardrails and exception alerts and you've removed the only safety net you had. Learn more in our guide to enterprise AI governance and workflow automation for lean teams.

Who Are Its Prime Users

The organizations this reframing benefits most:

  • Security operations teams (SOCs) — exactly the audience Brandwine and Google's deSouza are addressing. Routine triage at 'machine pace' with human oversight on escalations only.

  • High-volume customer support orgs — where approval fatigue hits hardest and the sheer volume makes per-action HITL economically absurd.

  • Fintech and payments — where deterministic guardrails like transaction limits and fraud rules already exist and slot cleanly into the coordination layer.

  • DevOps / platform teams deploying agents that touch infrastructure — see our piece on AI agents in production environments.

  • Mid-market companies (50–500 employees) scaling agents beyond pilots, where HITL stops being feasible almost immediately.

Roles that benefit most: senior engineers, AI leads, heads of security, and ops managers who own the reliability number. If your job is to make agents trustworthy at scale, this is your problem to solve — and nobody's coming to solve it for you. You can browse battle-tested starting points in our AI agent library.

When to Use It (and When Not To)

Brandwine's own framing is the rule: use human-in-the-loop 'judiciously, where you absolutely need it.' Here's the concrete mapping.

Keep the human in the loop when:

  • The action is irreversible and high-consequence — wiring money, deleting production data, terminating an account.

  • Decision frequency is low — a handful of times a day, so discipline doesn't have room to decay.

  • Regulation demands a named human approver — certain medical, legal, or financial actions where there's no way around it.

Remove the per-action human (and use coordination layers) when:

  • Volume is high — hundreds of decisions per shift. This is where normalization of deviance guarantees failure, full stop.

  • Actions are reversible or bounded — a draft, a low-value refund, a calendar hold.

  • Deterministic rules can express the constraint better than human judgment can.

The decision rule in one line: frequency × reversibility. High-frequency + reversible = automate with guardrails. Low-frequency + irreversible = keep the human. Everything in between needs the orchestration layer to decide which bucket a given action falls into at runtime.

Head-to-Head: Four Governance Models Compared

ModelWho Champions ItHuman's RoleBest ForFailure Mode

Human-in-the-loop (per action)Legacy vendor adviceApproves every actionLow-volume, irreversible decisionsNormalization of deviance — discipline decays

AI-led, human-overseenGoogle Cloud (deSouza)Oversees a fleet, handles escalationsHigh-velocity SOC / cyber defensePoorly tuned escalation = missed real threats

Loop learningMicrosoft (Nadella)Designs evals, RL environmentsWorkflows that improve with each useBad private evals optimize the wrong outcome

Human accountabilityIBM execsOwns outcomes, not clicksRegulated, auditable deploymentsAccountability without authority = theater

Notice the convergence: all four move the human away from the per-action gate and toward a higher-leverage position — oversight, eval design, or accountability. That's the Coordination Gap being closed from four different directions in a single week. When Amazon, Google, Microsoft, and IBM land on the same architectural conclusion independently, that's not a trend. That's a correction.

How to Use It: A Worked Demonstration

Let's build the alternative concretely. Below is a minimal LangGraph pattern that replaces a per-action human gate with deterministic guardrails plus exception-only escalation. This is the production-ready pattern; you can adapt it on n8n visually if you prefer no-code. For ready-made building blocks, explore our AI agent library.

Sample input: An AI support agent proposes: issue_refund(order_id='A-9921', amount=420.00, reason='late delivery')

Python — LangGraph coordination layer (illustrative)

Replace per-action HITL with deterministic guardrails + exception escalation

REFUND_AUTO_LIMIT = 250.00 # deterministic policy, never fatigues
REFUND_HARD_CEILING = 1000.00

def coordinate_action(action):
# Layer 1: deterministic guardrail
if action.type == 'issue_refund':
if action.amount > REFUND_HARD_CEILING:
return reject('Exceeds hard ceiling — auto-blocked')
# Layer 2: policy reconciliation
if action.amount <= REFUND_AUTO_LIMIT:
log_trace(action, decision='auto-approved')
return execute(action) # 98% path, no human
# Layer 4: exception-only human accountability
else:
return escalate_to_human(action, # 2% path, high-signal
context=action.full_context,
reason='Refund $250-$1000 requires named approver')

Input: issue_refund(order_id='A-9921', amount=420.00)

result = coordinate_action(refund_action)

Actual output for the $420 refund:

Output

DECISION: escalate_to_human
REASON: Refund $250-$1000 requires named approver
CONTEXT_ATTACHED: order history, prior refunds, customer LTV
HUMAN_QUEUE_DEPTH_TODAY: 3 # not 200 — discipline preserved

The human still sees this $420 decision — because it's genuinely in the judgment zone — but they see only 3 of these today, each with full context attached, instead of clicking 'approve' 200 times on autopilot. That is the entire difference between a fatigued rubber-stamp and meaningful oversight. Three decisions with context versus two hundred decisions in isolation. For deeper orchestration patterns see our multi-agent systems and orchestration guides, and browse our AI agent library for templates.

LangGraph code showing deterministic refund guardrails and exception-only human escalation in an AI agent workflow

The worked demonstration in action: routine refunds auto-execute under deterministic limits while only judgment-zone amounts escalate to a human — preserving the discipline HITL destroys.

[

Watch on YouTube
Human-in-the-loop vs orchestrated AI agent governance with LangGraph
LangChain • agent governance patterns
Enter fullscreen mode Exit fullscreen mode

](https://www.youtube.com/results?search_query=human+in+the+loop+ai+agent+governance+langgraph)

Good Practices and Common Pitfalls

  ❌
  Mistake: Treating HITL as a universal safety patch
Enter fullscreen mode Exit fullscreen mode

Bolting a human approval step onto every agent action because it 'feels safe.' At high velocity this triggers normalization of deviance — the reviewer's discipline decays until they're approving everything, including the one action that actually mattered.

Enter fullscreen mode Exit fullscreen mode

Fix: Reserve human approval for low-frequency, irreversible actions. Use deterministic guardrails in LangGraph or AWS Bedrock for everything else.

  ❌
  Mistake: Removing humans without coordination layers
Enter fullscreen mode Exit fullscreen mode

Reading 'Amazon hates HITL' and ripping out human oversight entirely — leaving a non-deterministic agent firing irreversible actions with no guardrails. You've widened the Coordination Gap, not closed it. I've seen this exact mistake made by teams who read the headline and skipped the architecture.

Enter fullscreen mode Exit fullscreen mode

Fix: Add deterministic limits + policy reconciliation + exception escalation BEFORE removing any human gate. Humans become backstops, never absent.

  ❌
  Mistake: Reviewing decisions without context
Enter fullscreen mode Exit fullscreen mode

Showing a human a bare action ('approve refund $420?') with no upstream reasoning, customer history, or downstream impact attached. They can't make a good call — so they default to approve. Every time.

Enter fullscreen mode Exit fullscreen mode

Fix: Attach full trace context to every escalation. Use MCP to pass the reasoning chain and retrieved evidence alongside the proposed action.

  ❌
  Mistake: Optimizing against public benchmarks
Enter fullscreen mode Exit fullscreen mode

Measuring agent quality on MMLU or public leaderboards instead of your own outcomes. Nadella's explicit warning: external benchmarks don't capture whether the model improves on what matters to YOUR business. A model that scores well on MMLU can still fail your specific workflow in ways the benchmark will never surface.

Enter fullscreen mode Exit fullscreen mode

Fix: Build private evals on real internal traces. Measure business outcomes (resolution rate, error cost), not generic scores.

Average Expense to Use It

Here's the realistic total cost of ownership for moving from per-action HITL to a coordination layer, for a mid-sized support operation.

  • The hidden cost you're already paying: ~30 hours/month of approval clicking at $40/hour loaded cost = $1,200/month of degrading human attention. This is real money leaving your P&L for work that makes your team worse at their jobs.

  • LLM inference: Using Claude or GPT-class models at roughly $3–$15 per million input tokens; a support agent handling 5,000 conversations/month typically runs $150–$600/month.

  • Orchestration: LangGraph is open-source (free to self-host); LangGraph Platform managed tiers and n8n start around $20–$50/seat/month. Self-hosted n8n is free.

  • Vector DB for RAG context: Pinecone serverless starts free, scaling to $50–$300/month for production workloads.

  • Engineering setup: 1–3 weeks of senior engineer time to build guardrails + escalation logic. One-time cost.

The math that sells this internally: if you move from 200 daily approvals to ~10 escalations, you reclaim roughly $1,000/month of human attention while spending under $700/month on inference + orchestration — and your reliability goes up, because the human is now alert when it counts.

Industry Impact: Who Wins and Who Loses

Winners: Orchestration platforms (LangChain/LangGraph, CrewAI, AutoGen), the hyperscalers selling agent infrastructure (AWS Bedrock, Google Cloud, Azure), and eval/observability tooling vendors. Anyone selling the coordination layer just had four of the biggest companies on earth validate their pitch in one week.

Losers: Vendors whose entire AI safety story was 'add a human approval step.' That positioning aged badly overnight. Also at risk: BPO and outsourced review operations whose business model is selling human approval clicks at volume — exactly the work Brandwine says humans do poorly. I wouldn't want to be renewing those contracts right now.

What changes for builders: The new reliability bar isn't 'did a human approve it' — it's 'can you show the deterministic guardrails, the private evals, and the named accountable owner.' If you build agents, your governance story now has to be architectural, not procedural. Read more in our enterprise AI coverage and our breakdown of AI reliability at scale.

The companies winning with AI agents aren't the ones with a human approving every action. They're the ones who moved governance into the orchestration layer — and freed their humans to catch the failures that actually matter.

Reactions: What the Industry Is Saying

This isn't one executive's hot take — it's a coordinated industry shift documented by The Register in a single week of June 2026.

  • Eric Brandwine, distinguished engineer & VP, Amazon Security: 'We know how humans fail. We're comfortable with it. So human-in-the-loop isn't necessarily the gold standard.' (The Register)

  • Francis deSouza, COO, Google Cloud: 'We have moved from a human-led defense strategy, to a human-in-the-loop defense strategy, to an AI-led defense strategy that's overseen by humans... an agentic fleet that does a lot of the routine cyber security work at a machine pace.' (stated ahead of Google Cloud Next, April 2026)

  • Satya Nadella, CEO, Microsoft: argued for 'loop learning' on X — 'Companies need to turn their workflows, domain knowledge, and accumulated judgment into AI systems that improve with each use... Private evals should capture whether a model is actually improving against outcomes that matter to the business (not just external benchmarks!).' (via X)

  • IBM executives: called for 'human accountability — not humans in the loop — at all stages of AI development, deployment, and governance.' (IBM Think)

Four hyperscaler executives Amazon Google Microsoft IBM converging on AI agent governance without per-action human approval

Four hyperscalers converging in one week: the AI Coordination Gap is being closed from multiple directions — oversight, eval design, and accountability replacing the approval gate.

What Happens Next: Predictions

2026 H2


  **'Private evals' becomes a product category**
Enter fullscreen mode Exit fullscreen mode

Following Nadella's explicit call for private reinforcement learning environments on real internal traces, expect AWS, Azure, and LangChain to ship managed private-eval tooling. The pieces already exist in LangSmith and Bedrock — this is more about packaging than invention.

2027


  **'Human accountability' overtakes 'human-in-the-loop' in compliance language**
Enter fullscreen mode Exit fullscreen mode

IBM's framing maps cleanly to how regulators think about ownership. Expect AI governance frameworks like the NIST AI Risk Management Framework and audit standards to require a named accountable owner rather than per-action sign-off.

2027–2028


  **Deterministic guardrails become the default agent safety layer**
Enter fullscreen mode Exit fullscreen mode

As MCP standardizes tool access, the safest place to enforce policy shifts to code-level guardrails on the tool layer — non-fatiguing, auditable, and immune to normalization of deviance.

Confirmed facts here are the four executive statements and dates reported by The Register. The predictions above are my analysis, grounded in those statements and current tooling trajectories — clearly labeled as forecast, not fact.

Frequently Asked Questions

What is agentic AI technology?

Agentic AI technology refers to systems built on large language models that can plan multi-step tasks, call external tools, and take actions in the world — not just generate text. Unlike a chatbot that answers a question, an agent might read a ticket, query a database, issue a refund, and send a confirmation autonomously. Frameworks like LangGraph, AutoGen, and CrewAI orchestrate these agents. The governance challenge Amazon's Brandwine describes is specific to agentic AI: because agents take real actions, the question of who approves them — a fatiguing human gate or a deterministic coordination layer — directly determines reliability. Agents are non-deterministic, so they need bounded permissions and exception-based oversight rather than blanket trust or blanket per-action human approval.

How does multi-agent orchestration work?

Multi-agent orchestration coordinates several specialized agents — a planner, a researcher, an executor — through a control layer that routes tasks, passes context, and reconciles outputs. In LangGraph this is modeled as a stateful graph where nodes are agents and edges are conditional transitions. The orchestrator is exactly where governance belongs, per the AI Coordination Gap: it can enforce deterministic guardrails, check actions against policy, log traces, and escalate only genuine exceptions to humans. The key benefit over a single agent is reliability through decomposition — but it also compounds error, since a 6-step chain at 97% per-step reliability lands near 83% end-to-end. That's why orchestration must include validation between steps, not just hand-offs.

What companies are using AI agents?

As of mid-2026, the hyperscalers themselves are the largest documented users. Amazon Security deploys agents in its IT environments; Google Cloud, per COO Francis deSouza, runs an 'agentic fleet' for cybersecurity at machine pace; Microsoft embeds agents across its enterprise stack; and IBM deploys them with formal human-accountability governance. Beyond big tech, fintech, customer support, and DevOps organizations are the heaviest adopters because their workflows are high-volume and rule-bound. The common thread among successful deployments is not raw model power — it's that they've moved governance into the orchestration layer rather than relying on per-action human approval, which Amazon explicitly warns degrades under repetition.

What is the difference between RAG and fine-tuning?

RAG (Retrieval-Augmented Generation) injects relevant documents into the model's context at query time, pulling from a vector database like Pinecone. Fine-tuning permanently adjusts the model's weights by training on examples. RAG is better for fast-changing, factual knowledge — you update the database, not the model — and it's cheaper and auditable, since you can see which document grounded an answer. Fine-tuning is better for teaching style, format, or domain-specific behavior that won't change often. In the context of this article, Nadella's 'private reinforcement learning environments' point beyond both: training models on real internal traces to improve against business outcomes. Most production agent stacks use RAG for knowledge and reserve fine-tuning for narrow behavioral tuning.

How do I get started with LangGraph?

Install with pip install langgraph langchain, then define a stateful graph where nodes are functions or agents and edges are transitions. Start with a single agent plus one tool, add a deterministic guardrail node (like the refund-limit check in this article), then a conditional escalation edge for exceptions. The official LangGraph docs have quickstarts, and LangSmith provides tracing so you can see every step — essential for building the private evals Nadella describes. For a no-code alternative, n8n lets you wire the same flow visually. Build the coordination layer first — guardrails and exception escalation — before you let any agent take irreversible actions. You can also start from templates in our agent library.

What are the biggest AI failures to learn from?

The failure pattern Amazon's Brandwine highlights is the most instructive: normalization of deviance, where human reviewers in a tight approval loop progressively stop catching errors — documented among healthcare workers facing alarm fatigue, firefighters, and Army pilots. Applied to AI, the classic failure is deploying agents with per-action human approval at high volume and assuming you're safe, when the reviewer has long since defaulted to clicking 'approve.' Other major failures include compounding error in long agent chains (each step degrades the whole), optimizing against public benchmarks instead of business outcomes, and removing human oversight without adding deterministic guardrails. The lesson across all of them: governance must be architectural and exception-based, not a single fatiguing human gate.

What is MCP in AI?

MCP (Model Context Protocol) is an open standard, introduced by Anthropic, that defines how AI models connect to external tools, data sources, and systems in a consistent, secure way. Think of it as a universal adapter: instead of writing custom integrations for every tool, an agent speaks MCP to access files, databases, or APIs. In the governance context of this article, MCP matters because it's the layer where tool calls are proposed and where deterministic guardrails can live — the safest place to enforce policy is at the protocol boundary, before an action executes. As MCP standardizes, the coordination layer that Amazon, Google, and Microsoft are all gesturing toward becomes far easier to build consistently across an organization's agent fleet.

The takeaway isn't 'fire the humans.' It's that we've spent years placing our most valuable, least scalable resource — human judgment — at the exact point where it degrades fastest, and calling it safety. Amazon, Google, Microsoft, and IBM just said so out loud, in the same week. Close the AI Coordination Gap: deterministic guardrails for the routine, named accountability for the consequential, and humans alert for the moment that actually matters.

About the Author

Rushil Shah

AI Systems Builder & Founder, Twarx

Rushil Shah is the founder of Twarx and an AI systems builder who has spent years designing autonomous workflows, multi-agent architectures, and AI-powered business tools. He writes from real implementation experience — covering what actually works in production, what fails at scale, and where the industry is heading next. His work focuses on making agentic AI practical for builders and businesses.

LinkedIn · Full Profile


This article was originally published on Twarx. Follow for daily deep dives on AI agents and automation.

Top comments (0)