<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: dnyandeo bharambe</title>
    <description>The latest articles on DEV Community by dnyandeo bharambe (@dnyandeo).</description>
    <link>https://dev.to/dnyandeo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3957330%2Fc2d5d5df-e458-4f24-b141-328cb2c3e1f0.png</url>
      <title>DEV Community: dnyandeo bharambe</title>
      <link>https://dev.to/dnyandeo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dnyandeo"/>
    <language>en</language>
    <item>
      <title>Semantic Layer vs MCP: Why Direct ERP Write Access Is an Enterprise Security Risk</title>
      <dc:creator>dnyandeo bharambe</dc:creator>
      <pubDate>Tue, 16 Jun 2026 23:13:08 +0000</pubDate>
      <link>https://dev.to/dnyandeo/semantic-layer-vs-mcp-why-direct-erp-write-access-is-an-enterprise-security-risk-3po8</link>
      <guid>https://dev.to/dnyandeo/semantic-layer-vs-mcp-why-direct-erp-write-access-is-an-enterprise-security-risk-3po8</guid>
      <description>&lt;p&gt;⚠️ &lt;strong&gt;Failure Mode:&lt;/strong&gt; Most architects celebrating MCP's ability to connect LLMs directly to enterprise systems are skipping a critical question: what happens when the prompt is malicious? One crafted instruction can execute a real write operation against your ERP with no alarm, no gate, and no human review.&lt;/p&gt;




&lt;p&gt;Enterprise teams are giving AI systems write access to their most critical business data — pricing, inventory, financials, vendor records — without a governance layer. When this goes wrong it does not look like a technology failure. It looks like a business crisis. Orders placed at zero price. Payments routed to wrong accounts. Inventory records wiped. And no audit trail to explain how it happened.&lt;/p&gt;

&lt;p&gt;The technology enabling this is MCP (Model Context Protocol) — powerful, fast to implement, and dangerously easy to deploy without the architecture it requires. This post breaks down the boundary between semantic layer and MCP, shows exactly where the attack surface opens, and defines the three governance layers every enterprise AI system needs before going anywhere near write access.&lt;/p&gt;




&lt;h2&gt;
  
  
  The semantic layer is safe — but read-only
&lt;/h2&gt;

&lt;p&gt;A semantic layer sits between your LLM and your data. It exposes a curated, permission-aware model — metrics, dimensions, KPIs — and nothing else. An LLM can query it freely because the worst outcome is reading data it is already permitted to see. Role-based permissions are enforced at the semantic layer. The data model is defined and controlled. The blast radius of any mistake is limited to a bad query result.&lt;/p&gt;

&lt;p&gt;The limitation is fundamental: semantic layers are read-only by design. The moment you need AI to act — update a record, trigger a workflow, modify a price, approve a purchase order — you need write access. That is where MCP enters the picture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0pe0xj4o2fcqt9qmn46v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0pe0xj4o2fcqt9qmn46v.png" alt=" " width="800" height="729"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  MCP enables write access — and that changes everything
&lt;/h2&gt;

&lt;p&gt;MCP servers can execute real operations against live enterprise systems. That power is also the risk. An LLM processing a crafted prompt injection can issue a perfectly valid MCP tool call with malicious intent. Without a governance layer, that instruction goes straight to your ERP. No validation. No human review. No audit trail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A prompt is injected into a customer-facing AI assistant&lt;/li&gt;
&lt;li&gt;The injected instruction tells the LLM to set all product catalog prices to $0.00&lt;/li&gt;
&lt;li&gt;The MCP server receives a valid &lt;code&gt;update_product_price&lt;/code&gt; tool call&lt;/li&gt;
&lt;li&gt;The ERP executes it. No alarm fires.&lt;/li&gt;
&lt;li&gt;By the time anyone notices, thousands of orders have been placed at zero price&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi28txcqdycgx2259zwcx.png" alt=" " width="800" height="588"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  The three layers every MCP write path requires
&lt;/h2&gt;

&lt;p&gt;Every MCP tool call that touches a write operation in an enterprise system must pass through three governance controls before execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Prompt validation layer
&lt;/h3&gt;

&lt;p&gt;Before any write intent reaches the MCP server, validate the instruction independently of the LLM. Check: does this action match the user's stated goal? Is the scope within expected bounds? Do the parameter values fall within acceptable ranges? Flag and block anything outside the expected envelope before it gets further.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔒 &lt;strong&gt;Security Pattern:&lt;/strong&gt; Treat the prompt validation layer as an untrusted-input boundary. Apply the same rigor you would to any external API input — schema validation, value range checks, anomaly detection, and rejection of anything malformed or out-of-scope.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. MCP server input and output schema enforcement
&lt;/h3&gt;

&lt;p&gt;The MCP server itself must enforce a strict schema on every incoming tool call. Define explicit contracts for each tool: expected parameter types, value ranges, allowed record types, and maximum operation scope. Reject anything outside the contract at the server boundary, before execution. Apply the same validation to output — never return raw ERP data to the LLM without sanitization.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔄 &lt;strong&gt;Resilience Pattern:&lt;/strong&gt; Design MCP tool contracts to be narrow and specific. A tool that can only update price for one SKU at a time with a defined min/max range is far safer than a bulk update endpoint. Scope restriction is your first blast-radius limiter.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. HITL governance gate
&lt;/h3&gt;

&lt;p&gt;For any write operation touching critical business data — pricing, financial records, inventory, vendor data, purchase orders — require human approval before execution. This gate is not optional and must not be bypassable by the LLM. The agent submits the proposed operation and waits. A human reviews and approves or rejects. Only on explicit approval does the MCP server execute.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🤖 &lt;strong&gt;Agent Pattern:&lt;/strong&gt; Implement HITL as a hard gate in your LangGraph workflow state machine. The agent transitions to a &lt;code&gt;PENDING_APPROVAL&lt;/code&gt; state and cannot proceed to &lt;code&gt;EXECUTING&lt;/code&gt; until the approval signal is received from an authenticated human actor. No LLM reasoning path should be able to bypass this transition.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The observability requirement
&lt;/h2&gt;

&lt;p&gt;Governance without observability is theater. Every MCP write operation must generate a complete audit trail: who or what initiated it, the full prompt context that led to it, the validation decisions made at each layer, the human approval record, the exact operation executed, and the result. This trail must be immutable and queryable.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔍 &lt;strong&gt;Observability Pattern:&lt;/strong&gt; Use AgentOps or LangSmith to capture full session traces for every agentic workflow touching write operations. Tag every trace with the MCP tool name, parameter hash, validation outcome, approval actor, and ERP transaction ID. When something goes wrong — and it will — you need forensic replay, not log archaeology.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The architecture rule
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Semantic layer for read. MCP for write. But every MCP write path must pass through prompt validation, MCP schema enforcement, and a HITL governance gate before touching any enterprise system.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The leaders celebrating "AI talking directly to ERP via MCP" are describing a system one prompt injection away from a business crisis. The architects who understand this gap will be the ones cleaning up the mess in 2027 — and Gartner has already said more than 50% of enterprise AI agents will fail by then. Governance is not a feature. It is the foundation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Coming next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A2A vs MCP — and where Google's UCP and AP2 fit in the agentic protocol landscape&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>ai</category>
      <category>mcp</category>
      <category>security</category>
      <category>promptengineering</category>
    </item>
    <item>
      <title>Building a production LLM Judge: lessons from the enterprise audit engine</title>
      <dc:creator>dnyandeo bharambe</dc:creator>
      <pubDate>Sun, 07 Jun 2026 16:56:30 +0000</pubDate>
      <link>https://dev.to/dnyandeo/building-a-production-llm-judge-lessons-from-the-enterprise-audit-engine-3c74</link>
      <guid>https://dev.to/dnyandeo/building-a-production-llm-judge-lessons-from-the-enterprise-audit-engine-3c74</guid>
      <description>&lt;p&gt;When I was building the enterprise audit engine, the LLM Judge was the last thing I &lt;br&gt;
planned to add. It felt like over-engineering. The main agent already had MCP tool &lt;br&gt;
access to live device state, a policy file to reason against, and a LangGraph state &lt;br&gt;
machine keeping it on track. That felt like enough. &lt;br&gt;
Then during testing, the agent correctly flagged a non-compliant firmware version — &lt;br&gt;
and recommended a remediation action that belonged to a completely different device &lt;br&gt;
category. The reasoning was internally consistent. It just used the wrong rule. &lt;br&gt;
Nothing in the pipeline caught it. The output looked clean. Without a second check, that &lt;br&gt;
would have gone to the user. &lt;br&gt;
That's when I added the Judge. Here's how it works and what I learned building it.&lt;br&gt;
&lt;strong&gt;What the Judge actually does&lt;/strong&gt; &lt;br&gt;
The Judge is a separate LLM call that runs after every agent response, before anything &lt;br&gt;
reaches the user. It gets two inputs: &lt;br&gt;
• The agent's output — the proposed compliance verdict and any suggested &lt;br&gt;
remediation &lt;br&gt;
• A fresh read of the policy file — not pulled from the agent's context, read &lt;br&gt;
independently &lt;br&gt;
Its job is to compare those two things and decide whether the agent's reasoning holds &lt;br&gt;
up against the actual policy. It's not re-running the audit. It's checking whether the &lt;br&gt;
answer the agent produced is consistent with the rules the agent was supposed to &lt;br&gt;
apply. &lt;br&gt;
If the reasoning checks out, the output moves forward. If something doesn't line up, the &lt;br&gt;
Judge blocks it, logs the mismatch, and returns an error instead of a verdict. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftbd9fzdbvlvc1xg6j577.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftbd9fzdbvlvc1xg6j577.png" alt=" " width="799" height="362"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Why independence matters more than intelligence&lt;/strong&gt;&lt;br&gt;
The design decision that makes the Judge actually useful is a simple one: it doesn't &lt;br&gt;
share context with the main agent. &lt;br&gt;
Most multi-step agent pipelines accumulate context as they run — retrieved chunks, &lt;br&gt;
intermediate reasoning, tool outputs. By the time the agent produces its final answer, it's &lt;br&gt;
been reasoning inside a specific context window for several steps. That context shapes &lt;br&gt;
what the agent sees as plausible. &lt;br&gt;
If the Judge reads from that same context window, it's not really checking anything &lt;br&gt;
independently. It's just re-reading the same information through a slightly different &lt;br&gt;
prompt. Whatever bias or error the agent accumulated, the Judge inherits it too. That's a &lt;br&gt;
rubber stamp, not a check. &lt;br&gt;
The fix is straightforward. The Judge reads the policy file directly. Not from cache, not &lt;br&gt;
from whatever the agent retrieved — from the file. Every time. That's what gives it the &lt;br&gt;
ability to catch the category mismatch the main agent missed. The agent had &lt;br&gt;
accumulated enough context during its reasoning loop that the wrong rule seemed &lt;br&gt;
plausible. The Judge, reading the policy fresh, saw the inconsistency immediately. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxka12hzwtu4ke29c9fcg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxka12hzwtu4ke29c9fcg.png" alt=" " width="800" height="486"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it sits in the LangGraph state machine&lt;/strong&gt; &lt;br&gt;
The Judge is a node in the LangGraph graph, not a wrapper around the graph. That &lt;br&gt;
distinction matters for how state flows through the system. &lt;br&gt;
After check_compliance runs and produces a verdict, the graph routes to llm_judge &lt;br&gt;
before doing anything else. The Judge node reads the policy file, compares it against &lt;br&gt;
the proposed verdict, and sets a pass/fail flag in the graph state. &lt;br&gt;
From there the graph branches: &lt;br&gt;
• FAIL — the output is blocked, the error is logged to AgentOps, and the response &lt;br&gt;
returned to the user explains that the verdict couldn't be verified &lt;br&gt;
• PASS — the graph moves to suggest_remediation, where the agent proposes &lt;br&gt;
the corrective action &lt;br&gt;
After suggest_remediation, there's a hard gate. The graph does not automatically &lt;br&gt;
proceed to execute_remediation. The only path forward is a human clicking Approve in &lt;br&gt;
the Streamlit UI. That approval event triggers the final node, which is the only place in &lt;br&gt;
the entire graph where a write operation touches the database. &lt;br&gt;
The Judge and the HITL gate are separate concerns. The Judge is about correctness &lt;br&gt;
— did the agent apply the right rule? The gate is about authorization&lt;br&gt;
— did a human sign off on the action? Both are necessary and neither replaces the other.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1mjzcnqj5odar04zcsc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1mjzcnqj5odar04zcsc.png" alt=" " width="799" height="628"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;What it logs&lt;/strong&gt;&lt;br&gt;
Every Judge decision gets written to AgentOps as a structured event: the agent's &lt;br&gt;
proposed verdict, the policy sections the Judge checked against, the final pass/fail, and &lt;br&gt;
if it failed, the specific mismatch it found. &lt;br&gt;
That logging turned out to be more useful than I expected. During development it made &lt;br&gt;
debugging much faster — you could see exactly what the Judge was checking and why &lt;br&gt;
it made the call it did. For a production enterprise system, it also creates a full audit trail &lt;br&gt;
for every compliance decision the system touches. &lt;br&gt;
If someone asks later why a device was flagged or cleared, there's a complete record: &lt;br&gt;
what the agent found, what the Judge verified, and who approved the remediation. That &lt;br&gt;
kind of traceability is hard to retrofit once a system is in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I'd do differently&lt;/strong&gt; &lt;br&gt;
The main thing I'd change is adding a confidence threshold to the Judge output rather &lt;br&gt;
than a binary pass/fail. &lt;br&gt;
Right now the Judge either passes or blocks. That works for clear-cut cases, but there's &lt;br&gt;
a middle category — outputs where the reasoning is mostly right but one detail is &lt;br&gt;
uncertain — where a hard block isn't the right response. A confidence score would let &lt;br&gt;
the system surface those cases to the human reviewer with a flag rather than just &lt;br&gt;
blocking them silently.&lt;br&gt;
The other thing is latency. Adding a second LLM call to every response adds time, and &lt;br&gt;
for simple queries that don't need a full audit, that overhead isn't worth it. The intent &lt;br&gt;
classifier at the FastAPI gateway already routes simple queries to the NIM worker and &lt;br&gt;
skips the full agent. Extending that logic to also skip the Judge for low-stakes queries &lt;br&gt;
would help.&lt;br&gt;
&lt;strong&gt;The short version&lt;/strong&gt;&lt;br&gt;
The Judge is useful not because it's smarter than the main agent, but because it's &lt;br&gt;
independent of it. Reading the policy fresh, without inheriting whatever context the &lt;br&gt;
agent accumulated during its reasoning loop, is what gives it the ability to catch errors &lt;br&gt;
the agent can't see from inside its own context window. &lt;br&gt;
It adds latency and cost. For a system making compliance decisions on production &lt;br&gt;
infrastructure, that's a reasonable trade. &lt;br&gt;
Code is on GitHub. The Judge implementation is in the governance layer — happy to &lt;br&gt;
walk through the prompt design or the AgentOps integration in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>langgraph</category>
      <category>agents</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Why I chose MCP over RAG for live infrastructure auditing</title>
      <dc:creator>dnyandeo bharambe</dc:creator>
      <pubDate>Thu, 28 May 2026 22:41:53 +0000</pubDate>
      <link>https://dev.to/dnyandeo/why-i-chose-mcp-over-rag-for-live-infrastructure-auditing-1ce8</link>
      <guid>https://dev.to/dnyandeo/why-i-chose-mcp-over-rag-for-live-infrastructure-auditing-1ce8</guid>
      <description>&lt;p&gt;I've been working on a project to audit distributed hardware infrastructure — devices &lt;br&gt;
spread across multiple sites, each running firmware that needs to stay compliant with a &lt;br&gt;
central policy. Pretty standard enterprise ops problem. &lt;br&gt;
My first instinct was RAG. Everyone reaches for RAG. You embed your documents, &lt;br&gt;
stand up a vector store, and your agent can reason over your data. I've built RAG &lt;br&gt;
pipelines before, they work well, so I started there. &lt;br&gt;
Three days in, I switched direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The moment I realized RAG wasn't the right fit
&lt;/h2&gt;

&lt;p&gt;I was testing the agent against a scenario where a device had failed a firmware check at &lt;br&gt;
2am. The agent reported it as compliant. &lt;br&gt;
The problem wasn't the model. The problem was that the data the agent was reasoning &lt;br&gt;
over was from an embedded snapshot I'd generated two days earlier. The device had &lt;br&gt;
drifted since then. The vector store didn't know — it can't know. It's a snapshot by &lt;br&gt;
design. &lt;br&gt;
That works fine for a documentation assistant. For infrastructure audit it's a problem, &lt;br&gt;
because you need to know what's happening now, not what was true when you last ran &lt;br&gt;
the embedding pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I needed wasn't retrieval — it was access
&lt;/h2&gt;

&lt;p&gt;Here's the reframe that changed how I thought about this. &lt;br&gt;
RAG answers the question: what documents are relevant to this query? &lt;br&gt;
What I actually needed to answer was: what is the current state of device X right now? &lt;br&gt;
Those are different questions. One is a search problem. The other is a database query. I &lt;br&gt;
was using the wrong tool. &lt;br&gt;
The inventory — firmware versions, device health, site assignments — lives in a SQLite &lt;br&gt;
database. The compliance policy lives in a structured text file. Neither of these is a &lt;br&gt;
document in any meaningful sense. Chunking them and embedding them into a vector &lt;br&gt;
store was me forcing square data into a round hole because that's what I knew how to do.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfmk9zlzvta2wcnty185.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfmk9zlzvta2wcnty185.png" alt="Figure 1 — RAG vs MCP: why retrieval falls short for live infrastructure data " width="800" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;server that exposes it as tools the agent can call: &lt;br&gt;
• get_inventory() — returns live device state, current to the second &lt;br&gt;
• query_policy() — reads the policy file and returns the requirements &lt;br&gt;
• flag_violation() — marks a device non-compliant with structured metadata &lt;br&gt;
The agent calls these the same way your application code calls an API. No embedding &lt;br&gt;
pipeline. No staleness problem. No guessing at similarity scores for what is &lt;br&gt;
fundamentally a structured query. &lt;/p&gt;

&lt;h2&gt;
  
  
  The gateway nobody talks about
&lt;/h2&gt;

&lt;p&gt;One thing I'd push back on in most agent tutorials — they wire the LLM directly to the &lt;br&gt;
frontend and call it done. &lt;br&gt;
I put a FastAPI gateway in between, and I'd do it again every time. &lt;br&gt;
The practical reason: NVIDIA NIM credits aren't free. A misconfigured client or a &lt;br&gt;
runaway loop can drain your quota in minutes if there's nothing between the UI and the &lt;br&gt;
model. The gateway enforces rate limits per IP before a single token is generated. &lt;br&gt;
Saved me actual money during development. &lt;br&gt;
The better reason: not every query needs the full audit agent. Simple questions — how &lt;br&gt;
many nodes are in Bellevue? — don't need a multi-step LangGraph agent burning &lt;br&gt;
Gemini 2.5 tokens. The gateway classifies intent and routes accordingly. Simple queries &lt;br&gt;
go to a lighter NIM worker. Full compliance audits go to the Gemini agent. &lt;br&gt;
It also centralises auth and logging in one place, which matters when you need to show &lt;br&gt;
a security team exactly what the agent did and when. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazh8ak0xkgcnanbg6oto.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazh8ak0xkgcnanbg6oto.png" alt="Figure 2 — Full system architecture: gateway, dual-model routing, MCP sensor layer, and LLM Judge" width="800" height="558"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Judge
&lt;/h2&gt;

&lt;p&gt;This is the piece I'm most glad I built, and the one I almost skipped. &lt;br&gt;
Every response — whether it came from the NIM worker or the Gemini agent — passes &lt;br&gt;
through a secondary LLM before it reaches the user. I call it the Judge. Its only job is to &lt;br&gt;
read the agent's output, check it independently against the policy file, and decide &lt;br&gt;
whether the reasoning holds up. &lt;br&gt;
During testing, the Judge caught something the main agent missed. The agent had &lt;br&gt;
correctly identified a non-compliant firmware version, but applied a remediation rule that &lt;br&gt;
belonged to a different device category. The logic was sound — it just used the wrong &lt;br&gt;
rule. The Judge caught it because it reads the policy independently, without inheriting &lt;br&gt;
whatever context the main agent had accumulated during its reasoning loop. &lt;br&gt;
That independence is the point. If the Judge just re-reads the agent's own context, it's &lt;br&gt;
not really checking anything. You want it reading from the source, fresh. &lt;/p&gt;

&lt;h2&gt;
  
  
  Humans stay in the loop
&lt;/h2&gt;

&lt;p&gt;The agent can suggest remediation — here's the CLI command to fix the firmware drift &lt;br&gt;
on node 7. It cannot run it. &lt;br&gt;
There's a hard gate in the LangGraph state machine. Suggest remediation and execute &lt;br&gt;
remediation are separate nodes, and the only path between them runs through a human &lt;br&gt;
decision in the UI. An architect clicks Approve. Then and only then does the write &lt;br&gt;
operation touch the database. &lt;br&gt;
For infrastructure this felt like the right call. The cost of a false positive — a remediation &lt;br&gt;
that runs when it shouldn't — is much higher than the cost of an extra approval click. &lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;Two things. &lt;br&gt;
I'd instrument RAGAS metrics from day one. I ended up retrofitting evaluation on the &lt;br&gt;
agent's audit outputs and found gaps I'd been manually poking at for weeks. &lt;br&gt;
Faithfulness and context relevancy scores would have surfaced those faster. &lt;br&gt;
And I'd write the red-team report in parallel, not after. I know what failure modes the &lt;br&gt;
Judge catches now, but I reconstructed most of that knowledge from memory rather &lt;br&gt;
than documenting it as I found it. A live failure log from the start would've made that &lt;br&gt;
report much sharper.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;p&gt;RAG is the right tool for knowledge retrieval over static content. It's a less natural fit &lt;br&gt;
when your agent needs to query live structured data and act on what it finds. &lt;br&gt;
MCP let me give the agent real database access through a typed tool interface — no &lt;br&gt;
embedding pipeline, no staleness, no similarity search on what is fundamentally a &lt;br&gt;
relational query. For infrastructure audit, that was the right call. &lt;br&gt;
Code is on GitHub if you want to dig into the architecture. Happy to go deeper on the &lt;br&gt;
LangGraph state machine or the Judge design in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>mcp</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
