DEV Community

Cover image for Governing AI Agent Decisions with MCP: How I Built Dead Letter Oracle
Prasad Thiriveedi
Prasad Thiriveedi

Posted on

Governing AI Agent Decisions with MCP: How I Built Dead Letter Oracle

Dead Letter Oracle turns failed events into governed replay decisions.

Dead Letter Oracle Architecture


The Problem Nobody Solves

A failed message hits the DLQ. The fix looks obvious. The replay still breaks production.

Every on-call engineer who has manually replayed a DLQ message and watched it break production again knows this problem.

In event-driven systems, messages fail silently. They land in a dead-letter queue with a vague error and an angry on-call engineer staring at them. The diagnosis is manual. The fix is a guess. The replay decision, whether to reprocess the message, is made without confidence scoring, without governance, and without an audit trail.

Most AI agent demos show you the happy path: the agent gets it right on the first try.

Dead Letter Oracle is not that demo.


The Closed Loop

Dead Letter Oracle turns failed events into governed replay decisions. It does not just diagnose. It reasons through a fix, tests it, revises when confidence is too low, makes a governed ALLOW/WARN/BLOCK decision, and shows every step of its reasoning.

The full loop:

  1. Read the failed DLQ message via dlq_read_message
  2. Validate the payload via schema_validate
  3. LLM proposes an initial fix (plausible, but high-level)
  4. replay_simulate tests the fix: confidence 0.28, too low to proceed
  5. LLM revises with a concrete, operationally safe fix
  6. replay_simulate re-evaluates: confidence 0.91
  7. Gatekeeper issues WARN: production requires manual approval before live replay

The deliberate first-fix failure is the core design moment. The first fix is plausible but not operationally safe. Simulation catches the weakness. Revision becomes concrete. Governance still restrains production replay even at 0.91.

A system that always succeeds on the first try is not reasoning. It is pattern-matching.


The Governance Layer

Most agent demos skip governance. Dead Letter Oracle makes it the centerpiece.

The Gatekeeper evaluates four independent factors before issuing a decision:

Factor What it checks
Schema Is the original mismatch resolved by the proposed fix?
Simulation What is the replay confidence score?
Fix Has a confirmed, operationally specific fix been applied?
Environment Is this production or staging?

Why WARN instead of ALLOW at 0.91? Because the environment is production. The Gatekeeper applies a higher confidence threshold in production than in staging. A 0.91 confidence fix in staging gets ALLOW. The same fix in production gets WARN. A human operator reviews before the live replay proceeds.

When does BLOCK trigger? If simulation confidence stays below threshold after revision, or if no confirmed fix was applied. The Gatekeeper does not reward effort. It rewards verified outcomes.

This is not a hardcoded if/else. It is multi-factor evaluation, the same pattern used in access control and fraud detection systems. The Gatekeeper is the governance layer, not a convenience wrapper.


Why the MCP Protocol Boundary Matters

The agent and the MCP server run as separate processes communicating over stdio. The protocol boundary is real.

This matters because the tools are genuinely callable by any MCP-compatible client, not just this agent. The tools are a contract, not an implementation detail.

Four MCP tools: three deterministic, one orchestration.

Tool Type Input Output
dlq_read_message Deterministic file path parsed DLQ message
schema_validate Deterministic payload, expected schema valid/errors
replay_simulate Deterministic original message, proposed fix confidence score, likelihood, reason
agent_run_incident Orchestration file path gatekeeper decision + 7-step trace

The LLM is the interpretation layer only. It proposes and revises. The deterministic tools measure and verify. The orchestration tool composes them into a governed pipeline callable from any MCP client.

How is the confidence score calculated? replay_simulate evaluates schema validity of the proposed fix, fix specificity (concrete value vs high-level direction), and replay rule alignment. A high-level fix like "align producer schema" scores low because it describes intent, not action. A concrete fix like user_id="12345" scores high because it is directly verifiable.


AgentGateway: Real HTTP Transport

Dead Letter Oracle ships with an AgentGateway configuration that exposes all four MCP tools behind a production-grade HTTP proxy.

agentgateway -f agentgateway/config.yaml
Enter fullscreen mode Exit fullscreen mode

The gateway adds CORS, session tracking, and a live playground UI at localhost:15000/ui. Any client, browser, remote agent, or CI pipeline, can invoke the tools at http://localhost:3000/ without spawning a subprocess.

Open the playground, connect to http://localhost:3000/, select agent_run_incident, and invoke it with {"file_path": "data/sample_dlq.json"}. The full governed pipeline runs and returns both simulations, the gatekeeper decision, and the complete 7-step trace from a browser. No CLI required.

The agent runtime is HTTP-first: it probes the gateway URL before each tool call batch and falls back to stdio if the gateway is not running. The system works in both modes. The transport layer is transparent to the planner.


The BlackBox Reasoning Trace

Every run produces a structured 7-step audit record:

[1] READ MESSAGE     event=user_created, error=Schema validation failed
[2] VALIDATE        user_id: expected string, got int
[3] PROPOSE FIX     Align producer schema: cast user_id to string
[4] SIMULATE (1)    confidence=0.28, likelihood=low
[5] REVISE FIX      Set user_id="12345" in payload before replay
[6] SIMULATE (2)    confidence=0.91, likelihood=high
[7] GOVERN          WARN: fix validated, prod environment requires manual approval
Enter fullscreen mode Exit fullscreen mode

This is not a log. It is a structured audit record: every tool call, every LLM step, every policy trigger, in order. In production, this is what you attach to the incident ticket. It is the difference between "the agent decided to replay" and "here is exactly why."


Business Value

Dead Letter Oracle reduces four categories of operational risk:

  • Risky manual replays: confidence scoring and governance replace gut-feel decisions
  • MTTR for DLQ incidents: the full loop runs in seconds, not hours of manual debugging
  • Repeated failure loops: simulation catches fixes that would fail again before they reach production
  • Audit gaps: every decision is traceable, every step is recorded

Three Entry Points, One Implementation

# Entry point 1: AgentGateway playground (browser, no setup)
# Open http://localhost:15000/ui/playground/
# Invoke agent_run_incident with {"file_path": "data/sample_dlq.json"}

# Entry point 2: HTTP API
curl -X POST http://localhost:8000/run-incident \
  -H "Content-Type: application/json" \
  -d '{"file_path": "data/sample_dlq.json"}'

# Entry point 3: CLI
python main.py
Enter fullscreen mode Exit fullscreen mode

One implementation (mcp_server/tools.run_incident), three surfaces. The MCP tool, the HTTP API, and the CLI all call the same function.


Try It

git clone https://github.com/tvprasad/dead-letter-oracle
cd dead-letter-oracle
pip install -r requirements.txt
cp .env.example .env
# Set LLM_PROVIDER and credentials (azure_openai, anthropic, or ollama)
python main.py
Enter fullscreen mode Exit fullscreen mode

GitHub: github.com/tvprasad/dead-letter-oracle


ADR-Driven Build

Every architectural decision is documented in an Architecture Decision Record before a line of code was written. Nine ADRs covering the MCP transport strategy, deterministic vs orchestration tool distinction, Gatekeeper multi-factor evaluation, BlackBox audit trace, AgentGateway integration, and Agent HTTP API. The ADRs are in the repo.


By the Numbers

  • Confidence delta per run: 0.28 to 0.91 (deliberate two-pass design)
  • 7-step structured audit trace per incident
  • 23 tests: zero flaky, LLM fully mocked
  • Full pipeline completes in under 2 seconds on local Ollama
  • 4 MCP tools: 3 deterministic, 1 orchestration

Production-Grade Standards

  • ruff lint and format enforced in CI
  • GitHub Actions: test matrix on Python 3.12 and 3.13
  • Branch protection: CI must pass before merge
  • Apache 2.0 license

Built by Prasad Thiriveedi, VPL Solutions LLC

Submitted to MCP_HACK//26, Secure & Govern MCP track

Top comments (0)