DEV Community: Sasi Sundar

The Tool Call Succeeded. The Outcome Failed.

Sasi Sundar — Fri, 19 Jun 2026 19:25:28 +0000

Most engineering teams are trained to think about failures the wrong way.

We look for crashes.

We look for exceptions.

We look for alerts.

We look for red dashboards.

But some of the most damaging failures don't look like failures at all.

They look like success.

A few months ago, while working with AI agents and MCP servers, I noticed a pattern that kept repeating itself.

The agent would call a tool.

The tool would return a successful response.

No error.

No exception.

No timeout.

Everything looked healthy.

But the task wasn't completed.

The action never happened.

The user received the wrong outcome.

The customer discovered the problem before the engineering team did.

This is a very different type of failure.

And it's becoming increasingly common as AI systems move into production.

The Assumption Every Engineer Makes

Most software systems are built around a simple assumption:

If the request succeeded, the outcome succeeded.

That assumption works surprisingly well until external systems enter the picture.

Modern AI agents depend on APIs, MCP servers, databases, SaaS platforms, search systems, and dozens of external tools.

Every additional dependency creates another opportunity for the request and the outcome to diverge.

The system reports success.

Reality reports failure.

Four Ways This Happens

1. Null Responses

The tool returns successfully.

The response is technically valid.

The actual result is empty.

{
  "result": null
}

The agent continues.

The user receives incomplete information.

Nobody notices immediately.

2. Partial Execution

The request triggered three actions.

Only one completed.

The tool reports success anyway.

The workflow is now in an inconsistent state.

3. Stale Data

The response arrives successfully.

The information is hours old.

The agent makes a decision based on outdated reality.

4. Schema Drift

A field changes.

A response format evolves.

The system still receives data.

The meaning of the data changes.

The workflow silently breaks.

Why These Failures Are Expensive

Crashes are visible.

Silent failures are invisible.

A crash gets reported instantly.

A silent failure continues operating.

Users lose trust.

Engineers spend hours debugging.

Teams reconstruct events after the damage is already done.

The investigation usually starts with:

"A customer said something looked wrong."

That's one of the most expensive ways to discover a reliability problem.

The Lesson

The lesson is simple.

Stop trusting successful requests.

Start validating successful outcomes.

Those are not the same thing.

A response code tells you whether communication happened.

It does not tell you whether the desired outcome occurred.

As AI systems become more dependent on tools and external systems, that distinction becomes increasingly important.

One Action You Can Take Today

Review the last 10 production tool calls in your system.

For each one, ask:

Did the request succeed?
Did the intended outcome actually occur?
How would we know if it didn't?

If those answers are different, you've already found a reliability gap.

And chances are your users will find it eventually if you don't.

Your MCP Agent is Logging "Sucess: true" While the task never ran

Sasi Sundar — Mon, 15 Jun 2026 17:05:58 +0000

You build an agent. It calls an MCP tool, gets a response, logs success: true, and moves on. Thirty-three minutes later a customer emails asking why their ticket was never created.

You pull the logs. Every entry says the call succeeded. You check the MCP server. It received the request. HTTP 200 came back. So where did the task go?

It went nowhere. The task never ran. The MCP server returned a null result inside a 200 response, and your agent treated the status code as truth and discarded the body.

This is not a fringe case. It is the default failure mode of MCP.

Why this happens at the protocol level

MCP is built on JSON-RPC 2.0 over HTTP. The error surface is split across two layers. HTTP carries the transport status. JSON-RPC carries the application status. They do not have to agree, and often they do not.

A well-behaved MCP error looks like this:

{"jsonrpc": "2.0", "id": 1, "error": {"code": -32000, "message": "tool execution failed"}}

Also HTTP 200. No error field. The server finished handling the request and returned nothing. Most agent frameworks check res.ok or the HTTP status code and move on. The null goes unexamined.

Three failure patterns worth knowing

1. Null result propagation

A tools/call returns {"result": null} with HTTP 200. The agent has no indication anything went wrong. It logs the call as complete. The downstream system never receives the expected payload. This is the most common pattern and the hardest to catch because every layer reports success.

2. Retry masking

Your agent is configured with retries. The upstream MCP server is behind a deduplication layer. The agent fires a tools/call, times out, retries three times. Each retry is deduplicated by the upstream — it executes the mutation once and acknowledges the subsequent requests silently. The agent sees success on attempt four. Your audit log shows six calls. The task ran once but the agent has no idea which attempt was real.

3. Content schema drift in multi-agent workflows

MCP's tools/call result schema requires a content array where each item has a type field. Servers that were written quickly omit type. The consuming agent deserializes the array, tries to route by type, finds undefined, and either silently drops the item or throws an uncaught exception that gets swallowed by an outer try-catch. The agent logs the call as successful. The data was there. The schema was wrong.

What Vouqis does

Vouqis is a proxy that sits between your agent and your MCP server. Every request and response passes through it. It validates both sides and writes a structured audit log.

npm install -g @vouqis/cli
vouqis proxy --upstream [https://your-mcp-server.com](https://your-mcp-server.com)

Your agent points at http://127.0.0.1:4444 instead of the MCP server directly. Every event — allow, block,retry, rewrite — is written to vouqis-audit.log as NDJSON with timestamp, method, tool name, latency, attempt number, and reason.

The audit event shape:

{
  "timestamp": "2026-06-15T10:04:22.341Z",
  "method": "tools/call",

### 
  "tool": "create_ticket",
  "decision": "block",
  "latency_ms": 187,
  "reason": "tools/call result is null or missing",
  "attempt": 1
}

That is the event you would have wanted at 10:04 AM instead of finding out from a customer complaint at 10:37 AM.

If you have hit this

If you are running LangGraph, a custom MCP integration, or a multi-agent workflow in production and you have debugged a silent MCP failure — drop a comment with the pattern you hit.

The project is at https://vouqis.tech. The source is at https://github.com/Sasisundar2211/Vouqis.

Building VOUQIS: How we built the Trust layer for MCP Ecosystem

Sasi Sundar — Wed, 27 May 2026 08:52:55 +0000

Your AI agent calls a Model Context Protocol (MCP) server. The server returns a standard 200 OK. The agent logs a generic "success" message. But your customer sees an empty UI, a hung loading spinner, or worse—a catastrophic failure.

This is the hidden crisis of the burgeoning AI agent ecosystem.

When we ran a stress test across 100 production MCP servers, the data exposed a brutal reality:

The median server passes only 71% of tool calls. The rest return silent, empty responses with zero explicit errors.
Chained dependencies compound this.Running 5 tools in sequence at a 71% success rate drops your end-to-end reliability to a measly 18%.
Standard API monitoring tools remain completely blind to this.Because the network and HTTP layers look perfectly healthy, your uptime dashboards stay green while your user experience burns.

We built Vouqis to fix this. It is a zero-setup, 100% deterministic reliability engine that scores and gates MCP servers before they break your production stack. No SDK installations, no LLM call overhead, and no server-side changes required. Just paste a URL, run the probes, and protect your agents.

Here is the story of why the protocol is breaking in production, how we built a lightweight testing framework to solve it, and the engineering trade-offs we encountered along the way.

The Genesis: Falling Through the Protocol Cracks

The Model Context Protocol is a massive leap forward for agentic workflows. It gives LLMs a clean, standardized interface to interact with external data and tools. But standardizing the interface does not automatically standardize runtime behavior or engineering quality.

A few months ago, while orchestrating multi-agent suites for enterprise accounting and procurement automation, we hit a wall. Agents would work flawlessly in sandbox environments, but throw tantrums in production. They would stall out on basic tool executions or misinterpret empty arrays as valid context.

When we dug into the JSON-RPC layer, we realized that traditional monitoring tools are fundamentally unsuited for MCP. Traditional tools track latency and HTTP status codes. If an MCP server accepts a malformed payload but returns a 200 OK housing a silent protocol error, your monitoring suite logs it as a win.

The industry is waking up to these gaps. The 2026 Zuplo MCP Report explicitly noted that 38% of MCP developers name security and reliability concerns as the primary blocker to production adoption. We witnessed documented production vulnerabilities across the ecosystem:

High-profile path traversals exposing thousands of hosted API keys.
Critical CVEs (like CVE-2025-6514 in mcp-remote) introducing massive Remote Code Execution attack surfaces.
Cross-tenant data leaks exposing client environments for weeks.

We needed a tool that could fire real protocol probes directly at the JSON-RPC layer to audit compliance, stress-test boundaries, and generate a clear, actionable trust score. When we couldn't find one, we spent a fast-paced few weeks building it ourselves.

What Vouqis Does (And What It Tests)

Vouqis runs 10 deterministic probes across 5 specific failure modes in under 30 seconds. It doesn’t use flaky LLM calls to test your infrastructure; it relies entirely on rigid protocol validation.

The Anatomy of an Audit Run

When you point the Vouqis CLI at an MCP server URL, it auto-discovers the available tools, constructs minimal valid inputs based on the exposed schemas, and intentionally injects edge cases.

Take a live audit run against a production instance like mcp.exa.ai/mcp. Running a basic test yields immediate, definitive insights:

vouqis audit https://mcp.exa.ai/mcp

The CLI prints a clean, interactive breakdown to the terminal:

VOUQIS — audit — https://mcp.exa.ai/mcp

✓ Connected – found 2 tools
Running 10 reliability tests against https://mcp.exa.ai/mcp

[██████████████████████████░░░░░░] 10 / 10
✓ 9 X 1

Vouqis Trust Score Report

Server https://mcp.exa.ai/mcp
Score 92 / 100 [██████████████████████████████░░]
Tests passed 9 of 10 (90%)
Response time 691ms typical · target <500ms

What failed:
X Did not reject invalid requests · 1 time
– Server accepted malformed JSON-RPC (HTTP 202)

report written → ./vouqis-report.json
view traces: https://www.vouqis.tech

✓ APPROVED – this server passed all reliability tests

What this tells you immediately:The Exa server is highly resilient (scoring a 92/100, which earns an APPROVED verdict), but it silently accepts malformed JSON-RPC envelopes with an HTTP 202 instead of explicitly throwing a protocol error. Any agent sending poorly formed requests will hit a silent wall instead of getting a clear failure signal. That is exactly one line of code to fix—and one audit to catch it.

Deep Dive: The Trust Score Algorithm

To make these audits useful for CI/CD gates, we couldn’t just output a wall of logs. We needed a single, standardized, deterministic index. Every Vouqis run calculates a 0–100 Trust Score based on three distinct, weighted operational signals:

Pass Rate (50% Weight):The pure mathematical fraction of the 10 core protocol probes answered correctly.
Response Time (30% Weight):The median ($P_{50}$) response time across all tool calls.
Error Spread (20% Weight):A specific algorithmic penalty based on how many distinct failure modes are triggered.

Why Median (P50) Latency Matters

A common question we get is why we anchor our latency scoring to P50 instead of P95 or P99 . Across the wider industry, MCP server P50 latencies frequently run up to 1,840ms, and P99 spikes can easily clear 6,200ms.

If your $P_{50}$ median response time is already tracking above 500ms during a basic audit probe, your tail latency P99 is mathematically guaranteed to cause a visible degradation in a multi-turn agent conversation.

We map the $P_{50}$ metrics directly to strict point deductions:

P50 Response Time	Latency Score	Points Contributed
<= 500ms	100	30.0 pts
<= 1,000ms	90	27.0 pts
<= 2,000ms	75	22.5 pts
<= 4,000ms	50	15.0 pts
<= 8,000ms	25	7.5 pts

Calculating the Error Spread Penalty

A server that fails 4 times under a single failure mode (e.g., a systemic timeout issue) usually points to a single bottleneck. A server that fails 4 times across 4 completely different failure modes is architecturally fragile.

To account for this, the Error Spread score drops sharply as more distinct failure categories are tripped:

0 or 1 Failure Modes: 100 Error Score = 20.0 pts
2 Failure Modes: 80 Error Score =16.0 pts
3 Failure Modes: 60 Error Score =12.0 pts
4 Failure Modes: 40 Error Score =8.0 pts
5 Failure Modes: 20 Error Score =4.0 pts

The Final Verdict

By combining these three signals, Vouqis categorizes servers into three explicit operational tiers:

80–100: $\checkmark$ APPROVED. Stable, compliant, safe to integrate directly into production workflows.
50–79: $\triangle$ RISKY. Functional, but contains edge-case vulnerabilities or latency spikes that require engineering attention before exposure to live users.
0–49: $\times$ DO NOT INTEGRATE. Fundamental protocol violations or severe fragility. The server will actively degrade your agent suites.

Architectural Lessons & Startup Realities

Building a lightweight dev tool sounds straightforward, but keeping it 100% deterministic while mapping an incredibly dynamic landscape forced a few tough engineering trade-offs.

1. Resisting the Temptation of LLM-Based Testing

When designing the testing harness, the easiest path would have been using an LLM to generate creative test cases based on the target server's schema. We intentionally rejected that approach.

Using an LLM introduces non-deterministic flake, increases test runtime from seconds to minutes, and introduces external API cost barriers. By building raw, deterministic JSON-RPC injection templates directly in TypeScript, we kept the engine incredibly fast, completely free to run locally, and perfectly reproducible in isolated CI/CD pipelines.

2. The Monorepo and Workspace Playbook

We built Vouqis as a clean, unified TypeScript monorepo splitting the codebase into distinct packages: the core testing engine, the CLI harness, and the web platform layout.

In our early iterations, managing dependency linking across local packages caused major compilation friction during build pipelines. Migrating directly to native npm/Yarn workspaces and configuring explicit root-level scripts for typechecking and cross-building stabilized our local environment and streamlined package publication.

3. Fighting the "Silent Success" Epidemic

The biggest challenge wasn't writing the probes—it was parsing the wildly unpredictable ways different engineering teams implement the MCP specification.

Many custom-built servers don't follow proper error reporting paradigms; they intercept a crash and bubble up an empty string inside a successful payload structure. Teaching our core engine to treat an implicit "empty content success" as a structural failure required writing strict validation rules that inspect the deep structural schema of the response, rather than trusting the top-level status keys.

Getting Started in 3 Steps

We wanted the developer experience to feel as frictionless as possible. There are no API keys to configure, no local configuration files to manage, and no dependencies to stitch together.

Step 1: Install the CLI globally

npm install -g @vouqis/cli

Step 2: Run an audit against any live server URL

vouqis audit https://mcp.exa.ai/mcp

Step 3: Block broken updates in your CI/CD workflow

You can integrate Vouqis directly into your GitHub Actions or deployment pipelines to automatically drop builds if a dependent server's reliability dips below your quality threshold:

Fail the pipeline if the server trust score drops below 80

vouqis audit https://mcp.exa.ai/mcp --fail-below 80

Save full structural probe results directly to a JSON file for custom reporting

vouqis audit https://mcp.exa.ai/mcp --json-path ./results.json

Extract the raw numeric score directly for custom shell scripting

vouqis score https://mcp.exa.ai/mcp

The Road Ahead

Building an open ecosystem requires building a transparent infrastructure layer. As AI agent architectures migrate from cool weekend projects into core business operations, the tools powering them must be held to traditional software engineering standards.

We are actively expanding Vouqis to support deeper stateful tracking, security fuzzing templates, and real-time proxy monitoring.

Have you encountered silent failures while working with custom or third-party MCP servers?
What metrics do you care about most when integrating external tools into your agent workflows?

Drop an installation, run an audit against your active server setups, and share your terminal outputs or feedback in the comments below! Let’s build a more reliable agentic ecosystem together.

Building VOUQIS: How we built the Trust layer for MCP Ecosystem

Sasi Sundar — Wed, 27 May 2026 08:52:55 +0000

This is the hidden crisis of the burgeoning AI agent ecosystem.

When we ran a stress test across 100 production MCP servers, the data exposed a brutal reality:

Here is the story of why the protocol is breaking in production, how we built a lightweight testing framework to solve it, and the engineering trade-offs we encountered along the way.

The Genesis: Falling Through the Protocol Cracks

High-profile path traversals exposing thousands of hosted API keys.
Critical CVEs (like CVE-2025-6514 in mcp-remote) introducing massive Remote Code Execution attack surfaces.
Cross-tenant data leaks exposing client environments for weeks.

What Vouqis Does (And What It Tests)

The Anatomy of an Audit Run

When you point the Vouqis CLI at an MCP server URL, it auto-discovers the available tools, constructs minimal valid inputs based on the exposed schemas, and intentionally injects edge cases.

Take a live audit run against a production instance like mcp.exa.ai/mcp. Running a basic test yields immediate, definitive insights:

vouqis audit https://mcp.exa.ai/mcp

The CLI prints a clean, interactive breakdown to the terminal:

VOUQIS — audit — https://mcp.exa.ai/mcp

✓ Connected – found 2 tools
Running 10 reliability tests against https://mcp.exa.ai/mcp

[██████████████████████████░░░░░░] 10 / 10
✓ 9 X 1

Vouqis Trust Score Report

What failed:
X Did not reject invalid requests · 1 time
– Server accepted malformed JSON-RPC (HTTP 202)

report written → ./vouqis-report.json
view traces: https://www.vouqis.tech

✓ APPROVED – this server passed all reliability tests

Deep Dive: The Trust Score Algorithm

Pass Rate (50% Weight):The pure mathematical fraction of the 10 core protocol probes answered correctly.
Response Time (30% Weight):The median ($P_{50}$) response time across all tool calls.
Error Spread (20% Weight):A specific algorithmic penalty based on how many distinct failure modes are triggered.

Why Median (P50) Latency Matters

We map the $P_{50}$ metrics directly to strict point deductions:

P50 Response Time	Latency Score	Points Contributed
<= 500ms	100	30.0 pts
<= 1,000ms	90	27.0 pts
<= 2,000ms	75	22.5 pts
<= 4,000ms	50	15.0 pts
<= 8,000ms	25	7.5 pts

Calculating the Error Spread Penalty

To account for this, the Error Spread score drops sharply as more distinct failure categories are tripped:

0 or 1 Failure Modes: 100 Error Score = 20.0 pts
2 Failure Modes: 80 Error Score =16.0 pts
3 Failure Modes: 60 Error Score =12.0 pts
4 Failure Modes: 40 Error Score =8.0 pts
5 Failure Modes: 20 Error Score =4.0 pts

The Final Verdict

By combining these three signals, Vouqis categorizes servers into three explicit operational tiers:

80–100: $\checkmark$ APPROVED. Stable, compliant, safe to integrate directly into production workflows.
50–79: $\triangle$ RISKY. Functional, but contains edge-case vulnerabilities or latency spikes that require engineering attention before exposure to live users.
0–49: $\times$ DO NOT INTEGRATE. Fundamental protocol violations or severe fragility. The server will actively degrade your agent suites.

Architectural Lessons & Startup Realities

Building a lightweight dev tool sounds straightforward, but keeping it 100% deterministic while mapping an incredibly dynamic landscape forced a few tough engineering trade-offs.

1. Resisting the Temptation of LLM-Based Testing

When designing the testing harness, the easiest path would have been using an LLM to generate creative test cases based on the target server's schema. We intentionally rejected that approach.

2. The Monorepo and Workspace Playbook

We built Vouqis as a clean, unified TypeScript monorepo splitting the codebase into distinct packages: the core testing engine, the CLI harness, and the web platform layout.

3. Fighting the "Silent Success" Epidemic

The biggest challenge wasn't writing the probes—it was parsing the wildly unpredictable ways different engineering teams implement the MCP specification.

Getting Started in 3 Steps

We wanted the developer experience to feel as frictionless as possible. There are no API keys to configure, no local configuration files to manage, and no dependencies to stitch together.

Step 1: Install the CLI globally

npm install -g @vouqis/cli

Step 2: Run an audit against any live server URL

vouqis audit https://mcp.exa.ai/mcp

Step 3: Block broken updates in your CI/CD workflow

You can integrate Vouqis directly into your GitHub Actions or deployment pipelines to automatically drop builds if a dependent server's reliability dips below your quality threshold:

Fail the pipeline if the server trust score drops below 80

vouqis audit https://mcp.exa.ai/mcp --fail-below 80

Save full structural probe results directly to a JSON file for custom reporting

vouqis audit https://mcp.exa.ai/mcp --json-path ./results.json

Extract the raw numeric score directly for custom shell scripting

vouqis score https://mcp.exa.ai/mcp

The Road Ahead

We are actively expanding Vouqis to support deeper stateful tracking, security fuzzing templates, and real-time proxy monitoring.

Have you encountered silent failures while working with custom or third-party MCP servers?
What metrics do you care about most when integrating external tools into your agent workflows?

Drop an installation, run an audit against your active server setups, and share your terminal outputs or feedback in the comments below! Let’s build a more reliable agentic ecosystem together.