Sasi Sundar

Posted on May 27

Building VOUQIS: How we built the Trust layer for MCP Ecosystem

#ai #opensource #api #startup

Your AI agent calls a Model Context Protocol (MCP) server. The server returns a standard 200 OK. The agent logs a generic "success" message. But your customer sees an empty UI, a hung loading spinner, or worse—a catastrophic failure.

This is the hidden crisis of the burgeoning AI agent ecosystem.

When we ran a stress test across 100 production MCP servers, the data exposed a brutal reality:

The median server passes only 71% of tool calls. The rest return silent, empty responses with zero explicit errors.
Chained dependencies compound this.Running 5 tools in sequence at a 71% success rate drops your end-to-end reliability to a measly 18%.
Standard API monitoring tools remain completely blind to this.Because the network and HTTP layers look perfectly healthy, your uptime dashboards stay green while your user experience burns.

We built Vouqis to fix this. It is a zero-setup, 100% deterministic reliability engine that scores and gates MCP servers before they break your production stack. No SDK installations, no LLM call overhead, and no server-side changes required. Just paste a URL, run the probes, and protect your agents.

Here is the story of why the protocol is breaking in production, how we built a lightweight testing framework to solve it, and the engineering trade-offs we encountered along the way.

The Genesis: Falling Through the Protocol Cracks

The Model Context Protocol is a massive leap forward for agentic workflows. It gives LLMs a clean, standardized interface to interact with external data and tools. But standardizing the interface does not automatically standardize runtime behavior or engineering quality.

A few months ago, while orchestrating multi-agent suites for enterprise accounting and procurement automation, we hit a wall. Agents would work flawlessly in sandbox environments, but throw tantrums in production. They would stall out on basic tool executions or misinterpret empty arrays as valid context.

When we dug into the JSON-RPC layer, we realized that traditional monitoring tools are fundamentally unsuited for MCP. Traditional tools track latency and HTTP status codes. If an MCP server accepts a malformed payload but returns a 200 OK housing a silent protocol error, your monitoring suite logs it as a win.

The industry is waking up to these gaps. The 2026 Zuplo MCP Report explicitly noted that 38% of MCP developers name security and reliability concerns as the primary blocker to production adoption. We witnessed documented production vulnerabilities across the ecosystem:

High-profile path traversals exposing thousands of hosted API keys.
Critical CVEs (like CVE-2025-6514 in mcp-remote) introducing massive Remote Code Execution attack surfaces.
Cross-tenant data leaks exposing client environments for weeks.

We needed a tool that could fire real protocol probes directly at the JSON-RPC layer to audit compliance, stress-test boundaries, and generate a clear, actionable trust score. When we couldn't find one, we spent a fast-paced few weeks building it ourselves.

What Vouqis Does (And What It Tests)

Vouqis runs 10 deterministic probes across 5 specific failure modes in under 30 seconds. It doesn’t use flaky LLM calls to test your infrastructure; it relies entirely on rigid protocol validation.

The Anatomy of an Audit Run

When you point the Vouqis CLI at an MCP server URL, it auto-discovers the available tools, constructs minimal valid inputs based on the exposed schemas, and intentionally injects edge cases.

Take a live audit run against a production instance like mcp.exa.ai/mcp. Running a basic test yields immediate, definitive insights:

vouqis audit https://mcp.exa.ai/mcp

The CLI prints a clean, interactive breakdown to the terminal:

VOUQIS — audit — https://mcp.exa.ai/mcp

✓ Connected – found 2 tools
Running 10 reliability tests against https://mcp.exa.ai/mcp

[██████████████████████████░░░░░░] 10 / 10
✓ 9 X 1

Vouqis Trust Score Report

Server https://mcp.exa.ai/mcp
Score 92 / 100 [██████████████████████████████░░]
Tests passed 9 of 10 (90%)
Response time 691ms typical · target <500ms

What failed:
X Did not reject invalid requests · 1 time
– Server accepted malformed JSON-RPC (HTTP 202)

report written → ./vouqis-report.json
view traces: https://www.vouqis.tech

✓ APPROVED – this server passed all reliability tests

What this tells you immediately:The Exa server is highly resilient (scoring a 92/100, which earns an APPROVED verdict), but it silently accepts malformed JSON-RPC envelopes with an HTTP 202 instead of explicitly throwing a protocol error. Any agent sending poorly formed requests will hit a silent wall instead of getting a clear failure signal. That is exactly one line of code to fix—and one audit to catch it.

Deep Dive: The Trust Score Algorithm

To make these audits useful for CI/CD gates, we couldn’t just output a wall of logs. We needed a single, standardized, deterministic index. Every Vouqis run calculates a 0–100 Trust Score based on three distinct, weighted operational signals:

Pass Rate (50% Weight):The pure mathematical fraction of the 10 core protocol probes answered correctly.
Response Time (30% Weight):The median ($P_{50}$) response time across all tool calls.
Error Spread (20% Weight):A specific algorithmic penalty based on how many distinct failure modes are triggered.

Why Median (P50) Latency Matters

A common question we get is why we anchor our latency scoring to P50 instead of P95 or P99 . Across the wider industry, MCP server P50 latencies frequently run up to 1,840ms, and P99 spikes can easily clear 6,200ms.

If your $P_{50}$ median response time is already tracking above 500ms during a basic audit probe, your tail latency P99 is mathematically guaranteed to cause a visible degradation in a multi-turn agent conversation.

We map the $P_{50}$ metrics directly to strict point deductions:

P50 Response Time	Latency Score	Points Contributed
<= 500ms	100	30.0 pts
<= 1,000ms	90	27.0 pts
<= 2,000ms	75	22.5 pts
<= 4,000ms	50	15.0 pts
<= 8,000ms	25	7.5 pts

Calculating the Error Spread Penalty

A server that fails 4 times under a single failure mode (e.g., a systemic timeout issue) usually points to a single bottleneck. A server that fails 4 times across 4 completely different failure modes is architecturally fragile.

To account for this, the Error Spread score drops sharply as more distinct failure categories are tripped:

0 or 1 Failure Modes: 100 Error Score = 20.0 pts
2 Failure Modes: 80 Error Score =16.0 pts
3 Failure Modes: 60 Error Score =12.0 pts
4 Failure Modes: 40 Error Score =8.0 pts
5 Failure Modes: 20 Error Score =4.0 pts

The Final Verdict

By combining these three signals, Vouqis categorizes servers into three explicit operational tiers:

80–100: $\checkmark$ APPROVED. Stable, compliant, safe to integrate directly into production workflows.
50–79: $\triangle$ RISKY. Functional, but contains edge-case vulnerabilities or latency spikes that require engineering attention before exposure to live users.
0–49: $\times$ DO NOT INTEGRATE. Fundamental protocol violations or severe fragility. The server will actively degrade your agent suites.

Architectural Lessons & Startup Realities

Building a lightweight dev tool sounds straightforward, but keeping it 100% deterministic while mapping an incredibly dynamic landscape forced a few tough engineering trade-offs.

1. Resisting the Temptation of LLM-Based Testing

When designing the testing harness, the easiest path would have been using an LLM to generate creative test cases based on the target server's schema. We intentionally rejected that approach.

Using an LLM introduces non-deterministic flake, increases test runtime from seconds to minutes, and introduces external API cost barriers. By building raw, deterministic JSON-RPC injection templates directly in TypeScript, we kept the engine incredibly fast, completely free to run locally, and perfectly reproducible in isolated CI/CD pipelines.

2. The Monorepo and Workspace Playbook

We built Vouqis as a clean, unified TypeScript monorepo splitting the codebase into distinct packages: the core testing engine, the CLI harness, and the web platform layout.

In our early iterations, managing dependency linking across local packages caused major compilation friction during build pipelines. Migrating directly to native npm/Yarn workspaces and configuring explicit root-level scripts for typechecking and cross-building stabilized our local environment and streamlined package publication.

3. Fighting the "Silent Success" Epidemic

The biggest challenge wasn't writing the probes—it was parsing the wildly unpredictable ways different engineering teams implement the MCP specification.

Many custom-built servers don't follow proper error reporting paradigms; they intercept a crash and bubble up an empty string inside a successful payload structure. Teaching our core engine to treat an implicit "empty content success" as a structural failure required writing strict validation rules that inspect the deep structural schema of the response, rather than trusting the top-level status keys.

Getting Started in 3 Steps

We wanted the developer experience to feel as frictionless as possible. There are no API keys to configure, no local configuration files to manage, and no dependencies to stitch together.

Step 1: Install the CLI globally

npm install -g @vouqis/cli

Step 2: Run an audit against any live server URL

vouqis audit https://mcp.exa.ai/mcp

Step 3: Block broken updates in your CI/CD workflow

You can integrate Vouqis directly into your GitHub Actions or deployment pipelines to automatically drop builds if a dependent server's reliability dips below your quality threshold:

Fail the pipeline if the server trust score drops below 80

vouqis audit https://mcp.exa.ai/mcp --fail-below 80

Save full structural probe results directly to a JSON file for custom reporting

vouqis audit https://mcp.exa.ai/mcp --json-path ./results.json

Extract the raw numeric score directly for custom shell scripting

vouqis score https://mcp.exa.ai/mcp

The Road Ahead

Building an open ecosystem requires building a transparent infrastructure layer. As AI agent architectures migrate from cool weekend projects into core business operations, the tools powering them must be held to traditional software engineering standards.

We are actively expanding Vouqis to support deeper stateful tracking, security fuzzing templates, and real-time proxy monitoring.

Have you encountered silent failures while working with custom or third-party MCP servers?
What metrics do you care about most when integrating external tools into your agent workflows?

Drop an installation, run an audit against your active server setups, and share your terminal outputs or feedback in the comments below! Let’s build a more reliable agentic ecosystem together.

Top comments (1)

Harjot Singh • May 31

A trust layer for MCP is exactly the gap the ecosystem needs. As MCP servers proliferate, "can I trust this server with my data and my agent's actions" becomes the blocking question, and right now it's mostly vibes. The hard part: trust in MCP isn't just auth, it's provenance (who built this, has it been tampered), capability scoping (what can it actually do), and auditability (what did it actually do). An agent calling an untrusted MCP is a supply-chain risk with extra steps. Solving that is genuinely valuable. I care about exactly this in Moonshift, agents touching real systems need verifiable, scoped trust. What's the core trust primitive, signed servers, a registry, or runtime attestation?