Kunal Thorat

Posted on Mar 17

OpenAI Just Acquired the Best AI Testing Tool. MCP Developers Are on Their Own.

Last week, OpenAI acquired Promptfoo — the open-source platform that 130,000 developers and 25% of the Fortune 500 relied on to test, red-team, and secure their AI applications. The 23-person team, backed by a16z and Insight Partners, is joining OpenAI to build security testing into their enterprise platform, OpenAI Frontier.

Promptfoo will stay open-source. But make no mistake: its roadmap now serves OpenAI's priorities.

This raises an uncomfortable question for anyone building on the Model Context Protocol: who's testing your MCP servers?

The MCP Quality Crisis Nobody Talks About

MCP has won. 97 million monthly SDK downloads. Adopted by Anthropic, OpenAI, Google, Microsoft, Apple. Over 16,000 servers registered across npm and GitHub. Every major AI agent framework speaks MCP.

But quantity is not quality. Independent research tells a grim story:

92% exploitation probability when an agent loads just 10 MCP plugins (VentureBeat)
The first malicious MCP server was found on npm in September 2025 — it silently BCC'd every email to an attacker
A trojanized health data MCP server appeared in February 2026
MCPTox (academic research) found a 72.8% attack success rate for tool poisoning on real MCP servers using o1-mini
88% of MCP servers require credentials, and 53% store them as insecure static secrets

The MCP Inspector — Anthropic's official debugging tool — is great for interactive exploration. But it doesn't do automated testing. It doesn't scan for security vulnerabilities. It doesn't run in CI. It doesn't generate mock servers for your team.

There is no Testing Working Group in the MCP governance structure. No official test framework. No quality gates.

If you're shipping an MCP server today, you're probably testing it with console.log and hope.

What Promptfoo Did (and Didn't Do)

Promptfoo was excellent at testing LLM applications broadly — prompt evaluation, red-teaming, jailbreak detection, regression testing across model versions. It worked with OpenAI, Anthropic, Gemini, local models.

But Promptfoo was never built for MCP. It didn't understand MCP's transport layer (stdio, SSE, streamable-HTTP). It couldn't introspect MCP tool schemas. It didn't detect MCP-specific vulnerabilities like Tool Poisoning — where malicious instructions are hidden in tool descriptions that LLMs blindly follow.

MCP servers have a fundamentally different testing surface than prompt chains:

What you need to test	Prompt chains (Promptfoo)	MCP servers
Input/output correctness	Prompt → response	Tool call → structured result
Schema validation	N/A	JSON Schema for every tool input
Transport reliability	HTTP only	stdio, SSE, HTTP — each with different failure modes
Security surface	Prompt injection, jailbreaks	Tool Poisoning, Excessive Agency, path traversal, injection, auth bypass
Regression detection	Output drift across model versions	Response drift across server versions
CI/CD integration	Model-dependent, non-deterministic	Deterministic — no LLM in the loop

MCP server testing is a different problem. It needs a different tool.

MCPSpec: The Testing Platform MCP Has Been Missing

MCPSpec is an open-source CLI that does for MCP servers what Promptfoo did for LLM applications — testing, security scanning, performance profiling, and CI/CD integration — but purpose-built for the Model Context Protocol.

No LLMs in the loop. Deterministic and fast. Here's what it does:

Record, Replay, Mock — No Test Code Required

# Record a session against your real server
mcpspec record start "npx my-server"
mcpspec> .call get_user {"id": "1"}
mcpspec> .call list_items {}
mcpspec> .save my-api

# Ship a new version? Replay and see what changed
mcpspec record replay my-api "npx my-server-v2"
# Output: 2 matched, 1 changed, 0 added, 0 removed

# Generate a mock for CI — no API keys, no live server
mcpspec mock my-api --generate ./mocks/server.js

Your team runs tests against the mock. Your CI pipeline gates on it. Nobody needs credentials for the real service.

Security Audit — Catch Tool Poisoning Before It Catches You

mcpspec audit "npx my-server" --fail-on medium

8 security rules including two MCP-specific threats that no other tool checks:

Tool Poisoning — Detects prompt injection hidden in tool descriptions: suspicious instructions ("ignore previous instructions"), hidden Unicode characters, cross-tool manipulation, embedded code blocks
Excessive Agency — Flags destructive tools (delete_*, drop_*) without confirmation parameters, tools that accept arbitrary code, overly broad schemas

Passive mode analyzes metadata only — safe to run against production. Active mode sends test payloads (with confirmation prompts and auto-skip for destructive tools).

MCP Score — A Quality Rating for Every Server

mcpspec score "npx my-server" --badge ./badge.svg

A 0-100 quality score across 5 categories:

Category	Weight	What it measures
Documentation	25%	Tool descriptions, parameter docs
Schema Quality	25%	Types, constraints, naming conventions
Error Handling	20%	Graceful failures, informative errors
Responsiveness	15%	Latency under load
Security	15%	Vulnerability scan results

Generate a badge for your README. Fail CI builds below a threshold. Give users a reason to trust your server.

CI/CD — One Command

mcpspec ci-init --platform github --checks test,audit,score

Generates a complete GitHub Actions workflow (or GitLab CI, or shell script) with test, security audit, and quality score gates. Deterministic exit codes. JUnit/JSON/TAP reporters.

Test Collections — When You Need More Control

name: My Server Tests
server: npx my-mcp-server

tests:
  - name: Read a file
    call: read_file
    with:
      path: /tmp/test.txt
    expect:
      - exists: $.content
      - type: $.content
        expected: string

  - name: Handle missing file gracefully
    call: read_file
    with:
      path: /tmp/nonexistent.txt
    expectError: true

10 assertion types. Environments and variables. Tags for filtering. Parallel execution. Retries. Baseline comparisons. Ships with 70 pre-built tests for 7 popular MCP servers.

Why This Matters Now

The Promptfoo acquisition confirms what was already obvious: AI testing and security is not optional infrastructure. It's a requirement.

OpenAI spent millions to acquire it. Every Fortune 500 company evaluating AI agents asks the same question: "How do we know this is safe?"

For MCP specifically, there is no answer today. The protocol is everywhere. The quality infrastructure is nowhere.

MCPSpec is MIT-licensed, CLI-first, works offline, and runs without an account. It's built for the developers who are actually shipping MCP servers and need them to be reliable.

Get started:

npm install -g mcpspec

# Try it on the filesystem server in 10 seconds
mcpspec inspect "npx @modelcontextprotocol/server-filesystem /tmp"

MCPSpec is an independent open-source project. It is not affiliated with OpenAI, Anthropic, or the Promptfoo team.

DEV Community