Eugen

Posted on Apr 23 • Edited on Apr 24

Your MCP Server Is a QA Engineer You Haven't Hired Yet

#ai #mcp #testing #saas

We built 118 MCP tools for our SaaS and organized them into 19 files. Then we ran 722 manual test scenarios by talking to Claude. What we found changed how we think about MCP servers.

Unit tests verify that code does what you wrote. They do not verify that your data makes sense.

The accidental discovery

We built the MCP server for data entry - "record this expense," "create this invoice." Standard operations. Then we started asking different questions. Not "create X" but "show me everything that looks wrong."

The first one: "Show me all transactions without a category." Claude called list-transactions and returned entries we had forgotten about. Expenses recorded but never categorized. Not a code bug. A data gap. Invisible to every test we run.

That was the moment we realized: an AI with read access to multiple domains is a QA engineer. Not the kind that writes Playwright tests - the kind that audits your actual production data.

Cross-domain queries humans don't think to run

Here are the checks we now run regularly. Each one targets the boundary between two domains where no single test has jurisdiction.

Stuck state machines

"Which invoices have been in Sent status for more than 30 days?"

Claude calls list-invoices filtered by status, then checks dates. An invoice sitting in "Sent" for 47 days should have transitioned to "Overdue" automatically. If it didn't, the scheduled job failed, the client record was deleted, or there's a timezone bug. The AI doesn't fix these - it surfaces them.

Balance reconciliation

"Does the bank account balance match the sum of its transactions?"

Claude calls get-account-balances for current balances, then list-transactions filtered by account to sum recorded entries. If the numbers diverge, a transaction was missed during import or a balance sync failed silently. Both systems pass their own tests. The inconsistency lives between them.

Revenue consistency

"Compare total invoiced revenue with total recorded income in accounting."

Invoice totals are computed from line items. Accounting revenue comes from payment transactions. If these two numbers disagree, a payment was recorded without matching an invoice, or an invoice was paid but the income transaction was never created.

Ghost references

"Show me products referenced in invoices but marked as archived."

An archived product shouldn't appear on active invoices. If it does, the archive operation didn't cascade properly, or the invoice was created in a narrow window between the product lookup and the archive. Edge cases that unit tests don't model because they cross entity boundaries.

Dormant clients

"Which clients have zero invoices in the last 6 months?"

Not a bug - a business insight. But it's the same query pattern: cross-domain, read-only, using filters that no UI was designed to combine.

722 test scenarios: what we actually found

We didn't stop at ad hoc questions. We wrote 722 acceptance test scenarios and ran every one through Claude. Each tool got happy path tests, error handling, boundary values, unicode input, and cross-tool E2E flows.

We categorized results as PASS, SKIP, BUG-FIXED, or KNOWN-LIMITATION. The BUG-FIXED category is where it gets interesting.

Bug class 1: Zod coercion

MCP sends parameters as strings over JSON-RPC. Six of our list tools used z.number() for pagination params. Claude sends "3" as a string. Zod rejects it. The tools worked in our test suite (which passes native numbers) but failed through Claude.

Fix: z.number() to z.coerce.number() across all pagination params.

This is the kind of bug that automated tests miss because they test the code path, not the transport path. The Zod schema was correct TypeScript. It just didn't match the real-world input format.

Bug class 2: Required fields on partial updates

Three update tools (update-company, update-client, update-product) had name: z.string() in their schema. Name was required. But for a partial update, you might only want to change the phone number. Claude would send { phone: "..." } and Zod would reject it because name was missing.

Fix: z.string() to z.string().optional() for every field that shouldn't be mandatory on update.

Bug class 3: Missing ownership checks (IDOR)

update-document-status didn't verify that the status belonged to the authenticated team. A user could theoretically update another team's status by guessing the UUID. The web UI never exposed this because it only showed the user's own statuses. The MCP tool, being a raw API, had no such guard.

Fix: add teamId to the WHERE clause in the repository query. Same fix applied to delete-document-status.

This is the most important category. MCP tools are public API endpoints. If your web UI has implicit security (only showing the user's data), your MCP tools need explicit security (checking ownership in every query). The acceptance tests caught two IDOR vulnerabilities that would have passed code review.

Bug class 4: Method name mismatches

update-invoice called service.update() but the actual method was service.updateDocument(). TypeScript didn't catch it because the DI wiring used as never to bypass a complex generic. The tool compiled fine and would crash at runtime.

convert-estimate-to-invoice used require() instead of a static import, hiding another type mismatch. ESM context + require() = silent failure.

Fix: remove require(), use proper DI factory functions, fix the method names.

Bug class 5: Archived entities in list results

list-clients, list-companies, and list-products returned archived entities by default. The web UI filtered them out in the component. The MCP tool didn't. Claude would show the user a client they had already archived, they'd try to create an invoice for it, and get a confusing error.

Fix: filter archived entities by default, add includeArchived param for explicit requests.

The pattern: MCP tests the full stack

What these 5 bug classes have in common: they all live at integration boundaries that unit tests don't cross.

Unit test:     UseCase -> Mock Repository -> Assert result
MCP test:      Claude -> JSON-RPC -> Zod -> UseCase -> Prisma -> Response -> Claude

What unit tests verify:
  - Business logic
  - Input validation  
  - Error handling

What MCP tests add:
  - Transport coercion (string -> number)
  - Schema completeness (required vs optional)
  - Authorization at the API boundary (not UI boundary)
  - Response format (can the AI actually use the output?)
  - Cross-tool data integrity (tool A's output feeds tool B)

What MCP gives you that SQL doesn't

You could run consistency checks with SQL. The difference isn't capability - it's friction.

Authorization is built into every tool. Every MCP tool in our server filters by team automatically. With raw SQL, you write WHERE team_id = X on every query and hope nobody forgets.

No schema knowledge required. In our database, Account is the NextAuth OAuth model while FinancialAccount is a bank account, categories have polymorphic ownership (teamId OR userId, never both), and invoices and estimates live in separate tables with different schemas. The MCP tools abstract all of this away. You ask for "invoices" and get invoices.

Checks evolve without code changes. When you think of a new consistency check, you type it in English. "Which clients have invoices but no recorded payments in the last 90 days?" That check didn't exist before you asked - and it required no engineering work to create.

How to do this with your own MCP server

1. Identify boundary questions. Where should two domains agree? Revenue in invoices vs. revenue in accounting. Document access logs vs. NDA signatures. User counts vs. subscription seats.

2. Run 722 scenarios (or start with 10). Pick your most complex tool. Test: happy path, missing required params, invalid UUIDs, cross-team access, boundary values (amount: 0, limit: 101), unicode (Cyrillic descriptions, emoji). You'll find bugs on the first day.

3. Categorize what you find. PASS, BUG-FIXED, KNOWN-LIMITATION, SKIP. Track it. The BUG-FIXED count tells you how much value this approach adds. Ours was high enough that we made acceptance testing a standard step before every MCP release.

4. Make it routine. Ask your AI the boundary questions weekly. The data drifts. The questions stay relevant.

Try it

claude mcp add --transport http paperlink https://mcp.paperlink.online/api/mcp/mcp

DEV Community