MiniKao

Posted on May 16

I'm a QA engineer. After Claude wrote # TODO in my 100th test, I built an MCP server.

#ai #testing #opensource #mcp

TL;DR — mk-qa-master is an open-source MCP server that lets Claude / Cursor / Codex / Gemini drive your real test suite — pytest, Jest, Cypress, Go test, and Maestro for mobile. 16 tools, 5 categories, a three-layer QA knowledge architecture. uvx-installable. MIT.

The moment I stopped blaming the model

The 5th time Claude wrote # TODO: add real selector here in a generated test, I tried a smarter prompt. The 20th time, I switched models. The 100th time, I stopped blaming the LLM.

I'm a QA engineer. I've watched LLMs write beautiful-looking test scaffolds for two years now, and every one of them collapses at the same place:

The model can read your code. It cannot see your live DOM, your mobile view hierarchy, your last 10 test runs, or that checkout-flow.spec.ts has been red 7 times in 14 days.

So it guesses. Guesses are how you get # TODO.

The fix isn't a smarter prompt. It's giving the LLM access to the things it's currently guessing about.

That's what the Model Context Protocol (MCP) is for. And that's why I built mk-qa-master.

What "AI for QA" usually means

Most AI-for-testing products today fall into one of three buckets:

IDE plugins that emit test files — Copilot Tests, Cursor's test generator. Great in a screenshot. They write the file, you fix the selectors.
"Just prompt ChatGPT" tutorials — works for one test, falls apart at ten. No persistence, no awareness of what's actually flaky, no runtime feedback.
End-to-end AI testing SaaS — record-and-playback wrappers. They own your test infrastructure, charge per seat, and you're locked in.

What's missing from all three: the AI never touches the runner. It writes code; you run; you debug; you tell the AI what broke. It's a chatbot pretending to be an engineer.

The reframe: stop asking AI to write tests. Make it drive your test runner.

What MCP changes

MCP (introduced by Anthropic in late 2024, now adopted by Cursor, Codex CLI, Gemini CLI, Zed, Cline and others) lets an AI client call tools — not just see text, but trigger actions, read structured responses, chain them.

An MCP server is just a process that exposes tools. Drop it into your client config:

{
  "mcpServers": {
    "mk-qa-master": {
      "command": "uvx",
      "args": ["mk-qa-master"],
      "env": {
        "QA_RUNNER": "pytest",
        "QA_PROJECT_ROOT": "/path/to/your/project"
      }
    }
  }
}

…and now Claude has 16 new things it can do in your project: probe the DOM of a live URL, list your existing tests, generate new ones with real selectors, run them, read JUnit XML, write an optimization plan based on the last N runs.

Your runner just became part of the AI's tool surface.

mk-qa-master in 60 seconds

16 tools across 5 categories. You don't need to memorize names; the README has a cookbook of natural-language prompts that map to each chain.

Category	Tools	What it does
Discover	`get_runner_info` · `list_tests` · `analyze_url` · `analyze_screen`	Which framework is active. What tests exist. Probe a URL or a live mobile screen for form / nav / CTA modules with real selectors.
Generate	`generate_test` · `auto_generate_tests` · `codegen` · `init_qa_knowledge` · `get_qa_context`	Emit runnable pytest `.py` or Maestro `.yaml`. Not `# TODO` placeholders.
Run	`run_tests` · `run_failed`	Drive pytest / Jest / Cypress / Go test / Maestro. Auto-retry, JUnit XML, screenshots, Playwright `trace.zip`, Maestro recordings.
Report	`get_test_report` · `get_failure_details` · `generate_html_report` · `get_test_history`	Outcome history, error signatures, per-test flake scores.
Advise	`get_optimization_plan`	Three lenses: suite quality (flaky vs broken vs slow), MCP usability, AI effectiveness. Output is a ranked action list — what to fix next, with evidence.

Switch frameworks with a single env var: QA_RUNNER=pytest | jest | cypress | go | maestro. Web and mobile share the same MCP surface — analyze_screen works on iOS Simulator, Android Emulator, real devices, and (yes) BlueStacks via adb connect.

The part nobody else builds: a three-layer QA knowledge architecture

This is what makes mk-qa-master not monkey-testing.

A DOM-only analyzer produces "empty field should error" for every form on the internet. That's not testing, it's noise. To produce a test that means anything, the generator needs domain context. So I layered three:

Layer 1 — Built-in

ISTQB's seven principles, equivalence partitioning, decision tables, state transitions, the test pyramid, shift-left, mobile testing checklists, QA metrics — baked into the server. The AI gets methodology by default, not by accident.

Layer 2 — Your project's `qa-knowledge.md`

Drop a file at your project root with your business rules, historical bugs, standard assertion copy, user-journey snippets, technical constraints. init_qa_knowledge scaffolds one. The MCP loads it on every relevant tool call. This is where the "AI doesn't know my business" problem actually gets solved.

Layer 3 — Per-test inline

Pass a business_context slice into generate_test. It gets printed as a # Business context: block inside the generated test, so the next reviewer sees why this test exists without leaving the file.

Three layers of context. One MCP. Pile them up and the AI stops producing "click the button, see something happen" garbage.

A real session

Here's what a Monday morning with this looks like:

you ▸ Test https://your-site/login — one runnable case per module

  → analyze_url ✓ 4 modules · 12 endpoints · 18 candidate cases
  → generate_test ✓ tests/test_login.py (4 cases)
  → run_tests ⚠ 3 passed, 1 failed
  → get_optimization_plan ✓ next priorities:
      🔴 broken  · checkout-coupon-rule (same signature × 3 runs = real bug)
      🟡 flaky   · login-with-2fa (PFPFP outcome string, 60% flake score)
      🟢 stable  · all 12 nav-menu cases

you ▸ Fix the broken one first. Show me the failure.

  → get_failure_details ✓ checkout-coupon-rule:
      Expected: "Discount applied: $5.00"
      Got:      "Discount applied: NaN"
      First failed: 3 runs ago, on PR #142

Notice what's happening here:

The AI doesn't ask which test is flaky — it pulls flake history from tests-history/.
The AI doesn't guess selectors — analyze_url gave it real selectors from the live page.
The AI doesn't just run tests — it returns a ranked action list. "This is broken, this is flaky, this is stable." Evidence, not gut feel.

This isn't AI writing tests. This is AI doing QA.

What this deliberately is not

Not	Use this instead
A test framework	You bring pytest / Jest / Cypress / Go test / Maestro — mk-qa-master drives them
An LLM	Your AI client (Claude / Cursor / Codex / Gemini) does the reasoning
A CI runner	Runs locally, produces JUnit XML; pipe to GitHub Actions / Jenkins as usual
A source-code analyzer	Looks at live DOM and view hierarchy, not your repo's source
A SaaS dashboard	MCP-native, lives in your AI client. HTML reports are self-contained `.html` files

Knowing what a tool isn't is half of trust.

Quick start

uvx mk-qa-master
# or: pip install mk-qa-master

Claude Desktop config lives at:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "mk-qa-master": {
      "command": "uvx",
      "args": ["mk-qa-master"],
      "env": {
        "QA_RUNNER": "pytest",
        "QA_PROJECT_ROOT": "/path/to/your/project"
      }
    }
  }
}

Restart your client. Then in any AI session, say:

"Test https://your-site/login — one runnable case per module, then tell me which existing test is most likely flaky."

That's the whole UX. No menus. No buttons. The AI chains the tools.

This is one of three

mk-qa-master is the execution end of a family I'm building solo:

mk-plan-master — turns a pile of 30–200 raw ideas into RICE-scored, spec-draft-ready initiatives. Hands off to ↓
mk-spec-master — parses specs into scenarios, keeps a live spec ↔ test coverage matrix, grades the specs themselves. Hands off to ↓
mk-qa-master — drives the runner, generates tests, advises on what's broken vs flaky vs slow.

Together they form an end-to-end AI dev pipeline:

Idea → Plan → Spec → Code (your IDE) → Test → Coverage → Coach
       mk-plan mk-spec your IDE       mk-qa  mk-spec     both

The family wraps the rails; code-writing stays in your IDE (Claude Code / Cursor / Copilot). I deliberately don't try to rebuild what your IDE already does well.

The other two MCPs get their own posts. Follow if that pipeline sounds useful.

Links

GitHub: https://github.com/kao273183/mk-qa-master
PyPI: https://pypi.org/project/mk-qa-master/
Family site: https://mcp.chenjundigital.com
License: MIT
Family: mk-qa-master · mk-spec-master · mk-plan-master

If your team is QA-heavy and you've been frustrated by AI tools that write # TODO instead of real tests — give it a try. If you've found a better way to do this, I'd genuinely love to hear about it in the comments. This is an opinionated tool and I'm still iterating.

A star helps the algorithm find people like you. Feedback helps more.

— Jack Kao, building solo.

DEV Community

I'm a QA engineer. After Claude wrote # TODO in my 100th test, I built an MCP server.

The moment I stopped blaming the model

What "AI for QA" usually means

What MCP changes

mk-qa-master in 60 seconds

The part nobody else builds: a three-layer QA knowledge architecture

Layer 1 — Built-in

Layer 2 — Your project's `qa-knowledge.md`

Layer 3 — Per-test inline

A real session

What this deliberately is not

Quick start

This is one of three

Links

Top comments (0)

The moment I stopped blaming the model

What "AI for QA" usually means

What MCP changes

mk-qa-master in 60 seconds

The part nobody else builds: a three-layer QA knowledge architecture

Layer 1 — Built-in

Layer 2 — Your project's qa-knowledge.md

Layer 3 — Per-test inline

A real session

What this deliberately is not

Quick start

This is one of three

Links

Layer 2 — Your project's `qa-knowledge.md`