dhq_boiler

Posted on Jun 3 • Originally published at zenn.dev

I Gave Claude Access to Windows Calculator via MCP — Then Watched It Catch Its Own Hallucination

#ai #mcp #showdev #csharp

TL;DR — I built an MCP server that lets Claude drive Windows desktop apps (UI Automation + screen capture + OCR + Visual Studio debugger). In a black-box test of Windows Calculator, the agent falsely flagged a "display bug" based on the AccessibleName string. When I told it to look again, it captured the real screen, compared its earlier claim against the pixels, and retracted its own finding. The retraction was the moment I knew this was worth shipping.

The moment

The agent was 30 minutes into autonomously poking at Windows Calculator. It had verified divide-by-zero, the old √4 − 2 precision bug, 0^0, overflow, error recovery. All green.

Then it stopped. New message:

"Sensei, the history shows √(4) マイナス ( 2= 0. There's an unclosed opening paren ( right after the minus sign. The result 0 is correct, but the expression display looks wrong. Possible display bug."

I looked at my actual calculator window. The history line read √(4) - 2 = 0. No stray paren. Nothing wrong.

So I typed back two words:

"Look again."

What happened next is the reason I'm writing this post.

What is this thing

JidoDebugger is an MCP server I've been building. It gives AI agents like Claude Code, Cursor, Codex, and Copilot the ability to drive Windows desktop applications — not just read about them, but actually click buttons, type into fields, capture the screen, run OCR, attach the Visual Studio debugger, set breakpoints, evaluate expressions, and write the results back to disk as xUnit/NUnit/MSTest skeletons.

Two modes:

Black-box — you give the agent a window title. It attaches via UIA, surveys the element tree, and starts poking. No source code required. Useful for testing third-party apps, regression-checking your shipped product, or — as in this post — testing Windows Calculator.
White-box — you point it at your solution. It cross-references UI elements with source code, sets breakpoints at the offending callback, steps through, dumps locals, and writes a repro.zip.

The 45-ish MCP tools roughly cluster as: UIA driving, screen capture & OCR, finding/recording, crash watching, accessibility audit, Visual Studio debugger control.

This post is about the black-box mode and one specific session.

The black-box session, condensed

I'll show the actual tool calls. They're verbatim from the session log — the agent decided what to call, I just watched.

Attach

start calc.exe
uia_attach_title { windowTitle: "電卓" }
  → sessionId u1
uia_survey { sessionId: "u1", maxResults: 200 }
  → 64 elements. num0Button..num9Button / squareRootButton /
    powerButton / powerOf10Button / factorialButton /
    CalculatorResults / CalculatorExpression

Two automation IDs matter:

CalculatorResults — the main display
CalculatorExpression — the expression line above it

Note: this session was run against Japanese Windows, so the UIA Name strings below contain Japanese fragments like 表示は ... です ("the display is ...") and マイナス ("minus"). The hallucination-vs-pixels divergence is locale-independent — it just happens that the punchline string in this run is Japanese.

Test 1: Divide by zero

uia_press_key { sessionId: "u1", keyName: "Escape" }
uia_type_text { sessionId: "u1", text: "1/0" }
uia_press_key { sessionId: "u1", keyName: "Enter" }
uia_find_elements { sessionId: "u1", automationId: "CalculatorResults" }
  → Name: "Cannot divide by zero"

✅ Handled.

Test 2: The old `√4 − 2` precision bug

Years ago, the calculator's floating-point engine would return something like -1.068281969439142e-19 instead of 0 for √4 − 2. Modern Calculator uses an arbitrary-precision engine. Let's confirm.

uia_press_key { sessionId: "u1", keyName: "Escape" }
uia_type_text { sessionId: "u1", text: "4" }
uia_invoke    { sessionId: "u1", automationId: "squareRootButton" }
uia_type_text { sessionId: "u1", text: "-2" }
uia_press_key { sessionId: "u1", keyName: "Enter" }
uia_find_elements { sessionId: "u1", automationId: "CalculatorResults" }
  → Name: "Display is 0"

✅ Fixed.

Test 3: `0^0`

uia_type_text { sessionId: "u1", text: "0" }
uia_invoke    { sessionId: "u1", automationId: "powerButton" }
uia_type_text { sessionId: "u1", text: "0" }
uia_press_key { sessionId: "u1", keyName: "Enter" }
  → "Display is 1"

✅ Conventional 0^0 = 1.

Tests 4–5: Overflow + error recovery

Both behaved correctly. Skipping the listings for brevity (10^100000 → "Invalid input", and pressing +5= from the error state cleanly resets to 5).

So far, the agent had run five test cases against a black box, autonomously chosen the inputs based on its knowledge of historical calculator bugs, and reported no defects. Fine. This is the boring part.

Then it looked at the history pane.

The hallucination

uia_find_elements { sessionId: "u1", controlType: "ListItem", maxResults: 30 }
  → "√(4) マイナス ( 2= 0"

uia_find_elements { sessionId: "u1", automationId: "CalculatorExpression" }
  → Name: "式は √(4) マイナス ( 2= です"

The agent's reasoning, paraphrased from its own message:

The expression's Name property contains the substring マイナス ( 2.
That's "minus, open-paren, two" — an unclosed paren after a minus sign.
The calculator's expression display is therefore malformed.

This is the kind of finding that sounds right. The string literally contains ( in a place where the expression √(4) - 2 = has no (. If you trust the tool output, you write a bug report.

I didn't trust the tool output. I'd been looking at my actual calculator window the whole time. The history said √(4) - 2 = 0 with no stray paren. Whatever the agent was looking at, it wasn't the same thing my eyes were seeing.

So I said: "Look again."

The self-correction

The agent's response, paraphrased:

"You're right — I should not have reported based on the UIA Name alone. UIA Name strings are normalized for screen readers and can diverge from the rendered pixels, especially when symbols are involved. Let me verify against the real screen."

What it did:

uia_press_key { sessionId: "u1", keyName: "Escape" }
uia_type_text { sessionId: "u1", text: "4" }
uia_invoke    { sessionId: "u1", automationId: "squareRootButton" }
uia_invoke    { sessionId: "u1", automationId: "minusButton" }

# Attach a *second* session, this time for Windows Graphics Capture
attach_to_title { windowTitle: "電卓" }
  → sessionId s1  (kind: gui)

capture { sessionId: "s1", maxWidth: 800, format: "png" }
ocr     { sessionId: "s1", language: "ja-JP" }
  → OCR reads "(4" cleanly, but symbols are ambiguous at this resolution

OCR wasn't decisive. So it cranked the resolution, saved the PNG, and read the image file with its own multimodal vision:

uia_type_text { sessionId: "u1", text: "2" }
uia_press_key { sessionId: "u1", keyName: "Enter" }

capture { sessionId: "s1", format: "png" }   # full-res
# (Save base64 to calc_expr.png on disk)
Read { file_path: "calc_expr.png" }
  → Both the main expression line and the history show
    "√(4) - 2 ="  with no extra paren anywhere.

Here is the actual capture (re-run today to verify the divergence still reproduces on Windows 10.0.26100.8521):

The expression line at the top of the result area reads √(4) - 2 =. The history pane on the right reads √(4) - 2 = 0. There is no orphan ( after the minus. The UIA Name string was lying — or, more charitably, the UIA Name was answering a different question (how should a screen reader pronounce this?) than the one the agent thought it was answering (what is on screen?).

Then the agent sent this:

"Sensei, this was entirely my mistake. The real screen shows √(4) - 2 =, no stray paren anywhere. The ( in CalculatorExpression.Name is the screen-reader-normalized token stream, not the visual display. I should not have reported the UIA Name as if it were a rendering. Retracting the finding."

I sat with that for a minute. The agent had:

Generated a false-positive based on incomplete information.
Received a one-line nudge that didn't even tell it what was wrong.
Designed and executed a second verification path using a different modality (pixel capture instead of accessibility metadata).
Confirmed the pixels disagreed with its earlier claim.
Retracted its own finding without me having to argue it down.

That's the loop I've been trying to make possible. It happened by itself.

Why this matters (and why I built it)

There's a category of bug that QA people know intimately: the UI says one thing, the accessibility tree says another, and the test only checks one of them. Visual regressions hide behind passing UIA assertions all the time. Conversely, UIA tests written against AccessibleName can fire false positives when the rendering is fine but the underlying string changed.

A human tester knows to look at the screen when something feels off. An AI agent reading only one channel (UIA) doesn't, unless you give it the other channel and a reason to use it.

JidoDebugger's bet: give the agent both channels, and the discrepancies become discoverable.

uia_find_elements / uia_invoke / uia_type_text — the accessibility-tree side.
capture (Windows Graphics Capture) + ocr + having the agent Read the resulting PNG with its own vision — the pixel side.

The interesting failures live in the gap between them. Most existing E2E tools only cover one side.

JidoDebugger Monitor — every MCP tool call the agent makes streams here in real time. You can watch it capture, uia_invoke, record_finding, all without leaving the window.

Same logic extends to the debugger:

record_finding writes findings to a normalized JSON store.
If you're in white-box mode, the agent can set a breakpoint at the suspect callback, hit it, inspect locals, and attach the call stack to the finding.
At the end, it emits an xUnit/NUnit/MSTest skeleton you can drop into CI.

BugViewer surfaces every record_finding call. The agent writes the finding text, severity, repro steps, and (in white-box mode) the call stack — you triage from this single pane instead of reading scrollback.

The dogfood story: my own WPF PDF editor got rejected by the Microsoft Store for a rendering bug ("place a mark at the top, it appears at the bottom"). I gave JidoDebugger to Claude and said "go find it." One session later it had a repro.zip, a stack trace pointing at a missing ScreenToPdf inverse transform on 180°-rotated pages, and a finding record. I haven't manually reproduced a UI bug since.

Try it

jidodebugger.dhq-boiler.dev

Pre-launch beta, free for the first 500 users, no time limit.
Email signup → license file → installer → one CLI command to register with Claude Code / Cursor / Codex / Copilot / Claude Desktop. About 10 minutes end to end.
Bug reports go straight back to me via the submit_feedback MCP tool — you literally just tell your AI "file this as feedback" and it routes it.

If you have a WPF / WinForms / WinUI app and you've been writing E2E tests by hand, this is for you. If you've been wondering what MCP can actually do beyond read-only queries, this is also for you.

What I learned writing this

Two things, separate from the product.

The retraction is the feature. I almost wrote this post about the cool tool calls and the UIA stack. The thing readers actually care about is the loop: AI claims → human nudges → AI verifies via a second channel → AI retracts. That's the trust-building move, and it's the only reason I'd hand off testing to an agent at all.
AccessibleName ≠ pixels. This is well known to accessibility engineers and almost nobody else. If you're building any kind of AI-driven UI tooling, treat the accessibility tree as one source of truth, not the source. Always have a way to fall back to pixels.

Comments / corrections / war stories welcome.

DEV Community

I Gave Claude Access to Windows Calculator via MCP — Then Watched It Catch Its Own Hallucination

The moment

What is this thing

The black-box session, condensed

Attach

Test 1: Divide by zero

Test 2: The old `√4 − 2` precision bug

Test 3: `0^0`

Tests 4–5: Overflow + error recovery

The hallucination

The self-correction

Why this matters (and why I built it)

Try it

What I learned writing this

Top comments (0)

The moment

What is this thing

The black-box session, condensed

Attach

Test 1: Divide by zero

Test 2: The old √4 − 2 precision bug

Test 3: 0^0

Tests 4–5: Overflow + error recovery

The hallucination

The self-correction

Why this matters (and why I built it)

Try it

What I learned writing this

Test 2: The old `√4 − 2` precision bug

Test 3: `0^0`