DEV Community: Nidhi Singh

I gave one Gemini agent two observability tools. The correlation it found surprised me.

Nidhi Singh — Sat, 06 Jun 2026 12:54:29 +0000

There is a category of production bug in AI systems that I find genuinely fascinating, because the difficulty has almost nothing to do with the bug itself. The bug is often simple. What makes it nearly undebuggable is the way we've chosen to organize our tools. I want to walk through it carefully, because once the shape of the problem is clear, the solution becomes almost forced and that solution turned into a project I'll show you at the end.

Two worlds that never meet

When you put a model into production, you quickly find yourself watching two different things.

The first is the infrastructure. This is the world of memory, CPU, pods, network, latency — the machinery the model runs on. We have excellent tools for this; Dynatrace is the one I used.

The second is the model's own behavior. Is it hallucinating? Are its answers relevant? How is the eval score trending, how many tokens is it consuming? This is a genuinely different kind of observability, and again we have good tools for it; I used Arize Phoenix.

Here is the important part, and it's so ordinary that it's easy to miss: these two worlds are monitored by two different products, and those products do not know about each other. Worse, they're usually watched by two different teams. The infrastructure has its on-call rotation; the model has its own. Each group is fluent in its own dashboard and effectively blind to the other.

The failure that lives in the seam

Now consider a specific incident. A memory leak begins on one of your pods. Under memory pressure, the system does something reasonable in isolation: it trims the buffer that assembles prompts before they go to the model, to reclaim space. The consequence is that the model starts receiving prompts with part of their context silently removed. And a model running on half its context does the only thing it can, it fills the missing pieces by guessing. The hallucination rate climbs.

Watch what each observer sees. The infrastructure engineer sees memory utilization spike. That's a familiar, almost boring signal - restart the pod, reclaim the memory, move on. The ML engineer sees the model's answer quality fall off a cliff and begins the long investigation into prompts, retrieval, weights. Each of them is looking at exactly one link of a single causal chain, and nothing in their tool gives them any reason to suspect that the other link exists, let alone that it belongs to the same story.

This is the insight I kept coming back to: the bug is not technical, it's organizational. Every piece of information required to solve it is already being collected. The failure is purely that the two halves of the chain never arrive in the same place, in the same mind, at the same time.

The forced solution

Once you frame it that way, the fix is almost not a choice. If the problem is that no single observer sees both layers, then you create an observer that does. You put one agent in front of both dashboards.

That's ARIA. It connects to both Dynatrace and Arize Phoenix through their MCP servers, pulls the relevant signals from each, and hands the combined picture to Gemini — orchestrated with Google's Agent Development Kit as a planner → reasoner → executor pipeline — to reason over as one problem instead of two.

The one decision that actually mattered

I'll share the mistake, because it's the most useful part. My first design was two agents: one that understood Dynatrace, one that understood Arize, talking to each other. It felt natural — mirror the org structure in software. It does not work. All it does is faithfully reproduce the two-teams blind spot inside your code. Each agent is still an expert in one half and a stranger to the other.

The correlation only emerges when a single agent holds both toolsets inside one reasoning context. When one mind can call a Dynatrace tool and an Arize tool in the same turn, and keep both results in view at once, it can finally see the chain end to end. That's the whole product compressed into a sentence: one mind, both halves.

Try it

Live: https://aria-three-lac.vercel.app/
Code: https://github.com/Nidhicodes/aria
Demo: https://youtu.be/odIU07C-jfY

[Boost]

Nidhi Singh — Tue, 02 Jun 2026 21:36:58 +0000

Nidhi Singh

Jun 2

WebMCP has 0% adoption. So I generated the tools myself.

#ai #python #opensource #mcp

6 min read

WebMCP has 0% adoption. So I generated the tools myself.

Nidhi Singh — Tue, 02 Jun 2026 21:05:42 +0000

There's a clean story everyone tells about AI agents and the web.

Agents will call structured tools. Websites will expose those tools. Everything will be typed, reliable, and boring in the good way. Google even shipped a standard for it — WebMCP, in Chrome, behind a flag.

It's a genuinely good idea. There's just one problem.

Almost nobody has implemented it. Adoption is effectively zero. And web standards don't get adopted in quarters — they get adopted in years, if they get adopted at all.

So in the meantime, your agent is still doing the embarrassing thing: screenshotting pages, scraping the DOM, clicking at pixel coordinates, and quietly praying the layout didn't shift since last Tuesday.

I got tired of waiting. So I asked a different question:

What if you didn't need the website's permission?

A search box — a labeled input next to a submit button — already is a search(query) tool. The spec is right there, rendered in HTML. Someone just has to read it and write it down.

That's the entire idea behind webmcp-gen.

pip install webmcp-gen
webmcp-gen https://news.ycombinator.com --groq

{
  "tools": [{
    "name": "searchStories",
    "description": "Search Hacker News stories",
    "parameters": {
      "type": "object",
      "properties": { "query": { "type": "string" } },
      "required": ["query"]
    }
  }]
}

It drives a real browser, reads the page the way a person would, and emits WebMCP tool definitions. Then — the part that makes it useful instead of a toy — it runs as an MCP server, so Claude Desktop, Cline, or any MCP client can actually call those tools on the live site and get structured results back.

The pipeline is four stages:

EXTRACT   real browser -> DOM + Shadow DOM + iframes -> stable CSS selectors
ANALYZE   heuristic or LLM -> WebMCP tools, each param bound to its selector
SERVE     MCP server (stdio / SSE / streamable-HTTP)
EXECUTE   fill by selector -> submit -> read structured results back

Let me show you the two parts that actually took thought.

Part 1: the selector binding (why it doesn't fall over on real pages)

Most "let AI use the browser" tools work by showing the model the DOM and letting it guess what to click. That guessing is exactly where they fall apart on a real, messy page — the model picks the wrong input, or the layout shifts and the coordinates rot.

webmcp-gen makes a different bet: resolve the target once, deterministically, at generation time. Every parameter the analyzer emits carries a _selector — the exact CSS selector that fills it. The tool an agent sees is clean:

{ "query": { "type": "string", "description": "Search term" } }

But the version the executor holds also carries the binding:

{
  "query": {
    "type": "string",
    "description": "Search term",
    "_selector": "input[name=\"q\"]"
  },
  "_submit_selector": "form#search"
}

When the agent calls searchStories(query="rust"), there is no guessing. The executor fills input[name="q"] and submits form#search. The LLM was used once, up front, to name things and infer intent — never on the hot path to re-derive what a search box is.

The selectors themselves are generated with a fallback chain, most-stable first:

function stableSelector(el) {
  if (el.id) return '#' + CSS.escape(el.id);
  if (el.getAttribute('data-testid'))
    return `[data-testid="${el.getAttribute('data-testid')}"]`;
  if (el.name && el.tagName === 'INPUT')
    return `input[name="${CSS.escape(el.name)}"]`;
  // ... select / textarea by name ...
  // last resort: a bounded path with :nth-of-type
}

#id is best. data-testid is what good frontends ship for exactly this purpose. [name=...] is reliable for form fields. Only if all of those fail do we build a structural path — and even then it's capped at five levels so it can't generate a brittle 12-deep selector that breaks on the next deploy.

Part 2: the bug that taught me to never trust `form.method`

Here's the war story, because it's the kind of thing you only hit once you run against dozens of real sites instead of your own test page.

Extraction was crashing on certain sites. Not erroring gracefully — crashing the entire page extraction, returning zero tools. The stack trace pointed at this innocent-looking line:

method: (form.method || 'GET').toUpperCase()

The culprit is DOM clobbering. If a form contains an input named method — say <input name="method"> — then form.method no longer returns the string "get". It returns the input element. And elements don't have .toUpperCase(), so the whole thing throws and takes the page down with it.

Plenty of real forms have fields named method, action, submit, id. The property accessor is a trap.

The fix is to stop reading properties and read attributes instead:

method: (form.getAttribute('method') || 'GET').toUpperCase()

getAttribute can't be clobbered. I went through and did the same everywhere I'd touched form/field properties, and wrapped each form's parsing in its own try/catch so one malformed form can't nuke the rest of the page. Recovered a bunch of sites that had been silently returning nothing.

It's a small fix. But it's the difference between "works in the demo" and "works on the actual web," and you don't find it by being clever — you find it by running against real sites and reading the failures.

There's a related subtlety in extraction worth a line: webmcp-gen waits for the page with a MutationObserver that resolves when the DOM stops changing, not a fixed sleep. Single-page apps render after load; a sleep(2) either wastes two seconds or misses the content. Watching for stability is both faster and more correct. It also walks open Shadow DOM and same-origin iframes, so component-framework sites aren't invisible.

The part I'm actually proud of: it tells the truth

Drive a headless browser and some sites will try to stop you.

webmcp-gen patches the obvious headless tells — navigator.webdriver, the missing window.chrome, an empty plugin list, the SwiftShader WebGL giveaway. That's enough for a surprising amount of the web.

It is not enough for Cloudflare challenges, CAPTCHAs, or behavioral fingerprinting. Beating those means residential proxies and a TLS-spoofing arms race I deliberately don't ship.

So when a site blocks it, webmcp-gen says so:

{
  "success": false,
  "blocked": true,
  "error": "Blocked by anti-bot protection (redirected to '418.html')."
}

It never fakes a success: true over a CAPTCHA page. For an agent, a fake success with garbage results is far more dangerous than an honest "I was blocked" — the agent can recover from the second one, but it'll happily act on the first.

Does it actually work?

There's a benchmark in the repo. It runs the full pipeline against real sites grouped by difficulty, because "X% success" is meaningless until you say which sites.

Tier	What it means
sandbox	sites built for automation
open	public sites, no aggressive detection
guarded	real sites that may throttle or challenge
walled	known hard-blocks (reported `blocked`, never faked)

On the open and sandbox tiers it lands the large majority of sites — including successful live runs against names like Google, Bing, GitHub, and Wikipedia, not just toy pages. On the walled tier it correctly reports blocked.

The point of the tiers is honesty: one aggregate percentage hides which sites it actually handles. The whole suite is in the source, and you can re-run it:

webmcp-benchmark --suite full

It does more than single calls

Multi-page crawl — one page rarely shows everything a site can do. --crawl walks the origin and merges tools from every page it finds.
Authenticated sessions — for gated sites, you log in once in a real browser (you type the password, not the tool), and it reuses the session.
Tool-chaining workflows — chain search -> open result -> act, passing earlier results into later steps. The page is re-read between steps, so a "reserve" button that only appears on a detail page becomes callable when you get there.

All of it works with any OpenAI-compatible API — Groq, OpenAI, or a local Ollama model, so the analysis can run fully offline.

Try it

pip install webmcp-gen
playwright install chromium
webmcp-gen https://en.wikipedia.org --groq

It's MIT, on PyPI, and the README has the architecture diagrams and the honest caveats spelled out.

→ github.com/Nidhicodes/webmcp-gen

If you're building agents and you've felt this exact frustration, I'd genuinely love to know where it breaks for you. The interesting failures are the ones I haven't seen yet.

DEV Community: Nidhi Singh

I gave one Gemini agent two observability tools. The correlation it found surprised me.

Two worlds that never meet

The failure that lives in the seam

The forced solution

The one decision that actually mattered

Try it

[Boost]

WebMCP has 0% adoption. So I generated the tools myself.

WebMCP has 0% adoption. So I generated the tools myself.

Part 1: the selector binding (why it doesn't fall over on real pages)

Part 2: the bug that taught me to never trust form.method

The part I'm actually proud of: it tells the truth

Does it actually work?

It does more than single calls

Try it

Part 2: the bug that taught me to never trust `form.method`