DEV Community: weiseer

Bulk-check DNS, SSL and email auth for a whole list of domains (no scraping)

weiseer — Sun, 31 May 2026 04:16:15 +0000

If you've ever had a spreadsheet of domains — a lead list, an acquisition target's
footprint, your own portfolio — and needed DNS records, WHOIS, SSL expiry, or email
authentication for all of them, you know the pain: single-domain web tools don't
scale, and dig / whois / openssl loops are fiddly to parse.

Here's how I think about pulling clean, structured domain intelligence in bulk —
and the three small tools I built so I never have to write that loop again.

1. DNS + WHOIS + SSL, in one pass

For each domain you usually want three things together:

DNS — A/AAAA/MX/NS/TXT/CNAME (where it points, who runs mail, the DNS provider)
WHOIS — registrar, creation/expiry dates, status, name servers
SSL — issuer, the valid_from/valid_to window, SAN list

The trick is to do them as protocol calls (DNS resolution, a TLS handshake, WHOIS
on port 43) rather than scraping any website — protocol surfaces are stable, so the
output doesn't break when sites redesign.

If you don't want to wire that yourself, I packaged it as
Domain Intelligence on Apify:
paste a list of domains, get one clean JSON row each. ($0.01/domain, no proxies.)

2. Email authentication (MX / SPF / DMARC / DKIM)

Deliverability and security both hinge on the same DNS records:

MX present? which provider? (aspmx.l.google.com → Google Workspace, etc.)
SPF — is there a v=spf1 TXT record, and is the qualifier -all (strict) or ~all?
DMARC — _dmarc TXT with p=reject / quarantine / none?
DKIM — does a common selector (google, selector1, k1…) publish a key at <selector>._domainkey.<domain>?

A domain with MX + SPF -all + DMARC reject + DKIM is a "strong" setup; missing
DMARC is the most common gap. You can score a whole list this way in seconds — no SMTP
probing required (which mail servers block anyway). I put this in
Bulk Email Deliverability Checker.

3. The website itself (metadata + tech + security headers)

The HTTP layer rounds out the picture: final URL after redirects, status, <title>,
meta/Open Graph, Server/X-Powered-By, the security headers (HSTS, CSP,
X-Frame-Options…) graded present/missing, and lightweight tech hints (Cloudflare,
nginx, Next.js, Shopify, WordPress…). Useful for SEO audits, tech research, and
lead enrichment: Website Metadata & Tech Profiler.

Putting it together

For lead enrichment you might run all three over a list of company domains: WHOIS for
registrar/age, DNS MX for the email stack, the web profiler for the tech stack, and the
email checker for deliverability — a quick technographic profile per domain, exported
to CSV or pulled via API.

All three are protocol-based and low-maintenance by design. Code and notes:
github.com/weiseer. Happy to take requests for fields to add —
what would you want in a bulk domain report?

I tested mcp-doctor pricing with 12 LLM-simulated personas. 4 said they would pay.

weiseer — Sat, 30 May 2026 09:40:08 +0000

Earlier today I shipped @weiseer/mcp-doctor — an open-source supply-chain trust scanner for MCP (Model Context Protocol) servers. CLI + GitHub Action + Trust Badge + free public API at https://api.weiseer.com. Pro tier is $19/mo on Gumroad.

The honest question every solo founder skips: would anyone actually pay $19/mo for this?

I have a separate tool for exactly this — personalab, an open-source persona-driven product evaluation harness. 12 LLM-simulated personas read the product, decide each day what they'd do, and tell you who would pay and who would walk away. I've used it before on PostHog, Cal.com, and personalab-on-itself.

Tonight I ran it on mcp-doctor as case study #4. Code + raw data + full report all in github.com/weiseer/mcp-doctor/blob/main/case_study/.

Headline result

4 of 12 personas would pay (33%). 2 abandoned. 6 stayed engaged on the free tier.

Case study	Would-pay rate (under same persona harness)
mcp-doctor (today)	4/12 = 33%
personalab self-test	0/8
PostHog (5-day agentic)	0/12 sustained (6/12 day-1 yes)
Cal.com	8/12

This puts mcp-doctor between PostHog and Cal.com under the same methodology. Better than personalab itself, better than what PostHog showed under a 5-day sustained simulation, worse than Cal.com (which converged on a single clean friction lever — the famous "Powered by Cal.com" branding).

Not making PMF claims. Treat it as one signal among several.

Who paid

07 OSS maintainer — strongest engagement signal. Opened a GitHub issue on day 2, shared with team on day 3, subscribed to Pro on day 4. Quote synthesized from the 5-day transcript:

"Supply-chain audits are part of my actual job. A rubric I can fork and argue with is worth more than another vendor's black box. $19/mo is below my coffee budget."

06 Research consultant — buys tools on behalf of clients. Subscribed on day 5. The "buying for someone else" pattern showed up clearly — they care about whether the trust signal is defensible to a third party.

Who walked

02 Growth PM — final action: UNSUBSCRIBE_OR_UNINSTALL. Their verbatim:

"mcp-doctor 解决的是供应链信任问题，跟我的 OKR（Free→Paid conversion 3.2%→4.5%）完全正交。5 天了，零帮助我加快 A/B 迭代速度。时间成本 > $19 价值。"

(Translation: "mcp-doctor solves supply-chain trust. My OKR is conversion rate. They are orthogonal. After 5 days I haven't moved faster on A/B tests. Time cost exceeds $19 value.")

This is correct. The persona is right. Their OKR is conversion; my tool is supply chain. Audience mismatch.

11 Data team lead — abandoned over rubric calibration disagreement. They disagreed with how aggressively A1_unpinned_deps fires. This is real feedback the actual product would need to address (PR welcome on rubric.yaml).

Who stayed engaged but didn't pay

6 of 12 personas used the free tier daily, found genuine value, but did not subscribe. These are the free-tier loyalists — exactly the funnel design intent. They give us:

Word-of-mouth (some opened GitHub issues, shared with team)
Trust badge usage on their READMEs (free)
The actual marketing engine

If we tried to push these personas to Pro, we'd lose the funnel. Free tier should stay generous.

Patterns across the 60 simulations (12 × 5)

The personalab agentic mode runs each persona day-by-day, so I get 60 data points. Friction clusters extracted:

Cluster	# mentions across persona-days
Rubric calibration / false positive concerns	14
Pro tier value vs Free tier sufficiency	11
MCP-specific audience (do I even use MCP?)	9
Trust building (new brand)	8
vs npm audit / Snyk / Bumblebee	7
Self-serve / docs gap	4

The top cluster — rubric calibration — is the right one to prioritize. v0.2 of the scanner should add an LLM-judge mode for ambiguous signals (the same fix planned for @weiseer/prompt-redteam's detection).

The number-of-clusters observation from earlier personalab work was: pre-PMF products see 4-5 diffuse complaints, late-funnel products see 1-2 clean levers. mcp-doctor surfaced 6 clusters at day 1 of launch. That feels right — pre-PMF, complaints diffuse.

What I'm doing about it

Not changing pricing — 33% would-pay on the right persona slice is enough signal at $19/mo. Cal.com hits 67% on a more general audience; we accept narrower fit at this stage.
Sharpening audience — Twitter / Reddit posting should drop the "general developer" framing and double down on "MCP server users" specifically. The personas who pay are the ones who already do this work.
Rubric calibration — top friction cluster is real. v0.2 will add LLM-judge classification of ambiguous signals + explicit per-signal severity thresholds.
Not naming the package — case study itself is the marketing. No "X is the worst, buy mcp-doctor."

Honest disclosure

This is simulated user behavior via Claude Haiku 4.5, not real customer interviews. Treat as one signal, not as PMF validation.
The same persona library was previously calibrated on three other products; cross-product comparability is plausible but not proven.
The product context was shown once; real buyers would see Twitter, GitHub stars, friends' opinions, etc.
Some persona quotes may reflect personalab's own design biases (acknowledged in personalab's own meta case study).
Two products by the same person (mcp-doctor + personalab) tested by the same person — bias risk acknowledged.

Reproducibility

# Clone personalab
git clone https://github.com/g16253470-beep/personalab
cd personalab

# Adapt the runner to your product
# https://github.com/weiseer/mcp-doctor/blob/main/case_study/run_personalab.py

# Run on your own product brief
ANTHROPIC_API_KEY=... python run_personalab.py

The raw JSON output is at mcp-doctor/case_study/personalab_raw_report.json. Argue with the persona definitions. Fork the case-study runner. If you do this on your own product, the failure modes you find will tell you more than any survey.

I scanned 200 popular MCP server packages. Here is what I found.

weiseer — Sat, 30 May 2026 07:23:16 +0000

The MCP ecosystem has been growing fast, but the supply-chain hygiene has not kept up. MCPwn (CVE-2026-33032, CVSS 9.8) exposed 2,600+ instances. The Shai-Hulud npm worm stole MCP auth tokens from 172 packages. MCPSafe found high-severity bugs in official MCPs from Atlassian, GitHub, Cloudflare, and Microsoft. Perplexity open-sourced Bumblebee in May 2026 specifically because no good scanner existed.

So I built one. Today I'm shipping @weiseer/mcp-doctor — an open-source install-time trust gate for MCP server packages — together with the validation dataset that surfaced its first real finding.

TL;DR

npx @weiseer/mcp-doctor @some/mcp-server

Returns PASS / WARN / BLOCK with cited evidence per signal. The full scoring rubric is open-source so you can argue with the methodology rather than trust a black-box. Free public scan endpoint at https://api.weiseer.com/scan, 60 requests/min/IP, no auth.

Live dataset of 200 popular MCP-related packages at https://api.weiseer.com/dataset/scan_200.json. Leaderboard view at https://api.weiseer.com/leaderboard.

What the 200-package scan found

Verdict	Count
PASS	138 (69%)
WARN	58 (29%)
BLOCK	3 (1.5%)
ERROR	1 (npm 404)

1 package had a hardcoded LLM API key

The scanner's D3_hardcoded_credentials_in_source signal fires on common provider key patterns (sk-ant-*, sk-*, AKIA*, ghp_*, npm_*, AIza*) in published source. It is a hard-block: −50 points, no questions.

One of the 200 packages tripped it with a real-looking sk-ant-... Anthropic key embedded in its bundled JavaScript source. The maintainer was emailed within the hour using their npm publisher contact. They have 7 days to rotate the key, deprecate the bad version, and republish reading from process.env. After that window closes (2026-06-06), I'll reference the pattern anonymously — but I'll keep the specific package name private indefinitely if they ask.

This is the Shai-Hulud-class risk in concrete form: a single embedded key, in a single npm package, that any tool scanning the agent's dependency tree could exfiltrate.

Six "official" MCP servers are silently abandoned

This one surprised me:

Package	Days since last release	Repository URL
`@modelcontextprotocol/create-server`	550	none
`@modelcontextprotocol/server-postgres`	541	none
`@modelcontextprotocol/server-gdrive`	501	none
`@modelcontextprotocol/server-github`	416	none
`@modelcontextprotocol/server-slack`	399	none
`@modelcontextprotocol/server-puppeteer`	382	none

These are still cited in nearly every MCP tutorial. None have a repository field in package.json, so source-to-binary verification is impossible. If you depend on any of them in production, mirror the source today.

@google/generative-ai is also installed broadly via npm but Google has archived its GitHub repo in favor of @google/genai.

2 typosquats of official servers

Self-explanatory — both blocked with −40 HARD C4_name_typosquats_official. Short-name comparison catches packages within edit-distance 1 of well-known official names.

The rubric is open-source by design

I am not a security vendor and these scores are not a black box. Every signal in rubric.yaml has:

An ID (e.g. D3_hardcoded_credentials_in_source)
A deduction value (how much it costs you)
A rationale (why we think it matters)

If you think A1_unpinned_deps is too aggressive, open a PR. If you think B2_single_maintainer unfairly punishes new packages, open a PR. The whole point of an open rubric is that ecosystem trust is a public good, not vendor secret sauce.

I ran the scanner on my own 9 packages first (@weiseer/*) and published the results in the same leaderboard. They all PASS at 100/100 — but two signals (B2_single_maintainer, B3_repo_under_60d_old) are explicitly suppressed via the self_disclosure flag because they're trivially expected on packages published the same day. I'd rather show the suppression than score myself perfect with a rigged rubric.

How to use it

Single package:

npx @weiseer/mcp-doctor @some/package

Audit your existing MCP config:

npx @weiseer/mcp-doctor --config ~/Library/Application\ Support/Claude/claude_desktop_config.json

CI gate to block bad MCPs in PRs:

- uses: weiseer/mcp-doctor-action@v1
  with:
    config-path: '.mcp/claude_desktop_config.json'
    policy: 'block-only'  # or strict, or report

README trust badge:

![MCP Trust](https://api.weiseer.com/badge?pkg=YOUR_PACKAGE)

Pricing

Tier	Price	Get
Free	$0	Single scan + Trust Badge + leaderboard, 60 req/min/IP
Pro	$19/mo	Repo monitoring, drift alerts, badge history
Team	$49/mo	5 repos, Slack/Webhook alerts, custom policy YAML
Enterprise	$299/mo	Unlimited repos, audit log export, SLA

What is broken / what I want feedback on

A1_unpinned_deps calibration — npm convention is ^ ranges. v0.2 raised the threshold to >5 deps AND >70% caret, but I might still be over-firing.
B3_repo_under_60d_old — suppressed for self_disclosure packages but maybe should be more nuanced (new fork vs new project).
Typosquat detection — currently short-name edit-distance ≤1. Might miss creative variants.
MCP-specific signals (D series) — capability-declaration mismatches are very domain-specific and the rule layer feels thin.

If you spot a false positive when you run it on your packages, please open an issue with the package name + which signal you think is misfiring. The faster the rubric matures, the more useful this is for everyone.

How to test whether your AI agent calls the right tool (instead of hallucinating)

weiseer — Thu, 28 May 2026 10:17:14 +0000

Your agent has 12 tools registered. You ask it to look up a customer's order status. It calls search_knowledge_base instead of get_order_status. No error is thrown — the agent returns a plausible-sounding text response. You might ship it without realizing the mistake.

This is the most common silent failure mode in tool-using agents, and many teams lack a systematic way to catch it.

Why Tool Selection Fails

LLMs don't "know" which tool to call — they predict the most likely next token given the prompt. That means:

Ambiguous tool descriptions → wrong tool selected
Too many tools → model picks the nearest semantic match
Prompt drift (system prompt changes) → previously correct selections break
Model version updates → behavior shifts silently

You can't catch this with unit tests on your tool implementations. You need eval cases that assert which tool was invoked, not just whether the final answer looks okay.

The Test Structure You Need

Each test case needs three things:

Input — the user message (and optionally conversation history)
Expected tool call — name + key arguments
Pass condition — exact match, partial match, or "must not call X"

Here's a concrete YAML format that works well for this:

# tool_selection_tests.yaml

- id: order_status_lookup
  description: "Agent should call get_order_status, not search_knowledge_base"
  input:
    user_message: "Where is my order #ORD-9921?"
  expected_tool_call:
    name: get_order_status
    arguments:
      order_id: "ORD-9921"
  match_mode: exact_name_partial_args
  must_not_call:
    - search_knowledge_base
    - get_product_info

- id: refund_eligibility_check
  description: "Refund question should route to check_refund_policy, not create_ticket"
  input:
    user_message: "Can I get a refund for an order I placed 40 days ago?"
  expected_tool_call:
    name: check_refund_policy
  match_mode: exact_name_only

- id: ambiguous_product_question
  description: "Generic product question — acceptable to call either search tool"
  input:
    user_message: "Tell me about your return policy"
  expected_tool_call:
    name: search_knowledge_base
  match_mode: exact_name_only

Note on match_mode: The harness below supports two modes — exact_name_only and exact_name_partial_args. Unrecognized values default to exact_name_only and log a warning rather than silently passing.

Running This Against a Real Agent

Here's a minimal Python harness using OpenAI function calling. Two important notes before you use this:

Include your real system prompt. The run_agent_get_tool_call function uses a placeholder — your evals must use the same system prompt your production agent uses, otherwise you're not testing real behavior.
Add retries and error handling before running this in CI.

import yaml
import json
from openai import OpenAI

client = OpenAI()

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Look up order status by order ID",
            "parameters": {
                "type": "object",
                "properties": {"order_id": {"type": "string"}},
                "required": ["order_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_knowledge_base",
            "description": "Search support articles and FAQs",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "check_refund_policy",
            "description": "Check refund eligibility",
            "parameters": {
                "type": "object",
                "properties": {"days_since_purchase": {"type": "integer"}},
                "required": ["days_since_purchase"],
            },
        },
    },
]

SYSTEM_PROMPT = "You are a customer support agent. Use the available tools to answer questions."

KNOWN_MATCH_MODES = {"exact_name_only", "exact_name_partial_args"}


def run_agent_get_tool_call(user_message: str) -> dict | None:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
        tools=TOOLS,
        tool_choice="auto",
    )
    msg = response.choices[0].message
    if msg.tool_calls:
        tc = msg.tool_calls[0]
        return {"name": tc.function.name, "arguments": json.loads(tc.function.arguments)}
    return None


def evaluate(test_cases: list) -> int:
    passed, failed = 0, 0

    for case in test_cases:
        actual = run_agent_get_tool_call(case["input"]["user_message"])
        expected = case["expected_tool_call"]
        must_not = case.get("must_not_call", [])
        match_mode = case.get("match_mode", "exact_name_only")

        if match_mode not in KNOWN_MATCH_MODES:
            print(f"WARN [{case['id']}]: unrecognized match_mode '{match_mode}', defaulting to exact_name_only")

        # Check forbidden tools
        if actual and actual["name"] in must_not:
            print(f"FAIL [{case['id']}]: called forbidden tool '{actual['name']}'")
            failed += 1
            continue

        # Check expected tool name
        if not actual or actual["name"] != expected["name"]:
            actual_name = actual["name"] if actual else "None"
            print(f"FAIL [{case['id']}]: expected '{expected['name']}', got '{actual_name}'")
            failed += 1
            continue

        # Partial argument check
        if match_mode == "exact_name_partial_args":
            for k, v in expected.get("arguments", {}).items():
                if str(actual["arguments"].get(k)) != str(v):
                    print(f"FAIL [{case['id']}]: arg '{k}' expected '{v}', got '{actual['arguments'].get(k)}'")
                    failed += 1
                    break
            else:
                print(f"PASS [{case['id']}]")
                passed += 1
        else:
            print(f"PASS [{case['id']}]")
            passed += 1

    print(f"\n{passed} passed, {failed} failed out of {passed + failed} cases")
    return failed


if __name__ == "__main__":
    with open("tool_selection_tests.yaml") as f:
        cases = yaml.safe_load(f)
    exit_code = evaluate(cases)
    raise SystemExit(exit_code)

Running this against the three YAML cases above produces output like:

PASS [order_status_lookup]
PASS [refund_eligibility_check]
PASS [ambiguous_product_question]

3 passed, 0 failed out of 3 cases

Any failure prints the exact mismatch — wrong tool name, forbidden tool called, or argument value off — so you know immediately what to fix.

What to Do When a Case Fails

A failing case tells you one of three things:

Tool description is ambiguous — the model couldn't distinguish it from a semantically similar tool. Rewrite the description to be more specific about when not to use it.
System prompt is overriding tool selection — instructions like "always search before responding" can override correct routing. Audit your prompt for implicit biases.
The model genuinely can't extract the argument — for cases like days_since_purchase from natural language, consider whether your tool signature is realistic, or whether a pre-processing step should handle extraction before the tool call.

The test output gives you a precise failure signal. The fix almost always lives in your tool descriptions or system prompt — not in your tool implementations.

Scaling This Up

Three cases won't cover a real agent. A production-grade eval suite for a customer support agent typically needs:

Happy path cases for every tool (correct routing with clean input)
Adversarial cases — inputs designed to trigger the wrong tool
Boundary cases — ambiguous phrasing where the correct tool is non-obvious
Must-not-call cases — sensitive tools (e.g., cancel_order) that should never fire on ambiguous input

That's usually 20–30 cases minimum before you have meaningful coverage. The structure stays identical — more YAML entries, same harness.

Summary

Silent tool misrouting is one of the hardest agent bugs to catch because it produces no exceptions and often generates plausible-looking output. The fix is straightforward: define expected tool calls as structured test cases, run them against your real agent on every prompt or model change, and treat failures as regressions. The harness above is the minimum viable version — extend it with your actual tools, your real system prompt, and enough cases to cover the failure modes that matter for your use case.

Free 5-case starter pack: github.com/weiseer/ai-agent-qa-eval-pack-starter · Full 23-case pack: gumroad.com/l/dcipxt · New cases by email: dl.weiseer.com/cases

Tool selection failures are fundamentally a specification problem: the model can only route correctly if your tool descriptions unambiguously encode when each tool should and shouldn't be used. Building a structured eval suite forces you to make those boundaries explicit — and running it on every model or prompt change turns what was an invisible regression risk into a measurable, fixable signal.

Test your agent for failures like this in CI — free, deterministic, no LLM-as-judge:

Free 5-case starter (MIT): https://github.com/weiseer/ai-agent-qa-eval-pack-starter
Failure-mode guides (how to test each): https://guides.weiseer.com/
Get new cases + the 6-dimension cheatsheet (free): https://dl.weiseer.com/cases
Full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt

Dogfooding an LLM agent eval pack on my own production agent — what 6-dim methodology surfaced

weiseer — Wed, 27 May 2026 07:05:38 +0000

I built a 20-case YAML eval pack for tool-using AI agents (the kind that call APIs / tools to do work). To test whether the methodology actually catches real failure modes, I applied it to my own production LLM-driven agent — one I've been running for months and had already documented 15+ failure modes for.

Result: ~80% of the eval pack's surface area was already covered by my agent's existing defenses. That validated the 6-dimension cut. 5 gaps surfaced that my agent's own failure-mode documentation didn't catalogue — 3 of them serious enough to add as v1.1 cases.

This post is about those gaps. They're worth knowing if you're building an LLM-driven agent.

What the pack is

Briefly: 20 YAML test cases across 6 dimensions: accuracy, safety, edge cases, prompt injection, hallucination, cost efficiency. Each case is a YAML file describing a failure mode + the expected agent behavior + deterministic evaluation rules (no LLM judge — you can run them without paying for an external "judge model").

Free 5-case starter on GitHub:
https://github.com/weiseer/ai-agent-qa-eval-pack-starter

Paid 20-case pack:
https://weiseer.gumroad.com/l/dcipxt

What it means to "dogfood" against an existing agent

My agent is an LLM-driven generator embedded in a larger quantitative system. The LLM proposes candidates; downstream deterministic code validates and acts on them. The agent isn't generic chat — it's tool-using in the structural sense (typed schema in/out, downstream consumers).

I ran the 6-dimension methodology mentally + via code review against this LLM subsystem:

Walked through each of the 26 audit questions (4-6 per dimension)
Cited the file/line where defense exists, OR flagged "no defense visible"

After ~45 minutes of disciplined read-only review:

21 of 26 questions: existing defense ✓
5 questions: gap of some severity

The 5 gaps (severity-ordered)

Gap 1 (MEDIUM) — LLM cost cap was logged, not enforced

I had a $X/day cap on the LLM subsystem in my design docs. The code path:

Logged every API call's cost to a per-cycle audit YAML file
Did NOT check cumulative spend before the next call

So if anything misbehaved (large response, retry loop, prompt cache miss across a batch), the daily total would silently overshoot. Detection would happen the next morning during log review — which is "fast" for governance, but slow for damage containment.

The eval pack's "detection-quality" axis explicitly tests for this: the system must catch a fault faster than the fault spreads. Logging-but-not-enforcing fails that axis.

Lesson generalized: if your spec says "stay under $X", write the code that says if today_spend >= X: abort(), not just the code that says log(today_spend). The eval methodology made me notice the gap.

Gap 2 (MEDIUM) — Predicted vs actual self-assessment drift wasn't tracked

My agent emits self-assessments along with its proposals — predicted success score, expected outcome quality. Downstream validation produces actual measurements. So far so good: prediction vs ground truth, well-separated.

What I didn't have: monitoring of the DELTA between predicted and actual over time.

If the LLM systematically over-claims by 30% across 100 proposals, no single proposal triggers an alert (each one passes downstream validation independently). But the DRIFT between LLM-prediction and ground-truth becomes invisible. The LLM's predictions silently lose calibration.

The fix is meta-monitoring: track the rolling delta. If 30-day moving mean(predicted - actual) starts climbing, the model needs a reset / re-prompt / explicit calibration constraint.

Gap 3 (MEDIUM) — Parallel workers without pre-call diversity planning

My agent dispatches multiple LLM workers in parallel (one "seed" generator + several "variant" generators), each with the same prompt. I had a POST-call diversity gate: compute set distance between worker outputs, reject too-similar candidates.

But the diversity gate runs AFTER all workers have completed. If they converge, I've paid N× the cost for ~1 unique result.

The fix is pre-call diversity planning: explicitly assign each worker an anchor before they fire (worker_1 → category A, worker_2 → category B, ...). Forces structural diversity, not luck-based.

Gap 4 (LOW) — Full-prompt retry vs corrective retry

When my agent's output fails validation (say, references a non-existent feature), the retry sends the full original prompt. With Anthropic prompt caching, the input cost is cheap — but output is fully re-sampled. ~5-10% cost penalty per retry that could be avoided by including the specific correction in the prompt ("you mentioned feature X which doesn't exist; valid features are: ...").

Gap 5 (ADVISORY) — Scope adherence via prompt text only

My system prompt instructs the LLM to span certain conceptual zones. There's no programmatic check that the actual outputs distribute across those zones. Downstream validators catch many ways this can go wrong, but not pattern drift across cycles.

What the gaps have in common

All 5 gaps are meta-monitoring gaps, not architecture bugs. The agent's individual components do their jobs correctly. What was missing: cross-call patterns, cross-time drift, cumulative-cost tracking — the layer above the individual call.

This generalizes: LLM-system reliability is built bottom-up (per-call correctness) but the failures that bite production are top-down (cumulative drift / cumulative cost / cumulative diversity loss). Most engineers (myself included) build the bottom layer first. The eval pack methodology pulled my attention to the top layer.

Why this validates the eval pack's framework, not undermines it

It's tempting to read "80% already covered" as "the pack didn't help much." That's the wrong frame. The right frame:

The 6 dimensions are the right cuts. A mature engineer building an LLM agent will hit the 6 cuts independently.
The pack codifies those cuts. New builders don't have to rediscover them.
The methodology surfaces blind spots even in agents whose builders already think carefully about failure modes. Anyone who built an LLM agent without hitting at least one of these gaps either:
- Got lucky
- Hasn't been in production long enough yet
- Or built something simpler than what they think they built

The pack's value proposition is: 10-30 hours of disciplined failure-mode thinking compressed into 20 YAML files you can read in an hour and apply to your own agent in 3-line glue code per case.

If you build LLM agents and want to compress your "production hardening" timeline:

Free 5-case starter (CC BY 4.0): https://github.com/weiseer/ai-agent-qa-eval-pack-starter
Full 23-case pack: weiseer.gumroad.com/l/dcipxt (launch week: code LAUNCH7 → $29)
国内付款: dl.weiseer.com/pay

v1.1 cases adding the 3 gaps above are queued for the next release.

Built solo. Refund 7 days, no questions asked. If you've built an agent and want to compare your defenses against this list, reply or DM with what failure mode you'd add as case #21.

Test your agent for failures like this in CI — free, deterministic, no LLM-as-judge:

Free 5-case starter (MIT): https://github.com/weiseer/ai-agent-qa-eval-pack-starter
Failure-mode guides (how to test each): https://guides.weiseer.com/
Get new cases + the 6-dimension cheatsheet (free): https://dl.weiseer.com/cases
Full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt

I tested my AI product tester on 3 real SaaS products. Every persona said no.

weiseer — Mon, 18 May 2026 04:20:20 +0000

Two months ago I was about to ship a crypto signal product. It "worked technically" but I had zero
signal on whether anyone would subscribe.

So I wrote 12 fictional user personas as markdown files — a burnt veteran trader, a hostile compliance
officer, a YC partner, a noise-allergic fund manager — and built a Python harness that fed each one my
actual product transcripts and asked: "what would you actually do?"

The answers were brutally helpful. They killed features I'd spent weeks on. I open-sourced the harness
as personalab (MIT).

## Then I pointed it at 3 real products

1. personalab itself — yes, I tested my own tool with my own tool. 0/8 simulated B2B SaaS buyers
said they'd pay $99/mo. The case study became my own roadmap.

2. PostHog — 6/12 personas said "yes I'd pay" after reading a 7-day product transcript. Same 12 over
5-day agentic simulation: 0/12 sustained. The "yes" was first-impression optimism; the "no" was
multi-day reality.

3. Cal.com — 8/12 yes at $5-20/mo. And here's the gold: 75% of complaints converged on ONE thing —
the free-plan "Powered by Cal.com" branding makes recipients suspect spam. 8 distinct personas
independently nailed the same conversion lever.

## A pattern emerges

After 3 case studies, the number of dominant friction clusters in a personalab run seems to correlate
with PMF stage:

Pre-PMF: 4-5 diffuse complaints (my own tool)
Mid-funnel: 5 distinct friction clusters (PostHog: price / learning / UI / compliance / privacy)
Late-funnel: 1-2 clean conversion levers (Cal.com: branding)

If this holds in case study #4+, personalab becomes a free PMF-stage diagnostic from a $1 LLM run.

## Honest disclaimer

The default personas accidentally encoded personalab-specific preferences, so some quotes leak when
reused on other products. I kept the bug in the case study writeup rather than rerunning with clean data
— it surfaces persona design as a real engineering concern.

## Try it


bash
  git clone https://github.com/g16253470-beep/personalab
  cd personalab && pip install -e .
  personalab run --mode static --personas ./personas --adapter your_adapter --llm gemini:gemini-2.5-flash

  40-line adapter, 12 default personas, MIT licensed.

  Repo: https://github.com/g16253470-beep/personalab

  Two questions for DEV

  1. What product would you point this at first?
  2. Real PMF business or just an OSS curiosity?

  Tell me where this falls apart — that's the next case study.

<!-- weiseer-cta -->
---

**Test your agent for failures like this in CI — free, deterministic, no LLM-as-judge:**

- Free 5-case starter (MIT): https://github.com/weiseer/ai-agent-qa-eval-pack-starter
- Failure-mode guides (how to test each): https://guides.weiseer.com/
- Get new cases + the 6-dimension cheatsheet (free): https://dl.weiseer.com/cases
- Full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt

DEV Community: weiseer

Bulk-check DNS, SSL and email auth for a whole list of domains (no scraping)

1. DNS + WHOIS + SSL, in one pass

2. Email authentication (MX / SPF / DMARC / DKIM)

3. The website itself (metadata + tech + security headers)

Putting it together

I tested mcp-doctor pricing with 12 LLM-simulated personas. 4 said they would pay.

Headline result

Who paid

Who walked

Who stayed engaged but didn't pay

Patterns across the 60 simulations (12 × 5)

What I'm doing about it

Honest disclosure

Reproducibility

Links

I scanned 200 popular MCP server packages. Here is what I found.

TL;DR

What the 200-package scan found

1 package had a hardcoded LLM API key

Six "official" MCP servers are silently abandoned

2 typosquats of official servers

The rubric is open-source by design

How to use it

Pricing

What is broken / what I want feedback on

Links

How to test whether your AI agent calls the right tool (instead of hallucinating)

Why Tool Selection Fails

The Test Structure You Need

Running This Against a Real Agent

What to Do When a Case Fails

Scaling This Up

Summary

Dogfooding an LLM agent eval pack on my own production agent — what 6-dim methodology surfaced

What the pack is

What it means to "dogfood" against an existing agent

The 5 gaps (severity-ordered)

Gap 1 (MEDIUM) — LLM cost cap was logged, not enforced

Gap 2 (MEDIUM) — Predicted vs actual self-assessment drift wasn't tracked

Gap 3 (MEDIUM) — Parallel workers without pre-call diversity planning

Gap 4 (LOW) — Full-prompt retry vs corrective retry

Gap 5 (ADVISORY) — Scope adherence via prompt text only

What the gaps have in common

Why this validates the eval pack's framework, not undermines it

I tested my AI product tester on 3 real SaaS products. Every persona said no.