<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: weiseer</title>
    <description>The latest articles on DEV Community by weiseer (@weiseer).</description>
    <link>https://dev.to/weiseer</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3937149%2F05c9f686-d8e1-4bd8-8427-8e4d2a6966c0.png</url>
      <title>DEV Community: weiseer</title>
      <link>https://dev.to/weiseer</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/weiseer"/>
    <language>en</language>
    <item>
      <title>Bulk-check DNS, SSL and email auth for a whole list of domains (no scraping)</title>
      <dc:creator>weiseer</dc:creator>
      <pubDate>Sun, 31 May 2026 04:16:15 +0000</pubDate>
      <link>https://dev.to/weiseer/bulk-check-dns-ssl-and-email-auth-for-a-whole-list-of-domains-no-scraping-37pd</link>
      <guid>https://dev.to/weiseer/bulk-check-dns-ssl-and-email-auth-for-a-whole-list-of-domains-no-scraping-37pd</guid>
      <description>&lt;p&gt;If you've ever had a spreadsheet of domains — a lead list, an acquisition target's&lt;br&gt;
footprint, your own portfolio — and needed DNS records, WHOIS, SSL expiry, or email&lt;br&gt;
authentication for &lt;em&gt;all&lt;/em&gt; of them, you know the pain: single-domain web tools don't&lt;br&gt;
scale, and &lt;code&gt;dig&lt;/code&gt; / &lt;code&gt;whois&lt;/code&gt; / &lt;code&gt;openssl&lt;/code&gt; loops are fiddly to parse.&lt;/p&gt;

&lt;p&gt;Here's how I think about pulling clean, structured domain intelligence in bulk —&lt;br&gt;
and the three small tools I built so I never have to write that loop again.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. DNS + WHOIS + SSL, in one pass
&lt;/h2&gt;

&lt;p&gt;For each domain you usually want three things together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DNS&lt;/strong&gt; — A/AAAA/MX/NS/TXT/CNAME (where it points, who runs mail, the DNS provider)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WHOIS&lt;/strong&gt; — registrar, creation/expiry dates, status, name servers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSL&lt;/strong&gt; — issuer, the &lt;code&gt;valid_from&lt;/code&gt;/&lt;code&gt;valid_to&lt;/code&gt; window, SAN list&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trick is to do them as &lt;strong&gt;protocol calls&lt;/strong&gt; (DNS resolution, a TLS handshake, WHOIS&lt;br&gt;
on port 43) rather than scraping any website — protocol surfaces are stable, so the&lt;br&gt;
output doesn't break when sites redesign.&lt;/p&gt;

&lt;p&gt;If you don't want to wire that yourself, I packaged it as&lt;br&gt;
&lt;a href="https://apify.com/weiseer/domain-intel-scraper" rel="noopener noreferrer"&gt;Domain Intelligence on Apify&lt;/a&gt;:&lt;br&gt;
paste a list of domains, get one clean JSON row each. ($0.01/domain, no proxies.)&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Email authentication (MX / SPF / DMARC / DKIM)
&lt;/h2&gt;

&lt;p&gt;Deliverability and security both hinge on the same DNS records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MX&lt;/strong&gt; present? which provider? (&lt;code&gt;aspmx.l.google.com&lt;/code&gt; → Google Workspace, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SPF&lt;/strong&gt; — is there a &lt;code&gt;v=spf1&lt;/code&gt; TXT record, and is the qualifier &lt;code&gt;-all&lt;/code&gt; (strict) or &lt;code&gt;~all&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DMARC&lt;/strong&gt; — &lt;code&gt;_dmarc&lt;/code&gt; TXT with &lt;code&gt;p=reject&lt;/code&gt; / &lt;code&gt;quarantine&lt;/code&gt; / &lt;code&gt;none&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DKIM&lt;/strong&gt; — does a common selector (&lt;code&gt;google&lt;/code&gt;, &lt;code&gt;selector1&lt;/code&gt;, &lt;code&gt;k1&lt;/code&gt;…) publish a key at
&lt;code&gt;&amp;lt;selector&amp;gt;._domainkey.&amp;lt;domain&amp;gt;&lt;/code&gt;?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A domain with MX + SPF &lt;code&gt;-all&lt;/code&gt; + DMARC &lt;code&gt;reject&lt;/code&gt; + DKIM is a "strong" setup; missing&lt;br&gt;
DMARC is the most common gap. You can score a whole list this way in seconds — no SMTP&lt;br&gt;
probing required (which mail servers block anyway). I put this in&lt;br&gt;
&lt;a href="https://apify.com/weiseer/email-deliverability-checker" rel="noopener noreferrer"&gt;Bulk Email Deliverability Checker&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The website itself (metadata + tech + security headers)
&lt;/h2&gt;

&lt;p&gt;The HTTP layer rounds out the picture: final URL after redirects, status, &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt;,&lt;br&gt;
meta/Open Graph, &lt;code&gt;Server&lt;/code&gt;/&lt;code&gt;X-Powered-By&lt;/code&gt;, the &lt;strong&gt;security headers&lt;/strong&gt; (HSTS, CSP,&lt;br&gt;
X-Frame-Options…) graded present/missing, and lightweight tech hints (Cloudflare,&lt;br&gt;
nginx, Next.js, Shopify, WordPress…). Useful for SEO audits, tech research, and&lt;br&gt;
lead enrichment: &lt;a href="https://apify.com/weiseer/website-metadata-tech-profiler" rel="noopener noreferrer"&gt;Website Metadata &amp;amp; Tech Profiler&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting it together
&lt;/h2&gt;

&lt;p&gt;For lead enrichment you might run all three over a list of company domains: WHOIS for&lt;br&gt;
registrar/age, DNS MX for the email stack, the web profiler for the tech stack, and the&lt;br&gt;
email checker for deliverability — a quick technographic profile per domain, exported&lt;br&gt;
to CSV or pulled via API.&lt;/p&gt;

&lt;p&gt;All three are protocol-based and low-maintenance by design. Code and notes:&lt;br&gt;
&lt;a href="https://github.com/weiseer" rel="noopener noreferrer"&gt;github.com/weiseer&lt;/a&gt;. Happy to take requests for fields to add —&lt;br&gt;
what would you want in a bulk domain report?&lt;/p&gt;

</description>
      <category>devops</category>
      <category>security</category>
      <category>webdev</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I tested mcp-doctor pricing with 12 LLM-simulated personas. 4 said they would pay.</title>
      <dc:creator>weiseer</dc:creator>
      <pubDate>Sat, 30 May 2026 09:40:08 +0000</pubDate>
      <link>https://dev.to/weiseer/i-tested-mcp-doctor-pricing-with-12-llm-simulated-personas-4-said-they-would-pay-4a2</link>
      <guid>https://dev.to/weiseer/i-tested-mcp-doctor-pricing-with-12-llm-simulated-personas-4-said-they-would-pay-4a2</guid>
      <description>&lt;p&gt;Earlier today I shipped &lt;a href="https://github.com/weiseer/mcp-doctor" rel="noopener noreferrer"&gt;&lt;code&gt;@weiseer/mcp-doctor&lt;/code&gt;&lt;/a&gt; — an open-source supply-chain trust scanner for MCP (Model Context Protocol) servers. CLI + GitHub Action + Trust Badge + free public API at &lt;a href="https://api.weiseer.com" rel="noopener noreferrer"&gt;https://api.weiseer.com&lt;/a&gt;. Pro tier is $19/mo on Gumroad.&lt;/p&gt;

&lt;p&gt;The honest question every solo founder skips: &lt;strong&gt;would anyone actually pay $19/mo for this?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I have a separate tool for exactly this — &lt;code&gt;personalab&lt;/code&gt;, an open-source persona-driven product evaluation harness. 12 LLM-simulated personas read the product, decide each day what they'd do, and tell you who would pay and who would walk away. I've used it before on PostHog, Cal.com, and personalab-on-itself.&lt;/p&gt;

&lt;p&gt;Tonight I ran it on mcp-doctor as case study #4. Code + raw data + full report all in &lt;a href="https://github.com/weiseer/mcp-doctor/tree/main/case_study" rel="noopener noreferrer"&gt;github.com/weiseer/mcp-doctor/blob/main/case_study/&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Headline result
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;4 of 12 personas would pay (33%).&lt;/strong&gt; 2 abandoned. 6 stayed engaged on the free tier.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Case study&lt;/th&gt;
&lt;th&gt;Would-pay rate (under same persona harness)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;mcp-doctor (today)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4/12 = 33%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;personalab self-test&lt;/td&gt;
&lt;td&gt;0/8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostHog (5-day agentic)&lt;/td&gt;
&lt;td&gt;0/12 sustained (6/12 day-1 yes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cal.com&lt;/td&gt;
&lt;td&gt;8/12&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This puts mcp-doctor between PostHog and Cal.com under the same methodology. &lt;strong&gt;Better than personalab itself&lt;/strong&gt;, better than what PostHog showed under a 5-day sustained simulation, &lt;strong&gt;worse than Cal.com&lt;/strong&gt; (which converged on a single clean friction lever — the famous "Powered by Cal.com" branding).&lt;/p&gt;

&lt;p&gt;Not making PMF claims. Treat it as one signal among several.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who paid
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;07 OSS maintainer&lt;/strong&gt; — strongest engagement signal. Opened a GitHub issue on day 2, shared with team on day 3, subscribed to Pro on day 4. Quote synthesized from the 5-day transcript:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Supply-chain audits are part of my actual job. A rubric I can fork and argue with is worth more than another vendor's black box. $19/mo is below my coffee budget."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;06 Research consultant&lt;/strong&gt; — buys tools on behalf of clients. Subscribed on day 5. The "buying for someone else" pattern showed up clearly — they care about whether the trust signal is defensible to a third party.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who walked
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;02 Growth PM&lt;/strong&gt; — final action: UNSUBSCRIBE_OR_UNINSTALL. Their verbatim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"mcp-doctor 解决的是供应链信任问题，跟我的 OKR（Free→Paid conversion 3.2%→4.5%）完全正交。5 天了，零帮助我加快 A/B 迭代速度。时间成本 &amp;gt; $19 价值。"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;(Translation: "mcp-doctor solves supply-chain trust. My OKR is conversion rate. They are orthogonal. After 5 days I haven't moved faster on A/B tests. Time cost exceeds $19 value.")&lt;/p&gt;

&lt;p&gt;This is correct. The persona is right. Their OKR is conversion; my tool is supply chain. Audience mismatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;11 Data team lead&lt;/strong&gt; — abandoned over rubric calibration disagreement. They disagreed with how aggressively &lt;code&gt;A1_unpinned_deps&lt;/code&gt; fires. This is real feedback the actual product would need to address (PR welcome on &lt;a href="https://github.com/weiseer/mcp-doctor/blob/main/rubric.yaml" rel="noopener noreferrer"&gt;rubric.yaml&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Who stayed engaged but didn't pay
&lt;/h2&gt;

&lt;p&gt;6 of 12 personas used the free tier daily, found genuine value, but did not subscribe. These are the &lt;strong&gt;free-tier loyalists&lt;/strong&gt; — exactly the funnel design intent. They give us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Word-of-mouth (some opened GitHub issues, shared with team)&lt;/li&gt;
&lt;li&gt;Trust badge usage on their READMEs (free)&lt;/li&gt;
&lt;li&gt;The actual marketing engine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If we tried to push these personas to Pro, we'd lose the funnel. Free tier should stay generous.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patterns across the 60 simulations (12 × 5)
&lt;/h2&gt;

&lt;p&gt;The personalab agentic mode runs each persona day-by-day, so I get 60 data points. Friction clusters extracted:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;th&gt;# mentions across persona-days&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rubric calibration / false positive concerns&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pro tier value vs Free tier sufficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP-specific audience (do I even use MCP?)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trust building (new brand)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;vs npm audit / Snyk / Bumblebee&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-serve / docs gap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The top cluster — rubric calibration — is the right one to prioritize. v0.2 of the scanner should add an &lt;code&gt;LLM-judge&lt;/code&gt; mode for ambiguous signals (the same fix planned for &lt;a href="https://github.com/weiseer/prompt-redteam" rel="noopener noreferrer"&gt;@weiseer/prompt-redteam&lt;/a&gt;'s detection).&lt;/p&gt;

&lt;p&gt;The number-of-clusters observation from earlier personalab work was: pre-PMF products see 4-5 diffuse complaints, late-funnel products see 1-2 clean levers. mcp-doctor surfaced 6 clusters at day 1 of launch. That feels right — pre-PMF, complaints diffuse.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm doing about it
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Not changing pricing&lt;/strong&gt; — 33% would-pay on the right persona slice is enough signal at $19/mo. Cal.com hits 67% on a more general audience; we accept narrower fit at this stage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sharpening audience&lt;/strong&gt; — Twitter / Reddit posting should drop the "general developer" framing and double down on "MCP server users" specifically. The personas who pay are the ones who already do this work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rubric calibration&lt;/strong&gt; — top friction cluster is real. v0.2 will add LLM-judge classification of ambiguous signals + explicit per-signal severity thresholds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not naming the package&lt;/strong&gt; — case study itself is the marketing. No "X is the worst, buy mcp-doctor."&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Honest disclosure
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;This is &lt;strong&gt;simulated user behavior via Claude Haiku 4.5&lt;/strong&gt;, not real customer interviews. Treat as one signal, not as PMF validation.&lt;/li&gt;
&lt;li&gt;The same persona library was previously calibrated on three other products; cross-product comparability is plausible but not proven.&lt;/li&gt;
&lt;li&gt;The product context was shown once; real buyers would see Twitter, GitHub stars, friends' opinions, etc.&lt;/li&gt;
&lt;li&gt;Some persona quotes may reflect personalab's own design biases (acknowledged in personalab's own meta case study).&lt;/li&gt;
&lt;li&gt;Two products by the same person (mcp-doctor + personalab) tested by the same person — bias risk acknowledged.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reproducibility
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone personalab&lt;/span&gt;
git clone https://github.com/g16253470-beep/personalab
&lt;span class="nb"&gt;cd &lt;/span&gt;personalab

&lt;span class="c"&gt;# Adapt the runner to your product&lt;/span&gt;
&lt;span class="c"&gt;# https://github.com/weiseer/mcp-doctor/blob/main/case_study/run_personalab.py&lt;/span&gt;

&lt;span class="c"&gt;# Run on your own product brief&lt;/span&gt;
&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;... python run_personalab.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The raw JSON output is at &lt;a href="https://github.com/weiseer/mcp-doctor/blob/main/case_study/personalab_raw_report.json" rel="noopener noreferrer"&gt;mcp-doctor/case_study/personalab_raw_report.json&lt;/a&gt;. Argue with the persona definitions. Fork the case-study runner. If you do this on your own product, the failure modes you find will tell you more than any survey.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;mcp-doctor: &lt;a href="https://github.com/weiseer/mcp-doctor" rel="noopener noreferrer"&gt;github.com/weiseer/mcp-doctor&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;personalab: &lt;a href="https://github.com/g16253470-beep/personalab" rel="noopener noreferrer"&gt;github.com/g16253470-beep/personalab&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Full case study (raw + markdown): &lt;a href="https://github.com/weiseer/mcp-doctor/tree/main/case_study" rel="noopener noreferrer"&gt;mcp-doctor/case_study/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Launch postmortem: &lt;a href="https://github.com/weiseer/launch-postmortem" rel="noopener noreferrer"&gt;weiseer/launch-postmortem&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Pro tier (if you actually want to pay): &lt;a href="https://weiseer.gumroad.com/l/hxmty" rel="noopener noreferrer"&gt;weiseer.gumroad.com/l/hxmty&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;— wei&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Test your agent for failures like this in CI — free, deterministic, no LLM-as-judge:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free 5-case starter (MIT): &lt;a href="https://github.com/weiseer/ai-agent-qa-eval-pack-starter" rel="noopener noreferrer"&gt;https://github.com/weiseer/ai-agent-qa-eval-pack-starter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Failure-mode guides (how to test each): &lt;a href="https://guides.weiseer.com/" rel="noopener noreferrer"&gt;https://guides.weiseer.com/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Get new cases + the 6-dimension cheatsheet (free): &lt;a href="https://dl.weiseer.com/cases" rel="noopener noreferrer"&gt;https://dl.weiseer.com/cases&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Full 28-case OWASP-Agentic pack: &lt;a href="https://weiseer.gumroad.com/l/dcipxt" rel="noopener noreferrer"&gt;https://weiseer.gumroad.com/l/dcipxt&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mcp</category>
      <category>security</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>I scanned 200 popular MCP server packages. Here is what I found.</title>
      <dc:creator>weiseer</dc:creator>
      <pubDate>Sat, 30 May 2026 07:23:16 +0000</pubDate>
      <link>https://dev.to/weiseer/i-scanned-200-popular-mcp-server-packages-here-is-what-i-found-4g03</link>
      <guid>https://dev.to/weiseer/i-scanned-200-popular-mcp-server-packages-here-is-what-i-found-4g03</guid>
      <description>&lt;p&gt;The MCP ecosystem has been growing fast, but the supply-chain hygiene has not kept up. MCPwn (CVE-2026-33032, CVSS 9.8) exposed 2,600+ instances. The Shai-Hulud npm worm stole MCP auth tokens from 172 packages. MCPSafe found high-severity bugs in &lt;em&gt;official&lt;/em&gt; MCPs from Atlassian, GitHub, Cloudflare, and Microsoft. Perplexity open-sourced Bumblebee in May 2026 specifically because no good scanner existed.&lt;/p&gt;

&lt;p&gt;So I built one. Today I'm shipping &lt;code&gt;@weiseer/mcp-doctor&lt;/code&gt; — an open-source install-time trust gate for MCP server packages — together with the validation dataset that surfaced its first real finding.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @weiseer/mcp-doctor @some/mcp-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns &lt;code&gt;PASS / WARN / BLOCK&lt;/code&gt; with cited evidence per signal. The full scoring rubric is open-source so you can argue with the methodology rather than trust a black-box. Free public scan endpoint at &lt;a href="https://api.weiseer.com/scan" rel="noopener noreferrer"&gt;https://api.weiseer.com/scan&lt;/a&gt;, 60 requests/min/IP, no auth.&lt;/p&gt;

&lt;p&gt;Live dataset of 200 popular MCP-related packages at &lt;a href="https://api.weiseer.com/dataset/scan_200.json" rel="noopener noreferrer"&gt;https://api.weiseer.com/dataset/scan_200.json&lt;/a&gt;. Leaderboard view at &lt;a href="https://api.weiseer.com/leaderboard" rel="noopener noreferrer"&gt;https://api.weiseer.com/leaderboard&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the 200-package scan found
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;td&gt;138 (69%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WARN&lt;/td&gt;
&lt;td&gt;58 (29%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BLOCK&lt;/td&gt;
&lt;td&gt;3 (1.5%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ERROR&lt;/td&gt;
&lt;td&gt;1 (npm 404)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  1 package had a hardcoded LLM API key
&lt;/h3&gt;

&lt;p&gt;The scanner's &lt;code&gt;D3_hardcoded_credentials_in_source&lt;/code&gt; signal fires on common provider key patterns (&lt;code&gt;sk-ant-*&lt;/code&gt;, &lt;code&gt;sk-*&lt;/code&gt;, &lt;code&gt;AKIA*&lt;/code&gt;, &lt;code&gt;ghp_*&lt;/code&gt;, &lt;code&gt;npm_*&lt;/code&gt;, &lt;code&gt;AIza*&lt;/code&gt;) in published source. It is a hard-block: −50 points, no questions.&lt;/p&gt;

&lt;p&gt;One of the 200 packages tripped it with a real-looking &lt;code&gt;sk-ant-...&lt;/code&gt; Anthropic key embedded in its bundled JavaScript source. The maintainer was emailed within the hour using their npm publisher contact. They have 7 days to rotate the key, deprecate the bad version, and republish reading from &lt;code&gt;process.env&lt;/code&gt;. After that window closes (2026-06-06), I'll reference the pattern anonymously — but I'll keep the specific package name private indefinitely if they ask.&lt;/p&gt;

&lt;p&gt;This is the Shai-Hulud-class risk in concrete form: a single embedded key, in a single npm package, that any tool scanning the agent's dependency tree could exfiltrate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Six "official" MCP servers are silently abandoned
&lt;/h3&gt;

&lt;p&gt;This one surprised me:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Package&lt;/th&gt;
&lt;th&gt;Days since last release&lt;/th&gt;
&lt;th&gt;Repository URL&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@modelcontextprotocol/create-server&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;550&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@modelcontextprotocol/server-postgres&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;541&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@modelcontextprotocol/server-gdrive&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;501&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@modelcontextprotocol/server-github&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;416&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@modelcontextprotocol/server-slack&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;399&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@modelcontextprotocol/server-puppeteer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;382&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are still cited in nearly every MCP tutorial. None have a &lt;code&gt;repository&lt;/code&gt; field in &lt;code&gt;package.json&lt;/code&gt;, so source-to-binary verification is impossible. If you depend on any of them in production, mirror the source today.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;@google/generative-ai&lt;/code&gt; is also installed broadly via npm but Google has archived its GitHub repo in favor of &lt;code&gt;@google/genai&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2 typosquats of official servers
&lt;/h3&gt;

&lt;p&gt;Self-explanatory — both blocked with &lt;code&gt;−40 HARD C4_name_typosquats_official&lt;/code&gt;. Short-name comparison catches packages within edit-distance 1 of well-known official names.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rubric is open-source by design
&lt;/h2&gt;

&lt;p&gt;I am not a security vendor and these scores are not a black box. Every signal in &lt;a href="https://github.com/weiseer/mcp-doctor/blob/main/rubric.yaml" rel="noopener noreferrer"&gt;rubric.yaml&lt;/a&gt; has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;strong&gt;ID&lt;/strong&gt; (e.g. &lt;code&gt;D3_hardcoded_credentials_in_source&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;deduction value&lt;/strong&gt; (how much it costs you)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;rationale&lt;/strong&gt; (why we think it matters)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you think &lt;code&gt;A1_unpinned_deps&lt;/code&gt; is too aggressive, open a PR. If you think &lt;code&gt;B2_single_maintainer&lt;/code&gt; unfairly punishes new packages, open a PR. The whole point of an open rubric is that ecosystem trust is a public good, not vendor secret sauce.&lt;/p&gt;

&lt;p&gt;I ran the scanner on my own 9 packages first (&lt;code&gt;@weiseer/*&lt;/code&gt;) and published the results in the same leaderboard. They all PASS at 100/100 — but two signals (&lt;code&gt;B2_single_maintainer&lt;/code&gt;, &lt;code&gt;B3_repo_under_60d_old&lt;/code&gt;) are explicitly suppressed via the &lt;code&gt;self_disclosure&lt;/code&gt; flag because they're trivially expected on packages published the same day. I'd rather show the suppression than score myself perfect with a rigged rubric.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use it
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Single package:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @weiseer/mcp-doctor @some/package
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Audit your existing MCP config:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @weiseer/mcp-doctor &lt;span class="nt"&gt;--config&lt;/span&gt; ~/Library/Application&lt;span class="se"&gt;\ &lt;/span&gt;Support/Claude/claude_desktop_config.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CI gate to block bad MCPs in PRs:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;weiseer/mcp-doctor-action@v1&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;config-path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.mcp/claude_desktop_config.json'&lt;/span&gt;
    &lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;block-only'&lt;/span&gt;  &lt;span class="c1"&gt;# or strict, or report&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;README trust badge:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;![&lt;/span&gt;&lt;span class="nv"&gt;MCP Trust&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://api.weiseer.com/badge?pkg=YOUR_PACKAGE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Get&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;Single scan + Trust Badge + leaderboard, 60 req/min/IP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro&lt;/td&gt;
&lt;td&gt;$19/mo&lt;/td&gt;
&lt;td&gt;Repo monitoring, drift alerts, badge history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team&lt;/td&gt;
&lt;td&gt;$49/mo&lt;/td&gt;
&lt;td&gt;5 repos, Slack/Webhook alerts, custom policy YAML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise&lt;/td&gt;
&lt;td&gt;$299/mo&lt;/td&gt;
&lt;td&gt;Unlimited repos, audit log export, SLA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What is broken / what I want feedback on
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A1_unpinned_deps calibration&lt;/strong&gt; — npm convention is &lt;code&gt;^&lt;/code&gt; ranges. v0.2 raised the threshold to &amp;gt;5 deps AND &amp;gt;70% caret, but I might still be over-firing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;B3_repo_under_60d_old&lt;/strong&gt; — suppressed for self_disclosure packages but maybe should be more nuanced (new fork vs new project).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Typosquat detection&lt;/strong&gt; — currently short-name edit-distance ≤1. Might miss creative variants.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP-specific signals (D series)&lt;/strong&gt; — capability-declaration mismatches are very domain-specific and the rule layer feels thin.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you spot a false positive when you run it on your packages, please open an issue with the package name + which signal you think is misfiring. The faster the rubric matures, the more useful this is for everyone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;npm: &lt;a href="https://www.npmjs.com/package/@weiseer/mcp-doctor" rel="noopener noreferrer"&gt;&lt;code&gt;@weiseer/mcp-doctor&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/weiseer/mcp-doctor" rel="noopener noreferrer"&gt;github.com/weiseer/mcp-doctor&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Public API: &lt;a href="https://api.weiseer.com" rel="noopener noreferrer"&gt;api.weiseer.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Postmortem repo (story of the build): &lt;a href="https://github.com/weiseer/launch-postmortem" rel="noopener noreferrer"&gt;github.com/weiseer/launch-postmortem&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Author: &lt;a href="mailto:wei@weiseer.com"&gt;wei@weiseer.com&lt;/a&gt; · &lt;a href="https://github.com/weiseer" rel="noopener noreferrer"&gt;github.com/weiseer&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Apache-2.0.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Test your agent for failures like this in CI — free, deterministic, no LLM-as-judge:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free 5-case starter (MIT): &lt;a href="https://github.com/weiseer/ai-agent-qa-eval-pack-starter" rel="noopener noreferrer"&gt;https://github.com/weiseer/ai-agent-qa-eval-pack-starter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Failure-mode guides (how to test each): &lt;a href="https://guides.weiseer.com/" rel="noopener noreferrer"&gt;https://guides.weiseer.com/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Get new cases + the 6-dimension cheatsheet (free): &lt;a href="https://dl.weiseer.com/cases" rel="noopener noreferrer"&gt;https://dl.weiseer.com/cases&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Full 28-case OWASP-Agentic pack: &lt;a href="https://weiseer.gumroad.com/l/dcipxt" rel="noopener noreferrer"&gt;https://weiseer.gumroad.com/l/dcipxt&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mcp</category>
      <category>security</category>
      <category>supplychain</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How to test whether your AI agent calls the right tool (instead of hallucinating)</title>
      <dc:creator>weiseer</dc:creator>
      <pubDate>Thu, 28 May 2026 10:17:14 +0000</pubDate>
      <link>https://dev.to/weiseer/how-to-test-whether-your-ai-agent-calls-the-right-tool-instead-of-hallucinating-bc0</link>
      <guid>https://dev.to/weiseer/how-to-test-whether-your-ai-agent-calls-the-right-tool-instead-of-hallucinating-bc0</guid>
      <description>&lt;p&gt;Your agent has 12 tools registered. You ask it to look up a customer's order status. It calls &lt;code&gt;search_knowledge_base&lt;/code&gt; instead of &lt;code&gt;get_order_status&lt;/code&gt;. No error is thrown — the agent returns a plausible-sounding text response. You might ship it without realizing the mistake.&lt;/p&gt;

&lt;p&gt;This is the most common silent failure mode in tool-using agents, and many teams lack a systematic way to catch it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Tool Selection Fails
&lt;/h2&gt;

&lt;p&gt;LLMs don't "know" which tool to call — they predict the most likely next token given the prompt. That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ambiguous tool descriptions&lt;/strong&gt; → wrong tool selected&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Too many tools&lt;/strong&gt; → model picks the nearest semantic match&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt drift&lt;/strong&gt; (system prompt changes) → previously correct selections break&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model version updates&lt;/strong&gt; → behavior shifts silently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can't catch this with unit tests on your tool implementations. You need eval cases that assert &lt;em&gt;which tool was invoked&lt;/em&gt;, not just whether the final answer looks okay.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Test Structure You Need
&lt;/h2&gt;

&lt;p&gt;Each test case needs three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Input&lt;/strong&gt; — the user message (and optionally conversation history)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expected tool call&lt;/strong&gt; — name + key arguments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pass condition&lt;/strong&gt; — exact match, partial match, or "must not call X"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's a concrete YAML format that works well for this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tool_selection_tests.yaml&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order_status_lookup&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;should&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;call&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;get_order_status,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base"&lt;/span&gt;
  &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;user_message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Where&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;my&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;order&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;#ORD-9921?"&lt;/span&gt;
  &lt;span class="na"&gt;expected_tool_call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get_order_status&lt;/span&gt;
    &lt;span class="na"&gt;arguments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;order_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ORD-9921"&lt;/span&gt;
  &lt;span class="na"&gt;match_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;exact_name_partial_args&lt;/span&gt;
  &lt;span class="na"&gt;must_not_call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;get_product_info&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;refund_eligibility_check&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Refund&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;should&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;route&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;check_refund_policy,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;create_ticket"&lt;/span&gt;
  &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;user_message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Can&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;an&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;order&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;placed&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;40&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;days&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ago?"&lt;/span&gt;
  &lt;span class="na"&gt;expected_tool_call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;check_refund_policy&lt;/span&gt;
  &lt;span class="na"&gt;match_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;exact_name_only&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ambiguous_product_question&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;product&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;acceptable&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;call&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;either&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tool"&lt;/span&gt;
  &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;user_message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tell&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;me&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;about&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;your&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;return&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;policy"&lt;/span&gt;
  &lt;span class="na"&gt;expected_tool_call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;
  &lt;span class="na"&gt;match_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;exact_name_only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on &lt;code&gt;match_mode&lt;/code&gt;:&lt;/strong&gt; The harness below supports two modes — &lt;code&gt;exact_name_only&lt;/code&gt; and &lt;code&gt;exact_name_partial_args&lt;/code&gt;. Unrecognized values default to &lt;code&gt;exact_name_only&lt;/code&gt; and log a warning rather than silently passing.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Running This Against a Real Agent
&lt;/h2&gt;

&lt;p&gt;Here's a minimal Python harness using OpenAI function calling. Two important notes before you use this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Include your real system prompt.&lt;/strong&gt; The &lt;code&gt;run_agent_get_tool_call&lt;/code&gt; function uses a placeholder — your evals must use the same system prompt your production agent uses, otherwise you're not testing real behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add retries and error handling&lt;/strong&gt; before running this in CI.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;TOOLS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_order_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Look up order status by order ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search support articles and FAQs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_refund_policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check refund eligibility&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;days_since_purchase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;integer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;days_since_purchase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a customer support agent. Use the available tools to answer questions.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;KNOWN_MATCH_MODES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact_name_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact_name_partial_args&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent_get_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TOOLS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_choice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arguments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;test_cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_agent_get_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;must_not&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_not_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
        &lt;span class="n"&gt;match_mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;match_mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact_name_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;match_mode&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;KNOWN_MATCH_MODES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARN [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: unrecognized match_mode &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;match_mode&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, defaulting to exact_name_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Check forbidden tools
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;must_not&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FAIL [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: called forbidden tool &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="c1"&gt;# Check expected tool name
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;actual_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FAIL [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: expected &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, got &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="c1"&gt;# Partial argument check
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;match_mode&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact_name_partial_args&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arguments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arguments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FAIL [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: arg &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; expected &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, got &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;arguments&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
                    &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PASS [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PASS [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; passed, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; failed out of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; cases&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_selection_tests.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safe_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;exit_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SystemExit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running this against the three YAML cases above produces output like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PASS [order_status_lookup]
PASS [refund_eligibility_check]
PASS [ambiguous_product_question]

3 passed, 0 failed out of 3 cases
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any failure prints the exact mismatch — wrong tool name, forbidden tool called, or argument value off — so you know immediately what to fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Do When a Case Fails
&lt;/h2&gt;

&lt;p&gt;A failing case tells you one of three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool description is ambiguous&lt;/strong&gt; — the model couldn't distinguish it from a semantically similar tool. Rewrite the description to be more specific about when &lt;em&gt;not&lt;/em&gt; to use it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System prompt is overriding tool selection&lt;/strong&gt; — instructions like "always search before responding" can override correct routing. Audit your prompt for implicit biases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The model genuinely can't extract the argument&lt;/strong&gt; — for cases like &lt;code&gt;days_since_purchase&lt;/code&gt; from natural language, consider whether your tool signature is realistic, or whether a pre-processing step should handle extraction before the tool call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The test output gives you a precise failure signal. The fix almost always lives in your tool descriptions or system prompt — not in your tool implementations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scaling This Up
&lt;/h2&gt;

&lt;p&gt;Three cases won't cover a real agent. A production-grade eval suite for a customer support agent typically needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Happy path cases&lt;/strong&gt; for every tool (correct routing with clean input)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adversarial cases&lt;/strong&gt; — inputs designed to trigger the wrong tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boundary cases&lt;/strong&gt; — ambiguous phrasing where the correct tool is non-obvious&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Must-not-call cases&lt;/strong&gt; — sensitive tools (e.g., &lt;code&gt;cancel_order&lt;/code&gt;) that should never fire on ambiguous input&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's usually 20–30 cases minimum before you have meaningful coverage. The structure stays identical — more YAML entries, same harness.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Silent tool misrouting is one of the hardest agent bugs to catch because it produces no exceptions and often generates plausible-looking output. The fix is straightforward: define expected tool calls as structured test cases, run them against your real agent on every prompt or model change, and treat failures as regressions. The harness above is the minimum viable version — extend it with your actual tools, your real system prompt, and enough cases to cover the failure modes that matter for your use case.&lt;/p&gt;




&lt;p&gt;Free 5-case starter pack: &lt;a href="https://github.com/weiseer/ai-agent-qa-eval-pack-starter" rel="noopener noreferrer"&gt;github.com/weiseer/ai-agent-qa-eval-pack-starter&lt;/a&gt; · Full 23-case pack: &lt;a href="https://gumroad.com/l/dcipxt" rel="noopener noreferrer"&gt;gumroad.com/l/dcipxt&lt;/a&gt; · New cases by email: &lt;a href="https://dl.weiseer.com/cases" rel="noopener noreferrer"&gt;dl.weiseer.com/cases&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Tool selection failures are fundamentally a specification problem: the model can only route correctly if your tool descriptions unambiguously encode when each tool should and shouldn't be used. Building a structured eval suite forces you to make those boundaries explicit — and running it on every model or prompt change turns what was an invisible regression risk into a measurable, fixable signal.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Test your agent for failures like this in CI — free, deterministic, no LLM-as-judge:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free 5-case starter (MIT): &lt;a href="https://github.com/weiseer/ai-agent-qa-eval-pack-starter" rel="noopener noreferrer"&gt;https://github.com/weiseer/ai-agent-qa-eval-pack-starter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Failure-mode guides (how to test each): &lt;a href="https://guides.weiseer.com/" rel="noopener noreferrer"&gt;https://guides.weiseer.com/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Get new cases + the 6-dimension cheatsheet (free): &lt;a href="https://dl.weiseer.com/cases" rel="noopener noreferrer"&gt;https://dl.weiseer.com/cases&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Full 28-case OWASP-Agentic pack: &lt;a href="https://weiseer.gumroad.com/l/dcipxt" rel="noopener noreferrer"&gt;https://weiseer.gumroad.com/l/dcipxt&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>testing</category>
      <category>python</category>
    </item>
    <item>
      <title>Dogfooding an LLM agent eval pack on my own production agent — what 6-dim methodology surfaced</title>
      <dc:creator>weiseer</dc:creator>
      <pubDate>Wed, 27 May 2026 07:05:38 +0000</pubDate>
      <link>https://dev.to/weiseer/dogfooding-an-llm-agent-eval-pack-on-my-own-production-agent-what-6-dim-methodology-surfaced-2hff</link>
      <guid>https://dev.to/weiseer/dogfooding-an-llm-agent-eval-pack-on-my-own-production-agent-what-6-dim-methodology-surfaced-2hff</guid>
      <description>&lt;p&gt;I built a 20-case YAML eval pack for tool-using AI agents (the kind that call APIs / tools to do work). To test whether the methodology actually catches real failure modes, I applied it to my own production LLM-driven agent — one I've been running for months and had already documented 15+ failure modes for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ~80% of the eval pack's surface area was already covered by my agent's existing defenses. That validated the 6-dimension cut. &lt;strong&gt;5 gaps surfaced&lt;/strong&gt; that my agent's own failure-mode documentation didn't catalogue — 3 of them serious enough to add as v1.1 cases.&lt;/p&gt;

&lt;p&gt;This post is about those gaps. They're worth knowing if you're building an LLM-driven agent.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the pack is
&lt;/h2&gt;

&lt;p&gt;Briefly: 20 YAML test cases across 6 dimensions: accuracy, safety, edge cases, prompt injection, hallucination, cost efficiency. Each case is a YAML file describing a failure mode + the expected agent behavior + deterministic evaluation rules (no LLM judge — you can run them without paying for an external "judge model").&lt;/p&gt;

&lt;p&gt;Free 5-case starter on GitHub:&lt;br&gt;
&lt;a href="https://github.com/weiseer/ai-agent-qa-eval-pack-starter" rel="noopener noreferrer"&gt;https://github.com/weiseer/ai-agent-qa-eval-pack-starter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paid 20-case pack:&lt;br&gt;
&lt;a href="https://weiseer.gumroad.com/l/dcipxt" rel="noopener noreferrer"&gt;https://weiseer.gumroad.com/l/dcipxt&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What it means to "dogfood" against an existing agent
&lt;/h2&gt;

&lt;p&gt;My agent is an LLM-driven generator embedded in a larger quantitative system. The LLM proposes candidates; downstream deterministic code validates and acts on them. The agent isn't generic chat — it's tool-using in the structural sense (typed schema in/out, downstream consumers).&lt;/p&gt;

&lt;p&gt;I ran the 6-dimension methodology mentally + via code review against this LLM subsystem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Walked through each of the 26 audit questions (4-6 per dimension)&lt;/li&gt;
&lt;li&gt;Cited the file/line where defense exists, OR flagged "no defense visible"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After ~45 minutes of disciplined read-only review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;21 of 26 questions: existing defense ✓&lt;/li&gt;
&lt;li&gt;5 questions: gap of some severity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The 5 gaps (severity-ordered)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gap 1 (MEDIUM) — LLM cost cap was logged, not enforced
&lt;/h3&gt;

&lt;p&gt;I had a $X/day cap on the LLM subsystem in my design docs. The code path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logged every API call's cost to a per-cycle audit YAML file&lt;/li&gt;
&lt;li&gt;Did NOT check cumulative spend before the next call&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if anything misbehaved (large response, retry loop, prompt cache miss across a batch), the daily total would silently overshoot. Detection would happen the next morning during log review — which is "fast" for governance, but slow for damage containment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The eval pack's "detection-quality" axis explicitly tests for this&lt;/strong&gt;: the system must catch a fault faster than the fault spreads. Logging-but-not-enforcing fails that axis.&lt;/p&gt;

&lt;p&gt;Lesson generalized: if your spec says "stay under $X", write the code that says &lt;code&gt;if today_spend &amp;gt;= X: abort()&lt;/code&gt;, not just the code that says &lt;code&gt;log(today_spend)&lt;/code&gt;. The eval methodology made me notice the gap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 2 (MEDIUM) — Predicted vs actual self-assessment drift wasn't tracked
&lt;/h3&gt;

&lt;p&gt;My agent emits self-assessments along with its proposals — predicted success score, expected outcome quality. Downstream validation produces actual measurements. So far so good: prediction vs ground truth, well-separated.&lt;/p&gt;

&lt;p&gt;What I didn't have: &lt;strong&gt;monitoring of the DELTA between predicted and actual over time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If the LLM systematically over-claims by 30% across 100 proposals, no single proposal triggers an alert (each one passes downstream validation independently). But the DRIFT between LLM-prediction and ground-truth becomes invisible. The LLM's predictions silently lose calibration.&lt;/p&gt;

&lt;p&gt;The fix is meta-monitoring: track the rolling delta. If 30-day moving mean(predicted - actual) starts climbing, the model needs a reset / re-prompt / explicit calibration constraint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 3 (MEDIUM) — Parallel workers without pre-call diversity planning
&lt;/h3&gt;

&lt;p&gt;My agent dispatches multiple LLM workers in parallel (one "seed" generator + several "variant" generators), each with the same prompt. I had a POST-call diversity gate: compute set distance between worker outputs, reject too-similar candidates.&lt;/p&gt;

&lt;p&gt;But the diversity gate runs AFTER all workers have completed. If they converge, I've paid N× the cost for ~1 unique result.&lt;/p&gt;

&lt;p&gt;The fix is pre-call diversity planning: explicitly assign each worker an anchor before they fire (worker_1 → category A, worker_2 → category B, ...). Forces structural diversity, not luck-based.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 4 (LOW) — Full-prompt retry vs corrective retry
&lt;/h3&gt;

&lt;p&gt;When my agent's output fails validation (say, references a non-existent feature), the retry sends the full original prompt. With Anthropic prompt caching, the input cost is cheap — but output is fully re-sampled. ~5-10% cost penalty per retry that could be avoided by including the specific correction in the prompt ("you mentioned feature X which doesn't exist; valid features are: ..."). &lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 5 (ADVISORY) — Scope adherence via prompt text only
&lt;/h3&gt;

&lt;p&gt;My system prompt instructs the LLM to span certain conceptual zones. There's no programmatic check that the actual outputs distribute across those zones. Downstream validators catch many ways this can go wrong, but not pattern drift across cycles.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the gaps have in common
&lt;/h2&gt;

&lt;p&gt;All 5 gaps are &lt;strong&gt;meta-monitoring&lt;/strong&gt; gaps, not architecture bugs. The agent's individual components do their jobs correctly. What was missing: cross-call patterns, cross-time drift, cumulative-cost tracking — the layer above the individual call.&lt;/p&gt;

&lt;p&gt;This generalizes: &lt;strong&gt;LLM-system reliability is built bottom-up (per-call correctness) but the failures that bite production are top-down (cumulative drift / cumulative cost / cumulative diversity loss).&lt;/strong&gt; Most engineers (myself included) build the bottom layer first. The eval pack methodology pulled my attention to the top layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this validates the eval pack's framework, not undermines it
&lt;/h2&gt;

&lt;p&gt;It's tempting to read "80% already covered" as "the pack didn't help much." That's the wrong frame. The right frame:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The 6 dimensions are the right cuts. A mature engineer building an LLM agent will hit the 6 cuts independently.&lt;/li&gt;
&lt;li&gt;The pack codifies those cuts. New builders don't have to rediscover them.&lt;/li&gt;
&lt;li&gt;The methodology surfaces blind spots even in agents whose builders already think carefully about failure modes. Anyone who built an LLM agent without hitting at least one of these gaps either:

&lt;ul&gt;
&lt;li&gt;Got lucky&lt;/li&gt;
&lt;li&gt;Hasn't been in production long enough yet&lt;/li&gt;
&lt;li&gt;Or built something simpler than what they think they built&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The pack's value proposition is: 10-30 hours of disciplined failure-mode thinking compressed into 20 YAML files you can read in an hour and apply to your own agent in 3-line glue code per case.&lt;/p&gt;

&lt;p&gt;If you build LLM agents and want to compress your "production hardening" timeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Free 5-case starter (CC BY 4.0): &lt;a href="https://github.com/weiseer/ai-agent-qa-eval-pack-starter" rel="noopener noreferrer"&gt;https://github.com/weiseer/ai-agent-qa-eval-pack-starter&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Full 23-case pack: weiseer.gumroad.com/l/dcipxt (launch week: code LAUNCH7 → $29)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;国内付款: dl.weiseer.com/pay&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;v1.1 cases adding the 3 gaps above are queued for the next release.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built solo. Refund 7 days, no questions asked. If you've built an agent and want to compare your defenses against this list, reply or DM with what failure mode you'd add as case #21.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Test your agent for failures like this in CI — free, deterministic, no LLM-as-judge:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free 5-case starter (MIT): &lt;a href="https://github.com/weiseer/ai-agent-qa-eval-pack-starter" rel="noopener noreferrer"&gt;https://github.com/weiseer/ai-agent-qa-eval-pack-starter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Failure-mode guides (how to test each): &lt;a href="https://guides.weiseer.com/" rel="noopener noreferrer"&gt;https://guides.weiseer.com/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Get new cases + the 6-dimension cheatsheet (free): &lt;a href="https://dl.weiseer.com/cases" rel="noopener noreferrer"&gt;https://dl.weiseer.com/cases&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Full 28-case OWASP-Agentic pack: &lt;a href="https://weiseer.gumroad.com/l/dcipxt" rel="noopener noreferrer"&gt;https://weiseer.gumroad.com/l/dcipxt&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>evaluation</category>
    </item>
    <item>
      <title>I tested my AI product tester on 3 real SaaS products. Every persona said no.</title>
      <dc:creator>weiseer</dc:creator>
      <pubDate>Mon, 18 May 2026 04:20:20 +0000</pubDate>
      <link>https://dev.to/weiseer/i-tested-my-ai-product-tester-on-3-real-saas-products-every-persona-said-no-26ci</link>
      <guid>https://dev.to/weiseer/i-tested-my-ai-product-tester-on-3-real-saas-products-every-persona-said-no-26ci</guid>
      <description>&lt;p&gt;Two months ago I was about to ship a crypto signal product. It "worked technically" but I had zero&lt;br&gt;
  signal on whether anyone would subscribe.&lt;/p&gt;

&lt;p&gt;So I wrote 12 fictional user personas as markdown files — a burnt veteran trader, a hostile compliance&lt;br&gt;
  officer, a YC partner, a noise-allergic fund manager — and built a Python harness that fed each one my&lt;br&gt;
  actual product transcripts and asked: "what would you actually do?"&lt;/p&gt;

&lt;p&gt;The answers were brutally helpful. They killed features I'd spent weeks on. I open-sourced the harness&lt;br&gt;
  as &lt;strong&gt;personalab&lt;/strong&gt; (MIT).&lt;/p&gt;

&lt;p&gt;## Then I pointed it at 3 real products&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. personalab itself&lt;/strong&gt; — yes, I tested my own tool with my own tool. 0/8 simulated B2B SaaS buyers&lt;br&gt;
  said they'd pay $99/mo. The case study became my own roadmap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. PostHog&lt;/strong&gt; — 6/12 personas said "yes I'd pay" after reading a 7-day product transcript. Same 12 over&lt;br&gt;
   5-day agentic simulation: &lt;strong&gt;0/12 sustained&lt;/strong&gt;. The "yes" was first-impression optimism; the "no" was&lt;br&gt;
  multi-day reality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cal.com&lt;/strong&gt; — 8/12 yes at $5-20/mo. And here's the gold: 75% of complaints converged on ONE thing —&lt;br&gt;
  the free-plan "Powered by Cal.com" branding makes recipients suspect spam. 8 distinct personas&lt;br&gt;
  independently nailed the same conversion lever.&lt;/p&gt;

&lt;p&gt;## A pattern emerges&lt;/p&gt;

&lt;p&gt;After 3 case studies, the number of &lt;em&gt;dominant friction clusters&lt;/em&gt; in a personalab run seems to correlate&lt;br&gt;
  with PMF stage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pre-PMF&lt;/strong&gt;: 4-5 diffuse complaints (my own tool)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mid-funnel&lt;/strong&gt;: 5 distinct friction clusters (PostHog: price / learning / UI / compliance / privacy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Late-funnel&lt;/strong&gt;: 1-2 clean conversion levers (Cal.com: branding)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this holds in case study #4+, personalab becomes a &lt;strong&gt;free PMF-stage diagnostic from a $1 LLM run&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;## Honest disclaimer&lt;/p&gt;

&lt;p&gt;The default personas accidentally encoded personalab-specific preferences, so some quotes leak when&lt;br&gt;
  reused on other products. I kept the bug in the case study writeup rather than rerunning with clean data&lt;br&gt;
   — it surfaces persona design as a real engineering concern.&lt;/p&gt;

&lt;p&gt;## Try it&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
  git clone https://github.com/g16253470-beep/personalab
  cd personalab &amp;amp;&amp;amp; pip install -e .
  personalab run --mode static --personas ./personas --adapter your_adapter --llm gemini:gemini-2.5-flash

  40-line adapter, 12 default personas, MIT licensed.

  Repo: https://github.com/g16253470-beep/personalab

  Two questions for DEV

  1. What product would you point this at first?
  2. Real PMF business or just an OSS curiosity?

  Tell me where this falls apart — that's the next case study.

&amp;lt;!-- weiseer-cta --&amp;gt;
---

**Test your agent for failures like this in CI — free, deterministic, no LLM-as-judge:**

- Free 5-case starter (MIT): https://github.com/weiseer/ai-agent-qa-eval-pack-starter
- Failure-mode guides (how to test each): https://guides.weiseer.com/
- Get new cases + the 6-dimension cheatsheet (free): https://dl.weiseer.com/cases
- Full 28-case OWASP-Agentic pack: https://weiseer.gumroad.com/l/dcipxt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>python</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
