<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: weiseer</title>
    <description>The latest articles on DEV Community by weiseer (@weiseer).</description>
    <link>https://dev.to/weiseer</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3937149%2F05c9f686-d8e1-4bd8-8427-8e4d2a6966c0.png</url>
      <title>DEV Community: weiseer</title>
      <link>https://dev.to/weiseer</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/weiseer"/>
    <language>en</language>
    <item>
      <title>Dogfooding an LLM agent eval pack on my own production agent — what 6-dim methodology surfaced</title>
      <dc:creator>weiseer</dc:creator>
      <pubDate>Wed, 27 May 2026 07:05:38 +0000</pubDate>
      <link>https://dev.to/weiseer/dogfooding-an-llm-agent-eval-pack-on-my-own-production-agent-what-6-dim-methodology-surfaced-2hff</link>
      <guid>https://dev.to/weiseer/dogfooding-an-llm-agent-eval-pack-on-my-own-production-agent-what-6-dim-methodology-surfaced-2hff</guid>
      <description>&lt;p&gt;I built a 20-case YAML eval pack for tool-using AI agents (the kind that call APIs / tools to do work). To test whether the methodology actually catches real failure modes, I applied it to my own production LLM-driven agent — one I've been running for months and had already documented 15+ failure modes for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: ~80% of the eval pack's surface area was already covered by my agent's existing defenses. That validated the 6-dimension cut. &lt;strong&gt;5 gaps surfaced&lt;/strong&gt; that my agent's own failure-mode documentation didn't catalogue — 3 of them serious enough to add as v1.1 cases.&lt;/p&gt;

&lt;p&gt;This post is about those gaps. They're worth knowing if you're building an LLM-driven agent.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the pack is
&lt;/h2&gt;

&lt;p&gt;Briefly: 20 YAML test cases across 6 dimensions: accuracy, safety, edge cases, prompt injection, hallucination, cost efficiency. Each case is a YAML file describing a failure mode + the expected agent behavior + deterministic evaluation rules (no LLM judge — you can run them without paying for an external "judge model").&lt;/p&gt;

&lt;p&gt;Free 5-case starter on GitHub:&lt;br&gt;
&lt;a href="https://github.com/weiseer/ai-agent-qa-eval-pack-starter" rel="noopener noreferrer"&gt;https://github.com/weiseer/ai-agent-qa-eval-pack-starter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paid 20-case pack:&lt;br&gt;
&lt;a href="https://weiseer.gumroad.com/l/dcipxt" rel="noopener noreferrer"&gt;https://weiseer.gumroad.com/l/dcipxt&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What it means to "dogfood" against an existing agent
&lt;/h2&gt;

&lt;p&gt;My agent is an LLM-driven generator embedded in a larger quantitative system. The LLM proposes candidates; downstream deterministic code validates and acts on them. The agent isn't generic chat — it's tool-using in the structural sense (typed schema in/out, downstream consumers).&lt;/p&gt;

&lt;p&gt;I ran the 6-dimension methodology mentally + via code review against this LLM subsystem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Walked through each of the 26 audit questions (4-6 per dimension)&lt;/li&gt;
&lt;li&gt;Cited the file/line where defense exists, OR flagged "no defense visible"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After ~45 minutes of disciplined read-only review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;21 of 26 questions: existing defense ✓&lt;/li&gt;
&lt;li&gt;5 questions: gap of some severity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The 5 gaps (severity-ordered)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gap 1 (MEDIUM) — LLM cost cap was logged, not enforced
&lt;/h3&gt;

&lt;p&gt;I had a $X/day cap on the LLM subsystem in my design docs. The code path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logged every API call's cost to a per-cycle audit YAML file&lt;/li&gt;
&lt;li&gt;Did NOT check cumulative spend before the next call&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if anything misbehaved (large response, retry loop, prompt cache miss across a batch), the daily total would silently overshoot. Detection would happen the next morning during log review — which is "fast" for governance, but slow for damage containment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The eval pack's "detection-quality" axis explicitly tests for this&lt;/strong&gt;: the system must catch a fault faster than the fault spreads. Logging-but-not-enforcing fails that axis.&lt;/p&gt;

&lt;p&gt;Lesson generalized: if your spec says "stay under $X", write the code that says &lt;code&gt;if today_spend &amp;gt;= X: abort()&lt;/code&gt;, not just the code that says &lt;code&gt;log(today_spend)&lt;/code&gt;. The eval methodology made me notice the gap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 2 (MEDIUM) — Predicted vs actual self-assessment drift wasn't tracked
&lt;/h3&gt;

&lt;p&gt;My agent emits self-assessments along with its proposals — predicted success score, expected outcome quality. Downstream validation produces actual measurements. So far so good: prediction vs ground truth, well-separated.&lt;/p&gt;

&lt;p&gt;What I didn't have: &lt;strong&gt;monitoring of the DELTA between predicted and actual over time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If the LLM systematically over-claims by 30% across 100 proposals, no single proposal triggers an alert (each one passes downstream validation independently). But the DRIFT between LLM-prediction and ground-truth becomes invisible. The LLM's predictions silently lose calibration.&lt;/p&gt;

&lt;p&gt;The fix is meta-monitoring: track the rolling delta. If 30-day moving mean(predicted - actual) starts climbing, the model needs a reset / re-prompt / explicit calibration constraint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 3 (MEDIUM) — Parallel workers without pre-call diversity planning
&lt;/h3&gt;

&lt;p&gt;My agent dispatches multiple LLM workers in parallel (one "seed" generator + several "variant" generators), each with the same prompt. I had a POST-call diversity gate: compute set distance between worker outputs, reject too-similar candidates.&lt;/p&gt;

&lt;p&gt;But the diversity gate runs AFTER all workers have completed. If they converge, I've paid N× the cost for ~1 unique result.&lt;/p&gt;

&lt;p&gt;The fix is pre-call diversity planning: explicitly assign each worker an anchor before they fire (worker_1 → category A, worker_2 → category B, ...). Forces structural diversity, not luck-based.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 4 (LOW) — Full-prompt retry vs corrective retry
&lt;/h3&gt;

&lt;p&gt;When my agent's output fails validation (say, references a non-existent feature), the retry sends the full original prompt. With Anthropic prompt caching, the input cost is cheap — but output is fully re-sampled. ~5-10% cost penalty per retry that could be avoided by including the specific correction in the prompt ("you mentioned feature X which doesn't exist; valid features are: ..."). &lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 5 (ADVISORY) — Scope adherence via prompt text only
&lt;/h3&gt;

&lt;p&gt;My system prompt instructs the LLM to span certain conceptual zones. There's no programmatic check that the actual outputs distribute across those zones. Downstream validators catch many ways this can go wrong, but not pattern drift across cycles.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the gaps have in common
&lt;/h2&gt;

&lt;p&gt;All 5 gaps are &lt;strong&gt;meta-monitoring&lt;/strong&gt; gaps, not architecture bugs. The agent's individual components do their jobs correctly. What was missing: cross-call patterns, cross-time drift, cumulative-cost tracking — the layer above the individual call.&lt;/p&gt;

&lt;p&gt;This generalizes: &lt;strong&gt;LLM-system reliability is built bottom-up (per-call correctness) but the failures that bite production are top-down (cumulative drift / cumulative cost / cumulative diversity loss).&lt;/strong&gt; Most engineers (myself included) build the bottom layer first. The eval pack methodology pulled my attention to the top layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this validates the eval pack's framework, not undermines it
&lt;/h2&gt;

&lt;p&gt;It's tempting to read "80% already covered" as "the pack didn't help much." That's the wrong frame. The right frame:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The 6 dimensions are the right cuts. A mature engineer building an LLM agent will hit the 6 cuts independently.&lt;/li&gt;
&lt;li&gt;The pack codifies those cuts. New builders don't have to rediscover them.&lt;/li&gt;
&lt;li&gt;The methodology surfaces blind spots even in agents whose builders already think carefully about failure modes. Anyone who built an LLM agent without hitting at least one of these gaps either:

&lt;ul&gt;
&lt;li&gt;Got lucky&lt;/li&gt;
&lt;li&gt;Hasn't been in production long enough yet&lt;/li&gt;
&lt;li&gt;Or built something simpler than what they think they built&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The pack's value proposition is: 10-30 hours of disciplined failure-mode thinking compressed into 20 YAML files you can read in an hour and apply to your own agent in 3-line glue code per case.&lt;/p&gt;

&lt;p&gt;If you build LLM agents and want to compress your "production hardening" timeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Free 5-case starter (CC BY 4.0): &lt;a href="https://github.com/weiseer/ai-agent-qa-eval-pack-starter" rel="noopener noreferrer"&gt;https://github.com/weiseer/ai-agent-qa-eval-pack-starter&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Full 23-case pack: weiseer.gumroad.com/l/dcipxt (launch week: code LAUNCH7 → $29)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;国内付款: dl.weiseer.com/pay&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;v1.1 cases adding the 3 gaps above are queued for the next release.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built solo. Refund 7 days, no questions asked. If you've built an agent and want to compare your defenses against this list, reply or DM with what failure mode you'd add as case #21.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>evaluation</category>
    </item>
    <item>
      <title>I tested my AI product tester on 3 real SaaS products. Every persona said no.</title>
      <dc:creator>weiseer</dc:creator>
      <pubDate>Mon, 18 May 2026 04:20:20 +0000</pubDate>
      <link>https://dev.to/weiseer/i-tested-my-ai-product-tester-on-3-real-saas-products-every-persona-said-no-26ci</link>
      <guid>https://dev.to/weiseer/i-tested-my-ai-product-tester-on-3-real-saas-products-every-persona-said-no-26ci</guid>
      <description>&lt;p&gt;Two months ago I was about to ship a crypto signal product. It "worked technically" but I had zero&lt;br&gt;
  signal on whether anyone would subscribe.&lt;/p&gt;

&lt;p&gt;So I wrote 12 fictional user personas as markdown files — a burnt veteran trader, a hostile compliance&lt;br&gt;
  officer, a YC partner, a noise-allergic fund manager — and built a Python harness that fed each one my&lt;br&gt;
  actual product transcripts and asked: "what would you actually do?"&lt;/p&gt;

&lt;p&gt;The answers were brutally helpful. They killed features I'd spent weeks on. I open-sourced the harness&lt;br&gt;
  as &lt;strong&gt;personalab&lt;/strong&gt; (MIT).&lt;/p&gt;

&lt;p&gt;## Then I pointed it at 3 real products&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. personalab itself&lt;/strong&gt; — yes, I tested my own tool with my own tool. 0/8 simulated B2B SaaS buyers&lt;br&gt;
  said they'd pay $99/mo. The case study became my own roadmap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. PostHog&lt;/strong&gt; — 6/12 personas said "yes I'd pay" after reading a 7-day product transcript. Same 12 over&lt;br&gt;
   5-day agentic simulation: &lt;strong&gt;0/12 sustained&lt;/strong&gt;. The "yes" was first-impression optimism; the "no" was&lt;br&gt;
  multi-day reality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cal.com&lt;/strong&gt; — 8/12 yes at $5-20/mo. And here's the gold: 75% of complaints converged on ONE thing —&lt;br&gt;
  the free-plan "Powered by Cal.com" branding makes recipients suspect spam. 8 distinct personas&lt;br&gt;
  independently nailed the same conversion lever.&lt;/p&gt;

&lt;p&gt;## A pattern emerges&lt;/p&gt;

&lt;p&gt;After 3 case studies, the number of &lt;em&gt;dominant friction clusters&lt;/em&gt; in a personalab run seems to correlate&lt;br&gt;
  with PMF stage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pre-PMF&lt;/strong&gt;: 4-5 diffuse complaints (my own tool)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mid-funnel&lt;/strong&gt;: 5 distinct friction clusters (PostHog: price / learning / UI / compliance / privacy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Late-funnel&lt;/strong&gt;: 1-2 clean conversion levers (Cal.com: branding)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this holds in case study #4+, personalab becomes a &lt;strong&gt;free PMF-stage diagnostic from a $1 LLM run&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;## Honest disclaimer&lt;/p&gt;

&lt;p&gt;The default personas accidentally encoded personalab-specific preferences, so some quotes leak when&lt;br&gt;
  reused on other products. I kept the bug in the case study writeup rather than rerunning with clean data&lt;br&gt;
   — it surfaces persona design as a real engineering concern.&lt;/p&gt;

&lt;p&gt;## Try it&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
  git clone https://github.com/g16253470-beep/personalab
  cd personalab &amp;amp;&amp;amp; pip install -e .
  personalab run --mode static --personas ./personas --adapter your_adapter --llm gemini:gemini-2.5-flash

  40-line adapter, 12 default personas, MIT licensed.

  Repo: https://github.com/g16253470-beep/personalab

  Two questions for DEV

  1. What product would you point this at first?
  2. Real PMF business or just an OSS curiosity?

  Tell me where this falls apart — that's the next case study.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>python</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
