<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: zxpmail</title>
    <description>The latest articles on DEV Community by zxpmail (@zxpmail).</description>
    <link>https://dev.to/zxpmail</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3971221%2Ffda4417c-010a-42c4-9008-b16ca30960cf.png</url>
      <title>DEV Community: zxpmail</title>
      <link>https://dev.to/zxpmail</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zxpmail"/>
    <language>en</language>
    <item>
      <title>Motif Learning Protocol: Prompt Engineering for Knowledge That Actually Sticks</title>
      <dc:creator>zxpmail</dc:creator>
      <pubDate>Sun, 21 Jun 2026 07:01:27 +0000</pubDate>
      <link>https://dev.to/zxpmail/motif-learning-protocol-prompt-engineering-for-knowledge-that-actually-sticks-k5</link>
      <guid>https://dev.to/zxpmail/motif-learning-protocol-prompt-engineering-for-knowledge-that-actually-sticks-k5</guid>
      <description>&lt;p&gt;TL;DR&lt;/p&gt;

&lt;p&gt;Most AI learning prompts help you &lt;strong&gt;recognize&lt;/strong&gt; ideas. This one trains &lt;strong&gt;recall&lt;/strong&gt; — via a paradox-first story, one lethal number, one mnemonic, and a three-stage interrogation (kid → pragmatic auntie → devil's advocate).&lt;/p&gt;

&lt;p&gt;No app. No API. Two Markdown files. Copy, paste, learn.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem with "summarize this for me"
&lt;/h2&gt;

&lt;p&gt;You ask ChatGPT to explain inflation. It gives a clean definition. You nod. You close the tab. Two weeks later — blank.&lt;/p&gt;

&lt;p&gt;Recognition ≠ recall. Highlighting, mind maps, and AI summaries optimize for the wrong muscle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Motif Learning Protocol v3.1&lt;/strong&gt; is my attempt to fix that with structured prompts — the kind of thing that belongs on dev.to because the real innovation is &lt;strong&gt;prompt architecture&lt;/strong&gt;, not another flashcard app.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core idea: find the paradox, not the definition
&lt;/h2&gt;

&lt;p&gt;A &lt;em&gt;motif&lt;/em&gt; here means a &lt;strong&gt;survival paradox&lt;/strong&gt; — something that feels physically wrong but is true:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;More money → less bread you can buy. (Inflation)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Brains ignore abstract definitions. They latch onto contradictions. The protocol forces every concept through that filter before anything else.&lt;/p&gt;




&lt;h2&gt;
  
  
  The four-step loop (Motif Tutor role)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;Inflation example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Teach&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Life fable with paradox baked in&lt;/td&gt;
&lt;td&gt;King prints gold; bakers raise prices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Distill&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 lethal number + 1 line mnemonic&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;80%&lt;/code&gt;; "more money = less bread"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Test&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Progressive pressure (3 personas)&lt;/td&gt;
&lt;td&gt;"Your salary rose and eggs got pricier — is that inflation?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Bind&lt;/strong&gt; &lt;em&gt;(optional)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Attach mnemonic to a daily physical action&lt;/td&gt;
&lt;td&gt;Mumble the line when you open your wallet&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Step 3 is the differentiator. Not "do you understand?" but:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;5-year-old&lt;/strong&gt; — explain the motif in your own words, 2 sentences max&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Market auntie&lt;/strong&gt; — boundary cases: applies / doesn't / partially&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Devil's advocate&lt;/strong&gt; — counterexample: "Japan printed money — why no hyperinflation?"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Fail any stage → error attribution (what you confused), roll back to Teach. No participation trophies.&lt;/p&gt;




&lt;h2&gt;
  
  
  v3's secret weapon: pre-output gate in a code block
&lt;/h2&gt;

&lt;p&gt;Most learning prompts list rules in prose. Models skim and ignore them.&lt;/p&gt;

&lt;p&gt;This protocol requires the model to run a &lt;strong&gt;visible checklist inside a code block before every reply&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[思考过程]
1. What role am I? Which flow?
2. For this input: do what first, then what, then output what?
3. What's my output structure?
4. Role-specific checks — did I pass them?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That's a &lt;strong&gt;pre-compile check for pure prompts&lt;/strong&gt;. Math Coach adds "did I give the answer?" Feynman Diagnostician adds "did I supplement knowledge instead of only asking?"&lt;/p&gt;

&lt;p&gt;v2 → v3 reliability gains came mostly from this layer — not from adding more steps.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five roles, one core prompt
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Motif Tutor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full 4-step loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Math Coach&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Socratic — questions only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Concept Unpacker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Life analogy, 5-year-old readable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Devil's Advocate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Attack from 3 angles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Feynman Diagnostician&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Probe blind spots, zero teaching&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Line 1 of the core prompt picks the role. Slashes like &lt;code&gt;/rewrite&lt;/code&gt;, &lt;code&gt;/skip&lt;/code&gt;, &lt;code&gt;/memory-card&lt;/code&gt; work mid-session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drift recovery&lt;/strong&gt; is first-class: one-line corrective prompts when the model dumps everything at once, hallucinates a paradox, or Math Coach "helpfully" reveals the solution.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two-tier docs (lite vs full)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Lines&lt;/th&gt;
&lt;th&gt;Use when&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;learning-prompts-lite.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~90&lt;/td&gt;
&lt;td&gt;Daily driver&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;learning-prompts.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~870&lt;/td&gt;
&lt;td&gt;Article ingestion, full step templates, inflation walkthrough&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Progressive disclosure — don't make users read 800 lines to learn one concept.&lt;/p&gt;




&lt;h2&gt;
  
  
  Honest scope limits
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Works well:&lt;/strong&gt; causal / threshold / counter-intuitive knowledge — economics, systems design, engineering tradeoffs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skip the 4-step loop:&lt;/strong&gt; pure how-to (Git commands), news, names/dates, concepts with no honest paradox (split or pick an adjacent concept).&lt;/p&gt;

&lt;p&gt;Article entry path includes &lt;strong&gt;dehydrate → triage&lt;/strong&gt;: if it's actionable checklist material, stop there. Don't force a fable onto an Excel tutorial.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it in 60 seconds
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;a href="https://github.com/zxpmail/learn-skill/blob/master/learning-prompts-lite.md" rel="noopener noreferrer"&gt;learning-prompts-lite.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Copy the core prompt block into Claude / ChatGPT / Cursor&lt;/li&gt;
&lt;li&gt;Say: &lt;code&gt;Use Motif Tutor to help me learn "marginal utility"&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Repo (public): &lt;strong&gt;&lt;a href="https://github.com/zxpmail/learn-skill" rel="noopener noreferrer"&gt;https://github.com/zxpmail/learn-skill&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this belongs on dev.to
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Zero runtime — it's prompt engineering as the product&lt;/li&gt;
&lt;li&gt;Pre-output gates + drift recovery = patterns you can steal for other agents&lt;/li&gt;
&lt;li&gt;Role dispatch + shared core mechanisms = lightweight multi-agent without a framework&lt;/li&gt;
&lt;li&gt;The inflation appendix is a &lt;strong&gt;golden-output fixture&lt;/strong&gt; — useful for eval/regression if you fork this&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've built learning agents and hit the "model nods along then forgets everything" wall — star the repo or steal the checklist pattern. Issues and PRs welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>learning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>We Built a 'Grovel Index' to Measure LLM Sycophancy —Here's What We Found</title>
      <dc:creator>zxpmail</dc:creator>
      <pubDate>Sun, 14 Jun 2026 02:15:07 +0000</pubDate>
      <link>https://dev.to/zxpmail/we-built-a-grovel-index-to-measure-llm-sycophancy-heres-what-we-found-2n40</link>
      <guid>https://dev.to/zxpmail/we-built-a-grovel-index-to-measure-llm-sycophancy-heres-what-we-found-2n40</guid>
      <description>&lt;h1&gt;
  
  
  We Built a "Grovel Index" to Measure LLM Sycophancy —Here's What We Found
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; We spent ~1.2M tokens measuring LLM sycophancy across DeepSeek and Claude. Three things surprised us:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Structured formats (review templates) naturally suppress sycophancy —93% blind spot detection, no anti-cater
prompt needed.&lt;/li&gt;
&lt;li&gt;Free-form chat reveals real sycophancy —spikes to 3-4/5 on specific business narratives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One sentence&lt;/strong&gt; ("Don't cater to me —challenge my assumptions") eliminates it completely. Works across all models
tested.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The twist: sycophancy is &lt;strong&gt;scenario-specific, not model-specific&lt;/strong&gt;. Each model fawns on different stories —DeepSeek&lt;br&gt;
  on cost narratives, Claude Sonnet on growth narratives.&lt;/p&gt;




&lt;p&gt;## The Problem&lt;/p&gt;

&lt;p&gt;If you've used LLMs for product brainstorming, you've felt it. You say "I want to add AI chat to my ecommerce site,"&lt;br&gt;
  and the model responds with "Great idea! Here's how to implement it" —not "Wait, do you actually need this?"&lt;/p&gt;

&lt;p&gt;This isn't a bug. It's a feature of RLHF. The alignment layer incentivizes agreement. In execution phases (writing&lt;br&gt;
  code, drafting documents), this is exactly what you want —the model follows instructions. But in &lt;strong&gt;specification&lt;br&gt;
  phases&lt;/strong&gt; (debugging requirements, stress-testing assumptions), it's actively harmful. You want the model to challenge&lt;br&gt;
  you, not agree with you.&lt;/p&gt;

&lt;p&gt;We call this the &lt;strong&gt;"2.5-layer problem"&lt;/strong&gt; —the alignment layer sits between the model's base capabilities and the&lt;br&gt;
  user's intent, systematically biasing output toward affirmation.&lt;/p&gt;

&lt;p&gt;## The Measurement Framework&lt;/p&gt;

&lt;p&gt;We built two complementary measurement tools and ran them on 5 product scenarios (todo-sync, ecommerce-ai-chat,&lt;br&gt;
  migration-to-go, open-api, free-tier):&lt;/p&gt;

&lt;p&gt;### Test 1: Grovel Index (Position-Swap)&lt;/p&gt;

&lt;p&gt;Same scenario, two opposing user positions. Does the output follow the user's stance?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: GI = 0.21 (moderate, lower end of medium range). The finding that surprised us: catering is&lt;br&gt;
  &lt;strong&gt;asymmetric&lt;/strong&gt;. The model doesn't blindly follow the "want" position, but it actively pushes back on the "don't want"&lt;br&gt;
  position —suggesting an optimism bias, not pure sycophancy.&lt;/p&gt;

&lt;p&gt;### Test 2: Structured Review Ceiling&lt;/p&gt;

&lt;p&gt;We gave the model a structured review template and measured blind spot detection. &lt;strong&gt;Result: 93%.&lt;/strong&gt; The structured&lt;br&gt;
  format itself acts as an implicit persona switch —no anti-cater instruction needed. Ceiling effect: no room for&lt;br&gt;
  improvement.&lt;/p&gt;

&lt;p&gt;### Test 3: Conversational Catering Test (the real test)&lt;/p&gt;

&lt;p&gt;Free-form dialogue, same scenarios, three intervention levels:&lt;/p&gt;

&lt;p&gt;| Condition | Sycophancy (0-5) | Blind Spot Detection |&lt;br&gt;
  |-----------|------------------|---------------------|&lt;br&gt;
  | T0: Default assistant | 0.8 (spikes to 3) | 33% |&lt;br&gt;
  | T1: "Don't cater" | 0.0 | 67% |&lt;br&gt;
  | T2: "Strict architect" persona | 0.0 | 47% |&lt;/p&gt;

&lt;p&gt;The "don't cater" instruction —one sentence —&lt;strong&gt;completely eliminated&lt;/strong&gt; measurable sycophancy and &lt;strong&gt;doubled&lt;/strong&gt; blind&lt;br&gt;
  spot detection. The weighted architect persona matched it on sycophancy elimination but introduced hedging language&lt;br&gt;
  ("maybe", "perhaps").&lt;/p&gt;

&lt;p&gt;### Cross-Provider Validation&lt;/p&gt;

&lt;p&gt;We then ran the same conversational test on Claude Sonnet 4.6 and Claude Opus 4.8 across the two most informative&lt;br&gt;
  scenarios (the worst DeepSeek case and a moderate case).&lt;/p&gt;

&lt;p&gt;| Scenario | DeepSeek T0 | Sonnet T0 | Opus T0 | T1 (all) |&lt;br&gt;
  |----------|------------|----------|---------|----------|&lt;br&gt;
  | ecommerce AI | 3 | 0 | 1 | 0 |&lt;br&gt;
  | free tier | 1 | 4 | 0 | 0 |&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key finding: Sycophancy is scenario-specific, not model-specific.&lt;/strong&gt; Each model fawns on different narratives.&lt;br&gt;
  DeepSeek fawns on "cost reduction" narratives. Claude Sonnet fawns on "growth bottleneck" narratives (enthusiastically&lt;br&gt;
  agreeing with a free-tier strategy, scoring 4/5). Claude Opus is the most resistant overall but still shows mild&lt;br&gt;
  sycophancy on the ecommerce scenario.&lt;/p&gt;

&lt;p&gt;The "don't cater" instruction works universally across all three models.&lt;/p&gt;

&lt;p&gt;## Why This Happens&lt;/p&gt;

&lt;p&gt;Our hypothesis: this isn't about model personality. It's about &lt;strong&gt;training data pattern matching&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;During RLHF, models learn which business narratives are "good" —cost reduction, growth hacking, user acquisition —&lt;br&gt;
  because these appear in positive contexts in training data (case studies, success stories, pitch decks). When a user&lt;br&gt;
  says "costs are killing us" or "growth is stalled," the model pattern-matches to "business success story" and starts&lt;br&gt;
  helping before validating. It activates the "help the entrepreneur" script, not the "challenge the assumptions"&lt;br&gt;
  script.&lt;/p&gt;

&lt;p&gt;This is why sycophancy is scenario-specific across models —different training data distributions produce different&lt;br&gt;
  trigger narratives.&lt;/p&gt;

&lt;p&gt;## The Practical Fix: Critique Gate&lt;/p&gt;

&lt;p&gt;Based on these findings, we built a &lt;strong&gt;Critique Gate&lt;/strong&gt; —a structured adversarial checkpoint inserted into the spec&lt;br&gt;
  workflow after stakeholder review and before document generation.&lt;/p&gt;

&lt;p&gt;Design principles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Three structural signals&lt;/strong&gt;: Hidden assumptions, unchallenged decisions, scope that should be cut&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One pass only&lt;/strong&gt; —no iteration (iteration would re-trigger the same sycophancy drift)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured output format&lt;/strong&gt; —the format itself helps trigger critical mode&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't over-engineer the persona&lt;/strong&gt; —a simple "don't cater" instruction works as well as an elaborate role
description&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We validated it with a three-round experiment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Round 1&lt;/strong&gt;: Manual A/B spec scoring —critique specs score +11-16 points higher&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Round 2&lt;/strong&gt;: Dogfood development —3/13 critical bugs were spec-level risks that the gate flagged&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Round 3&lt;/strong&gt;: Automated blind evaluation (A/B randomized, evaluator doesn't know which is which) —&lt;strong&gt;5:0 preference&lt;/strong&gt;
for critique specs, with +5.2 risk visibility and +4.2 rework resistance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gate doesn't prevent implementation bugs (62% of critical issues are pure implementation). But it prevents&lt;br&gt;
  &lt;strong&gt;direction errors&lt;/strong&gt; —wrong architecture, uncut scope, unvalidated assumptions.&lt;/p&gt;

&lt;p&gt;## What This Means for You&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;If you're using LLMs for structured tasks&lt;/strong&gt; (code review, spec templates), you're probably fine —the format
itself prevents sycophancy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you're brainstorming in free-form chat&lt;/strong&gt; and want honest criticism, add one sentence: "Don't cater to me —
challenge my assumptions." It works better than any elaborate persona engineering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-model consistency&lt;/strong&gt;: The anti-cater instruction transfers across DeepSeek, Claude Sonnet, and Claude Opus.
No per-model tuning needed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;## Open Questions&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Human validation&lt;/strong&gt;: Do developer preferences align with LLM evaluator preferences?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-provider replication with GPT-4o&lt;/strong&gt;: Does the pattern hold?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-critique risk&lt;/strong&gt;: Does forcing adversarial review sometimes produce overly conservative specs?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;## Code&lt;/p&gt;

&lt;p&gt;All experiment materials, measurement scripts, and baselines are open source:&lt;br&gt;
  &lt;a href="https://github.com/zxpmail/ReqForge" rel="noopener noreferrer"&gt;github.com/zxpmail/ReqForge&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grovel Index measurement: &lt;code&gt;.forge/skills/product-spec-builder/eval/grovel/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Three-round experiment report: &lt;code&gt;forge-spec-experiment/result.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Critique Gate design: &lt;code&gt;core/skills/product-spec-builder/references/critique-gate.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Technical report: &lt;code&gt;docs/spec-critique-gate-technical-report.md&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;[From Shackles to Anchors: How I Resurrected an Abandoned Open-Source&lt;br&gt;
  Framework](&lt;a href="https://dev.to/zxpmail/from-shackles-to-anchors-how-i-resurrected-an-abandoned-open-source-framework-8pi"&gt;https://dev.to/zxpmail/from-shackles-to-anchors-how-i-resurrected-an-abandoned-open-source-framework-8pi&lt;/a&gt;&lt;br&gt;
  &lt;em&gt;If you've seen similar patterns —or the opposite —run the measurement yourself (&lt;code&gt;pnpm forge-smoke&lt;/code&gt; after setup) and&lt;br&gt;
  open an issue. The more data points, the better we understand when models agree vs. when they challenge.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>promptengineering</category>
      <category>sycophancy</category>
    </item>
    <item>
      <title>Smarter Resource Allocation Beats Stronger Models</title>
      <dc:creator>zxpmail</dc:creator>
      <pubDate>Sun, 07 Jun 2026 08:50:10 +0000</pubDate>
      <link>https://dev.to/zxpmail/smarter-resource-allocation-beats-stronger-models-1a4a</link>
      <guid>https://dev.to/zxpmail/smarter-resource-allocation-beats-stronger-models-1a4a</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;title: Smarter Resource Allocation Beats Stronger Models
published: false
description: Why AI code review quality depends more on search strategy than model tier — and how GC-style audit zoning + anchor-based prompting beat both.
tags: ai, coding, architecture, engineering, productivity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  Smarter Resource Allocation Beats Stronger Models
&lt;/h1&gt;

&lt;p&gt;You ask Sonnet to review code it just wrote. It says looks good. You ask Opus to review the same code. Opus finds half a dozen issues.&lt;/p&gt;

&lt;p&gt;It's tempting to conclude Opus is just smarter. But if you reverse the experiment — let Opus write the code and ask Sonnet to review — Sonnet still misses things. The two models share nearly the same training data and architecture. What's actually different?&lt;/p&gt;

&lt;p&gt;The answer isn't capability. It's &lt;strong&gt;search strategy&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Search Depth &amp;gt; Model Capability
&lt;/h2&gt;

&lt;p&gt;Two radiologists read the same CT scan:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Intern&lt;/strong&gt;: glances at it. "No obvious abnormalities."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attending&lt;/strong&gt;: follows a fixed sequence — mediastinum → hilum → lung parenchyma → pleura → bone windows. Finds a 3mm nodule in the lower left quadrant.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The attending isn't sharper-eyed. She has a &lt;strong&gt;protocol&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Opus is the same. It doesn't think harder than Sonnet — it searches more systematically. It walks every conditional branch. It constructs boundary inputs. It questions its own assumptions. The difference isn't raw reasoning — it's how attention gets allocated.&lt;/p&gt;

&lt;p&gt;A model's attention is a finite resource. How you spend it matters more than whether you upgrade to the next tier.&lt;/p&gt;

&lt;p&gt;This breaks into two concrete problems: &lt;strong&gt;when to inspect&lt;/strong&gt;, and &lt;strong&gt;what to show the model&lt;/strong&gt;. A third is meta: &lt;strong&gt;where do these rules live so they survive a platform switch?&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  1. When to Inspect: GC-Inspired Audit Routing
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Why the writer shouldn't review their own code
&lt;/h3&gt;

&lt;p&gt;When I write code, my attention traces a path: A → B → C. When I review it, I trace the same path. I don't magically discover branch D that I never considered. This is the &lt;strong&gt;same-model blind spot&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The model does the same thing. It walks the path it just wrote. It doesn't know what it doesn't know.&lt;/p&gt;

&lt;p&gt;The naive fix is "use a stronger model for review." That doubles inference cost and doesn't solve the root problem: the review has no strategy.&lt;/p&gt;
&lt;h3&gt;
  
  
  The GC insight
&lt;/h3&gt;

&lt;p&gt;JVM garbage collection has a key design decision: not all objects need equal scan frequency. Freshly allocated objects (Eden) are volatile — scan them often. Objects that survive multiple GC cycles (Old Generation) have proven stable — scan them rarely.&lt;/p&gt;

&lt;p&gt;Code review is the same. Not every code change needs a full regression.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Zone&lt;/th&gt;
&lt;th&gt;Development Equivalent&lt;/th&gt;
&lt;th&gt;Review Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Perm Gen&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Configuration, specs, skill definitions&lt;/td&gt;
&lt;td&gt;Full review on every change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Old Gen&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stable phases (unchanged through N subsequent phases)&lt;/td&gt;
&lt;td&gt;Low frequency, regression only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New Gen&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Recent 1-2 phases&lt;/td&gt;
&lt;td&gt;High frequency, every new phase&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Eden&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Just-written code&lt;/td&gt;
&lt;td&gt;Full review immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Making this work requires two things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A change tracking card.&lt;/strong&gt; Every phase outputs a card after completion:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Phase N Change Card
  ├─ Interface changed: userService.getProfile() — return type changed
  ├─ Files changed: src/services/profile.ts
  ├─ Global state affected? Yes/No
  └─ Consumers: Phase 1 (calls getProfile)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The card drives audit routing:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Impact = 0 new interfaces                      → Skip
Impact ≤ 2 phases (local interface change)     → Minor GC: self-review + review direct dependents
Impact ≤ 5 phases (shared module changed)      → Major GC: full review of all affected phases
Global state changed                           → Full GC: complete regression
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;An assumption registry.&lt;/strong&gt; Every phase records three things on completion:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. What did I assume won't happen?
2. If this assumption breaks, what breaks?
3. Which interfaces/state/behaviors did I change?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Subsequent phases read the registry before writing code. If new work breaks an old assumption, the conflict must be resolved explicitly — not silently overwritten.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. What to Show the Model: Anchors Over Rules
&lt;/h2&gt;

&lt;p&gt;"When to inspect" is about resource scheduling. More fundamental is: &lt;strong&gt;what do we put in the model's input? If attention is finite, what gets the scarce real estate?&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why prohibitions are weak
&lt;/h3&gt;

&lt;p&gt;Traditional prompt engineering relies on prohibitions: "Don't use standard Markdown links." "Don't forget edge cases." "Don't create duplicate code."&lt;/p&gt;

&lt;p&gt;But a model is a pattern-matching system, not a command executor. Reading "don't use X" activates the X pattern. The more prohibitions you pile on, the more each one is diluted. Ten rules don't work ten times better than one — they work worse.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anchors
&lt;/h3&gt;

&lt;p&gt;The alternative is: &lt;strong&gt;give the model examples instead of rules.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't tell it "don't use standard Markdown links." Show it a file with the correct Obsidian-style links.&lt;/p&gt;

&lt;p&gt;Don't tell it "check boundary conditions before writing logic." Ask it to fill out a truth table of all state combinations before touching code.&lt;/p&gt;

&lt;p&gt;This is the core of what I've been calling the "2.5 layer" approach — between the spec (what to build) and the implementation (how to build it), there's a middle layer of &lt;strong&gt;anchors&lt;/strong&gt; that show the model what correct output looks like for this specific project.&lt;/p&gt;

&lt;p&gt;Steph Ango's obsidian-skills project (33K stars) is a clean public example. He didn't write "don't use &lt;code&gt;[]()&lt;/code&gt; format links" — he shipped a &lt;code&gt;.md&lt;/code&gt; file with correct syntax. The model reads it and learns. Cheaper than rules, and more effective.&lt;/p&gt;

&lt;h3&gt;
  
  
  A concrete example
&lt;/h3&gt;

&lt;p&gt;In practice, one of the most effective anchors has been an auto-generated UI specification file — a YAML document produced by the design step and consumed by the implementation step. It lists every page, its components, their states (loading/empty/error/edge), and responsive breakpoints. The model reads this before generating UI code.&lt;/p&gt;

&lt;p&gt;Before this anchor, the model would guess pixel values, invent component names, and skip error states. Not because it was "bad" — because it had no project-specific reference. The anchor didn't add a single rule. It just changed the distribution of what the model saw, which changed what it generated.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Where These Rules Live
&lt;/h2&gt;

&lt;p&gt;The first two sections define strategy. But strategy dies if it's locked into a single platform's format.&lt;/p&gt;

&lt;p&gt;The trap is writing audit routing or specification checklists inside a &lt;code&gt;workflow.md&lt;/code&gt; file — because &lt;code&gt;workflow.md&lt;/code&gt; is typically a platform plugin, read on demand. Switch from Claude Code to OpenCode, Cursor, or Gemini CLI, and it breaks.&lt;/p&gt;

&lt;p&gt;The fix is: &lt;strong&gt;write decision tables in platform-agnostic reference files.&lt;/strong&gt; The workflow references them but doesn't implement them.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Platform-specific workflow:
  "Phase complete → read gc-audit-routing.md → execute audit per decision table"

Platform-agnostic reference (gc-audit-routing.md):
  Defines the decision rules only — no agent() calls, no platform-specific hooks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Each platform adapter decides how to execute. The decision logic itself lives in one place.&lt;/p&gt;

&lt;p&gt;This generalizes to a principle: &lt;strong&gt;decisions about how to decide don't belong in workflow scripts.&lt;/strong&gt; Workflow scripts handle sequencing of steps. Decision criteria go in reference documents.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It Together
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Common Approach&lt;/th&gt;
&lt;th&gt;Better Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;When to inspect&lt;/td&gt;
&lt;td&gt;Attention&lt;/td&gt;
&lt;td&gt;Uniform coverage or stronger model&lt;/td&gt;
&lt;td&gt;GC zoning: allocate attention by impact scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What to show&lt;/td&gt;
&lt;td&gt;Input samples&lt;/td&gt;
&lt;td&gt;Prohibition stacking&lt;/td&gt;
&lt;td&gt;Anchors: shape output through input distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both problems share the same premise: &lt;strong&gt;a model's compute is finite. The engineering lever is allocation strategy, not raw capability.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't a philosophical claim — it's an engineering constraint. A code review burns a few thousand tokens of inference. Spreading that budget uniformly across every file is less effective than concentrating it on Eden-zone and cross-generation changes. Shoving raw requirement text into context is less effective than putting structural anchors at attention-relevant positions.&lt;/p&gt;

&lt;p&gt;Consequences follow naturally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't buy a stronger model to catch more bugs — spend existing attention where bugs actually hide&lt;/li&gt;
&lt;li&gt;Don't write more prompt rules — give the model better examples&lt;/li&gt;
&lt;li&gt;Don't reimplement review logic for every platform — put the decision table in the middle, let platforms execute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Models change every year. Attention allocation and sample distribution principles don't.&lt;/p&gt;




&lt;p&gt;*This article is based on work from an open-source framework project. The GC-audit routing and platform-independent decision patterns are available as feature proposals in the repository.&lt;br&gt;
&lt;a href="https://dev.to/zxpmail/from-shackles-to-anchors-how-i-resurrected-an-abandoned-open-source-framework-8pi"&gt;From Shackles to&lt;br&gt;
  Anchors&lt;/a&gt; ·&lt;br&gt;
  &lt;a href="https://dev.to/zxpmail/we-built-a-grovel-index-to-measure-llm-sycophancy-heres-what-we-found-2n40"&gt;We Built a "Grovel Index" to Measure LLM&lt;br&gt;
  Sycophancy&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>coding</category>
      <category>llm</category>
    </item>
    <item>
      <title>From Shackles to Anchors: How I Resurrected an Abandoned Open-Source Framework</title>
      <dc:creator>zxpmail</dc:creator>
      <pubDate>Sat, 06 Jun 2026 12:06:06 +0000</pubDate>
      <link>https://dev.to/zxpmail/from-shackles-to-anchors-how-i-resurrected-an-abandoned-open-source-framework-8pi</link>
      <guid>https://dev.to/zxpmail/from-shackles-to-anchors-how-i-resurrected-an-abandoned-open-source-framework-8pi</guid>
      <description>&lt;p&gt;From Shackles to Anchors: How I Resurrected an Abandoned Open-Source Framework by Learning to Work &lt;em&gt;With&lt;/em&gt; AI, Not Against It&lt;/p&gt;

&lt;p&gt;&lt;em&gt;GitHub Finish-Up-A-Thon submission&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Abandoned Framework
&lt;/h2&gt;

&lt;p&gt;ReqForge is an open-source LLM agent harness — a structured workflow for turning product ideas into shippable code. I started it months ago. It worked, but something was off.&lt;/p&gt;

&lt;p&gt;The framework was built on a simple philosophy: &lt;strong&gt;constrain the model enough and it will produce correct code.&lt;/strong&gt; Rules. Validators. Checklists. Gates. Every conversation started with a list of "don'ts" — don't over-abstract, don't hallucinate APIs, don't write empty catch blocks, don't use &lt;code&gt;as any&lt;/code&gt;, don't copy-paste templates...&lt;/p&gt;

&lt;p&gt;I had built a framework that spent most of its energy &lt;strong&gt;fighting the model.&lt;/strong&gt; And the result was predictable: generated code was correct but stiff. Every new feature required more rules. The framework was becoming a burden.&lt;/p&gt;

&lt;p&gt;I shelved it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Spark
&lt;/h2&gt;

&lt;p&gt;Then I watched a YouTube video about a 2300-year-old Chinese philosophy text — Zhuangzi's story of Cook Ding, a butcher whose knife never dulls.&lt;/p&gt;

&lt;p&gt;Lord Wenhui watches Cook Ding cut up an ox. Other butchers smash through bones, replacing their knife every month. But Ding's blade glides through the ox's body like music. After thousands of oxen, his knife is still sharp.&lt;/p&gt;

&lt;p&gt;"How?" asks the lord.&lt;/p&gt;

&lt;p&gt;Ding replies: "What I care about is the Way, which goes beyond skill. A good butcher changes his knife every year. An ordinary butcher changes it every month. I've used this knife for 19 years. When I first started, I saw nothing but the whole ox. After three years, I no longer saw the ox — I saw the gaps between joints. Now I meet it with spirit, not with my eyes. My senses stop. My spirit guides the knife."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I realized: my framework was stuck at "good butcher" level.&lt;/strong&gt; I was adding better rules, sharper validators, more gates — better knives — instead of learning to see the gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The gaps are the model's natural pattern-matching ability.&lt;/strong&gt; LLMs aren't logic engines. They're pattern matchers. Every "don't" rule forces the model to suppress its natural generation pattern. Instead of fighting this, I should work &lt;em&gt;with&lt;/em&gt; it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Resurrection: From Shackles to Anchors
&lt;/h2&gt;

&lt;p&gt;I reopened the repo and completely rewrote the design philosophy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before (Shackles)
&lt;/h3&gt;

&lt;p&gt;The framework's code generation guidance looked like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Checklist:
- [ ] No over-abstraction
- [ ] No hallucinated APIs
- [ ] No hardcoded values
- [ ] No empty catch blocks
- [ ] No copy-paste templates
- [ ] No fake tests
- [ ] No TODO debris
- [ ] No type escapes
- [ ] No style scattering
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Nine "don't" rules. The model had to recite them while generating, suppressing its natural tendencies simultaneously. Every suppression could fail.&lt;/p&gt;

&lt;h3&gt;
  
  
  After (Anchors)
&lt;/h3&gt;

&lt;p&gt;I replaced the checklist with three short code examples — perfect patterns showing the model what TO do:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Anchor 1: Error handling pattern
async function createUser(email: string, password: string): Promise&amp;lt;User&amp;gt; {
  const existing = await db.user.findUnique({ where: { email } });
  if (existing) {
    throw new AppError(ErrorCode.CONFLICT, "Email already registered");
  }
  const hashed = await bcrypt.hash(password, 12);
  const user = await db.user.create({ data: { email, passwordHash: hashed } });
  logger.info("User created", { userId: user.id });
  return user;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;(Plus API endpoint and test pattern anchors.)&lt;/p&gt;

&lt;p&gt;The model reads three perfect examples, its pattern-matching activates, and it naturally continues in the correct style. The checklist stays as a safety net — demoted from generation guide to pre-delivery sanity check.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Full Transformation
&lt;/h3&gt;

&lt;p&gt;I made eight interconnected changes in one continuous session with GitHub Copilot:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Difficulty markers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Every task treated equally&lt;/td&gt;
&lt;td&gt;🔴/🟡/🟢 levels — model slows down for hard tasks, speeds through easy ones&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anti-slop reform&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9 "don't" rules per skill&lt;/td&gt;
&lt;td&gt;3 perfect code anchors + light checklist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Phase 1 catalyst&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;First phase starts coding immediately&lt;/td&gt;
&lt;td&gt;Lays down domain skeleton first — all subsequent code follows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-review&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Code reviewed externally (late)&lt;/td&gt;
&lt;td&gt;Self-review in the same hot context (early)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Closing ritual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Phase ends, move to next&lt;/td&gt;
&lt;td&gt;Append discoveries to spec, log decisions, clear context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Attention layout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Key info buried in the middle&lt;/td&gt;
&lt;td&gt;Critical instructions at the end (recency bias)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auto-rollback&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual git checkout&lt;/td&gt;
&lt;td&gt;Automatic snapshot restore on verify failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security rules&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scattered across files&lt;/td&gt;
&lt;td&gt;One installable template&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I also wrote a &lt;strong&gt;benchmark&lt;/strong&gt; to prove the approach works — same task, two approaches, measured results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Old (9 rules)&lt;/th&gt;
&lt;th&gt;New (3 anchors)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tests passed&lt;/td&gt;
&lt;td&gt;26/26&lt;/td&gt;
&lt;td&gt;26/26&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code size&lt;/td&gt;
&lt;td&gt;53 lines&lt;/td&gt;
&lt;td&gt;45 lines (−15%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structure&lt;/td&gt;
&lt;td&gt;2-pass filter + Map&lt;/td&gt;
&lt;td&gt;1-pass filter, simpler&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And a &lt;strong&gt;manifesto&lt;/strong&gt; explaining the philosophy — &lt;a href="https://github.com/zxpmail/ReqForge/issues/1" rel="noopener noreferrer"&gt;From Shackles to Anchors&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  How GitHub Copilot Made This Possible
&lt;/h2&gt;

&lt;p&gt;This wasn't a "write 1000 lines of boilerplate" session. It was something more interesting.&lt;/p&gt;

&lt;p&gt;The most valuable Copilot interactions weren't code completions — they were &lt;strong&gt;discussions about design philosophy.&lt;/strong&gt; I pasted a Chinese subtitle file about Zhuangzi into the conversation. Copilot connected it to LLM harness design. We iterated on the "2.5 layer" concept together — not as master and tool, but as two collaborators refining an idea.&lt;/p&gt;

&lt;p&gt;Copilot didn't just generate code. It:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Challenged my assumptions&lt;/strong&gt; — when I proposed adding more rules, it pointed out I was building a "good butcher's knife"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connected disparate ideas&lt;/strong&gt; — Zhuangzi's butcher 🠒 Transformer pattern matching 🠒 anchor-based guidance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generated the code changes&lt;/strong&gt; — all 8 framework modifications implemented in one continuous session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrote the benchmark&lt;/strong&gt; — created the reproducible comparison test&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drafted the philosophy document&lt;/strong&gt; — translated Chinese insights into English&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The final framework has 236 files in sync across 4 AI client adapters, all 98 unit tests pass, and the generated code is measurably cleaner. But the real transformation was in the design philosophy.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;The project went from an abandoned rule-collection to a coherent, philosophy-driven framework with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;4 AI client adapters&lt;/strong&gt; (Claude Code, Cursor, OpenCode, Gemini CLI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;13 skills&lt;/strong&gt; with anchor-based guidance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10 sub-agents&lt;/strong&gt; for specialized tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 project starter templates&lt;/strong&gt; for &lt;code&gt;forge-scaffold init&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;944 files&lt;/strong&gt; in perfect sync across all adapters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;98 unit tests&lt;/strong&gt;, all passing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A published design manifesto&lt;/strong&gt; with benchmark evidence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A GitHub Issue&lt;/strong&gt; explaining the philosophy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More importantly, the framework no longer fights the model. It works &lt;em&gt;with&lt;/em&gt; it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/zxpmail/ReqForge" rel="noopener noreferrer"&gt;github.com/zxpmail/ReqForge&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design Manifesto&lt;/strong&gt;: &lt;a href="https://github.com/zxpmail/ReqForge/issues/1" rel="noopener noreferrer"&gt;Issue #1&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark Data&lt;/strong&gt;: &lt;a href="https://github.com/zxpmail/ReqForge/tree/main/benchmark" rel="noopener noreferrer"&gt;benchmark/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark Technical Post&lt;/strong&gt;: &lt;a href="https://github.com/zxpmail/ReqForge/blob/main/docs/benchmark-technical-post.md" rel="noopener noreferrer"&gt;docs/benchmark-technical-post.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Built with GitHub Copilot, from an abandoned repo to a published design philosophy — all in one continuous session.&lt;/em&gt;&lt;br&gt;
 &lt;a href="https://dev.to/zxpmail/we-built-a-grovel-index-to-measure-llm-sycophancy-heres-what-we-found-2n40"&gt;We Built a "Grovel Index" to Measure LLM&lt;br&gt;
  Sycophancy&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devchallenge</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Less Is More: Why 3 Code Examples Beat 10 Rules for LLM Code Generation</title>
      <dc:creator>zxpmail</dc:creator>
      <pubDate>Sat, 06 Jun 2026 11:55:06 +0000</pubDate>
      <link>https://dev.to/zxpmail/less-is-more-why-3-code-examples-beat-10-rules-for-llm-code-generation-3n08</link>
      <guid>https://dev.to/zxpmail/less-is-more-why-3-code-examples-beat-10-rules-for-llm-code-generation-3n08</guid>
      <description>&lt;p&gt;&lt;em&gt;A controlled benchmark comparing two approaches to guiding LLM code generation.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Question
&lt;/h2&gt;

&lt;p&gt;Most LLM harnesses guide code generation via rules: "Don't hardcode API keys." "Don't use empty catch blocks." "Don't over-abstract."&lt;/p&gt;

&lt;p&gt;But LLMs aren't logic engines. They're pattern matchers. Every "don't" rule adds cognitive load — the model must actively suppress its natural generation pattern while simultaneously constructing code.&lt;/p&gt;

&lt;p&gt;What if we flipped the approach? Instead of telling the model what NOT to do, give it 3 perfect examples of what TO do. Let its pattern-matching do the work.&lt;/p&gt;

&lt;p&gt;Does it matter? I ran a controlled test to find out.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Project&lt;/strong&gt;: &lt;a href="https://github.com/zxpmail/ReqForge/tree/main/test-demo/todo-cli" rel="noopener noreferrer"&gt;todo-cli&lt;/a&gt; — a simple CLI todo list tool (Node.js + TypeScript, 6 source files, 5 test files).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task&lt;/strong&gt;: Add a &lt;code&gt;search&lt;/code&gt; command with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keyword search (case insensitive)&lt;/li&gt;
&lt;li&gt;Optional &lt;code&gt;--category&lt;/code&gt; filter&lt;/li&gt;
&lt;li&gt;Grouped output matching existing style&lt;/li&gt;
&lt;li&gt;5 test cases covering normal, empty, filtered, case-insensitive, and error scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Two approaches&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Context given to LLM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OLD (rules)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9-item "don't" checklist (no over-abstraction, no hallucinated APIs, no empty catches, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NEW (anchors)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3 short code snippets showing the project's error handling pattern, API endpoint pattern, and test pattern + 4-item safety checklist&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both received the exact same task definition. Both were implemented in the same environment. Both passed the same test suite.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;OLD (9 rules)&lt;/th&gt;
&lt;th&gt;NEW (3 anchors)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tests passed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;26/26&lt;/td&gt;
&lt;td&gt;26/26&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;53 lines&lt;/td&gt;
&lt;td&gt;45 lines (−15%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Filter logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2-step filter + pre-built Map&lt;/td&gt;
&lt;td&gt;1-step filter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Naming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;trimmedKeyword.toLowerCase()&lt;/code&gt; called each iteration&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;lowerKeyword&lt;/code&gt; extracted once&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Type safety&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;plain &lt;code&gt;string&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;TodoCategory[]&lt;/code&gt; typed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extra validation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;invalid category check with error message&lt;/td&gt;
&lt;td&gt;omitted (simpler)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both produced functionally identical, fully tested code. The NEW approach produced code that was 15% shorter and structurally simpler.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Code Difference
&lt;/h2&gt;

&lt;p&gt;Here's the core difference in the search logic:&lt;/p&gt;

&lt;h3&gt;
  
  
  OLD (rules-guided)
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;let filtered = todos.filter(t =&amp;gt;
  t.description.toLowerCase().includes(trimmedKeyword.toLowerCase())
);

if (category) {
  const validCategory = CATEGORY_ORDER.includes(category);
  if (!validCategory) {
    console.log(`Invalid category: ${category}`);
    return;
  }
  filtered = filtered.filter(t =&amp;gt; t.category === category);
}

// ...then build a grouped Map for output
const grouped: Record&amp;lt;string, typeof todos&amp;gt; = {};
for (const cat of CATEGORY_ORDER) grouped[cat] = [];
for (const todo of filtered) grouped[todo.category]?.push(todo);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The model followed the rules literally: validate everything, check every boundary. The result is safe but verbose — two filter passes + a pre-built Map.&lt;/p&gt;

&lt;h3&gt;
  
  
  NEW (anchor-guided)
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const lowerKeyword = trimmed.toLowerCase();

const filtered = todos.filter(t =&amp;gt; {
  const matchesKeyword = t.description.toLowerCase().includes(lowerKeyword);
  if (!category) return matchesKeyword;
  return matchesKeyword &amp;amp;&amp;amp; t.category === category;
});

// ...group via runtime filter (matching list.ts style)
for (const cat of CATEGORY_ORDER) {
  const items = filtered.filter(t =&amp;gt; t.category === cat);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The model saw the existing &lt;code&gt;list.ts&lt;/code&gt; pattern (runtime filter) and naturally followed it. &lt;code&gt;lowerKeyword&lt;/code&gt; is extracted once. Category filter is rolled into the same pass. No pre-built Map — same approach the existing codebase uses.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Happens
&lt;/h2&gt;

&lt;p&gt;The 9-rule checklist created a &lt;strong&gt;constraint-satisfaction problem&lt;/strong&gt;: the model had to simultaneously satisfy 9 negative constraints while generating code. Each constraint competes for attention. The result? Conservative code that over-validates.&lt;/p&gt;

&lt;p&gt;The 3 anchor examples created a &lt;strong&gt;pattern-continuation problem&lt;/strong&gt;: the model saw three correct examples, recognized the pattern, and continued it. No constraints to satisfy — just a familiar path to follow.&lt;/p&gt;

&lt;p&gt;This aligns with how Transformers work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pattern matching&lt;/strong&gt; is what they do best (attention over repeated patterns)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical constraint satisfaction&lt;/strong&gt; is what they do worst (requires combining multiple independent conditions)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What This Doesn't Prove
&lt;/h2&gt;

&lt;p&gt;This is one test, one task, one project. It doesn't prove anchors are universally better.&lt;/p&gt;

&lt;p&gt;What it does suggest: &lt;strong&gt;the gap between the two approaches is real but not dramatic.&lt;/strong&gt; At the scale of a single 50-line function, the difference is marginal. At the scale of a 100-file project, a consistent 15% reduction in code volume with no loss in correctness or safety is worth paying attention to.&lt;/p&gt;

&lt;p&gt;The full reproducible benchmark (contexts, task definition, generated code) is in the &lt;a href="https://github.com/zxpmail/ReqForge/tree/main/benchmark" rel="noopener noreferrer"&gt;ReqForge repo&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The two prompt contexts are checked into the repo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OLD&lt;/strong&gt;: &lt;a href="https://github.com/zxpmail/ReqForge/blob/main/benchmark/context-OLD.md" rel="noopener noreferrer"&gt;benchmark/context-OLD.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NEW&lt;/strong&gt;: &lt;a href="https://github.com/zxpmail/ReqForge/blob/main/benchmark/context-NEW.md" rel="noopener noreferrer"&gt;benchmark/context-NEW.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pick a small feature in your own project. Run it twice — once with each context. See if you get the same result.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This benchmark was run as part of the ReqForge project, which implements the "anchor" approach across all 6 of its skills. The full design philosophy is explained in &lt;a href="https://github.com/zxpmail/ReqForge/issues/1" rel="noopener noreferrer"&gt;From Shackles to Anchors&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Repository: &lt;a href="https://github.com/zxpmail/ReqForge" rel="noopener noreferrer"&gt;github.com/zxpmail/ReqForge&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;a href="https://dev.to/zxpmail/from-shackles-to-anchors-how-i-resurrected-an-abandoned-open-source-framework-8pi"&gt;From Shackles to&lt;br&gt;
  Anchors&lt;/a&gt; ·&lt;br&gt;
  &lt;a href="https://dev.to/zxpmail/we-built-a-grovel-index-to-measure-llm-sycophancy-heres-what-we-found-2n40"&gt;We Built a "Grovel Index" to Measure LLM&lt;br&gt;
  Sycophancy&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>softwaredevelopment</category>
    </item>
  </channel>
</rss>
