<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hector Haung</title>
    <description>The latest articles on DEV Community by Hector Haung (@hector_haung_da45eb10a814).</description>
    <link>https://dev.to/hector_haung_da45eb10a814</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3908486%2F734fd308-b868-4601-a25c-28e1a5c69861.jpg</url>
      <title>DEV Community: Hector Haung</title>
      <link>https://dev.to/hector_haung_da45eb10a814</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hector_haung_da45eb10a814"/>
    <language>en</language>
    <item>
      <title>AI Is Very Good at Implementing Bad Plans</title>
      <dc:creator>Hector Haung</dc:creator>
      <pubDate>Sat, 02 May 2026 08:57:55 +0000</pubDate>
      <link>https://dev.to/hector_haung_da45eb10a814/ai-is-very-good-at-implementing-bad-plans-4d80</link>
      <guid>https://dev.to/hector_haung_da45eb10a814/ai-is-very-good-at-implementing-bad-plans-4d80</guid>
      <description>&lt;p&gt;Most AI coding posts focus on the code: which model writes cleaner functions, which one needs less prompting, which one hallucinates less.&lt;/p&gt;

&lt;p&gt;But the code isn't usually where my projects break.&lt;/p&gt;

&lt;p&gt;The plan is.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9fzd8oaxie7mvvn4r0gf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9fzd8oaxie7mvvn4r0gf.png" alt=" " width="800" height="491"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
   &lt;a href="https://github.com/permoon/multi-model-redteam" rel="noopener noreferrer"&gt;https://github.com/permoon/multi-model-redteam&lt;/a&gt; 
&lt;/h2&gt;

&lt;h2&gt;
  
  
  A pipeline that almost shipped
&lt;/h2&gt;

&lt;p&gt;A few weeks ago I asked Claude Code to plan a BigQuery dedup pipeline. Routine stuff. Pull events from Postgres into GCS, load into BigQuery, dedup by event ID, impute some missing checkout rows.&lt;/p&gt;

&lt;p&gt;The plan came back in maybe 90 seconds. Six steps, clean SQL, sensible-looking error handling. I almost just told it to start coding.&lt;/p&gt;

&lt;p&gt;Then I tried something. I sent the same plan to Codex and Gemini, and asked each one separately to break it.&lt;/p&gt;

&lt;p&gt;Three models. Same plan. No shared context. None of them knew what the others wrote.&lt;/p&gt;

&lt;p&gt;Here's what came back.&lt;/p&gt;




&lt;h2&gt;
  
  
  What three models found
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;All three caught the same dedup bug:&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;INSERT INTO order_events_dedup&lt;/code&gt; step wasn't idempotent. Any retry doubled yesterday's rows. The existing alert ("less than 50% of expected") is one-sided and would never fire on over-counts.&lt;/p&gt;

&lt;p&gt;That's the easy one. The interesting findings were the ones only one model caught.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Only Claude caught this:&lt;/strong&gt;&lt;br&gt;
Step D's correlated subquery had unqualified column references. &lt;code&gt;WHERE m2.user_id = user_id&lt;/code&gt; doesn't bind the way the writer intended in BigQuery's scoping rules. The imputation step would silently do nothing after day one. The pipeline's &lt;em&gt;whole purpose&lt;/em&gt; (filling in missing checkout events) would fail invisibly for 2–8 weeks before anyone noticed.&lt;/p&gt;

&lt;p&gt;Codex and Gemini both &lt;em&gt;quoted&lt;/em&gt; this exact SQL block in their reviews. Neither tested whether the join actually binds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Only Gemini caught this:&lt;/strong&gt;&lt;br&gt;
Midnight-boundary race. The same event retried at 23:59:59 on Day 1 and 00:00:02 on Day 2 lands in two different daily partitions. Step C's &lt;code&gt;GROUP BY&lt;/code&gt; only sees within a partition. The cross-partition pair never gets deduped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Only Codex caught this:&lt;/strong&gt;&lt;br&gt;
Truncated CSV from GCS, BigQuery load succeeds anyway. Up to 50% of the data can be silently lost while still passing the row-count alert, because the truncated file is still syntactically valid CSV.&lt;/p&gt;

&lt;p&gt;Three different blind spots. Three different models. If I'd just gone with any one model's review, I'd have shipped two of these bugs.&lt;/p&gt;


&lt;h2&gt;
  
  
  The workflow
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PROMPT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;prompts/system-prompt.md plan.md&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PROMPT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | claude &lt;span class="nt"&gt;--print&lt;/span&gt;                   &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; out/claude.md &amp;amp;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PROMPT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | codex &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;--skip-git-repo-check&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; out/codex.md  &amp;amp;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PROMPT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | gemini &lt;span class="nt"&gt;--skip-trust&lt;/span&gt;              &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; out/gemini.md &amp;amp;
&lt;span class="nb"&gt;wait&lt;/span&gt;

&lt;span class="c"&gt;# 4th call merges and ranks the three reviews&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;cat &lt;/span&gt;prompts/consolidation-prompt.md &lt;span class="se"&gt;\&lt;/span&gt;
       out/claude.md out/codex.md out/gemini.md&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | claude &lt;span class="nt"&gt;--print&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; out/ranked.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Three CLIs in parallel. Same prompt. No shared context. A fourth call to merge.&lt;/p&gt;

&lt;p&gt;Wall time: 5–15 minutes (the merge step dominates). Cost: about $0.10–0.20 for a sample plan, $0.50–2.00 for production-size.&lt;/p&gt;


&lt;h2&gt;
  
  
  The prompt that does the work
&lt;/h2&gt;

&lt;p&gt;The prompt sent to all three models has one job: force concrete failure scenarios, reject abstract advice. Five dimensions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. HIDDEN ASSUMPTIONS — ordering, uniqueness, atomicity, data
   freshness, caller behavior. What does this design implicitly
   depend on?
2. DEPENDENCY FAILURES — upstream/downstream services, external
   APIs, databases, messaging. What breaks if a dependency
   degrades?
3. BOUNDARY INPUTS — empty, single, huge batch, malicious,
   malformed.
4. MISUSE PATHS — caller misbehavior, user skipping steps,
   out-of-order operations.
5. ROLLBACK &amp;amp; BLAST RADIUS — how to recover, scope of damage.
   5-minute detection vs 5-day detection?

For each scenario:
- TRIGGER: what causes it
- IMPACT: who is affected, how badly
- DETECTABILITY: how long until noticed

Reject abstract advice like "add monitoring". Specify what
metric, what threshold, what alert.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last paragraph is doing most of the work. Without it you get "consider rate limiting" and "ensure proper error handling." With it you get the midnight-boundary race.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I actually learned
&lt;/h2&gt;

&lt;p&gt;Three models in parallel isn't impressive. Anyone can run three CLIs. The thing that surprised me is &lt;strong&gt;how rarely the unique findings overlap&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Claude tends to over-warn. It flags five defensive checks that aren't really bugs. But it actually reads the SQL.&lt;/p&gt;

&lt;p&gt;Codex is concise. It skips integration details, but it notices file-format and infra failure modes the others gloss over.&lt;/p&gt;

&lt;p&gt;Gemini stays surface-level a lot of the time. But when it does dig in, it's often a concurrency or partition issue the others missed.&lt;/p&gt;

&lt;p&gt;You don't get this from ensemble averaging. The &lt;em&gt;consensus&lt;/em&gt; findings are the obvious ones. The &lt;em&gt;unique&lt;/em&gt; findings are the ones a single-model review would have quietly missed.&lt;/p&gt;

&lt;p&gt;That's the whole point.&lt;/p&gt;




&lt;h2&gt;
  
  
  Not a multi-agent framework
&lt;/h2&gt;

&lt;p&gt;This is a workflow, not a system. No orchestrator, no shared scratchpad, no consensus protocol, no agent class hierarchy. Three CLIs in parallel. A fourth call to merge.&lt;/p&gt;

&lt;p&gt;If you want an installed framework with marketplace plugins, there are several. This is the opposite shape: ~30 lines you paste into your &lt;code&gt;CLAUDE.md&lt;/code&gt;, and the next time you ask Claude Code to review a plan, it fans out to Codex and Gemini in parallel and brings back a merged report.&lt;/p&gt;




&lt;h2&gt;
  
  
  I wrote it up
&lt;/h2&gt;

&lt;p&gt;The full method, both case studies (the BigQuery pipeline above plus a Cloud Run + Workflows deploy), and the 100-line &lt;code&gt;redteam.sh&lt;/code&gt; are in a small repo:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/permoon/multi-model-redteam" rel="noopener noreferrer"&gt;https://github.com/permoon/multi-model-redteam&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three install tiers depending on what you have set up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tier 0&lt;/strong&gt;: paste 30 lines into &lt;code&gt;CLAUDE.md&lt;/code&gt;. No install.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 1&lt;/strong&gt;: &lt;code&gt;git clone&lt;/code&gt; and run the bash script.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 2&lt;/strong&gt;: copy the prompt into Claude / ChatGPT / Gemini's chat UI. One model only, but better than no frame.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's also a teaching repo. Seven chapters, from "why one LLM isn't enough" to the parallel script.&lt;/p&gt;




&lt;h2&gt;
  
  
  Open question
&lt;/h2&gt;

&lt;p&gt;The reason I'm posting this: I want to know if other people are doing something similar.&lt;/p&gt;

&lt;p&gt;Are you red-teaming AI-generated plans before letting the model implement them? With one model? Multiple? Or are you mostly trusting the plan and reviewing the code afterward?&lt;/p&gt;

&lt;p&gt;If you've tried this and it didn't work for you, I'd especially like to hear that.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>programming</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
