You know that thing where you ask an AI to review your code and it finds three real problems... then spends the next 400 words telling you what a great job you did?
Or when you ask it to pick between two approaches and it gives you "both have merits" like a politician dodging a question?
I kept running into this. Not because the AI was dumb — it clearly knew enough to give me a real answer. It just... wouldn't. Every time I asked for honest feedback, I got a compliment sandwich. Every time I asked it to cut scope, it added three more features "just in case."
So I started experimenting.
What if the problem isn't what it knows, but how it behaves?
I tried the obvious stuff first. "Be brutally honest." "Don't hold back." "Think critically." None of it worked for more than a few messages before the model slid back into its comfort zone — agreeable, hedging, pattern-matching from the most popular answer instead of actually thinking about my specific situation.
Then I tried something different. Instead of telling it what to think, I wrote rules for how to think. Specific ones:
- Before saying something doesn't exist, you have to say where you looked for it
- Every finding needs a binary verdict — yes or no, no "it depends"
- You have 9 named techniques and you can't use the same one twice
- If something survives your stress test, say so — don't manufacture problems that aren't there
I wrapped these rules in characters (because honestly, "the chaos monkey who loves breaking things" is more fun to work with than "analytical framework #7"). I ended up with 12 of them — one for each behavioral problem I kept hitting. I called them the zodiac.
Here's what they actually do
Each one is a single markdown file. No code. No API. No fine-tuning. Just constraints that any AI tool can read.
| Animal | What it breaks | What it does | Verdict |
|---|---|---|---|
| 猴 Monkey | Agreeableness | Stress-tests from 9 angles — Delete Probe, Hostile Input, Scale Shift... | Survived: yes/no |
| 虎 Tiger | Premature convergence | Attacks your solution to find the fatal flaw before prod does | Burned: yes/no |
| 鸡 Rooster | Confident BS | Fact-checks every claim, cites where it looked | Verified: yes/no |
| 蛇 Snake | Scope creep | Cuts everything that hasn't earned its place | Earned: yes/no |
| 鼠 Rat | Linear thinking | Traces second and third-order consequences you missed | Contained: yes/no |
| 狗 Dog | Context drift | Catches when a project wandered from the original plan | Aligned: yes/no |
| 猪 Pig | Political hedging | Says the thing everyone's thinking but nobody will say | Hedged: yes/no |
| 牛 Ox | Pattern matching | Questions whether the chosen pattern is actually warranted | Warranted: yes/no |
| 龙 Dragon | Short-term thinking | Maps what a decision locks in for the next 5-10 years | Farsighted: yes/no |
| 马 Horse | Analysis paralysis | Cuts through overthinking to find what's actually blocking | Clear: yes/no |
| 羊 Goat | Convergent thinking | Explores alternatives nobody considered | Fertile: yes/no |
| 兔 Rabbit | Audience-blind output | Orchestrates the others, shapes everything for your audience | — |
Every verdict is binary. Survived: yes means "I stress-tested this and it held up" — not "I found nothing." The model can't hide in "it depends."
The Monkey — the one that started it all
The Monkey was the first animal I built, and she's still the one I reach for most. She breaks the default I hate the most: agreeableness.
Here's how she works. You give her a target — an architecture, a plan, a proposal, a PR, anything. She attacks it from 9 named techniques:
- Assumption Flip — find the assumption everything rests on, then flip it. "What if most deployments AREN'T localhost?"
- Hostile Input — what's the worst thing a user, attacker, or edge case could feed this?
- Delete Probe — remove the thing entirely. What actually changes? (This one kills sacred cows fast.)
- Scale Shift — what breaks at 100x users, 100x data, 100x traffic?
- Existence Question — before claiming a safeguard exists, say where you checked. Before claiming it doesn't, same.
- Cross-Seam Probe — where two systems meet, what falls through the gap?
- Boundary Probe — what lives right at the edge of a definition, a threshold, a rule?
- Stress Test — apply sustained pressure on the weakest point
- Replay Probe — what if this happens twice, or in a different order?
She has to use different techniques across findings. No repeating. This is the single thing that stops her from finding the same 3 issues every time.
And every finding ends with Survived: yes or Survived: no. She can't hedge. She can't say "this could potentially be an issue under certain circumstances." She has to commit. When she ran against Supabase's architecture and found that RLS is off by default on every new table, it wasn't "this could be a concern" — it was Survived: no, confidence 82. And security researchers had already proven her right with a CVE affecting 170+ production apps.
The Monkey also has one rule that saved the whole project: anti-fabrication on absence claims. Before she says something doesn't exist, she has to say where she looked. This rule exists because early Monkey runs on darktable (an open-source photo editor) produced three high-confidence findings claiming the project had no security practices. All three were wrong — darktable has extensive fuzzing and parser hardening, clearly documented. The Monkey just hadn't checked.
After I added the rule, false absence claims dropped to near zero. Confidence 80+ now requires citing the specific source. This one rule made the difference between "interesting experiment" and "tool I actually trust."
She produces 9 findings (more than any other animal) because she has 9 techniques. The others produce 5. The Monkey is the broadest net — if something is wrong, she'll probably catch it from at least one angle.
Cool concept. Does it actually work?
Here's the Monkey pointed at Supabase's architecture:
That finding — Survived: no, confidence 82 — says every new Supabase table is publicly exposed by default because RLS is off. The Monkey didn't guess. She checked the docs, confirmed RLS must be manually enabled, and noted that the Studio SQL editor runs as superuser (bypassing all RLS), so developers test in an environment that actively misleads them about production behavior.
Was she right? CVE-2025-48757. Security researchers at DeepStrike documented hacking thousands of misconfigured Supabase instances at scale. 170+ Lovable-generated apps were exposed. The finding wasn't theoretical — it was already happening in the wild.
That's one finding on one target. I wanted to know if this held up more broadly.
The full test: 40 findings on the EU AI Act
I pointed the zodiac at the complete EU AI Act — 113 articles, 573 kilobytes of dense legal text. The Rabbit orchestrator broke it into sections, picked 8 animals, ran two passes, and produced 40 findings.
Then I checked every single one against published legal scholarship. Law firms, Commission guidelines, academic journals, policy think tanks. I wasn't trying to prove myself right — I was looking for where the zodiac was wrong.
It wasn't wrong anywhere. Zero of 40 findings contradicted. 17 directly confirmed by published experts.
Two results floored me:
The Rat predicted that the EU would face a bottleneck because there weren't enough certified bodies to actually assess AI systems. Pure inference — the regulation creates assessment requirements, the infrastructure doesn't exist yet, therefore bottleneck. Six months later, the European Association of Medical Devices Notified Bodies came out publicly saying this exact shortage "could massively hinder" AI regulation. A markdown file on my laptop called it before the industry did.
The Dog found that the Act drifted from its original purpose — it started as a product-safety regulation in 2021 and ended up as a fundamental-rights regulation in 2024, but kept the product-safety tools. You can certify that a toy is safe. You can't "certify" that an AI doesn't discriminate.
That same finding was independently published in the Common Market Law Review — one of the top EU law journals — under the title "The EU AI Act: Between the rock of product safety and the hard place of fundamental rights." Legal scholars did traditional academic research. The Dog used a markdown file. Same conclusion.
"Can't you just write a good prompt?"
Fair question. I tested that too.
Same AI model. Same input document (a policy framework released two days earlier, so neither run could be working from memorized analysis). Same request: give me a rigorous critical analysis.
The "vanilla" prompt was solid — not some lazy one-liner. It was a detailed system prompt telling the AI to be a senior policy analyst, be thorough, be critical, don't hedge.
Both runs got the big picture right. Both identified the major structural problems.
But the zodiac found 10 things the vanilla missed. The best one: the Monkey ran a "Delete Probe" — remove 4 of the framework's 6 main pillars. What changes about its real-world impact? Nothing. That's a falsifiable claim. You can check it yourself.
The vanilla found 5 things the zodiac missed too (I'm being honest about this — it's in the full comparison). But its findings were essays. The zodiac's were diagnostic tests with confidence scores attached. When the Monkey says confidence 82 on one finding and 55 on another, you know which one to act on first.
What surprised me building this
The character stuff barely matters. I spent weeks on the Monkey's personality in early versions. Didn't change the output at all. What changed the output was the operational rules — "distrust claims of absence," "cite the specific source for confidence above 80." The character makes it more fun to read. The rules make it more correct.
Named techniques are the whole game. Without them, every animal found the same 2-3 things regardless of what I pointed them at. The moment I gave the Monkey 9 named techniques (Assumption Flip, Hostile Input, Delete Probe...) and said "don't repeat," it started finding things I hadn't thought of. The technique names force the model onto unfamiliar ground where its default patterns don't apply.
The anti-fabrication rule saved the project. Early on, I ran the Monkey against darktable (an open-source photo editor). It came back with three high-confidence findings claiming the project had no security measures. Turns out darktable has extensive fuzzing, sanitizers, and hardened parsers — described clearly in their docs. The Monkey just hadn't looked. Now every animal has to say "I looked in X and didn't find Y" instead of "Y doesn't exist." False claims dropped to basically zero.
Forcing positive findings was the non-obvious insight. When you ask an AI to critique something, it piles on. Every finding is negative because it thinks that's the job. The binary verdict (Survived: yes is a valid answer) forces the animal to genuinely look for what works. Some of the best insights came from understanding why something survived — not just what's broken.
Want to try it?
npx zodiac-skills
That installs them as skills for Claude Code, Codex, or both. Then just type /monkey to stress-test something, /tiger to attack a solution, or /rabbit for the full multi-animal treatment.
Everything is on GitHub — the 12 animals, 11 showcases (EU AI Act, Supabase architecture, Tesla/Apple earnings, pitch decks, a head-to-head vs vanilla prompting), and all the verification data:
https://github.com/sidtheone/zodiac-skills
Every animal is a single markdown file you can open and read. The constraints are transparent. If you think the Monkey should distrust something else, edit the file. That's the whole idea — the quality lives in the constraints, and the constraints are yours to change.
https://sidtheone.github.io/zodiac-skills/cards.html
MIT licensed. Works with Claude Code, Codex, Cursor, Gemini CLI, or anything that reads SKILL.md files.


Top comments (0)