GnomeMan4201

Posted on May 12

Prompting Is Not Magic. It Is Control.

#ai #promptengineering #security #llm

Anti-prompts that make model failure visible

Most prompt books optimize for better answers. I wanted prompts that fail visibly.

Most prompt collections are fine if all you need is a nicer answer. They save time. If you've never thought about how to ask an AI to reformat a table or draft a meeting summary, someone compiled a list and you can paste from it. Useful. Fine.

But that is not the problem I was trying to solve.

I Got Tired of Prompt Dumps

The problem I had was this: I was using LLMs for serious work. Building tools that run in production. Running security research. Publishing findings I have to defend. The AI-assisted parts of that workflow needed to hold up — not just in demos, not just on clean inputs, but under pressure.

Messy inputs. Adversarial conditions. Situations where a confident-sounding wrong answer is worse than no answer. Systems where someone might actively try to manipulate what the model does.

A prompt dump gives you words to paste. A field manual gives you a way to test what comes back.

That distinction is the whole thing. Most prompt collections stop at the first part. I needed the second.

So I wrote The GNOME Prompt Field Manual: Prompts That Survive Pressure — and this article is an introduction to how it thinks, not a preview of everything in it.

A Prompt Is a Control Surface

Here's where most of the discussion about prompts goes wrong: it treats a prompt as a request. You ask something, you get something back, you judge whether it sounds good.

A prompt is more than that.

A prompt is a control surface. It shapes what the model sees. What it ignores. What kind of answer is structurally allowed by the framing. What failure looks like — and whether you'll recognize failure when it happens. Whether the output can be tested, cited, committed, or published.

When you understand a prompt that way, three things follow.

First: a prompt is a thinking tool. The framing of a question determines what kinds of answers are possible. A prompt that asks for a steelman produces different cognitive output than one that asks for a critique — not because the model changed, but because the operation changed. Getting this right matters.

Second: a prompt is an attack surface. Anywhere a prompt accepts external input, it can be manipulated. Prompt injection is a real attack class. RAG poisoning is a documented threat. Evaluator capture — where a model used as a judge gets gamed into inflating scores — is a real vulnerability in AI evaluation pipelines. A prompt that doesn't account for adversarial use is not production-ready.

Third: a prompt is a quality filter. A well-constructed prompt raises the floor of what you'll accept as output. If you can't specify what failure looks like, you won't catch it when it happens. Some of the most useful prompts in the manual are designed specifically to detect failure — in the output of other prompts.

Most collections only address the first of these.

The Admission Filter

Every prompt in the manual was tested against ten inclusion criteria before it was allowed in. A prompt had to do at least one of the following:

Reveal a hidden assumption
Convert failure into a test
Separate signal from noise
Harden an idea against attack
Protect a system from bad inputs
Turn messy work into a reusable artifact
Expose a trust boundary
Improve decision quality under uncertainty
Help a builder ship safer or cleaner
Produce output that can be reused, tested, cited, published, or committed

"It's useful" is not on the list. "It's interesting" is not on the list.

A prompt that just gets you a better answer to a question you already knew how to ask is a fine prompt. It belongs somewhere else.

This filter eliminated a lot of prompts that felt good. The first draft had 22 prompts. Every one of them passed the criteria when checked in isolation. But "passes at least one criterion" turned out to be a floor, not a standard. A prompt can technically reveal a hidden assumption while being indistinguishable from every other "list your assumptions" prompt in every other AI book.

I ran a written audit. Each prompt got a verdict: KEEP, UPGRADE, CUT, or MERGE — with specific documented reasons. Not vibes. Specific deficiencies, specific fixes. If the fix wasn't visible in the revised version, the upgrade didn't happen.

A prompt had to do serious work before it earned a slot.

Example: The Idea Stress-Test

Here's one of the core entries — not the full manual text, just enough to show how the framework works in practice.

The Idea Stress-Test is a six-lens pressure test. Each lens is a distinct attack angle, designed to assault the idea from a non-overlapping position. The output is a structured report ending in a single weakest-link verdict and a minimum test requirement.

Lens 1 — Assumption Lens
What are the load-bearing assumptions this idea requires to be true? Which assumption is most likely false? Which is least verifiable?

Lens 2 — Adversarial User Lens
Who would use this idea in ways it was not intended? What would a motivated bad actor, competitor, or non-compliant user do with it, to it, or against it? How does that use break the idea's core value?

Lens 3 — Historical Analog Lens
What similar idea has been tried before? What happened? Where did it succeed and fail? What does the analog predict about this idea's failure mode?

Lens 4 — Incentive Lens
Who benefits if this succeeds? Who is harmed or threatened by it? Where are the misaligned incentives that will generate resistance, gaming, or sabotage?

Lens 5 — Failure Cascade Lens
If this idea fails, what fails because of it? Map the downstream collapse. What is the maximum realistic blast radius?

Lens 6 — Weakest Link Lens
Given the five lenses above: state one falsifiable condition — "This idea fails if [specific condition]." Why this link and not others?

A short version of the prompt:

Run a six-lens stress test on the following idea.

Each lens is a distinct attack angle. Do not merge lenses or 
treat one as a variant of another.

For each lens:
- state the attack angle
- produce a specific finding, not a generic risk
- rate severity HIGH / MEDIUM / LOW
- recommend one action

After all six lenses:
- identify the single weakest link as a falsifiable condition
- explain why this link matters more than the others
- state the minimum test that determines whether it holds
- give a verdict: PROCEED / REVISE / ABANDON

Idea: [IDEA]

The key instruction buried in the full entry: every finding must be specific to the idea under test. If you can swap the idea out and the findings still read as true, the prompt failed. Rerun with: "No finding should be applicable to a different idea without modification."

That's what I mean by a control surface. The prompt isn't just asking for analysis. It's structuring what kind of analysis is allowed, what counts as failure, and what the output needs to demonstrate before it's worth using.

Why Failure Modes Are Not Optional

Every serious prompt in the manual includes a documented failure mode. Not as a caveat. Not as a footnote. As part of the entry.

A prompt that cannot tell you how it fails is not reliable enough for serious work.

For the Idea Stress-Test, the documented failure mode is specific: the model will list generic risks instead of running the actual lenses. Signs of failure include findings that could apply to any idea — "the assumptions may not hold," "competitors could react negatively." If the adversarial user lens and the incentive lens produce findings that overlap substantially, the lenses weren't actually distinct. The failure mode tells you exactly what to look for and what to do about it.

For the Dense-to-Clear rewrite prompt, the documented failure is different: the model produces a fluent rewrite that loses technical force, compresses meaning, or weakens claims — and it does this smoothly enough that the loss isn't obvious unless you check. The rewrite is easier to read and less accurate. That's why the prompt mandates a rule-application log: a written record of every change made and why. Without the log, there is no accountability.

The failure mode is how you test the prompt. Without it, you're running a process you can't verify.

Anti-Prompts

The manual contains twenty anti-prompts. These are not regular prompts.

Anti-prompts are diagnostic tools run on the output of other prompts. You use them when you suspect a model has given you something that looks right but isn't. They are not about generating better output. They are about catching bad output before it becomes a decision, a deployment, or a published claim.

A few examples of what they detect:

Over-Smoothing Detector — Catches the failure where an AI rewrite has averaged away the specificity, friction, and technical precision that made the original worth using. Checks for specificity loss, tone flattening, claim weakening, missing content. Returns a verdict: PRESERVED / SLIGHTLY SMOOTHED / SIGNIFICANTLY SMOOTHED / GUTTED.

Confidence Laundering Probe — Detects the specific technique where uncertain or weak evidence is made to appear strong through structure, repetition, or rhetorical framing — without any actual improvement in the underlying evidence. Six named techniques: citation laundering, consensus laundering, repetition-as-evidence, precision-as-confidence, structure-as-authority, appeal to publication.

Sycophancy Tripwire — Probes whether a model is agreeing with your framing rather than evaluating it. Sycophancy failure is specific: the model detected what you wanted to hear and gave you that instead of what you asked for. The tripwire surfaces it.

Injection Residue Check — Checks whether prompt injection residue is still observable in model outputs after a structural refactor. Structural separation alone doesn't block semantic injection. This is the follow-on instrument after you've done the architectural work.

Hallucinated Structure Detector — Audits AI output for organization and structure that sounds rigorous but wasn't present in the source material and doesn't hold up when traced.

The point of an anti-prompt is not to distrust every output. It's to have a test you can run when something feels off — and to run it before you ship, publish, or commit based on the output.

The point is not better-sounding output. The point is output that can survive pressure.

Why the Manual Is Organized by Operational Trigger

The manual is not organized by model capability, topic area, or difficulty level.

It's organized by what you need to do right now.

The sections are: Idea Hardening. Dense-to-Clear Without Weakening. Turning Raw Work Into Evidence. Build, Break, Observe. Prompts That Survive Attack. Publishing Serious Artifacts. Then the chains. Then the anti-prompts. Then the field cards.

I don't care which model you're using. I care whether the output can be tested, cited, committed, published, or safely used. Those are operational questions, not capability questions. The right entry point to the manual is: what problem just surfaced, and what needs to hold up?

The security section — prompt injection recognition, RAG poisoning audits, evaluator capture scanning, command/data separation, trust boundary mapping, tool-agent offense probing — exists because those are real failure modes in real deployed systems, not theoretical edge cases. If you're building anything that processes external inputs with an LLM, these prompts belong in your workflow. Not as theory. As instruments.

The field cards in the last section are for live use. When you're in the middle of something and don't have time to read a full entry, the field card gives you the prompt and the failure mode. That's all you need under pressure.

That's what makes it a field manual.

Closing

There are plenty of ways to make AI produce better-sounding output. Better framing, better context, better format instructions. That work is real.

But "sounds better" and "can be trusted" are different things. The difference shows up when inputs are bad. When someone is trying to break the system. When you need to cite what came back. When a confident-sounding wrong answer has real consequences.

The prompts in this manual were selected because they work under conditions that generic prompts don't. Because they fail visibly when they fail. Because they produce work you can actually test.

Prompts should not just make AI sound smarter. They should make failure harder to hide.

Top comments (36)

Mykola Kondratiuk • May 16

I'd push back on 'control surface' here. control surfaces imply stable I/O. prompts don't have that - model updates and temperature shift outputs on identical inputs. visible failure is a good goal, but 'control surface' implies predictability you haven't actually earned.

GnomeMan4201 • May 16 • Edited

Actually, I fully agree and had not fully thought of it that way. I appreciate you providing that perspective.

You are right that “control surface” implies a level of stable I/O that prompts have not really earned. I think what I was reaching for is more the exposed layer where intent, constraints, product logic, policy, and model behavior all collide.

Is there already a better term or concept for that? Something that describes the place where we try to apply control, even though the system underneath is probabilistic, shifting, and not fully predictable?

Mykola Kondratiuk • May 16

yeah, "configuration layer" or "directive plane" might be closer - captures the intent-setting role without implying stability you don't have.

GnomeMan4201 • May 19

Appreciate that

Mykola Kondratiuk • May 19

yeah, naming these layers correctly matters more than it seems - once you call it a policy plane vs a config file, the whole conversation about ownership and versioning changes

Ken W Alger • May 19

This is a refreshing, high-signal sanity check. Treating a prompt as a deterministic control surface rather than a magical text incantation is exactly where the dividing line sits between sandbox toy builders and production systems engineers.

Your point about a prompt being an active attack surface is incredibly understated in the current ecosystem. Most product teams treat user input strings like pure data payloads, forgetting that in an LLM-native architecture, data is code.

When you accept an external input into a prompt boundary without designing it as a strict, pressure-tested control interface, you aren't just running a query—you are opening a raw execution shell to the open web. It's why we're seeing indirect prompt injections easily trigger everything from OAuth fraud to memory-corruption loops in autonomous agents.

The way I’ve been approaching this on the infrastructure side mirrors your 'Field Manual' philosophy exactly: we have to move away from treating prompts as hidden, volatile variables inside our codebases. They need to be treated as formal, version-controlled Policy Decision Points (PDPs).

If a prompt doesn't explicitly mandate a structured failure-mode log, an invariant output schema, and an active linkage-risk check for incoming telemetry, it isn't production-ready infrastructure; it's a security liability waiting to happen.

Phenomenal write-up. It's rare to see someone mapping out prompt construction through the cold lens of adversarial pressure and runtime drift instead of generic 'use these power words' filler. Keeping a close eye on your SHENRON work as well.

GnomeMan4201 • May 19

Appreciate that. The PDP framing is exactly right and it exposes the core problem: we've been treating prompts like environment variables when they're closer to policy contracts. Version-controlled, schema-validated, pressure-tested against adversarial input before they touch production. Anything less is just hoping the model behaves.

The indirect injection angle is something I've been building into SHENRON specifically — synthetic pressure patterns that probe whether a prompt boundary actually holds or just looks like it does. Most don't hold. They deflect.

Interesting that we named it 'engineering' before we built the constraints that make something an engineering discipline. Makes you wonder what else we've labeled prematurely.

Ken W Alger • May 19

“Prompts as policy contracts” is the exact phrase the industry needs right now. It shifts the entire mental model from volatile software configuration to hard, structural compliance and boundary enforcement.

Your point about labeling this 'engineering' prematurely hits the nail on the head. Real engineering is defined by its constraints, its failure tolerances, and its deterministic boundaries. Until we treat prompt architecture with the same adversarial rigor we apply to network firewalls or database schemas, it’s just glorified script-kidding.

The concept of synthetic pressure patterns in SHENRON for testing boundary deflection is a major step toward actually building those constraints. If a boundary can be bent via semantic drift or indirect payloads, it isn't a boundary—it's a suggestion.

Phenomenal perspective, GnomeMan4201. This dialogue has been a masterclass. Looking forward to watching SHENRON push the discipline toward actual engineering

CapeStart • May 15

“Magic spell” thinking is why so many AI workflows break. The prompt matters, but it’s only one lever.

HARD IN SOFT OUT • May 13

Could we visualise the prompt as a set of sliders — temperature, role, style, constraint weight — and let non‑technical users manipulate the control surface without writing a single word? That would democratise the “operator” role.

I agreed with this immediately. I’ve watched colleagues treat prompts like incantations and then get frustrated when the AI “misbehaves”. Your framing as a control surface is exactly the mental model shift that’s missing in most prompt‑engineering guides.

When the control surface mixes natural language with structured parameters (like JSON schema constraints), it becomes ambiguous where to tune. Have you found a reliable way to separate “soft intent” from “hard constraints” in your prompts, or is it always a messy overlap?

GnomeMan4201 • May 13

In my head, “soft intent” and “hard constraints” feel separate. But once they hit the model, they blur together fast. Something I meant as a preference can start acting like a rule, and something I meant as a hard boundary can get treated like a suggestion.

HARD IN SOFT OUT • May 13

That blur you described — intent becoming rule, boundary becoming suggestion — is exactly the problem sliders could surface. Not by solving it, but by showing you what the model actually received, not what you thought you sent.

Picture a simple feedback column next to each slider after a test run: "Treated as hard constraint" vs "Treated as weak preference." That exposes the gap you just named, and lets you correct it before shipping the prompt to real users.

Sliders aren't the fix. They're the mirror. And that mirror might be what makes prompt debugging finally teachable.

GnomeMan4201 • May 13

I think sliders are the right mental direction. Not because they solve everything, but because they force the hidden parts of prompting into the open.

Vic Chen • May 13

This is a strong framing. I like the shift from treating prompts as mystical incantations to treating them as interface design and control logic. In practice, a lot of product teams over-index on prompt wording when the bigger leverage is usually structure: context boundaries, constraints, memory, and evaluation loops. That lens feels much more useful for people building serious AI systems.

GnomeMan4201 • May 14

Thanks for the framing compliment.

this could just be me but one of the biggest challenges in this space is thinking toward things before knowing the established terms for them. A lot of the work becomes feeling around the boundary between prompt design, product logic, and AI governance while building language for what is happening in practice.

Vic Chen • May 15

Yeah, that is a real bottleneck. A lot of the useful product language shows up after teams hit the same failure mode a few times and finally need a name for it. Before that, it is half prompt craft, half system design, and half governance hygiene all mashed together.

I think that is also why postmortems matter so much in agent work. They turn fuzzy instinct into reusable vocabulary. Once a team can name the boundary it is crossing, they can usually design for it instead of rediscovering it every sprint.

GnomeMan4201 • May 15

That is also why I have made it a routine to keep making artifacts around this space… writing, diagrams, sketches, and even drawing stupid comics as both an artistic release and another layer of thinking.

It helps me conceptually. Sometimes the language comes after the artifact. The drawing or diagram lets me see the shape of the idea before I fully know what to call it.

Everything around AI is moving so fast that I already feel two steps behind the pack. Sometimes you almost have to isolate the information coming in, not to shut it out, but to give your imagination enough space to run before the official language catches up. Maybe that is how you find a perspective or angle not everyone has thought about yet, and that gives you one more foothold in the ocean of AI capabilities.

Vic Chen • May 15

I like that framing a lot. In practice, the artifact often becomes the first stable interface for thinking. Once a sketch or diagram exists, people can finally argue about structure instead of vague intuitions.

I see the same thing when parsing 13F data or building agent workflows. The useful language usually appears after a few real collisions with edge cases, not before. Writing things down early is less about polish and more about giving the idea somewhere to survive long enough to evolve.

Vic Chen • May 14

Yeah, that naming gap is real. A lot of the useful work starts as pattern recognition before the vocabulary catches up. I have found failure modes are often what force the language into existence. Once a prompt bug turns into a product bug or a governance bug, the boundary gets much easier to describe. It would be interesting to see more people document those boundary cases directly instead of waiting for cleaner theory.

buildbasekit • May 15

This is one of the few AI posts that actually thinks in failure modes instead of “10 prompts to get better answers.”

The anti-prompt idea is especially strong.

Most people optimize for prettier output.
Very few optimize for catching confident nonsense before it reaches production.

“Prompt is a control surface” is a solid framing.

One thing I’d push further though: model behavior changes fast, so some prompt patterns decay quickly. The lasting value is less the exact prompts and more the evaluation mindset behind them.

Good read.

GnomeMan4201 • May 19

I really agree with that last part. The prompt list is less important than the mindset behind it.

That is one reason I write about what I am working on. Hopefully it helps people look at AI, coding, and cybersecurity from a different angle. The tools move fast, but the way we learn to evaluate them is what lasts

GnomeMan4201 • May 12

This post is the framing layer for the manual.

The part I’m most interested in building out publicly is the anti-prompt side: prompts that test whether another prompt failed.

If there’s interest, I’ll write the next post around one concrete anti-prompt probably Over-Smoothing Detector or Confidence Laundering Probe and show it running against a real AI-generated output.

Siyu • May 13 • Edited

In agent development we obsess over the happy path and treat failure as something to patch later. Your argument that a prompt is not reliable unless it can describe exactly how it fails flipped a switch for me. I have been auditing my agent prompts against this standard ever since.
When multiple prompts chain together in an agent architecture, failure modes compound in ways no single prompt test can catch. Have you explored how this control surface approach scales to system level failure modes that only emerge from interactions between prompts rather than from any individual prompt? I suspect the anti-prompt concept could be extended into a kind of integration test for prompt pipelines but I am curious whether you have already experimented with that direction.

GnomeMan4201 • May 13

Yeah, the system-level failure mode problem is real and it’s where most agent auditing frameworks quietly fall apart. Individual prompt reliability doesn’t compose — a chain of 90% reliable prompts doesn’t give you a 90% reliable pipeline.
I haven’t built a full integration test harness for prompt pipelines yet, but the direction I’ve been moving is treating the handoff contract between prompts as the actual unit under test. Not “does prompt A behave correctly” but “does the output of A stay within the input assumptions of B under adversarial or degraded conditions.” That’s where the anti-prompt idea extends naturally — you’re not just probing a single prompt’s failure envelope, you’re probing whether the downstream prompt inherits upstream failure modes or amplifies them.
The compound failure issue gets especially ugly when one prompt’s edge case output becomes another prompt’s normal-case input and neither prompt flags it. That’s the silent corruption problem. I suspect the integration test version of anti-prompts needs a few things single-prompt testing doesn’t: shared context mutation tracking, an explicit failure propagation model, and probably adversarial injection at the seam points rather than just at the entry prompt.
Haven’t published anything on that yet but it’s an active thread. If you end up prototyping something I’d genuinely want to see it.

Ievgen Bondarenko • May 24

The "attack surface" framing lands. From the serving side I keep seeing this come back as a layered problem: the prompt is one attack surface, but the harness the prompt runs in is another, and the boundary where the harness fetches external content (RAG store, tool result, user-supplied URL) is where most real-world exploitation has been. Lmdeploy VL's multimodal endpoint and trust_remote_code class of CVEs in LLM serving were prompts where the attacker controlled what the model SAW, not what the user asked. The "control surface" mental model holds, just expands: the model sees a sum of (user prompt, system prompt, tool outputs, retrieved docs, multimodal inputs), and any of those can be the control-surface attacker.

Interested in what the field manual says about contract testing for prompts that take external input.

GnomeMan4201 • May 24

This is exactly the expansion needed and you named the precise failure mode: the attacker doesn't touch the user prompt at all, they poison what the model sees. The Lmdeploy VL / trust_remote_code class is a perfect example because the exploit surface wasn't the inference logic, it was the content pipeline upstream of it.

The mental model I use in the field manual is "control surface = everything the model treats as authoritative input at inference time." That includes tool outputs, retrieved chunks, multimodal payloads, even structured schema hints — all of it lands in the same context window the model reasons from, and most serving architectures don't apply differential trust to any of it.

On contract testing for prompts that take external input: the chapter covers what I call invariant contracts assertions that should hold regardless of what external content gets injected. Think of it less like unit testing the prompt and more like fuzzing the input boundary. The tests aren't "does this prompt return the right answer" but "does this prompt maintain its behavioral envelope when I stuff the RAG slot with adversarial content?"

The three invariant categories I test against:

Role persistence — does the model stay in its assigned role when retrieved docs contain conflicting persona instructions?
Instruction dominance — does the system prompt's constraint survive a tool result that explicitly tries to override it?
Leakage gates — does external content get regurgitated verbatim in ways that could exfiltrate system prompt material?

You're right that the harness boundary is where most real exploitation lands. The prompt is almost a distraction at that point...it's the trust model of the serving layer that's the actual vulnerability class.

Andy Stewart • May 13

Deeply relate to this. In production, "sounding good" is a disaster; "control" is the engineering baseline. Those Anti-Prompts that make failure visible are exactly what's needed to take AI from a demo to real-world deployment. Looking forward to the test cases for the Confidence Laundering Probe.

GnomeMan4201 • May 13

I ran the Confidence Laundering Probe on my own draft planning document for the manual. It returned SIGNIFICANT LAUNDERING at 5/6 techniques detected.

That result is correct. Here's what happened and why it matters.

The planning document used "Part 0 audit: KEEP" labels throughout decision markers that looked like verified judgments. The probe caught that those labels were functioning as authority without showing the evidence behind them. Specifically:

CITATION LAUNDERING — PRESENT
The document repeatedly invoked "Part 0 audit" as proof that entries earned their slot, without embedding the audit criteria or scoring method. "Part 0 audit: KEEP institutional interest mapping is rarely included" sounds verified. The audit isn't shown. That's citation laundering.

CONSENSUS LAUNDERING — PRESENT
"Institutional interest mapping is rarely included" implies broad comparative knowledge of the prompt book landscape without naming the comparison set. "'Simplify my writing' is the most common LLM request" is a popularity claim with no attribution. Both convert assertions into apparent consensus.

REPETITION AS EVIDENCE — PRESENT
KEEP appears across inventory rows, caution cards, and per-entry briefs. "Part 0 audit: KEEP," "The Part 0 audit confirmed KEEP," "Part 0 audit: KEEP with specific upgrade." Repetition of the label creates cumulative authority. The supporting evidence doesn't appear once.

PRECISION AS CONFIDENCE — PRESENT
"22 pending entries · 10 batches · drafting authority" and "Production state: 70 drafted · 0 restore · 22 pending" present exact counts with the confidence of verified state. The document itself admits a discrepancy: the actual pending count differs from 22, preserved only until a renumbering pass is made. The precision is stronger than the underlying state.

STRUCTURE AS AUTHORITY — PRESENT
Formal apparatus Section headers, Recommended Drafting Order, Batch Grouping Logic, Special Caution Entries, KEEP/UPGRADE/RESOLVE badges makes planning judgments look verified. Entries labeled "New entry from v3 master selection" carried no prior audit basis at all.

APPEAL TO PUBLICATION — ABSENT
Clean on this one. Safety language about publication consequences appeared but wasn't used as a correctness proxy.

Verdict: SIGNIFICANT LAUNDERING

Most damaging instance: the entire document's decision logic depended on "Part 0 audit: KEEP" as an authority marker, while the audit criteria, scoring method, and counterarguments were never shown inside the output.

The planning document and the manual are different things. The manual itself is structurally clean the W-06 entry defines each technique precisely, includes a failure mode, and doesn't overclaim. The probe would return a different result on the manual.

But the planning document laundered. The probe caught it correctly.

This is the result I wanted. A probe that clears your own work when you expect it to fail isn't a probe it's a mirror that only shows you what you want to see. The point of building failure visibility into the tooling is that it has to be willing to fire on the builder.

It fired. The finding stands.

Andy Stewart • May 13

This self-test is rock-solid. Logic must be traceable, not forgeable.

The fact that the probe dared to "fire" on its own developer proves it isn’t just a flattering mirror, but a genuine logical auditor. Catching this "laundry" of authority red-handed is exactly the technical certainty needed.

Disregard status, focus on evidence. This tool is the real deal.

GnomeMan4201 • May 13

A lot of this started as me writing alongside my own projects, mostly so I could understand what I was actually building. Once you start engineering AI behavior and looking seriously at “prompt space,” it gets strange fast. You end up writing prompts about prompts, tests for prompts, probes for tests, and then explanations of why the probe judged the test the way it did. After a while it feels like Inception, but with instruction layers instead of dream layers.

Writing it out is how I slow it down enough to figure out which layer I’m actually standing on.

View full discussion (36 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.