Instead of writing rules agents ignore, we describe the behaviors we want. A look at how Gherkin scenarios, message envelopes, and structured output formats work together to make AI agents reliably do the right thing.
When we first started building more complex prompts, we took the programming mindset to prompting: we wrote instructions, added more rules when something broke, and ended up with prompts that read like policy documents. After several iterations of watching agents ignore nuanced rules in favor of fluent-sounding output, we started looking for a different approach.
We shifted to a behavioral science mindset. Instead of specifying what the agent must do, we describe the context it operates in, the outcome we want, and concrete examples of what success and failure look like. It is harder to design — each scenario requires thinking through not just the happy path but the edge cases and failure modes. But in production, it is much more reliable in our experience.
This post covers three specific techniques we use: Gherkin-style prompt structure, the message envelope DSL, and a structured completion signal format. None of these are original. But together, they give agents a much cleaner operating contract in our experience.
The problem with rules-based prompting
When an agent misbehaves, the instinct is to add a rule. "Do not X." "Always do Y before Z." After a few iterations of this, you end up with something like:
You are an AI assistant. You must read files before editing them. You must not create a new file when revising. You must not signal completion if tests are failing. You must always include the file_id in your response. You must not post the file content as a comment instead of saving it…
The model reads this, acknowledges it, and then does whatever the rest of its training predicts. Long rule lists dilute attention. The model does not treat them as hard constraints — it treats them as context. If the rule conflicts with what feels fluent given everything else in the prompt, fluency wins.
The behavioral science approach reframes the problem: instead of listing rules, you describe a situation, a trigger, and an expected behavior. That maps much more naturally onto how large models were trained.
Gherkin: specifying behavior without specifying instructions
Gherkin is a plain-English format from behavior-driven development. Each scenario has a Given/When/Then structure: the precondition, the trigger, and the expected outcome.
We adapted it to build agent prompts. Here is an example of the kind of Gherkin prompt we mean:
Feature: Plan Narration
As a user, I want to hear the plan before work begins so I can follow along.
Rule: Narrate the plan after receipt, before the first tool call
Scenario: User sends a non-trivial prompt requiring tool calls
Given I have spoken the receipt confirmation
When I am about to make my first tool call
Then I call speak() a second time with my intended approach
And the narration is two to four short spoken sentences maximum
And I use plain language with no bullet points, no markdown, and no unexplained jargon
And I state what I will do and why, in the order I will do it
And I end with a natural transition phrase
Examples of good transition phrases:
| "Let's go." |
| "Here we go." |
| "Starting now."|
Example of a good plan narration:
| "I'm going to read the agent runner file first to understand the current flow, |
| then add the new tool to the schema, and finally hook it up in the executor. |
| Should only take a moment. Here we go." |
Several things are worth noticing here.
It is actually Gherkin. The structure starts with a Feature, adds a Rule, and then defines a concrete Scenario. That matters because the prompt is specifying behavior in the same shape Gherkin is designed for, rather than borrowing only the surface style of Given/When/Then.
Context comes first. The Given line establishes the exact state the agent is already in: receipt has been confirmed, and the next boundary is the first tool call. This is not just flavor text — it narrows the behavior to a specific moment instead of leaving it as a vague instruction.
The trigger is singular. When I am about to make my first tool call identifies one decision point. The agent does not have to infer when narration should happen.
The expected behavior is concrete. The Then lines define the constraints that matter: short narration, plain language, ordered explanation, and a natural handoff into action. The examples at the end do the rest of the work. They show what "good" looks like in a form the model can imitate directly.
The message envelope: separating signal from prose
The other half of the problem is output. You can get agents to behave well 80% of the time, but the remaining 20% shows up as verbose non-sequiturs, duplicated content, and format drift. When you are parsing agent output programmatically, format drift is a real cost.
ProjectBrain uses a message envelope DSL for agent-to-system communication. The format looks like an email header — structured key-value pairs above a --- separator, with optional free prose below:
ACTION: approve
COMMENT: Added cursor pagination to /facts endpoint; all 43 tests pass.
FILE_ID: 3a9f2c1d-...
---
The implementation uses keyset pagination on (created_at, id) to ensure
stable ordering under concurrent inserts. I chose this over pure cursor
because the facts table is insert-heavy...
The runner reads only what is above the ---. Everything after it is ignored for routing purposes. This means agents can write as much explanatory prose as they want — it costs nothing and discards cleanly. The structured part stays small and parseable.
We use this same format in both directions. When the runner dispatches tasks to agents, it sends structured preamble above ---. When agents respond, they use the same format. The symmetry is intentional: it is easier to teach a model to produce a format it already sees being consumed.
The runner extracts the envelope with a simple regex anchored to line starts:
_ENVELOPE_RE = re.compile(
r"^(ACTION|COMMENT|FILE_ID|PR_URL):\s*(.+)$",
re.IGNORECASE | re.MULTILINE,
)
Why not JSON?
We started with JSON as the completion signal format:
{
"action": "approve",
"comment": "Added cursor pagination; all tests pass.",
"file_id": "3a9f2c1d-..."
}
This worked, but had two problems. First, models can produce malformed JSON — unescaped quotes in the comment, trailing commas, or a JSON block that got split across a markdown code fence. Second, agents tended to repeat the comment content both in their prose and in the JSON object, creating verbose duplication.
The envelope format is more natural for models to produce because it looks like structured text, not a programming construct. Line-start anchoring (^) means indented examples in the reasoning cannot contaminate the actual signal. And the --- separator creates a clear moment of transition: everything above is machine-readable, everything below is human-readable. Models navigate that transition naturally.
Evaluator agents: the same structure, different role
When ProjectBrain routes a task to a reviewer, the same prompt builder is used — but the behavioral contract changes:
Given I am a senior software engineer operating via the ProjectBrain runner
And the submission to evaluate is:
"""
[content of the submitted draft]
"""
When I evaluate the submission
Then I evaluate against the rubric criteria below
And I provide specific, actionable findings in my verdict
And I end my response with a completion envelope (see Completion Signal)
And the scenarios shift accordingly:
Scenario: submission meets all criteria
Given all rubric criteria are satisfied
When I evaluate
Then I end with:
ACTION: approve
COMMENT: Clear summary of why it passes.
Scenario: submission needs targeted changes
Given specific, fixable issues were found
When I evaluate
Then I end with:
ACTION: request_changes
COMMENT: • Issue 1\n• Issue 2\n• What the author should do next.
Scenario: issues require human judgement
Given the submission has fundamental problems beyond targeted fixes
When I evaluate
Then I end with:
ACTION: escalate
COMMENT: Reason: [what human review is needed]
The rubric — what to evaluate against — is also injected into the prompt by the platform, not hardcoded into the agent. This means a single evaluator agent can apply different quality bars for different workflows without needing separate deployments.
Behavioral science vs. rules: the practical difference
When you write rules, you are trying to enumerate failure modes in advance. The list is always incomplete. Models find the gaps — not through adversarial intent, but because they optimize for fluent, plausible output, and fluent output does not always respect unstated constraints.
When you write behavioral scenarios, you are doing something closer to training by example. You are showing the model a situation, a decision point, and an outcome. Models generalize from examples much better than they comply with rule lists. And when you include explicit failure scenarios, you close off the most common failure paths without needing to enumerate every possible variant.
The combination of Given/When/Then structure, concrete examples, and a machine-readable completion format gives each agent a clean operating contract: what situation it is in, what decision it is being asked to make, what output to produce, and how to signal when it is done. Each piece reinforces the others.
The result is not a perfectly obedient agent — that does not exist. But it is an agent that fails in predictable ways, recovers cleanly, and can be improved incrementally as you observe what goes wrong.
What we would do differently
The biggest lesson from running this experiment is that prompt quality compounds. A vague brief leads to unfocused work leads to an approval decision based on wrong criteria. Every stage amplifies whatever is unclear in the stage before it.
Investing in clean behavioral contracts at the start of each workflow pays off more than refining the prompt for any single stage. And when something goes wrong, the first place to look is not the agent — it is the scenario. Was the failure a case the scenario covered? If not, add it. The scenarios are your tests for agent behavior, and they should grow the same way a test suite does.
The envelope format also taught us something more general: make the machine-readable part as small as possible. An action name, a one-sentence comment, an optional file reference. That is all the agent needs. Everything else can go below the separator. The less structure you ask the model to maintain, the more reliably it maintains it.
Top comments (0)