DEV Community

Li-Hsuan Lung
Li-Hsuan Lung

Posted on • Originally published at blog.projectbrain.tools

How We Use Gherkin, Envelopes, and Schemas to Shape Agent Behavior

Behavioral science over ignored rule lists

Instead of writing rules agents ignore, we describe the behaviors we want. A look at how Gherkin scenarios, message envelopes, and structured output formats work together to make AI agents reliably do the right thing.


When we first started building more complex prompts, we took the programming mindset to prompting: we wrote instructions, added more rules when something broke, and ended up with prompts that read like policy documents. After several iterations of watching agents ignore nuanced rules in favor of fluent-sounding output, we started looking for a different approach.

We shifted to a behavioral science mindset. Instead of specifying what the agent must do, we describe the context it operates in, the outcome we want, and concrete examples of what success and failure look like. It is harder to design — each scenario requires thinking through not just the happy path but the edge cases and failure modes. But in production, it is much more reliable in our experience.

This post covers three specific techniques we use: Gherkin-style prompt structure, the message envelope DSL, and a structured completion signal format. None of these are original. But together, they give agents a much cleaner operating contract in our experience.

The problem with rules-based prompting

When an agent misbehaves, the instinct is to add a rule. "Do not X." "Always do Y before Z." After a few iterations of this, you end up with something like:

You are an AI assistant. You must read files before editing them. You must not create a new file when revising. You must not signal completion if tests are failing. You must always include the file_id in your response. You must not post the file content as a comment instead of saving it…

The model reads this, acknowledges it, and then does whatever the rest of its training predicts. Long rule lists dilute attention. The model does not treat them as hard constraints — it treats them as context. If the rule conflicts with what feels fluent given everything else in the prompt, fluency wins.

The behavioral science approach reframes the problem: instead of listing rules, you describe a situation, a trigger, and an expected behavior. That maps much more naturally onto how large models were trained.

Gherkin: specifying behavior without specifying instructions

Gherkin is a plain-English format from behavior-driven development. Each scenario has a Given/When/Then structure: the precondition, the trigger, and the expected outcome.

We adapted it to build agent prompts. Here is an example of the kind of Gherkin prompt we mean:

Feature: Plan Narration
  As a user, I want to hear the plan before work begins so I can follow along.

  Rule: Narrate the plan after receipt, before the first tool call

    Scenario: User sends a non-trivial prompt requiring tool calls
      Given I have spoken the receipt confirmation
      When I am about to make my first tool call
      Then I call speak() a second time with my intended approach
      And the narration is two to four short spoken sentences maximum
      And I use plain language with no bullet points, no markdown, and no unexplained jargon
      And I state what I will do and why, in the order I will do it
      And I end with a natural transition phrase

    Examples of good transition phrases:
      | "Let's go."    |
      | "Here we go."  |
      | "Starting now."|

    Example of a good plan narration:
      | "I'm going to read the agent runner file first to understand the current flow, |
      |  then add the new tool to the schema, and finally hook it up in the executor. |
      |  Should only take a moment. Here we go."                                      |
Enter fullscreen mode Exit fullscreen mode

Several things are worth noticing here.

It is actually Gherkin. The structure starts with a Feature, adds a Rule, and then defines a concrete Scenario. That matters because the prompt is specifying behavior in the same shape Gherkin is designed for, rather than borrowing only the surface style of Given/When/Then.

Context comes first. The Given line establishes the exact state the agent is already in: receipt has been confirmed, and the next boundary is the first tool call. This is not just flavor text — it narrows the behavior to a specific moment instead of leaving it as a vague instruction.

The trigger is singular. When I am about to make my first tool call identifies one decision point. The agent does not have to infer when narration should happen.

The expected behavior is concrete. The Then lines define the constraints that matter: short narration, plain language, ordered explanation, and a natural handoff into action. The examples at the end do the rest of the work. They show what "good" looks like in a form the model can imitate directly.

The message envelope: separating signal from prose

The other half of the problem is output. You can get agents to behave well 80% of the time, but the remaining 20% shows up as verbose non-sequiturs, duplicated content, and format drift. When you are parsing agent output programmatically, format drift is a real cost.

ProjectBrain uses a message envelope DSL for agent-to-system communication. The format looks like an email header — structured key-value pairs above a --- separator, with optional free prose below:

ACTION: approve
COMMENT: Added cursor pagination to /facts endpoint; all 43 tests pass.
FILE_ID: 3a9f2c1d-...
---
The implementation uses keyset pagination on (created_at, id) to ensure
stable ordering under concurrent inserts. I chose this over pure cursor
because the facts table is insert-heavy...
Enter fullscreen mode Exit fullscreen mode

The runner reads only what is above the ---. Everything after it is ignored for routing purposes. This means agents can write as much explanatory prose as they want — it costs nothing and discards cleanly. The structured part stays small and parseable.

We use this same format in both directions. When the runner dispatches tasks to agents, it sends structured preamble above ---. When agents respond, they use the same format. The symmetry is intentional: it is easier to teach a model to produce a format it already sees being consumed.

The runner extracts the envelope with a simple regex anchored to line starts:

_ENVELOPE_RE = re.compile(
    r"^(ACTION|COMMENT|FILE_ID|PR_URL):\s*(.+)$",
    re.IGNORECASE | re.MULTILINE,
)
Enter fullscreen mode Exit fullscreen mode

Why not JSON?

We started with JSON as the completion signal format:

{
  "action": "approve",
  "comment": "Added cursor pagination; all tests pass.",
  "file_id": "3a9f2c1d-..."
}
Enter fullscreen mode Exit fullscreen mode

This worked, but had two problems. First, models can produce malformed JSON — unescaped quotes in the comment, trailing commas, or a JSON block that got split across a markdown code fence. Second, agents tended to repeat the comment content both in their prose and in the JSON object, creating verbose duplication.

The envelope format is more natural for models to produce because it looks like structured text, not a programming construct. Line-start anchoring (^) means indented examples in the reasoning cannot contaminate the actual signal. And the --- separator creates a clear moment of transition: everything above is machine-readable, everything below is human-readable. Models navigate that transition naturally.

Evaluator agents: the same structure, different role

When ProjectBrain routes a task to a reviewer, the same prompt builder is used — but the behavioral contract changes:

Given I am a senior software engineer operating via the ProjectBrain runner
  And the submission to evaluate is:
    """
    [content of the submitted draft]
    """

When I evaluate the submission

Then I evaluate against the rubric criteria below
  And I provide specific, actionable findings in my verdict
  And I end my response with a completion envelope (see Completion Signal)
Enter fullscreen mode Exit fullscreen mode

And the scenarios shift accordingly:

Scenario: submission meets all criteria
  Given all rubric criteria are satisfied
  When I evaluate
  Then I end with:
    ACTION: approve
    COMMENT: Clear summary of why it passes.

Scenario: submission needs targeted changes
  Given specific, fixable issues were found
  When I evaluate
  Then I end with:
    ACTION: request_changes
    COMMENT:  Issue 1\n• Issue 2\n• What the author should do next.

Scenario: issues require human judgement
  Given the submission has fundamental problems beyond targeted fixes
  When I evaluate
  Then I end with:
    ACTION: escalate
    COMMENT: Reason: [what human review is needed]
Enter fullscreen mode Exit fullscreen mode

The rubric — what to evaluate against — is also injected into the prompt by the platform, not hardcoded into the agent. This means a single evaluator agent can apply different quality bars for different workflows without needing separate deployments.

Behavioral science vs. rules: the practical difference

When you write rules, you are trying to enumerate failure modes in advance. The list is always incomplete. Models find the gaps — not through adversarial intent, but because they optimize for fluent, plausible output, and fluent output does not always respect unstated constraints.

When you write behavioral scenarios, you are doing something closer to training by example. You are showing the model a situation, a decision point, and an outcome. Models generalize from examples much better than they comply with rule lists. And when you include explicit failure scenarios, you close off the most common failure paths without needing to enumerate every possible variant.

The combination of Given/When/Then structure, concrete examples, and a machine-readable completion format gives each agent a clean operating contract: what situation it is in, what decision it is being asked to make, what output to produce, and how to signal when it is done. Each piece reinforces the others.

The result is not a perfectly obedient agent — that does not exist. But it is an agent that fails in predictable ways, recovers cleanly, and can be improved incrementally as you observe what goes wrong.

What we would do differently

The biggest lesson from running this experiment is that prompt quality compounds. A vague brief leads to unfocused work leads to an approval decision based on wrong criteria. Every stage amplifies whatever is unclear in the stage before it.

Investing in clean behavioral contracts at the start of each workflow pays off more than refining the prompt for any single stage. And when something goes wrong, the first place to look is not the agent — it is the scenario. Was the failure a case the scenario covered? If not, add it. The scenarios are your tests for agent behavior, and they should grow the same way a test suite does.

The envelope format also taught us something more general: make the machine-readable part as small as possible. An action name, a one-sentence comment, an optional file reference. That is all the agent needs. Everything else can go below the separator. The less structure you ask the model to maintain, the more reliably it maintains it.

Top comments (4)

Collapse
 
thlandgraf profile image
Thomas Landgraf

The behavioral science reframing is the part that lands hardest for me — "describe the situation and expected outcome" vs "list rules the model will ignore." I've run into the exact same failure mode on the spec side: long rule lists in CLAUDE.md that the agent acknowledges and then doesn't apply when the rule conflicts with the fluent next token.

The approach I've been taking on SPECLAN (disclosure: I'm the creator, it's a VS Code extension for spec management) is to push Gherkin-style scenarios down into the requirements layer itself. Each requirement gets scenarios with Given/When/Then in the frontmatter, and the agent implementing the requirement loads only those scenarios — not a giant policy document. It's basically your message envelope idea, but at the spec granularity rather than the prompt wrapper.

The one thing I'm still figuring out is how to handle scenarios that describe system-wide constraints (like "never store secrets in git") that apply across all requirements. Those don't belong in a single requirement's frontmatter, but they also can't live in a rule list because of the exact problem your article describes. Curious how you handle cross-cutting behavioral constraints with the envelope DSL — do they get attached to every envelope, or is there a separate scope?

Collapse
 
lihsuanlung profile image
Li-Hsuan Lung

@thlandgraf That's an interesting angle! I don't know if this will help, but here's an approach I took for a different project (text adventure with LLM). Basically, I looked into how we feed prompt into an LLM API, and there are two parts: system prompt and user message. I use the system prompt to specify "narrator identity," "writing principles," and "world constraints" to ground the LLM to the world I am building. Anthropic's API lets you cache the system prompt, so that it is less costly to include in every conversation. Perhaps you can use the same technique for system-wide constraints? Thanks for reading my article!

Collapse
 
thlandgraf profile image
Thomas Landgraf

The system prompt / user message split is the cleanest framing I've seen for this. It maps pretty well to the two layers I've been wrestling with — project-wide constraints live in CLAUDE.md / .claude/rules/ at the repo root (the "system" layer), and per-requirement scenarios live in the spec frontmatter that gets loaded into the user message when the agent implements that requirement.

The thing I'm still unsure about is whether caching the system prompt actually makes the agent respect the constraints more reliably, or just makes it cheaper to include them. In my experience the agent will still ignore cached system instructions when they conflict with the fluent next token — the exact failure mode your article describes. Have you seen caching change the adherence, or is it purely a cost optimization?

One wrinkle I hit: I run specs through multiple providers (Claude, OpenAI, Google ADK), and prompt caching semantics differ across them. That nudges me toward putting critical constraints in the content the agent must read, not just in a cached system slot — though that has its own cost in tokens and in giving the agent more surface area to selectively ignore.

The world-grounding angle for your text adventure is a nice bridge between system prompt and behavioral contracts — thanks for sharing it.

Thread Thread
 
lihsuanlung profile image
Li-Hsuan Lung

@thlandgraf The agent did treat system prompts with higher priority in my experience, but I haven't done enough research to know for sure.

One other thing I tried was to set up a review loop: the narrator agent submits a draft. An editor agent reviews it and then either approves or rejects based on all the constraint violations. The narrator would read the violations and then rewrite until either we reach the max number of loops or the editor approves it.

This improved the rule adherence quite a bit, but it was not a good fit for my text adventure game because the narrator focused too much on rule adherence and the narration quality dropped. But it may not be a problem for your use case?