Sebastian Schürmann

Posted on Jun 8

It's Time We All Eat some more Cucumber!

#ai #tdd #bdd #darkfactory

Everyone's writing specs for AI now. We hand the model a markdown file, tell it what we want, and hope it builds the right thing. It mostly works — until it doesn't.

Markdown has quietly become the spec language. People reach for it as the DSL for their AI-driven workflows — headings, bullet lists, the odd table — and treat that loose structure as if it were a contract. The thing is, it isn't a DSL. It's markdown. It's prose formatting with no grammar to enforce, no structure you can execute, no shared vocabulary, and no way to tell whether the spec and the code still agree. You're leaning on a document format to do a job it was never built for, and you hit the limit the moment you want the spec to actually mean something a machine can check.

Before you go down that road, I want to make a small, slightly absurd suggestion.

Eat a cucumber.

What I actually mean

Gherkin is the plain-text language behind Cucumber, a tool that's been around for years in the behavior-driven development (BDD) world. It looks like this:

Feature: User login

  Scenario: Successful login with valid credentials
    Given a registered user "ada@example.com"
    When she logs in with the correct password
    Then she should land on her dashboard
    And she should see a welcome message

  Scenario: Rejected login with wrong password
    Given a registered user "ada@example.com"
    When she logs in with an incorrect password
    Then she should see an "invalid credentials" error
    And she should remain on the login page

That's it. Feature, Scenario, Given/When/Then. Structured enough that a machine can parse it, loose enough that a product manager can write it.

The gap it bridges

Most specs live at one of two extremes.

On one end you have written specs: docs, tickets, markdown files. Readable by anyone, but inert. Nothing checks whether they're still true. They rot the moment the code moves on.

On the other end you have tests: precise, executable, always honest — but written in code, illegible to half the people who actually care about the behavior.

Gherkin sits in the middle. It's prose with just enough skeleton that you can wire each Given/When/Then step to real code. The same file is both the human-readable spec and the thing your test runner executes. When the behavior changes, the scenario fails. The spec can't quietly drift away from reality, because the spec is the check.

That's what people mean by "living specs": documentation that can't lie to you because it runs.

The signals you've outgrown markdown

You don't switch formats on principle. You switch when the document starts straining against what it can express. Three signals tell you you're there.

When the markdown gets too specific. A spec starts as a paragraph of intent. Then someone adds a bullet for the edge case, then a sub-bullet for the exception to the edge case, then a parenthetical for what happens on a leap year. The prose is now pretending to be a state machine. Each of those specifics is really a scenario — a concrete given/when/then — and markdown gives you no way to mark it as one, let alone check it. Gherkin does: every specific behavior becomes its own named scenario you can point at, run, and reason about in isolation.

When tables of example data start to appear. This is the clearest tell. The moment your spec contains a table — input A gives output X, input B gives output Y — you've stopped describing behavior and started enumerating cases. A markdown table just sits there; nothing verifies that row three is still true. Gherkin has this built in with Scenario Outline and Examples:

Scenario Outline: Discount applied by tier
  Given a customer on the "<tier>" plan
  When they check out with a cart total of <total>
  Then the discount applied should be <discount>

  Examples:
    | tier    | total | discount |
    | free    | 100   | 0        |
    | pro     | 100   | 10       |
    | premium | 100   | 25       |

Same readable table — except every row is now an executable test case. Add a row, you've added a test. The data table is the spec.

When requirements are subject to change and re-evaluation. If a requirement will never move, a comment in the code is fine. The cost of a living spec only pays off when things change — and AI-driven work changes constantly. When a rule shifts, you want one place to edit that immediately tells you what broke. With markdown, you change the prose and hope someone updates the code to match; nothing flags the drift. With Gherkin, you change the scenario, run it, and the failures show you exactly where reality no longer agrees. The spec becomes the thing you re-evaluate against, not a stale artifact you forget to update.

Why this matters for AI work

If you're feeding specs to an LLM to generate or verify code, Gherkin gives you three things a freeform markdown file doesn't:

A fixed grammar. The model doesn't have to guess your invented structure. Gherkin's vocabulary is small, well-documented, and almost certainly already in the model's training data.
Executable acceptance criteria. The AI can generate code and you can immediately run the scenarios to see if it actually did the job — no human re-reading the markdown to judge.
A round trip. You can ask the model to write scenarios from a description, write code from scenarios, or check that existing code satisfies them. Each direction has a clear, checkable artifact.

You get the readability you wanted from markdown, plus a definition of "done" that a machine can enforce.

Start small

You don't need to adopt all of BDD or restructure your team. Take one feature you're about to spec for your AI tooling and write it as a .feature file instead of a markdown blob. See how it feels to have the spec and the test be the same document.

You're already using a document format as a spec language. This one was actually built to be one — it already exists, already has tooling in every major language, and already solved the problem markdown keeps quietly failing at.

So: put the markdown back where it belongs. Eat the cucumber.

Top comments (9)

phinn markson • Jun 17

Not gonna lie.. I excitedly hoped for IRL cukes because it's my first time growing them and I don't even like them. I'm growing them for my partner. But fine... Guess I'll learn some more stuff. Shakes head grumbling 😉

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Jun 8

Quality post lol.

Good reminder for me to eat a cucumber today. Regardless, good work lol

Martin Morrey • Jun 17

Great article. As a product manager / owner I've been using gherkin for product specs for a few years, partly because it can help QA do their job, but mainly because as an ex-programmer I really like having a syntax to work to.

I think the specification-by-example approach it brings you to is very powerful, especially when you start using tables of data as described. As a product manager/owner it really makes you think hard about the acceptance test scenarios.

This done mean writing the spec itself can take longer, but I truly believe it opens up potential to save a lot of time down the line.

In my last role I specced up a new look-up API in gherkin, and the dev team (starting with Claude Code and fettling its output) were able to get the API running in a day, when the estimate had been a week.

Doug Wilson • Jun 18

Great article. No idea why people insist on reinventing the wheel. Gherkin's been around for years and does a great job when used well. Excellent examples of the "sad path" (unsuccessful) scenario too. We should always describe the outcome when not everything goes to plan. Excellent!

Mardeg • Jun 18 • Edited

Well I looked at the format and thought: How would the data and logic for a text adventure from 1976 fit into this?

Probably the wrong question but here it is anyway, Colossal Cave Adventure!

Andreas Müller • Jun 20

It's worth pointing out for those that don't know Cucumber that you have to write code to make the steps executable. Of course AI can do that for you, but as presented it can seem like Cucumber does some magic to execute the scenarios. It really doesn't, you describe the sentences you want with placeholders in annotations to methods (at least that's how it is done in Java) and then you write the method to execute code based on the parameters in the sentences you defined. Cucumber only gives you the minimal structure of given-when-then syntax and a framework to write the code in, but you still have to write the code which execute the steps.

You addressed in your "roundtrip" point at the end exactly what I was thinking: Could you use AI to generate the Cucumber scenarios for you? I see a bit of a problem here given what I outlined above: I can see how the AI could derive the scenarios in Gherkin syntax from a markdown spec, but what about the implementation of the sentences? That I think is a bit harder than just the scenario definitions. For example, if you want it to sent messages to a Kafka Topic, but you need to give it a dependency so it can even construct the message correctly, how would the AI know which dependency to use if you don't specify it in some way? But if you go to the level of specifying the dependency, then you might just as well include it yourself in your pom... .

I think there is an opportunity here to use AI in a smart way if you're willing to write some of the code and set up some infrastructure yourself. I've been experimenting with this approach lately in implementing functionality, and I find that for me it is the best approach. Do some groundwork yourself, and then let the AI do the rest. That way you stay in charge of design more than if you let the AI do everything.

Also, I had the pleasure of setting up a Cucumber-based test repository last year which now our tester has inherited from me, and what I found is that getting the sentences right so that they're re-usable actually takes quite a bit of thought and design work. In that respect also it seems to me that going straight from a spec to Cucumber scenarios might require some back and forth with the AI, rather than just feeding it the spec and hope it gets it right.

Sebastian Schürmann • Jun 22

Claude Opes can do all 3 parts of a nice cucumber test suit with no problems. The 3 layers are kind of helpful as they uncover inconsitencies while you build.

Vitaliy Potapov • Jun 21

Thanks for the post. I have exactly the same feeling, and as a maintainer of playwright-bdd, I'm seeing growing interest in this topic.

In spec-driven flows, people often get buried under tons of markdown artifacts that nobody reads. Feature files are different: they are a valuable outcome because they are both readable and executable.

In my projects, I use a skill that forces the agent to write BDD scenarios during planning. Even without automation, this is already useful because it helps humans and agents confirm they have the same understanding of the requirements.

With automation, it becomes even more powerful because the agent can autonomously verify the implementation by running tests. But there is no magic here. As mentioned in other comments, you still need to write correct step implementations - with AI as well, of course. In the playwright-bdd skill, we encourage agents to reuse existing steps as much as possible, which helps avoid unnecessary new code.

I believe BDD is getting a fresh boost in our AI-assisted flows.

Sebastian Schürmann • Jun 22

great to see this in use. ;)