We’ve all been there. You adopt AI coding agents (like Cursor, Copilot, or custom LLM pipelines) with high hopes of 10x productivity. For the first week, it feels like magic.
Then, the cracks start to show.
I was building a multi-tenant B2B SaaS product in Go. We took testing seriously: clean Hexagonal Architecture, strict Domain-Driven Design (DDD), hundreds of BDD Gherkin scenarios, and a disciplined Red-Green-Refactor workflow wired into our CI.
Then we let the AI loose on implementing our step definitions and domain logic.
Confidently and swiftly, the AI began quietly sabotaging our codebase:
-
The Float Precision Flaw: It generated
float64fields for monetary calculations, completely ignoring rounding errors. -
The Architectural Bleed: It lazily imported
database/sqlor web router utilities directly into pure domain aggregates, shattering our hexagonal boundaries. -
The UI Rot: It read a step like
Given I click the Submit buttonand generated a brittle, DOM-coupled backend test that crashed the next time a frontend dev changed a CSS class. - The Mocking Loophole: It wrote step definitions that bypassed the real application services and called mocks directly—creating beautiful, passing test suites that proved absolutely nothing in production.
We were spending more time correcting AI hallucinations and rolling back messy pull requests than we were writing actual features.
But as I sat there debugging yet another floating-point error, I realized something: The root cause wasn’t the AI. It was our specifications. Our Gherkin files were written exclusively for humans, leaving far too many gaps for an LLM to guess.
The Paradigm Shift: Dual-Audience Gherkin
When humans read ambiguous requirements, we use context clues and common sense. A LLM doesn't have common sense; it has token probability. If you give it a vague specification, it takes the path of least resistance (which usually means messy, un-idiomatic code).
What if a .feature file could simultaneously speak two completely different languages?
- Plain English that a product owner or business stakeholder can seamlessly read and validate.
- A Deterministic Technical Contract that an AI agent can parse with absolute mathematical precision, leaving zero room for guessing.
This is the core philosophy behind GherkinForge, an open-source experiment to enforce a strict "airlock" between raw human intent and AI generation. We realized that by leveraging Gherkin's native DataTables and DocStrings, we could anchor the AI's boundaries directly inside the specification.
Take a look at how this looks in practice:
@business
Feature: Carbon Emission Entry
Background: System Initialization
Given the platform emission factor catalogue contains the following entries:
| activity_year | activity_type | factor_mg_co2e_per_unit | unit |
| 2026 | electricity | 233000 | kWh |
Scenario: Successfully submitting carbon activity calculates exact emissions
When the manager submits a carbon entry with the following details:
"""json
{
"manager_id": "SAM-001",
"activity_year": 2026,
"activity_type": "electricity",
"quantity": 100
}
"""
Then the carbon entry aggregate should be successfully created
And the calculated carbon figure in mg_co2e should be 23300000
And a "carbon.entry.submitted" domain event is published to the broker
Look closely at what this setup accomplishes:
-
DataTable headers become Go struct fields. The column name
factor_mg_co2e_per_unitexplicitly maps to a Go property name. - DocString JSON defines the exact command payload schema. The AI doesn't guess the input types; it has a literal structural template.
-
The math is locked in. 100 × 233,000 = 23,300,000. By demanding the explicit value
23300000in theThenclause, any sloppy implementation or floating-point mutation instantly fails the test runner.
The Automated Airlock: A Three-Tiered Pipeline
Prompt engineering alone is just a soft suggestion. A truly resilient AI-assisted workflow requires runtime enforcement. To handle this, we wrapped this methodology into a lightweight CLI tool (gforge) and a series of pipeline guardrails to act as a definitive gatekeeper.
features/
├── business/ @business → godog + hand-written test doubles (Pure Domain)
├── integration/ @integration → testcontainers-go + real infrastructure
└── nfr/ @nfr → Go benchmarks + fuzz testing
Before an AI agent is even allowed to look at a feature file or generate code, the gforge lint utility parses the Gherkin Abstract Syntax Tree (AST) to evaluate our Zero-Trust Rules:
1. Ruthless Vocabulary Bans
If a developer or a product owner accidentally writes a step containing words like click, button, input field, browser, or page inside a @business feature file, the linter instantly throws an error and halts execution. Why? Because UI vocabulary couples backend specifications to cosmetic frontend layouts.
2. Affirmative AI Constraints
Instead of telling an AI agent what not to do (e.g., "Don't use global variables"), which often backfires because LLMs heavily weight the tokens you tell them to avoid, we feed them highly explicit, positive constraints via .cursor/rules/ configs:
- “Every method signature MUST accept
ctx context.Contextas its first parameter.” - “Exclusively use
int64for all measurements and monetary balances.” - “All current time values must be retrieved via an injected
Clockinterface to guarantee deterministic testing.”
3. Isolated Transaction Rollbacks
For @integration suites, the framework automatically wraps tests inside an isolated SQL transaction that unconditionally rolls back after every single scenario. The AI can mutate the database as heavily as it wants; it is physically impossible for state to leak across tests.
The Payoff: What Clean Generation Looks Like
When you couple an AST-level Gherkin linter with clear, positive architectural instructions, the AI agent stops acting like an erratic intern and starts acting like an elite staff engineer.
Because we forced the specs to use exact integer thresholds and explicit domain event assertions, the code scaffolded by the framework naturally respects complex Go idioms, wraps errors to preserve call stacks, and builds rich domain aggregates instead of anemic models.
We moved away from babysitting LLM hallucinations and transitioned to a high-velocity flow where our feature files act as true mathematical invariants.
Where do we go from here?
GherkinForge isn't a silver bullet, nor is it a rigid dogma. It's an exploration of how we can build tooling that acknowledges the reality of AI-driven development without sacrificing architectural purity.
If your team has been hitting a wall with AI agents generating sloppy, un-maintainable code, try shifting your focus away from refining the code prompts, and start locking down the inputs to the engine.
The project is fully open-source, and I’d love to hear how other teams are drawing the line between human intent and automated execution. Check out the repository, run the linter against your own specs, and let me know: How are you keeping your architecture safe in the age of AI agents?
GitHub Repository: github.com/SpannerSync/gherkinforge
Top comments (0)