Antoniel Magalhães

Posted on Mar 5

My Agentic Engineering Process: From Vibe Code to BDD

#ai #agents #bdd #testing

TL;DR I use BDD to turn specification into validated code. Scenarios define behavior; tests cover it; mutation checks that tests are actually useful. Once validated, I can refactor the implementation without fear of regression.

Introduction

This post is about how I currently approach agentic engineering and how I try to build features using AI agents (Cursor, Claude Code, or any other) through a defined methodology—specifically BDD (Behavior Driven Design) to guide code production and ensure correctness.

Vibe Code vs Agent Engineering

The first time I saw the term agentic engineering was in Simon Willison's article on agentic engineering patterns. I was already doing it, but kept calling it vibe coding; I liked the new name—it sounds more serious and well-defined.

Vibe coding

When you don't pay attention to the code in general

You don't pay attention to how things are being done. The focus is on shipping, not on establishing patterns or validating the quality of what the agent produces.

Agentic engineering

When you create and establish norms

You create norms, guidelines, and define how you want the code to be written by the AI. You still don't type; but you actively influence how the code is produced.

Exploring new agentic engineering patterns

Day to day, it's hard to give up productivity or quality relative to the baseline just to experiment with agentic engineering patterns. So what I usually do is create small side projects I want to build and use them to explore these capabilities with the agent.

I started the project with my established agentic engineering workflow: I have an intention, I create a specification using SpecKit or OpenSpec, from the specification I create a plan, and from that plan I create the implementation.

Intention → Specification → Plan → Implementation

A simple pipeline, but it works well for me. The "problem" is that I need to pay attention at every step: the intention is in my mind; I have to confirm the specification reflects what I imagined; verify the implementation plan follows what I imagined; and finally do code review of the implementation. This is a process that demands a lot of my attention, and I wanted to explore different patterns that could reduce my attention time per task while keeping the same quality.

TDD

I had already seen the pattern of using automated tests to validate the implementation and create that counterpressure on the agent so it validates its own work. I think it's valid. The problem, in my experience, is that when I let the agent create tests on its own, without adequate supervision, it ends up taking shortcuts: tests that cover trivial things and implementation details. Things like expect(1).toBe(1) and the like.

Simon documents the red/green TDD pattern in his guide: make the test fail first, then make it pass, ensuring the test is actually useful. Given my experience, I still didn't feel comfortable trusting the agent's goodwill not to edit the test just to make it pass (@effectfully). My worst case: I implement TDD, spend n minutes, go to review the code, specifically the tests, and the tests don't make sense—or I run a manual test and discover the feature doesn't work, yet the tests pass.

Before: Just Go

Previous flow without formal specification

Open the agent, explain the task, refine over iterations. No spec, no formal validation. The risk: trivial tests, implementation that passes but doesn't deliver what's expected.

Explain task → Agent implements → Review → OK? → (No → Agent implements) | (Yes → Done)

After: BDD

Specification to action

Spec → .feature → tony-bdd-test → mutation. Approved scenarios become the source of truth; tests validate behavior; mutation ensures tests don't pass by accident.

Spec → .feature → tony-bdd-test → Mutation

BDD

A while back, in pre-LLM times, I ran into a requirement: how to ensure code quality through tests that talk to the product? There was a gap between what was asked in the ticket and what the team delivered. The solution: BDD (Behavior Driven Design), documented by Cucumber. Once the .feature files and scenarios were approved, the team would have to ensure the tests for each scenario were passing.

That idea came back as my side project grew. The time to validate all features increased in step with the functionality I added. The question reappeared: how do I consolidate what I have now and ensure I can add new things in a structured way, with tests that actually validate behavior?

BDD. My approach was to create two /commands in Cursor: one to generate the .feature and another to implement the feature and its tests.

/tony-bdd

The command takes a file (a component, a route, a handler) and extracts the scenarios that matter from it. Instead of organizing by page or screen, it organizes by domain: auth, checkout, layout, whatever makes sense for the code's vocabulary. It starts with few high-impact scenarios, the journeys that prove the thing works end-to-end. Only later, if needed, it unfolds into rules and variations. Each scenario gets a stable @id. A script sweeps the .feature files and checks that there's a corresponding test for each declared scenario. It doesn't use Cucumber; the .feature files are just the behavior source. The command doesn't touch code or tests, it only writes specs.

/tony-bdd-test

This command takes the Gherkin scenarios and implements the tests. It places tests next to the code under test, with minimal mocks (MSW when it's web). For each test it writes, it requires a sanity-check by mutation: break the implementation on purpose, run the test, confirm it fails, revert. The goal is to ensure tests don't pass by accident.

tony-bdd (.feature) → tony-bdd-test (Tests) → Mutation → Test fails? → (Yes → Revert → Done) | (No → Reinforce → Mutation)

/tony-workflow

The base commands allow creating .features and implementing them. How to chain the two? That's where /tony-workflow comes in. The workflow runs /tony-bdd and then /tony-bdd-test, but leverages context engineering and subagents. One of the instructions is to parallelize whenever possible. For features that touch different parts of the system or can be split into smaller work units, once the .feature files are generated, each can be launched in a subagent in parallel, in the background. It doesn't block the main thread nor pollute the main context.

Main thread: tony-bdd → .feature
                    ↓
    ┌───────────────┼───────────────┐
    ↓               ↓               ↓
Subagent 1      Subagent 2      Subagent N
    ↓               ↓               ↓
tony-bdd-test  tony-bdd-test   tony-bdd-test
    └───────────────┼───────────────┘
                    ↓
               background

Specs First vs BDD

So is BDD better than spec first? No. In practice they can be used together. You can generate the specification (SpecKit, OpenSpec) and define what should be built. Then that spec becomes input for tony-bdd. The .feature scenarios are born from that material. The spec says what; BDD extracts the testable behavior.

Conclusion

This methodology doesn't deal with teaching the agent how you want the code to be written. It creates tools to ensure that, once the code is correct and validated, you won't have regressions over time. This workflow should be used together with other techniques: skills, rules, guardrails, agents.md.

My result: once the outcome is ready and manually validated, reviewing the created interfaces, tests, and implementation takes less time than if I had to review everything in Spec First. Why? Because with everything validated, tests in place, and correct implementation, I can change the implementation freely. What I test is the interface, not the internals. That gives me flexibility to refactor without fear since it's all covered by tests. If I reach the point where I look at the implementation and see it's not as good as it could be, I can change it completely knowing the tests guarantee the behavior.

Results

These numbers come from the project where I applied the BDD workflow. Vitest coverage crosses the .feature scenarios with coverage reports; the 0 in Uncovered indicates that all scenarios have at least one associated test; none were left without coverage.

Metric	Value
Feature files	19
BDD scenarios	94
Test files	15
Tests	176
Coverage (Vitest)	42%

Although 42% is low coverage, the idea isn't to be exhaustive in tests, but precise in what to test: ensure the behavior that matters, not cover every line.

DEV Community