Paul Coles

Posted on Aug 11

The Subtle Art of Herding Cats: How I Turned Chaos Into a Repeatable Test Process (Part 3 of 4)

#ai #testing #llm #softwareengineering

Proof of Concept: Does This Actually Work?

In Part 2, I found that gold standards are more effective than rulebooks. Also, lazy loading helps stop Context Rot. Part 3 shows how it works in real life. It includes fake examples and a truthful look at what happens when the cats face reality.

The Universal BDD Vision: Two Car Companies Principle

The Core Philosophy

If two companies do the same thing, like BMW and Mercedes-Benz with car configurators, you can take a requirement from either and create the same BDD scenario.

The scenario shouldn't contain:

Implementation details: REST APIs, microservices, specific databases
System names: ConfiguratorService v2.1, PricingEngine, ValidationAPI
Technical artefacts: JSON responses, event handlers, component states

Instead, it should focus on:

User intent: What does the person want to accomplish?
User actions: What do they actually do?
Observable results: What do they see happen?

The code behind is different, but the human need is identical.

The Problem: Implementation-Contaminated Scenarios

Here's what BDD scenarios look like when they're contaminated with implementation details:

# BMW's contaminated approach
Feature: BMW iDrive ConfiguratorService Integration [SPEC-BMW-123]
Background:
  Given the BMW ConnectedDrive API is initialized
  And the user authenticates via BMW ID OAuth
  And the PricingEngine microservice is available

Scenario: M Sport Package selection triggers pricing recalculation
  Given I have loaded the 3-series configurator via iDrive interface
  When I POST to /api/bmw/packages/m-sport with authentication headers
  Then the PricingCalculatorService should return updated totals
  And the frontend should display BMW-specific pricing components
  And the ConfiguratorState should persist to BMW backend systems

# Mercedes contaminated approach
Feature: Mercedes MBUX Configurator Integration [SPEC-MB-456]
Background:
  Given the Mercedes me connect platform is active
  And MBUX infotainment system is responsive
  And the pricing validation service confirms availability

Scenario: AMG package selection updates Mercedes pricing display
  Given I access the C-Class configurator through MBUX interface
  When the system processes AMG package selection via Mercedes API
  Then the integrated pricing module recalculates total cost
  And Mercedes-specific UI components reflect package changes
  And the selection persists in Mercedes customer profile system

The Problem: These scenarios test implementation details, not user behaviour. A tester needs different knowledge to understand BMW and Mercedes scenarios. Users perform the same tasks, but the context differs.

📌 Universal Behavior Insight: When configuring a BMW 3-series or a Mercedes C-Class, users want to choose packages, check pricing updates, and identify conflicts. The implementation shows significant differences, but the user experience remains fundamentally the same.

The Solution: Universal, Human-Focused Scenarios

Here's what the same functionality looks like when focused on universal human behaviour:

# Works for BMW, Mercedes, Audi, or any car configurator
Feature: Vehicle Package Configuration [SPEC-123]

Scenario: Premium package selection updates pricing
  Given I am on the vehicle configuration page
  When I select the premium package
  Then I should see the updated total price
  And the premium package should be marked as selected

Scenario: Package conflict prevention
  Given I have selected a premium package
  When I attempt to select a conflicting economy package
  Then I should see a conflict warning message
  And the economy package should remain unselected
  And my original premium selection should be preserved

Scenario: Package removal affects pricing
  Given I have selected multiple packages
  When I remove the premium package
  Then the total price should decrease
  And the premium package should no longer appear selected
  And any dependent options should be automatically removed

Domain Configuration Separation

Behind the scenes, each company uses their specific domain configuration:

BMW Domain Config:

{
  "navigation_url": "https://bmw.com/configurator",
  "premium_package": "M Sport Package",
  "economy_package": "Efficiency Package",
  "api_endpoint": "BMW ConnectedDrive API",
  "pricing_currency": "EUR"
}

Mercedes Domain Config:

{
  "navigation_url": "https://mercedes-benz.com/configurator",
  "premium_package": "AMG Line Package",
  "economy_package": "Eco Package",
  "api_endpoint": "Mercedes me connect API",
  "pricing_currency": "EUR"
}

The same universal scenarios have different domain implementations. Testers understand them easily because they focus on human behaviour, not on technical details.

Garbage in and Garbage out

Something that became clear was that some of our tickets were not so good. They're written in a given, when, then format, but, it's in a table and full of bullet points. Instead of writing bullet points, it's a blend of syntax. It will have 10 compound results, and some may contradict each other.

Example of a "Bad Ticket":

Given the user is on the config page 
When they add the M Sport Package 
Then
• The price updates
• The UI shows "M Sport"
• The PricingEngine service is called with SKU-123
• A confirmation modal appears (unless they are premium users)
• The total must not exceed credit limit in UserDB
• The economy package is disabled
• Loading spinner shows during calculation
• Error handling for network failures

This ticket mixes UI behavior, API calls, business rules, and error handling - all in one "Then" clause. Some requirements clash. They test implementation details rather than user behaviour.

From this I made the tool check the ticket before we could let the AI try and make scenarios from it.

read it
assess it
apply rules
Let the user decide what to do
- Accept the badness
- See what the scenarios would look like
- Rewrite them using the single responsibility principal
- Stop

The Complete Workflow: From Jira Ticket to Executable Tests

Task 1: Context Extraction in Action

When a Jira ticket arrives, the AI agent (loaded only with analysis rules and domain context) creates a structured conversation log:

## Requirements Analysis
- REQ-001: User can select vehicle packages
- REQ-002: Package selection updates total pricing
- REQ-003: Conflicting packages show warning messages
- REQ-004: Package removal updates pricing and dependencies

## Positive Test Scenarios Identified
- Premium package selection with pricing update
- Multiple package selection and total calculation
- Package upgrade scenarios

## Negative Test Scenarios Identified
- Conflicting package selection attempts
- Invalid package combinations
- Network error during selection

## Inferred Requirements (Agent Additions)
- Loading states during price calculation
- Confirmation for expensive package selections
- Package dependency validation

Task 2: BDD Generation with Pattern-Led Prompting

The agent then loads BDD generation rules (and only those rules) plus the Task 1 output. It uses the gold standards approach to create clear scenarios that focus on business language and user actions.

The key insight: The AI already knows BDD structure. I didn't need to teach "Given-When-Then." I just needed to steer it toward business-focused language and user-observable outcomes.

Task 3a: Behavioural Assessment - The Testing Filter

This is where the Context Smartness approach really shines. The agent loads only assessment criteria and applies strict behavioural filters:

Include for Automation:

Multi-step user workflows
Cross-component integration tests
Business process validation
State persistence across actions

Exclude from Automation:

Single component behaviour (unit test territory)
Subjective UX validation
Accessibility testing (specialised tools needed)
Performance without specific metrics

Golden Rule: Only test what you can control. Avoid putting product prices or names in automation since they change. Check that prices show correctly and names appear consistently. Focus on the user experience, not the system's internals. This aligns with "intent-based testing" principles, but I see it as common sense.

Task 3b: TAF Generation - From Human to Machine

The final task loads technical patterns and turns approved behaviour scenarios into working automation code.

Generated Test Automation Framework (TAF) Code:

Scenario: Premium package selection updates pricing
  Given I navigate to the package configuration page
  When I select the premium package option
  Then the pricing display should show updated costs
  And the premium package should appear selected

Generated Infrastructure Report:

## Required Page Objects
- PackageConfigurationPage
  - premiumPackageOption (data-testid="premium-package")
  - pricingDisplay (data-testid="pricing-total")
  - packageSelectionIndicator (data-testid="selected-indicator")

## Missing Step Definitions
- "I select the premium package option"
- "the premium package should appear selected"

The agent gets you 80-90% of the way there, then humans add the final details.

Using State Diagrams so you and it can know what it does

The Problem: AI Doesn't Know Your Application States

The LLM only knows what it knows. If you ask it to write API requests from a spec snippet, it will try. But the result often seems fine, even though it's completely wrong. It doesn't know the states of your application.

The Solution: Plain English State Description

I began explaining application states and process flows in plain English. I often used Figma designs since they show the actual state changes clearly.

Then I asked the agent to create Mermaid state diagrams from the scenarios:

graph TD
    A[Configuration Page] --> B[Premium Selected]
    B --> C[Pricing Updated]
    B --> D[Try Economy Selection]
    D --> E[Conflict Warning Displayed]
    E --> B[Return to Premium Selected]

These diagrams showed missing state transitions that weren't clear in Jira stories but were visible in Figma designs. The AI became better at identifying incomplete workflows and suggesting additional test scenarios.

Making the tasks self documenting

I was much like a university student writing their software plan after the code. I had this amazing system, but only I knew what it did, and I'd only remember this for a while. So, I asked the AI to produce flow charts using Mermaid again.

This allows others to understand what it does without reading a bunch of pseudo code. It also allows me to follow the paths through and debug problems. I quickly realised its value when I kept loading the domain. I checked it, made a decision, and then loaded the domain again for a more thorough check. 😒

Honest Assessment: What Actually Works

It's all new to me

I've jumped into this without much preparation. I'll discuss this more in part 4. The first version was a mess, but it worked and showed that it was possible. When we needed to roll it out, I realised I couldn't because it was tied to my area of work.

I learned what worked as I went along.

We launched what I'll call version 2. This version is domain-aware and loads context. But, it had some early bugs. One major issue was that ticket assessment would always fail. This meant it wasn't loading the domain context. It was definitely a "it works on my machine" problem.

I tried to strengthen the wording, but I know this only helps so much. The AI doesn't read like us; it sees everything as one long sentence. It also gives more weight to recent information.

I needed to change how I executed tasks. In V3, everything is pseudo code. I'm considering that this could all be real code, using a basic markdown file for the AI.

Lesson 1: The Ripple Effect (F1 Car Analogy)

I'm a big fan of F1, more of the design and off track stuff; the races can be pretty dull. What is clear is that changing the front wing affects other areas of the car.

Making the domain loading perfect had the issue of the AI treating it as an override for garbage tickets. It would still allow poor tickets because it believed the extra domain stuff improved them. It didn't, it meant they were nonsense with the correct names.

Lesson 2: You Can't Automate Quality Control

The domain stuff is seasoning on your pasta; it enhances the bland scenarios to reflect your specific business, but it cannot remedy fundamentally bad ingredients.

So, the assessment had to change to pseudo code so that it understood the rules. I did try putting rule 5 before rule 4 (changed the numbers and everything), but the AI ignored it!

Doing this stopped ticket processing. In a perfect world, this wouldn't occur. Everyone would create perfect scenarios. So I had to make it optional.

Lesson 3: The Human Must Be the Final Arbiter

There's something that the system has to adhere to - The human has to make the decisions. It's why the test cases aren't limited, they are in priority order, but it makes everything. When something goes wrong, people won't tell off the AI.

So, that's where the options I mentioned earlier came from.

After this, rather than mess around, we changed all the other tasks to pseudo code.

Turning the AI on Itself: Unit Testing the Rules

I tested all these changes manually, which was frustrating. Then, I got the AI to create some unit tests. It took good and bad examples, even tickets outside my domain. The AI generated expectations and made repeatable tests. Now, when the rules change, we can run these tests to check for any issues. We also recreate the mermaid diagrams, so we can see if the flow makes sense.

The Wins ✅

Consistency: Generated scenarios follow the same patterns every time. No more confusion about why one tester says "Given I navigate to" and another says "Given the user accesses."

Speed: Minutes instead of hours for complex features. What used to take an afternoon of careful scenario writing now happens in the time it takes to make coffee.

Creativity: The agent spots edge cases that humans often overlook. It often detects state changes, error conditions, and user journey differences not included in the original requirements. When you focus on user behaviour instead of technical details, you naturally uncover more realistic test scenarios. This is what intent-based testing advocates have always claimed.

Documentation: Creates the specifications that were missing. Generated BDD scenarios are often clearer than the original Jira tickets. They become the source of truth for what the feature does.

Onboarding: New team members understand features faster. Universal, behaviour-focused scenarios are self-documenting in ways that implementation-specific tests aren't.

The Ongoing Challenges ⚠️

Domain Drift: The AI loves to just go for it. Before you know it, there are domain-specific details creeping into what should be universal patterns. You have to watch what the AI does if you're changing rules.

Edge Case Handling: Still needs human review for unusual scenarios. The AI excels at common patterns but struggles with genuinely unique business logic.

Context Maintenance: Domain configurations need regular updates. As products evolve, the mappings between universal patterns and specific implementations require ongoing care.

What It Doesn't Fix ❌

Process Problems: Technical solutions don't fix workflow issues. If your requirements are unclear, arrive late, or change constantly, AI won't solve those basic communication problems.

Human Communication: Still need clear specs and acceptance criteria. The AI amplifies the quality of your inputs - it doesn't create clarity from chaos.

Domain Expertise: Agent can't replace understanding your business. It can apply patterns consistently, but someone still needs to know whether the business logic makes sense.

The Practical Implementation Guide

Create Your Gold Standards

Pick your best existing BDD scenarios
Clean them to perfection
Document why they're good
Use these as training examples

Build Task-Based Rules

Extract minimal rules from gold standards
Create focused rule sets per task
Test with lazy loading approach
Measure consistency improvements

Implement the Full Workflow

Task 1: Context extraction and analysis
Task 2: Human-readable BDD generation
Task 3a: Behavioural assessment
Task 3b: Automation code generation

Measure and Refine

Compare generated vs manual scenarios
Track consistency metrics
Identify remaining edge cases
Refine rules based on actual usage

What's Coming in Part 4

The framework works, the cats stay in formation, and the scenarios are consistent. But here's what really got me thinking: I accidentally solved fundamental AI problems that every developer faces.

It's annoying talking to an AI and it takes off half-cocked and does something you don't want. Turns out, I wasn't alone in this frustration.

In Part 4, I'll reveal:

The Context Rot discovery: How I identified performance degradation months before it was documented
The market irony: Why simple solutions to real problems get overlooked while flashy AI tools get all the funding
What I learned about AI reliability that applies to any system trying to get consistent behavior
From frustrated tester to accidentally solving problems I didn't know had names

The real breakthrough wasn't just herding cats - it was discovering that my specific frustrations were actually universal AI challenges.

Paul Coles is a software tester who proved that universal BDD patterns work across domains when separated from implementation details. In Part 3, he demonstrates the complete framework in action with real examples and honest assessment of what works and what doesn't. His cat now stays mostly in the designated areas.

🐾 Series Navigation

Part 1: Why AI Starts Making Stuff Up

The cat has opinions — and your postcode formatting rules aren't one of them.

Read it →
Part 2: Show, Don't Tell: Teaching AI with Better Examples

Bribing the cat with gold standards and smaller piles of paper.

[Read it →]https://dev.to/paul_coles_633f698b10fd6e/the-subtle-art-of-herding-cats-show-dont-tell-teaching-ai-by-example-part-2-of-4-ing
Part 3: How I Made My AI Stop Guessing

Teaching the cat one trick at a time with task-focused training.

(you are here)
Part 4: The More You Say, the Less It Learns

When you talk too much, the cat stops listening — and invents new requirements instead.

(Coming soon)

Photo by Birte Liu on Unsplash

DEV Community