wintrover

Posted on Jan 12 • Originally published at wintrover.github.io

Testing in the Age of AI Agents: How I Kept QA from Collapsing

#testing #qa #automation #tdd

AI agents changed my development tempo overnight. I can ship more code in a day than I used to in a week, and that sounds great until the first time a tiny edge case takes down an entire flow.

At that speed, QA becomes either a competitive advantage or a constant fire drill. I chose the first option, and I rebuilt my testing approach in d:\Coding\Company\Ochestrator around a small set of test design techniques that scale with code volume:

TDD
EP-BVA (Equivalence Partitioning + Boundary Value Analysis)
Pairwise (Combinatorial Testing)
State Transition Testing

1. Why I Needed “Test Design,” Not Just “More Tests”

When code volume grows, the problem is not only “coverage.” The real problem is that the space of possible inputs and states grows faster than my time.

So I stopped asking:

“Did I write tests for this function?”

And I started asking:

“Did I select test cases that actually represent the failure surface?”

That mindset is what pushed me toward structured test design techniques.

2. TDD: Design for Testability from Day One

The Principle: TDD (Test-Driven Development) flips the traditional "write code, then test" workflow. It follows the Red-Green-Refactor cycle:

Red: Write a test for a new requirement and watch it fail. This confirms the test actually checks something and that the requirement isn't already met.
Green: Write the minimal amount of code to make the test pass. Avoid "over-engineering" at this stage.
Refactor: Clean up the code while ensuring the tests stay green.

In Orchestrator:
Since AI agents can generate complex business logic rapidly, I used TDD to ensure that the logic was testable by design. For example, when implementing the RetryPolicy for our Temporal workflows, I started with the test cases for exponential backoff before writing a single line of the policy logic.

# Simplified TDD Example for Retry Logic
def test_retry_interval_calculation():
    policy = ExponentialRetry(base_delay=1.0, max_delay=10.0)
    # 1st attempt: 1.0s
    assert policy.get_delay(attempt=1) == 1.0
    # 2nd attempt: 2.0s
    assert policy.get_delay(attempt=2) == 2.0
    # Capped at 10.0s
    assert policy.get_delay(attempt=10) == 10.0

This forced me to separate the calculation of delays from the execution of the retry, making the system modular and robust.

3. EP-BVA: Efficiency through Mathematical Selection

The Principle:

Equivalence Partitioning (EP): Instead of testing every possible value, you divide the input domain into groups (partitions) where the system is expected to behave identically. You only need to test one value from each group.
Boundary Value Analysis (BVA): Bugs often hide at the "edges" of these partitions. BVA focuses on testing the exact boundaries, and values just inside and outside of them.

In Orchestrator:
When handling user-uploaded files, we have strict size limits (e.g., 1MB to 10MB).

Partitions:
- Invalid (< 1MB)
- Valid (1MB - 10MB)
- Invalid (> 10MB)
BVA Points: 0.99MB, 1.0MB, 1.01MB, 9.99MB, 10.0MB, 10.01MB.

A critical real-world example I applied was the 72-byte limit of bcrypt. Many developers don't realize that bcrypt ignores any characters after the 72nd byte.

# apps/backend/tests/test_auth_service.py
def test_password_length_boundaries(self, auth_service):
    # Boundary: 72 bytes
    p72 = "a" * 72
    h72 = auth_service.get_password_hash(p72)

    # Just above the boundary: 73 bytes
    p73 = p72 + "b"
    # Bcrypt will treat p73 the same as p72 if only the first 72 bytes are used
    assert auth_service.verify_password(p73, h72) is True

By focusing on these specific points, I reduced hundreds of potential test cases to just 6-10 highly effective ones.

4. Pairwise: Taming the Combinatorial Explosion

The Principle: Most bugs are caused by either a single input parameter or the interaction between two parameters. Pairwise Testing is a combinatorial method that ensures every possible pair of input parameters is tested at least once. This drastically reduces the number of test cases while maintaining high defect detection.

In Orchestrator:
Our AI Inference engine has multiple configuration axes:

Execution Provider: [CUDA, CPU, OpenVINO] (3)
Model Size: [Small, Medium, Large] (3)
Quantization: [INT8, FP16, FP32] (3)
Async Mode: [Enabled, Disabled] (2)

Total combinations: $3 \times 3 \times 3 \times 2 = 54$ cases.
Using Pairwise, we can cover all interactions between any two settings in roughly 12-15 cases.

# Using allpairspy to generate the matrix
from allpairspy import AllPairs

parameters = [
    ["CUDA", "CPU", "OpenVINO"],
    ["Small", "Medium", "Large"],
    ["INT8", "FP16", "FP32"],
    ["Enabled", "Disabled"]
]

for i, combo in enumerate(AllPairs(parameters)):
    print(f"Test Case {i}: {combo}")

This allows us to maintain high confidence in our hardware compatibility matrix without running the full 54-case suite on every PR.

5. State Transition Testing: Mapping the Life of a Process

The Principle: This technique is used when the system's behavior depends on its current state and the events that occur. We map out a State Transition Diagram and ensure that:

All valid transitions are possible.
All invalid transitions are properly blocked (Negative Testing).
The system ends in the correct final state.

In Orchestrator:
The KYC (Know Your Customer) verification workflow is a complex state machine. A user's document moves through:
PENDING $\rightarrow$ UPLOADING $\rightarrow$ PROCESSING $\rightarrow$ VERIFIED or REJECTED.

I implemented tests to ensure a REJECTED document cannot suddenly jump to VERIFIED without going through PROCESSING again.

# apps/backend/tests/test_integration_kyc_workflow.py
def test_invalid_state_transitions(workflow_engine):
    workflow_engine.set_state(ImageStatus.REJECTED)

    # This should be blocked by the business logic
    with pytest.raises(IllegalStateError):
        workflow_engine.transition_to(ImageStatus.VERIFIED)

This is crucial for AI agents that might try to "short-circuit" logic. By strictly testing the state machine, we ensure the integrity of the entire business process.

Conclusion

In the AI-agent era, code is cheap. Trust is not.

What kept my QA from collapsing was not writing more tests, but adopting test design techniques that scale:

TDD for fast feedback and safer refactors
EP-BVA to systematize edge cases
Pairwise to tame combinatorial growth
State Transition Testing to validate real workflows

This is the testing toolbox I expect to keep using as my code volume keeps accelerating.

DEV Community