SDET Code

Posted on Apr 24

Logic Mutations: The Bugs Your Tests Are Secretly Ignoring

#testing #qa #python #career

In Part 2 of this series, we looked at boundary mutations — the category with the highest detection rate (63.8%). The numbers were reassuring, with a catch: 36.2% of boundary bugs still survived, and the ones that slipped through were the ones that mattered most.

Logic mutations are a harder problem.

In our benchmark of 195 AI-run sessions against the SDET Code challenge library, logic bugs were caught only 47.5% of the time. That is the second lowest detection rate of any category, beaten to the bottom only by type-related bugs at 28.6%.

What does that mean concretely? It means that if you injected a hundred plausible operator mutations into your production code and relied on your existing test suite to catch them, more than half would ship.

The reason is mechanical. Boundary mutations break obvious things — edge values produce obviously wrong outputs. Logic mutations break subtle things — the output is plausible, the function runs, all existing assertions still pass, and the bug only manifests under specific combinations of inputs that your test matrix happened not to cover.

This article is about that second category. How logic mutations work, why standard test design misses them, and the systematic techniques that close the gap.

The Four Shapes of Logic Mutations

Logic mutations fall into four common patterns. Each one looks different at the code level, but they share a property: the resulting code is still syntactically valid and semantically plausible.

1. Operator Swap

The simplest form. One comparison operator is replaced with a neighboring one.

# Original
if user_age >= 18 and country_code == "US":
    return True

# Mutation: >= becomes >
if user_age > 18 and country_code == "US":
    return True

The function still compiles. It still returns a boolean. The only difference is behavior when user_age == 18.

2. Logical Connective Swap

The and becomes or, or vice versa.

# Original
if user_is_premium and cart_total > 100:
    apply_free_shipping()

# Mutation: and becomes or
if user_is_premium or cart_total > 100:
    apply_free_shipping()

The intent was: premium customers AND high-value orders get free shipping. The mutation says: either one is enough. Now every premium user gets free shipping regardless of cart value, and every high-value cart gets free shipping regardless of membership. Revenue impact: silent.

3. Condition Inversion

A condition is negated.

# Original
if payment_status == "success":
    send_receipt()

# Mutation: == becomes !=
if payment_status != "success":
    send_receipt()

Receipts now go out for failed payments. Successful payments get silence. This is not a theoretical example — it has shipped to production in systems that existed at scale.

4. Branch Removal

An entire logical branch is deleted.

# Original
def calculate_fee(amount: float, account_type: str) -> float:
    if account_type == "premium":
        return 0.0
    elif account_type == "standard":
        return amount * 0.025
    else:
        return amount * 0.05

# Mutation: premium branch removed
def calculate_fee(amount: float, account_type: str) -> float:
    if account_type == "standard":
        return amount * 0.025
    else:
        return amount * 0.05

Premium accounts now pay the standard 2.5% fee. The function still returns a number. Every existing test that runs calculate_fee(100, "standard") or calculate_fee(100, "unknown") still passes.

All four of these mutations have something in common: a test suite that never deliberately probes the specific combination of inputs that distinguishes the correct behavior from the mutant will pass against both.

Why "Coverage" Misses These

Line coverage tools will report 100% for a test suite that misses every logic mutation above. That is not a flaw in the tooling — the tool is doing exactly what it says. The test ran every line. It just did not distinguish correct output from incorrect output.

Here is the concrete version of the problem. Take the free-shipping example:

def should_offer_free_shipping(user_is_premium: bool, cart_total: float) -> bool:
    if user_is_premium and cart_total > 100:
        return True
    return False

A test suite with 100% line coverage:

def test_premium_high_cart():
    assert should_offer_free_shipping(True, 150) == True

def test_not_premium_low_cart():
    assert should_offer_free_shipping(False, 50) == False

Every line of the function runs across these two tests. Coverage tool says 100%. Now inject the and → or mutation:

if user_is_premium or cart_total > 100:
    return True

test_premium_high_cart: True or True → True. Passes.
test_not_premium_low_cart: False or False → False. Passes.

The mutation survives. The line coverage number was meaningless in the face of this bug.

The problem is that logic mutations live in the space between inputs, not at the lines. You need tests that specifically target the distinguishing conditions. For an and → or swap, that means testing the combinations where one operand is true and the other is false — the two cases that produce different outputs between the correct and mutated versions.

The Truth Table Technique

The most reliable way to kill connective mutations is the truth table method. For every compound boolean condition, write tests that cover every combination of the operands' truth values.

For A and B, the truth table has four rows:

A	B	A and B (expected)
T	T	T
T	F	F
F	T	F
F	F	F

A test suite that covers all four rows has killed and vs or mutations by definition. Here is what that looks like:

# Row TT — both true
def test_premium_and_high_cart():
    assert should_offer_free_shipping(True, 150) == True

# Row TF — premium but low cart (distinguishes and from or)
def test_premium_but_low_cart():
    assert should_offer_free_shipping(True, 50) == False

# Row FT — not premium but high cart (distinguishes and from or)
def test_not_premium_but_high_cart():
    assert should_offer_free_shipping(False, 150) == False

# Row FF — neither
def test_neither_premium_nor_high_cart():
    assert should_offer_free_shipping(False, 50) == False

Under the and → or mutation:

Row TF produces True instead of False. Mutation killed.
Row FT also produces True instead of False. Mutation killed.

Two tests that looked redundant were the ones doing the real work. This is the pattern. In almost every logic mutation kill, there is a test that feels like it "tests the same thing" as another — right up until it is the one that exposes the bug.

The truth table technique generalizes. For A or B, the single row that kills the or → and mutation is F+F compared to T+T — so you need at least two rows from opposite corners. For nested conditions, the number of rows multiplies. For (A and B) or C, the full table has eight rows; in practice you need at least the rows where the result differs between the correct and any plausible mutation.

Equivalence Partitioning Doesn't Save You

A common response when engineers see the truth table requirement is to push back: "That is a lot of tests. Equivalence partitioning says we do not need to test every combination."

Equivalence partitioning is a good technique for input coverage. It tells you that if the function treats values 18, 25, and 45 identically (all "adult"), you only need one test from that partition.

It does not help with logic mutations.

Because the mutation is in the connective, not the input. Premium and non-premium are different partitions on the user dimension. High-cart and low-cart are different partitions on the total dimension. A truth table test is not redundant with a partition-based test — it is testing something orthogonal: whether the combination of partitions produces the correct output.

Mutation testing surfaces the gap that equivalence partitioning was never designed to cover. The two techniques are complementary, not competing.

A Harder Example: Nested Logic

Here is a function with nested logic that is harder to test systematically:

def can_withdraw(
    balance: float,
    daily_limit_used: float,
    account_is_frozen: bool,
    kyc_verified: bool
) -> bool:
    """
    Allow withdrawal if:
    - Account is not frozen, AND
    - KYC verified, AND
    - Either balance is above $500 OR daily limit remaining is above $200
    """
    if account_is_frozen:
        return False
    if not kyc_verified:
        return False
    daily_limit_remaining = 1000 - daily_limit_used
    return balance > 500 or daily_limit_remaining > 200

Mutations that a mutation testing system might inject:

Mutation A — if account_is_frozen: becomes if not account_is_frozen:. Frozen accounts can withdraw; unfrozen cannot. Obvious catastrophe. Easy to catch with a single test on a frozen account.

Mutation B — if not kyc_verified: becomes if kyc_verified:. Unverified users can withdraw; verified users cannot. Same shape as above.

Mutation C — balance > 500 or daily_limit_remaining > 200 becomes balance > 500 and daily_limit_remaining > 200. The or becomes and. Accounts that should qualify through either condition now need both.

Mutation D — balance > 500 becomes balance >= 500. A $500 balance now qualifies. Boundary mutation.

Mutation E — daily_limit_remaining > 200 becomes daily_limit_remaining < 200. Inverts the condition. Accounts with low remaining limit now qualify; accounts with high remaining limit do not.

A test suite focused on "happy path" and "one failure per guard":

def test_successful_withdrawal():
    assert can_withdraw(1000, 0, False, True) == True

def test_frozen_account_denied():
    assert can_withdraw(1000, 0, True, True) == False

def test_unverified_account_denied():
    assert can_withdraw(1000, 0, False, False) == False

def test_low_balance_and_low_limit():
    assert can_withdraw(300, 900, False, True) == False

Kill analysis:

Mutation A: test_frozen_account_denied expects False, mutant returns True (because not account_is_frozen is False for frozen, skipping the return). Kills.
Mutation B: test_unverified_account_denied kills it.
Mutation C (or → and): test_successful_withdrawal has balance=1000 (>500) AND daily_limit_remaining=1000 (>200). Both conditions true, so and still returns True. Survives.
Mutation D (> → >=): No test uses balance == 500. Survives.
Mutation E: test_low_balance_and_low_limit has balance=300 (not > 500) and daily_limit_remaining=100 (which is less than 200, so the inverted condition < 200 evaluates to True). Under the mutation, the return becomes False or True = True. Expected False. Fails. Kills.

Kill ratio: 3 out of 5.

Now here is the same suite with targeted logic tests added:

# Kill Mutation C: qualify via balance only
def test_high_balance_but_low_daily_remaining():
    assert can_withdraw(1000, 900, False, True) == True
    # balance=1000 (>500): qualifies
    # daily_limit_remaining=100 (not >200): does not qualify via limit
    # "or" returns True. Mutation "and" returns False. Kills.

# Kill Mutation C again: qualify via limit only
def test_low_balance_but_high_daily_remaining():
    assert can_withdraw(300, 0, False, True) == True
    # balance=300 (not >500): does not qualify via balance
    # daily_limit_remaining=1000 (>200): qualifies via limit
    # "or" returns True. Mutation "and" returns False. Kills.

# Kill Mutation D: boundary test on balance
def test_balance_exactly_at_boundary():
    assert can_withdraw(500, 900, False, True) == False
    # balance=500 (not >500): does not qualify
    # daily_limit_remaining=100 (not >200): does not qualify
    # Returns False. Mutation ">=" returns True for balance=500. Kills.

Three targeted tests close the gap. Each one targets a specific mutation class by probing a specific combination of input states.

Why This Matters for AI-Generated Code

There is a reason I am writing about logic mutations now, in the specific context of AI.

The benchmark I mentioned at the top of this article — 47.5% detection rate for logic bugs — was measured by running AI models through the challenge library as test writers. The interesting asymmetry is on the other side: when AI models generate code rather than tests, logic bugs are the most common failure mode we see.

GPT-class models are quite good at boundary handling when the boundary is stated explicitly in the prompt ("fee of 2.5% for orders above $100"). They are quite bad at logic correctness when multiple conditions interact — the exact domain where truth-table thinking is required.

Common patterns we observed in AI-generated code:

and where or was intended when combining permission checks
Negation inconsistencies in guard clauses (especially with not in vs not outside an in)
Operator swaps in range checks (< limit where <= limit was meant)
Dropped branches where a specification had three tiers but the generated code covered only two

If you have ever reviewed code from an AI assistant and had the vague feeling that "something is off" without being able to immediately point to what — there is a high chance it was a logic mutation shape. The code reads cleanly, the variables are well-named, the types line up. The bug is in the invisible space between the operators.

This is why mutation-style test thinking is becoming a critical skill for working with AI-generated code. The bug patterns are shifting, but the detection technique — adversarial probing of the specific combinations that distinguish correct from incorrect — is the same.

Try It on AI-Generated Code

We built a practice mode around exactly this skill. AI Verifier on SDET Code gives you functions generated by GPT-class models, some of which contain intentional subtle logic bugs. Your job is to design inputs — in a lightweight, non-pytest format — that expose the incorrect behavior.

It is a different interaction model from the mutation-scoring mode. Instead of writing a pytest suite and reading a kill ratio, you propose inputs, see the function's output, and identify whether the output matches the specification. The skill being practiced is the same one: probing the distinguishing conditions that separate correct logic from plausible-looking mutations.

The problem library includes specifically the logic mutation classes covered in this article — operator swaps, connective swaps, condition inversions, branch removals — applied to realistic business functions across fintech, e-commerce, and platform domains.

Everything runs in your browser (Pyodide + WebAssembly, no install needed). Free to try without signing up.

Recap

Logic mutations — wrong operators, swapped connectives, inverted conditions, removed branches — were the second-hardest category in our benchmark, with a 47.5% detection rate. They survive coverage-based test design because the mutated code still runs, still returns the right type, and often still passes inputs the test suite happened to choose.

The truth table technique closes most of the connective gap. For every compound boolean condition, test each combination of operand truth values. The two rows where operands disagree — T, F and F, T — are what distinguish and from or. Skip those rows and you cannot tell the two apart by observation.

For nested or multi-condition logic, the technique generalizes: identify the specific input combinations where the correct and mutated versions would produce different outputs, and write tests that hit exactly those combinations. Boundary triplets still apply at the leaf operators.

None of this is new theory. It is deliberate application of logic, not a trick. But it becomes an automatic habit only through practice — which is why benchmarks show this category getting missed at a high rate even by engineers who could explain the technique if asked.

The next time you write a test that feels redundant with an earlier one, check: is it the F-T row to the other's T-F? That redundancy might be the only thing standing between your suite and a silent logic bug.

This is Part 3 of the "Mutation Testing for QA Engineers" series. Part 4 will cover the AI Verifier workflow in depth — how to design input probes that reliably catch AI-generated logic bugs in realistic business code.

DEV Community