Leon Pennings

Posted on Mar 26 • Originally published at blog.leonpennings.com

Software Testing: You’re Probably Doing It Wrong

#softwareengineering #testing #java #softwaredevelopment

Software testing has become one of the most ritualized practices in modern development.

That is not because testing is unimportant. Quite the opposite.

Testing matters.

But in many teams, testing has quietly expanded beyond its actual role. It is no longer treated as a tool for verifying software behavior. It is increasingly treated as a proxy for understanding, a proxy for design, and even a proxy for quality itself.

And that is where the problem begins.

Because testing can verify behavior.

But it cannot replace engineering.

Testing in Software Is a Verification Discipline

At its core, testing in software has a very specific role:

to verify whether a system behaves acceptably under certain conditions.

That is valuable. Necessary, even.

A good test can help answer questions like:

Does this behavior still work?
Does this input still lead to the expected output?
Did this change introduce a regression?

That is where testing is strong.

But notice what testing does not answer:

Is the design coherent?
Is the architecture proportional to the problem?
Is the model a good representation of the domain?
Is this implementation economical to evolve?

Those are engineering questions.

And when teams start treating test suites as if they answer them, behavioral verification gets confused with software quality itself.

That is a costly mistake.

The Math Test Problem

A great deal of modern team testing resembles cheating on a math exam.

Imagine students defining the exam questions during class, together with the teacher, while learning the material. By the time the exam arrives, the goal is no longer to understand the mathematics. The goal is to reproduce the answers that were already agreed upon.

Something very similar happens in software teams.

During refinement, development, or collaborative scenario-writing sessions, expected behavior is often defined in detail in advance. Tests are written, scenarios are formalized, and the team aligns around them.

In theory, this sounds excellent.

In practice, it introduces a subtle distortion:

the implementation target shifts from understanding the business domain to passing the agreed test scenarios.

That is a very different goal.

The result is not necessarily a bad system. But it is often a system optimized for compliance rather than understanding.

And the danger is obvious:

how often is the first interpretation of a business need fully correct?

If the test scenarios are based on incomplete understanding, then all the rigor in the world only helps build the wrong thing more reliably.

Verification Is Not Validation

This is the distinction many teams lose.

Testing is very good at verification:

Did the implementation behave as intended?
Does the system still behave as expected?

But verification is not the same as validation:

Was the right thing built?
Is this actually a fitting solution for the domain?

A system can satisfy every agreed scenario and still be fundamentally wrong.

It can behave correctly while being poorly modeled.

It can produce the expected output while being overcomplicated.

It can pass every acceptance test while solving the wrong problem in the wrong way.

In other words:

A passing test suite proves behavioral agreement—not solution fitness.

And that distinction matters far more than many teams admit.

The Ferrari in the Field

A Ferrari F40 can absolutely move across a field.

It can produce motion. It can get from one side to the other. It can, in the most literal sense, “do the job.”

That does not make it a tractor.

The same is true in software.

A system can satisfy all functional expectations and still be the wrong machine for the domain. It can be too expensive to change, too fragile to extend, too over-engineered for the actual need, or too structurally rigid to survive evolving business requirements.

Testing does not expose that.

Because testing can tell whether the machine moves.

It cannot tell whether it is the right machine.

And that is not a trivial distinction.

That is the distinction between working software and good engineering.

When Tests Stop Following Behavior and Start Following Structure

This is where testing often becomes actively harmful.

If testing is a behavioral verification discipline, then it should limit itself to verifying behavior.

But many modern testing practices go deeper than that.

They start testing:

local call structures
internal collaborations
class-level decomposition
implementation fragments in isolation

At that point, the tests are no longer verifying the system in any meaningful way.

They are verifying the current shape of the code.

That is not the same thing.

And once that happens, the test suite stops protecting change and starts resisting it.

The moment a test depends on how the behavior is achieved instead of what behavior is observed, it becomes a brake on refactoring.

That is one of the most under-discussed quality problems in software teams.

Because now every structural improvement becomes expensive:

rename a collaborator → tests break
merge responsibilities → tests break
simplify orchestration → tests break
move logic to a better abstraction → tests break

Not because behavior changed.

But because the test suite was never really about behavior to begin with.

Why Isolated Class Testing Often Misses the Point

One of the clearest examples of this problem is isolated class testing.

A class exists in code. Therefore, many teams assume it should be testable independently.

But a technical unit is not automatically a meaningful behavioral unit.

That assumption is rarely challenged.

Take something like a PDF information extractor.

That behavior does not meaningfully exist in a vacuum. It depends on:

parsing logic
normalization logic
extraction rules
object interpretation
domain-level decisions

Yet what often happens?

A single class gets tested in isolation.

Its collaborators are mocked.

Its environment is simulated.

Its context is stripped away.

Now the test no longer asks:

“Can the system reliably extract useful information from PDFs?”

Instead, it asks something far weaker:

“Does this one implementation fragment behave under synthetic scaffolding?”

That is not meaningful verification.

That is structural rehearsal.

And the cost is not just conceptual—it is practical.

Because now the test suite is coupled to a local decomposition that may not even survive the next decent refactor.

We end up with a test suite that passes perfectly even if the integration between those fragments is fundamentally broken—because we’ve tested the components, but ignored the composition.

Coverage Is Not Confidence

Test coverage is another example of verification ritual turning into proxy engineering.

Coverage has become a metric in its own right.

Teams report it. Managers ask for it. Pipelines display it as if it were a signal of quality.

But coverage says only one thing:

this code was executed while a test ran.

That’s it.

It does not tell:

whether the test is meaningful
whether important behavior is protected
whether the assertions matter
whether the design is safe to evolve

And yet teams optimize for it anyway.

That leads to the usual absurdities:

getter/setter tests
trivial constructor tests
one-line branch inflation
synthetic assertions written only to satisfy the metric

This is not quality.

It is administrative theater.

Coverage is a measure of execution, not a measure of insight.

And once a team starts chasing the number instead of the confidence, the metric has already failed.

Testing Is Not a Design Discipline

This may be the most important point of all.

Testing can verify whether software behaves as expected.

It cannot tell whether the software is well-designed.

It cannot tell whether:

the abstraction boundaries are good
the model is coherent
the architecture is sustainable
the implementation cost is proportional to the value
future stories will remain easy to add

Those are not test outcomes.

Those are design and engineering concerns.

And if a team replaces those concerns with:

framework templates
scenario scripts
coverage thresholds
pipeline greenness

…then better engineering is not happening.

Judgment is simply being outsourced to artifacts.

That may feel safer.

It may even look more rigorous.

But it is still a substitute for actual thought.

What Testing Is Actually For

Testing does have a real and valuable place.

Used well, testing is for:

verifying externally observable behavior
protecting against meaningful regressions
increasing confidence during change
supporting safe evolution of a system

That is already enough.

Testing does not need to become:

a replacement for design
a replacement for domain understanding
a replacement for architecture
a replacement for engineering judgment

The moment testing is asked to do those things, it becomes overloaded.

And overloaded tools do not become more powerful.

They become more misleading.

The Cost of a Misaligned System

A system does not need to be broken to be expensive.

It only needs to be misaligned.

That is one of the most dangerous illusions in software development: if the system behaves correctly, it is easy to assume the engineering must also be sound.

But a system can pass tests, satisfy stories, and still be fundamentally costly in all the places that matter over time.

It can be:

too expensive to extend
too brittle to refactor
too complex to reason about
too rigid to absorb new requirements cleanly

This is the software equivalent of using a Ferrari F40 to plow a field.

The machine moves.

The task gets completed.

But every future change becomes more expensive than it should be.

That cost rarely appears in the first implementation. It appears later:

in slower feature development
in rising maintenance effort
in increasingly fragile changes
in the growing difficulty of correcting earlier assumptions

And this is precisely where testing, on its own, offers very little protection.

Because testing can confirm that a system still behaves the same.

It cannot tell whether that behavior is now trapped inside the wrong machine.

That is an engineering problem.

And when that distinction is missed, software quality gets reduced to present-day correctness while long-term adaptability quietly deteriorates.

Conclusion

Software engineering has become increasingly comfortable with proxies.

Metrics are used as substitutes for judgment.

Artifacts are used as substitutes for understanding.

Test suites are used as substitutes for design confidence.

And in doing so, many teams create the appearance of rigor while quietly undermining the adaptability of the system itself.

Testing is valuable.

But only when it stays in its lane.

Testing should verify software behavior. It should not define the software, freeze its structure, or pretend to certify its design.

Because the moment verification starts replacing engineering, pipelines may still signal green — but better systems do not follow.

Ferraris get built where tractors would have been enough.

DEV Community