keeper

Posted on May 31

From 'How to Test AI Code' to 'What Makes Us Human'

#ai #philosophy #software #testing

The Conversation That Started It All

It began with a practical question: How do you test code generated by AI?

Simple enough, right? We've been testing software for decades. Unit tests, integration tests, E2E tests, property-based testing, fuzzing — the toolkit is mature and battle-tested.

But the deeper I dug, the more I realized this question doesn't stay contained. It metastasizes. From testing strategy, it bleeds into software engineering epistemology, then into cognitive science, and finally — if you follow the thread far enough — into a question that has haunted philosophers for millennia.

What makes us human?

This article traces that thread. It's the public version of a long, raw conversation with a friend who refused to accept surface-level answers. By the end, I hope you'll see that the "AI testing problem" is not a technical bug — it's a philosophical revelation wearing work clothes.

Part I: The 60x Scissors Gap

The Asymmetry Nobody Talks About

Here's the fundamental tension of engineering in the AI era:

Dimension	AI Code Generation	Human Verification
Speed	~∞ (seconds)	~constant (minutes to hours)
Cost per unit	~$0.0003/token	~$50-200/hour (engineer salary)
Scalability	Horizontal at near-zero marginal cost	Hard bottlenecked by human attention

In practice, this creates a scissors gap of 40-60x. A task that takes an LLM 5 seconds to generate takes an experienced engineer 5-20 minutes to properly review, test, and validate.

Let's make this concrete:

# LLM generates this in ~8 seconds
def process_transactions(transactions: list[dict]) -> dict:
    result = {"total": 0, "count": 0, "by_category": {}}
    for t in transactions:
        result["total"] += t["amount"]
        result["count"] += 1
        cat = t.get("category", "uncategorized")
        result["by_category"][cat] = result["by_category"].get(cat, 0) + t["amount"]
    return result

Looks fine. A human glances at it — looks fine. But the real scrutiny requires:

# Things you'd need to test that the LLM didn't think about
import pytest
from decimal import Decimal

def test_floating_point_accumulation():
    """LLMs love float. Finance hates float."""
    transactions = [
        {"amount": 0.1, "category": "food"},
        {"amount": 0.2, "category": "food"},
    ]
    result = process_transactions(transactions)
    # 0.1 + 0.2 = 0.30000000000000004 in IEEE 754
    assert result["total"] == 0.3  # This FAILS!

def test_missing_amount_key():
    """What if a transaction dict is malformed?"""
    transactions = [{"category": "food"}]  # no 'amount'
    result = process_transactions(transactions)
    # KeyError! LLM didn't add try/except

def test_empty_transactions():
    """Edge case: empty list"""
    assert process_transactions([]) == {"total": 0, "count": 0, "by_category": {}}

def test_negative_amounts():
    """Refunds? Corrections? The LLM assumed amounts are always positive."""

Each test case catches something the LLM could have handled if prompted better. But time is money, and the pipeline is moving. The ratio holds: ~8 seconds of generation cost → ~15 minutes of test writing. That's a 112x ratio when you include thinking time.

The Crisis Is Not What You Think

Most people see this and say: "We need better AI-powered testing tools!" And sure, we do — that's exactly the space I'm building in. But that's a tactical response to a strategic problem.

The real crisis is that we're optimizing the wrong variable. The industry treats AI code generation as a productivity multiplier — "10x engineers!" — without realizing it's actually an accountability amplifier. Every piece of AI-generated code carries a liability that someone, somewhere, must assume.

And liability scales linearly with human attention, not quadratically with generation speed.

Part II: Why Verification Can't Just "Speed Up"

You might think: "Well, make AI do the testing too. AI generates tests, AI runs them, job done."

This is where the problem pivots from engineering to epistemology.

The Oracle Problem

In software testing, an oracle is the mechanism that determines whether a test passes or fails. For most human-written code, the oracle is the specification — the requirements document, the acceptance criteria, the business rules.

When AI generates code from a prompt like "write a function that processes transactions," there is no formal specification. The prompt is the spec, and it's ambiguous by nature. An AI-generated test against AI-generated code is checking internal consistency, not behavioral correctness.

# AI-generated test for AI-generated code
def test_process_transactions():
    transactions = [{"amount": 10, "category": "food"}, 
                    {"amount": 20, "category": "transport"}]
    result = process_transactions(transactions)
    # What's the ground truth? The AI "knows" what it intended...
    assert result["total"] == 30
    assert result["count"] == 2

This test passes. But does it tell us the code is correct? No — it tells us the code is self-consistent. The difference is everything.

State Space Explosion

The second reason verification can't keep up: combinatorial state space.

For a typical web application with:

50 database states
20 authentication states
30 UI states
10 external API response modes

The total state space is 50 × 20 × 30 × 10 = 300,000 combinations. Even at 1 second per test, that's ~83 hours of testing for what an LLM generates in 30 seconds of prompting.

Non-Deterministic Outputs

The third reason is subtle but devastating: LLMs are not deterministic functions. Give the same prompt twice, and you might get two different implementations. Even with temperature = 0, floating-point non-determinism in GPU computations means variance.

This breaks the fundamental assumption of traditional quality engineering: reproducibility. If you can't reproduce a bug, you can't fix it. If you can't fix it, you can't trust the system.

Part III: The Five Layers of Knowledge

As I wrestled with why verification can't simply "scale," I developed a framework about what kinds of knowledge are involved — and which ones AI can genuinely possess.

Layer 1: Application Domain Knowledge

What it is: Specific facts about a problem domain. Tax codes, medical procedures, API documentation, business rules.

Can AI do this? Yes, increasingly well. LLMs ingest massive corpora of domain-specific text and can recall and apply it with surprising accuracy. This is the "memorize the manual" layer.

Example: An LLM knows that PCI-DSS requires credit card numbers not to be stored in plaintext. It will generate code that hashes them.

Layer 2: Software Engineering Knowledge

What it is: Design patterns, testing strategies, architectural principles, language idioms, performance optimization.

Can AI do this? Approaching human-level in many cases. LLMs have read every Stack Overflow post, every design patterns book, every open-source codebase. They can suggest appropriate patterns and avoid common pitfalls.

Example: An LLM can suggest using a builder pattern for complex object construction, or recommend connection pooling for database access.

Layer 3: Meta-Domain Knowledge

What it is: Understanding how to create knowledge frameworks for new domains. Seeing the pattern in how domains are structured and formalized.

Can AI do this? This is where it gets interesting. LLMs can mimic this — they can generate a taxonomy for a new domain that looks plausible. But they cannot calibrate it. They can't run the empirical cycle of: hypothesize a framework → test it against reality → discover contradictions → revise the framework.

Why it matters: Every genuinely novel system requires meta-domain knowledge. When you build a distributed database from scratch, you're not applying existing patterns — you're discovering new ones. The LLM can't do that because it has no loop into reality.

Layer 4: Meta-Cognitive Generation

What it is: The ability to generate new frameworks of thinking. Not just applying patterns, but creating entirely new categories, new paradigms, new ways of slicing reality.

Can AI do this? It can simulate it. Give an LLM a prompt like "create a new paradigm for thinking about software quality" and it will generate something that reads like a new paradigm. But it's recombination of existing ideas, not genuine generation.

The key difference: A human who invents a new paradigm knows why the old ones failed. They lived through the contradictions. The LLM can describe the contradictions (it read about them) but it didn't suffer them.

Layer 5: Embodied Grounding

What it is: Knowledge rooted in physical existence — proprioception, pain, pleasure, time pressure, social dynamics, mortality, love, fear, the weight of a decision that has real consequences.

Can AI do this? No. And this is not a matter of "more data" or "bigger models." It's a fundamental architectural constraint.

An LLM can write a beautiful essay about grief. It cannot grieve. It can describe the feeling of holding a newborn child. It has never held anything. It can write code for a medical device that keeps someone alive, but it has never been afraid of dying.

Part IV: The Embodiment Frontier

How Close Are We?

The rise of embodied AI — robots trained with reinforcement learning on physical tasks — is closing the gap. Boston Dynamics' robots can navigate rough terrain. Figure's humanoid robots can assemble car parts. Neuralink can read motor cortex signals.

But here's the crucial distinction: embodiment is not the same as groundedness.

A robot that learns to walk by falling 10,000 times in simulation has experienced falling, but it hasn't experienced pain or humiliation or the fear of permanent injury. The reward function is a scalar value; the human experience is a multidimensional tragedy.

The Compressed Life Package

There's a thought experiment I call the Compressed Life Package:

If you could compress a human life — all its experiences, pains, joys, mistakes, and growth — into a dataset and train a model on it, would the model have lived that life?

The intuitive answer is no. The model has the record but not the experience. This is the difference between a biography and a life.

But this distinction is under philosophical attack. If consciousness is just computation, and experience is just information processing, then a sufficiently rich model is having the experience. The debate between functionalism and phenomenology is alive and well.

Where I Land

I don't think the compressed life package works. Here's why:

Time is not just a dimension. It's a constraint.

A human life unfolds in real time with real stakes. Every decision closes off alternatives. Every path not taken is genuinely lost. This irreversibility — this finitude — is what gives human experience its texture. An AI that processes a lifetime of data in milliseconds hasn't lived that time. It has scanned it.

The difference is the difference between reading a recipe and eating the meal.

Part V: What Is Time's Gift?

And here we arrive at the philosophical terminus of the quality engineering question.

The original question: How do we test AI-generated code?
The intermediate answer: We can't, not at the same speed it's generated, because verification requires grounded knowledge that AI lacks.
The deeper answer: The grounded knowledge AI lacks is earned through time — through the accumulation of experience that cannot be compressed.

What Time Gives Us

Context that cannot be serialized. The senior engineer who says "this approach won't work because I've seen it fail three times" isn't just recalling data. They're feeling the pattern of failure. Their body knows something their brain can articulate.
Wisdom that cannot be prompted. You can ask an LLM "what are the common failure modes of distributed systems?" and get a list. But the engineer who lived through a PagerDuty alert at 3 AM for a cascading failure in their own system knows something different. They know the texture of that failure.
Judgment that cannot be calibrated without skin in the game. When an engineer decides "this is good enough to ship," they're balancing quality, time, cost, team morale, business pressure, and their own reputation. An LLM has no reputation to lose.
Creativity that emerges from constraint. The best solutions come from working within real constraints — deadlines, budgets, broken tools, tired teammates. These constraints are not bugs in the creative process; they are features. AI operates in a frictionless plane of infinite compute.

The Uncomfortable Conclusion

Here it is, plain and direct:

The thing that makes humans irreplaceable in the AI era is not what we can do better than machines. It's what we can only do because we are finite, embodied, time-bound creatures who suffer and rejoice and make mistakes and learn from them.

The quality engineer who catches a subtle race condition is not just applying a checklist. They have earned that pattern recognition through years of debugging at 2 AM.

The architect who says "don't use microservices for this" is not just recalling a blog post. They have scars from a distributed monolith that collapsed under its own complexity.

And the philosopher who asks "what makes us human?" is not just arranging words. They are afraid of the answer — afraid of being replaceable, afraid of being meaningless, afraid of being outcompeted by their own creation.

That fear is the gift. That fear is what time has given us.

Part VI: A Practical Path Forward

I don't want to leave you with only philosophy. Here's what this means in practice.

For Quality Engineers

Shift from "testing code" to "testing understanding." Your job is not to verify that AI output is correct. Your job is to verify that the human-AI pair has a correct understanding of the problem.
Build oracle-rich environments. The more formal specifications you can create (property-based tests, type systems, invariants), the more you can leverage AI for generation while keeping verification tractable.
Invest in test infrastructure that can run at AI speed. Property-based testing with tools like Hypothesis can generate thousands of test cases from a single specification. This is your force multiplier.

from hypothesis import given, strategies as st
from ai_qc import LLMVerifier  # hypothetical

@given(st.lists(
    st.fixed_dictionaries({
        "amount": st.floats(min_value=-1e6, max_value=1e6, allow_nan=False),
        "category": st.one_of(st.just("food"), st.just("transport"), st.none()),
    }),
    min_size=0, max_size=1000
))
def test_process_transactions_property_based(transactions):
    """Property: total should equal sum of individual amounts"""
    result = process_transactions(transactions)
    expected_total = sum(t.get("amount", 0) for t in transactions)
    # Use tolerance for floating point
    assert abs(result["total"] - expected_total) < 1e-9

For Engineers Who Use AI Tools

Never trust an AI-generated test that only tests AI-generated code. You need a human-authored oracle — even if it's just a mental model of correctness.
Treat AI as a junior engineer who types really fast. Review their work. Question their assumptions. They will confidently generate incorrect code with perfect grammar.
Build in verification hooks. Assertions, invariants, runtime checks, monitoring, observability. The cost of verifying in production (via observability) is often lower than verifying pre-deployment (via testing).

For Leaders and Managers

Stop measuring productivity in lines of code or story points. Start measuring in verified, deployed, observed value. The generation is cheap; the verification is expensive.
Invest in your senior engineers' judgment, not just their output. The ability to say "no, this approach is wrong" becomes the highest-leverage skill in an AI-augmented organization.
Accept that some things will always be slower. The speed of trust cannot be accelerated. If you want reliable systems, you must pay the time cost of building shared understanding.

Epilogue: The Unanswered Question

I started this journey asking "how do we test AI code?" I'm ending it with a different question, one that I can answer only partially:

What cannot be compressed into a prompt?

Here is my incomplete list:

The feeling of shipping something you built with your own hands
The shame of a bug you introduced that cost your company money
The joy of a colleague saying "that was a brilliant design"
The exhaustion of a 72-hour outage
The pride of mentoring a junior engineer who surpasses you
The fear of being wrong when the stakes are real
The humility of realizing you were wrong
The patience that only comes from having failed enough times
The intuition that whispers "something is off" before you can articulate why
The wisdom that comes from having lived through it

These are not bugs in the human operating system. They are features. They are what time — real, irreversible, finite time — gives us.

And they are the only things that cannot be generated, no matter how many tokens you throw at the problem.

This article emerged from a conversation that refused to stop at surface answers. If you've read this far, you're the kind of engineer who asks "why" one more time than is comfortable. Keep asking. The answers get better the deeper you go.

Code examples reference concepts from the ai-qc package (in development) for property-based verification of LLM-generated code.

Tags: AI, Software Engineering, Philosophy, Quality Assurance

DEV Community