Liz Acosta

Posted on Oct 26

Don’t Make Assumptions About Assertions: Even with AI you still have to write your unit tests

#ai #python #testing

This blog post talks about why making assumptions about assertions makes not one “ass,” but two – which is especially true in this new age of AI. You’re not going to like this, but guess what? You still have to write your unit tests. (Sorry!)

But that’s not necessarily a bad thing! If this blog post doesn’t convince you of that, then I at least hope to invite you to reconsider your feelings about unit testing by taking a closer look at the most important component of the AAA test pattern: the assertion.

How to get the most out of this blog post

This blog post is designed to accommodate many different learning styles so you can choose your own adventure:

Jump straight to the code: Use the README to get it up and running locally.
The key to human productivity or a “disconcerting trend”?: A tale of two reports and their findings about AI, developer productivity, and code churn. (And why human generated tests are important.)
A quick review of unit tests: In case you need it.
Writing unit tests doesn’t have to be horrible: How we can use the assert methods included with Python’s unittest framework to make writing tests less awful.
Using more precise assert methods to write better unit tests: A walk-through of some code in which we’ll explore the effects of using different kinds of assert methods so you can experience that “Aha!” moment first-hand.
Resources and references: Links to more information.

AI generated code: The key to human productivity or a “disconcerting trend”?

In June 2022, GitHub Copilot went GA. A year later, in June 2023, a GitHub blog post proclaimed that “AI developer productivity benefits could boost global GDP by over $1.5 trillion,” citing a February 2023 research paper about an experiment in which a group of programmers with “access to GitHub Copilot was able to complete the task [of implementing an HTTP server in JavaScript as quickly as possible] 55.8% faster” than a control group without help from AI.

It is worth noting that of the paper’s four authors, three of them are associated with GitHub or its parent company Microsoft.

In response to this “promise to increase human productivity,” GitClear, a software engineering intelligence platform, asked the question, “How does this profusion of LLM generated code affect quality and maintainability?” To answer that question, GitClear collected 153 million changed lines of code authored between January 2020 and December 2023, and evaluated the data for differences in code quality. At the time, it was the largest known database of highly structured code change data used for this purpose, and included repos owned by Google, Microsoft, Meta, and enterprise C-Corps.

What GitClear found was “disconcerting trends for maintainability.” The authors of the report drew this conclusion from two factors they observed notable changes in following the general availability of GitHub Copilot:

Code churn

“Code churn” refers to the percentage of lines that are reverted or updated less than two weeks after being authored. In other words, this is code that was probably authored in one sprint and then reverted or updated in the next sprint because the changes were either incomplete or erroneous. The report noted that code churn increased around the same time as the release of GitHub Copilot and projected that it would continue to do so.

Added and copy-pasted code

This refers to code that is newly authored instead of code that is “updated,” “deleted,” or “moved.” The report goes on to explain that an increase in adding new code instead of refactoring existing code “resembles an itinerant contributor, prone to violate the DRY-ness of the repos visited.” (DRY is an acronym for the “Don’t Repeat Yourself” tenet of software engineering.) And who might that “itinerant contributor” be? Yup – AI.

(No, AI didn’t write that. I’ve always been a prolific em-dash user so AI probably stole its usage from me.)

So what can we, as developers, take away from this?

While both the GitHub and the GitClear reports don’t try to hide their bias or content marketing intentions, we can still glean some useful insights from them:

You’re probably going to encounter AI generated code – whether you’re the one adding it or you’re reading/reviewing AI generated code that someone else added.
I’m sorry, but you still have to write your unit tests. Now more so than ever.

… but that’s not a bad thing. (Stay with me here.)

A quick review of unit tests

What is unit testing?

Unit testing is the process of evaluating and verifying that the smallest functional units of code do what they are supposed to do individually and independently. So if we have a web application that allows a user to view collections of Pokemon according to ability, color, or type, we might have a Pokemon object with methods that perform tasks such as calling a Pokemon API endpoint, processing whatever response we get back from that endpoint, and then transforming that processed response into something we can display in a browser for the user. Our unit tests would act on each method individually and independently to verify that each method performs as expected.

The idea is that if each “ingredient” of a whole application is correct, then we can assume the end result will turn out the way we want. We can assume that if we have the right kind of tomato sauce, crust, and toppings, our pizza will be edible and delicious.

Unit tests are just one kind of software testing. There are lots of different types of tests that try to answer different types of questions such as, “Do all the different parts of this system actually work together?” and, “What happens if I throw in this totally wild edge case – will my system survive?”

(This is where I trot out my software testing alignment chart because it’s probably one of the most clever things I’ve ever created and people seem to really like it!)

The benefits of unit tests

The benefits of unit tests include:

Preventing bugs
Accelerating development
Saving money

But the most compelling benefit of unit tests is they help us become better engineers. Unit tests force us to ask ourselves, “What actually is the expected behavior of this method?” In the best case scenario, unit tests reveal code smells or redundancy or unnecessary complexity that motivate us to refactor the code under test.

A LinkedIn post that says: I do not understand the average developer's revulsion toward refactoring. Refactoring has been my favorite part of being a software engineer because You learn about software architecture though re-structuring, You learn about new libraries to solve problems you are facing, You learn best practices studying how others solved your problem, You learn how to document and make your code readable, You learn to take pride in your work. I am obsessed with refactoring because when you're refactoring, you're literally learning how to write better code. — A LinkedIn post in praise of refactoring and why.

Unit tests and AI

“But Liz!” you say, “Writing unit tests is so tedious and boring! Won’t I be more productive if I get AI to write them?”

Maybe.

After all, if these LLMs are trained on millions of lines of code and all of the internet, isn’t using AI to write unit tests kind of exactly like copying and pasting a solution from Stack Overflow? That time-honored tradition of software engineering?

If someone else has already figured it out, why not reuse their solution? Is that not in alignment with the DRY principle?

What could go wrong?

To answer that, here’s an excerpt from the GitHub paper on AI powered productivity mentioned above:

“We included a test suite in the repository, comprising twelve checks for submission correctness. If a submission passes, all twelve tests we counted are successfully completed. Participants could see the tests but were unable to alter them.”

In order to ensure that both the AI enabled and control groups of programmers tasked with spinning up a server did so correctly, the tests were written first. Whether the code was human or AI generated, it was verified with tests provided by the researchers.

And anyway, would you really trust an LLM trained on the tests most developers write?

A LinkedIn post that says: A cruel irony of coding agents is that everyone who blew off automated testing for the past 20 years is now telling the AI to do TDD all the time. But because LLMs were trained on decades of their shitty tests, the agents are also terrible at testing. — A LinkedIn post describing why LLMs are bad at writing tests.

Writing unit tests doesn’t have to be horrible

Personally, I would love to one day achieve the disciplined zen of test driven development, but jumping right into the application code is just so much more seductive. It’s like eating dessert first, and while eating dessert first isn’t necessarily “bad” (we’re adults who can make our own decisions), it’s probably not great for us nutritionally in the long run. So how can we write unit tests in a way that is efficient and optimized? Unit tests that are modular and maintainable and leverage all of the tools in our toolkit?

The AAAs of testing

Typically, a test follows this pattern:

Arrange: Set up the test environment. This can include fixtures, mocks, or context managers – whatever is needed to execute the code under test. When it comes to unit tests, the test environments for each test should be isolated from each other.
Act: Execute the code under test. If it’s an end-to-end test, this might mean kicking off a workflow that includes multiple services and dependencies. In a unit test, however, this should be a single method.
Assert: Verify the results. Compare the expected result with the test results – did the code do what you want it to do? This is the most important part of the test and in unit tests, it is (usually) best practice to have one precise assertion per test.

Keeping this pattern in mind can help make it easier to write unit tests.

Don’t make assumptions about assertions

When you make assumptions about assertions, you end up with not one “ass,” but two. Just because you have 100% test coverage and everything is passing, it doesn’t mean your tests are actually meaningful or – and here’s the “galaxy brain” revelation for you – maintainable.

In Python’s untittest specifically, assert methods come included with the TestCase class. These methods check for and report failures. You are probably familiar with the tried and true assertEqual method, in which one argument is compared with another, and if the two do not match, result in a test failure … but did you know that there are so many more specific and precise assertions available to you? All out of the box?

Take a look at these!

Most common assert methods:

Method	Checks that ...
assertEqual(a, b)	a == b
assertNotEqual(a, b)	a != b
assertTrue(x)	bool(x) is True
assertFalse(x)	bool(x) is False
assertIs(a, b)	a is b
assertIsNot(a, b)	a is not b
assertIsNone(x)	x is None
assertIsNotNone(x)	x is not None
assertIn(a, b)	a in b
assertNotIn(a, b)	a not in b
assertIsInstance(a, b)	isinstance(a, b)
assertNotIsInstance(a, b)	not isinstance(a, b)
assertIsSubclass(a, b)	issubclass(a, b)
assertNotIsSubclass(a, b)	not issubclass(a, b)

Assert methods that check the production of exceptions, warnings, and log messages:

Method	Checks that ...
assertRaises(exc, fun, args, *kwds)	fun(args, kwds) raises exc*
assertRaisesRegex(exc, r, fun, args, *kwds)	fun(args, kwds) raises exc* and the message matches regex r
assertWarns(warn, fun, args, *kwds)	fun(args, kwds) raises warn*
assertWarnsRegex(warn, r, fun, args, *kwds)	fun(args, kwds) raises warn* and the message matches regex r
assertLogs(logger, level)	The with block logs on logger with minimum level
assertNoLogs(logger, level)	The with block does not log on logger with minimum level
assertRaises(exc, fun, args, *kwds)	fun(args, kwds) raises exc*

Even more specific checks:

Method	Checks that ...
assertAlmostEqual(a, b)	round(a-b, 7) == 0
assertNotAlmostEqual(a, b)	round(a-b, 7) != 0
assertGreater(a, b)	a > b
assertGreaterEqual(a, b)	a >= b
assertLess(a, b)	a < b
assertLessEqual(a, b)	a <= b
assertRegex(s, r)	r.search(s)
assertNotRegex(s, r)	not r.search(s)
assertCountEqual(a, b)	a and b have the same elements in the same number, regardless of their order
assertStartsWith(a, b)	a.startswith(b)
assertNotStartsWith(a, b)	not a.startswith(b)
assertEndsWith(a, b)	a.endswith(b)
assertNotEndsWith(a, b)	not a.endswith(b)
assertHasAttr(a, b)	hastattr(a, b)
assertNotHasAttr(a, b)	not hastattr(a, b)

Type specific assertEqual methods:

Method	Compares ...
assertMultiLineEqual(a, b)	strings
assertSequenceEqual(a, b)	sequences
assertListEqual(a, b)	lists
assertTupleEqual(a, b)	tuples
assertSetEqual(a, b)	sets or frozensets
assertDictEqual(a, b)	dicts

Using a more precise assert method can help refine your unit tests and make the work of writing them more efficient and optimized.

Getting your hands dirty: Using more precise assert methods to write better unit tests

What I appreciate most about developers as an audience is the emphasis on showing rather than telling because personally, I need to see something before I believe it, too. It’s even better when I get to run the code myself and arrive at that “Aha!” moment on my own. Hands-on is the best way to learn.

An artisanal, handcrafted, slow coded Pokemon Flask app

You’ve heard of “no code,” right? Well, get ready for “slow code.”

I wanted to see if I could use Cursor, an AI-powered code editor, to write my unit tests, but I needed some code to test first. I decided to code – by hand – a very simple Pokedex Flask app. Sure, I could have prompted Cursor to do it for me, but that seemed to defeat the purpose of the experiment. Nor does it really simulate a real world use case since most professional developers are probably working with existing pre-AI code, and, more than that, I wanted to write some Python. Isn’t that why I do this? Because it’s enjoyable?

Yeah, it’s “slow code” – and it’s important. Programming is a muscle, and if you don’t exercise it regularly, it atrophies. I understand that the craft of code is often not as important as the profit it produces, but at what cost? I could have prompted an LLM to generate this blog post, but I didn’t, because I like writing. Every blog post I write myself makes me a better writer; every line of code I write makes me a better programmer. It’s that hands-on learning thing.

So I wrote my app by hand, using a forked minimal Flask template to avoid the boilerplate code. I ended up with a web app that uses an API endpoint to view collections of Pokemon according to ability, color, or type. I muddled through the limited JavaScript the app implements and used a Python-wrapped Bootstrap library for the styling. It’s not very complicated so using Cursor to write the unit tests should be a simple task – right?

A screenshot from the handcrafted Flask app showing all the pins Pokemon — All the pink Pokemon all in one place.

A look at the AI generated unit tests

My prompt was simple: generate unit tests for pokemon.py using unittest

Let’s take a look at what we ended up with. Feel free to pull down the code here and check it out.

To start things off, let’s see if the tests pass and what kind of coverage they provide.

----------------------------------------------------------------------

Ran 12 tests in 0.008s

OK

Name              Stmts   Miss  Cover

-------------------------------------

pokemon.py           29      0   100%

test_pokemon.py     147      1    99%

-------------------------------------

TOTAL               176      1    99%

Passing tests and 100% coverage. We’re off to a promising start … or are we? Not all coverage is created equally, so it’s worth investigating the tests themselves.

Funky fixtures

def setUp(self):
    """Set up test fixtures before each test method."""
    self.mock_response = Mock()
    self.mock_response.json.return_value = {}

def tearDown(self):
    """Clean up after each test method."""
    pass

We begin benign enough with some test fixtures in which a mock response is created for every test method in the class. Things start to get a little “smelly” when we examine the teardown method, which is just a pass. In this particular case, the mock object would already be inaccessible beyond the test function it is created within, so while the teardown ensures it is truly gone, it’s a little excessive, and renders the whole fixture moot. Test fixtures can be very useful, especially when creating isolated, independent test environments, but in this scenario, it doesn’t seem to be adding to the meaningfulness of the tests.

Furthermore, mock responses are created in each of the test functions, making the fixture even more redundant. (Read more about test fixtures and mocks.)

So already we find ourselves having to refactor AI generated code.

Now let’s take a look at the first test and its assertions.

More maintainable, human friendly tests

Like their source code, tests will need to be updated as a system evolves. The more “finding and replacing” you need to conduct, the more brittle and unreliable your tests are. Using variables instead of “magic values” can reduce the number of instances that require updating. In this example, we’ve replaced the magic test values and expected values with a variable. Our test is now more modular and easier to maintain.

# Act: Execute the code under test
# Test the function

test_input = "type"
test_result = get_attributes(test_input)

# Assert: Check the results

# Add a message for the assert methods
assert_message = f"For test values: {test_input} " \
    f"the function produced: {test_result}."

self.assertEqual(test_result, expected_result)

The TestCase assert methods also take a message argument: assertEqual(arg1, arg2, msg=None) The value provided for msg outputs when a test fails. This can give us more information about a test failure, which makes it easier to fix or debug.

Let’s add a test that will fail:

# Add a test failure to demonstrate ouput
self.assertIsNone("something")

Without a message, this is what our test failure looks like:

AssertionError: 'something' is not None

With a message, and utilizing the variables we created, even our failures become helpful:

# Add a test failure to demonstrate ouput
self.assertIsNone("something", msg=assert_message)

AssertionError: 'something' is not None : For test values: type the function produced: ['fire', 'water', 'grass'].

Too many assertions

There are competing philosophies on the number of assertions that a test should contain. Some people will tell you that a unit test should have only one assertion and others might tell you that more than one is okay. When writing tests, it’s important to remember that the assert methods provided by TestCase “check for and report failures.” Imagine if all of these assertions result in failures you have to then fix or debug. Do these failures actually tell us anything about the code under test?

# Assertions
self.assertIsInstance(result, list)
self.assertEqual(len(result), 3)
self.assertIn("fire", result)
self.assertIn("water", result)
self.assertIn("grass", result)
self.assertEqual(result, ["fire", "water", "grass"])

If we look at the code under test, a list of strings is returned. That’s it, that’s all that happens. While this code is not code you’d want to push to production, it is the code we are testing.

def get_attributes(attribute):
    response = requests.get(BASE_URL + attribute)
    response_json = response.json()

    attributes_list = [item["name"] for item in response_json["results"]]

    return attributes_list

Do we really need to assert on the specific contents of the list? Especially if this particular function doesn’t do anything with those contents? We probably want to reduce the number of assertions in this test. We really only need to test whether or not the code produces a list.

We can do that with assertEqual(test_result, expected_result, msg=test_message or we can eliminate yet another assertion (the assertIsInstance) with assertListEqual which will not only compare the lists, but also verify the list type.

self.assertListEqual(test_result, expected_result, msg=assert_message)

Don’t believe me? Let’s change expected_result to a string and see what happens when we use assertListEqual:

# Change `expected_result` to a string
self.assertListEqual(test_result, "'fire', 'water', 'grass'", msg=assert_message)

AssertionError: Second sequence is not a list: "'fire', 'water', 'grass'"

The test fails. Now we’ve verified not only the test result itself, but the test result type as well.

Can we eliminate another assertion? Let’s see!

Let’s say we want to also make sure we don’t end up with an empty list even though we might not know the exact number of list elements we will end up with. This is where we can use assertGreaterThan and create a variable list_minimum = 0 for the minimum value we can accept – which, in this case, is zero.

self.assertGreater(len(test_result), list_minimum, msg=assert_message)

No comment please

This is just a nit, but the AI generated tests included this comment:

self.assertEqual(result, ["fire", "water", "grass"])  # Order matters for this function

Nothing in the code suggests that, so it’s just a random comment. In response, I added my own useless comment: # No, it doesn't

(I don’t cover the rest of the tests here, but if you check out the code, I’ve commented on the parts of the tests that I would have to refactor if I wanted to make this code production ready.)

Before and after

Comparing the test before and after, our new test is a lot more succinct, meaningful, and maintainable. Now, no matter how the source code evolves, we can rely on this test to tell us if we’ve introduced any breaking changes.

Before:

@patch("pokemon.requests.get")
def test_get_attributes_success(self, mock_get):
    """Test get_attributes function with successful API response."""

    # Mock the response
    mock_response = Mock()
    mock_response.json.return_value = {"results":[
                                                 {"name": "fire"},               
                                                 {"name": "water"},
                                                 {"name": "grass"}]}
    mock_get.return_value = mock_response

    # Test the function
    result = get_attributes("type")

    # Assertions
    self.assertIsInstance(result, list)
    self.assertEqual(len(result), 3)
    self.assertIn("fire", result)
    self.assertIn("water", result)
    self.assertIn("grass", result)
    self.assertEqual(result, ["fire", "water", "grass"])  # Order matters for this function
    mock_get.assert_called_once_with(BASE_URL + "type")

After:

@patch("pokemon.requests.get")
def test_get_attributes_success(self, mock_get):
    """Test get_attributes function with successful API response."""

    # Mock the response
    mock_response = Mock()
    mock_response.json.return_value = {"results":[
                                                 {"name": "fire"},               
                                                 {"name": "water"},
                                                 {"name": "grass"}]}
    mock_get.return_value = mock_response

    # Test the function
    test_input = "type"
    test_result = get_attributes(test_input)

    # Assert
    assert_message = f"For test values: {test_input} " \
        f"the function produced: {test_result}."

    expected_result = ['fire', 'water', 'grass']
    list_minimum = 0

    self.assertGreater(len(test_result), list_minimum, msg=assert_message)
    self.assertListEqual(test_result, expected_result, msg=assert_message)
    mock_get.assert_called_once_with(BASE_URL + 'type')

In conclusion: Unit tests are the human in the loop

Whether your code is meticulously typed out character by character, copied and pasted from Stack Overflow, or generated by an LLM, unit tests are the quickest way to verify it operates as expected. Moreover, when we start with unit tests that are written with as much care and intention as the source code itself, we lay the foundation for efficiency and optimization which makes writing the next set of unit tests much less laborious and tedious. Solid unit tests are an investment in future productivity. While AI can “hallucinate,” it has no imagination or empathy, so it cannot write tests for the humans who will eventually be stuck deciphering test failures.

What do you think? Do you think AI will get better at writing unit tests? Do you feel inspired to try out other assert methods in your testing?

Resources and references

❤️ If you found this blog post helpful, please consider buying me a coffee.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

Method	Checks that ...
assertRaises(exc, fun, args, *kwds)	fun(args, kwds) raises exc*
assertRaisesRegex(exc, r, fun, args, *kwds)	fun(args, kwds) raises exc* and the message matches regex r
assertWarns(warn, fun, args, *kwds)	fun(args, kwds) raises warn*
assertWarnsRegex(warn, r, fun, args, *kwds)	fun(args, kwds) raises warn* and the message matches regex r
assertLogs(logger, level)	The with block logs on logger with minimum level
assertNoLogs(logger, level)	The with block does not log on logger with minimum level
assertRaises(exc, fun, args, *kwds)	fun(args, kwds) raises exc*