Carolyn Stransky for Meeshkan

Posted on Apr 6, 2020 • Edited on Apr 9, 2020 • Originally published at meeshkan.com

From 1 to 10,000 test cases in under an hour: A beginner's guide to property-based testing

#python #testing #tutorial #beginners

This guide was co-authored by Fredrik Fornwall.

Testing your software takes time... a lot of time. When you're writing tests, you're often stuck trying to manually reproduce every potential sequence of events. But what if you wanted to test hundreds (or thousands or even millions) of cases at once? We have an answer: Property-based testing.

Maybe you've written unit tests before, but this is the first time you've heard about property-based testing. Or maybe you've heard the term, but still don't really get what it's about. Either way, we've got you.

Throughout this guide, we'll cover the fundamentals of property-based testing. We'll walk you through practical examples and how to structure your tests. Finally, you'll learn how to use property-based testing to find bugs in your code and what existing libraries are out there.

What's in this guide

Traditional unit tests based on examples
- Limitations of example-based testing
Introduction to property-based testing
Example properties and how to test for them
Finding bugs with property-based testing
Available libraries
Conclusion

⚠️ Prerequisites:

A general understanding of what unit tests are.
(Optional) Python 3+* if you want to follow along in your own IDE.

* This guide will use Python for code examples, but the concepts aren't limited to any specific programming language. So even if you don't know Python, we'd encourage you to read along anyway.

💻 References:
We've created a GitHub repository to go with this guide. All of the featured code examples exist there as unit tests and include instructions for how to execute them.

Traditional unit tests based on examples

Most often, software testing is done using example-based testing. This means you test that for a given argument, you get a known return value. This return value is known because, well, you provided that exact value as a sample. So when you run the function or test system, it then asserts the actual result against that sample return value.

Let's look at an example. Say you want to write a function called sort_this_list. This function will take a list as an argument and return the same list organized in ascending order. To do this, you'll use the built-in Python sorted function.

It might look like the following:

# test_sorted_list.py
def sort_this_list(input_list):
    sorted_list = sorted(input_list)
    return sorted_list

Now that you have your sort_this_list function, let's test it.

To test this using example-based testing, you need to (manually) provide the test function with return values that you know will be True. For example, the list [5, 3, 1, 4, 2] should return [1, 2, 3, 4, 5] after it's sorted.

# test_sorted_list.py
def test_sort_this_list():
    assert sort_this_list([5, 3, 1, 4, 2]) == [1, 2, 3, 4, 5] # True
    assert sort_this_list(['a', 'd', 'c', 'e', 'b']) == ['a', 'b', 'c', 'd', 'e'] # True

And with that, you have a passing example-based test 🎉

Limitations of example-based testing

While example-based tests work well in many situations and provide an (arguably) low barrier of entry to testing, they do have downsides. Particularly that you have to create every test case yourself - and you can only test as many cases as you're willing to write. The less you write, the more likely it is that your tests will miss catching bugs in your code.

To show why this could be a problem, let's look at the test for the sort_this_list function from the last section:

# test_sorted_list.py
def test_sort_this_list():
    assert sort_this_list([5, 3, 1, 4, 2]) == [1, 2, 3, 4, 5] # True
    assert sort_this_list(['a', 'd', 'c', 'e', 'b']) == ['a', 'b', 'c', 'd', 'e'] # True

Both of these assertions return True. So if you only tested these two values, you might believe that the sort_this_list function always returns the desired result.

But if you add a third potential return value:

# test_sorted_list.py
def test_sort_this_list():
    assert sort_this_list([5, 3, 1, 4, 2]) == [1, 2, 3, 4, 5] 
    assert sort_this_list(['a', 'd', 'c', 'e', 'b']) == ['a', 'b', 'c', 'd', 'e'] 
    # Add a new test case:
    assert sort_this_list(['a', 2, 'c', 3, 'b', 1]) == ['a', 'b', 'c', 1, 2, 3]

And then run the test... you'll hit an error:

TypeError: '<' not supported between instances of 'int' and 'str'

Turns out the sort_this_list function doesn't work as expected when the list contains both integers and strings. Maybe you already knew that, but maybe you would've never known that without a specific test case.

Even with these limitations, example-based testing will continue to be the norm in software testing. Throughout the rest of this guide, though, we'll explore another technique. One designed to compliment your existing (likely example-based) tests and improve the test coverage of your code.

Introduction to property-based testing

When thinking about the limitations of example-based testing, many questions come to mind. What if you want to test hundreds of cases? Or ones that you could never dream of coming up with yourself?

Property-based testing is a different approach here to help with that. With property-based testing, you don't generate the exact values manually. Instead, that is done by a computer automatically.

As the developer, what you have to do is:

Specify what value to generate.
Assert on guarantees (or properties) that are true regardless of the exact value.

An example using Hypothesis

To put property-based testing into practice, let's look at an example using Hypothesis, a Python library for generative test cases. We chose Hypothesis mostly because we're using Python - but also because the documentation is clear and thorough.

Let's use the sort_this_list function from earlier. As a reminder, here's what that looked like:

# test_sorted_list.py
def sort_this_list(input_list):
    sorted_list = sorted(input_list)
    return sorted_list

Now let's write a property-based test using Hypothesis. To limit the scope, you'll only test for lists of integers:

# test_sorted_list.py
import hypothesis.strategies as some
from hypothesis import given, settings

# Use the @given decorator to guide Hypothesis to the input value needed:
@given(input_list=some.lists(some.integers()))
# Use the @settings object to set the number of cases to run:
@settings(max_examples=10000)
def test_sort_this_list_properties(input_list):
    sorted_list = sort_this_list(input_list)

    # Regardless of input, sorting should never change the size:
    assert len(sorted_list) == len(input_list)

    # Regardless of input, sorting should never change the set of distinct elements:
    assert set(sorted_list) == set(input_list)

    # Regardless of input, each element in the sorted list should be
    # lower or equal to the value that comes after it:
    for i in range(len(sorted_list) - 1):
        assert sorted_list[i] <= sorted_list[i + 1]

Note: If you're following along on your machine, make sure to install Hypothesis and then you can run the tests using pytest.

And there you have it, your first property-based test 🎉

What's especially important here is the use of the @given function decorator:

@given(input_list=some.lists(some.integers()))

This specifies that you want a list of random integers as the input value and asserts on properties that are true regardless of the exact input.

Another significant feature of this test is the use of the @settings object:

@settings(max_examples=10000)

Here, it's using the max_examples setting to indicate the maximum number of satisfying test cases that will run before terminating. The default value is 100 and in this case, it's set to 10000.

At first, running tens of thousands of test cases might feel excessive - but these numbers are reasonable in the property-based testing realm. Even the Hypothesis documentation recommends setting this value well above the default or else it may miss uncommon bugs.

Going back to the example test, if you add a print(input_list) statement, you can peek at the 10,000 different generated input values:

[]
[92]
[66, 24, -25219, 94, -28953, 31131]
[-16316, -367479896]
[-7336253322929551029, -7336253322929551029, 27974, -24308, -64]
...

Note: Your values will likely be different from our example - and that's ok. You also might not want to print 10,000 example lists - and that's ok too. You can take our word for it.

The number of runs and specifics of the generated data can be configured. More on that later on.

What can be a property?

With this style of testing, a property is something that's true about the function being tested, regardless of the exact input.

Let's see this definition applied to assertion examples from the previous test_sort_this_list_properties function:

len(sorted_list) == len(input_list): The property tested here is the list length. The length of the sorted list is always the same as the original list (regardless of the specific list items).
sorted_list[i] <= sorted_list[i + 1]: This property was that each element of the sorted list is in ascending order. No matter the contents of the original list, this should be true.

How does property-based testing differ from example-based?

While they're from different concepts, property-based tests share many characteristics with example-based tests. This is illustrated in the following comparison of steps you'd take to write a given test:

Example based	Property based
1. Set up some example data	1. Define data type matching a specification
2. Perform some operations on the data	2. Perform some operations on the data
3. Assert something about the result	3. Assert properties about the result

There are several instances where it would be worthwhile to use property-based testing. But the same can be said for example-based testing. They can, and very likely will, co-exist in the same codebase.

So if you're stressed about having to rewrite your entire test suite to try out property-based testing, don't worry. We wouldn't recommend that.

Example properties and how to test for them

By now, you've written your first property-based test and many lists were sorted 🎉 But sorting lists isn't a representative example of how you'd use property-based testing in the real world. So we've gathered three example properties and in this section, we'll guide you through how they might be used to test software.

All of the examples will continue to use the Hypothesis testing library and its @given function decorator.

Unexpected exceptions should never be thrown

Something that was tested by default in the previous test_sorted_list_properties function was that the code didn't throw any exceptions. The fact that the code doesn't throw any exceptions (or more generally, only expected and documented exceptions, and that it never causes a segmentation fault) is a property. And this property can be a convenient one to test, especially if the code has a lot of internal assertions.

As an example, let's use the json.loads function from the Python standard library. Then, let's test that the json.loads function never throws any exceptions other than json.JSONDecodeError - regardless of input:

# test_json_decode.py
@given(some.text())
def test_json_loads(input_string):
    try:
        json.loads(input_string)
    except json.JSONDecodeError:
        return

When you run the test file, it passes 🎉 So the beliefs held up under testing!

Values shouldn't change after encoding and then decoding

A commonly tested property is called symmetry. Symmetry proves in certain operations that decoding an encoded value always results in the original value.

Let's apply it to base32-crockford, a Python library for the Base32 encoding format:

# test_base32_crockford.py
@given(some.integers(min_value=0))
def test_base32_crockford(input_int):
      assert base32_crockford.decode(base32_crockford.encode(input_int)) == input_int

Because this decoding scheme only works for non-negative integers, you need to specify the generation strategy of your input data. That's why, in this example, some.integers(min_value=0) is added to the @given decorator. It restricts Hypothesis to only generate integers with a minimum value of zero.

Once again, the test passes 🎉

A naive method should still give the same result

Sometimes, you can get the desired solution through a naive, unpractical way that isn't acceptable to use in production code. This might be due to the execution time being too slow, memory consumption being too high or it requiring specific dependencies that aren't acceptable to install in production.

For example, consider counting the number of set bits in an (arbitrary sized) integer, where you have an optimized solution from the pygmp2 library.

Let's compare this with a slower solution that converts the integer to a binary string (using the bin function in the Python standard library) and then counts the occurrences of the string "1" inside of it:

# test_gmpy_popcount.py
def count_bits_slow(input_int):
    return bin(input_int).count("1")

@given(some.integers(min_value=0))
@settings(max_examples=500)
def test_gmpy2_popcount(input_int):
    assert count_bits_slow(input_int) == gmpy2.popcount(input_int)

For illustrative purposes, this example specifies a @settings(max_examples=500) decorator to tweak the default number of input values to generate.

The test passes 🎉 - showing that the optimized, hard-to-follow code of gmpy2.popcount gives the same results as the slower but less complex count_bits_slow function.

Note: If this was the only reason to bring in gmpy2 as a dependency, it'd be wise to benchmark if the performance improvements of it really would outweigh the cost and weight of the dependency.

Finding bugs with property-based testing

We've gone over the concepts of property-based testing and seen various properties in action - this is great. But one of the selling points of property-based tests is that they're supposed to help us find more bugs. And we haven't found any bugs yet.

So let's go hunting.

For this example, let's use the json5 library for JSON5 serialization. It's a bit niche, sure, but it's also a younger project. This means that you're more likely to uncover a bug compared to a more established library.

The json5 library contains:

One property of JSON5 is that it is a superset of JSON.
Another property proving that deserializing a serialized string should give you back the original object.

Let's use those properties in a test:

# test_json5_decode.py
import json
from string import printable

import hypothesis.strategies as some
import json5
from hypothesis import example, given, settings

# Construct a generator of arbitrary objects to test serialization on:
some_object = some.recursive(
    some.none() | some.booleans() | some.floats(allow_nan=False) | some.text(printable),
    lambda children: some.lists(children, min_size=1)
    | some.dictionaries(some.text(printable), children, min_size=1),
)

@given(some_object)
def test_json5_loads(input_object):
    dumped_json_string = json.dumps(input_object)
    dumped_json5_string = json5.dumps(input_object)

    parsed_object_from_json = json5.loads(dumped_json_string)
    parsed_object_from_json5 = json5.loads(dumped_json5_string)

    assert parsed_object_from_json == input_object
    assert parsed_object_from_json5 == input_object

After creating a some_object generator of arbitrary objects, you can verify aspects of the previously mentioned properties: You serialize the input using both json and json5. Then, you deserialize those two objects back using the json5 library and assert that the original object was obtained.

But doing this, you'll run into a problem. At the json5.dumps(input_object) statement, you get an exception inside the internals of the json5 library:

    def _is_ident(k):
        k = str(k)
>       if not _is_id_start(k[0]) and k[0] not in (u'$', u'_'):
E       IndexError: string index out of range

Note: If you're following along and you want to discover the bug yourself, uncomment json5==0.9.3 and remove json5 from the requirements.txt file.

Besides showing the stack trace, as usual, you also get an informative message showing the failed hypothesis - otherwise known as the generated data that caused the test to fail:

# ------------- Hypothesis ------------- #
Falsifying example: test_json5_loads(
    input_object={'': None},
)

Using the {'': None} input data caused the issue that was promptly reported. Finding the issue led to fixing the bug. This fix has since been released in version 0.9.4 of the json5 library.

We want to add that the fix was merged and released within 20 minutes of reporting. Very impressive work by json5 maintainer @dpranke 👏

But what about the future, how can you be sure that the problem will never resurface?

Because the data currently generated contains the troublesome input ({'': None}), you want to ensure that this input is always used. This should be true even if someone tweaks the some_object generator or updates the version of Hypothesis used.

The following fix uses the @example decorator to add a hard-coded example to the generated input:

--- test_json5_decode_orig.py   2020-03-27 09:48:24.000000000 +0100
+++ test_json5_decode.py    2020-03-27 09:48:32.000000000 +0100
@@ -14,6 +14,7 @@

 @given(some_object)
 @settings(max_examples=500)
+@example({"": None})
 def test_json5_loads(input_object):
     dumped_json_string = json.dumps(input_object)
     dumped_json5_string = json5.dumps(input_object)

Bug found ✅ Bug fixed ✅ You're good to go 🎉

Available libraries

This guide uses the Hypothesis library for Python. But there's a lot of functionality that wasn't covered and, as previously mentioned, the documentation is nice. We'd recommend checking it out if you're a Python user.

If you're not using Python, no problem. There are several other libraries built for property-based testing, in a variety of languages:

fast-check: TypeScript
FsCheck: .NET
jqwik: Java
PropCheck: Elixir
PropEr: Erlang
RapidCheck: C++
QuickCheck: Haskell
QuickCheck ported to Rust: Rust

Conclusion

Example-based testing isn't going anywhere any time soon. But we hope that, after reading this guide, you're motivated to incorporate some property-based tests into your codebase.

Some lingering questions from us...

Why weren't you using property-based testing before?
After reading through this guide, would you be willing to try?
Are you interested in seeing another article expanding on the topic?

At Meeshkan, we're working to improve how people test their products. So no matter if you loved or loathed this guide, we want to hear from you. Leave us a comment below, tweet at us or reach out on Gitter to let us know what you think.

Top comments (10)

Artur Neumann • Apr 7 '20

@settings(max_examples=10000)
def test_sort_this_list_properties(input_list):
    sorted_list = sorted(input_list)

shouldn't that be

@settings(max_examples=10000)
def test_sort_this_list_properties(input_list):
    sorted_list = sort_this_list(input_list)

Carolyn Stransky • Apr 7 '20

Good catch! The outcome is the same because all the sort_this_list function does is execute the built-in sorted function 😆 But for clarity, I'll update. Thanks!

Cat • Apr 7 '20 • Edited

This type of testing looks very useful, I'll try it for myself.
Why do so many property-based testing library names allude to speed? Fast, Quick, Rapid...

Carolyn Stransky • Apr 7 '20

Yay 🎉 Curious, had you heard of property-based testing before reading this guide?

About the library names... that's very true 😂 I don't know exactly why. I could assume they are really trying to emphasize the 'speed' value of property-based testing - both in that the runtime is quick relative to the number of test cases and you save time by not writing all of these cases manually 🤔

Cat • Apr 7 '20

I had not heard of property-based testing before this, no.
Interestingly, the usage of post-conditions reminds me of a computer science class I had about formal program proofs. We used Frama-C for C and JML for Java and one of the exercises was about proving that a given sorting function is correct. Doing so involved writing the same post-conditions as this article.

As for the names, I think they're relative to how long it would take to write 100,000 tests by hand, yes. It's not wrong... But it's also not saying much 😂

Fredrik Fornwall • Apr 8 '20

Thanks for the discussion!

Regarding how design by contract (which at least JML is about) might relate to property-based testing I found a post which actually combines hypothesis (for property-based testing) with dpcontract (for design by contract):

Maybe they complement each other: “your code doesn’t violate any contracts” counts as a PBT invariant. Maybe we can generate inputs that match the preconditions and confirm that all of the postconditions (and preconditions of other called functions) are preserved.

I also like the following comment (from softwareengineering.stackexchange....

The relationship is that the combination of design by contract and the testing methods are attempting to substitute for a correctness proof, which would be the ultimate goal, if it were feasible.

Artur Neumann • Apr 7 '20

What would you say is the difference to fuzzy-testing?

Scott Simontis • Apr 7 '20

Isn't fuzzy testing just randomly generating values with no rhyme or reason (or perhaps within a large range)? It seems like property-based testing is more about guaranteeing that logical assertions hold throughout a large number of test cases versus ensuring that random input is handled correctly.

I suppose I need to go learn about fuzzy testing now!

Artur Neumann • Apr 7 '20

looks to me that the input values in the examples in this blog are pretty random (in the defined scope e.g. int or json) that is why you need to use @example to make sure this specific value will always be tested.
Maybe verifying that the results are correct is more the point in "property-based testing", you have random (fuzzy) input and because of that you cannot really check the details of the output so you "only" make sure properties (that you can be sure of) of the output are correct

Fredrik Fornwall • Apr 7 '20

In general, a property based test can be formulated as:

for all (x, y, ...)
such as precondition(x, y, ...) holds
property(x, y, ...) is true

One way to look at it (which not everyone agrees with, and is more theoretical than practical) is that fuzzing is a subset of property-based testing which focuses on the property that code should never crash (segfault, throw an exception, give an internal error response code, overflow buffers, etc) regardless of input.

While we have generated rather random inputs in this article, one will often use custom generators (such as the some_object generator in the test_json5_loads example) to build up more restricted, higher level input. It's also possible to use a mechanism like hypothesis.assume to filter away certain generated input that doesn't match the desired precondition.