Let Hypothesis Break Your Python Code Before Your Users Do

, you need to take testing your code seriously. You may write unit tests with pytest, mock dependencies, and strive for prime code coverage. For those who’re like me, though, you may have a nagging query lingering in the back of your mind after you finish coding a test suite.

“Have I believed of all the sting cases?”

You may test your inputs with positive numbers, negative numbers, zero, and empty strings. But what about weird Unicode characters? Or floating-point numbers which might be NaN or infinity? What about a listing of lists of empty strings or complex nested JSON? The space of possible inputs is large, and it’s hard to consider the myriad alternative ways your code could break, especially for those who’re under a while pressure.

Property-based testing flips that burden from to the tooling. As an alternative of hand-picking examples, you state a property — a truth that must hold for inputs. The Hypothesis library then inputs; several hundred if required, hunts for counterexamples, and — if it finds one — it to the best failing case.

In this text, I’ll introduce you to the powerful concept of property-based testing and its implementation in Hypothesis. We’ll transcend easy functions and show you the best way to test complex data structures and stateful classes, in addition to the best way to fine-tune Hypothesis for robust and efficient testing.

So, what exactly is property-based testing?

Property-based testing is a technique where, as an alternative of writing tests for specific, hardcoded examples, you define the overall “properties” or “invariants” of your code. A property is a high-level statement in regards to the behaviour of your code that ought to hold for valid inputs. You then use a testing framework, like Hypothesis, which intelligently generates a wide selection of inputs and tries to seek out a “counter-example” — a particular input for which your stated property is fake.

Some key points of property-based testing with Hypothesis include:

Generative Testing. Hypothesis generates test cases for you, from the straightforward to the weird, exploring edge cases you’ll likely miss.
Property-Driven. It shifts your mindset from “what’s the output for this specific input?” to “what are the universal truths about my function’s behaviour?”
Shrinking. That is Hypothesis’s killer feature. When it finds a failing test case (which may be large and sophisticated), it doesn’t just report it. It mechanically “shrinks” the input right down to the smallest and simplest possible example that also causes the failure, often making debugging dramatically easier.
Stateful Testing. Hypothesis can test not only pure functions, but in addition the interactions and state changes of complex objects over a sequence of method calls.
Extensible Strategies. Hypothesis provides a strong library of “strategies” for generating data, and permits you to compose them or construct entirely latest ones to match your application’s data models.

Why Hypothesis Matters / Common Use Cases

The first advantage of property-based testing is its ability to seek out subtle bugs and increase your confidence within the correctness of your code far beyond what’s possible with example-based testing alone. It forces you to think more deeply about your code’s contracts and assumptions.

Hypothesis is especially effective for testing:

Serialisation/Deserialisation. A classic property is that for any object x, decode(encode(x)) needs to be equal to x. This is ideal for testing functions that work with JSON or custom binary formats.
Complex Business Logic. Any function with complex conditional logic is a terrific candidate. Hypothesis will explore paths through your code that it’s possible you’ll not have considered.
Stateful Systems. Testing classes and objects to be certain that no sequence of valid operations can put the thing right into a corrupted or invalid state.
Testing against a reference implementation. You may state the property that your latest, optimised function should at all times produce the identical result as a more straightforward, known, exemplary reference implementation.
Functions that accept complex data models. Testing functions that take Pydantic models, dataclasses, or other custom objects as input.

Organising a development environment

All you would like is Python and pip. We’ll install pytest as our test runner, hypothesis itself, and pydantic for one among our advanced examples.

(base) tom@tpr-desktop:~$ python -m venv hyp-env
(base) tom@tpr-desktop:~$ source hyp-env/bin/activate
(hyp-env) (base) tom@tpr-desktop:~$

# Install pytest, hypothesis, and pydantic
(hyp-env) (base) tom@tpr-desktop:~$ pip install pytest hypothesis pydantic 

# create a brand new folder to carry your python code
(hyp-env) (base) tom@tpr-desktop:~$ mkdir hyp-project

Hypothesis is best run through the use of a longtime test runner tool like pytest, in order that’s what we’ll do here.

Code example 1 — A straightforward test

On this simplest of examples, we have now a function that calculates the world of a rectangle. It should take two integer parameters, each greater than zero, and return their product.

Hypothesis tests are defined using two things: the @given decorator and a strategy, which is passed to the decorator. Consider a technique as the info types that Hypothesis will generate to check your function. Here’s a straightforward example. First, we define the function we would like to check.

# my_geometry.py

def calculate_rectangle_area(length: int, width: int) -> int:
  """
  Calculates the world of a rectangle given its length and width.

  This function raises a ValueError if either dimension isn't a positive integer.
  """
  if not isinstance(length, int) or not isinstance(width, int):
    raise TypeError("Length and width should be integers.")
  
  if length <= 0 or width <= 0:
    raise ValueError("Length and width should be positive.")
  
  return length * width

Next is the testing function.

# test_rectangle.py

from my_geometry import calculate_rectangle_area
from hypothesis import given, strategies as st
import pytest

# Through the use of st.integers(min_value=1) for each arguments, we guarantee
# that Hypothesis will only generate valid inputs for our function.
@given(
    length=st.integers(min_value=1), 
    width=st.integers(min_value=1)
)
def test_rectangle_area_with_valid_inputs(length, width):
    """
    Property: For any positive integers length and width, the world
    needs to be equal to their product.
    
    This test ensures the core multiplication logic is correct.
    """
    print(f"Testing with valid inputs: length={length}, width={width}")
    
    # The property we're checking is the mathematical definition of area.
    assert calculate_rectangle_area(length, width) == length * width

Adding the @given decorator to the function turns it right into a Hypothesis test. Passing the strategy (st.integers) to the decorator says that Hypothesis should generate random integers for the argument n when testing, but we further constrain that by ensuring neither integer might be lower than one.

We will run this test by calling it in this way.

(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_my_geometry.py

=========================================== test session starts ============================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item

test_my_geometry.py Testing with valid inputs: length=1, width=1
Testing with valid inputs: length=6541, width=1
Testing with valid inputs: length=6541, width=28545
Testing with valid inputs: length=1295885530, width=1
Testing with valid inputs: length=1295885530, width=25191
Testing with valid inputs: length=14538, width=1
Testing with valid inputs: length=14538, width=15503
Testing with valid inputs: length=7997, width=1
...
...

Testing with valid inputs: length=19378, width=22512
Testing with valid inputs: length=22512, width=22512
Testing with valid inputs: length=3392, width=44
Testing with valid inputs: length=44, width=44
.

============================================ 1 passed in 0.10s =============================================

By default, Hypothesis will perform 100 tests in your function with different inputs. You may increase or decrease this through the use of the settings decorator. For instance,

from hypothesis import given, strategies as st,settings
...
...
@given(
    length=st.integers(min_value=1), 
    width=st.integers(min_value=1)
)
@settings(max_examples=3)
def test_rectangle_area_with_valid_inputs(length, width):
...
...

#
# Outputs
#
(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_my_geometry.py
=========================================== test session starts ============================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item

test_my_geometry.py 
Testing with valid inputs: length=1, width=1
Testing with valid inputs: length=1870, width=5773964720159522347
Testing with valid inputs: length=61, width=25429
.

============================================ 1 passed in 0.06s =============================================

Code Example 2 — Testing the Classic “Round-Trip” Property

Let’s have a look at a classic property:- serialisation and deserialization needs to be reversible. Briefly, decode(encode(X)) should return X.

We’ll write a function that takes a dictionary and encodes it right into a URL query string.

Create a file in your hyp-project folder named my_encoders.py.

# my_encoders.py
import urllib.parse

def encode_dict_to_querystring(data: dict) -> str:
    # A bug exists here: it doesn't handle nested structures well
    return urllib.parse.urlencode(data)

def decode_querystring_to_dict(qs: str) -> dict:
    return dict(urllib.parse.parse_qsl(qs))

These are two elementary functions. What could go unsuitable with them? Now let’s test them in test_encoders.py:
# test_encoders.py

# test_encoders.py

from hypothesis import given, strategies as st

# A technique for generating dictionaries with easy text keys and values
simple_dict_strategy = st.dictionaries(keys=st.text(), values=st.text())

@given(data=simple_dict_strategy)
def test_querystring_roundtrip(data):
    """Property: decoding an encoded dict should yield the unique dict."""
    encoded = encode_dict_to_querystring(data)
    decoded = decode_querystring_to_dict(encoded)
    
    # We now have to watch out with types: parse_qsl returns string values
    # So we convert our original values to strings for a good comparison
    original_as_str = {k: str(v) for k, v in data.items()}
    
    assert decoded == original_as_st

Now we will run our test.

(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_encoders.py
=========================================== test session starts ============================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item

test_encoders.py F

================================================= FAILURES =================================================
_______________________________________ test_for_nesting_limitation ________________________________________

    @given(data=st.recursive(
>       # Base case: A flat dictionary of text keys and straightforward values (text or integers).
                   ^^^
        st.dictionaries(st.text(), st.integers() | st.text()),
        # Recursive step: Allow values to be dictionaries themselves.
        lambda children: st.dictionaries(st.text(), children)
    ))

test_encoders.py:7:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

data = {'': {}}

    @given(data=st.recursive(
        # Base case: A flat dictionary of text keys and straightforward values (text or integers).
        st.dictionaries(st.text(), st.integers() | st.text()),
        # Recursive step: Allow values to be dictionaries themselves.
        lambda children: st.dictionaries(st.text(), children)
    ))
    def test_for_nesting_limitation(data):
        """
        This test asserts that the decoded data structure matches the unique.
        It's going to fail because urlencode flattens nested structures.
        """
        encoded = encode_dict_to_querystring(data)
        decoded = decode_querystring_to_dict(encoded)

        # This can be a deliberately easy assertion. It's going to fail for nested
        # dictionaries since the `decoded` version may have a stringified
        # inner dict, while the `data` version may have a real inner dict.
        # That is how we reveal the bug.
>       assert decoded == data
E       AssertionError: assert {'': '{}'} == {'': {}}
E
E         Differing items:
E         {'': '{}'} != {'': {}}
E         Use -v to get more diff
E       Falsifying example: test_for_nesting_limitation(
E           data={'': {}},
E       )

test_encoders.py:24: AssertionError
========================================= short test summary info ==========================================
FAILED test_encoders.py::test_for_nesting_limitation - AssertionError: assert {'': '{}'} == {'': {}}

Okay, that was unexpected. Let’s attempt to decipher what went unsuitable with this test. The TL;DR is that this test shows the encode/decode functions don't work accurately for nested dictionaries.

The Falsifying Example. Crucial clue is on the very bottom. Hypothesis is telling us the that breaks the code.

test_for_nesting_limitation(
    data={'': {}},
)

The input is a dictionary where the secret is an empty string and the worth is an empty dictionary. This can be a classic edge case that a human might overlook.
The Assertion Error: The test failed due to a failed assert statement:

AssertionError: assert {'': '{}'} == {'': {}}

That is the core of the difficulty. The unique data that went into the test was {‘’: {}}. The decoded result that got here out of your functions was {‘’: ‘{}’}. This shows that for the important thing ‘’, the values are different:

In decoded, the worth is the string ‘{}’.
In data, the worth is the dictionary {}.

A string isn't equal to a dictionary, so the assertion assert decoded == data is False, and the test fails.

Tracing the Bug Step-by-Step

Our encode_dict_to_querystring function uses urllib.parse.urlencode. When urlencode sees a price that may be a dictionary (like {}), it doesn’t know the best way to handle it, so it just converts it to its string representation (‘{}’).

The knowledge in regards to the value’s original (that it was a dict) is lost perpetually.

When the decode_querystring_to_dict function reads the info back, it accurately decodes the worth because the string ‘{}’. It has no way of knowing it was initially a dictionary.

The Solution: Encode Nested Values as JSON Strings

The answer is easy,

Encode. Before URL-encoding, check each value in your dictionary. If a price is a dict or a listing, convert it right into a JSON string first.
Decode. After URL-decoding, check each value. If a price looks like a JSON string (e.g., starts with { or [), parse it back into a Python object.
Make our testing more comprehensive. Our given decorator is more complex. In simple terms, it tells Hypothesis to generate dictionaries that can contain other dictionaries as values, allowing for nested data structures of any depth. For example,

A simple, flat dictionary: {‘name’: ‘Alice’, ‘city’: ‘London’}
A one-level nested dictionary: {‘user’: {‘id’: ‘123’, ‘name’: ‘Tom’}}
A two-level nested dictionary: {‘config’: {‘database’: {‘host’: ‘localhost’}}}
And so on…

Here is the fixed code.

# test_encoders.py

from my_encoders import encode_dict_to_querystring, decode_querystring_to_dict
from hypothesis import given, strategies as st

# =========================================================================
# TEST 1: This test proves that the NESTING logic is correct.
# It uses a strategy that ONLY generates strings, so we don't have to
# worry about type conversion. This test will PASS.
# =========================================================================
@given(data=st.recursive(
    st.dictionaries(st.text(), st.text()),
    lambda children: st.dictionaries(st.text(), children)
))
def test_roundtrip_preserves_nested_structure(data):
    """Property: The encode/decode round-trip should preserve nested structures."""
    encoded = encode_dict_to_querystring(data)
    decoded = decode_querystring_to_dict(encoded)
    assert decoded == data

# =========================================================================
# TEST 2: This test proves that the TYPE CONVERSION logic is correct
# for simple, FLAT dictionaries. This test will also PASS.
# =========================================================================
@given(data=st.dictionaries(st.text(), st.integers() | st.text()))
def test_roundtrip_stringifies_simple_values(data):
    """
    Property: The round-trip should convert simple values (like ints)
    to strings.
    """
    encoded = encode_dict_to_querystring(data)
    decoded = decode_querystring_to_dict(encoded)

    # Create the model of what we expect: a dictionary with stringified values.
    expected_data = {k: str(v) for k, v in data.items()}
    assert decoded == expected_data

Now, if we rerun our test, we get this,

(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest
=========================================== test session starts ============================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item

test_encoders.py .                                                                                   [100%]

============================================ 1 passed in 0.16s =============================================

What we worked through there may be a classic example showcasing how useful testing with Hypothesis might be. What we thought were two easy and error-free functions turned out to not be the case.

Code Example 3— Constructing a Custom Strategy for a Pydantic Model

Many real-world functions don’t just take easy dictionaries; they take structured objects like Pydantic models. Hypothesis can construct strategies for these custom types, too.

Let’s define a model in my_models.py.

# my_models.py
from pydantic import BaseModel, Field
from typing import List

class Product(BaseModel):
    id: int = Field(gt=0)
    name: str = Field(min_length=1)
    tags: List[str]
def calculate_shipping_cost(product: Product, weight_kg: float) -> float:
    # A buggy shipping cost calculator
    cost = 10.0 + (weight_kg * 1.5)
    if "fragile" in product.tags:
        cost *= 1.5 # Extra cost for fragile items
    if weight_kg > 10:
        cost += 20 # Surcharge for heavy items
    # Bug: what if cost is negative?
    return cost

Now, in test_shipping.py, we’ll construct a technique to generate Product instances and test our buggy function.

# test_shipping.py
from my_models import Product, calculate_shipping_cost
from hypothesis import given, strategies as st

# Construct a technique for our Product model
product_strategy = st.builds(
    Product,
    id=st.integers(min_value=1),
    name=st.text(min_size=1),
    tags=st.lists(st.sampled_from(["electronics", "books", "fragile", "clothing"]))
)
@given(
    product=product_strategy,
    weight_kg=st.floats(min_value=-10, max_value=100, allow_nan=False, allow_infinity=False)
)
def test_shipping_cost_is_always_positive(product, weight_kg):
    """Property: The shipping cost should never be negative."""
    cost = calculate_shipping_cost(product, weight_kg)
    assert cost >= 0

And the test output?

(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_shipping.py
========================================================= test session starts ==========================================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item

test_shipping.py F

=============================================================== FAILURES ===============================================================
________________________________________________ test_shipping_cost_is_always_positive _________________________________________________

    @given(
>       product=product_strategy,
                   ^^^
        weight_kg=st.floats(min_value=-10, max_value=100, allow_nan=False, allow_infinity=False)
    )

test_shipping.py:13:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

product = Product(id=1, name='0', tags=[]), weight_kg = -7.0

    @given(
        product=product_strategy,
        weight_kg=st.floats(min_value=-10, max_value=100, allow_nan=False, allow_infinity=False)
    )
    def test_shipping_cost_is_always_positive(product, weight_kg):
        """Property: The shipping cost should never be negative."""
        cost = calculate_shipping_cost(product, weight_kg)
>       assert cost >= 0
E       assert -0.5 >= 0
E       Falsifying example: test_shipping_cost_is_always_positive(
E           product=Product(id=1, name='0', tags=[]),
E           weight_kg=-7.0,
E       )

test_shipping.py:19: AssertionError
======================================================= short test summary info ========================================================
FAILED test_shipping.py::test_shipping_cost_is_always_positive - assert -0.5 >= 0
========================================================== 1 failed in 0.12s ===========================================================

Whenever you run this with pytest, Hypothesis will quickly discover a falsifying example: a product with a negative weight_kg may end up in a negative shipping cost. That is an edge case we may not have considered, but Hypothesis found it mechanically.

Code Example 4— Testing Stateful Classes

Hypothesis can do greater than test pure functions. It may test classes with internal state by generating sequences of method calls to try to interrupt them. Let’s test a straightforward custom LimitedCache class.

my_cache.py

# my_cache.py
class LimitedCache:
    def __init__(self, capability: int):
        if capability <= 0:
            raise ValueError("Capacity must be positive")
        self._cache = {}
        self._capacity = capacity
        # Bug: This should probably be a deque or ordered dict for proper LRU
        self._keys_in_order = []

    def put(self, key, value):
        if key not in self._cache and len(self._cache) >= self._capacity:
            # Evict the oldest item
            key_to_evict = self._keys_in_order.pop(0)
            del self._cache[key_to_evict]
        
        if key not in self._keys_in_order:
            self._keys_in_order.append(key)
        self._cache[key] = value

    def get(self, key):
        return self._cache.get(key)
   
    @property
    def size(self):
        return len(self._cache)

This cache has several potential bugs related to its eviction policy. Let’s test it using a Hypothesis Rule-Based State Machine, which is designed for testing objects with internal state by generating random sequences of method calls to discover bugs that only appear after specific interactions.

Create the file test_cache.py.

from hypothesis import strategies as st
from hypothesis.stateful import RuleBasedStateMachine, rule, precondition
from my_cache import LimitedCache

class CacheMachine(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.cache = LimitedCache(capability=3)

    # This rule adds 3 initial items to fill the cache
    @rule(
        k1=st.just('a'), k2=st.just('b'), k3=st.just('c'),
        v1=st.integers(), v2=st.integers(), v3=st.integers()
    )
    def fill_cache(self, k1, v1, k2, v2, k3, v3):
        self.cache.put(k1, v1)
        self.cache.put(k2, v2)
        self.cache.put(k3, v3)

    # This rule can only run AFTER the cache has been filled.
    # It tests the core logic of LRU vs FIFO.
    @precondition(lambda self: self.cache.size == 3)
    @rule()
    def test_update_behavior(self):
        """
        Property: Updating the oldest item ('a') should make it the most recent,
        so the subsequent eviction should remove the second-oldest item ('b').
        Our buggy FIFO cache will incorrectly remove 'a' anyway.
        """
        # At this point, keys_in_order is ['a', 'b', 'c'].
        # 'a' is the oldest.
        
        # We "use" 'a' again by updating it. In a correct LRU cache,
        # this is able to make 'a' probably the most recently used item.
        self.cache.put('a', 999) 
        
        # Now, we add a brand new key, which should force an eviction.
        self.cache.put('d', 4)

        # An accurate LRU cache would evict 'b'.
        # Our buggy FIFO cache will evict 'a'.
        # This assertion checks the state of 'a'.
        # In our buggy cache, get('a') can be None, so this can fail.
        assert self.cache.get('a') isn't None, "Item 'a' was incorrectly evicted"
        
# This tells pytest to run the state machine test
TestCache = CacheMachine.TestCase

Hypothesis will generate long sequences of puts and gets. It's going to quickly discover a sequence of puts that causes the cache’s size to exceed its capability or for its eviction to behave in another way from our model, thereby revealing bugs in our implementation.

(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_cache.py
========================================================= test session starts ==========================================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item

test_cache.py F

=============================================================== FAILURES ===============================================================
__________________________________________________________ TestCache.runTest ___________________________________________________________

self = 

    def runTest(self):
>       run_state_machine_as_test(cls, settings=self.settings)

../hyp-env/lib/python3.11/site-packages/hypothesis/stateful.py:476:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../hyp-env/lib/python3.11/site-packages/hypothesis/stateful.py:258: in run_state_machine_as_test
    state_machine_test(state_machine_factory)
../hyp-env/lib/python3.11/site-packages/hypothesis/stateful.py:115: in run_state_machine
    @given(st.data())
               ^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = CacheMachine({})

    @precondition(lambda self: self.cache.size == 3)
    @rule()
    def test_update_behavior(self):
        """
        Property: Updating the oldest item ('a') should make it the most recent,
        so the subsequent eviction should remove the second-oldest item ('b').
        Our buggy FIFO cache will incorrectly remove 'a' anyway.
        """
        # At this point, keys_in_order is ['a', 'b', 'c'].
        # 'a' is the oldest.

        # We "use" 'a' again by updating it. In a correct LRU cache,
        # this is able to make 'a' probably the most recently used item.
        self.cache.put('a', 999)

        # Now, we add a brand new key, which should force an eviction.
        self.cache.put('d', 4)

        # An accurate LRU cache would evict 'b'.
        # Our buggy FIFO cache will evict 'a'.
        # This assertion checks the state of 'a'.
        # In our buggy cache, get('a') can be None, so this can fail.
>       assert self.cache.get('a') isn't None, "Item 'a' was incorrectly evicted"
E       AssertionError: Item 'a' was incorrectly evicted
E       assert None isn't None
E        +  where None = get('a')
E        +    where get = .get
E        +      where  = CacheMachine({}).cache
E       Falsifying example:
E       state = CacheMachine()
E       state.fill_cache(k1='a', k2='b', k3='c', v1=0, v2=0, v3=0)
E       state.test_update_behavior()
E       state.teardown()

test_cache.py:44: AssertionError
======================================================= short test summary info ========================================================
FAILED test_cache.py::TestCache::runTest - AssertionError: Item 'a' was incorrectly evicted
========================================================== 1 failed in 0.20s ===========================================================

The above output highlights a bug within the code. In easy terms, this output shows that the cache is not a correct “Least Recently Used” (LRU) cache. It has the next significant flaw,

Code Example 5 — Testing Against a Simpler, Reference Implementation

For our final example, we’ll have a look at a typical situation. Often, coders write functions which might be alleged to replace older, slower, but otherwise perfectly correct, functions. Your latest function will need to have the identical outputs because the old function for a similar inputs. Hypothesis could make your testing on this regard much easier.

Let’s say we have now a straightforward function, sum_list_simple, and a brand new, “optimised” sum_list_fast that has a bug.

my_sums.py

# my_sums.py
def sum_list_simple(data: list[int]) -> int:
    # That is our easy, correct reference implementation
    return sum(data)

def sum_list_fast(data: list[int]) -> int:
    # A brand new "fast" implementation with a bug (e.g., integer overflow for big numbers)
    # or on this case, a straightforward mistake.
    total = 0
    for x in data:
        # Bug: This needs to be +=
        total = x
    return total

test_sums.py

# test_sums.py
from my_sums import sum_list_simple, sum_list_fast
from hypothesis import given, strategies as st

@given(st.lists(st.integers()))
def test_fast_sum_matches_simple_sum(data):
    """
    Property: The results of the brand new, fast function should at all times match
    the results of the straightforward, reference function.
    """
    assert sum_list_fast(data) == sum_list_simple(data)

Hypothesis will quickly find that for any list with a couple of element, the brand new function fails. Let’s test it out.

(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_my_sums.py
=========================================== test session starts ============================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item

test_my_sums.py F

================================================= FAILURES =================================================
_____________________________________ test_fast_sum_matches_simple_sum _____________________________________

    @given(st.lists(st.integers()))
>   def test_fast_sum_matches_simple_sum(data):
                   ^^^

test_my_sums.py:6:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

data = [1, 0]

    @given(st.lists(st.integers()))
    def test_fast_sum_matches_simple_sum(data):
        """
        Property: The results of the brand new, fast function should at all times match
        the results of the straightforward, reference function.
        """
>       assert sum_list_fast(data) == sum_list_simple(data)
E       assert 0 == 1
E        +  where 0 = sum_list_fast([1, 0])
E        +  and   1 = sum_list_simple([1, 0])
E       Falsifying example: test_fast_sum_matches_simple_sum(
E           data=[1, 0],
E       )

test_my_sums.py:11: AssertionError
========================================= short test summary info ==========================================
FAILED test_my_sums.py::test_fast_sum_matches_simple_sum - assert 0 == 1
============================================ 1 failed in 0.17s =============================================

So, the test failed since the “fast” sum function gave the unsuitable answer (0) for the input list [1, 0], while the right answer, provided by the “easy” sum function, was 1. Now that you already know the difficulty, you possibly can take steps to repair it.

Summary

In this text, we took a deep dive into the world of property-based testing with Hypothesis, moving beyond easy examples to point out how it will probably be applied to real-world testing challenges. We saw that by defining the invariants of our code, we will uncover subtle bugs that traditional testing would likely miss. We learned the best way to:

Test the “round-trip” property and see how more complex data strategies can reveal limitations in our code.
Construct custom strategies to generate instances of complex Pydantic models for testing business logic.
Use a RuleBasedStateMachine to check the behaviour of stateful classes by generating sequences of method calls.
Validate a fancy, optimised function by testing it against a more straightforward, known-good reference implementation.

Adding property-based tests to your toolkit won’t replace all of your existing tests. Still, it should profoundly augment them, forcing you to think more clearly about your code’s contracts and providing you with a much higher degree of confidence in its correctness. I encourage you to select a function or class in your codebase, take into consideration its fundamental properties, and let Hypothesis try its best to prove you unsuitable. You’ll be a greater developer for it.

I’ve only scratched the surface of what Hypothesis can do to your testing. For more information, confer with their official documentation, available via the link below.

https://hypothesis.readthedocs.io/en/latest

Let Hypothesis Break Your Python Code Before Your Users Do

So, what exactly is property-based testing?

Why Hypothesis Matters / Common Use Cases

Organising a development environment

Code example 1 — A straightforward test

Code Example 2 — Testing the Classic “Round-Trip” Property

Tracing the Bug Step-by-Step

The Solution: Encode Nested Values as JSON Strings

Code Example 3— Constructing a Custom Strategy for a Pydantic Model

Code Example 4— Testing Stateful Classes

Code Example 5 — Testing Against a Simpler, Reference Implementation

Summary

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How you can Train Scientific Agents with Reinforcement Learning

PaliGemma 2 Mix – Recent Instruction Vision Language Models by Google

The Machine Learning “Advent Calendar” Day 15: SVM in Excel

NVIDIA CUDA-X Powers the Recent Sirius GPU Engine for DuckDB, Setting ClickBench Records

Bringing Video Understanding to Every Device

Let Hypothesis Break Your Python Code Before Your Users Do

So, what exactly is property-based testing?

Why Hypothesis Matters / Common Use Cases

Organising a development environment

Code example 1 — A straightforward test

Code Example 2 — Testing the Classic “Round-Trip” Property

Tracing the Bug Step-by-Step

The Solution: Encode Nested Values as JSON Strings

Code Example 3— Constructing a Custom Strategy for a Pydantic Model

Code Example 4— Testing Stateful Classes

Code Example 5 — Testing Against a Simpler, Reference Implementation

Summary

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.