Cannot MBT with PBT

Property-based testing can look a lot like model-based testing: it generates command sequences, tracks state, replays failures, and shrinks them. But there is a hard boundary when the next valid input depends on values the running system just returned. This post walks through that boundary with a tiny payment system and Hypothesis.

❦ ❦ ❦

Whenever I describe model-based testing (MBT), someone eventually asks:

“So, Hypothesis?”

Not quite. Hypothesis is excellent at stateful property-based testing (PBT), and that gets surprisingly close. The mismatch appears when we try to do full-scale MBT:

MBT means driving a system under test (SUT) from a separate model. The model state tracks what we have observed from the SUT, and the next input to the SUT can depend on that observed state. The model also acts as a test oracle, checking that SUT responses match the model’s expectations.

MBT is naturally suited for complex systems with external nondeterminism: for example, a payment system that interacts with external card networks. Similar patterns show up in cloud-native applications, microservice architectures, and distributed systems such as databases or etcd.

The mismatch I’m talking about is not something that can easily be fixed. Nor is it a singular flaw of Hypothesis; other popular PBT tools such as fast-check, PropEr, QuickCheck, and proptest share similar characteristics. This is because the tension is fundamental: property-based testing first and foremost wants to generate a replayable, shrinkable artifact. For example, if your PBT test of a stack data structure fails with

stack.push(48)
stack.push(17)
stack.pop()  # returns 17
stack.pop()  # returns 48
stack.pop()  # returns None, but expected an exception

PBT tools will try to shrink that test case to a smaller one that still fails, such as:

stack.push(1)
stack.pop()  # returns 1
stack.pop()  # returns None, but expected an exception

This is a fundamental strength of PBT, one that all PBT tools inherit. But it also creates a tension with the kind of MBT this post focuses on: tests where the next valid input may depend on values returned by the running SUT. The usual PBT solutions to this problem work by making the generated artifact symbolic or adaptive, which is a compromise.

Let’s try to test a small payment system.

The Payment System

Consider a card-payment flow with two commands:

authorize(amount) reserves funds and creates a transaction.
capture(transaction_id) finalizes that transaction.

Let’s look at TinyPay, a small in-memory implementation of that flow, as our SUT. TinyPay is not a real payment system, and omits some important features. But it is sufficient to illustrate the testing challenges.

The important part is that our SUT creates transaction IDs during authorization:

def authorize(self, amount_cents: int) -> TransactionResult:
    """Create an AUTHORIZED transaction and return its new transaction ID."""
    transaction_id = uuid4()
    self._transactions[transaction_id] = Transaction(
        transaction_id=transaction_id,
        amount_cents=amount_cents,
        status=TransactionStatus.AUTHORIZED,
    )
    return TransactionResult(
        transaction_id=transaction_id,
        amount_cents=amount_cents,
        status=TransactionStatus.AUTHORIZED,
    )

When we capture, TinyPay expects us to provide an existing transaction ID (that is, one that was returned by authorize). TinyPay looks up the provided transaction and checks its status to avoid double-capture, then moves it to the captured state:

def capture(self, transaction_id: UUID) -> TransactionResult:
    """Move an AUTHORIZED transaction to CAPTURED and return the capture."""
    try:
        transaction = self._transactions[transaction_id]
    except KeyError:
        raise UnknownTransactionError(transaction_id) from None

    if transaction.status == TransactionStatus.CAPTURED:
        raise AlreadyCapturedError(transaction_id)

    transaction.status = TransactionStatus.CAPTURED
    return TransactionResult(
        transaction_id=transaction.transaction_id,
        amount_cents=transaction.amount_cents,
        status=transaction.status,
    )

The protocol is:

Sequence diagram showing authorization creating a transaction ID, and capture using that transaction ID.

The gist: A valid capture input depends on a value that does not exist until we have executed authorize.

𐫱

A Stateful Property-Based Test

To model this system in a property-based test, we need to implement some kind of state machine that tracks which transactions have been authorized and captured so far.

For this kind of stateful testing, Hypothesis provides RuleBasedStateMachine. This is a powerful tool for writing MBT-style tests, but it also reveals some of the limitations of PBT when the model has to use values observed from the SUT.

For our tests, we want to keep a small model of what we have observed from the SUT:

class PaymentMachine(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.sut = PaymentSystem()
        self.authorized_transactions = []
        self.captured_transaction_ids = set()

We also provide some helper methods to interact with the SUT and update our model.

For the sake of simplicity, we directly call our SUT TinyPay; a real system would of course use a client library or REST API calls. We also directly include assertions to verify that the SUT responses match our expectations.

def authorize_payment(self, amount_cents: int) -> TransactionResult:
    # call SUT
    authorized = self.sut.authorize(amount_cents)

    # verify SUT response
    assert authorized.amount_cents == amount_cents
    assert authorized.status == TransactionStatus.AUTHORIZED

    # update model state
    self.authorized_transactions.append(
        AuthorizedTransaction(
            transaction_id=authorized.transaction_id,
            amount_cents=amount_cents,
        )
    )
    return authorized

def capture_payment(
    self,
    transaction_id: UUID,
    expected_amount_cents: int,
) -> TransactionResult:
    # call SUT
    captured = self.sut.capture(transaction_id)

    # verify SUT response
    assert captured.transaction_id == transaction_id
    assert captured.amount_cents == expected_amount_cents
    assert captured.status == TransactionStatus.CAPTURED

    # update model state
    self.captured_transaction_ids.add(transaction_id)
    return captured

Now we can explore the design space of our PBT-driven, MBT-style test.

❦ ❦ ❦

1. A Direct Attempt: Generate The Transaction ID

The simplest idea is to just generate a transaction ID directly:

class ConcreteTransactionMachine(PaymentMachine):
    @rule(amount_cents=AMOUNTS)
    def authorize(self, amount_cents):
        self.authorize_payment(amount_cents)

    @rule(generated_transaction_id=st.uuids())
    def capture_with_generated_concrete_transaction_id(self, generated_transaction_id):
        self.sut.capture(generated_transaction_id)
        # ^-- this will (almost) always fail with UnknownTransactionError

Of course, this does not work. Given that UUIDs are essentially unique, the generated transaction ID will almost certainly not be one that the SUT created during an authorize step. When we call sut.capture(), TinyPay tries to look up that transaction ID, does not find it, and raises UnknownTransactionError.

This is the shape we would get from Hypothesis from a failing test:

authorize(100)      -> returns transaction_id=a0d4...
capture(3f3c...)    -> uses unrelated transaction_id

The generated UUID is (almost) certainly not the UUID returned by authorize. Our test mostly exercises “unknown transaction”, not capture.

Tradeoff

The generator is asked to produce a value that only the SUT can create. It has no meaningful way to recreate the SUT’s behavior.

So instead of generating the transaction ID inside the property-based test, we should generate a way to select one of the transaction IDs we have already observed.

2. Dependent Generation: Generate An Index Into Model State

The first selector approach is to generate a valid index into the current model state. We extend PaymentMachine with a helper method to get the list of uncaptured transactions:

class PaymentMachine(RuleBasedStateMachine):
    #... (same as before) ...
    def uncaptured_transactions(self) -> list[AuthorizedTransaction]:
        return [
            transaction
            for transaction in self.authorized_transactions
            if transaction.transaction_id not in self.captured_transaction_ids
        ]

    def has_uncaptured_transactions(self) -> bool:
        return bool(self.uncaptured_transactions())

This returns a list of uncaptured transactions that we can select from. In our property-based test, we can generate an index into that list:

class IndexedTransactionMachine(PaymentMachine):
    """Generate an index, then resolve it against observed model state."""

    @rule(amount_cents=AMOUNTS)
    def authorize(self, amount_cents):
        self.authorize_payment(amount_cents)

    @rule(data=st.data())
    @precondition(lambda self: self.has_uncaptured_transactions())
    def capture_using_generated_index(self, data):
        uncaptured_transactions = self.uncaptured_transactions()
        index = data.draw(
            st.integers(min_value=0, max_value=len(uncaptured_transactions) - 1),
            label="transaction_index",
        )
        transaction = uncaptured_transactions[index]
        self.capture_payment(transaction.transaction_id, transaction.amount_cents)

This works. Hypothesis only runs capture_using_generated_index() when there is at least one uncaptured transaction, and then generates an index that is valid for the current number of uncaptured transactions.

Tradeoff

The generated value is not the SUT input. Actually, we’re passing Hypothesis’ raw data generation strategy object into the rule, and use it to sample a selector (the index) into the model state at runtime.

We have switched from producing fully concrete input traces to an adaptive test driver. We have done so at the price of cluttering the rule body with dependent generation:

index = data.draw(...)

We also need the @precondition to keep the selector valid. Without it, Hypothesis may try to run capture_using_generated_index() before any transaction has been authorized, leaving us with an empty list and no valid index to draw. In that case, len(uncaptured_transactions) - 1 is -1, so we would ask Hypothesis to sample an integer from [0, -1].

Imagine sampling from more than one list, or applying more complex predicates to filter the model state. The rule body (which should remain focused on the high-level behavior being tested) quickly becomes a tangle of dependent generation and filtering logic.

We also had to add the uncaptured_transactions() and has_uncaptured_transactions() helpers to filter for authorized and uncaptured transactions, and to disable the rule when no such transaction exists.

(There are some other options that we will discuss later. We could generate a transaction ID and use assume() to discard invalid ones, but that can lead to many discarded examples and worse shrinking when valid values are sparse. We could also generate an index into authorized_transactions, and have two branches within the rule, one for capture and one for double-capture. That is a bit cleaner, but it still has the same basic shape of dependent generation and branching.)

This is a practical Hypothesis idiom, but it clutters the rule with dependent generation. We might try to make the rule look more like normal PBT by putting the strategy directly in the decorator.

3. Modulo Indexing: Generate An Integer And Map It Into Model State

A riff on our previous attempt: Instead of drawing a state-sized index inside the rule, we can generate any integer up front and map it into the current model state by using modulo arithmetic:

class ModuloIndexTransactionMachine(PaymentMachine):
    """Generate any integer, then modulo it into observed model state."""

    @rule(amount_cents=AMOUNTS)
    def authorize(self, amount_cents):
        self.authorize_payment(amount_cents)

    @rule(generated_index=st.integers())
    @precondition(lambda self: self.has_uncaptured_transactions())
    def capture_using_modulo_index(self, generated_index):
        uncaptured_transactions = self.uncaptured_transactions()
        index = generated_index % len(uncaptured_transactions)
        transaction = uncaptured_transactions[index]
        self.capture_payment(transaction.transaction_id, transaction.amount_cents)

This also works. We still need the same precondition to ensure that modulo arithmetic has a non-empty list to operate on. The strategy is now ordinary, but the generated integer is transformed at runtime:

index = generated_index % len(uncaptured_transactions)

Tradeoff

The generated integer is not the semantic input.

If Hypothesis reports:

generated_index=17

that does not mean “capture transaction 17”. It means:

capture uncaptured_transactions[17 % len(uncaptured_transactions)]

The mapping changes as the model state changes. The same generated index may select a different transaction in a different state.

Shrinking also becomes less meaningful, because shrinking the generated integer does not necessarily shrink the SUT action in a useful semantic sense.

If there are three uncaptured transactions:

generated_index
0, 3, 6, 9      -> index 0
1, 4, 7, 10     -> index 1
2, 5, 8, 11     -> index 2

For example, shrinking generated_index=17 to generated_index=2 tells you nothing useful: both map to index 2 when there are three uncaptured transactions, so the developer sees the same failing action but gains no insight into why a smaller or simpler input triggers the bug.

Modulo indexing avoids the data.draw() clutter, but it hides the same adaptivity behind arithmetic. The next step is to make the data dependency explicit.

4. Explicit Dependency: Use A Bundle To Pass Values Between Rules

Hypothesis’ cleanest built-in solution is a Bundle.

class BundleTransactionMachine(PaymentMachine):
    transactions = Bundle("transactions")

    @rule(target=transactions, amount_cents=AMOUNTS)
    def authorize(self, amount_cents):
        return self.authorize_payment(amount_cents)

    @rule(transaction=transactions)
    def capture_using_bundled_transaction(self, transaction):
        self.capture_payment(transaction.transaction_id, transaction.amount_cents)

By specifying target=transactions, we tell Hypothesis to automatically put the result of authorize() into the transactions bundle. By specifying transaction=transactions, we tell Hypothesis to automatically take a value from the transactions bundle and pass it into capture_using_bundled_transaction().

If the test fails, Hypothesis reports a symbolic trace:

transactions_0 = state.authorize(amount_cents=1)
state.capture_using_bundled_transaction(transaction=transactions_0)

Upon generation, Hypothesis no longer passes a concrete value into capture_using_bundled_transaction(). Instead, it passes a reference to the result of authorize(). This is clean. It shows the dependency directly.

But it again comes with a compromise.

Where Bundles Hit Their Limits

Suppose we have to extend authorization a bit. It is no longer deterministic, but can return one of three shapes:

authorize(amount, card)
  -> Authorized(transaction_id)
  -> Declined(reason)
  -> Requires3DS(challenge_id)

Now the next valid action depends on which shape the SUT returned.

The Impossible Option: Bundle-Aware Preconditions

What we might want is:

@rule(transaction=transactions)
@precondition(lambda self, transaction: transaction.status == AUTHORIZED)
def capture(self, transaction):
    ...

But Hypothesis preconditions do not work that way.

A @precondition runs before Hypothesis draws the rule arguments. It sees the state machine, but it does not see the generated values.

Is there some transaction that is valid for capture?

It cannot answer:

Is this particular bundled transaction that will be passed into the rule valid for capture?

That is the core limitation.

There are several possible alternatives, but they all come with tradeoffs.

Option A: Generate Indexes Or Selectors

Keep detailed model state yourself, filter it, and generate a selector:

@rule(data=st.data())
@precondition(lambda self: bool(self.authorized_transactions()))
def capture(self, data):
    authorized = self.authorized_transactions()
    index = data.draw(st.integers(0, len(authorized) - 1))
    transaction = authorized[index]
    self.capture_payment(transaction.transaction_id, transaction.amount_cents)

Again, the precondition is part of the selector machinery. It prevents the rule from running when the filtered model state is empty.

This is the most flexible shape. You can express arbitrary model predicates.

The cost is as discussed above: the rule body becomes cluttered with helper code that performs dependent generation, plus preconditions that keep those generated selectors valid. The generated value is a selector into observed model state, not the SUT input.

Verdict: Flexible, but noisy. As we will see, this is the most expressive option because the dependent generation is explicit. The cost is that test rules start to look like generators rather than descriptions of behavior.

Option B: Make The SUT Deterministic Or Controllable

For a payment system, we can inject a fake gateway:

approved card  -> Authorized(transaction_id)
declined card  -> Declined(reason)
3DS card       -> Requires3DS(challenge_id)

Then separate bundles work again because the test input controls the outcome.

This is often the right engineering move. But it changes what we are testing. We are no longer exploring external nondeterminism; we are testing a SUT under a controlled environment.

Verdict: Useful when the environment is legitimately controllable. This may be the right engineering call if your model-based test targets the unit and integration levels. But it is not always possible (the clean mapping I’ve given above may not exist in your domain), and it is not the same as testing a real, complex system with external nondeterminism.

Option C: Separate Bundles By State

Use different bundles:

authorized_transactions = Bundle("authorized_transactions")
declined_attempts = Bundle("declined_attempts")
challenges = Bundle("challenges")

Then write separate rules:

@rule(target=authorized_transactions, amount=APPROVABLE_AMOUNTS)
def authorize_success(self, amount):
    ...

@rule(target=declined_attempts, amount=DECLINED_AMOUNTS)
def authorize_decline(self, amount):
    ...

The limitation is that bundle targets are static. A single rule cannot decide at runtime which bundle its result should go into.

This works okay with a controllable or fully mocked gateway, but even then, we’re importing knowledge about the (mocked behavior of the) SUT into our test design. We now have to predict a priori which inputs will lead to which outcomes, and design our rules accordingly.

Not a stable method for testing a real, complex system.

Verdict: Fragile. It requires us to predict SUT behavior ahead of time, which becomes brittle as the system evolves.

Option D: One Broad Bundle, Branch In The Rule

Put all results in one bundle:

transactions = Bundle("transactions")

Then branch later:

@rule(transaction=transactions)
def capture(self, transaction):
    if transaction.status == AUTHORIZED:
        ...
    elif transaction.status == DECLINED:
        ...
    elif transaction.status == REQUIRES_3DS:
        ...

This is flexible, but the rule no longer describes a single behavior. It has become a tangle of branching logic that handles multiple behaviors. Imagine if the outcome of capture() not only depends on the transaction status, but also other aspects of the model state. The rule body would quickly become a dispatch table, rather than describing behavior at a high level.

We also cannot dynamically retarget the value into more specific downstream bundles inside the branches, since the target parameter is static:

@rule(transaction=transactions, target=captured_transactions)
def capture(self, transaction):
    # whatever we produce, even in branches, will go into captured_transactions
    ...

Verdict: Hard to read and hard to route. The rule becomes a dispatch table instead of describing one behavior, and it still cannot dynamically retarget values into more specific downstream bundles.

Option E: Use `consumes()`

consumes() is useful for one-way transitions:

authorized -> captured

It removes a value from the bundle once it has been consumed:

@rule(transaction=consumes(transactions))
def capture(self, transaction):
    ...

This handles lifetime, but cannot handle classification: It allows us to remove a transaction from the “authorized” bundle once we capture it, but it does not allow us to classify transactions into “authorized”, “declined”, and “3DS” downstream bundles.

Verdict: Limited. As we’ve seen, consumes() handles lifetime, not classification, so it only gets us partway.

Option F: Use `assume()` Or Early Returns

Another option is to draw from a broad bundle and discard invalid choices:

@rule(payment=payments)
def capture(self, payment):
    assume(payment.status == AUTHORIZED)
    ...

or:

if payment.status != AUTHORIZED:
    return

This is a simple and flexible option, at the cost of test efficiency. If valid values are sparse, assume() will produce many discarded examples. Early returns are semantically no-op steps, so a generated trace may look extremely busy while testing very little.

Verdict: Unreliable. If valid values are sparse, assume() will produce many discarded examples, giving false confidence.

❦ ❦ ❦

The Point

The practical conclusion is not that Hypothesis is bad. On the contrary, the bundle solution actually goes very far in providing MBT-style testing within a PBT framework.

The point is that PBT and this observation-dependent style of MBT have different goals, and that there is a tension between them. The generated artifact in PBT is meant to be replayable and shrinkable, which often means it cannot directly represent the next SUT input when that input depends on values observed during execution.

The concrete test tried to generate:

capture(transaction_id)

but failed. The working tests generate one of these instead:

capture(model_state[index])
capture(model_state[generated_integer % len(model_state)])
capture(result_of_previous_rule)

Those are not fully concrete SUT inputs. They are selectors, references, or expressions over values observed during execution.

That is the tension between stateful PBT and observation-dependent MBT:

Shrinkable PBT wants a replayable generated artifact. But observation-dependent MBT often needs the next input to depend on values returned by the SUT. PBT tools can support this style of testing on the surface, but inherit fundamental limitations due to their core design.

Get in Touch

Trying to decide whether a stateful PBT suite is enough, or whether your system needs a separate model? Get in touch. I like comparing notes on testing complex systems, external nondeterminism, and where formal models start to pay for themselves. You can also subscribe below for future posts.

Cannot MBT with PBT

The Payment System

A Stateful Property-Based Test

1. A Direct Attempt: Generate The Transaction ID

Tradeoff

2. Dependent Generation: Generate An Index Into Model State

Tradeoff

3. Modulo Indexing: Generate An Integer And Map It Into Model State

Tradeoff

4. Explicit Dependency: Use A Bundle To Pass Values Between Rules

Where Bundles Hit Their Limits

The Impossible Option: Bundle-Aware Preconditions

Option A: Generate Indexes Or Selectors

Option B: Make The SUT Deterministic Or Controllable

Option C: Separate Bundles By State

Option D: One Broad Bundle, Branch In The Rule

Option E: Use consumes()

Option F: Use assume() Or Early Returns

The Point

Get in Touch

Option E: Use `consumes()`

Option F: Use `assume()` Or Early Returns