What are executable specifications?

Executable specifications, not just Markdown docs.

The short version

A Markdown spec says what should happen. An executable spec can try things and check whether they still work.

That matters when bugs depend on ordering, retries, stale reads, timeouts, garbage collection, leader changes, or partial failure.

For system verification and software verification, the important part is practical: the spec is runnable code, produces tests, and goes into your CI.

Why this matters

Complex systems fail in behavior, not just inputs. Executable specs make that behavior runnable.

01

Rare schedules cause real outages

Production failures often happen on paths nobody wrote tests for: message interleavings, retries, timeouts, stale reads, crashes, and partial failures.
02

AI-generated code still needs an oracle

LLMs can produce plausible infrastructure and protocol code quickly. They also miss race conditions, atomicity assumptions, and system invariants. An executable spec gives humans and agents a reference to test against.

What is in an executable spec?

An executable spec is a small code-level reference for the behavior you care about. It does not mirror every line of the implementation. It captures the parts that decide whether the system behaved as expected.

01

State

The information that matters for the behavior under test: plans, versions, balances, leaders, locks, queues, records, or pending work.
02

Actions

The operations that can change that state: requests, retries, failovers, reconciliations, commits, rollbacks, timeouts, or background jobs.
03

Failures and observations

The messy parts that ordinary examples tend to skip: stale reads, delayed messages, duplicate requests, crashes, partial failure, and recovery.
04

System invariants

The rules that must keep holding after every generated scenario. These become executable checks, not comments in a design doc.

Example: a DNS balancer control plane

This example models a DNS load balancer control plane with garbage collection of stale plans. The full model has several actions; this snippet shows just one of them.

@action(inline=False)
def deployer_gc(c: Context[DnsBalancerState], deployer: Annotated[Expr, str]):
    """Deployer cleans up old plans."""
    s = c.state
    next_plan = s.deployer_next_plan[deployer]
    old_indices = s.planner_updates.keys.filter(
        lambda i: next_plan > i + s.MAX_PLAN_AGE
    )
    old_plan_entries = Set(
        s.planner_updates[i].creates for i in old_indices
    ).flattened

    # avoid unrealistic no-op cleanups
    c.assume(~(old_plan_entries & s.zone_records).is_empty)

    # implementation may do this in a loop;
    # the spec models the abstract effect
    s.zone_records -= old_plan_entries


@invariant
def no_inconsistent_root(s: DnsBalancerState):
    """Root must always point to an existing plan."""
    cnames = SetIf(r.kind == CNAME for r in s.zone_records)
    plans = Set(r.value[0] for r in SetIf(r2.name == ZONE for r2 in cnames))
    return ~((plans & Set(r.name for r in cnames)).is_empty) & (plans.size == 1)

From design intent to executable behavior

In executable specifications, we focus on the high-level effects: what changes in the system, not every API call or loop. The invariant at the bottom is now an executable check: after any sequence of actions, the root still points to a plan that exists. A Markdown sentence can describe that rule. An executable check can catch the regression.

One executable model, four uses

01

Exploration

Generate schedules that people would never write by hand, before the system reaches production.
02

Test generation

Generate scenarios and expected outcomes, then replay them against APIs, CLIs, simulators, or test harnesses.
03

Review artifact

Make atomicity, freshness, retries, rollback, and garbage-collection assumptions visible to the team.
04

Living documentation

Keep a runnable reference for behavior the team depends on, instead of a stale diagram no one trusts.

When executable specs are worth it

1
Your system manages complex state. It implements a control loop, scheduler, protocol, database, controller, or reconciliation process.
2
Expected behavior depends on timing or ordering. Retries, stale observations, partial failure, and delayed messages can change the outcome.
3
Incidents are hard to reproduce. The failing path is rare, long, or depends on a schedule your tests do not cover.
4
Important invariants are implicit. They live in docs, code comments, tribal knowledge, or review checklists instead of a runnable reference.

Book a 30-minute fit call

Not sure where to start?
Send a short description of the system and what worries you. I’ll suggest a workshop, pilot, risk review, or another testing approach.

Book a 30-minute fit call

Best fit: critical stateful systems where failures are expensive.
Remote or on-site. No prior formal-methods experience required.
Not a fit if you only need more unit tests, broad code coverage, or generic security scanning.