Fuzzing and property testing both involve generating random inputs, and then checking if a program misbehaves on those inputs. This description should probably leave you raising your eyebrow slightly: if you start being vague enough, lots of entirely different things sound similar. But there are some real similarities between these two techniques.

Let’s start with the hard distinction between the two techniques:

  • Fuzzing is generally a black-box method, meaning we don’t try to inform it too much on how to go about generating those inputs. Fuzzers usually use instrumentation to inspect how an input makes the program behave. This gives the fuzzer guidance on guessing new inputs that might lead to a significant change in behavior. (You could argue this isn’t “black box” anymore, but I’m speaking from the programmer’s perspective.) The properties being checked—at first glance—are also pretty “dumb.” Generally “does the program crash? Y/N?” Fuzzers generally need to run for a long time (hours, days, weeks or even months!)

  • Randomized property testing, by contrast, requires deep familiarity with the system under test. The programmer needs to specify both the properties to test, and describe a rough “shape” of inputs that are “interesting.” The advantage of this extra work is that only a few examples really need to be generated from that space, and as result property tests run quickly—just like unit tests.

So these are two very different techniques.

Similarities start to surface

One of the observations I made awhile back about property testing (especially imperative) code was that assertions in the code under test were extremely synergistic with property testing. This turns out to be a fully general benefit, for both fuzzing and property testing.

There’s a good reason I mentioned that fuzzing seems to only check “dumb” properties “at first glance.” Because upon revisiting that idea, it falls apart:

  • Assertions can check invariants throughout the operation of the program. Assertion failures are treated as crashes, and so fuzzing is in effect checking all these invariants. This kind of error checking can be almost arbitrarily powerful in finding bugs: a greater quantity and quality of assertions creates new opportunities for fuzzers and property tests to find bugs.

  • Dynamic analysis tools can automatically instrument a program with additional invariants to check. Clang and GCC both offer sanitizers for memory errors, undefined behavior, and even detecting potential race conditions or other concurrency errors. More of these analyzers are likely in the works: the effectiveness of these techniques has only become widely understood relatively recently. (Or perhaps: the availability of these tools is only recent. Whichever.)

Awhile back, I recall reading a story about the early development of these dynamic analyzers. The authors ran test suites for various open source projects under the analysis, and then reported bugs for observed problems.

Some of the project objected. Apparently, there were a lot of static analysis tools being developed around the same time, and some developers were (presumably) sick of being used as free labor, trying to distinguish when a static analyzer’s error was real or a false positive. (False positives are the worst thing for any kind of analysis tool, because they hide any signal amidst noise.)

It took awhile to explain that, no, a dynamic analyzer isn’t subject to false positives. This was a real, observed error that happened while running the test suite, not a hypothetical situation.

I lost the link to this story, though. A friendly reader has pointed out this story came from John Regehr’s blog: “Static Analysis Fatigue”. Thanks Andy!

In the face of these tools for invariant checking, the added properties being checked by property testing can start to be relatively minor in comparison. We can even eliminate that difference: instead of directly fuzzing the program under test, we can construct a special program to fuzz. That program can take the inputs, call functions on those inputs, and… assert properties about the results!

So in the end, I think the actual difference between these techniques is just how they’re used. You’re doing property testing when you’re also spending time describing the space of inputs that should be generated, to accelerate the test running.

When to fuzz, when to property test?

To a first approximation, always property test. The technique is very powerful, and criminally under-used. If you’ve got an example-based test suite, you should consider adopting property testing.

As part of the book, besides mostly trying to outline how to think about design, I also want to teach a small number of concepts that are less well known than they should be. Concurrency was one, which is why I had several articles about async/await. Property testing is another.

If you have thoughts on what I could write that would help you adopt property testing, send me a suggestion on Twitter or Patreon! So far, I have the post where I property test some C code via FFI from Python.

But the two techniques are best applicable in different circumstances:

  • Because property testing involves the programmer describing “interesting” inputs to test, bizarre (and security-critical) inputs may not be considered. Fuzzing is therefore the tool of choice for ensuring software is robust to attacks, as it bakes in no such assumptions.

  • Because fuzzing requires so much time to run, it cannot act as a replacement for example-based unit testing, while property testing can. When it comes to testing almost any property outside of security, property testing will generally be superior.

Co-design of properties, invariants, and code

There’s another important distinction: a benefit that almost always comes from property testing, but rarely from fuzzing. And that’s the impact on the design of the code under test. I’ve written before about how thinking about the properties of our abstractions can give us better designs. I’d like to describe two very practical (and sometimes quite magical) techniques you can employ with property testing to improve code and test quality.

The techniques are all about co-design. We don’t just write the code, then test it. We alternate between writing, testing, and finding new invariants that can be enforced. Each time we do one of these things, we learn things that affect how we then do the others.

Technique 1: Querying your code

Suppose we start with a reasonable suite of property tests for the code we’re working on actively. This acts like any other unit test suite might, except that we also get a new superpower. In addition to getting some immediate feedback on whether we’ve broken our code, we can also attempt to deliberately break the code, to help ourselves understand it better.

If I think of a possible new assertion of an invariant in my code, but I’m dubious about whether this assertion always holds or not, I can simply add the assertion, and run the test suite. Unlike example-based testing, where we might not have thought of the right inputs to run afoul of the assertion, we’re a lot more likely to run into that case while property testing.

That means our property test suite will discover and reply with an example of a situation where that assertion would fail. It’s like we’ve asked a magical oracle to tell us when our guess wouldn’t be true. Or perhaps it passes, and we have some extra confidence that the assertion is actually true. (Obviously, we can’t be completely sure. That’s the trade-off of testing versus more robust formal methods.) Regardless, this is profoundly helpful in trying to discover how some code actually works, which can guide us when it comes to improving its design.

Honestly, you should experience this. It’s amazing to look at something, go “hmm, can this happen?” And then the computer just tells you, “yeah, in this situation.” Magic.

Technique 2: Querying your tests

And we can sort of apply this technique in the reverse direction, too. One problem we frequently face is trying to assess how effective our testing actually is. There are some cheap and crude tools for this, like coverage metrics, and some expensive and slightly less crude tools, like mutation testing.

But property testing gives us a cheap and effective (though local) approach to doing this. We can add assertions we know are violated in some situations, and then check to see if our property tests are generating that situation and finding the assertion failure. If they do not, then we know we’ve messed up somewhere in defining the “interesting space” of possible inputs. So we can work on improving the test case generation until it does spot the assertion failure. Then we can remove the bogus assertion, and commit our improvements to the test suite.

This technique is employable with example-based testing, too, except that you’re a lot more likely to just discover that no test actually exercises that case. You’re also more likely to have a good idea of whether such a test already exists. Sometimes the space of generated inputs with a property test can be a bit opaque, so this helps us understand whether it’s doing a good job.

The more usually applied technique for example-based testing is “first write a failing test.” These are both methods for ensuring you’re testing for what you think you are.

End Notes

  • David R. MacIver disagrees with me slightly on what the distinction between fuzzing and property testing is. I think he focuses too much on what the techniques are and it’s a lot better to distinguish them by how they are used.
  • Others use “fuzzing” to mean any random generation, and distinguish the two approaches as “coverage-guided” vs “generative.” I hope this is partially because some of these terms got coined before we started to see these approaches catch on in practice, because I don’t like those names. (Generative how? And we can do more than just coverage guidance…)