Features All Test Frameworks Should Have

EDIT 2015-11-02: I added a couple more nice-to-haves which are I think are pretty important. See the end of the list.

I’ve used half a dozen unit testing frameworks, and written nearly that many more. Here is the set of features that I consider a requirement in any test framework.

You’d think some of the following requirements are so obvious as to not need mentioning… I, too, have been shocked before. :)


  • Minimal boilerplate. Writing tests should be frictionless, so setting up a new test file should be little more than a single import and maybe a top-level function call.

  • Similarly, each test name should only have to be uttered once. Some frameworks have you, after writing all of your tests, enumerate them in a test list. I’m even aware of a Haskell project that repeated each test name THREE times: once for the type signature, once for the test itself, and once for the test list.

  • Assertion failures should provide a stack trace, including filename and line number. There are test frameworks that make you name each assertion rather than giving a file and line number. Never make a human do what a computer can. :)

  • Assertion failures should also include any relevant values. For example, when asserting that x and y are the equal, if they are not, the values of x and y should be shown.

  • Test file names and cases should be autodiscovered. It’s too easy to accidentally not run a bunch of tests because you forgot to register them.

  • The framework should support test fixtures — that is, common setup and teardown code per test. In addition, and this is commonly missed, fixtures should be nestable: test setup code should run from the most base fixture to the most derived fixture, then all the teardown code should run in the reverse order. Nested fixtures allow reusing common environments across many tests. The BDD frameworks tend to support that because nested contexts are one of their selling points.

  • It should be possible to define what I call “superfixtures”: code that runs before and after each test, whether or not the test specifies a fixture or not. This is useful for making general assertions across the code base (or regions thereof), such as “no test leaks memory”.

  • Support for abstract test cases. Abstract tests let you define a set of N tests that each operate on an interface and a set of M fixtures, each providing an implementation of that interface. This runs M*N tests total. This makes it easy to test that a bunch of implementations all expose the same behavior.

  • A rich set of comparison operators. For example, equality, identify, membership, and string matching. This allows tests to provide more context upon failure, but also makes it easy for programmers to write good appropriate and concise tests in the first place. (Bonus points: there are frameworks like py.test that have a single assertion form, but examine the assertion expression to automatically print any relevant context upon failure.)

  • Printing to stdout should be unbuffered and interleaved properly with
    the test reporter’s output. I only include this because Tasty utterly fails this test. :)

is same as not caching:                OK (1.45s)He[lmo! [ 3T2h;i2s2 mis
 a [vmery[ 3i7n;n2o2cmen t   p u+t+S+t rOLKn,.  p aIs sheodp e1 0i0t  tdeosetssn.'
t a[fmfe c tt etshte  etxetsetr noault pmuetm.o
ize happy path:      OK


  • Customizable test reporting. There are two reasons. The first, colored test output, is a nice-to-have, but it’s a huge one, as it probably shaves a few seconds off of each visual scan of the test results. Also, integrating test output with continuous integration software is a big win too.

  • Parallelism. The built-in ability to run tests in parallel is a nice way to reduce testing turnaround time. Either opt-in or opt-out parallelism are okay. But, if necessary, it’s easy to work around the lack of parallelism and make efficient use of test hardware by dividing up the tests into even slices or chunks and running them on multiple machines or VMs.

  • Property-based testing, a la QuickCheck. While QuickCheck is amazing, and property-based testing will change your life, the bread and butter of your test suite will be unit tests.

  • Direct, convenient support for disabling tests. Without this capability, people just comment out the test, but commented-out tests don’t show up in the test metrics, so they tend to get forgotten. Jasmine handles this very well: simply prefix the disabled fixture or test with “x”. As in, if a test is spelled it('should return 2', function() { ... }), disabling it as easy as changing it to xit.

I could build the feature matrix across the test frameworks I’ve used, but only a handful are complete out of the box. (If anyone would like me to take a crack at filling out a feature matrix, let me know.)

The Python unit testing ecosystem is pretty great. Even the built-in unittest package has almost every feature. (I believe I had to manually extend TestCase to provide support for superfixtures.) The JavaScript world, until recently, was pretty anemic. QUnit was a wreck, last time I used it — there is no excuse for not including stack traces in test failures. Jasmine, on the other hand, supports almost everything I care about. (At IMVU, we ended up building imvujstest, part of imvujs.)

In the C++ world, UnitTest++ comes very close to being great. The only capabilities I’ve had to add were superfixtures, nested fixtures, and abstract test cases. In hindsight, I wish I’d open sourced that bit of C++ macros and templates while I could have. :)

go test by itself is way too simplistic to be used for a sophisticated test suite. Fortunately, the gocheck package is pretty good. It’s possible to make abstract tests work in gocheck, at the cost of some boilerplate. However, today, gocheck doesn’t support nested fixtures. I suspect they’d be amenable to a patch if anyone wants to take that on.

The Haskell unit testing ecosystem is less than ideal. Getting a proper framework that satisfies the above requirements takes considerably more effort than the other examples I’ve given. Everything I’ve described is possible with HUnit and various Template Haskell packages, but it takes quite a lot of package dependencies and language extensions. I have dreams of building my ideal Haskell unit test framework… perhaps the next time I work on a large Haskell project.

If you’re building a test framework, the most important thing to focus on is a rapid iteration flow: write a test, watch it fail, modify the code, watch the test pass. It should be easy for anyone to write a test, and easy for anyone to run them and interpret their output. The faster you can iterate, the more your mind stays focused on the actual problems at hand.

EDIT: More Nice-to-Haves

  • Copy and paste test names back into runner. It’s pretty common to want to run a single test again. The easiest way to support this is to allow the exact test name to be passed as a command line argument to the runner. Test frameworks that automatically strip underscores from test names or that output fixture names in some funky format automatically fail this. BDD frameworks fail this too because of their weird english-ish test name structure.
  • Test times. Tests that take longer than, say, one millisecond should have their running times output with the test result. Test times always creep up over time so it’s important to keep this visible.

Don’t Write Test-Only Methods

When writing unit tests for new code or existing code, it’s often tempting to call test-only methods to get at private implementation details.

Here I will argue that test-only methods are a mistake and that it’s critical to distinguish between testing code and testing the interface’s behavior.

As an example, let look at a simple Python thread pool implementation.

class ThreadPool:
    def __init__(self):
        self._queue = Queue.Queue()
        self._threads = []
        self._activeThreads = []

    def run(self, fn):
        if len(self._activeThreads) == len(self._threads):
            t = threading.Thread(target=self.__thread)

    def __thread(self):
        while True:
            q = self._queue.get()

How would I unit test such a thing? Maybe something like:

def test_queuing_a_job_starts_a_thread():
    t = ThreadPool()
    t.run(lambda: None)
    assertEqual(1, len(t._threads))

That test would pass! But it’s also a bad test for several reasons:

  • It doesn’t assert that the job is run, meaning the test could pass if the implementation is broken.
  • It assumes implementation details: that the number of threads increases to one the first time a job is added.
  • Finally, the test has not protected itself from valid refactorings. If someone renamed or eliminated the private _activeThreads variable, the test would fail, even if the thread pool still behaved as advertised.

And there’s the important point: behaved as advertised. Tests should verify that the object does what it says it does. That is, an object does work through its public interface, either by producing a side effect or returning a value.

Let’s write a better test for the same functionality.

def test_queueing_a_job_starts_a_thread():
    begin = threading.Semaphore(0)
    end = threading.Semaphore(0)
    calls = []
    def job():
    t = ThreadPool()
    assertEqual(1, len(calls))

This test no longer needs to reference any private fields or methods: it simply asserts that the given job is run and that said job does not block the main thread.

An aside: a simple implementation of run could be def run(fn): fn() but this test would hang (and thus fail) if fn was run on the same thread.

Writing tests against a public interface is validation for your object’s public interface. It requires you to understand the object’s interface. It shows examples for how to use your object elsewhere in the production system.

It also means the implementation of the object’s invariants are protected: refactoring internals will never break tests unless the object itself no longer meets its public interface.

What if the code under test is too hard to reach through the public interface?

And here we come to a common argument in support for testing internals: “What if the code under test is too hard to reach through the public interface?” Well maybe your object is too complicated! Split the object into multiple objects with simple public interfaces and compose them.

Otherwise objects will end up with combinatoric invariants. Imagine the implementation of list.append. It probably has a typical fast path plus the rare occasion where it must reallocate the list’s memory.

Note that the ThreadPool class doesn’t implement its own list allocation logic (nor a thread-safe queue): it reuses existing objects with clear and sensible interfaces.

Thus the threadpool tests can simply ignore the implementation of list.append and assume it works. list.append is tested elsewhere.

Refactoring is Hard Enough

Refactoring is hard enough. It’s critical that you build your systems to allow refactoring of internals. Any bit of friction, like incorrectly failing tests, may get in the way of engineers improving the code.

But what about caching and performance?

Caching and performance are an interesting case. Imagine a perspective camera object in a 3D renderer. It takes field-of-view, aspect ratio, and perhaps near and far distances as inputs and spits out a projection matrix.

As an optimization, you may be tempted to cache the projection matrix to avoid computation if it’s requested multiple times with the same input. How do you test such a cache?

# pseudocode
class Camera:
    # ... setters and such go here

    def getProjectionMatrix():
        if _projectionMatrixDirty: recalculate()
        return _projectionMatrixCache
    bool _projectionMatrixDirty
    Mat4f _projectionMatrixCache

It’s tempting to set inputs and assert _projectionMatrixDirty or perhaps assert the contents of the _projectionMatrixCache field.

However, these unit tests would not actually assert the desired behavior! Yes, you can maybe assert that you’re caching data correctly, but nobody actually cares about that. Instead you want to test the motivating intent behind the cache: performance.

Try writing performance tests! Call getProjectionMatrix() multiple times and assert it takes less than N microseconds or cycles or whatever.

In some cases it may even be faster to recompute the matrix than cache it.


Here’s my rule of thumb: if a method on an object is only called by tests, it’s dead. Excise it and its tests.

Be rigorous and clear with your public interfaces. Protect your code with tests. Don’t let internal object refactoring break your unit tests!