JSON Never Dies – An Efficient, Queryable Binary Encoding

A familiar story: Team starts project. Team uses JSON for interchange; it’s easy and ubiquitous and debuggable. Project grows. Entity types multiply. Data sizes increase. Finally, team notices hundreds of kilobytes — or even megabytes — of JSON is being generated, transferred, and parsed.

“Just use a binary format!” the Internet cries.

But by the time a project gets to this stage, it’s often a nontrivial amount of work to switch to Protocol Buffers or Cap’n Proto or Thrift or whatever. There might be thousands of lines of code for mapping model objects to and from JSON (that is arrays, objects, numbers, strings, and booleans). And if you’re talking about some hand-rolled binary format, it’s even worse: you need implementations for all of your languages and to make sure they’re fuzzed and secure.

The fact is, the activation energy required to switch from JSON to something else is high enough that it rarely happens. The JSON data model tends to stick. However, if we could insert a single function into the network layer to represent JSON more efficiently, then that could be an easy, meaningful win.

“Another binary JSON? We already have CBOR, MsgPack, BSON, and UBSON!” the Internet cries.

True, but, at first glance, none of those formats actually seem all that great. JSON documents tend to repeat the same strings over and over again, but those formats don’t support reusing string values or object shapes.

This led me to perform some experiments. What might a tighter binary representation of JSON look like?

What’s in a format?

First, what are some interesting properties of a file format?

  • Flexibility and Evolvability Well, we’re talking about representing JSON here, so they’re all going to be similar. However, some of the binary JSON replacements also have support for dates, binary blobs, and 64-bit integers.
  • Size How efficiently is the information encoded in memory? Uncompressed size matters because it affects peak memory usage and it’s how much data the post-decompression parser has to touch.
  • Compressibility Since you often get some sort of LZ compression “for free” in the network or storage interface, there’s value in the representation being amenable to those compression algorithms.
  • Code Size How simple are the encoders and decoders? Beyond the code size benefits, simple encoders and decoders are easier to audit for correctness, resource consumption bounds, and security vulnerabilities.
  • Decoder Speed How quickly can the entire file be scanned or processed? For comparison, JSON can be parsed at a rate of hundreds of MB/s.
  • Queryability Often we only want a subset of the data given to us. Does the format allow O(1) or O(lg N) path queries? Can we read the format without first parsing it into memory?

Size and parse-less queryability were my primary goals with JND. My hypothesis was that, since many JSON documents have repeating common structures (including string keys), storing strings and object shapes in a table would result in significant size wins.

Quickly glancing at the mainstream binary JSON encodings…


Each value starts with a tag byte followed by its payload. e.g. “0xdc indicates array, followed by a 16-bit length, followed by N values”. Big-endian integers.

Must be parsed before querying.


Per spec, does not actually seem to be a superset of JSON? Disallows nuls in object keys, and does not support arrays as root elements, as far as I can tell.

Otherwise, similar encoding as MsgPack. Each value has a tag byte followed by a payload. At least it uses little-endian integers.


Same idea. One-byte tags for each value, followed by a payload. Notably, lists and objects have terminators, and may not have an explicit length. Kind of an odd decision IMO.

Big-endian again. Weird.


IETF standard. Has a very intentional specification with documented rationales. Supports arrays of known sizes and arrays that terminate with a special “break” element. Smart 3-bit major tag followed by 5-bit length with special values for “length follows in next 4 bytes”, etc.

Big-endian again… Endianness doesn’t matter all that much, but it’s kind of weird to see formats using the endianness that’s less common these days.

CBOR does not support string or object shape tables, but at first glance it does not seem like CBOR sucks. I can imagine legitimate technical reasons to use it, though it is a quite complicated specification.


Okay! All of those formats have roughly the same shape. One byte prefixes on every value, value payloads in line (and thus values are variable-width).

Now it’s time to look at the format I sketched up.

The file consists of a simple header marking the locations and sizes of three tables: values, strings, and object shapes.

The string table consists of raw UTF-8 string data.

In the value table, every value starts with a tag byte (sound familiar?). The high nibble encodes the type. The low nibble contains two 2-bit values, k and n.

Consider strings:

0011 kknn [1<<k bytes] [1<<n bytes] - string: offset, length into string table

0011 indicates this is a string value. k and n are “size tags” which indicate how many bytes encode the integers. The string offset is 1 << k bytes (little-endian) and the string length is 1 << n bytes. Once the size tags are decoded, the actual offset and length values are read following the tag byte, and the resulting indices are used to retrieve the UTF-8 text at the given offset and length from the string table.

Now let’s look at objects:

1000 kknn [1<<k bytes] - object, object index, then <length> values of width n

The following 1 << k bytes encode the index into the object shape table, which holds the number and sorted list of object keys. Afterwards is a simple list of indices into the value table, each of size 1 << n bytes. The values are matched up with the keys in the object shape table.

Arrays are similar, except that instead of using an object index, they simply store their length.

This encoding has the property that lookups into an array are O(1) and lookups into an object are O(lg N), giving efficient support for path-based queries.

But there’s a pretty big downside relative to MsgPack, CBOR, and the like. The cost of efficient random access is that the elements of arrays and objects must have a known size. Thus, the (variable-width) values themselves cannot be stored directly into the array’s element list. Instead, arrays and objects have a list of fixed-width numeric offsets into the value table. This adds a level of indirection, and thus overhead, to JND. The payoff is that once a particular value is written (like a string or double-precision float), its index can be reused and referenced multiple times.

So how does this play out in practice? Net win or loss?

Size Benchmarks

I used sajson’s standard corpus of test JSON documents.

  • apache_builds Root object, largely consists of an array containing many three-element objects.
  • github_events Mix of objects, arrays, and long URL strings.
  • getinitialstate I can’t share the contents of this document as it’s from a project at work, but it’s 1.7 MB of various entity definitions, where each entity type is an object with maybe a dozen or two fields.
  • instruments A huge number of repeated structures and data — very compressible.
  • mesh 3D geometry. Basically a giant list of floating point and integer numbers.
  • svg_menu Only 600 bytes – used to test startup and base overhead costs.
  • twitter List of fairly large objects, many long strings.
  • update-center From Jenkins. Mostly consists of an object representing a mapping from plugin name to plugin description.


We can draw a few conclusions from these results.

  • As a wire replacement for the JSON data model, there’s no apparent reason to use BSON or UBSON. I’d probably default to MsgPack because it’s simple, but CBOR might be okay too.
  • The current crop of binary formats don’t compress any better than JSON itself, so if size over the wire is your primary concern, you might as well stick with JSON.
  • Except for small documents, where JND’s increased representation overhead dominates, JND is pretty much always smaller uncompressed. As predicted, reusing strings and object shapes is a big win in practice.
  • LZ-style compression algorithms don’t like JND much. Not a big surprise, since they don’t do a good job with sequences of numeric values. I expect delta coding value offsets in arrays and objects would help a lot, at the cost of needing to do a linear pass from delta to absolute at encode and decode time.

JND’s disadvantage is clear in the above graphs: while it’s smaller uncompressed, it does not compress as well as JSON or MsgPack. (Except in cases where its uncompressed size is dramatically smaller because of huge amounts of string or object shape reuse.)

Where would something like JND be useful? JND’s big advantage is that it can be read directly without allocating or parsing into another data structure. In addition, the uncompressed data is relatively compact. So I could imagine it being used if IO is relatively cheap but tight memory bounds are required.

Another advantage of JND is a bit less obvious. In my experience using sajson from other languages, the parse itself is cheaper than converting the parsed AST into values in the host language. For example, constructing Swift String objects from binary UTF-8 data was an order of magnitude slower than the parse itself. In JND, every unique string would naturally be translated into host language String objects once.

If you’d like to play with my experiments, they’re on GitHub at https://github.com/chadaustin/jnd.

Future Work?

It was more work than I could justify for this experiment, but I’d love to see how Thrift or Protocol Buffers compare to JND. JND is a self-describing format, so it’s going to lose on small messages, but when data sizes get big I wouldn’t be surprised if it at least ties protobufs.

Update: Forgot to mention decode times. I didn’t have time to set up a proper benchmark with high-performance implementations of each of these formats (rather than whatever pip gave me), but I think we can safely assume CBOR, MsgPack, BSON, and UBSON all have similar parse times, since the main structure of the decoder loop would be the same. The biggest question would be: how does that compare to JSON? And how does JND compare to both?

sajson: Building the Parse Tree

The least trivial algorithm for building sajson’s parse tree is
allocating (or should I say, reserving?) the space in the parse tree
for an array’s or object’s element list without knowing the length in

Let’s consider an eleven-character JSON text. Imagine we’ve parsed
three characters: [[[. At this point we know two things:
1) we can fit the parse tree in eleven words and 2) there are at least
three arrays.

We don’t know the length of the arrays, so we cannot begin writing the
parse tree records yet.

The file could be [[[0,0,0]]] or [[[[[0]]]]] or [[[0,[0]]]] all of
which have quite different parse tree representations.

My first attempt involved parsing in two passes. The first pass
scanned the JSON text for arrays and objects and temporarily stored
their lengths into safe locations in the parse tree array. The
second pass used that information to correctly lay out the parse tree.

Parsing in two passes worked but had two major disadvantages. First, it was
slow. The scanning phase was simpler than parsing, but not THAT
simpler. Since parsing involves reading one byte and
branching on its value, parsing in two phases was effectively
half the speed. Second, the scanning phase duplicated a fair amount
of parsing code, making it harder to reason about and maintain.

Mike Ralston and I worked out a simpler approach at the cost
of two memory writes per array/object element record.

The gist is: given a parse tree array of size N, start one pointer at
the front and one at the back. Call the one at the front temp, for
temporary storage, and the one at the back out, for the actual parse
tree data.

When encountering the beginning of an array or object, remember the
current temp pointer.

When encountering a scalar element (numbers, strings, etc.), place its
payload in the parse tree at out and its location in temp.

When encountering the end of an array or object, compare the current
temp pointer to its value when beginning the array or object. The
difference is the length of the array or object. Now that we know the
length and each element reference, we can move the element references
out of temp and into out.

It may help to work through a simple example:


The JSON text is 7 characters. Thus we have 7 words of parse tree to
work with:

[ ][ ][ ][ ][ ][ ][ ]
 ^                 ^
temp              out

Encountering the first [, we store the current value of temp (on the C stack).

Encountering the second [, we store the current value of temp (on the
C stack.)

At this point, nothing has been written to the parse tree.

Then we see the first zero and place its payload at out and its
type+location reference at temp.

[<Integer>:6][ ][ ][ ][ ][ ][0]
              ^           ^
             temp        out

Encountering the first ], we calculate the array length, 1, and move
the references from temp to out. We write
the new array’s location to temp, giving:

[<Array>:4][ ][ ][ ][1][<Integer>:2][0]
            ^     ^
           temp  out

We were careful to adjust the element references so they remain
relative to the array record.

We then encounter another zero and place its payload in out and
location in temp:

[<Array>:4][<Integer>:3][ ][0][1][<Integer>:2][0]
                     temp out

Closing the outer array, we calculate its length (2), again move
the references from temp to out, and write the final array record.


out now gives us the root of the parse tree.

Eliminating the Recursion

You may have noticed the previous implementation stores the start
address of each array or object on the C stack. This is liable to
overflow in the case of a JSON file consisting of N [s followed by N
]s for some large N. The JSON standard allows parsers to limit the
maximum depth they handle, but we can do better.

It’s possible to eliminate the recursion by storing the value of
temp into the temp side of the parse tree at the start of every
array or object. When reaching the end of an array or object, its
length is calculated, the record is written into out, and the
previous value of temp is restored. If the previous value of temp
is a special root marker, parsing is complete.

Does the parse tree, even during construction, have room for these
outer references?

First, let’s examine the same example but where we store a reference
to the outer ‘in-construction’ object in temp:

# seen [
[<Array>:<ROOT>][ ][ ][ ][ ][ ][ ]

# seen [[
[<Array>:<ROOT>][<Array>:0][ ][ ][ ][ ][ ]

# seen [[0
[<Array>:<ROOT>][<Array>:0][<Integer>:6][ ][ ][ ][0]

# seen [[0],
[<Array>:<ROOT>][<Array>:4][ ][ ][1][<Integer>:2][0]

# seen [[0],0

# seen [[0],0]

In this example, there is room. But will there always be?

An easy conceptualization is that the final size of an array
record will be 1+N records, including its length. The temporary array
storage is also 1+N, where we don’t yet know its length but we do have
a reference to the enclosing array or object. Thus, we have room for
outer references in the parse tree.

Actual Code

The result
is an implementation whose parsing loop is almost entirely inlined,
and on architectures with a reasonable number of registers (even
AMD64), very little spills to the stack.

sajson is available
under the MIT license, but at the time of this writing, it is
primarily a proof of concept. The API is not stable and it does not
support string escapes. It also needs a security review to guarantee
that no malformed inputs crash.

That said, if you give sajson a try, let me know how it goes.

sajson: Why the Parse Tree is Big Enough

Last week, I described a JSON parse tree data structure that, worst
case, requires N words for N characters of JSON text. I want to
explain the algorithm used to generate said parse tree, but first I
will describe the parse tree data structure in detail.
Simultaneously, I will show that the parse tree will fit in the worst

Given that value types are stored in 3-bit tags, it’s intuitive that N
characters of JSON data requires N words in the parse tree. Let’s
consider the parse tree representation of each JSON type individually:


Strings are represented in the parse tree by two pointers: one to the
beginning of the string and one to the end. In the source text, these
correspond to the string’s quotes. The empty string, "",
is two characters and consumes two words in the parse tree.

struct String {
    size_t begin;
    size_t end;


Arrays are represented by their length followed by a relative offset
to each element.

struct Array {
    size_t length;
    size_t element[length]; // type:offset

A zero-length array is 2 characters ([]) but consumes one word in the
parse tree.

An array of length 1, not including its element, is also 2 characters
([""]) but consumes two words in the parse tree.

An array of length N is N+1 characters, counting commas. The
representation described above consumes N+1 words in the parse tree,
so arrays fit.


Objects are represented by their length followed by an array of element records
each containing two pointers for the key string as well as a relative
offset to the element value.

struct ObjectElement {
    size_t key_begin;
    size_t key_end;
    size_t value; // type:offset

struct Object {
    size_t length;
    ObjectElement element[length];

The smallest object, {}, is 2 characters but its
representation in the parse tree is a single word.

Now consider an object of one element: {"":""}. Including the key
string (but not the value string!), the object is five characters in
the input text. Its parse tree representation is four words: the
length plus three for its element.

Each additional element in the input text adds four characters (a
comma, a colon, and two quotes) but requires only three words in the
parse tree.

Numbers: Integers

The only hat trick required to fit sajson’s parse tree
representation in input_text_length * sizeof(size_t)
bytes is representing integers differently than doubles.

struct Integer {
    intptr_t value;

That’s because, on 32-bit architectures, doubles are two words. If
single-digit numbers such as 0 consumed two words in the
parse tree, [0,0,0,0] would not fit.

Storing integers more compactly, we see that the smallest integers use
one character of input text and one word of parse tree structure.

It’s not worth the complexity, but the astute may notice that if we
limit integers to 29 bits, we don’t need to consume words in the parse
tree at all.

Numbers: Doubles

On 32-bit architectures, doubles are stored (unaligned) into the parse

struct Double {
    size_t first_half;
    size_t second_half;

On 64-bit architectures, a double consumes a single word.

The shortest JSON doubles are a single digit
followed by a decimal point followed by a single digit
(example: 0.0) or a single digit with a single-digit
exponent (example: 9e9). It’s clear that they fit into
the parse tree.

null, true, and false

null, true, and false contribute 4, 4, and 5 characters of input text,
respectively. They are represented simply by their type tags: no
parse tree data necessary.

We’ve now covered every JSON data type and shown that, in no case,
will the parse tree representation use more words than the JSON text
representation uses characters.

Single-Allocation JSON Parser

Over the holiday break, as mental exercise, I wrote a
single-allocation JSON parser, sajson. Why
single-allocation? To me, software that fits within a
precise resource budget, especially memory, is elegant. Most C or
C++ JSON parsers allocate memory per node and use hash tables to store
objects. Even if said parsers use efficient pool allocators or hash
table implementations, they miss the forest for the trees.

Dynamic memory allocation has disadvantages: fragmentation,
cache locality, and thread contention are the common arguments
against. But I focused on a different issue: what is the worst case
memory usage to parse, say, a 200 MB JSON file? With a JSON parser
that dynamically allocates, it’s challenging to prove the worst case

Before we calculate the worst case memory consumption of a JSON
parser, let’s cover some basics.

Parsers convert input text, a stream of characters, into a data
structure or event stream suitable for reading or processing in some
way. In this instance, sajson is a non-streaming dom-style parser in
that it translates a complete buffer of characters into a contiguous
parse tree that supports both enumeration and random access.

JSON has seven data types. Three are unit types: null, true, and
false. Two are scalars: numbers and strings. Finally, arrays and
objects are composites: they contain references to other values. The
root element of a JSON document can only be an array or object.

sajson’s goal is to convert a stream of JSON text into a contiguous data
structure containing an enumerable and randomly-accessible parse tree.

My first attempt defined the parsed representation of each value as a type
enumeration followed by the type’s payload.

For example, the JSON text…

[null, 0, ["foo"]]

… would parse into…

3 # length
5 # offset to first element
6 # offset to second element
9 # offset to third element
0 # first 32 bits of IEEE double value
0 # second 32 bits of value
1 # length
3 # offset to first element
12 # offset into source document of string start
15 # offset into source document of string end

… where each line is a pointer-sized (aka size_t) value and <> represents named type constants.

For the above representation, the parse tree’s worst-case size is
sizeof(size_t) * input_length * 2. I won’t derive that
here, but the worst-case document is a list of single-digit numbers:

[0,0,0,0] # 9 characters

# 9*2 = 18 'slots' of structure
6 # relative offset to first element

But we can do better!

Using a full size_t to store a 3-bit type constant is rather wasteful.
(Remember there are seven JSON types.) Because sajson only targets
32-bit and 64-bit architectures, each array or object element offset
has three bits to spare and thus can include the element’s type. The
document needs one bit to determine the type of the root element.
(Remember the root element must be an array or an object.)

A further optimization exists: rather than storing
all numbers as IEEE 64-bit doubles, we can add an extra type tag:
<Integer>. Single-digit JSON numbers must be integers, and thus
consume less structural storage.

Let’s consider the same example above with tagged element references,
where <tag>:offset delimits the tag from the offset.

[0,0,0,0] # 9 characters

# root bit determines root is array
4 # length of array

Let’s quickly check another example:

[[[[]]]] # 8 characters

# root bit determines root is array

With the above changes, the parse tree size is cut in half! It now
fits in sizeof(size_t) * input_length.

Next time I’ll describe the challenges in building said parse tree
without a-priori knowledge of array length. Here’s a hint: imagine
you know the input text is 20 characters long. The first three
characters are "[[["

Tracing Leaks in Python: Find the Nearest Root

Garbage Collection Doesn’t Mean You Can Ignore Memory Altogether…

This post is available on the IMVU Engineering Blog.

Garbage collection removes a great deal of burden from programming. In fact, garbage collection is a critical language feature for all languages where abstractions such as functional closures or coroutines are common, as they frequently create reference cycles.

IMVU is a mix of C++ and Python. The C++ code generally consists of small, cohesive objects with a clear ownership chain. An Avatar SceneObject owns a ModelInstance which owns a set of Meshes which own Materials which own Textures and so on… Since there are no cycles in this object graph, reference-counting with shared_ptr suffices.

The Python code, however, is full of messy object cycles. An asynchronous operation may hold a reference to a Room, while the Room may be holding a reference to the asynchronous operation. Often two related objects will be listening for events from the other. While Python’s garbage collector will happily take care of cycles, it’s still possible to leak objects.

Imagine these scenarios:

  • a leaked or living C++ object has a strong reference to a Python object.
  • a global cache has a reference to an instance’s bound method, which implicitly contains a reference to the instance.
  • two objects with __del__ methods participate in a cycle with each other, and Python refuses to decide which should destruct first

To detect these types of memory leaks, we use a LifeTimeMonitor utility:

a = SomeObject()
lm = LifeTimeMonitor(a)
del a
lm.assertDead() # succeeds

b = SomeObject()
lm = LifeTimeMonitor(b)
lm.assertDead() # raises ObjectNotDead

We use LifeTimeMonitor’s assertDead facility at key events, such as when a user closes a dialog box or 3D window. Take 3D windows as an example. Since they’re the root of an entire object subgraph, we would hate to inadvertently leak them. LifeTimeMonitor’s assertDead prevents us from introducing an object leak.

It’s good to know that an object leaked, but how can you determine why it can’t be collected?

Python’s Garbage Collection Algorithm

Let’s go over the basics of automatic garbage collection. In a garbage-collected system there are objects and objects can reference each other. Some objects are roots; that is, if an object is referenced by a root, it cannot be collected. Example roots are the stacks of live threads and the global module list. The graph formed by objects and their references is the object graph.

In SpiderMonkey, Mozilla’s JavaScript engine, the root set is explicitly-managed. SpiderMonkey’s GC traverses the object graph from the root set. If the GC does not reach an object, that object is destroyed. If C code creates a root object but fails to add it to the root set, it risks the GC deallocating the object while it’s still in use.

In Python however, the root set is implicit. All Python objects are ref-counted, and any that can refer to other objects — and potentially participate in an object cycle — are added to a global list upon construction. Each GC-tracked object can be queried for its referents. Python’s root set is implicit because anyone can create a root simply by incrementing an object’s refcount.

Since Python’s root set is implicit, its garbage collection algorithm differs slightly from SpiderMonkey’s. Python begins by setting GCRefs(o) to CurrentRefCount(o) for each GC-tracked PyObject o. Then it traverses all referents r of all GC-tracked PyObjects and subtracts 1 from GCRefs(r). Then, if GCRefs(o) is nonzero, o is an unknown reference, and thus a root. Python traverses the now-known root set and increments GCRefs(o) for any traversed objects. If any object o remains where GCRefs(o) == 0, that object is unreachable and thus collectible.

Finding a Path From the Nearest Root to the Leaked Object

Now that we know how Python’s garbage collector works, we can ask it for its set of roots by calculating GCRefs(o) for all objects o in gc.get_objects(). Then we perform a breadth-first-search from the root set to the leaked object. If the root set directly or indirectly refers to the leaked object, we return the path our search took.

Sounds simple, but there’s a catch! Imagine that the search function has signature:

PyObject* findPathToNearestRoot(PyObject* leakedObject);

leakedObject is a reference (incremented within Python’s function-call machinery itself) to the leaked object, making leakedObject a root!

To work around this, change findPathToNearestRoot so it accepts a singleton list containing a reference to the leaked object. findPathToNearestRoot can borrow that reference and clear the list, ensuring that leakedObject has no untracked references.

findPathToNearestRoot will find paths to expected Python roots like thread entry points and module objects. But, since it directly mirrors the behavior of Python’s GC, it will also find paths to leaked C references! Obviously, it can’t directly point you to the C code that leaked the reference, but the reference path should be enough of a clue to figure it out.

The Code

template<typename ArgType>
void traverse(PyObject* o, int (*visit)(PyObject* visitee, ArgType* arg), ArgType* arg) {
    if (Py_TYPE(o)->tp_traverse) {
        Py_TYPE(o)->tp_traverse(o, (visitproc)visit, arg);

typedef std::map<PyObject*, int> GCRefs;

static int subtractKnownReferences(PyObject* visitee, GCRefs* gcrefs) {
    if (gcrefs->count(visitee)) {
    return 0;

typedef int Backlink; // -1 = none

typedef std::vector< std::pair<Backlink, PyObject*> > ReferenceList;
struct Referents {
    std::set<PyObject*>& seen;
    Backlink backlink;
    ReferenceList& referenceList;

static int addReferents(PyObject* visitee, Referents* referents) {
    if (!referents->seen.count(visitee) && PyObject_IS_GC(visitee)) {
        referents->referenceList.push_back(std::make_pair(referents->backlink, visitee));
    return 0;

static Backlink findNextLevel(
    std::vector<PyObject*>& chain,
    const ReferenceList& roots,
    PyObject* goal,
    std::set<PyObject*>& seen
) {
    if (roots.empty()) {
        return -1;

    for (size_t i = 0; i < roots.size(); ++i) {
        if (roots[i].first != -1) {
            if (goal == roots[i].second) {
                return roots[i].first;

    ReferenceList nextLevel;
    for (size_t i = 0; i < roots.size(); ++i) {
        Referents referents = {seen, i, nextLevel};
        traverse(roots[i].second, &addReferents, &referents);

    Backlink backlink = findNextLevel(chain, nextLevel, goal, seen);
    if (backlink == -1) {
        return -1;

    return roots[backlink].first;

static std::vector<PyObject*> findReferenceChain(
    const std::vector<PyObject*>& roots,
    PyObject* goal
) {
    std::set<PyObject*> seen;
    ReferenceList unknownReferrer;
    for (size_t i = 0; i < roots.size(); ++i) {
        unknownReferrer.push_back(std::make_pair<Backlink>(-1, roots[i]));
    std::vector<PyObject*> rv;
    // going to return -1 no matter what: no backlink from roots
    findNextLevel(rv, unknownReferrer, goal, seen);
    return rv;

static object findPathToNearestRoot(const object& o) {
    if (!PyList_Check(o.ptr()) || PyList_GET_SIZE(o.ptr()) != 1) {
        PyErr_SetString(PyExc_TypeError, "findNearestRoot must take a list of length 1");

    // target = o.pop()
    object target(handle<>(borrowed(PyList_GET_ITEM(o.ptr(), 0))));
    if (-1 == PyList_SetSlice(o.ptr(), 0, 1, 0)) {

    object gc_module(handle<>(PyImport_ImportModule("gc")));
    object tracked_objects_list = gc_module.attr("get_objects")();
    // allocating the returned list may have run a GC, but tracked_objects won't be in the list

    std::vector<PyObject*> tracked_objects(len(tracked_objects_list));
    for (size_t i = 0; i < tracked_objects.size(); ++i) {
        object to = tracked_objects_list[i];
        tracked_objects[i] = to.ptr();
    tracked_objects_list = object();

    GCRefs gcrefs;
    // TODO: store allocation/gc count per generation

    for (size_t i = 0; i < tracked_objects.size(); ++i) {
        gcrefs[tracked_objects[i]] = tracked_objects[i]->ob_refcnt;

    for (size_t i = 0; i < tracked_objects.size(); ++i) {
        traverse(tracked_objects[i], subtractKnownReferences, &gcrefs);

    // BFS time
    std::vector<PyObject*> roots;
    for (GCRefs::const_iterator i = gcrefs.begin(); i != gcrefs.end(); ++i) {
        if (i->second && i->first != target.ptr()) { // Don't count the target as a root.
    std::vector<PyObject*> chain = findReferenceChain(roots, target.ptr());

    // TODO: assert that allocation/gc count per generation didn't change

    list rv;
    for (size_t i = 0; i < chain.size(); ++i) {

    return rv;