sajson: Building the Parse Tree

The least trivial algorithm for building sajson's parse tree is allocating (or should I say, reserving?) the space in the parse tree for an array's or object's element list without knowing the length in advance.

Let's consider an eleven-character JSON text. Imagine we've parsed three characters: [[[. At this point we know two things: 1) we can fit the parse tree in eleven words and 2) there are at least three arrays.

We don't know the length of the arrays, so we cannot begin writing the parse tree records yet.

The file could be [[[0,0,0]]] or [[[[[0]]]]] or [[[0,[0]]]] all of which have quite different parse tree representations.

My first attempt involved parsing in two passes. The first pass scanned the JSON text for arrays and objects and temporarily stored their lengths into safe locations in the parse tree array. The second pass used that information to correctly lay out the parse tree.

Parsing in two passes worked but had two major disadvantages. First, it was slow. The scanning phase was simpler than parsing, but not THAT simpler. Since parsing involves reading one byte and branching on its value, parsing in two phases was effectively half the speed. Second, the scanning phase duplicated a fair amount of parsing code, making it harder to reason about and maintain.

Mike Ralston and I worked out a simpler approach at the cost of two memory writes per array/object element record.

The gist is: given a parse tree array of size N, start one pointer at the front and one at the back. Call the one at the front temp, for temporary storage, and the one at the back out, for the actual parse tree data.

When encountering the beginning of an array or object, remember the current temp pointer.

When encountering a scalar element (numbers, strings, etc.), place its payload in the parse tree at out and its location in temp.

When encountering the end of an array or object, compare the current temp pointer to its value when beginning the array or object. The difference is the length of the array or object. Now that we know the length and each element reference, we can move the element references out of temp and into out.

It may help to work through a simple example:

[[0],0]

The JSON text is 7 characters. Thus we have 7 words of parse tree to work with:

[ ][ ][ ][ ][ ][ ][ ]
 ^                 ^
temp              out

Encountering the first [, we store the current value of temp (on the C stack).

Encountering the second [, we store the current value of temp (on the C stack.)

At this point, nothing has been written to the parse tree.

Then we see the first zero and place its payload at out and its type+location reference at temp.

[<Integer>:6][ ][ ][ ][ ][ ][0]
              ^           ^
             temp        out

Encountering the first ], we calculate the array length, 1, and move the references from temp to out. We write the new array's location to temp, giving:

[<Array>:4][ ][ ][ ][1][<Integer>:2][0]
            ^     ^
           temp  out

We were careful to adjust the element references so they remain relative to the array record.

We then encounter another zero and place its payload in out and location in temp:

[<Array>:4][<Integer>:3][ ][0][1][<Integer>:2][0]
                         ^
                     temp out

Closing the outer array, we calculate its length (2), again move the references from temp to out, and write the final array record.

[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]
 ^
out

out now gives us the root of the parse tree.

Eliminating the Recursion

You may have noticed the previous implementation stores the start address of each array or object on the C stack. This is liable to overflow in the case of a JSON file consisting of N [s followed by N ]s for some large N. The JSON standard allows parsers to limit the maximum depth they handle, but we can do better.

It's possible to eliminate the recursion by storing the value of temp into the temp side of the parse tree at the start of every array or object. When reaching the end of an array or object, its length is calculated, the record is written into out, and the previous value of temp is restored. If the previous value of temp is a special root marker, parsing is complete.

Does the parse tree, even during construction, have room for these outer references?

First, let's examine the same example but where we store a reference to the outer 'in-construction' object in temp:

# seen [
[<Array>:<ROOT>][ ][ ][ ][ ][ ][ ]

# seen [[
[<Array>:<ROOT>][<Array>:0][ ][ ][ ][ ][ ]

# seen [[0
[<Array>:<ROOT>][<Array>:0][<Integer>:6][ ][ ][ ][0]

# seen [[0],
[<Array>:<ROOT>][<Array>:4][ ][ ][1][<Integer>:2][0]

# seen [[0],0
[<Array>:<ROOT>][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]

# seen [[0],0]
[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]

In this example, there is room. But will there always be?

An easy conceptualization is that the final size of an array record will be 1+N records, including its length. The temporary array storage is also 1+N, where we don't yet know its length but we do have a reference to the enclosing array or object. Thus, we have room for outer references in the parse tree.

Actual Code

The result is an implementation whose parsing loop is almost entirely inlined, and on architectures with a reasonable number of registers (even AMD64), very little spills to the stack.

sajson is available under the MIT license, but at the time of this writing, it is primarily a proof of concept. The API is not stable and it does not support string escapes. It also needs a security review to guarantee that no malformed inputs crash.

That said, if you give sajson a try, let me know how it goes.