sajson: Building the Parse Tree

by Chad Austin on January 7th, 2013

The least trivial algorithm for building sajson’s parse tree is
allocating (or should I say, reserving?) the space in the parse tree
for an array’s or object’s element list without knowing the length in
advance.

Let’s consider an eleven-character JSON text. Imagine we’ve parsed
three characters: [[[. At this point we know two things:
1) we can fit the parse tree in eleven words and 2) there are at least
three arrays.

We don't know the length of the arrays, so we cannot begin writing the
parse tree records yet.

The file could be [[[0,0,0]]] or [[[[[0]]]]] or [[[0,[0]]]] all of
which have quite different parse tree representations.

My first attempt involved parsing in two passes. The first pass
scanned the JSON text for arrays and objects and temporarily stored
their lengths into safe locations in the parse tree array. The
second pass used that information to correctly lay out the parse tree.

Parsing in two passes worked but had two major disadvantages. First, it was
slow. The scanning phase was simpler than parsing, but not THAT
simpler. Since parsing involves reading one byte and
branching on its value, parsing in two phases was effectively
half the speed. Second, the scanning phase duplicated a fair amount
of parsing code, making it harder to reason about and maintain.

Mike Ralston and I worked out a simpler approach at the cost
of two memory writes per array/object element record.

The gist is: given a parse tree array of size N, start one pointer at
the front and one at the back. Call the one at the front temp, for
temporary storage, and the one at the back out, for the actual parse
tree data.

When encountering the beginning of an array or object, remember the
current temp pointer.

When encountering a scalar element (numbers, strings, etc.), place its
payload in the parse tree at out and its location in temp.

When encountering the end of an array or object, compare the current
temp pointer to its value when beginning the array or object. The
difference is the length of the array or object. Now that we know the
length and each element reference, we can move the element references
out of temp and into out.

It may help to work through a simple example:

[[0],0]

The JSON text is 7 characters. Thus we have 7 words of parse tree to
work with:

[ ][ ][ ][ ][ ][ ][ ]
 ^                 ^
temp              out

Encountering the first [, we store the current value of temp (on the C stack).

Encountering the second [, we store the current value of temp (on the
C stack.)

At this point, nothing has been written to the parse tree.

Then we see the first zero and place its payload at out and its
type+location reference at temp.

[<Integer>:6][ ][ ][ ][ ][ ][0]
              ^           ^
             temp        out

Encountering the first ], we calculate the array length, 1, and move
the references from temp to out. We write
the new array's location to temp, giving:

[<Array>:4][ ][ ][ ][1][<Integer>:2][0]
            ^     ^
           temp  out

We were careful to adjust the element references so they remain
relative to the array record.

We then encounter another zero and place its payload in out and
location in temp:

[<Array>:4][<Integer>:3][ ][0][1][<Integer>:2][0]
                         ^
                     temp out

Closing the outer array, we calculate its length (2), again move
the references from temp to out, and write the final array record.

[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]
 ^
out

out now gives us the root of the parse tree.

Eliminating the Recursion

You may have noticed the previous implementation stores the start
address of each array or object on the C stack. This is liable to
overflow in the case of a JSON file consisting of N [s followed by N
]s for some large N. The JSON standard allows parsers to limit the
maximum depth they handle, but we can do better.

It's possible to eliminate the recursion by storing the value of
temp into the temp side of the parse tree at the start of every
array or object. When reaching the end of an array or object, its
length is calculated, the record is written into out, and the
previous value of temp is restored. If the previous value of temp
is a special root marker, parsing is complete.

Does the parse tree, even during construction, have room for these
outer references?

First, let's examine the same example but where we store a reference
to the outer 'in-construction' object in temp:

# seen [
[<Array>:<ROOT>][ ][ ][ ][ ][ ][ ]

# seen [[
[<Array>:<ROOT>][<Array>:0][ ][ ][ ][ ][ ]

# seen [[0
[<Array>:<ROOT>][<Array>:0][<Integer>:6][ ][ ][ ][0]

# seen [[0],
[<Array>:<ROOT>][<Array>:4][ ][ ][1][<Integer>:2][0]

# seen [[0],0
[<Array>:<ROOT>][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]

# seen [[0],0]
[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]

In this example, there is room. But will there always be?

An easy conceptualization is that the final size of an array
record will be 1+N records, including its length. The temporary array
storage is also 1+N, where we don't yet know its length but we do have
a reference to the enclosing array or object. Thus, we have room for
outer references in the parse tree.

Actual Code

The result
is an implementation whose parsing loop is almost entirely inlined,
and on architectures with a reasonable number of registers (even
AMD64), very little spills to the stack.

sajson is available
under the MIT license, but at the time of this writing, it is
primarily a proof of concept. The API is not stable and it does not
support string escapes. It also needs a security review to guarantee
that no malformed inputs crash.

That said, if you give sajson a try, let me know how it goes.

Leave a Reply