sajson: Building the Parse Tree
The least trivial algorithm for building sajson's parse tree is allocating (or should I say, reserving?) the space in the parse tree for an array's or object's element list without knowing the length in advance.
Let's consider an eleven-character JSON text. Imagine we've parsed
three characters: [[[
. At this point we know two things:
1) we can fit the parse tree in eleven words and 2) there are at least
three arrays.
We don't know the length of the arrays, so we cannot begin writing the parse tree records yet.
The file could be [[[0,0,0]]]
or [[[[[0]]]]]
or [[[0,[0]]]]
all of
which have quite different parse tree representations.
My first attempt involved parsing in two passes. The first pass scanned the JSON text for arrays and objects and temporarily stored their lengths into safe locations in the parse tree array. The second pass used that information to correctly lay out the parse tree.
Parsing in two passes worked but had two major disadvantages. First, it was slow. The scanning phase was simpler than parsing, but not THAT simpler. Since parsing involves reading one byte and branching on its value, parsing in two phases was effectively half the speed. Second, the scanning phase duplicated a fair amount of parsing code, making it harder to reason about and maintain.
Mike Ralston and I worked out a simpler approach at the cost of two memory writes per array/object element record.
The gist is: given a parse tree array of size N, start one pointer at
the front and one at the back. Call the one at the front temp
, for
temporary storage, and the one at the back out
, for the actual parse
tree data.
When encountering the beginning of an array or object, remember the
current temp
pointer.
When encountering a scalar element (numbers, strings, etc.), place its
payload in the parse tree at out
and its location in temp
.
When encountering the end of an array or object, compare the current
temp
pointer to its value when beginning the array or object. The
difference is the length of the array or object. Now that we know the
length and each element reference, we can move the element references
out of temp
and into out
.
It may help to work through a simple example:
[[0],0]
The JSON text is 7 characters. Thus we have 7 words of parse tree to work with:
[ ][ ][ ][ ][ ][ ][ ] ^ ^ temp out
Encountering the first [, we store the current value of temp (on the C stack).
Encountering the second [, we store the current value of temp (on the C stack.)
At this point, nothing has been written to the parse tree.
Then we see the first zero and place its payload at out
and its
type+location reference at temp
.
[<Integer>:6][ ][ ][ ][ ][ ][0] ^ ^ temp out
Encountering the first ], we calculate the array length, 1, and move
the references from temp
to out
. We write
the new array's location to temp
, giving:
[<Array>:4][ ][ ][ ][1][<Integer>:2][0] ^ ^ temp out
We were careful to adjust the element references so they remain relative to the array record.
We then encounter another zero and place its payload in out
and
location in temp
:
[<Array>:4][<Integer>:3][ ][0][1][<Integer>:2][0] ^ temp out
Closing the outer array, we calculate its length (2), again move
the references from temp
to out
, and write the final array record.
[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0] ^ out
out
now gives us the root of the parse tree.
Eliminating the Recursion
You may have noticed the previous implementation stores the start address of each array or object on the C stack. This is liable to overflow in the case of a JSON file consisting of N [s followed by N ]s for some large N. The JSON standard allows parsers to limit the maximum depth they handle, but we can do better.
It's possible to eliminate the recursion by storing the value of
temp
into the temp
side of the parse tree at the start of every
array or object. When reaching the end of an array or object, its
length is calculated, the record is written into out
, and the
previous value of temp
is restored. If the previous value of temp
is a special root marker, parsing is complete.
Does the parse tree, even during construction, have room for these outer references?
First, let's examine the same example but where we store a reference
to the outer 'in-construction' object in temp
:
# seen [ [<Array>:<ROOT>][ ][ ][ ][ ][ ][ ] # seen [[ [<Array>:<ROOT>][<Array>:0][ ][ ][ ][ ][ ] # seen [[0 [<Array>:<ROOT>][<Array>:0][<Integer>:6][ ][ ][ ][0] # seen [[0], [<Array>:<ROOT>][<Array>:4][ ][ ][1][<Integer>:2][0] # seen [[0],0 [<Array>:<ROOT>][<Array>:4][<Integer>:3][0][1][<Integer>:2][0] # seen [[0],0] [2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]
In this example, there is room. But will there always be?
An easy conceptualization is that the final size of an array record will be 1+N records, including its length. The temporary array storage is also 1+N, where we don't yet know its length but we do have a reference to the enclosing array or object. Thus, we have room for outer references in the parse tree.
Actual Code
The result is an implementation whose parsing loop is almost entirely inlined, and on architectures with a reasonable number of registers (even AMD64), very little spills to the stack.
sajson is available under the MIT license, but at the time of this writing, it is primarily a proof of concept. The API is not stable and it does not support string escapes. It also needs a security review to guarantee that no malformed inputs crash.
That said, if you give sajson a try, let me know how it goes.