sajson: Why the Parse Tree is Big Enough

by Chad Austin on January 6th, 2013

Last week, I described a JSON parse tree data structure that, worst
case, requires N words for N characters of JSON text. I want to
explain the algorithm used to generate said parse tree, but first I
will describe the parse tree data structure in detail.
Simultaneously, I will show that the parse tree will fit in the worst
case.

Given that value types are stored in 3-bit tags, it’s intuitive that N
characters of JSON data requires N words in the parse tree. Let’s
consider the parse tree representation of each JSON type individually:

Strings

Strings are represented in the parse tree by two pointers: one to the
beginning of the string and one to the end. In the source text, these
correspond to the string’s quotes. The empty string, "",
is two characters and consumes two words in the parse tree.

struct String {
    size_t begin;
    size_t end;
};

Arrays

Arrays are represented by their length followed by a relative offset
to each element.

struct Array {
    size_t length;
    size_t element[length]; // type:offset
};

A zero-length array is 2 characters ([]) but consumes one word in the
parse tree.

An array of length 1, not including its element, is also 2 characters
([""]) but consumes two words in the parse tree.

An array of length N is N+1 characters, counting commas. The
representation described above consumes N+1 words in the parse tree,
so arrays fit.

Objects

Objects are represented by their length followed by an array of element records
each containing two pointers for the key string as well as a relative
offset to the element value.

struct ObjectElement {
    size_t key_begin;
    size_t key_end;
    size_t value; // type:offset
};

struct Object {
    size_t length;
    ObjectElement element[length];
};

The smallest object, {}, is 2 characters but its
representation in the parse tree is a single word.

Now consider an object of one element: {"":""}. Including the key
string (but not the value string!), the object is five characters in
the input text. Its parse tree representation is four words: the
length plus three for its element.

Each additional element in the input text adds four characters (a
comma, a colon, and two quotes) but requires only three words in the
parse tree.

Numbers: Integers

The only hat trick required to fit sajson’s parse tree
representation in input_text_length * sizeof(size_t)
bytes is representing integers differently than doubles.

struct Integer {
    intptr_t value;
};

That’s because, on 32-bit architectures, doubles are two words. If
single-digit numbers such as 0 consumed two words in the
parse tree, [0,0,0,0] would not fit.

Storing integers more compactly, we see that the smallest integers use
one character of input text and one word of parse tree structure.

It’s not worth the complexity, but the astute may notice that if we
limit integers to 29 bits, we don’t need to consume words in the parse
tree at all.

Numbers: Doubles

On 32-bit architectures, doubles are stored (unaligned) into the parse
tree.

struct Double {
    size_t first_half;
    size_t second_half;
};

On 64-bit architectures, a double consumes a single word.

The shortest JSON doubles are a single digit
followed by a decimal point followed by a single digit
(example: 0.0) or a single digit with a single-digit
exponent (example: 9e9). It’s clear that they fit into
the parse tree.

null, true, and false

null, true, and false contribute 4, 4, and 5 characters of input text,
respectively. They are represented simply by their type tags: no
parse tree data necessary.

We’ve now covered every JSON data type and shown that, in no case,
will the parse tree representation use more words than the JSON text
representation uses characters.

Leave a Reply