sajson: Why the Parse Tree is Big Enough

Last week, I described a JSON parse tree data structure that, worst case, requires N words for N characters of JSON text. I want to explain the algorithm used to generate said parse tree, but first I will describe the parse tree data structure in detail. Simultaneously, I will show that the parse tree will fit in the worst case.

Given that value types are stored in 3-bit tags, it's intuitive that N characters of JSON data requires N words in the parse tree. Let's consider the parse tree representation of each JSON type individually:

Strings

Strings are represented in the parse tree by two pointers: one to the beginning of the string and one to the end. In the source text, these correspond to the string's quotes. The empty string, "", is two characters and consumes two words in the parse tree.

struct String {
    size_t begin;
    size_t end;
};

Arrays

Arrays are represented by their length followed by a relative offset to each element.

struct Array {
    size_t length;
    size_t element[length]; // type:offset
};

A zero-length array is 2 characters ([]) but consumes one word in the parse tree.

An array of length 1, not including its element, is also 2 characters ([""]) but consumes two words in the parse tree.

An array of length N is N+1 characters, counting commas. The representation described above consumes N+1 words in the parse tree, so arrays fit.

Objects

Objects are represented by their length followed by an array of element records each containing two pointers for the key string as well as a relative offset to the element value.

struct ObjectElement {
    size_t key_begin;
    size_t key_end;
    size_t value; // type:offset
};

struct Object {
    size_t length;
    ObjectElement element[length];
};

The smallest object, {}, is 2 characters but its representation in the parse tree is a single word.

Now consider an object of one element: {"":""}. Including the key string (but not the value string!), the object is five characters in the input text. Its parse tree representation is four words: the length plus three for its element.

Each additional element in the input text adds four characters (a comma, a colon, and two quotes) but requires only three words in the parse tree.

Numbers: Integers

The only hat trick required to fit sajson's parse tree representation in input_text_length * sizeof(size_t) bytes is representing integers differently than doubles.

struct Integer {
    intptr_t value;
};

That's because, on 32-bit architectures, doubles are two words. If single-digit numbers such as 0 consumed two words in the parse tree, [0,0,0,0] would not fit.

Storing integers more compactly, we see that the smallest integers use one character of input text and one word of parse tree structure.

It's not worth the complexity, but the astute may notice that if we limit integers to 29 bits, we don't need to consume words in the parse tree at all.

Numbers: Doubles

On 32-bit architectures, doubles are stored (unaligned) into the parse tree.

struct Double {
    size_t first_half;
    size_t second_half;
};

On 64-bit architectures, a double consumes a single word.

The shortest JSON doubles are a single digit followed by a decimal point followed by a single digit (example: 0.0) or a single digit with a single-digit exponent (example: 9e9). It's clear that they fit into the parse tree.

null, true, and false

null, true, and false contribute 4, 4, and 5 characters of input text, respectively. They are represented simply by their type tags: no parse tree data necessary.

We've now covered every JSON data type and shown that, in no case, will the parse tree representation use more words than the JSON text representation uses characters.

Imported Comments [?]

Writing a Really, Really Fast JSON Parser – Chad Austin on May 25, 2017

[…] holidays ago, I got a bee in my bonnet and wrote a fast JSON parser whose parsed AST fits in a single contiguous block of memory. The code was small and simple and the performance was generally on-par with RapidJSON, so I […]