Single-Allocation JSON Parser

Over the holiday break, as mental exercise, I wrote a single-allocation JSON parser, sajson. Why single-allocation? To me, software that fits within a precise resource budget, especially memory, is elegant. Most C or C++ JSON parsers allocate memory per node and use hash tables to store objects. Even if said parsers use efficient pool allocators or hash table implementations, they miss the forest for the trees.

Dynamic memory allocation has disadvantages: fragmentation, cache locality, and thread contention are the common arguments against. But I focused on a different issue: what is the worst case memory usage to parse, say, a 200 MB JSON file? With a JSON parser that dynamically allocates, it's challenging to prove the worst case consumption.

Before we calculate the worst case memory consumption of a JSON parser, let's cover some basics.

Parsers convert input text, a stream of characters, into a data structure or event stream suitable for reading or processing in some way. In this instance, sajson is a non-streaming dom-style parser in that it translates a complete buffer of characters into a contiguous parse tree that supports both enumeration and random access.

JSON has seven data types. Three are unit types: null, true, and false. Two are scalars: numbers and strings. Finally, arrays and objects are composites: they contain references to other values. The root element of a JSON document can only be an array or object.

sajson's goal is to convert a stream of JSON text into a contiguous data structure containing an enumerable and randomly-accessible parse tree.

My first attempt defined the parsed representation of each value as a type enumeration followed by the type's payload.

For example, the JSON text...

[null, 0, ["foo"]]

... would parse into...

<Array>
3 # length
5 # offset to first element
6 # offset to second element
9 # offset to third element
<Null>
<Number>
0 # first 32 bits of IEEE double value
0 # second 32 bits of value
<Array>
1 # length
3 # offset to first element
<String>
12 # offset into source document of string start
15 # offset into source document of string end

... where each line is a pointer-sized (aka size_t) value and <> represents named type constants.

For the above representation, the parse tree's worst-case size is sizeof(size_t) * input_length * 2. I won't derive that here, but the worst-case document is a list of single-digit numbers:

[0,0,0,0] # 9 characters

# 9*2 = 18 'slots' of structure
<Array>
4
6 # relative offset to first element
9
12
15
<Number>
0
0
<Number>
0
0
<Number>
0
0
<Number>
0
0

But we can do better!

Using a full size_t to store a 3-bit type constant is rather wasteful. (Remember there are seven JSON types.) Because sajson only targets 32-bit and 64-bit architectures, each array or object element offset has three bits to spare and thus can include the element's type. The document needs one bit to determine the type of the root element. (Remember the root element must be an array or an object.)

A further optimization exists: rather than storing all numbers as IEEE 64-bit doubles, we can add an extra type tag: <Integer>. Single-digit JSON numbers must be integers, and thus consume less structural storage.

Let's consider the same example above with tagged element references, where <tag>:offset delimits the tag from the offset.

[0,0,0,0] # 9 characters

# root bit determines root is array
4 # length of array
<Integer>:5
<Integer>:6
<Integer>:7
<Integer>:8
0
0
0
0

Let's quickly check another example:

[[[[]]]] # 8 characters

# root bit determines root is array
1
<Array>:2
1
<Array>:2
1
<Array>:2
0

With the above changes, the parse tree size is cut in half! It now fits in sizeof(size_t) * input_length.

Next time I'll describe the challenges in building said parse tree without a-priori knowledge of array length. Here's a hint: imagine you know the input text is 20 characters long. The first three characters are "[[["...