Over the holiday break, as mental exercise, I wrote a
single-allocation JSON parser, sajson. Why
single-allocation? To me, software that fits within a
precise resource budget, especially memory, is elegant. Most C or
C++ JSON parsers allocate memory per node and use hash tables to store
objects. Even if said parsers use efficient pool allocators or hash
table implementations, they miss the forest for the trees.
Dynamic memory allocation has disadvantages: fragmentation,
cache locality, and thread contention are the common arguments
against. But I focused on a different issue: what is the worst case
memory usage to parse, say, a 200 MB JSON file? With a JSON parser
that dynamically allocates, it’s challenging to prove the worst case
Before we calculate the worst case memory consumption of a JSON
parser, let’s cover some basics.
Parsers convert input text, a stream of characters, into a data
structure or event stream suitable for reading or processing in some
way. In this instance, sajson is a non-streaming dom-style parser in
that it translates a complete buffer of characters into a contiguous
parse tree that supports both enumeration and random access.
JSON has seven data types. Three are unit types: null, true, and
false. Two are scalars: numbers and strings. Finally, arrays and
objects are composites: they contain references to other values. The
root element of a JSON document can only be an array or object.
sajson’s goal is to convert a stream of JSON text into a contiguous data
structure containing an enumerable and randomly-accessible parse tree.
My first attempt defined the parsed representation of each value as a type
enumeration followed by the type’s payload.
For example, the JSON text…
[null, 0, ["foo"]]
… would parse into…
<Array> 3 # length 5 # offset to first element 6 # offset to second element 9 # offset to third element <Null> <Number> 0 # first 32 bits of IEEE double value 0 # second 32 bits of value <Array> 1 # length 3 # offset to first element <String> 12 # offset into source document of string start 15 # offset into source document of string end
… where each line is a pointer-sized (aka
size_t) value and <> represents named type constants.
For the above representation, the parse tree’s worst-case size is
sizeof(size_t) * input_length * 2. I won’t derive that
here, but the worst-case document is a list of single-digit numbers:
[0,0,0,0] # 9 characters # 9*2 = 18 'slots' of structure <Array> 4 6 # relative offset to first element 9 12 15 <Number> 0 0 <Number> 0 0 <Number> 0 0 <Number> 0 0
But we can do better!
Using a full
size_t to store a 3-bit type constant is rather wasteful.
(Remember there are seven JSON types.) Because sajson only targets
32-bit and 64-bit architectures, each array or object element offset
has three bits to spare and thus can include the element’s type. The
document needs one bit to determine the type of the root element.
(Remember the root element must be an array or an object.)
A further optimization exists: rather than storing
all numbers as IEEE 64-bit doubles, we can add an extra type tag:
<Integer>. Single-digit JSON numbers must be integers, and thus
consume less structural storage.
Let’s consider the same example above with tagged element references,
where <tag>:offset delimits the tag from the offset.
[0,0,0,0] # 9 characters # root bit determines root is array 4 # length of array <Integer>:5 <Integer>:6 <Integer>:7 <Integer>:8 0 0 0 0
Let’s quickly check another example:
[[[]]] # 8 characters # root bit determines root is array 1 <Array>:2 1 <Array>:2 1 <Array>:2 0
With the above changes, the parse tree size is cut in half! It now
sizeof(size_t) * input_length.
Next time I’ll describe the challenges in building said parse tree
without a-priori knowledge of array length. Here’s a hint: imagine
you know the input text is 20 characters long. The first three