I thought this paper was really cool: Semi-Indexing
Semi-Structured Data in Tiny Space. The idea is pretty simple but it uses two ideas I wasn’t familiar with: Elias-Fano coding and balanced parentheses coding.
Imagine you have a bunch of JSON documents that you want to keep around without reencoding in some way, and you want to occasionally query the document for a selection of fields. Parsing JSON is not terribly efficient, and you have to parse most of the file just to read, say, a single key.
The key observation is that you can store a bitstream of information alongside the original JSON document that describes enough of the parse tree to be able to find a key without parsing the entire document. JSON is LL(1) so the type of each node is determined by its first byte in the file. Thus, the locations of each JSON node are a monotonically-increasing set of integer indices into the source file. Elias-Fano codes are a very efficient (quasi-succinct) encoding for such a set. In addition, the nesting hierarchy is encoding with a balanced parentheses bit pattern code.
The result is that it’s possible to preparse the document into a data structure about 25% of the size of the original JSON document that allows direct field lookups.
The author also provides an implementation of this preparsing algorithm for JSON on GitHub.
All of that said, I do slightly question this approach’s utility: it’s probably a bigger win to come up with an isomorphic encoding of JSON that allows direct access and still allows reconstruction of the original document if necessary.