Working on Eden’s timestamp storage.

Hm, I wonder what ext4’s possible timestamp range is?

Time to read https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Inode_Timestamps.

That’s a cute way to extend the legacy timestamp format. One extra 32-bit value gives nanosecond precision and extends the range out to year 2446.

What happens if we exceed the range?

dev:~/tmp/timestamps $ df -T .
Filesystem     Type    ...
/dev/vda3      ext4    ...
dev:~/tmp/timestamps $ touch -d '1900-01-01 01:02' a
dev:~/tmp/timestamps $ ls -l
total 0
-rw-r--r--. 1 chadaustin users 0 Jan  1  1900 a

Huh, that’s weird. ext4 shouldn’t be able to represent 1900.

dev:~/tmp/timestamps $ touch -d '2800-01-01 01:02' b
dev:~/tmp/timestamps $ ls -l
total 0
-rw-r--r--. 1 chadaustin users 0 Jan  1  1900 a
-rw-r--r--. 1 chadaustin users 0 Jan  1  2800 b

And it definitely shouldn’t be able to represent year 2800…

dev:~/tmp/timestamps $ sudo bash -c 'echo 2 > /proc/sys/vm/drop_caches'
dev:~/tmp/timestamps $ ls -l
total 0
-rw-r--r--. 1 chadaustin users 0 May 29  2444 a
-rw-r--r--. 1 chadaustin users 0 Aug  5  2255 b

… oh.

Digitizing Home Videos

Several years back, my father started a project to digitize our home videos. He purchased an old computer from the IMVU automated builds cluster, bought an AVerMedia C027 capture card, digitized a few tapes… and then his digitization workstation sat there for years, untouched.

Sadly, he passed away last year, so I picked up the project. There were four types of analog media to digitize: VHS tapes, 8mm tapes, audio cassettes, and old Super 8 film reels.

Super 8

The Super 8 film went to a service in Redwood City. I don’t have any relevant equipment to play it and they do a good job — they clean the tape, take high-resolution photos of each frame, and then apply color adjustments to correct for any age-related fading, overexposure, or underexposure. The output format is up to you. I selected an MP4 movie file and a 1080p JPEG for every captured frame. (30 GB of 1080p JPEGs for 20 minutes of video!)

The service worked out pretty well. My only complaint was that I gave them seven individually labeled 3″ film reels but, presumably to make it easier to process, they taped six of the reels into one larger 6″ reel, so I had to split the files back up. Avidemux made lossless splitting on the I-frame boundaries trivial.

Audio Cassettes

The audio was similarly easy. ION makes an inexpensive tape deck that advertizes itself as a stereo USB microphone. You can capture the audio straight into Audacity and clip, process, and encode as needed.

VHS and 8mm

The bulk of the project was VHS and 8mm: we had two medium-sized moving boxes plus a shoebox of VHS tapes and a medium-sized box of 8mm. Probably close to 100 tapes in all.

Home videos are not worth much if nobody can watch them, so my primary goal was to make the video conveniently accessible to family. I also wanted to minimize unnecessary re-encodes and quality loss. The film and VHS had already degraded over time. Some quality loss, unfortunately, is inevitable without spending $$$ on dedicated equipment that captures frames from the tape.

My parents happened to own a very high-quality VCR that’s still in great shape. The capture sequence ended up something like this:

Video Cassette -> VCR -> Composite Cables -> Capture Card -> MPEG-2

Since each tape contained a hodgepodge of home videos (sometimes interleaved with TV recordings!), they had to be split up. The excellent, open source dvbcut software is perfect for this: it has a quadratic slider for frame-accurate scrubbing and it only recompresses frames when your splits don’t line up precisely with I-frames. I recommend doing your dvbcut work on an SSD. Scrubbing is painful on a spinny disk.

Converting the 8mm tapes was similar except replace VCR with the (again, still in great shape) Sony camcorder in playback mode. Also, since the 8mm tapes are mono but the capture card always records in stereo, you have an option. You can either run a post-split ffmpeg -map_channel step to convert the stereo MPEG-2 files into mono. (This has to happen after splitting because dvbcut can’t read videos after ffmpeg processes them for some reason.) Or you can tell HandBrake to mixdown the audio to mono from the right channel only. The latter avoids an audio re-encode, but it’s easier to forget when setting up long HandBrake jobs.

Finally, because the captured MPEG-2 files are large (4 GB per hour of video), I recompressed in HandBrake to H.264. I don’t notice a material quality difference (besides some “free” noise reduction), and the H.264 MP4 files are smaller and have more responsive seeking.

In the end, the steps that involve quality loss are:

  1. Real-time playback. Tracking glitches, for example, result in a few missed frames. But, like I mentioned, it would take $$$ to do a precise, frame-accurate digitization of each VHS frame.
  2. Composite cables instead of S-Video. I couldn’t find a VCR on Craigslist that supported S-Video output.
  3. Capturing in MPEG-2. I’m not convinced the real-time AVerMedia MPEG-2 encoder is very good – I’d occasionally notice strips of artifacty blocks in high-frequency regions like tree lines.
  4. A few frames of dvbcut’s re-encoding at the beginning and end of every split.
  5. YouTube / HandBrake. Might be slightly better to upload the split MPEG-2 into YouTube and let it recompress, but uploading 2 TB of video to YouTube didn’t seem very fun.

The bulk of the time in this project went towards capturing the video. It has to play in real time. Each 8mm cassette was 2 hours, and VHS tapes range between 2 and 8 hours.

The bulk of the effort, on the other hand, went into splitting, labeling, and organizing. I had to rely on clues to figure out when and where some videos were set. There were many duplicate recordings, too, so I had to determine which was higher quality.

Now that all that’s done, I plan to upload everything to YouTube and make a Google Doc to share with family members, in case anyone wants to write stories about the videos or tag people in them.

JSON Never Dies – An Efficient, Queryable Binary Encoding

A familiar story: Team starts project. Team uses JSON for interchange; it’s easy and ubiquitous and debuggable. Project grows. Entity types multiply. Data sizes increase. Finally, team notices hundreds of kilobytes — or even megabytes — of JSON is being generated, transferred, and parsed.

“Just use a binary format!” the Internet cries.

But by the time a project gets to this stage, it’s often a nontrivial amount of work to switch to Protocol Buffers or Cap’n Proto or Thrift or whatever. There might be thousands of lines of code for mapping model objects to and from JSON (that is arrays, objects, numbers, strings, and booleans). And if you’re talking about some hand-rolled binary format, it’s even worse: you need implementations for all of your languages and to make sure they’re fuzzed and secure.

The fact is, the activation energy required to switch from JSON to something else is high enough that it rarely happens. The JSON data model tends to stick. However, if we could insert a single function into the network layer to represent JSON more efficiently, then that could be an easy, meaningful win.

“Another binary JSON? We already have CBOR, MsgPack, BSON, and UBSON!” the Internet cries.

True, but, at first glance, none of those formats actually seem all that great. JSON documents tend to repeat the same strings over and over again, but those formats don’t support reusing string values or object shapes.

This led me to perform some experiments. What might a tighter binary representation of JSON look like?

What’s in a format?

First, what are some interesting properties of a file format?

  • Flexibility and Evolvability Well, we’re talking about representing JSON here, so they’re all going to be similar. However, some of the binary JSON replacements also have support for dates, binary blobs, and 64-bit integers.
  • Size How efficiently is the information encoded in memory? Uncompressed size matters because it affects peak memory usage and it’s how much data the post-decompression parser has to touch.
  • Compressibility Since you often get some sort of LZ compression “for free” in the network or storage interface, there’s value in the representation being amenable to those compression algorithms.
  • Code Size How simple are the encoders and decoders? Beyond the code size benefits, simple encoders and decoders are easier to audit for correctness, resource consumption bounds, and security vulnerabilities.
  • Decoder Speed How quickly can the entire file be scanned or processed? For comparison, JSON can be parsed at a rate of hundreds of MB/s.
  • Queryability Often we only want a subset of the data given to us. Does the format allow O(1) or O(lg N) path queries? Can we read the format without first parsing it into memory?

Size and parse-less queryability were my primary goals with JND. My hypothesis was that, since many JSON documents have repeating common structures (including string keys), storing strings and object shapes in a table would result in significant size wins.

Quickly glancing at the mainstream binary JSON encodings…


Each value starts with a tag byte followed by its payload. e.g. “0xdc indicates array, followed by a 16-bit length, followed by N values”. Big-endian integers.

Must be parsed before querying.


Per spec, does not actually seem to be a superset of JSON? Disallows nuls in object keys, and does not support arrays as root elements, as far as I can tell.

Otherwise, similar encoding as MsgPack. Each value has a tag byte followed by a payload. At least it uses little-endian integers.


Same idea. One-byte tags for each value, followed by a payload. Notably, lists and objects have terminators, and may not have an explicit length. Kind of an odd decision IMO.

Big-endian again. Weird.


IETF standard. Has a very intentional specification with documented rationales. Supports arrays of known sizes and arrays that terminate with a special “break” element. Smart 3-bit major tag followed by 5-bit length with special values for “length follows in next 4 bytes”, etc.

Big-endian again… Endianness doesn’t matter all that much, but it’s kind of weird to see formats using the endianness that’s less common these days.

CBOR does not support string or object shape tables, but at first glance it does not seem like CBOR sucks. I can imagine legitimate technical reasons to use it, though it is a quite complicated specification.


Okay! All of those formats have roughly the same shape. One byte prefixes on every value, value payloads in line (and thus values are variable-width).

Now it’s time to look at the format I sketched up.

The file consists of a simple header marking the locations and sizes of three tables: values, strings, and object shapes.

The string table consists of raw UTF-8 string data.

In the value table, every value starts with a tag byte (sound familiar?). The high nibble encodes the type. The low nibble contains two 2-bit values, k and n.

Consider strings:

0011 kknn [1<<k bytes] [1<<n bytes] - string: offset, length into string table

0011 indicates this is a string value. k and n are “size tags” which indicate how many bytes encode the integers. The string offset is 1 << k bytes (little-endian) and the string length is 1 << n bytes. Once the size tags are decoded, the actual offset and length values are read following the tag byte, and the resulting indices are used to retrieve the UTF-8 text at the given offset and length from the string table.

Now let’s look at objects:

1000 kknn [1<<k bytes] - object, object index, then <length> values of width n

The following 1 << k bytes encode the index into the object shape table, which holds the number and sorted list of object keys. Afterwards is a simple list of indices into the value table, each of size 1 << n bytes. The values are matched up with the keys in the object shape table.

Arrays are similar, except that instead of using an object index, they simply store their length.

This encoding has the property that lookups into an array are O(1) and lookups into an object are O(lg N), giving efficient support for path-based queries.

But there’s a pretty big downside relative to MsgPack, CBOR, and the like. The cost of efficient random access is that the elements of arrays and objects must have a known size. Thus, the (variable-width) values themselves cannot be stored directly into the array’s element list. Instead, arrays and objects have a list of fixed-width numeric offsets into the value table. This adds a level of indirection, and thus overhead, to JND. The payoff is that once a particular value is written (like a string or double-precision float), its index can be reused and referenced multiple times.

So how does this play out in practice? Net win or loss?

Size Benchmarks

I used sajson’s standard corpus of test JSON documents.

  • apache_builds Root object, largely consists of an array containing many three-element objects.
  • github_events Mix of objects, arrays, and long URL strings.
  • getinitialstate I can’t share the contents of this document as it’s from a project at work, but it’s 1.7 MB of various entity definitions, where each entity type is an object with maybe a dozen or two fields.
  • instruments A huge number of repeated structures and data — very compressible.
  • mesh 3D geometry. Basically a giant list of floating point and integer numbers.
  • svg_menu Only 600 bytes – used to test startup and base overhead costs.
  • twitter List of fairly large objects, many long strings.
  • update-center From Jenkins. Mostly consists of an object representing a mapping from plugin name to plugin description.


We can draw a few conclusions from these results.

  • As a wire replacement for the JSON data model, there’s no apparent reason to use BSON or UBSON. I’d probably default to MsgPack because it’s simple, but CBOR might be okay too.
  • The current crop of binary formats don’t compress any better than JSON itself, so if size over the wire is your primary concern, you might as well stick with JSON.
  • Except for small documents, where JND’s increased representation overhead dominates, JND is pretty much always smaller uncompressed. As predicted, reusing strings and object shapes is a big win in practice.
  • LZ-style compression algorithms don’t like JND much. Not a big surprise, since they don’t do a good job with sequences of numeric values. I expect delta coding value offsets in arrays and objects would help a lot, at the cost of needing to do a linear pass from delta to absolute at encode and decode time.

JND’s disadvantage is clear in the above graphs: while it’s smaller uncompressed, it does not compress as well as JSON or MsgPack. (Except in cases where its uncompressed size is dramatically smaller because of huge amounts of string or object shape reuse.)

Where would something like JND be useful? JND’s big advantage is that it can be read directly without allocating or parsing into another data structure. In addition, the uncompressed data is relatively compact. So I could imagine it being used if IO is relatively cheap but tight memory bounds are required.

Another advantage of JND is a bit less obvious. In my experience using sajson from other languages, the parse itself is cheaper than converting the parsed AST into values in the host language. For example, constructing Swift String objects from binary UTF-8 data was an order of magnitude slower than the parse itself. In JND, every unique string would naturally be translated into host language String objects once.

If you’d like to play with my experiments, they’re on GitHub at https://github.com/chadaustin/jnd.

Future Work?

It was more work than I could justify for this experiment, but I’d love to see how Thrift or Protocol Buffers compare to JND. JND is a self-describing format, so it’s going to lose on small messages, but when data sizes get big I wouldn’t be surprised if it at least ties protobufs.

Update: Forgot to mention decode times. I didn’t have time to set up a proper benchmark with high-performance implementations of each of these formats (rather than whatever pip gave me), but I think we can safely assume CBOR, MsgPack, BSON, and UBSON all have similar parse times, since the main structure of the decoder loop would be the same. The biggest question would be: how does that compare to JSON? And how does JND compare to both?

Writing a Really, Really Fast JSON Parser

Several holidays ago, I got a bee in my bonnet and wrote a fast JSON parser whose parsed AST fits in a single contiguous block of memory. The code was small and simple and the performance was generally on-par with RapidJSON, so I stopped and moved on with my life.

Well, at the end of 2016, Rich Geldreich shared that he’d also written a high-performance JSON parser.

I dropped his pjson into my benchmarking harness and discovered it was twice as fast as both RapidJSON and sajson! Coupled with some major JSON parsing performance problems in a project at work, I was nerd-sniped again.

I started reading his code but nothing looked particularly extraordinary. In fact, several things looked like they ought to be slower than sajson… Oh wait, what’s this?

do {
    c = pStr[0]; if (globals::s_parse_flags[c] & 1)
        { ++pStr; break; }
    c = pStr[1]; if (globals::s_parse_flags[c] & 1)
        { pStr += 2; break; }
    c = pStr[2]; if (globals::s_parse_flags[c] & 1)
        { pStr += 3; break; }
    c = pStr[3];
    pStr += 4;
} while (!(globals::s_parse_flags[c] & 1));

Why is unrolling that loop a win? How is the parse flags look-up (even L1 is 3-4 cycle latency) better than some independent comparisons?

Guess it was time to do some instruction dependency analysis!

Fast String Parsing

The problem at hand: given a pointer to the first byte after the opening quote of a string, find the first special byte, where special bytes are “, \, <0x20, or >0x7f.

The loop:

for (;;) {
    if (*p < 0x20 || *p > 127 || *p == '"' || *p == '\\') {

roughly breaks down into:

    load c, [p]
    compare c, 0x20
    jump-if-less exit_loop
    compare c, 0x7f
    jump-if-greater exit_loop
    compare c, '"'
    jump-if-equal exit_loop
    compare c, '\\'
    jump-if-equal exit_loop
    increment p
    jump begin_loop

How do we think about the performance of this loop? Remember that mainstream CPUs are out-of-order and can execute four (or more!) instructions in parallel. So an approximate mental model for reasoning about CPU performance is that they understand multiple instructions per cycle, then stick them all into an execution engine that can execute multiple instructions per cycle. Instructions will execute simultaneously if they’re independent of each other. If an instruction depends on the result of another, it must wait N cycles, where N is the first instruction’s latency.

So, assuming all branches are correctly predicted (branch predictors operate in the frontend and don’t count towards dependency chains), let’s do a rough estimate of the cost of the loop above:

load c, [p] # depends on value of p
# assuming L1, ~4 cycle latency
# all four comparisons can be done in parallel
# assumption: jumps are all predicted
increment p
# jump to beginning

increment p is the only instruction on the critical path – it carries a dependency across iterations of the loop. Is there enough work to satisfy the execution resources during the increment? Well, the comparisons are independent so they can all be done in parallel, and there are four of them, so we can probably keep the execution units busy. But it does mean that, at most, we can only check one byte per cycle. In reality, we need to issue the load and increment too, so we’re looking at a loop overhead of about 2-3 cycles per byte.

Now let’s look more closely at Rich’s code.

Replacing the comparisons with a lookup table increases comparison latency (3-4 cycles latency from L1) and increased comparison throughput (multiple loads can be issued per cycle).

So let’s spell out the instructions for Rich’s code (reordered for clarity):

    load c0, [p]
    load c1, [p+1]
    load c2, [p+2]
    load c3, [p+3]
    load t0, [is_plain_string+c0]
    load t1, [is_plain_string+c1]
    load t2, [is_plain_string+c2]
    load t3, [is_plain_string+c3]
    compare t0, 0
    jump-if-equal exit_loop
    compare t1, 0
    jump-if-equal increment_1_and_exit_loop
    compare t2, 0
    jump-if-equal increment_2_and_exit_loop
    compare t3, 0
    jump-if-equal increment_3_and_exit_loop
    add p, 4
    jump begin_loop
    add p, 1
    jump exit_loop
    add p, 2
    jump exit_loop
    add p, 3
  • Load four bytes per iteration [not on critical execution path]
  • Load four LUT entries [not on critical execution path]
  • Issue four comparisons per cycle [not on critical execution path]
  • add p, 4

Again, the critical path is only add p, 4, but we still need to issue the other instructions. The difference is that now all of the loads happen in parallel and the comparisons for 4 bytes happen in parallel too, rather than doing four comparisons per byte.

It’s still hard to say if this is a win on paper — Haswell can only issue 2 loads per cycle. But the loads can overlap with the comparisons from previous bytes). However, we still have to issue all of these instructions. So maybe we’re looking at something closer to 2 cycles per byte?

Empirically, at least on my Broadwell, replacing four comparisons with a LUT was definitely a win. 55cc213

But is unrolling the loop necessary? If I take out the unrolling but leave the LUT, clang gets slower but gcc stays the same. I checked – neither gcc nor clang do any automatic unrolling here. What’s going on? Branch predictor effects? Also, Intel’s Intel Architecture Code Analyzer tool says that the unrolled and non-unrolled loops are both frontend-bound and have the same loop throughput.

I’m not yet convinced that a tight dependency chain is why unrolling is a win. More investigation is required here. But the important thing to keep in mind is to pay attention to the critical path latency through a loop body as well as the number of independent operations that can soak up spare execution bandwidth.

Lead Bullets

After playing around with LUTs and unrolling, I started implementing a bunch of small optimizations that I’d long known were available but didn’t consider to be that important.

Well, as it often turns out in performance projects, a sequence of 2% gains adds up to significant wins over time! If you’ve read Ben Horowitz’s lead bullets story or how SQLite achieved a 50% speed up, this will sound familiar.

Here are the optimizations that mattered:

  • Moved the input and output pointers into locals instead of members, which helps VC++ and Clang understand that they can be placed in registers. (gcc was already performing that optimization.) 71078d3 4a07c77
  • Replaced the entire parse loop with a goto state machine. Surprisingly, not only was this a performance win, but it actually made the code clearer. 3828769 05b3ec8 c02cb31
  • Change an 8-entry enum to a uint8_t instead of the pointer-sized value it was defaulting to. 799c55f
  • Duplicated a bit of code to avoid needing to call a function pointer. 44ec0df
  • Tiny microoptimizations like avoiding branches and unnecessary data flow. 235d330 002ba31 193b183 c23fa23
  • Store the tag bits at the bottom of the element index instead of the top, which avoids a shift on 64-bit. e7f2351

Static Branch Prediction

I also spent a bit of time on static branch prediction. It’s a questionable optimization; in theory, you should just use profile-guided optimization (PGO) and/or rely on the CPU’s branch predictors, but in practice few people actually bother to set up PGO. Plus, even though the CPU will quickly learn which branches are taken, the compiler doesn’t know. Thus, by using static branch prediction hints, the compiler can line up the instructions so the hot path flows in a straight line and all the cold paths are off somewhere else, sometimes saving register motion in the hot path.

For some examples, look at these uses of SAJSON_LIKELY and SAJSON_UNLIKELY.

I can’t recommend spending a lot of time on annotating your branches, but it does show up as a small but measurable win in benchmarks, especially on smaller and simpler CPUs.

Things I Didn’t Do

  • Unlike Rich’s parser and RapidJSON, I chose not to optimize whitespace skipping. Why? Not worth the code size increase – the first thing someone who cares about JSON parsing performance does is minify the JSON. b05082b
  • I haven’t yet optimized number parsing. Both RapidJSON and Rich’s parser are measurably faster there, and it would be straightforward to apply the techniques. But the documents I regularly deal with are strings and objects and rarely contain numbers.

Benchmark Results

I tested on four devices: my desktop (high-clocked Haswell), laptop (low-clocked Broadwell), iPhone SE, and Atom-based home server.


Dell XPS 13

iPhone SE

Atom D2700

The charts aren’t very attractive, but if you look closely, you’ll notice a few things:

  • Parsing JSON on modern CPUs can be done at a rate of hundreds of megabytes per second.
  • gcc does a much better job with RapidJSON than either clang or MSVC.
  • JSON parsing benefits from x64 – it’s not a memory-bound or cache-bound problem, and the extra registers help a lot.
  • The iPhone SE is not much slower than my laptop’s Broadwell. :)

The Remainder of the Delta

As you can see in the charts above, sajson is often faster than RapidJSON, but still not as fast as Rich’s pjson. Here are the reasons why:

  1. sajson does not require the input buffer to be null-terminated, meaning that every pointer increment requires a comparison with the end of the buffer (to detect eof) in addition to the byte comparison itself. I’ve thought about changing this policy (or even adding a compile-time option) but I really like the idea that I can take a buffer straight from a disk mmap or database and pass it straight to sajson without copying. On the other hand, I measured about a 10% performance boost from avoiding length checks.

  2. sajson sorts object keys so that object lookup takes logarithmic rather than linear time. The other high-performance parsers have linear-time object lookup by key. This is an optimization that, while not necessary for most use cases, avoids any accidental worst-case quadratic-time usage.

  3. sajson’s contiguous AST design requires, for every array, shifting N words in memory where N is the number of elements in the array. The alternative would be to use growable arrays in the AST (requiring that they get shifted as the array is realloc’d). Hard to say how much this matters.

Aside: The “You Can’t Beat the Compiler” Myth

There’s this persistent idea that compilers are smart and will magically turn your code into something that maps efficiently to the machine. That’s only approximately true. It’s really hard for compilers to prove the safety (or profitability) of certain transformations and, glancing through the produced code for sajson, I frequently noticed just plain dumb code generation. Like, instead of writing a constant into a known memory address, it would load a constant into a register, then jump to another location to OR it with another constant, and then jump somewhere else to write it to memory.

Also, just look at the charts – there are sometimes significant differences between the compilers on the same code. Compilers aren’t perfect and they appreciate all the help you can give!


Measuring the effect of microoptimizations on modern computers can be tricky. With dynamic clock frequencies and all of today’s background tasks, the easiest way to get stable performance numbers is to take the fastest time from all the runs. Run your function a couple thousand times and record the minimum. Even tiny optimizations will show up this way.

(If a library that uses statistical techniques to automatically achieve precise results is readily available, consider using that.)

I also had my JSON benchmarks report MB/s, which normalizes the differences between test files of different sizes. It also helps understand parser startup cost (when testing small files) and the differences in parse speed between files with lots of numbers, strings, huge objects, etc.

Swift Bindings

Dropbox (primarily @aeidelson) contributed high-quality Swift bindings for sajson. The challenge in writing these bindings was finding a way to efficiently expose the sajson parse tree to Swift. It turns out that constructing Swift arrays and objects is REALLY expensive; we once benchmarked 10 ms in sajson’s parser and 400 ms of Swift data structure construction.

Fortunately, Swift has decent APIs for unsafely manipulating pointers and memory, and with those we implemented the ability to access sajson AST nodes through a close-to-natural ValueReader type.

Converting sajson strings (e.g. ranges of UTF-8 memory) into Swift Strings is still expensive, but Swift 4 might improve the situation:

An iOS team at Dropbox replaced JSONSerialization with sajson and cut their initial load times by two thirds!


I used to think JSON parsing was not something you ever wanted in your application’s critical path. It’s certainly not the kind of algorithm that modern computers love (read byte, branch, read byte, branch). That said, this problem has been beaten to death. We now have multiple parsers that can parse data at hundreds of megabytes per second — around the same rate as SHA-256! If we relaxed some of the constraints on sajson, it could even go faster.

So how fast was Rich’s parser, after all? When measured in Clang and MSVC, quite a lot, actually. But when measured in GCC, RapidJSON, sajson, and pjson were (and remain) very similar. Many of the differences come down to naive, manually-inlined code, which we know the compiler can reliably convert to a fast sequence of instructions. It’s annoying to eschew abstractions and even duplicate logic across several locations, but, in the real world, people use real compilers. For critical infrastructure, it’s worth taking on some manual optimization so your users don’t have to care.


Arseny Kapoulkine pointed out an awesome trick to me: replace the last byte of the buffer with 0, which lets you avoid comparing against the end of the buffer. When you come across a 0 byte, if it’s the last one, treat it as if it’s the byte that got replaced. This would be a material performance win in sajson _and_ avoid the need for specifying null-terminated buffers.

Assume Good Intentions

Email and text messaging are cold. It’s easy to assume the person on the other side of the screen is aggressive or upset or thinks poorly of you, and respond in kind.

Many years ago, I worked on the SCons build system. Much of the development occurred on mailing lists. Every so often someone would come in, guns blazing, asking why X was so bad or why Y didn’t work. The project lead, Steven Knight, always managed to either get useful information from the person or, if they were truly trolling, deflect and ignore the onslaught.

I once asked him how he did such a great job keeping the community calm. He said “I assume everyone has a good heart.” If a person comes in with a problem or does something you don’t like, gently figure what the real issue is. It’s probably something you can solve.

Quasi-Succinct Data Structures

I thought this paper was really cool: Semi-Indexing
Semi-Structured Data in Tiny Space
. The idea is pretty simple but it uses two ideas I wasn’t familiar with: Elias-Fano coding and balanced parentheses coding.

Imagine you have a bunch of JSON documents that you want to keep around without reencoding in some way, and you want to occasionally query the document for a selection of fields. Parsing JSON is not terribly efficient, and you have to parse most of the file just to read, say, a single key.

The key observation is that you can store a bitstream of information alongside the original JSON document that describes enough of the parse tree to be able to find a key without parsing the entire document. JSON is LL(1) so the type of each node is determined by its first byte in the file. Thus, the locations of each JSON node are a monotonically-increasing set of integer indices into the source file. Elias-Fano codes are a very efficient (quasi-succinct) encoding for such a set. In addition, the nesting hierarchy is encoding with a balanced parentheses bit pattern code.

The result is that it’s possible to preparse the document into a data structure about 25% of the size of the original JSON document that allows direct field lookups.

The Quasi-Succinct Indices paper gives some background information on the encoding and its efficient implementation. An implementation of a set of succinct data structures is available on GitHub.

The author also provides an implementation of this preparsing algorithm for JSON on GitHub.

All of that said, I do slightly question this approach’s utility: it’s probably a bigger win to come up with an isomorphic encoding of JSON that allows direct access and still allows reconstruction of the original document if necessary.


Companies are built of many projects and repositories. Each tends to develop its own culture and tools for interacting with it.

The client is compiled with MSVC, and tests are run with ./runner.py --test foo/bar/baz_test.py.

The website builds with SCons and tests are run with bin/run-tests.sh.

The utility library has ./build and ./test.

Over time, some projects put scripts in the root, some in bin, maybe some in a directory named scripts. Some projects might use all three.

This makes it hard for people coming to the project to know 1) how to do common project tasks and 2) what the set of available commands even is.

When code is collectively owned and engineers contribute across the entire stack, it’s important that anyone can easily check out a repository and start developing in it, no matter the language or tooling.

At IMVU we solved this problem by introducing a directory to every project named s/.

  • How do you build the project? s/build
  • How do you run the tests? s/test
  • How do you deploy? s/deploy
  • How do you lint? s/lint
  • How do you launch the program? s/run
  • What if there’s a server component? s/server
  • How do you see what commands are available? ls s

At first there was some resistance, primarily to the name. “s is too short. Nobody will know what it means.” But after living with it, it was perfect:

  • It’s short.
  • It’s on home row for both qwerty and dvorak.
  • Unlike bin and scripts, s doesn’t already have a semantic meaning. bin, for example, is often used for programs generated by the build script. Conflating that with project tooling is confusing.

s is a home only for project commands – the interface that developers use to interact with the project. These commands should use the same nomenclature as your company (e.g. if people say build, call it s/build; if people say compile, call it s/compile).

The scripts shouldn’t have extensions, because, importantly, the programming language is an implementation detail. That is, s/build.sh or s/build.py are wrong. s/build lets you be consistent across projects and have the option to migrate from Python to Bash or whatever.

s/ is a simple trick, but it goes a long way towards helping people migrate between projects!

Thinking About Performance – Notes

I gave an internal presentation at Dropbox (sorry, video is not sharable publicly) about engineering software for performance. Here are a bunch of resources that went into the presentation.

Similar Presentations

The Economics of Performance




Reaction Times

Response Time

Perception of Time



Examples of High-Performance Code

CPU architecture

In the talk I intentionally left out some detail – technically the branch predictor and branch target predictor are different things.

Agner Fog has amazing resources for CPU optimization, including his famous x86 instruction tables. It’s helpful to scan the latencies and reciprocal throughputs of common instructions.


Apple A9

Caches, Memory, and Atomics

Memory Bandwidth

Branch Prediction


Designing Crux – Record Traits

I’ve intentionally tried not to break much new ground with Crux’s type system. I mostly want the language to be a pleasant, well-executed, modern language and type system for the web.

That said, when porting some web APIs to Crux, I ran into an bothersome issue with an interesting solution. I’ll motivate the problem a bit and then share the solution below.

Records and Row Polymorphism

Crux has row-polymorphic record types. Think of records as anonymous little structs that happen to be represented by JavaScript objects. Row polymorphism means that functions can be written to accept many different record types, as long as they have some subset of properties.

fun length(vec) {
    math.sqrt(vec.x * vec.x + vec.y * vec.y)

length({x: 10, y: 20})
length({x: 3, y: 4, name: "p1"})
length({x: 15})               // type error, missing property y
length({x: "one", y: "two"})  // type error, x and y must be numbers

I lived in a TypeScript codebase for nine months or so and the convenience of being able to quickly define anonymous record types is wonderful. Sometimes you simply want to return a value that contains three named fields, and defining a new type and giving it a name is busywork that contributes to the common belief that statically typed languages feel “heavy”.

Crux supports both nominal types (String, CustomerID, Array, …) and anonymous records, sometimes called structural types.

The problem here is how structural types interact with traits.

Imagine coming to Crux for the first time and trying to encode some values as JSON.

json.encode(10)              // returns "10"
json.encode("hello")         // returns "\"hello\""
json.encode(True)            // "true"
json.encode([15, 30, 12])    // "[15,30,12]"
json.encode(js.Null)         // "null"

json.encode({code: 200, text: "OK"})
// Type error: records don't implement traits

Wait, what? Why does everything work but records?

Records and Traits

json.encode‘s type is fun encode<V: ToJSON>(value: V) — that is, it accepts any value which implements the ToJSON trait. Why don’t records implement traits? Well, record types are anonymous, as described before. They don’t necessarily have unique definitions or names, so how would we define a ToJSON instance for this record?

Haskell, and other statically-typed languages, simply have the programmer manually construct a HashMap and JSON-encode that. In Crux, that approach might look something like:

let map = hashmap.new()
map["x"] = 10
map["y"] = 20

Not too bad, but not as nice as using record syntax here.

To solve this problem, which is “only” a human factors problem (but remember that Crux is intended to be delightful), I came up with something that I believe to be novel. I haven’t seen this technique before in the row polymorphism literature or any languages I’ve studied.

There are two parts to this problem. First we want to use record literals to construct key-value maps, and then we want to define trait instances that can use said maps.


As I mentioned, Crux records are implemented as JavaScript objects. This is pretty convenient when interfacing with JavaScript APIs.

data jsffi ReadyState {
    UNSENT = 0,
    OPENED = 1,
    LOADING = 3,
    DONE = 4,

type XMLHttpRequest = {
    mutable onreadystatechange: () => (),
    readyState: ReadyState,
    responseText: JSOption<String>,
    // ...

Sometimes, however, JavaScript objects are used as key-value maps instead of records. Crux provides a Dict type for that case. Dict is like Map, except keys are restricted to strings to allow implementation on a plain JavaScript object.

let d = dict.new()
d["happy"] = True
d["sad"] = False
json.encode(d)  // "{\"happy\":true,\"sad\":false}"

There is a ToJSON instance for any Dict<V> where V: ToJSON. But can we provide better syntax for creating one? It would be nicer to write:

let d = dict.from({
    happy: True,
    sad: False,

To implement dict.from, we need a new type of record constraint: one that says, I accept any set of fields, as long as every value has a consistent type.

fun from<V>(record: {...: V}): Dict<V> {
    // some unsafe JS code to copy the record into a mutable JS object

Read ... as “arbitrary set of fields” and : V as “has type V”. Thus, {...: V} is the new type of record constraint, and it’s read as “record with arbitrary set of fields where each field’s value has type V”.

So now we can write:

    happy: True,
    sad: False,

Better, but we’re still not done. For one, dict.from requires the values to have the same type — not necessary for JSON encoding. Two, it’s annoying to have to call two functions to quickly create a JSON object.

Defining Record Instances

Here’s where record trait instances come in. First we need to convert all the record’s field values into a common type T. Then, once the record has been converted into a record of type {...: T}, it can be passed to dict.from and then json.encode. The resulting syntax is:

// Read as: implement ToJSON for records where…
impl json.ToJSON {…} {
// …this code runs across each property of the record,
// producing a value of type {…: json.Value}…
for fieldValue {

// which is then encoded into a JSON object.
toJSON(record) {


And there you have it, a ToJSON instance for any record, at least as long as the record’s field types implement ToJSON themselves.

    code: 200,
    status: "OK",

I’ve never seen a feature like this in the literature or languages I’ve studied, but if you have, please let me know!