Writing a Really, Really Fast JSON Parser

Several holidays ago, I got a bee in my bonnet and wrote a fast JSON parser whose parsed AST fits in a single contiguous block of memory. The code was small and simple and the performance was generally on-par with RapidJSON, so I stopped and moved on with my life.

Well, at the end of 2016, Rich Geldreich shared that he’d also written a high-performance JSON parser.

I dropped his pjson into my benchmarking harness and discovered it was twice as fast as both RapidJSON and sajson! Coupled with some major JSON parsing performance problems in a project at work, I was nerd-sniped again.

I started reading his code but nothing looked particularly extraordinary. In fact, several things looked like they ought to be slower than sajson… Oh wait, what’s this?

do {
    c = pStr[0]; if (globals::s_parse_flags[c] & 1)
        { ++pStr; break; }
    c = pStr[1]; if (globals::s_parse_flags[c] & 1)
        { pStr += 2; break; }
    c = pStr[2]; if (globals::s_parse_flags[c] & 1)
        { pStr += 3; break; }
    c = pStr[3];
    pStr += 4;
} while (!(globals::s_parse_flags[c] & 1));

Why is unrolling that loop a win? How is the parse flags look-up (even L1 is 3-4 cycle latency) better than some independent comparisons?

Guess it was time to do some instruction dependency analysis!

Fast String Parsing

The problem at hand: given a pointer to the first byte after the opening quote of a string, find the first special byte, where special bytes are “, \, <0x20, or >0x7f.

The loop:

for (;;) {
    if (*p < 0x20 || *p > 127 || *p == '"' || *p == '\\') {
        break;
    }
    ++p;
}

roughly breaks down into:

begin_loop:
    load c, [p]
    compare c, 0x20
    jump-if-less exit_loop
    compare c, 0x7f
    jump-if-greater exit_loop
    compare c, '"'
    jump-if-equal exit_loop
    compare c, '\\'
    jump-if-equal exit_loop
    increment p
    jump begin_loop
exit_loop:

How do we think about the performance of this loop? Remember that mainstream CPUs are out-of-order and can execute four (or more!) instructions in parallel. So an approximate mental model for reasoning about CPU performance is that they understand multiple instructions per cycle, then stick them all into an execution engine that can execute multiple instructions per cycle. Instructions will execute simultaneously if they’re independent of each other. If an instruction depends on the result of another, it must wait N cycles, where N is the first instruction’s latency.

So, assuming all branches are correctly predicted (branch predictors operate in the frontend and don’t count towards dependency chains), let’s do a rough estimate of the cost of the loop above:

load c, [p] # depends on value of p
# assuming L1, ~4 cycle latency
# all four comparisons can be done in parallel
# assumption: jumps are all predicted
increment p
# jump to beginning

increment p is the only instruction on the critical path – it carries a dependency across iterations of the loop. Is there enough work to satisfy the execution resources during the increment? Well, the comparisons are independent so they can all be done in parallel, and there are four of them, so we can probably keep the execution units busy. But it does mean that, at most, we can only check one byte per cycle. In reality, we need to issue the load and increment too, so we’re looking at a loop overhead of about 2-3 cycles per byte.

Now let’s look more closely at Rich’s code.

Replacing the comparisons with a lookup table increases comparison latency (3-4 cycles latency from L1) and increased comparison throughput (multiple loads can be issued per cycle).

So let’s spell out the instructions for Rich’s code (reordered for clarity):

begin_loop:
    load c0, [p]
    load c1, [p+1]
    load c2, [p+2]
    load c3, [p+3]
    load t0, [is_plain_string+c0]
    load t1, [is_plain_string+c1]
    load t2, [is_plain_string+c2]
    load t3, [is_plain_string+c3]
    compare t0, 0
    jump-if-equal exit_loop
    compare t1, 0
    jump-if-equal increment_1_and_exit_loop
    compare t2, 0
    jump-if-equal increment_2_and_exit_loop
    compare t3, 0
    jump-if-equal increment_3_and_exit_loop
    add p, 4
    jump begin_loop
increment_1_and_exit_loop:
    add p, 1
    jump exit_loop
increment_2_and_exit_loop:
    add p, 2
    jump exit_loop
increment_3_and_exit_loop:
    add p, 3
exit_loop:
  • Load four bytes per iteration [not on critical execution path]
  • Load four LUT entries [not on critical execution path]
  • Issue four comparisons per cycle [not on critical execution path]
  • add p, 4

Again, the critical path is only add p, 4, but we still need to issue the other instructions. The difference is that now all of the loads happen in parallel and the comparisons for 4 bytes happen in parallel too, rather than doing four comparisons per byte.

It’s still hard to say if this is a win on paper — Haswell can only issue 2 loads per cycle. But the loads can overlap with the comparisons from previous bytes). However, we still have to issue all of these instructions. So maybe we’re looking at something closer to 2 cycles per byte?

Empirically, at least on my Broadwell, replacing four comparisons with a LUT was definitely a win. 55cc213

But is unrolling the loop necessary? If I take out the unrolling but leave the LUT, clang gets slower but gcc stays the same. I checked – neither gcc nor clang do any automatic unrolling here. What’s going on? Branch predictor effects? Also, Intel’s Intel Architecture Code Analyzer tool says that the unrolled and non-unrolled loops are both frontend-bound and have the same loop throughput.

I’m not yet convinced that a tight dependency chain is why unrolling is a win. More investigation is required here. But the important thing to keep in mind is to pay attention to the critical path latency through a loop body as well as the number of independent operations that can soak up spare execution bandwidth.

Lead Bullets

After playing around with LUTs and unrolling, I started implementing a bunch of small optimizations that I’d long known were available but didn’t consider to be that important.

Well, as it often turns out in performance projects, a sequence of 2% gains adds up to significant wins over time! If you’ve read Ben Horowitz’s lead bullets story or how SQLite achieved a 50% speed up, this will sound familiar.

Here are the optimizations that mattered:

  • Moved the input and output pointers into locals instead of members, which helps VC++ and Clang understand that they can be placed in registers. (gcc was already performing that optimization.) 71078d3 4a07c77
  • Replaced the entire parse loop with a goto state machine. Surprisingly, not only was this a performance win, but it actually made the code clearer. 3828769 05b3ec8 c02cb31
  • Change an 8-entry enum to a uint8_t instead of the pointer-sized value it was defaulting to. 799c55f
  • Duplicated a bit of code to avoid needing to call a function pointer. 44ec0df
  • Tiny microoptimizations like avoiding branches and unnecessary data flow. 235d330 002ba31 193b183 c23fa23
  • Store the tag bits at the bottom of the element index instead of the top, which avoids a shift on 64-bit. e7f2351

Static Branch Prediction

I also spent a bit of time on static branch prediction. It’s a questionable optimization; in theory, you should just use profile-guided optimization (PGO) and/or rely on the CPU’s branch predictors, but in practice few people actually bother to set up PGO. Plus, even though the CPU will quickly learn which branches are taken, the compiler doesn’t know. Thus, by using static branch prediction hints, the compiler can line up the instructions so the hot path flows in a straight line and all the cold paths are off somewhere else, sometimes saving register motion in the hot path.

For some examples, look at these uses of SAJSON_LIKELY and SAJSON_UNLIKELY.

I can’t recommend spending a lot of time on annotating your branches, but it does show up as a small but measurable win in benchmarks, especially on smaller and simpler CPUs.

Things I Didn’t Do

  • Unlike Rich’s parser and RapidJSON, I chose not to optimize whitespace skipping. Why? Not worth the code size increase – the first thing someone who cares about JSON parsing performance does is minify the JSON. b05082b
  • I haven’t yet optimized number parsing. Both RapidJSON and Rich’s parser are measurably faster there, and it would be straightforward to apply the techniques. But the documents I regularly deal with are strings and objects and rarely contain numbers.

Benchmark Results

I tested on four devices: my desktop (high-clocked Haswell), laptop (low-clocked Broadwell), iPhone SE, and Atom-based home server.

Desktop

Dell XPS 13

iPhone SE

Atom D2700

The charts aren’t very attractive, but if you look closely, you’ll notice a few things:

  • Parsing JSON on modern CPUs can be done at a rate of hundreds of megabytes per second.
  • gcc does a much better job with RapidJSON than either clang or MSVC.
  • JSON parsing benefits from x64 – it’s not a memory-bound or cache-bound problem, and the extra registers help a lot.
  • The iPhone SE is not much slower than my laptop’s Broadwell. :)

The Remainder of the Delta

As you can see in the charts above, sajson is often faster than RapidJSON, but still not as fast as Rich’s pjson. Here are the reasons why:

  1. sajson does not require the input buffer to be null-terminated, meaning that every pointer increment requires a comparison with the end of the buffer (to detect eof) in addition to the byte comparison itself. I’ve thought about changing this policy (or even adding a compile-time option) but I really like the idea that I can take a buffer straight from a disk mmap or database and pass it straight to sajson without copying. On the other hand, I measured about a 10% performance boost from avoiding length checks.

  2. sajson sorts object keys so that object lookup takes logarithmic rather than linear time. The other high-performance parsers have linear-time object lookup by key. This is an optimization that, while not necessary for most use cases, avoids any accidental worst-case quadratic-time usage.

  3. sajson’s contiguous AST design requires, for every array, shifting N words in memory where N is the number of elements in the array. The alternative would be to use growable arrays in the AST (requiring that they get shifted as the array is realloc’d). Hard to say how much this matters.

Aside: The “You Can’t Beat the Compiler” Myth

There’s this persistent idea that compilers are smart and will magically turn your code into something that maps efficiently to the machine. That’s only approximately true. It’s really hard for compilers to prove the safety (or profitability) of certain transformations and, glancing through the produced code for sajson, I frequently noticed just plain dumb code generation. Like, instead of writing a constant into a known memory address, it would load a constant into a register, then jump to another location to OR it with another constant, and then jump somewhere else to write it to memory.

Also, just look at the charts – there are sometimes significant differences between the compilers on the same code. Compilers aren’t perfect and they appreciate all the help you can give!

Benchmarking

Measuring the effect of microoptimizations on modern computers can be tricky. With dynamic clock frequencies and all of today’s background tasks, the easiest way to get stable performance numbers is to take the fastest time from all the runs. Run your function a couple thousand times and record the minimum. Even tiny optimizations will show up this way.

(If a library that uses statistical techniques to automatically achieve precise results is readily available, consider using that.)

I also had my JSON benchmarks report MB/s, which normalizes the differences between test files of different sizes. It also helps understand parser startup cost (when testing small files) and the differences in parse speed between files with lots of numbers, strings, huge objects, etc.

Swift Bindings

Dropbox (primarily @aeidelson) contributed high-quality Swift bindings for sajson. The challenge in writing these bindings was finding a way to efficiently expose the sajson parse tree to Swift. It turns out that constructing Swift arrays and objects is REALLY expensive; we once benchmarked 10 ms in sajson’s parser and 400 ms of Swift data structure construction.

Fortunately, Swift has decent APIs for unsafely manipulating pointers and memory, and with those we implemented the ability to access sajson AST nodes through a close-to-natural ValueReader type.

Converting sajson strings (e.g. ranges of UTF-8 memory) into Swift Strings is still expensive, but Swift 4 might improve the situation:

An iOS team at Dropbox replaced JSONSerialization with sajson and cut their initial load times by two thirds!

Summary

I used to think JSON parsing was not something you ever wanted in your application’s critical path. It’s certainly not the kind of algorithm that modern computers love (read byte, branch, read byte, branch). That said, this problem has been beaten to death. We now have multiple parsers that can parse data at hundreds of megabytes per second — around the same rate as SHA-256! If we relaxed some of the constraints on sajson, it could even go faster.

So how fast was Rich’s parser, after all? When measured in Clang and MSVC, quite a lot, actually. But when measured in GCC, RapidJSON, sajson, and pjson were (and remain) very similar. Many of the differences come down to naive, manually-inlined code, which we know the compiler can reliably convert to a fast sequence of instructions. It’s annoying to eschew abstractions and even duplicate logic across several locations, but, in the real world, people use real compilers. For critical infrastructure, it’s worth taking on some manual optimization so your users don’t have to care.

Update

Arseny Kapoulkine pointed out an awesome trick to me: replace the last byte of the buffer with 0, which lets you avoid comparing against the end of the buffer. When you come across a 0 byte, if it’s the last one, treat it as if it’s the byte that got replaced. This would be a material performance win in sajson _and_ avoid the need for specifying null-terminated buffers.

Assume Good Intentions

Email and text messaging are cold. It’s easy to assume the person on the other side of the screen is aggressive or upset or thinks poorly of you, and respond in kind.

Many years ago, I worked on the SCons build system. Much of the development occurred on mailing lists. Every so often someone would come in, guns blazing, asking why X was so bad or why Y didn’t work. The project lead, Steven Knight, always managed to either get useful information from the person or, if they were truly trolling, deflect and ignore the onslaught.

I once asked him how he did such a great job keeping the community calm. He said “I assume everyone has a good heart.” If a person comes in with a problem or does something you don’t like, gently figure what the real issue is. It’s probably something you can solve.

Quasi-Succinct Data Structures

I thought this paper was really cool: Semi-Indexing
Semi-Structured Data in Tiny Space
. The idea is pretty simple but it uses two ideas I wasn’t familiar with: Elias-Fano coding and balanced parentheses coding.

Imagine you have a bunch of JSON documents that you want to keep around without reencoding in some way, and you want to occasionally query the document for a selection of fields. Parsing JSON is not terribly efficient, and you have to parse most of the file just to read, say, a single key.

The key observation is that you can store a bitstream of information alongside the original JSON document that describes enough of the parse tree to be able to find a key without parsing the entire document. JSON is LL(1) so the type of each node is determined by its first byte in the file. Thus, the locations of each JSON node are a monotonically-increasing set of integer indices into the source file. Elias-Fano codes are a very efficient (quasi-succinct) encoding for such a set. In addition, the nesting hierarchy is encoding with a balanced parentheses bit pattern code.

The result is that it’s possible to preparse the document into a data structure about 25% of the size of the original JSON document that allows direct field lookups.

The Quasi-Succinct Indices paper gives some background information on the encoding and its efficient implementation. An implementation of a set of succinct data structures is available on GitHub.

The author also provides an implementation of this preparsing algorithm for JSON on GitHub.

All of that said, I do slightly question this approach’s utility: it’s probably a bigger win to come up with an isomorphic encoding of JSON that allows direct access and still allows reconstruction of the original document if necessary.

s/

Companies are built of many projects and repositories. Each tends to develop its own culture and tools for interacting with it.

The client is compiled with MSVC, and tests are run with ./runner.py --test foo/bar/baz_test.py.

The website builds with SCons and tests are run with bin/run-tests.sh.

The utility library has ./build and ./test.

Over time, some projects put scripts in the root, some in bin, maybe some in a directory named scripts. Some projects might use all three.

This makes it hard for people coming to the project to know 1) how to do common project tasks and 2) what the set of available commands even is.

When code is collectively owned and engineers contribute across the entire stack, it’s important that anyone can easily check out a repository and start developing in it, no matter the language or tooling.

At IMVU we solved this problem by introducing a directory to every project named s/.

  • How do you build the project? s/build
  • How do you run the tests? s/test
  • How do you deploy? s/deploy
  • How do you lint? s/lint
  • How do you launch the program? s/run
  • What if there’s a server component? s/server
  • How do you see what commands are available? ls s

At first there was some resistance, primarily to the name. “s is too short. Nobody will know what it means.” But after living with it, it was perfect:

  • It’s short.
  • It’s on home row for both qwerty and dvorak.
  • Unlike bin and scripts, s doesn’t already have a semantic meaning. bin, for example, is often used for programs generated by the build script. Conflating that with project tooling is confusing.

s is a home only for project commands – the interface that developers use to interact with the project. These commands should use the same nomenclature as your company (e.g. if people say build, call it s/build; if people say compile, call it s/compile).

The scripts shouldn’t have extensions, because, importantly, the programming language is an implementation detail. That is, s/build.sh or s/build.py are wrong. s/build lets you be consistent across projects and have the option to migrate from Python to Bash or whatever.

s/ is a simple trick, but it goes a long way towards helping people migrate between projects!

Thinking About Performance – Notes

I gave an internal presentation at Dropbox (sorry, video is not sharable publicly) about engineering software for performance. Here are a bunch of resources that went into the presentation.

Similar Presentations

The Economics of Performance

Humans

Vision

Touch

Reaction Times

Response Time

Perception of Time

Computers

Latency

Examples of High-Performance Code

CPU architecture

In the talk I intentionally left out some detail – technically the branch predictor and branch target predictor are different things.

Agner Fog has amazing resources for CPU optimization, including his famous x86 instruction tables. It’s helpful to scan the latencies and reciprocal throughputs of common instructions.

Haswell

Apple A9

Caches, Memory, and Atomics

Memory Bandwidth

Branch Prediction

Miscellaneous

Designing Crux – Record Traits

I’ve intentionally tried not to break much new ground with Crux’s type system. I mostly want the language to be a pleasant, well-executed, modern language and type system for the web.

That said, when porting some web APIs to Crux, I ran into an bothersome issue with an interesting solution. I’ll motivate the problem a bit and then share the solution below.

Records and Row Polymorphism

Crux has row-polymorphic record types. Think of records as anonymous little structs that happen to be represented by JavaScript objects. Row polymorphism means that functions can be written to accept many different record types, as long as they have some subset of properties.

fun length(vec) {
    math.sqrt(vec.x * vec.x + vec.y * vec.y)
}

length({x: 10, y: 20})
length({x: 3, y: 4, name: "p1"})
length({x: 15})               // type error, missing property y
length({x: "one", y: "two"})  // type error, x and y must be numbers

I lived in a TypeScript codebase for nine months or so and the convenience of being able to quickly define anonymous record types is wonderful. Sometimes you simply want to return a value that contains three named fields, and defining a new type and giving it a name is busywork that contributes to the common belief that statically typed languages feel “heavy”.

Crux supports both nominal types (String, CustomerID, Array, …) and anonymous records, sometimes called structural types.

The problem here is how structural types interact with traits.

Imagine coming to Crux for the first time and trying to encode some values as JSON.

json.encode(10)              // returns "10"
json.encode("hello")         // returns "\"hello\""
json.encode(True)            // "true"
json.encode([15, 30, 12])    // "[15,30,12]"
json.encode(js.Null)         // "null"

json.encode({code: 200, text: "OK"})
// Type error: records don't implement traits

Wait, what? Why does everything work but records?

Records and Traits

json.encode‘s type is fun encode<V: ToJSON>(value: V) — that is, it accepts any value which implements the ToJSON trait. Why don’t records implement traits? Well, record types are anonymous, as described before. They don’t necessarily have unique definitions or names, so how would we define a ToJSON instance for this record?

Haskell, and other statically-typed languages, simply have the programmer manually construct a HashMap and JSON-encode that. In Crux, that approach might look something like:

let map = hashmap.new()
map["x"] = 10
map["y"] = 20
json.encode(map)

Not too bad, but not as nice as using record syntax here.

To solve this problem, which is “only” a human factors problem (but remember that Crux is intended to be delightful), I came up with something that I believe to be novel. I haven’t seen this technique before in the row polymorphism literature or any languages I’ve studied.

There are two parts to this problem. First we want to use record literals to construct key-value maps, and then we want to define trait instances that can use said maps.

Dictionaries

As I mentioned, Crux records are implemented as JavaScript objects. This is pretty convenient when interfacing with JavaScript APIs.

data jsffi ReadyState {
    UNSENT = 0,
    OPENED = 1,
    HEADERS_RECEIVED = 2,
    LOADING = 3,
    DONE = 4,
}

type XMLHttpRequest = {
    mutable onreadystatechange: () => (),
    readyState: ReadyState,
    responseText: JSOption<String>,
    // ...
}

Sometimes, however, JavaScript objects are used as key-value maps instead of records. Crux provides a Dict type for that case. Dict is like Map, except keys are restricted to strings to allow implementation on a plain JavaScript object.

let d = dict.new()
d["happy"] = True
d["sad"] = False
json.encode(d)  // "{\"happy\":true,\"sad\":false}"

There is a ToJSON instance for any Dict<V> where V: ToJSON. But can we provide better syntax for creating one? It would be nicer to write:

let d = dict.from({
    happy: True,
    sad: False,
})

To implement dict.from, we need a new type of record constraint: one that says, I accept any set of fields, as long as every value has a consistent type.

fun from<V>(record: {...: V}): Dict<V> {
    // some unsafe JS code to copy the record into a mutable JS object
}

Read ... as “arbitrary set of fields” and : V as “has type V”. Thus, {...: V} is the new type of record constraint, and it’s read as “record with arbitrary set of fields where each field’s value has type V”.

So now we can write:

json.encode(dict.from({
    happy: True,
    sad: False,
}))

Better, but we’re still not done. For one, dict.from requires the values to have the same type — not necessary for JSON encoding. Two, it’s annoying to have to call two functions to quickly create a JSON object.

Defining Record Instances

Here’s where record trait instances come in. First we need to convert all the record’s field values into a common type T. Then, once the record has been converted into a record of type {...: T}, it can be passed to dict.from and then json.encode. The resulting syntax is:

“`crux
// Read as: implement ToJSON for records where…
impl json.ToJSON {…} {
// …this code runs across each property of the record,
// producing a value of type {…: json.Value}…
for fieldValue {
json.toJSON(fieldValue)
}

// which is then encoded into a JSON object.
toJSON(record) {
    json.toJSON(dict.from(record))
}

}
“`

And there you have it, a ToJSON instance for any record, at least as long as the record’s field types implement ToJSON themselves.

json.encode({
    code: 200,
    status: "OK",
})

I’ve never seen a feature like this in the literature or languages I’ve studied, but if you have, please let me know!

Why isn’t Crux a pure functional language?

Nicolas Hery asked on Twitter: “[Why isn’t Crux] pure like Haskell/Elm/etc.?” Good question, and one worth writing about.

First of all, the term “pure functional language” isn’t terribly enlightening. It’s certainly not useful for communication, because everyone has or invents their own definition on the fly.

There are a few independent ideas here: mutable vs. immutable data structures, mutable vs. immutable names/variables, and whether effects are tracked.

Regarding mutable vs. immutable data structures: in reality, mutable data structures are simply unavoidable. At some point you need to update your in-memory representation. There are ways to encode that with immutable data structures, but they get awkward fast. In addition, constantly producing new immutable values places a great deal of pressure on the garbage collector – often it’s fastest just to update your data in-place. Immutable data is a fantastic tool, used appropriately, but it sucks when it’s a straitjacket.

Regarding mutable vs. immutable local variables: imagine you want to sum up the elements in an array. If Crux were pure-functional, that
could be written something like:

fun sum_(list, index) {
  if index >= len(list) {
    0
  } else {
    list[index] + sum_(list, index + 1)
  }
}
fun sum(list) {
  sum_(list, 0)
}

Oh wait, that’s not tail-recursive, so it’ll blow up the stack with a long enough list. Let’s rewrite it:

fun sum_(list, index, total) {
  if index >= len(list) {
    total
  } else {
    sum_(list, index + 1, total + list[index])
  }
}
fun sum(list) {
  sum_(list, 0, 0)
}

OK, now we’re tail recursive. For efficiency, however, the compiler needs to turn that into a tight loop of instructions, storing the accumulators in registers. Wait, so you’re saying the language forced me to translate a simple imperative function (sum the elements in a list) into a tail-recursive loop, which the compiler then has to convert back into imperative code for efficiency? Why can’t the compiler just turn my natural imperative loop into a tail-recursive implementation, if that’s what it really wants? (It definitely could.) And it gets even worse the larger your functions become, especially if they have early exits and multiple “mutable variables”. In short, tail calls are great, when appropriate, but always programming with tail recursion is like someone rejecting all your commits with “Nope, you must contort your code for my pleasure.” And I shudder to think of what a tail-recursive implementation of a huge interpreter loop with lots of mutable variables would look like.

Tail calls are great, but being forced to write tail-recursive loops is dumb and one of my biggest annoyances with Haskell. Instead, why not just write:

fun sum(list) {
  let mutable total = 0
  for i in list {
      total += i
  }
  return total
}

Crux makes that natural.

However! Everything I said above is independent of what the defaults are. Crux variables and values are immutable by default, since that’s the common case. Sum types are always immutable. The only things that can be mutable are record fields and locals, but, again, you must opt into that functionality. It’s not annoying in practice; since everything’s an expression in Crux, there’s less need to mutate locals.

Okay, so instead let’s presume that most people consider pure functional to mean “has restricted side effects”, like Haskell.

Personally, I’m a huge fan of monads in Haskell, and the amazing
things they enable.

However, they come at a large cost: since nearly everything is a Monad, the error messages get really complicated, especially when the compiler thinks you meant the Maybe monad or a function monad. (Isn’t it crazy how, in Haskell, Integers aren’t a Monoid by default, but all functions are Monads?)

Eventually I’d love Crux to have an effect tracking system. Different syntax between pure code and effectful code is unacceptable, so the Haskell approach is a no-go. I am excited, however, about row-typed effect systems like Koka. (Notably, it should be okay for pure functions to use mutable locals, like the sum function above.) Future work!

To summarize, I think we can achieve most of the benefits of “pure functional programming” while retaining accessible syntax and allowing mutability as desired and convenient.

The Story of Haskell at IMVU

An interesting blog post and its corresponding comment thread on Hacker News made me realize that, while I’d told the story of Haskell at IMVU to many people, I’d never written it down. :)

IMVU’s backend had been written in PHP from the beginning. Startup code is always gross, but after years of learning PHP’s ins and outs, we bent it to our will. Part of me looks back and says “It wasn’t that bad”, and indeed, it has a lot of great properties. But then I remember the horror.

The terrible straight-line performance, the lack of concurrent IO, the insane memory usage… Once the codebase reached about a million lines, the latency on our REST service response times was over half a second and in the worst case several seconds. For some types of services, that’s okay, but IMVU strives to be a real-time, delightful experience, and service latency on interactive things like dressing up, browsing the catalog, sending messages to your friends, and shopping is really important.

Some of the performance-minded of us tried to get PHP to be fast, but it was basically intractable, and we couldn’t afford to fork PHP with a high-performance JIT like Facebook ended up doing years later.

The combination of PHP’s awful performance, terrifying pitfalls, and the fact that millions of lines of code in a dynamic language (even with copious test coverage) makes it really hard to refactor, led us to evaluate alternatives. Someone suggested Java (which would have been totally fine in hindsight) but the infrastructure team decided to adopt this newfangled Node.js thing. Node sucked. This predated ES6, promises, and the rich ecosystem of npm, so we spent far too long rebuilding basic, basic, infrastructure (like propagating stack traces across async operations) in Node that we already had in PHP, and it still wasn’t as good. In the end we decided to drop Node and stick with PHP.

Fast forward a year or two…

There were some people at IMVU who had dabbled with Haskell, but it always seemed too idealistic and not very practical. In fact, one weekend, Andy went home with the explicit goal of proving that Haskell was stupid and not good for writing production software.

He came back on Monday glowing. “Haskell is perfect for web services! We should use it! It’s fast and safe and concurrent!” We smiled and nodded (sometimes Andy gets enthusiastic about new things) and went on with our daily work. But in the evenings and weekends he plugged away, and built a prototype web service that was 1) dramatically faster than the corresponding PHP services 2) actually quite short in implementation and 3) surprisingly clear. Huh, maybe this Haskell thing did have legs.

OK, so let’s consider the possibility that Haskell could be something IMVU could adopt. How we would vet it?

We started by, during the next hack week, forming a small team of everyday engineers to learn Haskell and work on a feature. We learned two things: teaching Haskell was a nontrivial problem. BUT, Haskell newcomers hacking on the codebase only caused one regression, and that regression actually exposed a bug in some of our infrastructure. Well that’s pretty cool. For years, IMVU had a tradition of throwing new engineers into the deep end, after which they’d invariably take down the site, and then we’d demonstrate that we’re egoless and all failures are an opportunity to improve our process with a postmortem. What if Haskell could help us avoid some of these problems?

Over time we slowly ramped up Haskell within the company. At first we had growing pains. The infrastructure in our PHP stack was extremely mature and well-baked. The corresponding infrastructure in our Haskell (error reporting, DB access, unit testing) was not. It took us a while to sort out exactly what our Haskell best practices would be, and by that point a few people started to get a negative vibe about Haskell. But we kept pushing forward.

(By the way, I keep saying “we”, but at this point I hadn’t written any Haskell yet, and was not particularly for or against it, even though I did think PHP was terrible.)

By this point we’d shipped at least two sets of mobile services on Haskell, and many people considered it a success. However… Tensions about Haskell had really started to heat up, with some people saying things like “I never want to even look at Haskell again” or “at this stage in my career I don’t want another programming language” or “I think we’re making a horrible mistake by building another stack, especially in Haskell.” People were complaining to their managers, and the managers were coming to me. Some people loved Haskell, some people hated it, but very few were in the middle.

As a neutral third party, and as someone with a lot of respect in the organization, the engineering managers sent me out to gather ground truth. I took my little black notebook and a pen and interviewed something like half the engineering team on their opinions on Haskell, whether they were fans or not. I spent an hour with each person, asking what they liked, what they didn’t like, what they thought we should do, etc.

I then collated all the feedback and presented back to the engineering managers and leadership team (of which I was a part too, but since I had no direct reports and had no experience with Haskell at IMVU, I was fairly unbiased). I don’t have my report handy, but it went approximately like this:

75% of the people I interviewed were positive about or fine with Haskell.

25% of the people had major concerns. The concerns, in roughly decreasing priority, were:

  • Haskell strings suck
    • Specifically there’s a lot of cognitive overhead in switching between the five common string representations as needed
  • Syntactic concerns
    • Haskell’s syntax is iconoclastic: parens, $, /= instead of !=
    • some people in the Haskell community love point-free style
    • cognitive overhead in switching between pure and monadic style
    • let vs. where
  • concerns about a second backend stack
    • duplicate infrastructure across PHP and Haskell
  • concerns about it not being a mainstream language
    • teaching the whole organization
    • interaction between Haskell experts and novices sometimes leaves the novices feeling dumb or inadequate
    • bus factor to organization on key infrastructure
    • Haskell is less debuggable by ops engineers

The wins were:

  • GHC’s support for safe concurrent programming is world-class
  • Sound type safety led to refactoring being a breeze
  • Fun, stretches the mind (for some people)

Ultimately, we settled on a middle ground. We would not deprecate PHP, but we would continue to support Haskell as a technology stack. We had specific services in mind that needed high throughput and low latency (<100 ms) and we knew that PHP could not achieve those. That said, we respected the fact that several engineers, even those in leadership roles, put their foot down and said they didn’t want to touch Haskell. (From my perspective it was kind of a weird experience. Highly-skilled people that I respect just got really angry and irrational and basically refused to even try to learn Haskell, which resulted in a culture divide in the company.)

In the end, I never did manage to convince some of my peers to try it. That said, there were many engineers who loved Haskell and we built a great deal of critical infrastructure in it. The Haskell infrastructure basically hums like a well-oiled machine. Perfectly reliable unit tests. Concurrent batched IO. Abstractions for things like timeouts and running tasks on other threads while correctly bubbling failures. Fantastic support for safely encoding and decoding JSON. One of the best benchmarking tools I’ve ever used.

Since one of the biggest concerns about Haskell was education, I taught a sequence of four or five classes for the engineering team with my take on how it should be taught. Traditional Haskell materials tend to talk about laziness and currying and all this FP junk that is certainly interesting and different, but mostly irrelevant to people trying to come in hot and get the job done. Instead, I focused on specifics first, and then generalizations.

“You know how you can add two integers and get a new integer back? Well, you can also add two floating point numbers. Int and Float are instances of Num.”

“You know how you can concatenate strings? And concatenating with the empty string gives you the same value? It turns out many things satisfy these properties. They’re called Monoids.”

“You know how you can map across a list? We do it all the time in PHP or Python. Well you can also map across Maybe. Think of Maybe as a list that can have either zero or one element. Things that can be mapped are called Functors.”

etc. etc.

We started with specific examples of how to do stuff with Haskell, then talked about type classes, and the final optional class was about how Haskell is compiled down to the machine. You’d have to ask the attendees how much they remember, but the classes were well-reviewed at the time. :)

In summary, I definitely wouldn’t say that Haskell was a failure. GHC is an amazing piece of technology, and IMVU built infrastructure in it that wouldn’t have been possible in PHP. I also can’t say Haskell was a resounding success. I was truly disturbed by the culture drama that it brought. And, frankly, saddened and confused by how some people closed their minds. Personally, I would have happily traded a lot of Haskell’s safety for avoiding drama. I sometimes wonder what would have happened had we chosen Java instead of Node.js back in 2010. We could have hit our performance goals in Java too.

Oh well. Many of us still love Haskell. It’s not going away. I left IMVU last year (needed some more experience elsewhere) but I personally miss Haskell a great deal. If I were starting a new project with people that I trusted, I’d definitely reach for Haskell first. The expressiveness, brevity, and safety let you move extremely quickly, and Haskell is definitely mature enough for production use.

Designing Crux – Import Statements

Manipulating a module’s import list is a high-frequency activity in modern programming languages. Sadly, few popular languages perfect the usability of the import list.

To do a good job in Crux, let’s start from the use cases, roughly sorted by importance:

  • import a module and refer to it by a qualified import name
  • import a list of names unqualified into this module’s scope
  • import everything unqualified into this module’s scope
  • import a module solely for its side effects, without importing any symbols
  • visually scan the import list
  • sort the import list (with Emacs’s sort-lines or the like)
  • and, of course, tools must easily parse the import list

Designing for usability means optimizing for the most common use cases, while making sure the less common use cases are still possible.

The most common use case is to import a module and refer to it qualified. Consider this hypothetical syntax:

import fs.path
path.combine("/usr", "local")

The qualified import reference (here, path) defaults to the last segment of the module name. Now, for this to work well, the basename has to be short, distinct, and meaningful.

Haskell gets this wrong.

import qualified Data.ByteString as BS

From left to right: import qualified is long to start with. Then you have most packages in some kind of arbitary namespace like Data. Then, since ByteString is too long to be convenient (especially with the mixed case), people typically name the import BS. Typically being the key word. Sometimes people name it Bytes. Sometimes ByteString. Sometimes B. Sometimes Data.ByteString.Char8 is imported as BS. The fact that the programmer has to make a choice here naturally leads to inconsistency.

Go does a great job here. In Go, you import like this:

import "fs/path"

Afterwards, path functions can be accessed as path.Function. Short and meaningful.

Go goes even further and provides guidelines for naming functions in the context of a module. Conventions tend to be copied, so by setting a good standard early on, Go positively affected every project built then on.

ES6 modules are also pretty interesting. They are statically analyzable and syntactically lightweight, but they have a fairly significant problem at scale: sorting imports.

import Dispatcher from 'flux/dispatcher';
import {foo, bazzle} from 'app/utils';

The problem with having the list of imported symbols first is that they can’t easily be machine-sorted. Every ES6 project I’ve worked on has encouraged its engineers to sort the imports by hand. Also, the import list coming first gets annoying when import lists are long.

import {
  foo, bar, baz,
  qux, harf, barf,
  hurp, floop
} from 'my_module';

With Crux, we learned from all of the above import systems and came up with something that meets every use case. Here is the proposed syntax:

import {
  bytes,
  flux.dispatcher,
  my_encoding_system as mes,
  imported_only_for_side_effects as _,
  utils(foo, bar, baz),
  more_utils(
    foo2,
    bar2,
    baz2,
  ),
  all_the_utils(...), // import everything into scope
}

The easiest import syntax is most common one, but other types of imports, like renaming the import or importing some symbols unqualified, are straightforward too.

This directly satisfies all of our use cases!

We may change our minds and make the commas after each import optional in the future.

Side Note: Implicit vs. Explicit Imports

OCaml and Java allow qualifying names at their use sites, making import statements unnecessary. Import statements are syntax sugar at that point. I don’t have a strong justification here, but Crux went with explicit import statements (like Go, Python, Haskell, ES6) because it’s easier for humans and machines to see a module’s dependencies at a glance.