Dropbox Hack Week: GraphQL Server in Haskell

Last week was hack week at Dropbox. I took the opportunity to explore the implementation of a GraphQL server that does optimal IO batching and concurrency.

Serial IO

A common performance problem in the implementation of web services is serial IO. Let’s say you have a service that returns the names of all of your friends. It’s easy and natural to implement it like this:

me = fetch_user_info(my_user_id)
friend_names = []
for friend_id in me.friends:
  friend_names.append(fetch_user_info(friend_id).name)
return friend_names

The problem is that each user info lookup for my friends occurs sequentially. Even with TCP connection pooling, that’s still a packet round-trip in the datacenter. “But that should be faster than 10 ms right?” Even if it is, it doesn’t take many friends to blow your performance budget and send your response time across the threshold of perceptible delay.

Moreover, this problem compounds on itself. If your inner functions have serial IO and you call them in a loop, you’ve just added potentially thousands of round-trips to your backend data sources. I would hazard a guess and say serial IO is the largest contributor to service response latencies.

Manually Batched IO

One solution is to always batch IO. Every function takes a list and returns a list. This can be made to work (indeed, I’ve achieved excellent performance by carefully, manually, batching) but doesn’t compose as your services scale. Sometimes it’s just too hard to know all of the data you will need to fetch, and the dependencies between that data.

OK, so serial IO is bad. More on that in a bit. Let’s talk about REST now.

REST

At IMVU, we went all-in on REST, building a rather substantial framework to make it easy to create REST services. We anticipated the fact that REST services tend to require many round-trips from clients, so we built a response denormalization system, where the service can anticipate the other services a client will need to hit and include their responses too.

This sounded great at first, and in fact was mostly an improvement from the prior state of affairs. But, at scale, REST has some challenging performance problems. For one, REST services have a defined schema, and they always return the data they advertise. As I mentioned, if a service doesn’t return all the data a client needs, the client needs to hit more services, increasing the number of client-to-server round-trips. In addition, because the service always returns the same set of data, it must query the backend database to fetch more data than the client even cares about.

This phenomenon tends to happen most frequently with core domain objects like users. Because users are so important, they accumulate relationships with so many pieces of data (e.g. list of subscribed experiments, contact lists, shopping carts, etc.), almost all of which is irrelevant to most clients.

Why GraphQL?

This is where GraphQL comes in. In GraphQL, the client specifies the data that it wants. There is only one request, even if the data is deeply nested, and the server only has to fetch exactly what’s needed.

Consider the query:

query HeroNameQuery {
    newhope_hero: hero(episode: NEWHOPE) {
        name
    }
    empire_hero: hero(episode: EMPIRE) {
        name
    }
    jedi_hero: hero(episode: JEDI) {
        name
    }
}

It looks up the hero of each of the first three Star Wars movies, fetches any information it needs from the backend, and returns only what is requested:

"data": {
    "HeroNameQuery": {
        "jedi_hero": {
            "name": "R2-D2"
        },
        "empire_hero": {
            "name": "Luke Skywalker"
        },
        "newhope_hero": {
            "name": "R2-D2"
        }
    }
}

There are GraphQL implementations for many languages but many of them don’t solve the serial IO problem I described to start this post. In fact, a naive GraphQL implementation might issue IO per field of each object requested.

For hack week, I wanted to explore the design of a GraphQL server that issued all of its backend IO in optimally concurrent batches.

Why Haskell?

Dropbox doesn’t use Haskell, but I find it to be a great language for exploring design spaces, particularly around execution models. Also, Facebook open sourced their excellent Haxl library which converts code written with serial IO into efficient batched requests. Haxl provides an effect type that, when it can understand that two data fetches are independent, runs them both in parallel. When all Haxl operations are blocked on backend data fetches, only then does it issue the backends. My prototype GraphQL resolvers are surprisingly naive, specified with sequential code. Haxl automatically batches up the requests and hands them to the DataSource for execution.

In addition, there is nothing clever about the GraphQL request handler or graph traversal and filtering — all cleverness is handled by Haxl.

At this point, you might be thinking “So was anything challenging about this project?” On the data fetch side, no, not really. :) However, I did run into one unexpected snag when using Haxl for data updates: because Haxl tries really hard — and in a general, composable way — to run your IO in parallel, you must be careful about how writes and reads are sequenced together. If you leave out the () <- on line 105, Haskell sequences the operations with >> instead of >>=, and Haxl’s >> uses the Applicative bind operation instead of the Monad bind operation, and thus it assumes it can run them in parallel. And, as you might expect, issuing a write and read to the same data concurrently doesn’t end well. :)

Conclusions

I am very thankful for jdnavarro’s excellent GraphQL query parser. With it, in four days, I was able to get a prototype GraphQL server up and running. Using Haxl and Hedis, I have it hooked up to a Redis data source, and it correctly batches all independent IO reads and runs them concurrently.

The Star Wars hero names query above results in two batched backend requests:

fetch star wars batch of size 3:
["FetchEpisode(Jedi)","FetchEpisode(Empire)","FetchEpisode(NewHope)"]
fetch star wars batch of size 2:
["FetchCharacter(1000)","FetchCharacter(2001)"]

You can even see that it noticed that R2-D2 is the hero of both movies, and only requested its info once.

The performance is pretty good: on some AWS instances, I measured about 3500 queries per second per machine and a query latency averaging 3 ms. Of course, that could worsen as permissions checks and so on are implemented. On the other hand, the code is completely unoptimized, full of lazy data structures and an unoptimized parser without a parser cache.

The prototype code is open sourced on the Dropbox GitHub.

It’s probably possible to build something like Haxl in Python with generators, but you’d have to give up standard loops and list comprehensions, instead using some kind of parallel map operation. You also would not benefit from anything that teases concurrency out of imperative functions like GHC’s upcoming ApplicativeDo extension. There are some things that Haskell’s restricted effects are uniquely good at. :)

I’d guess it would probably be even trickier to do a good implementation in Go given that the programmer has less control over goroutines than Python’s generators and Monads in Haskell. That said, perhaps someone will discover a clean, optimal implementation. We should pay attention to efforts like this.

I think GraphQL will be a huge deal for high-performance web services. It’s harder to implement than REST or traditional RPC services, but the reduction in response latencies could be substantial. Naive implementations aren’t going to have good performance, but if you are interested in aiming straight at the finish line, Haskell and Haxl will give you a head start.

Thinking About Performance

Update: Bill Gropp covers this topic in depth in a presentation at University of Illinois: Engineering for Performance in High-Performance Computing.

Two things happened recently that made me think my approach to performance may not be widely shared.

First, when I announced BufferBuilder on reddit, one early commenter asked "Did you run your Haskell through a profiler to help identify the slow parts of your code?" A fairly reasonable question, but no, I didn’t. And the reason is that, if stuff shows up in a Haskell profiler, it’s already going to be too slow. I knew, up front, that an efficient way to build up buffers is a bounds check followed by a store (in the case of a single byte) or copy (in the case of a bytestring). Any instructions beyond that are heat, not light. Thus, I started BufferBuilder by reading the generated assembly and building a mental model of how Haskell translates to instructions, not by running benchmarks and profilers. (Benchmarks came later.) Any "typical" CPU-bound Haskell program is going to spend most of its time in stg_upd_frame_info and stg_ap_*_info anyway. ;)

Next, on my local programming IRC channel, I shared a little trick I’d seen on Twitter for replacing a div with a shift: (i+1)%3 == (1<<i)&3 for i in [0, 2]. One person strenuously objected to the idea of using this trick. Paraphrasing, their argument went something like "the meaning of the code is not clear, so tricks like that should be left to the compiler, but it doesn’t work for all values of i, so a compiler would never actually substitute the left for the right. Just don’t bother." We went back and forth, after which it became clear that what the person REALLY meant was "don’t use this trick unless you’re sure it matters". And then we got to "you have to run a profiler to know it matters." I contend it’s possible to use your eyes and brain to see that a div is on the critical path in an inner loop without fancy tools. You just have to have a rough sense of operation cost and how out-of-order CPUs work.

Over the years I have built a mental framework for thinking about performance beyond the oft-recommended "get the program working, then profile and optimize the hot spots". My mental framework comes from three sources:

  1. Rico Mariani has written a lot about designing for performance so you aren’t caught, late in the project, unable to hit your perf targets. That is, understand the machine and problem constraints, and sketch out on paper the solution so you can make sure it fits. Use prototypes as necessary.
  2. I’ve always been interested in how our programs run on physical silicon. The Pentium Chronicles: The People, Passion, and Politics Behind Intel’s Landmark Chips is an excellent history of the design of Intel’s out-of-order P6 line.
  3. I’ve been involved with too many projects run with the philosophy of "get it working, profile and optimize later". It’s very easy for this strategy to fail, resulting in a program that’s uniformly sluggish and uneconomical to fix.

Performance is Engineering

To hit your performance goals, you first need to define your goals. Consider what you’re trying to accomplish. Some example performance goals:

  • Interactive VR on a mobile phone
  • Time between mouse click in menus and action taken is under 100 ms
  • Load a full 3D chat room in five seconds
  • First page of search results from a mountain in Hawaii in half a second
  • Latest experimental analyses in half a day

Also, understand why you want to hit those goals. Todd Hoff’s excellent Latency is Everywhere and it Costs You Sales has several links to research showing how performance can affect your business metrics and customer satisfaction.

For BufferBuilder, our goal was to match the performance, in Haskell, of a naive C++ buffer
building library.

Now that you have a goal in mind, and presumably you understand the problem you’re trying to solve, there’s one final piece to understand. You could call them the "fundamental constants of human-computer interaction", which can be split into the human and computer halves.

Very roughly:

  • ~16 ms – frame budget for interactive animation
  • 100 ms – user action feels instantaneous
  • 200 ms – typical human reaction time
  • 500+ ms – perceptible delay
  • 3+ seconds – perceived sluggishness
  • 10+ seconds – attention span is broken

And on the computer:

  • 1 cycle – latency of simple bitwise operations
  • a few – maximum number of independent instructions that can be retired per cycle
  • 3-4 cycles – latency of L1 cache hit
  • ~dozen cycles – latency of L2 cache hit
  • ~couple dozen cycles – integer division on modern x86
  • couple hundred cycles – round-trip to main memory
  • 50-250 us – round-trip to SSD
  • 250-1000 us – in-network ethernet/tcp round-trip
  • 10 ms – spinny disk seek
  • 150 ms – IP latency between Australia and USA

Throughput numbers:

  • 8.5 GB/s – memory bandwidth on iPhone 5
  • ~100 MB/s – saturated gigabit ethernet connection
  • 50-100 MB/s – spinny disk transfer speed
  • 4.5 Mb/s – global average connection speed
  • 3.2 Mb/s – average mobile bandwidth in USA

These numbers can be composed into higher-level numbers. For example:

  • 1 ms – parsing 100 KB of JSON on a modern, native parser such as RapidJSON or sajson.

It’s unlikely JSON parsing will become appreciably faster in the future – parsing is a frustrating, latency-bound problem for computers. "Read one byte. Branch. Read next byte. Branch."

While throughput numbers increase over time, latency has only inched downwards. Thus, on most typical programs, you’re likely to find yourself latency-bound before being throughput-bound.

Above numbers are from:

Designing for Performance

Once you’ve defined your performance goal, you need to make the numbers fit. If your goal is interactive animation, you can barely afford a single blocking round-trip to a spinny disk per frame. (Don’t do that.) If your goal is an interactive AJAX application that feels instant from all around the world, you must pay VERY close attention to the number of IP round-trips required. Bandwidth, modulo TCP windowing, is usually available in spades – but accumulated, serial round-trips will rapidly blow your experience budget. If you want a WebGL scene to load in five seconds on a typical global internet connection, the data had better fit in (5s * 4.5 Mb/s) = 2.8 MB, minus initial round-trip time.

For BufferBuilder, since we knew our goal was to match (or at least come reasonably close to) C++ performance in Haskell, we didn’t need a profiler. We knew that appending a single byte to a buffer could be minimally expressed with a load of the current output pointer, a (predicted) bounds check, and a write of the new output pointer. Appending a buffer to another is a (predicted) bounds check, a memcpy, and updating the output pointer.

A profiler is not needed to achieve the desired performance characteristics. An understanding of the problem, an understanding of the constraints, and careful attention to the generated code is all you need.

We can apply this approach at almost any scale. Want a rich website that uses less than 100 MB of RAM yet has high-resolution images, gradients, transitions, and effects? Build prototypes to measure the costs of each component, build up a quantitative understanding, and make the numbers fit. (Of course, don’t forget to actually measure real memory consumption on the resulting page, lest you be surprised by something you missed.)

Want to design a mobile application that goes from tap to displayed network data in 200 ms? Well, good luck, because Android devices have touchscreen latency of 100+ ms. (Edit: mdwrigh2 on Hacker News points out that this data is out of date for modern Android handsets.) And it can take a second or more to spin up the radio!

To summarize, given a rough understanding of the latencies of various layers of a system, and knowledge of the critical path through the code, simply sum the numbers together and you should have a close approximation to the total latency. Periodically double-check your understanding against real measurements, of course. :)

FAQ

Are you saying profilers are useless?!

Absolutely not. Profilers are awesome. I have a spent a great deal of time with the Python profiler. I particularly love AMD’s old CodeAnalyst tool. (Not the new CodeXL thing.) Sampling profilers in general are great. I’ve also built a bunch of intrusive profilers for various purposes.

But always keep in mind that profilers are for exploring and learning something new. By the time you’ve built your application, you may not actually be able to "profile your way to success".

Should I unilaterally apply this advice, irrespective of context?

Of course not. A large number of programs, on a modern CPU, trivially fit in all the performance budgets you care about. In these situations, by all means, write O(n) loops, linked lists, call malloc() inside every function, use Python, whatever. At that point your bottleneck is human development speed, so don’t worry about it.

But continue to develop an understanding of the costs of various things. I remember a particular instance when someone replaced hundreds of <img> tags on a page with <canvas> for some kind of visual effect. Bam, the page became horrifically slow and consumed an enormous amount of memory. Browsers are very smart about minimizing working set in the presence of lots of images (flushing decoded JPEGs from memory until they’re in view is one possible technique), but <canvas> is freeform and consumes at least width*height*4 bytes of pagefile-backed memory.

What about algorithmic complexity?

Algorithmic complexity can be a significant improvement. Especially when you’re Accidentally Quadratic. Reach for those wins first. But once you get into O(n) vs. O(lg n), you’re almost certainly limited by constant factors.

In any context, you should aim for the biggest wins first. Let’s say you’re writing a web service that talks to a database, fetching info for 100 customers. By far the biggest optimization there is to run one query to fetch 100 customers (as a single round-trip can be 1-10 ms) instead of 100 queries of one customer each. Whenever someone says "My new web service is taking 1.5 seconds!" I can almost guarantee this is why. Both approaches are technically O(n), but the query issue overhead is a HUGE constant factor.

In interviews I sometimes ask candidates about the performance difference between two algorithms that have the same algorithmic complexity. There is no precisely correct answer to my question, but "the execution time is the same" is wrong. Constant factors can be very large, and optimizing them can result in multiple-orders-of-magnitude improvements.

But Knuth said…!

Read what he actually said. Also, since then, we’ve developed a much deeper understanding of when mature optimizations are appropriate. Passing const std::string& instead of std::string is not a premature optimization – it’s just basic hygiene you should get right.

Announcing BufferBuilder: Encode JSON in Haskell 2-5x faster than Aeson

Conclusion

This tale is a little long and detailed, so for those of you who want immediate payoff…

Andy and I wrote a Haskell library called Data.BufferBuilder (including Data.BufferBuilder.Utf8 and Data.BufferBuilder.Json) which makes it easy to safely and efficiently encode Haskell data structures into JSON. In our benchmarks, using Data.BufferBuilder.Json to encode JSON is 4-5x as fast as using Aeson.

Even if you’re already using Aeson, you can benefit somewhat from BufferBuilder’s improved encoding performance. The buffer-builder-aeson package adds a ToJson instance for Aeson’s Value type, which our benchmarks show is 50% to 100% faster than Aeson’s built-in encoder. All you need to do is call Data.BufferBuilder.Json.encodeJson instead of Data.Aeson.encode!

Why did we build BufferBuilder?

Some of IMVU’s backend services are written in Haskell. While Haskell is incredible for many use cases, we ran into an unexpected bottleneck: JSON encoding. Our service response structure produces quite a lot of JSON, and much of that JSON contains URLs encoded into JSON strings.

Amazingly, URL and JSON encoding was showing up as a significant cost center when generating JSON responses. Some services spent over a second encoding JSON!

When faced with a performance problem, my first instinct is to either pound the problem into the ground or unmake the problem so the code isn’t necessary in the first place.

So let’s look at the problem holistically:

  • A JSON response is represented in memory as a collection of varied, in-memory data structures. The response happens to contain many URLs — sometimes more than a hundred.
  • URLs are represented by a data structure consisting of the protocol, hostname, path segments, query string, and so on.
  • Each URL becomes a JSON string.
  • Each in-memory data structure is converted to a JSON object whose properties depend on the type on the corresponding data structure.

Using Aeson to encode all of this results in the following steps:

  • ToJSON instances convert Haskell data types to an AST of Aeson Values.
  • The keys of an Aeson object are Text values. In memory, Text is encoded in UTF-16. Thus, URLs must be translated from their in-memory representation (ours is ASCII) into UTF-16 before they fit into the Aeson AST.
  • Then, the entity bodies are converted into JSON objects, where the keys are known at compile-time, but must be encoded as UTF-16.
  • The entire AST is encoded into a Lazy Text Builder.
  • Then the Lazy Text is encoded into a Lazy ByteString containing UTF-8.
  • Then, to generate the response body, the Lazy ByteString is concatenated into a single strict ByteString.

That’s a lot of steps! To summarize:

(URLs and Custom Data Types) -> Aeson AST -> Lazy Text -> Lazy UTF-8 ByteString -> Strict ByteString

The Design of BufferBuilder

This is where Andy and I sat down to create an API to let us cleanly express JSON encoding without sacrificing type safety OR performance.

We know that a fast, if not the fastest, way to build up a buffer of bytes is to allocate a chunk of memory, stream writes to it, and either chunk or realloc() as needed. Obviously, this kind of code can be trivially expressed in C:

void buffer_append(buffer* b, const char* s, size_t length) {
    if (!b->has_room(length)) {
        b->grow(length);
    }
    memcpy(b->data + b->size, s, length);
}

Because I’d been told that bouncing between Haskell and C with the foreign function interface can be slow, my first approach was to attempt to build a Haskell monad that grabbed the RealWorld token out of IO (IO a is basically a newtype around RealWorld -> (RealWorld, a)), augmented it with some extra “local variables” like the output ptr, capacity, and current size, and manually implemented allocation and memory writes with GHC.Prim APIs. GHC did not like this at all. The generated code ran 20 times slower than naive usage of Data.ByteString.Builder. Nonetheless, it was an interesting technique, so maybe I’ll write about it another time.

Surely it was possible to do better. So I tried the foreign function interface after all.

I wrote a tiny C API that allocated the underlying growable buffer. It provided APIs for appending bytes, buffers, UTF-16-to-UTF-8 transcoding, and so on. These FFI calls can only happen within IO actions, but building buffers is fundamentally a pure operation, and the provided Haskell API should be effectively pure. The solution is to offer a restricted state monad like ST which limits the operations within the monad to safe buffer-building operations.

This approach was by the fastest of any that I tried. In fact, in a sequence of appendBS operations, if the arguments are known-strict ByteStrings, GHC will compile the appendBS sequence directly into a series of cheap C function calls. For example, the following code:

data BSTriple = BSTriple !ByteString !ByteString !ByteString

writeBSTriple :: BSTriple -> BufferBuilder ()
writeBSTriple !(BSTriple a b c) = do
    appendBS a
    appendBS b
    appendBS c

compiles into something like:


movq %rbx,%rdi
movq 120(%rsp),%rax
movq %rax,%rsi
movq 96(%rsp),%rax
movq %rax,%rdx
movq 112(%rsp),%rax
addq %rax,%rdx
subq $8,%rsp
movl $0,%eax
call bw_append_bs
addq $8,%rsp
movq %rbx,%rdi
movq 152(%rsp),%rax
movq %rax,%rsi
movq 128(%rsp),%rax
movq %rax,%rdx
movq 144(%rsp),%rax
addq %rax,%rdx
subq $8,%rsp
movl $0,%eax
call bw_append_bs
addq $8,%rsp
movq %rbx,%rdi
movq 216(%rsp),%rax
movq %rax,%rsi
movq 160(%rsp),%rax
movq %rax,%rdx
movq 208(%rsp),%rax
addq %rax,%rdx
subq $8,%rsp
movl $0,%eax
call bw_append_bs
addq $8,%rsp

Obviously GHC’s code generator has some room to improve, but the main thrust is exactly right. An equivalent C++ API would generate much the same kind of code, modulo the poor instruction and register selection, which overall doesn’t matter too much here.

In almost all situations, Haskell isn’t going to be as fast as straightforward C, but with some work and appropriate use of FFI, it’s possible to come close.

Data.BufferBuilder.Utf8

Once we had an API to safely and efficiently build up buffers of bytes, we wanted to build safe APIs on top for constructing valid UTF-8 buffers and valid JSON.

Utf8Builder is a newtype around BufferBuilder with a restricted API. If you only call safe functions in Data.BufferBuilder.Utf8, the result is guaranteed to be valid UTF-8. Unsafe functions are provided for when you know precisely what you’re doing.

Data.BufferBuilder.Json

Data.BufferBuilder.Json is built on top of Data.BufferBuilder.Utf8. Data.BufferBuilder.Json’s Value type is a newtype around Utf8Builder, meaning there’s no Aeson-style AST. Each Value simply knows how to write itself into an output buffer. Just like how the safe Utf8Builder functions guarantee the output is legal UTF-8, the safe JsonBuilder functions guarantee (almost) that the output is a legal JSON document. (There are a couple caveats, see the documentation for details.)

I suspect Data.BufferBuilder.Json is approaching the limit of how fast a JSON encoder can be. And thanks to the beauty of Haskell, it’s convenient and safe!

If you’re using Aeson and encoding performance matters to you, give BufferBuilder a shot!

HTTP/2 Request Priorities – A Summary

Previously, I’d written about a particular use case we’d run into where we needed to make thousands of HTTP requests to download 3D assets to populate a WebGL scene.

Many of these assets (skeletons, meshes, lower-mip-level textures, poses) are higher priority than others (higher-mip-level textures, animations). In addition, objects closer to the camera should be prioritized higher than objects farther away.

The following table defines the priority of each asset download:

Asset Type Base Priority
Object Description 100
Skeleton 90
Mesh 90
Low-res texture 90
Pose 90
Animation 50
High-res texture 10

Plus a modifier based on the requesting object:

Object Priority modifier
Room +9
My Avatar +8
Other Avatars +[0,7] — distance to camera

Asset type determines the base priority. Within an asset type, the referencing 3D object adds a priority modifier. The room gets highest priority, your avatar next, and all remaining avatars are sorted by camera distance.

That is, object descriptions trump everything else. (They’re small and contain links to other assets.) Skeletons, meshes, low-res textures, and poses must be loaded before the object to be rendered at all, so they claim the next-highest-level priority. Finally, animations and high-res textures can come in later.

Bandwidth-Delay Product

Let me take a brief digression to explain why priority is important.

The bandwidth-delay product measures the amount of data that must be in flight to fully utilize a pipe. In 2012, the average US Internet connection was 6.7 Mbps. Average round-trip to Google in US was, say, 55 ms. This places the bandwidth-delay product at 6.7 Mbps * 55 ms = 46 KB. If you cannot keep 46 KB in flight at a time, you are wasting bandwidth. Of course, not everyone has Google’s data centers, and latency is jittery, and you likely want worldwide efficiency, so aim higher. Targeting a 150 KB bandwidth delay product, or even higher if you’re worldwide, is not a bad idea.

Now, because the pipe should have a lot of data in flight at any one moment, a large number of requests should be sent to the server right away. Since the client doesn’t know how big responses are, if the server returned low-priority responses first, the page would load more slowly than if low-priority requests weren’t sent until all high-priority responses were downloaded. However, having the client wait to issue low-priority requests does not make good use of available bandwidth. The best option is to give the server enough information to prioritize responses, letting it send down high-priority responses first, making everything load faster. Experience with SPDY shows that it’s critical that priority work well.

An ideal prioritization solution would satisfy the following objectives:

  • make full use of network bandwidth
  • retrieve responses for high-priority requests before low-priority requests
  • immediately — on the current or next frame — retrieve any responses already cached by the browser

On the web, today, there is no way to accomplish all of these goals.

Again, for as much as people like to point out that priority is optional or advisory [1] [2], it is critically important. Multiplexed connections without priority are slower than HTTP/1! Browsers already prioritize HTTP/1 requests, so if they didn’t prioritize SPDY or HTTP/2 requests, low-priority resources would contend for bandwidth with high-priority resources, causing initial load to be slower.

After whining about all of this to a friend, he casually said “Why don’t you take it up with the standards bodies?” (Damn you, jfb!) I’ve not had good luck with web standards bodies in the past — to the point that I have the impression some key people don’t care a whole lot about real-world use cases — but I thought I’d give it a shot.

It turns out that exposing a simple priority field on XMLHttpRequest (W3C WebApps WG), while easy, obvious, and an immediate solution to many use cases, does not satisfy everyone. In fact, it’s been proposed before. The WebApps WG instead suggested that prioritization should be solved at the “Fetch” layer, so regular DOM fetches like <img> and >script< tags can benefit too. This increases the scope of the proposal, but fair enough…

So I went to WHATWG and discovered a proposal that, on first blush, looked absurdly complicated. It’s based on HTTP/2’s dependency tree priority model. I spent an evening or two wrapping my head around the dependency tree model and ran into some problems.

I now see that, for all of the varied use cases for prioritization on the web, there is no apparent solution that satisfies them all. (But one is very close, as you’ll see.)

Use Cases

Let’s work through a few illustrative use cases for prioritizing requests on a multiplexed connection.

Basic Page Load

The ideal scenario for a page load is that enough of the HTML arrives to link to the appropriate CSS and JS resources, after which the JS and CSS are downloaded (with resources completing serially so that response processing on the CPU can happen in parallel with transfer), followed by the remainder of the HTML.

This requires dynamic reprioritization of the HTML document mid-transfer.

Any referenced images are not critical to the initial load, thus have lower priority. They are also transferred in parallel rather than serially under the assumption that they can be progressively-rendered. Two half-complete JavaScript files is not useful at all, but two half-complete JPEGs certainly could be useful.

This use case illustrates sequential transfer (JS and CSS) and parallel transfer (images). Sequential transfer is ideal when a resource doesn’t provide value until it’s completely downloaded. Parallel transfer is ideal when resources support progressive display or when trying to be fair to competing requests.

Multiple Tabs

Consider a tabbed browser. Each tab may issue dozens of its own requests. The browser ought to prioritize requests from the focused tab. Background tabs receive the remainder of the network resources.

Switching tabs ought to efficiently reprioritize requests from the old and new tabs appropriately.

Proxy

Proxies take many requests from multiple clients and multiplex them on fewer backend connections. Proxies ought to be fair: if client A asks for low-priority requests and client B asks for high-priority requests, the proxy’s resources should be split evenly between them.

Streaming Video

A video stream (e.g. DASH or M3U) consists of a sequence of short video segments. The current segment has the highest priority, followed by the next, and so on.

Reprioritization occurs on seek, though it’s probably fine to disregard any outstanding requests upon seek.

Loading a 3D Scene

This is the use case described at the top. Unless the connection is otherwise idle, not a single byte of low-priority responses should come before high-priority responses. Requests should be periodically reprioritized as the objects move closer to the camera.

Large Page of Images

Image a web page that consists of 10,000 img tags. Ideally, they would download in priority order where priority is distance from current viewport. On scroll, requests would be reprioritized.

Server Push

I personally have no interest in server push and many smart people are hashing out how priority interacts with server-pushed data in various forums, so I will refrain from further discussion here.

Prioritization Primitives

We can take the uses cases above and break them down into the components that a prioritization algorithm should support.

  1. Sequential in specified order (A before B before C …)
  2. Sequential in no particular order (A, B, C, but all of one before the next)
  3. Parallel (A, B, C, interleaved on the connection)
  4. Fairness (resources allocated evenly between {A, B, C} and {X, Y, Z}.)
  5. Reprioritization
  6. Parallel group dependencies (all of {A, B, C}, in parallel, before {D, E, F}, also in parallel)
  7. And because non-prioritized mux is worse than HTTP/1, advertised knowledge of prioritization support helps clients decide whether to blast all requests all at once or be more conservative :)

The Proposed Priority Models

HTTP/1

First, let’s talk about the status quo: legacy HTTP served by N concurrent TCP connections.

Firefox maintains a priority queue on 32-bit signed integers fed from the nsISupportsPriority interface.

Chromium supports a handful of priorities but I can’t find the code that implements the priority queue in there. All evidence says it exists though. :)

Most browsers also implement a first-paint optimization, where they distinguish between resources that are required for the first paint, like HTML, JS, and CSS, and download those first. Anything else: images or XMLHttpRequests is blocked until after the first paint. This is slightly inefficient: it leaves the connection idle during the first paint.

An HTTP/1 priority queue served by N TCP connections is inefficient for a couple reasons:

  1. round-trips and limited concurrency make it hard to keep the pipe full (especially on high-BDP connections)
  2. cannot reprioritize an in-flight request
  3. TCP flow control works better if you have one saturated connection rather than 8 unsaturated connections

SPDY/3

The version of SPDY currently in widespread use, SPDY/3, associates a 3-bit priority field with every stream. Streams cannot be reprioritized. “The sender and recipient SHOULD use best-effort to process streams in the order of highest priority to lowest priority.”

Original SPDY/4 Proposal

Two years ago, Google proposed switching SPDY/4’s dependency model to a system based on stream dependencies and weights. The idea is that, rather than using numeric priorities, streams form a DAG which represents a partial ordering relation.

At each depth “level” of the DAG, weights are used to distribute available resources across streams.

In this model, streams could have multiple parents. You could say, for example, C depends on A and B, placing C at a lower priority than both A and B.

This proposal also introduced reprioritization.

SPDY/4 advertises on the connection whether prioritization has an effect.

One of the gnarlier bits of the stream dependency model is that it introduces some number of race conditions. After a stream is closed, how long is its information, for the purposes of future reprioritization packets, retained?

Microsoft’s HTTP/2 Proposal

In a simplified form, the SPDY/4 dependencies and weights proposal made it into the HTTP/2 draft. However, prioritization was still a somewhat controversial topic. Osama Mazahir, on behalf of Microsoft, proposed an alternate priority design that appeared to garner some support. See the replies in that thread.

The basic idea is that there are groups of streams. Each group is assigned a weight, and within each group, streams are assigned priority.

Osama does a better job describing the tradeoffs, but the main argument is that it’s simpler and supports more use cases more naturally than dependencies and weights.

The large open questions seem to be: “So what priority values do people choose? What if you want to prioritize resource B between A and C, but A’s and C’s priorities differ by only 1?” If you use numeric priorities, you have to select arbitrary values for people to

After this thread, I never saw mention of it again. What happened?

HTTP/2 Draft 15

HTTP/2 Draft 15 goes into great detail about the stream dependencies and weights priority model.

Every stream is given one 31-bit parent and an 8-bit weight in [1, 256]. The server SHOULD process streams from the root of the tree down, dividing available resources among streams by their relative weights. If a parent stream closes, it is removed from the tree and its weight is divided among its children.

The default stream parent is zero – the root.

One repeated bit of confusion I’ve seen on the mailing lists is the difference between weights and priorities. Priorities are used to specify relative ordering: A before B. Weights are used to divide resources among many competing requests. If I wanted {A0,A1,A2,A3}[weight=256] before {B0,B1,B2,B3}[weights=128] before {C0,C1,C2,C3}[weights=1], is the expectation that the gateway proxy would open 12 simultaneous backend connections, and divide the connection with 2/3 of the frames solving for A, 1/3 solving for B, and a trickle for C? That would be silly: we want all As before any Bs. So then is the expectation that proxies would “tend to choose” streams of higher weight and run them to completion? If so, that violates at least the spirit of the algorithm described in HTTP/2.

In terms of server decision-making and client bandwidth utilization, weighting make a very poor substitute for true prioritization, so it’s important to treat them separately. Priorities specify ordering, weights specify resource allocation.

However, there doesn’t seem to be much practical implementation experience with the HTTP 2 priority model. Mozilla’s current implementation ignores dependencies entirely, which would have unexpected results against a compliant connection, as low-priority resources would consume bandwidth while the page is waiting for high-priority resources. (To be fair, this implementation is actively in progress.)

Which Priority Protocol is Best?

Let’s enumerate the prioritization primitives again. The goal is to list whether a particular primitive is expressible in a given protocol. We won’t even consider HTTP/1 because it doesn’t solve for efficient bandwidth usage on a high-latency connection. The three priority protocols under consideration are: SPDY/3, SPDY/4 as of 2012, Microsoft, and HTTP/2.

SPDY/3 is easy to eliminate from the running too: 3 bits of priority is way too limiting in many use cases, including, for example, in the use case of streaming video chunks. After 8 chunks, you would be out of priority levels.

Use Case SPDY/4 2012 Microsoft HTTP/2
Sequential in specified order (A before B before C …) πŸ‘ πŸ‘ πŸ‘
Sequential in no particular order^
Parallel (A, B, C in parallel) πŸ‘ πŸ‘ πŸ‘
Fairness (resources allocated evenly between {A, B, C} and {X, Y, Z}.) πŸ‘ πŸ‘ πŸ‘
Reprioritization πŸ‘ πŸ‘ πŸ‘
Parallel group dependencies ({A, B, C} in parallel before {D, E, F}, also in parallel) πŸ‘ πŸ‘
Advertisement of prioritization support πŸ‘

^ None of the protocols allow the requestor to specify that responses should be sent in their entirety before the next response is sent without also defining a specific sequence. I suppose the assumption is the server can make this determination by content type? I wonder whether a bit to indicate “I don’t care about the order, but once you start transmitting a resource, please finish before starting the next.” would be useful in practice.

Many of the desired use cases are supported in HTTP/2, but Microsoft’s proposal appears to satisfy strictly more. I wonder what happened to it in committee? I will reach out to its authors and supporters and see.

The biggest issues with HTTP/2 are that:

  • There is no way to express “download {A,B,C} in parallel then download {D,E,F} in parallel”
  • If priority is truly optional, and there is no mechanism by which the server advertises support, then clients will need to play it safe, reducing the possible upside from a multiplexed connection.

Can HTTP/2 be fixed?

Can we modify HTTP/2 to add support for the {A,B,C} in parallel followed by {D,E,F} use case?

The idea of dummy placeholder streams has been bandied about. Let’s see what that might look like:

  • 0 is the root.
  • P1 is the high-priority dummy stream.
  • P2 is the low-priority dummy stream.

If we defined a tree as such:

P1 -> 0
{A, B, C, P2} -> P1
{D, E, F} -> P2

and changed the HTTP/2 specification such that dummy nodes never closed and streams closer to the root were serviced first, then the “Parallel groups” use case is satisfied. However, watch what happens to fairness in the proxy / multiple tabs use case.

Consider two inbound connections to a proxy that are being multiplexed on one outbound connection.

Inbound connection A has three priority levels, thus three dummy streams. Inbound connection B has two priority levels, thus two dummy streams.

0 <- A_P1 <- A_P2 <- A_P3
0 <- B_P1 <- B_P2

Now let’s assume that inbound connection A has two requests: A_S1 at priority A_P1 and A_S2 at priority A_P3. Inbound connection B also has two requests: B_S1 at priority B_P1 and B_S2 at B_P2.

A_P1 <- A_S1
A_P3 <- A_S2
B_P1 <- B_S1
B_P2 <- B_S2

Note that A’s priority levels go deeper than B’s. Per the proposed spec modification above, A_S1 and B_S1 would have equal priority and be serviced fairly. However, if two streams are at unequal depth in the tree, the one closer to the root would win. B_S2 would unfairly starve A_S2. So it seems that fairness and group prioritization are at odds in a collapsing, tree-based dependency model.

What now?

It’s surprising to me how hard it is to get priorities correct over multiplexed streams. Microsoft’s proposal seems to be the most useful, but I have no insight into what caused it to fail in committee.

I don’t have any ideas for how to save a tree-based model. We need either weighted stream groups or we need to convert the tree into a DAG — that is, if nodes had multiple parents, the dependency model would work swimmingly.

Getting priority right is critical — HTTP/2 will be the most popular protocol on the Internet, and the upside potential in both page-load efficiency and performance of new web applications is huge.

Pre-emptive Rebuttals

Why not define your own priority extension?

The big advantage to using HTTP/2 is that it’s a standard. CDNs, caching proxies, gateways, and even load balancers will implement it. A custom protocol would be prohibitive in practice.

Just get over it and use weights as priorities.

I hope I demonstrated earlier why weights don’t solve the prioritization problem. “But it’s close enough!” The speed of light isn’t getting any faster. Page load optimization is getting increasingly expensive. Projects like SPDY, QUIC, and HTTP/2 are ambitious, protocol-level efforts for arguably marginal wins. Being “close enough” is no longer sufficient. Defining a protocol that can specify the ideal transfer schedule has a huge return on investment, assuming the protocol is adopted and implemented well.

Optimizing WebGL Shaders by Reading D3D Shader Assembly

This post is a mirror of the IMVU Engineering Blog.

We are optimizing WebGL shaders for the Intel GMA 950 chipset, which is basically the slowest WebGL-capable device we care about. Unfortunately, it’s a fairly common chipset too. On the plus side, if we run well on the GMA 950, we should basically run well anywhere. :)

When you’re writing GLSL in WebGL on Windows, your code is three layers of abstraction away from what actually runs on the GPU. First, ANGLE translates your GLSL into HLSL. Then, D3DX compiles the HLSL into optimized shader assembly bytecode. That shader bytecode is given to the driver, where it’s translated into hardware instructions for execution on the silicon.

Ideally, when writing GLSL, we’d like to see at least the resulting D3D shader assembly.

With a great deal of effort and luck, I have finally succeeded at extracting Direct3D shader instructions from WebGL. I installed NVIDIA Nsight and Visual Studio 2013 in Boot Camp on my Mac. To use Nsight, you need to create a dummy Visual Studio project. Without a project, it won’t launch at all. Once you have your dummy project open, configure Nsight (via Nsight User Properties under Project Settings) to launch Firefox.exe instead. Firefox is easier to debug than Chrome because it runs in a single process.

If you’re lucky — and I’m not sure why it’s so unreliable — the Nsight UI will show up inside Firefox. If it doesn’t show up, try launching it again. Move the window around, open various menus. Eventually you should have the ability to capture a frame. If you try to capture a frame before the Nsight UI shows up in Firefox, Firefox will hang.

Once you capture a frame, find an interesting draw call, make sure the geometry is from your WebGL application, and then begin looking at the shader. (Note: again, this tool is unreliable. Sometimes you get to look at the HLSL that ANGLE produced, which you can compile and disassemble with fxc.exe, and sometimes you get the raw shader assembly.)

Anyway, check this out. We’re going to focus on the array lookup in the following bone skinning GLSL:

attribute vec4 vBoneIndices;
uniform vec4 uBones[3 * 68];

ivec3 i = ivec3(vBoneIndices.xyz) * 3;
vec4 row0 = uBones[i.x];

ANGLE translates the array lookup into HLSL. Note the added bounds check for security. (Why? D3D claims it already bounds-checks array accesses.)

int3 _i = (ivec3(_vBoneIndices) * 3);
float4 _row0 = _uBones[int(clamp(float(_i.x), 0.0, 203.0))]

This generates the following shader instructions:

# NOTE: v0 is vBoneIndices
# The actual load isn't shown here.  This is just index calculation.

def c0, 2.00787401, -1, 3, 0
def c218, 203, 2, 0.5, 0

# r1 = v0, truncated towards zero
slt r1.xyz, v0, -v0
frc r2.xyz, v0
add r3.xyz, -r2, v0
slt r2.xyz, -r2, r2
mad r1.xyz, r1, r2, r3

mul r2.xyz, r1, c0.z # times three

# clamp
max r2.xyz, r2, c0.w
min r2.xyz, r2, c218.x

# get ready to load, using a0.x as index into uBones
mova a0.x, r2.y

That blob of instructions that implements truncation towards zero? Let’s decode that.

r1.xyz = (v0 < 0) ? 1 : 0
r2.xyz = v0 - floor(v0)
r3.xyz = v0 - r2
r2.xyz = (-r2 < r2) ? 1 : 0
r1.xyz = r1 * r2 + r3

Simplified further:

r1.xyz = (v0 < 0) ? 1 : 0
r2.xyz = (floor(v0) < v0) ? 1 : 0
r1.xyz = (r1 * r2) + floor(v0)

In short, r1 = floor(v0), UNLESS v0 < 0 and floor(v0) < v0, in which case r1 = floor(v0) + 1.

That’s a lot of instructions just to calculate an array index. After the index is calculated, it’s multiplied by three, clamped to the array boundaries (securitah!), and loaded into the address register.

Can we convince ANGLE and HLSL that the array index will never be negative? (It should know this, since it’s already clamping later on, but whatever.) Perhaps avoid a bunch of generated code? Let’s tweak the GLSL a bit.

ivec3 i = ivec3(max(vec3(0), vBoneIndices.xyz)) * 3;
vec4 row0 = uBones[i.x];

Now the instruction stream is substantially reduced!

def c0, 2.00787401, -1, 0, 3
def c218, 203, 1, 2, 0.5

# clamp v0 positive
max r1, c0.z, v0.xyzx

# r1 = floor(r1)
frc r2, r1.wyzw
add r1, r1, -r2

mul r1, r1, c0.w # times three

# bound-check against array
min r2.xyz, r1.wyzw, c218.x
mova a0.x, r2.y

By clamping the bone indices against zero before converting to integer, the shader optimizer eliminates several instructions.

Can we get rid of the two floor instructions? We know that the mova instruction rounds to the nearest integer when converting a float to an index. Given that knowledge, I tried to eliminate the floor by making my GLSL match mova semantics, but the HLSL compiler didn’t seem smart enough to elide the two floor instructions. If you can figure this out, please let me know!

Either way, I wanted to demonstrate that, by reading the generated Direct3D shader code from your WebGL shaders, you may find small changes that result in big wins.

Web Platform Limitations, Part 2 – Web Audio API is a Mess

The Web Audio API is a beautiful example of how bizarre web APIs can get.

Originally, Mozilla proposed their Audio Data API which allowed programmatic creation of stream of audio sample data. You simply opened an output stream, specified the number of channels and sample frequency, and wrote sample data to it. It was simple, beautiful, low-latency, and exposed a minimum baseline set of functionality, meaning that each browser would have likely had a high-quality implementation. The Audio Data API had another nice feature: the application developer specified the desired sample rate. In practice, every OS already has a high-quality resampler and mixer, so there’s no penalty when using a non-native sample rate.

I don’t know exactly why the Audio Data API lost, but I suspect it had something to do with the idea that JavaScript is too slow or high-latency to generate sample data in small buffers on demand.

Of course, with typed arrays, asm.js, and the upcoming SIMD.js spec (more on that later), that concern is not very well-founded.

Either way, the Audio Data API lost and the Web Audio API won.

The Web Audio API is an enormous mess. It has a huge API surface with a signal processing graph containing HRTFs, sound cones, doppler, convolutions, oscillators, filters, and so on. Quoth the spec: “It is a goal of this specification to include the capabilities found in modern game audio engines as well as some of the mixing, processing, and filtering tasks that are found in modern desktop audio production applications.”

We may end up with at least four separate implementations: Blink, WebKit, Gecko, and IE. Some friends of mine and I have this suspicion that, since each browser has to implement such a wide API, in practice, browsers won’t always do a great job with each node, so serious audio people will simply generate samples in JavaScript (or Emscripten-compiled C++) anyway.

I created a little demo that shows how poorly browsers handle the basic task of playing a handful of buffers, with fixed sample rates, at exact, scheduled times. I did not find a single browser on either Windows or Mac that could play five 200 Hz sine wave buffers without glitching. You can try the demo yourself.

And I would be remiss without linking the jsmess plea for help.

These are almost certainly solvable problems (the Web Audio API allows you to specify when buffers begin playback in seconds at 64-bit floating point precision), but given that the Web Audio API has been in development for years, you’d think the fundamentals of playing gapless buffers would be solved by now. All the media graph stuff could be optional and layered on.

Final Thoughts

For posterity, audio APIs should work much like DirectSound: at some fairly high, reliable frequency, your application should query the current ‘play’ and ‘write’ heads in a circular audio buffer, fill the buffer from the write head up to the play head, and ta da. Streaming audio. Let the OS’s audio stack handle the final resampling and mixing, and let the particular application perform whatever transforms it needs to. Of course, the actual polling and buffer-filling could happen in a browser thread, and JavaScript could stream arbitrary numbers of samples in advance if latency doesn’t matter that much.

(Side note: OpenAL is terrible too. Audio is a stream, not a collection of buffers that you queue.)

“But, but, JavaScript performance!”

If your concern is truly about JavaScript performance, then we’re all in trouble anyway. Why should only audio benefit from SIMD and low latency callbacks? If it will never be the case that JavaScript numerical performance will approach native numerical performance, the Web Audio graph could be replaced entirely by a handful of signal processing functions that are 1) mathematically defined and 2) work on arbitrary typed arrays. This would be simpler for browsers to implement, easier to verify, AND have more flexibility.

Web Platform Limitations, Part 1 – XMLHttpRequest Priority

Note: this post is a mirror of the post at the IMVU Engineering Blog.

The web is inching towards being a general-purpose applications platform that rivals native apps for performance and functionality. However, to this day, API gaps make certain use cases hard or impossible.

Today I want to talk about streaming network resources for a 3D world.

Streaming Data

In IMVU, when joining a 3D room, hundreds of resources need to be downloaded: object descriptions, 3D mesh geometry, textures, animations, and skeletons. The size of this data is nontrivial: a room might contain 40 MB of data, even after gzip. The largest assets are textures.

To improve load times, we first request low-resolution thumbnail versions of each texture. Then, once the scene is fully loaded and interactive, high-resolution textures are downloaded in the background. High-resolution textures are larger and not critical for interactivity. That is, each resource is requested with a priority:

High Priority Low Priority
Skeletons High-resolution textures
Meshes
Low-resolution textures
Animations

Browser Connection Limits

Let’s imagine what would happen if we opened an XMLHttpRequest for each resource right away. What happens depends on whether the browser is using plain HTTP or SPDY.

HTTP

Browsers limit the number of simultaneous TCP connections to each domain. That is, if the browser’s limit is 8 connections per domain, and we open 50 XMLHttpRequests, the 9th would not even submit its request until the 8th finished. (In theory, HTTP pipelining helps, but browsers don’t enable it by default.) Since there is no way to indicate priority in the XMLHttpRequest API, you would have to be careful to issue XMLHttpRequests in order of decreasing priority, and hope no higher-priority requests would arrive later. (Otherwise, they would be blocked by low-priority requests.) This assumes the browser issues requests in sequence, of course. If not, all bets are off.

There is a way to approximate a prioritizing network atop HTTP XMLHttpRequest. At the application layer, limit the number of open XMLHttpRequests to 8 or so and have them pull the next request from a priority queue as requests finish.

Soon I’ll show why this doesn’t work that well in practice.

SPDY

If the browser and server both support SPDY, then there is no theoretical limit on the number of simultaneous requests to a server. The browser could happily blast out hundreds of HTTP requests, and the responses will make full use of your bandwidth. However, a low-priority response might burn bandwidth that could otherwise be spent on a high-priority response.

SPDY has a mechanism for prioritizing requests. However, that mechanism is not exposed to JavaScript, so, like HTTP, you either issue requests from high priority to low priority or you build the prioritizing network approximation described above.

However, the prioritizing network reduces effective bandwidth utilization by limiting the number of concurrent requests at any given time.

Prioritizing Network Approximation

Let’s consider the prioritizing network implementation described above. Besides the fact that it doesn’t make good use of the browser’s available bandwidth, it has another serious problem: imagine we’re loading a 3D scene with some meshes, 100 low-resolution textures, and 100 high-resolution textures. Imagine the high-resolution textures are already in the browser’s disk cache, but the low-resolution textures aren’t.

Even though the high-resolution textures could be displayed immediately (and would trump the low-resolution textures), because they have low priority, the prioritizing network won’t even check the disk cache until all low-resolution textures have been downloaded.

That is, even though the customer’s computer has all of the high-resolution textures on disk, they won’t show up for several seconds! This is an unnecessarily poor experience.

Browser Knows Best

In short, the browser has all of the information needed to effectively prioritize HTTP requests. It knows whether it’s using HTTP or SPDY. It knows what’s in cache and not.

It would be super fantastic if browsers let you tell them. I’ve seen some discussions about adding priority hints, but they seem to have languished.

tl;dr Not being able to tell the browser about request priority prevents us from making effective use of available bandwidth.

FAQ

Why not download all resources in one large zip file or package?

Each resource lives at its own URL, which maximizes utilization of HTTP caches and data sharing. If we downloaded resources in a zip file, we wouldn’t be able to leverage CDNs and the rest of the HTTP ecosystem. In addition, HTTP allows trivially sharing resources across objects. Plus, with protocols like SPDY, per-request overhead is greatly reduced.

Benchmarking JSON Parsing: Emscripten vs. Native

This post concludes my obsession with JSON parsing. In fact, the entire reason I wrote a JSON parser was to show these graphs. I wanted to see whether I could write a JSON parser faster than any other when run in Emscripten. As vjson is typically faster, I did not succeed unless I requalify my goal as writing the fastest-in-Emscripten JSON parser with a useful parse tree.

This benchmark’s code is on GitHub. I encourage you to reproduce my results and search for flaws.

All benchmarks were run on a 2010 Macbook Pro, 2.53 GHz Core i5, 8 GB 1067 MHz DDR3.

Native vs. Emscripten

First, let’s compare native JSON parsing performance (clang 3.1, -O2) with both stable and nightly versions of Firefox and Chrome.

Two things are immediately clear from this graph. First, native code is still 5-10x faster than Emscripten/JavaScript. Second, yajl and jansson are dramatically slower than rapidjson, sajson, and vjson. Native yajl and jansson are even slower than Emscripten’d sajson. Henceforth I will not include them.

Looking closely at the browser results, a third conclusion is clear. Firefox runs Emscripten’d code much faster than Chrome.

Finally, sajson consistently performs better than rapidjson in my Emscripten tests. And vjson always parses the fastest. I believe this is because Emscripten and browser JITs punish larger code sizes.

The previous graph only shows parse rates by browser and parser for a single file. Next let’s look at parse rates by file.

Yep, native code consistently wins. At this point I want to dig into differences between the browsers, so I will show the same graph but without native code.

Firefox vs. Chrome

Not only is Firefox consistently faster than Chrome, it’s faster by a factor of 2-5x!

Finally, here the same graph but normalized against Firefox 18.

If I were a Firefox JS developer, I’d be feeling pretty proud right now, assuming this experiment holds up to scrutiny. Even so, these results match my experience with Bullet/Emscripten in Chrome: Chrome takes a very long time to stabilize its JIT, and I’ve even seen it fail to stabilize, constantly deopting and then reoptimizing code. In contrast, Firefox may take longer to JIT up front, but performance is smooth afterwards.

Further work is necessary to test the hypothesis that code size is the biggest predictor of Emscripten performance.

Preemptive answers to predicted questions follow:

Well, duh. You shouldn’t use an Emscripten’d JSON parser. You should use the browser’s built-in JSON.parse function.

This isn’t about parsing JSON. This is about seeing whether the web can compete with native code under common workloads. After all, Mozillians have been claiming JavaScript is or will be fast enough to replace native apps. If parsing JSON through a blessed browser API is the only way to do it quickly, then developers are disempowered. What if a new format comes along? Must developers wait on the browser vendors? The web has to be better than that.

Shouldn’t you have compiled the Emscripten code with -fno-exceptions?

Yes. Oops. That said, after generating the graphs, I did recompile with -fno-exceptions and it made no difference to the benchmark results in either Chrome or Firefox.

JSON Parser Benchmarking

With the caveat that each parser provides different functionality and access to the resulting parse tree, I benchmarked sajson,Β rapidjson,Β vjson,Β YAJL, andΒ Jansson. Β My methodology was simple: given large-ish real-world JSON files, parse them as many times as possible in one second. Β To include the cost of reading the parse tree in the benchmark, I then iterated over the entire document and summed the number of strings, numbers, object values, array elements, etc.

The documents are:

  • apache_builds.json: Data from the Apache Jenkins installation. Β Mostly it’s a array of three-element objects, with string keys and string values.
  • github_events.json: JSON data from GitHub’s events feature. Β Nested objects several levels deep, mostly strings, but also contains booleans, nulls, and numbers.
  • instruments.json: A very long array of many-key objects.
  • mesh.json: 3D mesh data. Almost entirely consists of floating point numbers.
  • update-center.json: Also from Jenkins though I’m not sure where I found it.

apache_builds.json, github_events.json, and instruments.json are pretty-printed with a great deal of interelement whitespace.

Now for the results. The Y axis is parses per second. Thus, higher is faster.

Core 2 Duo E6850, Windows 7, Visual Studio 2010, x86

Core 2 Duo E6850, Windows 7, Visual Studio 2010, AMD64

Atom D2700, Ubuntu 12.04, clang 3.0, AMD64

Raspberry Pi

Conclusions

sajson compares favorably to rapidjson and vjson, all of which stomp the C-based YAJL and Jansson parsers. 64-bit builds are noticeably faster: presumably because the additional registers are helpful. Raspberry Pis are slow. :)

sajson: Building the Parse Tree

The least trivial algorithm for building sajson’s parse tree is
allocating (or should I say, reserving?) the space in the parse tree
for an array’s or object’s element list without knowing the length in
advance.

Let’s consider an eleven-character JSON text. Imagine we’ve parsed
three characters: [[[. At this point we know two things:
1) we can fit the parse tree in eleven words and 2) there are at least
three arrays.

We don’t know the length of the arrays, so we cannot begin writing the
parse tree records yet.

The file could be [[[0,0,0]]] or [[[[[0]]]]] or [[[0,[0]]]] all of
which have quite different parse tree representations.

My first attempt involved parsing in two passes. The first pass
scanned the JSON text for arrays and objects and temporarily stored
their lengths into safe locations in the parse tree array. The
second pass used that information to correctly lay out the parse tree.

Parsing in two passes worked but had two major disadvantages. First, it was
slow. The scanning phase was simpler than parsing, but not THAT
simpler. Since parsing involves reading one byte and
branching on its value, parsing in two phases was effectively
half the speed. Second, the scanning phase duplicated a fair amount
of parsing code, making it harder to reason about and maintain.

Mike Ralston and I worked out a simpler approach at the cost
of two memory writes per array/object element record.

The gist is: given a parse tree array of size N, start one pointer at
the front and one at the back. Call the one at the front temp, for
temporary storage, and the one at the back out, for the actual parse
tree data.

When encountering the beginning of an array or object, remember the
current temp pointer.

When encountering a scalar element (numbers, strings, etc.), place its
payload in the parse tree at out and its location in temp.

When encountering the end of an array or object, compare the current
temp pointer to its value when beginning the array or object. The
difference is the length of the array or object. Now that we know the
length and each element reference, we can move the element references
out of temp and into out.

It may help to work through a simple example:

[[0],0]

The JSON text is 7 characters. Thus we have 7 words of parse tree to
work with:

[ ][ ][ ][ ][ ][ ][ ]
 ^                 ^
temp              out

Encountering the first [, we store the current value of temp (on the C stack).

Encountering the second [, we store the current value of temp (on the
C stack.)

At this point, nothing has been written to the parse tree.

Then we see the first zero and place its payload at out and its
type+location reference at temp.

[<Integer>:6][ ][ ][ ][ ][ ][0]
              ^           ^
             temp        out

Encountering the first ], we calculate the array length, 1, and move
the references from temp to out. We write
the new array’s location to temp, giving:

[<Array>:4][ ][ ][ ][1][<Integer>:2][0]
            ^     ^
           temp  out

We were careful to adjust the element references so they remain
relative to the array record.

We then encounter another zero and place its payload in out and
location in temp:

[<Array>:4][<Integer>:3][ ][0][1][<Integer>:2][0]
                         ^
                     temp out

Closing the outer array, we calculate its length (2), again move
the references from temp to out, and write the final array record.

[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]
 ^
out

out now gives us the root of the parse tree.

Eliminating the Recursion

You may have noticed the previous implementation stores the start
address of each array or object on the C stack. This is liable to
overflow in the case of a JSON file consisting of N [s followed by N
]s for some large N. The JSON standard allows parsers to limit the
maximum depth they handle, but we can do better.

It’s possible to eliminate the recursion by storing the value of
temp into the temp side of the parse tree at the start of every
array or object. When reaching the end of an array or object, its
length is calculated, the record is written into out, and the
previous value of temp is restored. If the previous value of temp
is a special root marker, parsing is complete.

Does the parse tree, even during construction, have room for these
outer references?

First, let’s examine the same example but where we store a reference
to the outer ‘in-construction’ object in temp:

# seen [
[<Array>:<ROOT>][ ][ ][ ][ ][ ][ ]

# seen [[
[<Array>:<ROOT>][<Array>:0][ ][ ][ ][ ][ ]

# seen [[0
[<Array>:<ROOT>][<Array>:0][<Integer>:6][ ][ ][ ][0]

# seen [[0],
[<Array>:<ROOT>][<Array>:4][ ][ ][1][<Integer>:2][0]

# seen [[0],0
[<Array>:<ROOT>][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]

# seen [[0],0]
[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]

In this example, there is room. But will there always be?

An easy conceptualization is that the final size of an array
record will be 1+N records, including its length. The temporary array
storage is also 1+N, where we don’t yet know its length but we do have
a reference to the enclosing array or object. Thus, we have room for
outer references in the parse tree.

Actual Code

The result
is an implementation whose parsing loop is almost entirely inlined,
and on architectures with a reasonable number of registers (even
AMD64), very little spills to the stack.

sajson is available
under the MIT license, but at the time of this writing, it is
primarily a proof of concept. The API is not stable and it does not
support string escapes. It also needs a security review to guarantee
that no malformed inputs crash.

That said, if you give sajson a try, let me know how it goes.