Multiplatform C++ on the Web with Emscripten

by Chad Austin
Apr 5 13

[This post is also available at the IMVU engineering blog.]

At GDC 2013, IMVU shared the analysis that led us to using Emscripten as a key component of our technology strategy to bring our rich 3D virtual goods catalog to new platforms, including web browsers.

Here are the slides from said presentation.

C++11 on Windows (Hah!)

by Chad Austin
Feb 22 13

For it being 2013 and as much as Herb Sutter has talked about C++11, it’s surprisingly hard to get an off-the-shelf C++11 development toolchain on Windows, at least as of today.  By off-the-shelf I mean suitable for an engineering team to get up and running quickly. Of course I could perform unnatural acts and compile my own packages of whatever, but no thanks.

Cygwin runs gcc 4.5 which is too old for most C++11 features.  Cygwin does provide a clang 3.1 package, but it uses the gcc 4.5 libstdc++ headers, lacking most of C++11′s standard library.

I could attempt to compile my own libcxx but libcxx is only known to work on Mac OS X.

In November, Microsoft released a Community Technology Preview increasing Visual Studio 2012′s C++11 support but it requires modifying your project to use the CTP toolchain. I’m using SCons and I have no idea how to convince it to use the CTP.

MinGW offers gcc 4.7 and, in theory, complete C++11 support, but a known bug disables std::to_string.  At least until gcc 4.8.

Blah. I think I’ll abandon Windows for development and focus on Mac and Linux until this situation has improved.

Benchmarking JSON Parsing: Emscripten vs. Native

by Chad Austin
Feb 11 13

This post concludes my obsession with JSON parsing. In fact, the entire reason I wrote a JSON parser was to show these graphs. I wanted to see whether I could write a JSON parser faster than any other when run in Emscripten. As vjson is typically faster, I did not succeed unless I requalify my goal as writing the fastest-in-Emscripten JSON parser with a useful parse tree.

This benchmark’s code is on GitHub. I encourage you to reproduce my results and search for flaws.

All benchmarks were run on a 2010 Macbook Pro, 2.53 GHz Core i5, 8 GB 1067 MHz DDR3.

Native vs. Emscripten

First, let’s compare native JSON parsing performance (clang 3.1, -O2) with both stable and nightly versions of Firefox and Chrome.

Two things are immediately clear from this graph. First, native code is still 5-10x faster than Emscripten/JavaScript. Second, yajl and jansson are dramatically slower than rapidjson, sajson, and vjson. Native yajl and jansson are even slower than Emscripten’d sajson. Henceforth I will not include them.

Looking closely at the browser results, a third conclusion is clear. Firefox runs Emscripten’d code much faster than Chrome.

Finally, sajson consistently performs better than rapidjson in my Emscripten tests. And vjson always parses the fastest. I believe this is because Emscripten and browser JITs punish larger code sizes.

The previous graph only shows parse rates by browser and parser for a single file. Next let’s look at parse rates by file.

Yep, native code consistently wins. At this point I want to dig into differences between the browsers, so I will show the same graph but without native code.

Firefox vs. Chrome

Not only is Firefox consistently faster than Chrome, it’s faster by a factor of 2-5x!

Finally, here the same graph but normalized against Firefox 18.

If I were a Firefox JS developer, I’d be feeling pretty proud right now, assuming this experiment holds up to scrutiny. Even so, these results match my experience with Bullet/Emscripten in Chrome: Chrome takes a very long time to stabilize its JIT, and I’ve even seen it fail to stabilize, constantly deopting and then reoptimizing code. In contrast, Firefox may take longer to JIT up front, but performance is smooth afterwards.

Further work is necessary to test the hypothesis that code size is the biggest predictor of Emscripten performance.

Preemptive answers to predicted questions follow:

Well, duh. You shouldn’t use an Emscripten’d JSON parser. You should use the browser’s built-in JSON.parse function.

This isn’t about parsing JSON. This is about seeing whether the web can compete with native code under common workloads. After all, Mozillians have been claiming JavaScript is or will be fast enough to replace native apps. If parsing JSON through a blessed browser API is the only way to do it quickly, then developers are disempowered. What if a new format comes along? Must developers wait on the browser vendors? The web has to be better than that.

Shouldn’t you have compiled the Emscripten code with -fno-exceptions?

Yes. Oops. That said, after generating the graphs, I did recompile with -fno-exceptions and it made no difference to the benchmark results in either Chrome or Firefox.

JSON Parser Benchmarking

by Chad Austin
Jan 31 13

With the caveat that each parser provides different functionality and access to the resulting parse tree, I benchmarked sajsonrapidjsonvjsonYAJL, and Jansson.  My methodology was simple: given large-ish real-world JSON files, parse them as many times as possible in one second.  To include the cost of reading the parse tree in the benchmark, I then iterated over the entire document and summed the number of strings, numbers, object values, array elements, etc.

The documents are:

  • apache_builds.json: Data from the Apache Jenkins installation.  Mostly it’s a array of three-element objects, with string keys and string values.
  • github_events.json: JSON data from GitHub’s events feature.  Nested objects several levels deep, mostly strings, but also contains booleans, nulls, and numbers.
  • instruments.json: A very long array of many-key objects.
  • mesh.json: 3D mesh data. Almost entirely consists of floating point numbers.
  • update-center.json: Also from Jenkins though I’m not sure where I found it.

apache_builds.json, github_events.json, and instruments.json are pretty-printed with a great deal of interelement whitespace.

Now for the results. The Y axis is parses per second. Thus, higher is faster.

Core 2 Duo E6850, Windows 7, Visual Studio 2010, x86

Core 2 Duo E6850, Windows 7, Visual Studio 2010, AMD64

Atom D2700, Ubuntu 12.04, clang 3.0, AMD64

Raspberry Pi

Conclusions

sajson compares favorably to rapidjson and vjson, all of which stomp the C-based YAJL and Jansson parsers. 64-bit builds are noticeably faster: presumably because the additional registers are helpful. Raspberry Pis are slow. :)

Comparing JSON Parse Trees

by Chad Austin
Jan 25 13

I will soon post performance benchmarks of sajson, rapidjson, vjson, YAJL, and Jansson.

However, before talking about performance, I want to compare the parse trees generated by each library.

sajson

  • Differentiates between integers and doubles
  • Array lengths are known.
  • Arrays are random-access.
  • Object lengths are known.
  • Object keys are sorted.  Thus, random access of object elements is O(lg N).
  • Stored contiguously in memory.

rapidjson

  • At its root, rapidjson is a SAX-style parser.  However, rapidjson does provide a default DOM implementation.
  • Array and Object length are not directly accessible in the API but are accessible via End() - Begin() or MemberEnd() - MemberBegin().
  • Arrays support random access.
  • Object lookup is O(N).
  • Differentiates between integers, unsigned integers, 64-bit integers, 64-bit unsigned integers, and doubles.
  • Parse tree is mutable.

vjson

  • Namespaces its symbols under the generic “json” prefix.
  • Differentiates between integers and floats.  No support for doubles.
  • Array and Object lengths are not stored.  Instead a linked list must be traversed to calculate the length.
  • Random array access is O(N).
  • Object element lookup is O(N), by walking a linked list.
  • Very simple interface.

YAJL

  • Differentiates between integers and doubles.
  • Array and Object lengths are available.
  • Object element lookup is O(N).
  • Parse tree is mutable.

Jansson

  • Differentiates between integers and doubles.
  • Array and Object lengths are available.
  • Array access is O(1).
  • Object element access is O(1) via a hash table.
  • The parse tree is mutable.

sajson 1.0 is ready to go

by Chad Austin
Jan 22 13

I’ve fixed all the known bugs and error cases in sajson. If you end up using it, I’d love to hear from you.

I’ll capture some scattered thoughts before my JSON parsing obsession dies down.

There’s not much opportunity to take advantage of a modern CPU’s superscalar execution pipeline when parsing JSON. Parsing JSON generally involves reading a character and branching. Searching for a string’s ending quotes could be implemented with vector instructions (see strchr in Agner Fog’s optimization manual or this tutorial) but the ROI doesn’t seem high to me.

Could we use multiple threads? Maybe, if we split the input text into chunks and had each thread discover the next { or [ and optimistically begin parsing future elements. Almost certainly not worth it

Either way, sajson is approximately done. I will post benchmarks soon.

sajson: Building the Parse Tree

by Chad Austin
Jan 7 13

The least trivial algorithm for building sajson’s parse tree is
allocating (or should I say, reserving?) the space in the parse tree
for an array’s or object’s element list without knowing the length in
advance.

Let’s consider an eleven-character JSON text. Imagine we’ve parsed
three characters: [[[. At this point we know two things:
1) we can fit the parse tree in eleven words and 2) there are at least
three arrays.

We don't know the length of the arrays, so we cannot begin writing the
parse tree records yet.

The file could be [[[0,0,0]]] or [[[[[0]]]]] or [[[0,[0]]]] all of
which have quite different parse tree representations.

My first attempt involved parsing in two passes. The first pass
scanned the JSON text for arrays and objects and temporarily stored
their lengths into safe locations in the parse tree array. The
second pass used that information to correctly lay out the parse tree.

Parsing in two passes worked but had two major disadvantages. First, it was
slow. The scanning phase was simpler than parsing, but not THAT
simpler. Since parsing involves reading one byte and
branching on its value, parsing in two phases was effectively
half the speed. Second, the scanning phase duplicated a fair amount
of parsing code, making it harder to reason about and maintain.

Mike Ralston and I worked out a simpler approach at the cost
of two memory writes per array/object element record.

The gist is: given a parse tree array of size N, start one pointer at
the front and one at the back. Call the one at the front temp, for
temporary storage, and the one at the back out, for the actual parse
tree data.

When encountering the beginning of an array or object, remember the
current temp pointer.

When encountering a scalar element (numbers, strings, etc.), place its
payload in the parse tree at out and its location in temp.

When encountering the end of an array or object, compare the current
temp pointer to its value when beginning the array or object. The
difference is the length of the array or object. Now that we know the
length and each element reference, we can move the element references
out of temp and into out.

It may help to work through a simple example:

[[0],0]

The JSON text is 7 characters. Thus we have 7 words of parse tree to
work with:

[ ][ ][ ][ ][ ][ ][ ]
 ^                 ^
temp              out

Encountering the first [, we store the current value of temp (on the C stack).

Encountering the second [, we store the current value of temp (on the
C stack.)

At this point, nothing has been written to the parse tree.

Then we see the first zero and place its payload at out and its
type+location reference at temp.

[<Integer>:6][ ][ ][ ][ ][ ][0]
              ^           ^
             temp        out

Encountering the first ], we calculate the array length, 1, and move
the references from temp to out. We write
the new array's location to temp, giving:

[<Array>:4][ ][ ][ ][1][<Integer>:2][0]
            ^     ^
           temp  out

We were careful to adjust the element references so they remain
relative to the array record.

We then encounter another zero and place its payload in out and
location in temp:

[<Array>:4][<Integer>:3][ ][0][1][<Integer>:2][0]
                         ^
                     temp out

Closing the outer array, we calculate its length (2), again move
the references from temp to out, and write the final array record.

[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]
 ^
out

out now gives us the root of the parse tree.

Eliminating the Recursion

You may have noticed the previous implementation stores the start
address of each array or object on the C stack. This is liable to
overflow in the case of a JSON file consisting of N [s followed by N
]s for some large N. The JSON standard allows parsers to limit the
maximum depth they handle, but we can do better.

It's possible to eliminate the recursion by storing the value of
temp into the temp side of the parse tree at the start of every
array or object. When reaching the end of an array or object, its
length is calculated, the record is written into out, and the
previous value of temp is restored. If the previous value of temp
is a special root marker, parsing is complete.

Does the parse tree, even during construction, have room for these
outer references?

First, let's examine the same example but where we store a reference
to the outer 'in-construction' object in temp:

# seen [
[<Array>:<ROOT>][ ][ ][ ][ ][ ][ ]

# seen [[
[<Array>:<ROOT>][<Array>:0][ ][ ][ ][ ][ ]

# seen [[0
[<Array>:<ROOT>][<Array>:0][<Integer>:6][ ][ ][ ][0]

# seen [[0],
[<Array>:<ROOT>][<Array>:4][ ][ ][1][<Integer>:2][0]

# seen [[0],0
[<Array>:<ROOT>][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]

# seen [[0],0]
[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]

In this example, there is room. But will there always be?

An easy conceptualization is that the final size of an array
record will be 1+N records, including its length. The temporary array
storage is also 1+N, where we don't yet know its length but we do have
a reference to the enclosing array or object. Thus, we have room for
outer references in the parse tree.

Actual Code

The result
is an implementation whose parsing loop is almost entirely inlined,
and on architectures with a reasonable number of registers (even
AMD64), very little spills to the stack.

sajson is available
under the MIT license, but at the time of this writing, it is
primarily a proof of concept. The API is not stable and it does not
support string escapes. It also needs a security review to guarantee
that no malformed inputs crash.

That said, if you give sajson a try, let me know how it goes.

sajson: Why the Parse Tree is Big Enough

by Chad Austin
Jan 6 13

Last week, I described a JSON parse tree data structure that, worst
case, requires N words for N characters of JSON text. I want to
explain the algorithm used to generate said parse tree, but first I
will describe the parse tree data structure in detail.
Simultaneously, I will show that the parse tree will fit in the worst
case.

Given that value types are stored in 3-bit tags, it’s intuitive that N
characters of JSON data requires N words in the parse tree. Let’s
consider the parse tree representation of each JSON type individually:

Strings

Strings are represented in the parse tree by two pointers: one to the
beginning of the string and one to the end. In the source text, these
correspond to the string’s quotes. The empty string, "",
is two characters and consumes two words in the parse tree.

struct String {
    size_t begin;
    size_t end;
};

Arrays

Arrays are represented by their length followed by a relative offset
to each element.

struct Array {
    size_t length;
    size_t element[length]; // type:offset
};

A zero-length array is 2 characters ([]) but consumes one word in the
parse tree.

An array of length 1, not including its element, is also 2 characters
([""]) but consumes two words in the parse tree.

An array of length N is N+1 characters, counting commas. The
representation described above consumes N+1 words in the parse tree,
so arrays fit.

Objects

Objects are represented by their length followed by an array of element records
each containing two pointers for the key string as well as a relative
offset to the element value.

struct ObjectElement {
    size_t key_begin;
    size_t key_end;
    size_t value; // type:offset
};

struct Object {
    size_t length;
    ObjectElement element[length];
};

The smallest object, {}, is 2 characters but its
representation in the parse tree is a single word.

Now consider an object of one element: {"":""}. Including the key
string (but not the value string!), the object is five characters in
the input text. Its parse tree representation is four words: the
length plus three for its element.

Each additional element in the input text adds four characters (a
comma, a colon, and two quotes) but requires only three words in the
parse tree.

Numbers: Integers

The only hat trick required to fit sajson’s parse tree
representation in input_text_length * sizeof(size_t)
bytes is representing integers differently than doubles.

struct Integer {
    intptr_t value;
};

That’s because, on 32-bit architectures, doubles are two words. If
single-digit numbers such as 0 consumed two words in the
parse tree, [0,0,0,0] would not fit.

Storing integers more compactly, we see that the smallest integers use
one character of input text and one word of parse tree structure.

It’s not worth the complexity, but the astute may notice that if we
limit integers to 29 bits, we don’t need to consume words in the parse
tree at all.

Numbers: Doubles

On 32-bit architectures, doubles are stored (unaligned) into the parse
tree.

struct Double {
    size_t first_half;
    size_t second_half;
};

On 64-bit architectures, a double consumes a single word.

The shortest JSON doubles are a single digit
followed by a decimal point followed by a single digit
(example: 0.0) or a single digit with a single-digit
exponent (example: 9e9). It’s clear that they fit into
the parse tree.

null, true, and false

null, true, and false contribute 4, 4, and 5 characters of input text,
respectively. They are represented simply by their type tags: no
parse tree data necessary.

We’ve now covered every JSON data type and shown that, in no case,
will the parse tree representation use more words than the JSON text
representation uses characters.

Single-Allocation JSON Parser

by Chad Austin
Jan 1 13

Over the holiday break, as mental exercise, I wrote a
single-allocation JSON parser, sajson. Why
single-allocation? To me, software that fits within a
precise resource budget, especially memory, is elegant. Most C or
C++ JSON parsers allocate memory per node and use hash tables to store
objects. Even if said parsers use efficient pool allocators or hash
table implementations, they miss the forest for the trees.

Dynamic memory allocation has disadvantages: fragmentation,
cache locality, and thread contention are the common arguments
against. But I focused on a different issue: what is the worst case
memory usage to parse, say, a 200 MB JSON file? With a JSON parser
that dynamically allocates, it’s challenging to prove the worst case
consumption.

Before we calculate the worst case memory consumption of a JSON
parser, let’s cover some basics.

Parsers convert input text, a stream of characters, into a data
structure or event stream suitable for reading or processing in some
way. In this instance, sajson is a non-streaming dom-style parser in
that it translates a complete buffer of characters into a contiguous
parse tree that supports both enumeration and random access.

JSON has seven data types. Three are unit types: null, true, and
false. Two are scalars: numbers and strings. Finally, arrays and
objects are composites: they contain references to other values. The
root element of a JSON document can only be an array or object.

sajson’s goal is to convert a stream of JSON text into a contiguous data
structure containing an enumerable and randomly-accessible parse tree.

My first attempt defined the parsed representation of each value as a type
enumeration followed by the type’s payload.

For example, the JSON text…

[null, 0, ["foo"]]

… would parse into…

<Array>
3 # length
5 # offset to first element
6 # offset to second element
9 # offset to third element
<Null>
<Number>
0 # first 32 bits of IEEE double value
0 # second 32 bits of value
<Array>
1 # length
3 # offset to first element
<String>
12 # offset into source document of string start
15 # offset into source document of string end

… where each line is a pointer-sized (aka size_t) value and <> represents named type constants.

For the above representation, the parse tree’s worst-case size is
sizeof(size_t) * input_length * 2. I won’t derive that
here, but the worst-case document is a list of single-digit numbers:

[0,0,0,0] # 9 characters

# 9*2 = 18 'slots' of structure
<Array>
4
6 # relative offset to first element
9
12
15
<Number>
0
0
<Number>
0
0
<Number>
0
0
<Number>
0
0

But we can do better!

Using a full size_t to store a 3-bit type constant is rather wasteful.
(Remember there are seven JSON types.) Because sajson only targets
32-bit and 64-bit architectures, each array or object element offset
has three bits to spare and thus can include the element’s type. The
document needs one bit to determine the type of the root element.
(Remember the root element must be an array or an object.)

A further optimization exists: rather than storing
all numbers as IEEE 64-bit doubles, we can add an extra type tag:
<Integer>. Single-digit JSON numbers must be integers, and thus
consume less structural storage.

Let’s consider the same example above with tagged element references,
where <tag>:offset delimits the tag from the offset.

[0,0,0,0] # 9 characters

# root bit determines root is array
4 # length of array
<Integer>:5
<Integer>:6
<Integer>:7
<Integer>:8
0
0
0
0

Let’s quickly check another example:

[[[[]]]] # 8 characters

# root bit determines root is array
1
<Array>:2
1
<Array>:2
1
<Array>:2
0

With the above changes, the parse tree size is cut in half! It now
fits in sizeof(size_t) * input_length.

Next time I’ll describe the challenges in building said parse tree
without a-priori knowledge of array length. Here’s a hint: imagine
you know the input text is 20 characters long. The first three
characters are "[[["

Don’t Write Test-Only Methods

by Chad Austin
Dec 17 12

When writing unit tests for new code or existing code, it’s often tempting to call test-only methods to get at private implementation details.

Here I will argue that test-only methods are a mistake and that it’s critical to distinguish between testing code and testing the interface’s behavior.

As an example, let look at a simple Python thread pool implementation.

class ThreadPool:
    def __init__(self):
        self._queue = Queue.Queue()
        self._threads = []
        self._activeThreads = []

    def run(self, fn):
        self._queue.put(fn)
        if len(self._activeThreads) == len(self._threads):
            t = threading.Thread(target=self.__thread)
            t.setDaemon(True)
            t.start()
            self._threads.append(t)

    def __thread(self):
        while True:
            q = self._queue.get()
            self._activeThreads.append(None)
            q()
            self._activeThreads.pop()

How would I unit test such a thing? Maybe something like:

def test_queuing_a_job_starts_a_thread():
    t = ThreadPool()
    t.run(lambda: None)
    assertEqual(1, len(t._threads))

That test would pass! But it’s also a bad test for several reasons:

  • It doesn’t assert that the job is run, meaning the test could pass if the implementation is broken.
  • It assumes implementation details: that the number of threads increases to one the first time a job is added.
  • Finally, the test has not protected itself from valid refactorings. If someone renamed or eliminated the private _activeThreads variable, the test would fail, even if the thread pool still behaved as advertised.

And there’s the important point: behaved as advertised. Tests should verify that the object does what it says it does. That is, an object does work through its public interface, either by producing a side effect or returning a value.

Let’s write a better test for the same functionality.

def test_queueing_a_job_starts_a_thread():
    begin = threading.Semaphore(0)
    end = threading.Semaphore(0)
    calls = []
    def job():
        begin.acquire()
        calls.append(None)
        end.release()
    t = ThreadPool()
    t.run(job)
    begin.release()
    end.acquire()
    assertEqual(1, len(calls))

This test no longer needs to reference any private fields or methods: it simply asserts that the given job is run and that said job does not block the main thread.

An aside: a simple implementation of run could be def run(fn): fn() but this test would hang (and thus fail) if fn was run on the same thread.

Writing tests against a public interface is validation for your object’s public interface. It requires you to understand the object’s interface. It shows examples for how to use your object elsewhere in the production system.

It also means the implementation of the object’s invariants are protected: refactoring internals will never break tests unless the object itself no longer meets its public interface.

What if the code under test is too hard to reach through the public interface?

And here we come to a common argument in support for testing internals: “What if the code under test is too hard to reach through the public interface?” Well maybe your object is too complicated! Split the object into multiple objects with simple public interfaces and compose them.

Otherwise objects will end up with combinatoric invariants. Imagine the implementation of list.append. It probably has a typical fast path plus the rare occasion where it must reallocate the list’s memory.

Note that the ThreadPool class doesn’t implement its own list allocation logic (nor a thread-safe queue): it reuses existing objects with clear and sensible interfaces.

Thus the threadpool tests can simply ignore the implementation of list.append and assume it works. list.append is tested elsewhere.

Refactoring is Hard Enough

Refactoring is hard enough. It’s critical that you build your systems to allow refactoring of internals. Any bit of friction, like incorrectly failing tests, may get in the way of engineers improving the code.

But what about caching and performance?

Caching and performance are an interesting case. Imagine a perspective camera object in a 3D renderer. It takes field-of-view, aspect ratio, and perhaps near and far distances as inputs and spits out a projection matrix.

As an optimization, you may be tempted to cache the projection matrix to avoid computation if it’s requested multiple times with the same input. How do you test such a cache?

# pseudocode
class Camera:
    # ... setters and such go here

    def getProjectionMatrix():
        if _projectionMatrixDirty: recalculate()
        return _projectionMatrixCache
    bool _projectionMatrixDirty
    Mat4f _projectionMatrixCache

It’s tempting to set inputs and assert _projectionMatrixDirty or perhaps assert the contents of the _projectionMatrixCache field.

However, these unit tests would not actually assert the desired behavior! Yes, you can maybe assert that you’re caching data correctly, but nobody actually cares about that. Instead you want to test the motivating intent behind the cache: performance.

Try writing performance tests! Call getProjectionMatrix() multiple times and assert it takes less than N microseconds or cycles or whatever.

In some cases it may even be faster to recompute the matrix than cache it.

tl;dr

Here’s my rule of thumb: if a method on an object is only called by tests, it’s dead. Excise it and its tests.

Be rigorous and clear with your public interfaces. Protect your code with tests. Don’t let internal object refactoring break your unit tests!