sajson: Building the Parse Tree

by Chad Austin
Jan 7 13

The least trivial algorithm for building sajson’s parse tree is
allocating (or should I say, reserving?) the space in the parse tree
for an array’s or object’s element list without knowing the length in
advance.

Let’s consider an eleven-character JSON text. Imagine we’ve parsed
three characters: [[[. At this point we know two things:
1) we can fit the parse tree in eleven words and 2) there are at least
three arrays.

We don't know the length of the arrays, so we cannot begin writing the
parse tree records yet.

The file could be [[[0,0,0]]] or [[[[[0]]]]] or [[[0,[0]]]] all of
which have quite different parse tree representations.

My first attempt involved parsing in two passes. The first pass
scanned the JSON text for arrays and objects and temporarily stored
their lengths into safe locations in the parse tree array. The
second pass used that information to correctly lay out the parse tree.

Parsing in two passes worked but had two major disadvantages. First, it was
slow. The scanning phase was simpler than parsing, but not THAT
simpler. Since parsing involves reading one byte and
branching on its value, parsing in two phases was effectively
half the speed. Second, the scanning phase duplicated a fair amount
of parsing code, making it harder to reason about and maintain.

Mike Ralston and I worked out a simpler approach at the cost
of two memory writes per array/object element record.

The gist is: given a parse tree array of size N, start one pointer at
the front and one at the back. Call the one at the front temp, for
temporary storage, and the one at the back out, for the actual parse
tree data.

When encountering the beginning of an array or object, remember the
current temp pointer.

When encountering a scalar element (numbers, strings, etc.), place its
payload in the parse tree at out and its location in temp.

When encountering the end of an array or object, compare the current
temp pointer to its value when beginning the array or object. The
difference is the length of the array or object. Now that we know the
length and each element reference, we can move the element references
out of temp and into out.

It may help to work through a simple example:

[[0],0]

The JSON text is 7 characters. Thus we have 7 words of parse tree to
work with:

[ ][ ][ ][ ][ ][ ][ ]
 ^                 ^
temp              out

Encountering the first [, we store the current value of temp (on the C stack).

Encountering the second [, we store the current value of temp (on the
C stack.)

At this point, nothing has been written to the parse tree.

Then we see the first zero and place its payload at out and its
type+location reference at temp.

[<Integer>:6][ ][ ][ ][ ][ ][0]
              ^           ^
             temp        out

Encountering the first ], we calculate the array length, 1, and move
the references from temp to out. We write
the new array's location to temp, giving:

[<Array>:4][ ][ ][ ][1][<Integer>:2][0]
            ^     ^
           temp  out

We were careful to adjust the element references so they remain
relative to the array record.

We then encounter another zero and place its payload in out and
location in temp:

[<Array>:4][<Integer>:3][ ][0][1][<Integer>:2][0]
                         ^
                     temp out

Closing the outer array, we calculate its length (2), again move
the references from temp to out, and write the final array record.

[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]
 ^
out

out now gives us the root of the parse tree.

Eliminating the Recursion

You may have noticed the previous implementation stores the start
address of each array or object on the C stack. This is liable to
overflow in the case of a JSON file consisting of N [s followed by N
]s for some large N. The JSON standard allows parsers to limit the
maximum depth they handle, but we can do better.

It's possible to eliminate the recursion by storing the value of
temp into the temp side of the parse tree at the start of every
array or object. When reaching the end of an array or object, its
length is calculated, the record is written into out, and the
previous value of temp is restored. If the previous value of temp
is a special root marker, parsing is complete.

Does the parse tree, even during construction, have room for these
outer references?

First, let's examine the same example but where we store a reference
to the outer 'in-construction' object in temp:

# seen [
[<Array>:<ROOT>][ ][ ][ ][ ][ ][ ]

# seen [[
[<Array>:<ROOT>][<Array>:0][ ][ ][ ][ ][ ]

# seen [[0
[<Array>:<ROOT>][<Array>:0][<Integer>:6][ ][ ][ ][0]

# seen [[0],
[<Array>:<ROOT>][<Array>:4][ ][ ][1][<Integer>:2][0]

# seen [[0],0
[<Array>:<ROOT>][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]

# seen [[0],0]
[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]

In this example, there is room. But will there always be?

An easy conceptualization is that the final size of an array
record will be 1+N records, including its length. The temporary array
storage is also 1+N, where we don't yet know its length but we do have
a reference to the enclosing array or object. Thus, we have room for
outer references in the parse tree.

Actual Code

The result
is an implementation whose parsing loop is almost entirely inlined,
and on architectures with a reasonable number of registers (even
AMD64), very little spills to the stack.

sajson is available
under the MIT license, but at the time of this writing, it is
primarily a proof of concept. The API is not stable and it does not
support string escapes. It also needs a security review to guarantee
that no malformed inputs crash.

That said, if you give sajson a try, let me know how it goes.

sajson: Why the Parse Tree is Big Enough

by Chad Austin
Jan 6 13

Last week, I described a JSON parse tree data structure that, worst
case, requires N words for N characters of JSON text. I want to
explain the algorithm used to generate said parse tree, but first I
will describe the parse tree data structure in detail.
Simultaneously, I will show that the parse tree will fit in the worst
case.

Given that value types are stored in 3-bit tags, it’s intuitive that N
characters of JSON data requires N words in the parse tree. Let’s
consider the parse tree representation of each JSON type individually:

Strings

Strings are represented in the parse tree by two pointers: one to the
beginning of the string and one to the end. In the source text, these
correspond to the string’s quotes. The empty string, "",
is two characters and consumes two words in the parse tree.

struct String {
    size_t begin;
    size_t end;
};

Arrays

Arrays are represented by their length followed by a relative offset
to each element.

struct Array {
    size_t length;
    size_t element[length]; // type:offset
};

A zero-length array is 2 characters ([]) but consumes one word in the
parse tree.

An array of length 1, not including its element, is also 2 characters
([""]) but consumes two words in the parse tree.

An array of length N is N+1 characters, counting commas. The
representation described above consumes N+1 words in the parse tree,
so arrays fit.

Objects

Objects are represented by their length followed by an array of element records
each containing two pointers for the key string as well as a relative
offset to the element value.

struct ObjectElement {
    size_t key_begin;
    size_t key_end;
    size_t value; // type:offset
};

struct Object {
    size_t length;
    ObjectElement element[length];
};

The smallest object, {}, is 2 characters but its
representation in the parse tree is a single word.

Now consider an object of one element: {"":""}. Including the key
string (but not the value string!), the object is five characters in
the input text. Its parse tree representation is four words: the
length plus three for its element.

Each additional element in the input text adds four characters (a
comma, a colon, and two quotes) but requires only three words in the
parse tree.

Numbers: Integers

The only hat trick required to fit sajson’s parse tree
representation in input_text_length * sizeof(size_t)
bytes is representing integers differently than doubles.

struct Integer {
    intptr_t value;
};

That’s because, on 32-bit architectures, doubles are two words. If
single-digit numbers such as 0 consumed two words in the
parse tree, [0,0,0,0] would not fit.

Storing integers more compactly, we see that the smallest integers use
one character of input text and one word of parse tree structure.

It’s not worth the complexity, but the astute may notice that if we
limit integers to 29 bits, we don’t need to consume words in the parse
tree at all.

Numbers: Doubles

On 32-bit architectures, doubles are stored (unaligned) into the parse
tree.

struct Double {
    size_t first_half;
    size_t second_half;
};

On 64-bit architectures, a double consumes a single word.

The shortest JSON doubles are a single digit
followed by a decimal point followed by a single digit
(example: 0.0) or a single digit with a single-digit
exponent (example: 9e9). It’s clear that they fit into
the parse tree.

null, true, and false

null, true, and false contribute 4, 4, and 5 characters of input text,
respectively. They are represented simply by their type tags: no
parse tree data necessary.

We’ve now covered every JSON data type and shown that, in no case,
will the parse tree representation use more words than the JSON text
representation uses characters.

Single-Allocation JSON Parser

by Chad Austin
Jan 1 13

Over the holiday break, as mental exercise, I wrote a
single-allocation JSON parser, sajson. Why
single-allocation? To me, software that fits within a
precise resource budget, especially memory, is elegant. Most C or
C++ JSON parsers allocate memory per node and use hash tables to store
objects. Even if said parsers use efficient pool allocators or hash
table implementations, they miss the forest for the trees.

Dynamic memory allocation has disadvantages: fragmentation,
cache locality, and thread contention are the common arguments
against. But I focused on a different issue: what is the worst case
memory usage to parse, say, a 200 MB JSON file? With a JSON parser
that dynamically allocates, it’s challenging to prove the worst case
consumption.

Before we calculate the worst case memory consumption of a JSON
parser, let’s cover some basics.

Parsers convert input text, a stream of characters, into a data
structure or event stream suitable for reading or processing in some
way. In this instance, sajson is a non-streaming dom-style parser in
that it translates a complete buffer of characters into a contiguous
parse tree that supports both enumeration and random access.

JSON has seven data types. Three are unit types: null, true, and
false. Two are scalars: numbers and strings. Finally, arrays and
objects are composites: they contain references to other values. The
root element of a JSON document can only be an array or object.

sajson’s goal is to convert a stream of JSON text into a contiguous data
structure containing an enumerable and randomly-accessible parse tree.

My first attempt defined the parsed representation of each value as a type
enumeration followed by the type’s payload.

For example, the JSON text…

[null, 0, ["foo"]]

… would parse into…

<Array>
3 # length
5 # offset to first element
6 # offset to second element
9 # offset to third element
<Null>
<Number>
0 # first 32 bits of IEEE double value
0 # second 32 bits of value
<Array>
1 # length
3 # offset to first element
<String>
12 # offset into source document of string start
15 # offset into source document of string end

… where each line is a pointer-sized (aka size_t) value and <> represents named type constants.

For the above representation, the parse tree’s worst-case size is
sizeof(size_t) * input_length * 2. I won’t derive that
here, but the worst-case document is a list of single-digit numbers:

[0,0,0,0] # 9 characters

# 9*2 = 18 'slots' of structure
<Array>
4
6 # relative offset to first element
9
12
15
<Number>
0
0
<Number>
0
0
<Number>
0
0
<Number>
0
0

But we can do better!

Using a full size_t to store a 3-bit type constant is rather wasteful.
(Remember there are seven JSON types.) Because sajson only targets
32-bit and 64-bit architectures, each array or object element offset
has three bits to spare and thus can include the element’s type. The
document needs one bit to determine the type of the root element.
(Remember the root element must be an array or an object.)

A further optimization exists: rather than storing
all numbers as IEEE 64-bit doubles, we can add an extra type tag:
<Integer>. Single-digit JSON numbers must be integers, and thus
consume less structural storage.

Let’s consider the same example above with tagged element references,
where <tag>:offset delimits the tag from the offset.

[0,0,0,0] # 9 characters

# root bit determines root is array
4 # length of array
<Integer>:5
<Integer>:6
<Integer>:7
<Integer>:8
0
0
0
0

Let’s quickly check another example:

[[[[]]]] # 8 characters

# root bit determines root is array
1
<Array>:2
1
<Array>:2
1
<Array>:2
0

With the above changes, the parse tree size is cut in half! It now
fits in sizeof(size_t) * input_length.

Next time I’ll describe the challenges in building said parse tree
without a-priori knowledge of array length. Here’s a hint: imagine
you know the input text is 20 characters long. The first three
characters are "[[["

Don’t Write Test-Only Methods

by Chad Austin
Dec 17 12

When writing unit tests for new code or existing code, it’s often tempting to call test-only methods to get at private implementation details.

Here I will argue that test-only methods are a mistake and that it’s critical to distinguish between testing code and testing the interface’s behavior.

As an example, let look at a simple Python thread pool implementation.

class ThreadPool:
    def __init__(self):
        self._queue = Queue.Queue()
        self._threads = []
        self._activeThreads = []

    def run(self, fn):
        self._queue.put(fn)
        if len(self._activeThreads) == len(self._threads):
            t = threading.Thread(target=self.__thread)
            t.setDaemon(True)
            t.start()
            self._threads.append(t)

    def __thread(self):
        while True:
            q = self._queue.get()
            self._activeThreads.append(None)
            q()
            self._activeThreads.pop()

How would I unit test such a thing? Maybe something like:

def test_queuing_a_job_starts_a_thread():
    t = ThreadPool()
    t.run(lambda: None)
    assertEqual(1, len(t._threads))

That test would pass! But it’s also a bad test for several reasons:

  • It doesn’t assert that the job is run, meaning the test could pass if the implementation is broken.
  • It assumes implementation details: that the number of threads increases to one the first time a job is added.
  • Finally, the test has not protected itself from valid refactorings. If someone renamed or eliminated the private _activeThreads variable, the test would fail, even if the thread pool still behaved as advertised.

And there’s the important point: behaved as advertised. Tests should verify that the object does what it says it does. That is, an object does work through its public interface, either by producing a side effect or returning a value.

Let’s write a better test for the same functionality.

def test_queueing_a_job_starts_a_thread():
    begin = threading.Semaphore(0)
    end = threading.Semaphore(0)
    calls = []
    def job():
        begin.acquire()
        calls.append(None)
        end.release()
    t = ThreadPool()
    t.run(job)
    begin.release()
    end.acquire()
    assertEqual(1, len(calls))

This test no longer needs to reference any private fields or methods: it simply asserts that the given job is run and that said job does not block the main thread.

An aside: a simple implementation of run could be def run(fn): fn() but this test would hang (and thus fail) if fn was run on the same thread.

Writing tests against a public interface is validation for your object’s public interface. It requires you to understand the object’s interface. It shows examples for how to use your object elsewhere in the production system.

It also means the implementation of the object’s invariants are protected: refactoring internals will never break tests unless the object itself no longer meets its public interface.

What if the code under test is too hard to reach through the public interface?

And here we come to a common argument in support for testing internals: “What if the code under test is too hard to reach through the public interface?” Well maybe your object is too complicated! Split the object into multiple objects with simple public interfaces and compose them.

Otherwise objects will end up with combinatoric invariants. Imagine the implementation of list.append. It probably has a typical fast path plus the rare occasion where it must reallocate the list’s memory.

Note that the ThreadPool class doesn’t implement its own list allocation logic (nor a thread-safe queue): it reuses existing objects with clear and sensible interfaces.

Thus the threadpool tests can simply ignore the implementation of list.append and assume it works. list.append is tested elsewhere.

Refactoring is Hard Enough

Refactoring is hard enough. It’s critical that you build your systems to allow refactoring of internals. Any bit of friction, like incorrectly failing tests, may get in the way of engineers improving the code.

But what about caching and performance?

Caching and performance are an interesting case. Imagine a perspective camera object in a 3D renderer. It takes field-of-view, aspect ratio, and perhaps near and far distances as inputs and spits out a projection matrix.

As an optimization, you may be tempted to cache the projection matrix to avoid computation if it’s requested multiple times with the same input. How do you test such a cache?

# pseudocode
class Camera:
    # ... setters and such go here

    def getProjectionMatrix():
        if _projectionMatrixDirty: recalculate()
        return _projectionMatrixCache
    bool _projectionMatrixDirty
    Mat4f _projectionMatrixCache

It’s tempting to set inputs and assert _projectionMatrixDirty or perhaps assert the contents of the _projectionMatrixCache field.

However, these unit tests would not actually assert the desired behavior! Yes, you can maybe assert that you’re caching data correctly, but nobody actually cares about that. Instead you want to test the motivating intent behind the cache: performance.

Try writing performance tests! Call getProjectionMatrix() multiple times and assert it takes less than N microseconds or cycles or whatever.

In some cases it may even be faster to recompute the matrix than cache it.

tl;dr

Here’s my rule of thumb: if a method on an object is only called by tests, it’s dead. Excise it and its tests.

Be rigorous and clear with your public interfaces. Protect your code with tests. Don’t let internal object refactoring break your unit tests!

JavaScript’s new Operator (or: How to Read ECMA-262)

by Chad Austin
Dec 10 12

Before we dig in, I must briefly describe the state of JavaScript, aka ECMAScript, aka ECMA-262.

Two versions of JavaScript are in wide use today. ECMA-262, 3rd edition (commonly known as ES3) was standardized in 1999 and is supported by pretty much anything that claims to support JavaScript. ECMA-262, 5th edition (commonly known as ES5) is a slightly more modern language that, while common among the latest web browsers, is not ubiquitous. Note that ECMA-262, 5.1 edition, is a bugfix release of the ES5 standard.

The ECMA-262 standards are quite approachable and freely available online.

(This is a tangent but it’s cool. You should know the test262 project provides an ES5.1 conformance suite, and it appears that IE10 is the most compliant implementation at this time.)

On to the topic at hand, which is the behavior of JavaScript’s new operator and why you might want a JavaScript implementation of it.

Open the ES5 spec and examine section 11.2.2. You can disregard the fact that there are two versions of the new operator: the specification distinguishes between new X and new X.Y(Z).

The key point is that new evaluates its argument expression, which must result in an Object, and calls said object’s internal [[Construct]] method. Only Function objects have a [[Construct]] method, and its behavior is precisely defined in section 13.2.2.

I will demonstrate a JavaScript function that implements new. The first argument is the constructor.

function new_(T) {
    if (!(T instanceof Function)) {
        throw new TypeError(T + ' is not a constructor');
    }
    var proto = T.prototype;
    if (!(proto instanceof Object)) {
        proto = Object.prototype;
    }
    var obj = Object.create(proto);
    var result = T.apply(obj, [].slice.call(arguments, 1));
    return (result instanceof Object) ? result : obj;
}

Why is such a function useful?

First, and primarily, new_ can take a variable list of arguments, via new_.apply. That can’t be done with the traditional new operator.

Secondly, new_ can be bound to produce factory functions. Let’s say you have a constructor SomeClass. You can create a SomeClass factory as such:

var SomeClassFactory = new_.bind(undefined, SomeClass);
var instance = SomeClassFactory();
(instance instanceof SomeClass); // true

Thirdly, new_ illustrates some non-intuitive behavors of the new operator. For example, the created object’s prototype is set to Object.prototype if the constructor’s prototype is not an object. Consider:

function foo() {}
foo.prototype = null;
var instance = new foo;
Object.getPrototypeOf(instance) === Object.prototype; // true

A lesser-known behavior is that constructors can themselves return values. If the value returned is an Object (remember that Functions are Objects too), new returns it instead of the object passed to the constructor as this. For example:

function Wooo() { return Wooo; }
Wooo === new new new new new new new new Wooo; // true

I don’t recommend it, but this technique could be used to write constructors that behave identically whether they are called directly or as a constructor:

function Flexible(arg) {
    var this_ = Object.create(Flexible.prototype);
    this_.arg = arg;
    return this_;
}
Flexible.prototype.method = function() {
    return this.arg;
}

var a = Flexible('success');
var b = new Flexible('success');
'success' === a.method(); // true
'success' === b.method(); // true

I hope that you’ve learned something about JavaScript’s new operator. Before sending you on your way, I’d like to point out that Object.create is an ES5 method and thus unsupported in ES3 implementations. So could we write an implementation of new_ that runs in old browsers by avoiding Object.create?

Object.create is effectively a primitive operation on which the new operator is built. And indeed, we can bypass the need for Object.create by using some of new‘s behavior to create an object with a specified prototype. Here is the ES3 version of new_:

function new_(T) {
    if (!(T instanceof Function)) {
        throw new TypeError(T + ' is not a constructor');
    }
    var proto = T.prototype;
    if (!(proto instanceof Object)) {
        proto = Object.prototype;
    }
    function dummy() {}
    dummy.prototype = proto;
    var obj = new dummy; // Object.getPrototypeOf(obj) === proto
    var result = T.apply(obj, [].slice.call(arguments, 1));
    return (result instanceof Object) ? result : obj;
}

Emscripten Results: Firefox 19 shows dramatic improvement

by Chad Austin
Nov 28 12

Last time, we looked at Emscripten’s performance with current JS JITs on an in-order Atom core and found a penalty relative to out-of-order cores.

However, I told @js_dev I’d give updated numbers on a more typical out-of-order x86 core like my 2010 MacBook Pro’s i5.

There are a couple interesting things here: Firefox 19 shows substantial Emscripten performance improvements over Firefox 17, which is even still on par with hand-written JavaScript. While JavaScript JITs are still an order of magnitude away from native code performance, Emscripten’s performance meets or exceeds the performance of hand-written JavaScript. Progress!

The machine is a 2010 Macbook Pro, Core i5 2.53 GHz, OS X 10.6.

For each compiler, I compiled with -O0, -O1, -O2, -O3, and picked the best result.

Language Compiler Variant Vertex Rate Slowdown
C++ clang -O2 SSE 100142197 1
C++ gcc -O3 SSE 93109180 1.08
C++ gcc -O3 scalar 60398333 1.66
C++ clang -O2 scalar 58324308 1.72
JavaScript Chrome 23 untyped 9510489 10.5
Emscripten -O3 Aurora 19.0a2 scalar 7666000 13.1
Emscripten -O3 Firefox 17 scalar 6044000 16.6
JavaScript Chrome 23 typed arrays 5890000 17
Emscripten -O3 Chrome 25.0 canary scalar 5733706 17.5
JavaScript Firefox 17 untyped 5264735 19
JavaScript Firefox 17 typed arrays 5240000 19.1
Emscripten -O2 Chrome 23 scalar 4586165 21.8
Emscripten -O1 nodejs 0.8.10 scalar 4453109 22.5
Emscripten -O2 nodejs 0.8.10 scalar 1483406 67.5
Emscripten -O3 nodejs 0.8.10 scalar 668796 150

Here are the results for various Emscripten optimization levels:

Browser Compilation Level Vertex Rate
Firefox 17 emscripten -O0 2451509
Firefox 17 emscripten -O1 4080000
Firefox 17 emscripten -O2 5146000
Firefox 17 emscripten -O3 6044000
Chrome 23 emscripten -O0 1229754
Chrome 23 emscripten -O1 4152339
Chrome 23 emscripten -O2 4586165
Chrome 23 emscripten -O3 465162
Aurora 19.0a2 emscripten -O0 2062762
Aurora 19.0a2 emscripten -O1 4900000
Aurora 19.0a2 emscripten -O2 6214757
Aurora 19.0a2 emscripten -O3 7666000
Chrome 25.0 canary emscripten -O0 3001399
Chrome 25.0 canary emscripten -O1 4410235
Chrome 25.0 canary emscripten -O2 5482000
Chrome 25.0 canary emscripten -O3 5733706

I updated the benchmark to automate compiling and running the native and node.js builds.

JavaScript, Emscripten, and the Atom D2700

by Chad Austin
Nov 25 12

Lately I’ve been doing some work with Emscripten. As predicted, the quality of Emscripten’s generated code is improving and JITs are learning to understand its generated code. I have high hopes for asm.js, a formalization of high-performance, low-level JavaScript. I now believe it’s conceivable that Emscripten could approach the same level of performance as PNaCl, though whether that happens remains to be seen.

However, having a rough understanding of how today’s JavaScript JITs work, I’ve always wondered whether Emscripten-generated code would be especially penalized relative to native code on an in-order core like Intel Atom. Having recently built an Intel Atom home server, I figured I’d update my recent Emscripten skinning benchmark results and find out.

First I’ll describe the hardware. The CPU is an Atom D2700 on the Intel D2700DC board. 1066 MHz DDR3 memory. Two cores hyperthreaded. Running Ubuntu 12.04 Server. Firefox and Chromium packages are stock. Node.js and clang 3.1 are x64 Linux binaries downloaded from their respective websites. Emscripten is commit 26250471b46a68204711f037f33790bfb4ba37c7 in the master branch.

Now the results. Remember there are three JavaScript implementations: hand-written JS with untyped arrays and objects “untyped”, hand-written JS with typed arrays “typed arrays”, and Emscripten-compiled C++ “scalar”. Emscripten’s compiler was invoked with -O1. I saw significant performance drop-offs with -O2 and -O3.

Language Compiler Variant Vertex Rate Slowdown
C++ gcc 4.6.3 -O3 SSE 24040000 1
C++ clang 3.1 -O3 SSE 22530000 1.07
C++ gcc 4.6.3 -O3 scalar 18730000 1.28
C++ clang 3.1 -O3 scalar 13040000 1.84
JavaScript Chromium 20.0 untyped 3150000 7.63
JavaScript Firefox 17 typed arrays 2437562 9.86
JavaScript Firefox 17 untyped 1084577 22.2
Emscripten Firefox 17 scalar 944333 25.5
JavaScript Chromium 20.0 typed arrays 807577 29.8
Emscripten node 0.8.14 scalar 679802 35.4
Emscripten Chromium 20.0 scalar 677966 35.5

Based on the previous benchmark results and my recent experience with Emscripten, it appears that JavaScript JITted code indeed has a penalty relative native code on in-order cores, or at least the Atom D2700.

Next time I hope to update these benchmarks on a high-end desktop CPU.

As always, if you’d like to reproduce these results or question them, the code is available on my github.

Digging into JavaScript Performance, Part 2

by Chad Austin
Nov 6 11

UPDATE. After I posted these numbers, Alon Zakai, Emscripten’s author, pointed out options for generating optimized JavaScript. I reran my benchmarks; check out the updated table below and the script used to generate the new results.

At the beginning of the year, I tried to justify my claim that JavaScript has a long way to go before it can compete with the performance of native code.

Well, 10 months have passed. WebGL is catching on, Native Client has been launched, Unreal Engine 3 targets Flash 11, and Crytek has announced they might target Flash 11 too. Exciting times!

On the GPU front, we’re in a good place. With WebGL, iOS, and Flash 11 all roughly exposing shader model 2.0, it’s not a ton of work to target all of the above. Even on the desktop you can’t assume higher than shader model 2.0: the Intel GMA 950 is still at the top.

However, shader model 2.0 isn’t general enough to offload all of your compute-intensive workloads to the GPU. With 16 vertex attributes and no vertex texture fetch, you simply can’t get enough data into your vertex shaders do to everything you need, e.g. blending morph targets.

Thus, for the foreseeable future, we’ll need to write fast CPU code that can run on the web, mobile devices, and the desktop. Today, that means at least JavaScript and a native language like C++. And, because Microsoft has not implemented WebGL, the Firefox and Chrome WebGL blacklists are so strict, and no major browsers fall back on software, you probably care about targeting Flash 11 too. (It does have a software fallback!) If you care about Flash 11, then your code had better target ActionScript 3 / AVM2 too.

How can we target native platforms, the web, and Flash at the same time?

Native platforms are easy: C++ is well-supported on Windows, Mac, iOS, and Android. SSE2 is ubiquitous on x86, ARM NEON is widely available, and both have high-quality intrinsics-based implementations.

As for Flash… I’m just counting on Adobe Alchemy to ship.

On the web, you have two choices. Write your code in C++ and cross-compile it to JavaScript with Emscripten or write it in JavaScript and run via your native JavaScript engine. Ideally, cross-compiling C++ to JS via Emscripten would be as fast as writing your code in JavaScript. If it is, then targeting all platforms is easy: just use C++ and the browsers will do as well as they would with native JavaScript.

Over the last two evenings, while weathering a dust storm, I set about updating my skeletal animation benchmark results: for math-heavy code, how does JavaScript compare to C++ today? And how does Emscripten compare to hand-written JavaScript?

If you’d like, take a look at the raw results.

Language Compiler Variant Vertex Rate Slowdown
C++ clang 2.9 SSE 101580000 1
C++ gcc 4.2 SSE 96420454 1.05
C++ gcc 4.2 scalar 63355501 1.6
C++ clang 2.9 scalar 62928175 1.61
JavaScript Chrome 15 untyped 10210000 9.95
JavaScript Firefox 7 typed arrays 8401598 12.1
JavaScript Chrome 15 typed arrays 5790000 17.5
Emscripten Chrome 15 scalar 5184815 19.6
JavaScript Firefox 7 untyped 5104895 19.9
JavaScript Firefox 9a2 untyped 2005988 50.6
JavaScript Firefox 9a2 typed arrays 1932271 52.6
Emscripten Firefox 9a2 scalar 734126 138
Emscripten Firefox 7 scalar 729270 139

Conclusions?

  • JavaScript is still a factor of 10-20 away from well-written native code. Adding SIMD support to JavaScript will help, but obviously that’s not the whole story…
  • It’s bizarre that Chrome and Firefox disagree on whether typed arrays or not are faster.
  • Firefox 9 clearly has performance issues that need to be worked out. I wanted to benchmark its type inference capabilities.
  • Emscripten… ouch :( I wish it were even comparable to hand-written JavaScript, but it’s another factor of 10-20 slower…
  • Emscripten on Chrome 15 is within a factor of two of hand-written JavaScript. I think that means you can target all platforms with C++, because hand-written JavaScript won’t be that much faster than cross-compiled C++.
  • Emscripten on Firefox 7 and 9 still has issues, but Alon Zakai informs me that the trunk version of SpiderMonkey is much faster.

In the future, I’d love to run the same test on Flash 11 / Alchemy and Native Client but the former hasn’t shipped and the latter remains a small market.

One final note: it’s very possible my test methodology is screwed up, my benchmarks are wrong, or I suck at copy/pasting numbers. Science should be reproducible: please try to reproduce these results yourself!

More High Order Bits

by Chad Austin
Oct 25 11

One of our biggest failings as humans is our inability to ignore small, unimportant decisions. Keyword: unimportant. I’m not referring to details that add up to something greater, like the tiny details that sum to make this computer a joy to use.

We would all be well-served by focusing on the high-order bits in our lives.

  • Creativity and Purpose over following the rules
  • Velocity over predictability
  • Reducing Consumption over recycling
  • Scalable Algorithms over optimal machine code
  • Eating less, Exercising more over antioxidants, carbohydrates, high-fructose corn syrup, etc.

I value the items on the right, but I value the items on the left more.

High Order Bits (or: mostly correct but in the right direction)

by Chad Austin
Oct 25 11

First, the moral: Unit tests are good. But reliable design is better.

Even if you have to deal with short-term pain. Even if you haven’t figured out all of the edge cases.

Let me back up. I love automated tests. I’ve been test-driving code at IMVU since I started. We buy new engineers a copy of Test-Driven Development: By Example. Whenever there is a bug, we write tests to make sure it never happens again.

After years of working this way, seeing projects succeed and fail, I’d like to refine my perspective. Let me share a story.

IMVU was originally a bolt-on addition to AOL Instant Messenger. Two IMVU clients communicated with each other by manipulating AOL IM’s UI and scanning the window for new text messages, much like a screen reader would. This architecture propagated some implications through our entire codebase:

1) The messaging layer was inherently unreliable. AOL IM chat windows could be manipulated by the user or other programs. Thus, our chat protocol was built around eventual consistency.

2) We could not depend on an authoritative source of truth. Since text-over-IM is peer-to-peer, no client has a true view into where all of the avatars are sitting or who currently owns the room.

Thus, in 2008, long after we’d dropped support for integration with third-party IM clients and replaced it with an authoritative state database, we continued to have severe state consistency bugs. The client’s architecture still pretended like the chat protocol was unreliable and state was peer-to-peer instead of authoritative.

To address these bugs, we wrote copious test coverage. These were deep tests: start up a local Apache and MySQL instance, connect a couple ClientApp Python processes to them, have one invite another to chat, and assert that their scene graphs were consistent. We have dozens of these tests, for all kinds of edge cases. And we thought we’d fixed the bugs for good…

But the bugs returned. These tests are still running and passing, mind you, but differences in timing and sequencing result in the same state consistency issues we saw in 2008. It’s clear that test coverage is not sufficient to prevent these types of bugs.

So what’s the other ingredient in reliable software? I argue that, in agile software development, correct-by-design systems are underemphasized.

Doesn’t Test Driven Development guide me to build correct-by-design systems?

TDD prescribes a “red, green, refactor” rhythm, where you write a failing test, do the simplest work to make it pass, and then refactor the code so it’s high quality. TDD helps you reach the “I haven’t seen it fail” stage, by verifying that yes, your code can pass these tests. But just because you’ve written some tests doesn’t mean your code will always work.

So there’s another level of reliability: “I have considered the ways it can fail, but I can’t think of any.” This statement is stronger, assuming you’re sufficiently imaginative. Even still, you won’t think of everything, especially if you’re working at the edge of your human capacity (as you should be).

Nonetheless, thoughtfulness is better than nothing. I recommend adding a fourth step to your TDD rhythm: “Red, green, refactor, what else could go wrong?” In that fourth step, you deeply examine the code and think of additional tests to write.

The strongest level of software correctness is not about finding possible failure conditions; it’s about proving that your system works in the presence of all inputs. Correctness proofs for non-trivial algorithms are too challenging for all of the code we write, but in a critical subsystem like chat state management, the time spent on a lightweight proof will easily pay for itself. Again, I’m not advocating that we always prove the correctness of our software, but we should at least generally be convinced of its correctness and investigate facts that indicate otherwise. TDD by itself is not enough.

OK, so we can’t easily test-drive or refactor our way out of the chat system mess we got ourselves into, because it’s simply too flawed, so what can we do? The solution is especially tricky, because in situations like this, there are always features that depend on subtleties of the poor design. A rewrite would break those features, which is unacceptable, right? Even if breaking those features is acceptable to the company, there are political challenges. Imagine the look on your product owner’s face when you announce “Hey I have a new architecture that will break your feature but provide no customer benefit yet.”

The ancient saying “You can’t make an omelette without breaking some eggs” applies directly here. Preserving 100% feature compatibility is less important than fixing deep flaws.

Why? High-order bits are hardest to change, but in the end, are all that matters. The low-order bits are easy to change, and any competent organization will fix the small things over time.

I can’t help but recall the original iPhone. Everyone said “What?! No copy and paste?!” Indeed, the iPhone couldn’t copy and paste until 18 months and two major OS releases later. Even still, the iPhone reshaped the mobile industry. Clearly 100% feature compatibility is not a requirement for success.

My attitude towards unit testing has changed. While I write, run, and love unit testing, I value correct-by-design subsystems even more. When it comes down to it, tests are low-order bits compared to code that just doesn’t break.

For those curious, what are we doing about the chat system? I’ll let Jon’s GDC presentation speak for itself.