Single-Allocation JSON Parser

Over the holiday break, as mental exercise, I wrote a
single-allocation JSON parser, sajson. Why
single-allocation? To me, software that fits within a
precise resource budget, especially memory, is elegant. Most C or
C++ JSON parsers allocate memory per node and use hash tables to store
objects. Even if said parsers use efficient pool allocators or hash
table implementations, they miss the forest for the trees.

Dynamic memory allocation has disadvantages: fragmentation,
cache locality, and thread contention are the common arguments
against. But I focused on a different issue: what is the worst case
memory usage to parse, say, a 200 MB JSON file? With a JSON parser
that dynamically allocates, it’s challenging to prove the worst case

Before we calculate the worst case memory consumption of a JSON
parser, let’s cover some basics.

Parsers convert input text, a stream of characters, into a data
structure or event stream suitable for reading or processing in some
way. In this instance, sajson is a non-streaming dom-style parser in
that it translates a complete buffer of characters into a contiguous
parse tree that supports both enumeration and random access.

JSON has seven data types. Three are unit types: null, true, and
false. Two are scalars: numbers and strings. Finally, arrays and
objects are composites: they contain references to other values. The
root element of a JSON document can only be an array or object.

sajson’s goal is to convert a stream of JSON text into a contiguous data
structure containing an enumerable and randomly-accessible parse tree.

My first attempt defined the parsed representation of each value as a type
enumeration followed by the type’s payload.

For example, the JSON text…

[null, 0, ["foo"]]

… would parse into…

3 # length
5 # offset to first element
6 # offset to second element
9 # offset to third element
0 # first 32 bits of IEEE double value
0 # second 32 bits of value
1 # length
3 # offset to first element
12 # offset into source document of string start
15 # offset into source document of string end

… where each line is a pointer-sized (aka size_t) value and <> represents named type constants.

For the above representation, the parse tree’s worst-case size is
sizeof(size_t) * input_length * 2. I won’t derive that
here, but the worst-case document is a list of single-digit numbers:

[0,0,0,0] # 9 characters

# 9*2 = 18 'slots' of structure
6 # relative offset to first element

But we can do better!

Using a full size_t to store a 3-bit type constant is rather wasteful.
(Remember there are seven JSON types.) Because sajson only targets
32-bit and 64-bit architectures, each array or object element offset
has three bits to spare and thus can include the element’s type. The
document needs one bit to determine the type of the root element.
(Remember the root element must be an array or an object.)

A further optimization exists: rather than storing
all numbers as IEEE 64-bit doubles, we can add an extra type tag:
<Integer>. Single-digit JSON numbers must be integers, and thus
consume less structural storage.

Let’s consider the same example above with tagged element references,
where <tag>:offset delimits the tag from the offset.

[0,0,0,0] # 9 characters

# root bit determines root is array
4 # length of array

Let’s quickly check another example:

[[[[]]]] # 8 characters

# root bit determines root is array

With the above changes, the parse tree size is cut in half! It now
fits in sizeof(size_t) * input_length.

Next time I’ll describe the challenges in building said parse tree
without a-priori knowledge of array length. Here’s a hint: imagine
you know the input text is 20 characters long. The first three
characters are "[[["

Emscripten Results: Firefox 19 shows dramatic improvement

Last time, we looked at Emscripten’s performance with current JS JITs on an in-order Atom core and found a penalty relative to out-of-order cores.

However, I told @js_dev I’d give updated numbers on a more typical out-of-order x86 core like my 2010 MacBook Pro’s i5.

There are a couple interesting things here: Firefox 19 shows substantial Emscripten performance improvements over Firefox 17, which is even still on par with hand-written JavaScript. While JavaScript JITs are still an order of magnitude away from native code performance, Emscripten’s performance meets or exceeds the performance of hand-written JavaScript. Progress!

The machine is a 2010 Macbook Pro, Core i5 2.53 GHz, OS X 10.6.

For each compiler, I compiled with -O0, -O1, -O2, -O3, and picked the best result.

Language Compiler Variant Vertex Rate Slowdown
C++ clang -O2 SSE 100142197 1
C++ gcc -O3 SSE 93109180 1.08
C++ gcc -O3 scalar 60398333 1.66
C++ clang -O2 scalar 58324308 1.72
JavaScript Chrome 23 untyped 9510489 10.5
Emscripten -O3 Aurora 19.0a2 scalar 7666000 13.1
Emscripten -O3 Firefox 17 scalar 6044000 16.6
JavaScript Chrome 23 typed arrays 5890000 17
Emscripten -O3 Chrome 25.0 canary scalar 5733706 17.5
JavaScript Firefox 17 untyped 5264735 19
JavaScript Firefox 17 typed arrays 5240000 19.1
Emscripten -O2 Chrome 23 scalar 4586165 21.8
Emscripten -O1 nodejs 0.8.10 scalar 4453109 22.5
Emscripten -O2 nodejs 0.8.10 scalar 1483406 67.5
Emscripten -O3 nodejs 0.8.10 scalar 668796 150

Here are the results for various Emscripten optimization levels:

Browser Compilation Level Vertex Rate
Firefox 17 emscripten -O0 2451509
Firefox 17 emscripten -O1 4080000
Firefox 17 emscripten -O2 5146000
Firefox 17 emscripten -O3 6044000
Chrome 23 emscripten -O0 1229754
Chrome 23 emscripten -O1 4152339
Chrome 23 emscripten -O2 4586165
Chrome 23 emscripten -O3 465162
Aurora 19.0a2 emscripten -O0 2062762
Aurora 19.0a2 emscripten -O1 4900000
Aurora 19.0a2 emscripten -O2 6214757
Aurora 19.0a2 emscripten -O3 7666000
Chrome 25.0 canary emscripten -O0 3001399
Chrome 25.0 canary emscripten -O1 4410235
Chrome 25.0 canary emscripten -O2 5482000
Chrome 25.0 canary emscripten -O3 5733706

I updated the benchmark to automate compiling and running the native and node.js builds.

JavaScript, Emscripten, and the Atom D2700

Lately I’ve been doing some work with Emscripten. As predicted, the quality of Emscripten’s generated code is improving and JITs are learning to understand its generated code. I have high hopes for asm.js, a formalization of high-performance, low-level JavaScript. I now believe it’s conceivable that Emscripten could approach the same level of performance as PNaCl, though whether that happens remains to be seen.

However, having a rough understanding of how today’s JavaScript JITs work, I’ve always wondered whether Emscripten-generated code would be especially penalized relative to native code on an in-order core like Intel Atom. Having recently built an Intel Atom home server, I figured I’d update my recent Emscripten skinning benchmark results and find out.

First I’ll describe the hardware. The CPU is an Atom D2700 on the Intel D2700DC board. 1066 MHz DDR3 memory. Two cores hyperthreaded. Running Ubuntu 12.04 Server. Firefox and Chromium packages are stock. Node.js and clang 3.1 are x64 Linux binaries downloaded from their respective websites. Emscripten is commit 26250471b46a68204711f037f33790bfb4ba37c7 in the master branch.

Now the results. Remember there are three JavaScript implementations: hand-written JS with untyped arrays and objects “untyped”, hand-written JS with typed arrays “typed arrays”, and Emscripten-compiled C++ “scalar”. Emscripten’s compiler was invoked with -O1. I saw significant performance drop-offs with -O2 and -O3.

Language Compiler Variant Vertex Rate Slowdown
C++ gcc 4.6.3 -O3 SSE 24040000 1
C++ clang 3.1 -O3 SSE 22530000 1.07
C++ gcc 4.6.3 -O3 scalar 18730000 1.28
C++ clang 3.1 -O3 scalar 13040000 1.84
JavaScript Chromium 20.0 untyped 3150000 7.63
JavaScript Firefox 17 typed arrays 2437562 9.86
JavaScript Firefox 17 untyped 1084577 22.2
Emscripten Firefox 17 scalar 944333 25.5
JavaScript Chromium 20.0 typed arrays 807577 29.8
Emscripten node 0.8.14 scalar 679802 35.4
Emscripten Chromium 20.0 scalar 677966 35.5

Based on the previous benchmark results and my recent experience with Emscripten, it appears that JavaScript JITted code indeed has a penalty relative native code on in-order cores, or at least the Atom D2700.

Next time I hope to update these benchmarks on a high-end desktop CPU.

As always, if you’d like to reproduce these results or question them, the code is available on my github.

Digging into JavaScript Performance, Part 2

UPDATE. After I posted these numbers, Alon Zakai, Emscripten’s author, pointed out options for generating optimized JavaScript. I reran my benchmarks; check out the updated table below and the script used to generate the new results.

At the beginning of the year, I tried to justify my claim that JavaScript has a long way to go before it can compete with the performance of native code.

Well, 10 months have passed. WebGL is catching on, Native Client has been launched, Unreal Engine 3 targets Flash 11, and Crytek has announced they might target Flash 11 too. Exciting times!

On the GPU front, we’re in a good place. With WebGL, iOS, and Flash 11 all roughly exposing shader model 2.0, it’s not a ton of work to target all of the above. Even on the desktop you can’t assume higher than shader model 2.0: the Intel GMA 950 is still at the top.

However, shader model 2.0 isn’t general enough to offload all of your compute-intensive workloads to the GPU. With 16 vertex attributes and no vertex texture fetch, you simply can’t get enough data into your vertex shaders do to everything you need, e.g. blending morph targets.

Thus, for the foreseeable future, we’ll need to write fast CPU code that can run on the web, mobile devices, and the desktop. Today, that means at least JavaScript and a native language like C++. And, because Microsoft has not implemented WebGL, the Firefox and Chrome WebGL blacklists are so strict, and no major browsers fall back on software, you probably care about targeting Flash 11 too. (It does have a software fallback!) If you care about Flash 11, then your code had better target ActionScript 3 / AVM2 too.

How can we target native platforms, the web, and Flash at the same time?

Native platforms are easy: C++ is well-supported on Windows, Mac, iOS, and Android. SSE2 is ubiquitous on x86, ARM NEON is widely available, and both have high-quality intrinsics-based implementations.

As for Flash… I’m just counting on Adobe Alchemy to ship.

On the web, you have two choices. Write your code in C++ and cross-compile it to JavaScript with Emscripten or write it in JavaScript and run via your native JavaScript engine. Ideally, cross-compiling C++ to JS via Emscripten would be as fast as writing your code in JavaScript. If it is, then targeting all platforms is easy: just use C++ and the browsers will do as well as they would with native JavaScript.

Over the last two evenings, while weathering a dust storm, I set about updating my skeletal animation benchmark results: for math-heavy code, how does JavaScript compare to C++ today? And how does Emscripten compare to hand-written JavaScript?

If you’d like, take a look at the raw results.

Language Compiler Variant Vertex Rate Slowdown
C++ clang 2.9 SSE 101580000 1
C++ gcc 4.2 SSE 96420454 1.05
C++ gcc 4.2 scalar 63355501 1.6
C++ clang 2.9 scalar 62928175 1.61
JavaScript Chrome 15 untyped 10210000 9.95
JavaScript Firefox 7 typed arrays 8401598 12.1
JavaScript Chrome 15 typed arrays 5790000 17.5
Emscripten Chrome 15 scalar 5184815 19.6
JavaScript Firefox 7 untyped 5104895 19.9
JavaScript Firefox 9a2 untyped 2005988 50.6
JavaScript Firefox 9a2 typed arrays 1932271 52.6
Emscripten Firefox 9a2 scalar 734126 138
Emscripten Firefox 7 scalar 729270 139


  • JavaScript is still a factor of 10-20 away from well-written native code. Adding SIMD support to JavaScript will help, but obviously that’s not the whole story…
  • It’s bizarre that Chrome and Firefox disagree on whether typed arrays or not are faster.
  • Firefox 9 clearly has performance issues that need to be worked out. I wanted to benchmark its type inference capabilities.
  • Emscripten… ouch :( I wish it were even comparable to hand-written JavaScript, but it’s another factor of 10-20 slower…
  • Emscripten on Chrome 15 is within a factor of two of hand-written JavaScript. I think that means you can target all platforms with C++, because hand-written JavaScript won’t be that much faster than cross-compiled C++.
  • Emscripten on Firefox 7 and 9 still has issues, but Alon Zakai informs me that the trunk version of SpiderMonkey is much faster.

In the future, I’d love to run the same test on Flash 11 / Alchemy and Native Client but the former hasn’t shipped and the latter remains a small market.

One final note: it’s very possible my test methodology is screwed up, my benchmarks are wrong, or I suck at copy/pasting numbers. Science should be reproducible: please try to reproduce these results yourself!

Tracing Leaks in Python: Find the Nearest Root

Garbage Collection Doesn’t Mean You Can Ignore Memory Altogether…

This post is available on the IMVU Engineering Blog.

Garbage collection removes a great deal of burden from programming. In fact, garbage collection is a critical language feature for all languages where abstractions such as functional closures or coroutines are common, as they frequently create reference cycles.

IMVU is a mix of C++ and Python. The C++ code generally consists of small, cohesive objects with a clear ownership chain. An Avatar SceneObject owns a ModelInstance which owns a set of Meshes which own Materials which own Textures and so on… Since there are no cycles in this object graph, reference-counting with shared_ptr suffices.

The Python code, however, is full of messy object cycles. An asynchronous operation may hold a reference to a Room, while the Room may be holding a reference to the asynchronous operation. Often two related objects will be listening for events from the other. While Python’s garbage collector will happily take care of cycles, it’s still possible to leak objects.

Imagine these scenarios:

  • a leaked or living C++ object has a strong reference to a Python object.
  • a global cache has a reference to an instance’s bound method, which implicitly contains a reference to the instance.
  • two objects with __del__ methods participate in a cycle with each other, and Python refuses to decide which should destruct first

To detect these types of memory leaks, we use a LifeTimeMonitor utility:

a = SomeObject()
lm = LifeTimeMonitor(a)
del a
lm.assertDead() # succeeds

b = SomeObject()
lm = LifeTimeMonitor(b)
lm.assertDead() # raises ObjectNotDead

We use LifeTimeMonitor’s assertDead facility at key events, such as when a user closes a dialog box or 3D window. Take 3D windows as an example. Since they’re the root of an entire object subgraph, we would hate to inadvertently leak them. LifeTimeMonitor’s assertDead prevents us from introducing an object leak.

It’s good to know that an object leaked, but how can you determine why it can’t be collected?

Python’s Garbage Collection Algorithm

Let’s go over the basics of automatic garbage collection. In a garbage-collected system there are objects and objects can reference each other. Some objects are roots; that is, if an object is referenced by a root, it cannot be collected. Example roots are the stacks of live threads and the global module list. The graph formed by objects and their references is the object graph.

In SpiderMonkey, Mozilla’s JavaScript engine, the root set is explicitly-managed. SpiderMonkey’s GC traverses the object graph from the root set. If the GC does not reach an object, that object is destroyed. If C code creates a root object but fails to add it to the root set, it risks the GC deallocating the object while it’s still in use.

In Python however, the root set is implicit. All Python objects are ref-counted, and any that can refer to other objects — and potentially participate in an object cycle — are added to a global list upon construction. Each GC-tracked object can be queried for its referents. Python’s root set is implicit because anyone can create a root simply by incrementing an object’s refcount.

Since Python’s root set is implicit, its garbage collection algorithm differs slightly from SpiderMonkey’s. Python begins by setting GCRefs(o) to CurrentRefCount(o) for each GC-tracked PyObject o. Then it traverses all referents r of all GC-tracked PyObjects and subtracts 1 from GCRefs(r). Then, if GCRefs(o) is nonzero, o is an unknown reference, and thus a root. Python traverses the now-known root set and increments GCRefs(o) for any traversed objects. If any object o remains where GCRefs(o) == 0, that object is unreachable and thus collectible.

Finding a Path From the Nearest Root to the Leaked Object

Now that we know how Python’s garbage collector works, we can ask it for its set of roots by calculating GCRefs(o) for all objects o in gc.get_objects(). Then we perform a breadth-first-search from the root set to the leaked object. If the root set directly or indirectly refers to the leaked object, we return the path our search took.

Sounds simple, but there’s a catch! Imagine that the search function has signature:

PyObject* findPathToNearestRoot(PyObject* leakedObject);

leakedObject is a reference (incremented within Python’s function-call machinery itself) to the leaked object, making leakedObject a root!

To work around this, change findPathToNearestRoot so it accepts a singleton list containing a reference to the leaked object. findPathToNearestRoot can borrow that reference and clear the list, ensuring that leakedObject has no untracked references.

findPathToNearestRoot will find paths to expected Python roots like thread entry points and module objects. But, since it directly mirrors the behavior of Python’s GC, it will also find paths to leaked C references! Obviously, it can’t directly point you to the C code that leaked the reference, but the reference path should be enough of a clue to figure it out.

The Code

template<typename ArgType>
void traverse(PyObject* o, int (*visit)(PyObject* visitee, ArgType* arg), ArgType* arg) {
    if (Py_TYPE(o)->tp_traverse) {
        Py_TYPE(o)->tp_traverse(o, (visitproc)visit, arg);

typedef std::map<PyObject*, int> GCRefs;

static int subtractKnownReferences(PyObject* visitee, GCRefs* gcrefs) {
    if (gcrefs->count(visitee)) {
    return 0;

typedef int Backlink; // -1 = none

typedef std::vector< std::pair<Backlink, PyObject*> > ReferenceList;
struct Referents {
    std::set<PyObject*>& seen;
    Backlink backlink;
    ReferenceList& referenceList;

static int addReferents(PyObject* visitee, Referents* referents) {
    if (!referents->seen.count(visitee) && PyObject_IS_GC(visitee)) {
        referents->referenceList.push_back(std::make_pair(referents->backlink, visitee));
    return 0;

static Backlink findNextLevel(
    std::vector<PyObject*>& chain,
    const ReferenceList& roots,
    PyObject* goal,
    std::set<PyObject*>& seen
) {
    if (roots.empty()) {
        return -1;

    for (size_t i = 0; i < roots.size(); ++i) {
        if (roots[i].first != -1) {
            if (goal == roots[i].second) {
                return roots[i].first;

    ReferenceList nextLevel;
    for (size_t i = 0; i < roots.size(); ++i) {
        Referents referents = {seen, i, nextLevel};
        traverse(roots[i].second, &addReferents, &referents);

    Backlink backlink = findNextLevel(chain, nextLevel, goal, seen);
    if (backlink == -1) {
        return -1;

    return roots[backlink].first;

static std::vector<PyObject*> findReferenceChain(
    const std::vector<PyObject*>& roots,
    PyObject* goal
) {
    std::set<PyObject*> seen;
    ReferenceList unknownReferrer;
    for (size_t i = 0; i < roots.size(); ++i) {
        unknownReferrer.push_back(std::make_pair<Backlink>(-1, roots[i]));
    std::vector<PyObject*> rv;
    // going to return -1 no matter what: no backlink from roots
    findNextLevel(rv, unknownReferrer, goal, seen);
    return rv;

static object findPathToNearestRoot(const object& o) {
    if (!PyList_Check(o.ptr()) || PyList_GET_SIZE(o.ptr()) != 1) {
        PyErr_SetString(PyExc_TypeError, "findNearestRoot must take a list of length 1");

    // target = o.pop()
    object target(handle<>(borrowed(PyList_GET_ITEM(o.ptr(), 0))));
    if (-1 == PyList_SetSlice(o.ptr(), 0, 1, 0)) {

    object gc_module(handle<>(PyImport_ImportModule("gc")));
    object tracked_objects_list = gc_module.attr("get_objects")();
    // allocating the returned list may have run a GC, but tracked_objects won't be in the list

    std::vector<PyObject*> tracked_objects(len(tracked_objects_list));
    for (size_t i = 0; i < tracked_objects.size(); ++i) {
        object to = tracked_objects_list[i];
        tracked_objects[i] = to.ptr();
    tracked_objects_list = object();

    GCRefs gcrefs;
    // TODO: store allocation/gc count per generation

    for (size_t i = 0; i < tracked_objects.size(); ++i) {
        gcrefs[tracked_objects[i]] = tracked_objects[i]->ob_refcnt;

    for (size_t i = 0; i < tracked_objects.size(); ++i) {
        traverse(tracked_objects[i], subtractKnownReferences, &gcrefs);

    // BFS time
    std::vector<PyObject*> roots;
    for (GCRefs::const_iterator i = gcrefs.begin(); i != gcrefs.end(); ++i) {
        if (i->second && i->first != target.ptr()) { // Don't count the target as a root.
    std::vector<PyObject*> chain = findReferenceChain(roots, target.ptr());

    // TODO: assert that allocation/gc count per generation didn't change

    list rv;
    for (size_t i = 0; i < chain.size(); ++i) {

    return rv;

How to Write an Interactive, 60 Hz Desktop Application

This post is available on the IMVU Engineering Blog.

IMVU’s client application doesn’t fit neatly into a single development paradigm:

  • IMVU is a Windows desktop application. Mouse clicks, window resizes, and dialog boxes must all respond with imperceptible latency. Running IMVU should not significantly affect laptop battery life.
  • IMVU is an interactive 3D game. The 3D scene must be simulated and drawn at smooth, interactive frame rates, 60 Hz if possible.
  • IMVU is a networked application. Sending and receiving network packets must happen quickly and the UI should never have to wait for I/O.

Thus, let us clarify some specific requirements:

  • Minimal CPU usage (and thus battery consumption) when the application is minimized or obscured.
  • Minimal CPU usage in low-complexity scenes. Unlike most games, IMVU must never unnecessarily consume battery life while waiting in spin loops.
  • Animation must continue while modal dialog boxes and menus are visible. You don’t have control over these modal event loops, but it looks terrible if animation pauses while menus and dialogs are visible.
  • Animation must be accurate and precise. It looks much better if every frame takes 22 milliseconds (45 Hz) than if some frames take 30 milliseconds and some take 15 milliseconds (averaging 45 Hz).
  • Animation must degrade gracefully. In a really complex room with a dozen avatars, IMVU can easily spend all of a core’s CPU trying to animate the scene. In this case, the frame rate should gradually drop while the application remains responsive to mouse clicks and other input events.
  • Support for Windows XP, Vista, and 7.

Naive Approach #1

Windows applications typically have a main loop that looks something like:

MSG msg;
while (GetMessage(&msg, 0, 0, 0) > 0) {

What went wrong

Using SetTimer/WM_TIMER sounds like a good idea for simulation and painting, but it’s way too imprecise for interactive applications.

Naive Approach #2

Games typically have a main loop that looks something like the following:

while (running) {
    // process input events
    MSG msg;
    while (PeekMessage(&msg, 0, 0, 0, PM_REMOVE)) {

    if (frame_interval_has_elapsed) {

What went wrong

The above loop never sleeps, draining the user’s battery and burning her legs.

Clever Approach #1: Standard Event Loop + timeSetEvent

void runMainLoop() {
    MSG msg;
    while (GetMessage(&msg, 0, 0, 0) > 0) {

void customWindowProc(...) {
    if (message == timerMessage) {
        // schedules paint with InvalidateRect

    if (0 == InterlockedExchange(&inFlight, 1)) {
        PostMessage(frameTimerWindow, timerMessage, 0, 0);

void startFrameTimer() {
    RegisterClass(customWindowProc, ...);
    frameTimerWindow = CreateWindow(...);
    timeSetEvent(FRAME_INTERVAL, 0, &TimerProc, 0, TIME_PERIODIC);

What went wrong

The main loop’s GetMessage call always returns messages in a priority order. Slightly oversimplified, posted messages come first, then WM_PAINT messages, then WM_TIMER. Since timerMessage is a normal message, it will preempt any scheduled paints. This would be fine for us, since simulations are cheap, but the dealbreaker is that if we fail to maintain frame rate, WM_TIMER messages are entirely starved. This violates our graceful degradation requirement. When frame rate begins to degrade, code dependent on WM_TIMER shouldn’t stop entirely.

Even worse, the modal dialog loop has a freaky historic detail. It waits for the message queue to be empty before displaying modal dialogs. When painting can’t keep up, modal dialogs simply don’t appear.

We tried a bunch of variations, setting flags when stepping or painting, but they all had critical flaws. Some continued to starve timers and dialog boxes and some degraded by ping-ponging between 30 Hz and 15 Hz, which looked terrible.

Clever Approach #2: PostThreadMessage + WM_ENTERIDLE

A standard message loop didn’t seem to be getting us anywhere, so we changed our timeSetEvent callback to PostThreadMessage a custom message to the main loop, who knew how to handle it. Messages sent via PostThreadMessage don’t go to a window, so the event loop needs to process them directly. Since DialogBox and TrackPopupMenu modal loops won’t understand this custom message, we will fall back on a different mechanism.

DialogBox and TrackPopupMenu send WM_ENTERIDLE to their owning windows. Any window in IMVU that can host a dialog box or popup menu handles WM_ENTERIDLE by notifying a global idle handler, which can decide to schedule a new frame immediately or in N milliseconds, depending on how much time has elapsed.

What Went Wrong

So close! In our testing under realistic workloads, timeSetEvent had horrible pauses and jitter. Sometimes the multimedia thread would go 250 ms between notifications. Otherwise, the custom event loop + WM_ENTERIDLE approach seemed sound. I tried timeSetEvent with several flags, but they all had accuracy and precision problems.

What Finally Worked

Finally, we settled on MsgWaitForMultipleObjects with a calculated timeout.

Assuming the existence of a FrameTimeoutCalculator object which returns the number of milliseconds until the next frame:

int runApp() {
    FrameTimeoutCalculator ftc;

    for (;;) {
        const DWORD timeout = ftc.getTimeout();
        DWORD result = (timeout
            ? MsgWaitForMultipleObjects(0, 0, TRUE, timeout, QS_ALLEVENTS)
            : WAIT_TIMEOUT);
        if (result == WAIT_TIMEOUT) {

        MSG msg;
        while (PeekMessage(&msg, 0, 0, 0, PM_REMOVE)) {
            if (msg.message == WM_QUIT) {
                return msg.wParam;


Well, what about modal dialogs?

Since we rely on a custom message loop to animate 3D scenes, how do we handle standard message loops such as the modal DialogBox and TrackPopupMenu calls? Fortunately, DialogBox and TrackPopupMenu provide us with the hook required to implement frame updates: WM_ENTERIDLE.

When the standard DialogBox and TrackPopupMenu modal message loops go idle, they send their parent window a WM_ENTERIDLE message. Upon receiving WM_ENTERIDLE, the parent window determines whether it’s time to render a new frame. If so, we animate all visible 3D windows, which will trigger a WM_PAINT, which triggers a subsequent WM_ENTERIDLE.

On the other hand, if it’s not time to render a new frame, we call timeSetEvent with TIME_ONESHOT to schedule a frame update in the future.

As we saw previously, timeSetEvent isn’t as reliable as a custom loop using MsgWaitForMultipleObjectsEx, but if a modal dialog or popup menu is visible, the user probably isn’t paying very close attention anyway. All that matters is that the UI remains responsive and animation continues while modal loops are open. Code follows:

LRESULT CALLBACK ModalFrameSchedulerWndProc(HWND hwnd, UINT message, WPARAM wparam, LPARAM lparam) {
    if (message == idleMessage) {
    return DefWindowProc(hwnd, message, wparam, lparam);

struct AlmostMSG {
    HWND hwnd;
    UINT message;
    WPARAM wparam;
    LPARAM lparam;

    AlmostMSG* msg = reinterpret_cast<AlmostMSG*>(user_data);
    PostMessage(msg->hwnd, msg->message, msg->wparam, msg->lparam);
    delete msg;

void PostMessageIn(DWORD timeout, HWND hwnd, UINT message, WPARAM wparam, LPARAM lparam) {
    if (timeout) {
        AlmostMSG* msg = new AlmostMSG;
        msg->hwnd = hwnd;
        msg->message = message;
        msg->wparam = wparam;
        msg->lparam = lparam;
        timeSetEvent(timeout, 1, timeForPost, reinterpret_cast<DWORD_PTR>(msg), TIME_ONESHOT | TIME_CALLBACK_FUNCTION);
    } else {
        PostMessage(hwnd, message, wparam, lparam);

class ModalFrameScheduler : public IFrameListener {
    ModalFrameScheduler() { stepping = false; }

    // Call when WM_ENTERIDLE is received.
    void onIdle() {
        if (!frameListenerWindow) {
            idleMessage = RegisterWindowMessageW(L"IMVU_ScheduleFrame");

            WNDCLASS wc;
            ZeroMemory(&wc, sizeof(wc));
            wc.hInstance = GetModuleHandle(0);
            wc.lpfnWndProc = ModalFrameSchedulerWndProc;
            wc.lpszClassName = L"IMVUModalFrameScheduler";

            frameListenerWindow = CreateWindowW(
                0, 0, 0, 0, 0, 0, 0,
                GetModuleHandle(0), 0);

        if (!stepping) {
            const unsigned timeout = ftc.getTimeout();
            stepping = true;
            PostMessageIn(timeout, frameListenerWindow, idleMessage, 0, 0);
    void step() { stepping = false; }

    bool stepping;
    FrameTimeoutCalculator ftc;

How has it worked out?

A custom message loop and WM_ENTERIDLE neatly solves all of the goals we laid out:

  • No unnecessary polling, and thus increased battery life and performace.
  • When possible, the 3D windows animate at 60 Hz.
  • Even degradation. If painting a frame takes 40 ms, the frame rate will drop from 60 Hz to 25 Hz, not from 60 Hz to 15 Hz, as some of the implementations did.
  • Animation continue to play, even while modal dialogs and popup menus are visible.
  • This code runs well on XP, Vista, and Windows 7.

Extracting Color and Transparency from Flash

The original source of this post is at the IMVU engineering blog. Subscribe now!

For clarity, I slightly oversimplified my previous discussion on efficiently rendering Flash in a 3D scene. The sticky bit is extracting transparency information from the Flash framebuffer so we can composite the overlay into the scene.

Flash does not give you direct access to its framebuffer. It does, with IViewObject::Draw, allow you to composite the Flash framebuffer onto a DIB section of your choice.

Remembering your Porter-Duff, composition of source over dest is:

Color = SourceColor * SourceAlpha + DestColor * (1 - SourceAlpha)

If the source color is premultiplied, you get:

Color = SourceColor + DestColor * (1 - SourceAlpha)

Assuming we want premultiplied color and alpha from Flash for efficient rendering in the 3D scene, applying the above requires solving for FlashAlpha and FlashColor:

RenderedColor = FlashColor * FlashAlpha + DestColor * (1 - FlashAlpha)

RenderedColor = FlashColor * FlashAlpha + DestColor - DestColor * FlashAlpha

RenderedColor - DestColor = FlashColor * FlashAlpha - DestColor * FlashAlpha

RenderedColor - DestColor = FlashAlpha * (FlashColor - DestColor)

FlashAlpha = (RenderedColor - DestColor) / (FlashColor - DestColor)

If FlashColor and DestColor are equal, then FlashAlpha is undefined. Intuitively, this makes sense. If you render a translucent black SWF on a black background, you can’t know the transparency data because all of the pixels are still black. This doesn’t matter, as I’ll show in a moment.

FlashColor is trivial:

RenderedColor = FlashColor * FlashAlpha + DestColor * (1 - FlashAlpha)

RenderedColor - DestColor * (1 - FlashAlpha) = FlashColor * FlashAlpha

FlashColor = (RenderedColor - DestColor * (1 - FlashAlpha)) / FlashAlpha

FlashColor is undefined if FlashAlpha is 0. Transparency has no color.

What do these equations give us? We know RenderedColor, since it’s the result of calling IViewObject::Draw. We have control over DestColor, since we configure the DIB Flash is drawn atop. What happens if we set DestColor to black (0)?

FlashColor = (RenderedColorOnBlack) / FlashAlpha

What happens if we set it to white (1)?

FlashColor = (RenderedColorOnWhite - (1 - FlashAlpha)) / FlashAlpha

Now we’re getting somewhere! Since FlashColor and FlashAlpha are constant, we can define a relationship between FlashAlpha and RenderedColorOnBlack and RenderedColorOnWhite:

(RenderedColorOnBlack) / FlashAlpha = (RenderedColorOnWhite - (1 - FlashAlpha)) / FlashAlpha

RenderedColorOnBlack = RenderedColorOnWhite - 1 + FlashAlpha

FlashAlpha = RenderedColorOnBlack - RenderedColorOnWhite + 1

FlashAlpha = RenderedColorOnWhite - RenderedColorOnBlack

So all we have to do is render the SWF on a white background and a black background and subtract the two to extract the alpha channel.

Now what about color? Just plug the calculated FlashAlpha into the following when rendering on black.

FlashColor = (RenderedColor - DestColor * (1 - FlashAlpha)) / FlashAlpha

FlashColor = RenderedColorOnBlack / FlashAlpha

Since we want premultiplied alpha:

FlashColor = RenderedColorOnBlack

Now that we know FlashColor and FlashAlpha for the overlay, we can copy it into a texture and render the scene!

Efficiently Rendering Flash in a 3D Scene

The original source of this post is at the IMVU engineering blog. Subscribe now!

Last time, I talked about how to embed Flash into your desktop application, for UI flexibility and development speed. This time, I’ll discuss efficient rendering into a 3D scene.

Rendering Flash as a 3D Overlay (The Naive Way)

At first blush, rendering Flash on top of a 3D scene sounds easy. Every frame:

  1. Create a DIB section the size of your 3D viewport
  2. Render Flash into the DIB section with IViewObject::Draw
  3. Copy the DIB section into an IDirect3DTexture9
  4. Render the texture on the top of the scene
Naive Flash Rendering

Ta da! But your frame rate dropped to 2 frames per second? Ouch. It turns out this implementation is horribly slow. There are a couple reasons.

First, asking the Adobe flash player to render into a DIB isn’t a cheap operation. In our measurements, drawing even a simple SWF takes on the order of 10 milliseconds. Since most UI doesn’t animate every frame, we should be able to cache the captured framebuffer.

Second, main memory and graphics memory are on different components in your computer. You want to avoid wasting time and bus traffic by unnecessarily copying data from the CPU to the GPU every frame. If only the lower-right corner of a SWF changes, we should limit our memory copies to that region.

Third, modern GPUs are fast, but not everyone has them. Let’s say you have a giant mostly-empty SWF and want to render it on top of your 3D scene. On slower GPUs, it would be ideal if you could limit your texture draws to the region of the screen that are non-transparent.

Rendering Flash as a 3D Overlay (The Fast Way)

Disclaimer: I can’t take credit for these algorithms. They were jointly developed over years by many smart engineers at IMVU.

First, let’s reduce an embedded Flash player to its principles:

  • Flash exposes an IShockwaveFlash [link] interface through which you can load and play movies.
  • Flash maintains its own frame buffer. You can read these pixels with IViewObject::Draw.
  • When a SWF updates regions of the frame buffer, it notifies you through IOleInPlaceSiteWindowless::InvalidateRect.

In addition, we’d like the Flash overlay system to fit within these performance constraints:

  • Each SWF is rendered over the entire window. For example, implementing a ball that bounces around the screen or a draggable UI component should not require any special IMVU APIs.
  • If a SWF is not animating, we do not copy its pixels to the GPU every frame.
  • We do not render the overlay in transparent regions. That is, if no Flash content is visible, rendering is free.
  • Memory consumption (ignoring memory used by individual SWFs) for the overlay usage is O(framebuffer), not O(framebuffer * SWFs). That is, loading three SWFs should not require allocation of three screen-sized textures.
  • If Flash notifies of multiple changed regions per frame, only call IViewObject::Draw once.

Without further ado, let’s look at the fast algorithm:

Fast Flash Rendering

Flash notifies us of visual changes via IOleInPlaceSiteWindowless::InvalidateRect. We take any updated rectangles and add them to a per-frame dirty region. When it’s time to render a frame, there are four possibilities:

  • The dirty region is empty and the opaque region is empty. This case is basically free, because nothing need be drawn.
  • The dirty region is empty and the opaque region is nonempty. In this case, we just need to render our cached textures for the non-opaque regions of the screen. This case is the most common. Since a video memory blit is fast, there’s not much we could do to further speed it up.
  • The dirty region is nonempty. We must IViewObject::Draw into our Overlay DIB, with one tricky bit. Since we’re only storing one overlay texture, we need to render each loaded Flash overlay SWF into the DIB, not just the one that changed. Imagine an animating SWF underneath another translucent SWF. The top SWF must be composited with the bottom SWF’s updates. After rendering each SWF, we scan the updated DIB for a minimalish opaque region. Why not just render the dirty region? Imagine a SWF with a bouncing ball. If we naively rendered every dirty rectangle, eventually we’d be rendering the entire screen. Scanning for minimal opaque regions enables recalculation of what’s actually visible.
  • The dirty region is nonempty, but the updated pixels are all transparent. If this occurs, we no longer need to render anything at all until Flash content reappears.

This algorithm has proven efficient. It supports multiple overlapping SWFs while minimizing memory consumption and CPU/GPU draw calls per frame. Until recently, we used Flash for several of our UI components, giving us a standard toolchain and a great deal of flexibility. Flash was the bridge that took us from the dark ages of C++ UI code to UI on which we could actually iterate.

How to Embed Flash Into Your 3D Application

The original source of this post is at the IMVU engineering blog. Subscribe now!

[I wrote this post last year when IMVU still used Flash for a significant portion of our UI. Even though we now embed Gecko, I believe embedding Flash is still valuable.]

Writing user interfaces is hard. Writing usable interfaces is harder. Yet, the design of your interface is your product.

Products are living entities. They always want to grow, adapting to their users as users adapt to them. In that light, why build your user interface in a static technology like C++ or Java? It won’t be perfect the first time you build it, so prepare for change.

IMVU employs two technologies for rapidly iterating on and refining our client UIs: Flash and Gecko/HTML. Sure, integrating these technologies has a sizable up-front cost, but the iteration speed they provide easily pays for them. Rapid iteration has some obvious benefits:

  1. reduces development cost
  2. reduces time to market

and some less-obvious benefits:

  1. better product/market fit: when you can change your UI, you will.
  2. improved product quality: little details distinguish mediocre products from great products. make changing details cheap and your Pinto will become a Cadillac.
  3. improved morale: both engineers and designers love watching their creations appear on the screen right before them. it’s why so many programmers create games!

I will show you how integrating Flash into a 3D application is easier than it sounds.

Should I use Adobe Flash or Scaleform GFx?

The two most common Flash implementations are Adobe’s ActiveX control (which has a 97% installed base!) and Scaleform GFx.

Adobe’s control has perfect compatibility with their tool chain (go figure!) but is closed-source and good luck getting help from Adobe.

Scaleform GFx is an alternate implementation of Flash designed to be embedded in 3D applications, but, last I checked, is not efficient on machines without GPUs. (Disclaimer: this information is two years old, so I encourage you to make your own evaluation.)

IMVU chose to embed Adobe’s player.

Deploying the Flash Runtime

Assuming you’re using Adobe’s Flash player, how will you deploy their runtime? Well, given Flash’s install base, you can get away with loading the Flash player already installed on the user’s computer. If they don’t have Flash, just require that they install it from your download page. Simple and easy.

Down the road, when Flash version incompatibilities and that last 5% of your possible market becomes important, you can request permission from Adobe to deploy the Flash player with your application.

Displaying SWFs

IMVU displays Flash in two contexts: traditional HWND windows and 2D overlays atop the 3D scene.

IMVU Flash Window
IMVU Flash Overlay

If you want to have something up and running in a day, buy f_in_box. Besides its awesome name, it’s cheap, comes with source code, and the support forums are fantastic. It’s a perfect way to bootstrap. After a weekend of playing with f_in_box, Dusty and I had a YouTube video playing in a texture on top of our 3D scene.

Once you run into f_in_box’s limitations, you can use the IShockwaveFlash and IOleInPlaceObjectWindowless COM interfaces directly. See Igor Makarav’s excellent tutorial and CFlashWnd class.

Rendering Flash as an HWND

For top-level UI elements use f_in_box or CFlashWnd directly. They’re perfectly suited for this. Seriously, it’s just a few lines of code. Look at their samples and go.

Rendering Flash as a 3D Overlay

Rendering Flash to a 3D window gets a bit tricky… Wait for Part 2 of this post!

Scalable Build Systems: An Analysis of Tup

I previously argued that any tool whose running time is proportional with the number of files in a project scales quadratically with time. Bluem00 on Hacker News pointed me towards Tup, a scalable build system with goals similar to ibb.

Mike Shal, Tup’s author, wrote Build System Rules and Algorithms, formalizing the algorithmic deficiencies with existing build systems and describing Tup’s implementation, a significant improvement over the status quo. I would like to document my analysis of Tup and whether I think it replaces ibb.

Before we get started, I’d like to thank Mike Shal for being receptive to my comments. I sent him a draft of my analysis and his responses were thoughtful and complete. With his permission, I have folded his thoughts into the discussion below.

Is Tup suitable as a general-purpose build system? Will it replace SCons or Jam or Make anytime soon? Should I continue working on ibb?

Remember our criteria for a scalable build system, one that enables test-driven development at arbitrary project sizes:

  1. O(1) no-op builds
  2. O(changes) incremental builds
  3. Accessible dependency DAG atop which a variety of tools can be built

Without further ado, my thoughts on Tup follow:


Tup defines its own declarative syntax, similar to Make or Jam. At first glance, the Tup syntax looks semantically equivalent to Make. From the examples:

: hello.c |> gcc hello.c -o hello |> hello

Read the dependency graph from left to right: hello.c is compiled by gcc into a hello executable. Tup supports variable substitution and limited flow control.

Build systems are inherently declarative, but I think Tup’s syntax has two flaws:

  1. Inventing a new syntax unnecessarily slows adoption: by implementing GNU Make’s syntax, Tup would be a huge drop-in improvement to existing build systems.
  2. Even though specifying dependency graphs is naturally declarative, I think a declarative syntax is a mistake. Build systems are a first-class component of your software and your team’s workflow. You should be able to develop them in a well-known, high-level language such as Python or Ruby, especially since those languages come with rich libraries. As an example, SCons gets this right: it’s trivial for me to write CPU autodetection logic for parallel builds in a build script if that makes sense. Or I can extend SCons’s Node system to download source files from the web.

Implementation Language

Tup is 15,000 lines of C. There’s no inherent problem with C, but I do think a community-supported project is more likely to thrive in a faster and safer language, such as Python or Ruby. Having worked with teams of engineers, it’s clear that most engineers can safely work in Python with hardly any spin-up time. I can’t say the same of C.

Git is an interesting case study: The core performance-sensitive data structures and algorithms are written in C, but many of its interesting features are written in Perl or sh, including git-stash, git-svn, and git-bisect. Unlike Git, I claim Python and Ruby are plenty efficient for the entirety of a scalable build system. Worst case, the dependency graph could live in C and everything else could stay in Python.

Scanning Implicit Dependencies

The Tup paper mentions offhand that it’s trivial to monitor a compiler’s file accesses and thus determine its true dependencies for generating a particular set of outputs. The existing implementation uses a LD_PRELOAD shim to monitor all file accesses attempted by, say, gcc, and treats those as canonical input files. Clever!

This is a great example of lateral, scrappy thinking. It has a couple huge advantages:

  1. No implicit dependencies (such as C++ header file includes) need be specified — if all dependencies come from the command line or a file, Tup will know them all.
  2. It’s easy to implement. Tup’s ldpreload.c is a mere 500 lines.

And a few disadvantages:

  1. Any realistic build system must treat Windows as a first-class citizen. Perhaps, on Windows, Tup could use something like Detours. I’ll have to investigate that.
  2. Intercepting system calls is reliable when the set of system calls is known and finite. However, there’s nothing stopping the OS vendor from adding new system calls that modify files.
  3. Incremental linking / external PDB files: these Visual C++ features both read and write the same file in one compile command. SCons calls this a SideEffect: commands that share a SideEffect cannot parallelize. A build system that does not support incremental linking or external symbols would face resistance among Visual C++ users.

And some open questions:

  1. I haven’t completely thought this through, but it may be important to support user-defined dependency scanners that run before command execution, enabling tools such as graph debugging.
  2. I don’t have a realistic example, but imagine a compiler that reads spurious dependency changes from run to run; say, a compiler that only checks its license file on every other day.

Stepping back, I think the core build system should not be responsible for dependency scanning. By focusing on dependency graph semantics and leaving dependency scanning up to individual tools (which may or may not use LD_PRELOAD or similar techniques), a build system can generalize to uses beyond compiling software, as I mentioned in my previous blog post.

Dependency Graph

Tup’s dependency DAG contains two types of nodes: Commands and Files. Files depend on Commands and Commands depend on other Files. I prefer Tup’s design over SCons’s DAG-edges-are-commands design for two reasons:

  1. It simplifies the representation of multiple-input multiple-output commands.
  2. Some commands, such as “run-test foo” or “search-regex some.*regex” depend on source files but produce no files as output. Since they fit naturally into the build DAG, commands are a first-class concept.

Build Reliability

Tup, like SCons, places a huge emphasis on build reliability. This is key and I couldn’t agree more. In the half-decade I’ve used SCons, I can count the number of broken builds on one hand. Sadly, many software developers are used to typing “make clean” or clicking “full rebuild” when something is weird. What a huge source of waste! Developers should trust the build system as much as their compiler, and the build system should go out of its way to help engineers specify complete and accurate dependencies.

Reliable builds imply:

  1. Changes are tracked by file contents, not timestamps.
  2. The dependency graph, including implicit dependencies such as header files and build commands, is complete and accurate by default.
  3. Compiler command lines are included in the DAG. Put another way: if the command used to build a file changes, the file must be rebuilt.

Tup takes a strict functional approach and formalizes build state as a set of files and their contents. (I would argue build state also includes file metadata such as file names and timestamps, at least if the compiler uses such information.) If the build state does not change between invocations, then no work must be done.

Tup even takes build reliability one step further than SCons: If you rename a target file in the build script, Tup actually deletes the old built target before rebuilding the new one. Thus, you will never have stale target executables lying around in your build tree.

Nonetheless, there are situations where a project may choose to sacrifice absolute reliability for significant improvements in build speed, such as incremental linking discussed above.

Core vs. Community

A build system is a critical component of any software team’s development process. Since every team is different, it’s essential that a build system is flexible and extensible. SCons, for example, correctly chose to implement build scripts in a high-level language (Python) with a declarative API for specifying nodes and edges in the dependency graph.

However, I think SCons did not succeed at separating its core engine from its community. SCons tightly couples the underlying dependency graph with support for tools like Visual C++, gcc, and version control. The frozen and documented SCons API is fairly high-level while the (interesting) internals are treated as private APIs. It should be the opposite: a dependency graph is a narrow, stable, and general API. By simplifying and documenting the DAG API, SCons could enable broader uses, such as unit test execution.


Like Tup’s author, I agree that build autoconfiguration (such as autoconf or SCons’s Configure support) should not live in the core build system. Autoconfiguration is simply an argument that build scripts should be specified in a general programming language and that the community should develop competing autoconfiguration systems. If a particular autoconfiguration system succeeds in the marketplace, then, by all means, ship it with your build tool. Either way, it shouldn’t have access to any internal APIs. Configuration mechanisms are highly environment-sensitive and are best maintained by the community anyway.

DAG post-process optimizations

Another argument for defining a build tool in a general-purpose language is to allow user-defined DAG optimizations and sort orders. I can think of two such use cases:

  1. Visual C++ greatly improves compile times when multiple C++ files are specified on one command line. In fact, the benefit of batched builds can exceed the benefit of PCH. A DAG optimizer would search for a set of C++ source files that produce object files in the same directory and rewrite the individual command lines into one.
  2. When rapidly iterating, it would be valuable for a build system or test runner to sort such that the most-recently-failed compile or test runs first. However, when hunting test interdependencies as part of a nightly build, you may want to shuffle test runs. On machines with many cores but slow disks, you want to schedule expensive links as soon as possible to mitigate the risk that multiple will execute concurrently and thrash against your disk.


Tup is a significant improvement over the status quo, and I have personally confirmed its performance — it’s lightning fast and it scales to arbitrary project sizes.

However, without out-of-the-box Windows support, a mainstream general-purpose language, and a model for community contribution, I don’t see Tup rapidly gaining traction. With the changes I suggest, it could certainly replace Make and perhaps change the way we iterate on software entirely.

Next, I intend to analyze prebake.