Web Platform Limitations, Part 1 – XMLHttpRequest Priority

by Chad Austin
Aug 31 14

The web is inching towards being a general-purpose applications platform that rivals native apps for performance and functionality. However, to this day, API gaps make certain use cases hard or impossible.

Today I want to talk about streaming network resources for a 3D world.

Streaming Data

In IMVU, when joining a 3D room, hundreds of resources need to be downloaded: object descriptions, 3D mesh geometry, textures, animations, and skeletons. The size of this data is nontrivial: a room might contain 40 MB of data, even after gzip. The largest assets are textures.

To improve load times, we first request low-resolution thumbnail versions of each texture. Then, once the scene is fully loaded and interactive, high-resolution textures are downloaded in the background. High-resolution textures are larger and not critical for interactivity. That is, each resource is requested with a priority:

High Priority Low Priority
Skeletons High-resolution textures
Meshes
Low-resolution textures
Animations

Browser Connection Limits

Let’s imagine what would happen if we opened an XMLHttpRequest for each resource right away. What happens depends on whether the browser is using plain HTTP or SPDY.

HTTP

Browsers limit the number of simultaneous TCP connections to each domain. That is, if the browser’s limit is 8 connections per domain, and we open 50 XMLHttpRequests, the 9th would not even submit its request until the 8th finished. (In theory, HTTP pipelining helps, but browsers don’t enable it by default.) Since there is no way to indicate priority in the XMLHttpRequest API, you would have to be careful to issue XMLHttpRequests in order of decreasing priority, and hope no higher-priority requests would arrive later. (Otherwise, they would be blocked by low-priority requests.) This assumes the browser issues requests in sequence, of course. If not, all bets are off.

There is a way to approximate a prioritizing network atop HTTP XMLHttpRequest. At the application layer, limit the number of open XMLHttpRequests to 8 or so and have them pull the next request from a priority queue as requests finish.

Soon I’ll show why this doesn’t work that well in practice.

SPDY

If the browser and server both support SPDY, then there is no theoretical limit on the number of simultaneous requests to a server. The browser could happily blast out hundreds of HTTP requests, and the responses will make full use of your bandwidth. However, a low-priority response might burn bandwidth that could otherwise be spent on a high-priority response.

SPDY has a mechanism for prioritizing requests. However, that mechanism is not exposed to JavaScript, so, like HTTP, you either issue requests from high priority to low priority or you build the prioritizing network approximation described above.

However, the prioritizing network reduces effective bandwidth utilization by limiting the number of concurrent requests at any given time.

Prioritizing Network Approximation

Let’s consider the prioritizing network implementation described above. Besides the fact that it doesn’t make good use of the browser’s available bandwidth, it has another serious problem: imagine we’re loading a 3D scene with some meshes, 100 low-resolution textures, and 100 high-resolution textures. Imagine the high-resolution textures are already in the browser’s disk cache, but the low-resolution textures aren’t.

Even though the high-resolution textures could be displayed immediately (and would trump the low-resolution textures), because they have low priority, the prioritizing network won’t even check the disk cache until all low-resolution textures have been downloaded.

That is, even though the customer’s computer has all of the high-resolution textures on disk, they won’t show up for several seconds! This is an unnecessarily poor experience.

Browser Knows Best

In short, the browser has all of the information needed to effectively prioritize HTTP requests. It knows whether it’s using HTTP or SPDY. It knows what’s in cache and not.

It would be super fantastic if browsers let you tell them. I’ve seen some discussions about adding priority hints, but they seem to have languished.

tl;dr Not being able to tell the browser about request priority prevents us from making effective use of available bandwidth.

FAQ

Why not download all resources in one large zip file or package?

Each resource lives at its own URL, which maximizes utilization of HTTP caches and data sharing. If we downloaded resources in a zip file, we wouldn’t be able to leverage CDNs and the rest of the HTTP ecosystem. In addition, HTTP allows trivially sharing resources across objects. Plus, with protocols like SPDY, per-request overhead is greatly reduced.

CppCon 2014: Embind and Emscripten: Blending C++11, JavaScript, and the Web Browser

by Chad Austin
Aug 23 14

On September 9th, at CppCon 2014, I am presenting on the design of Embind, a JavaScript-to-C++ binding API for Emscripten. Embind was authored by IMVU Inc. and myself.

See the session abstract at the conference site.

In this talk, you’ll learn where Embind fits in the overall space of solutions for connecting C++ and JavaScript, why generated code size is so important, how Embind works hard to keep code size small, and several of the C++11 techniques I learned for this project.

C++ Casts

by Chad Austin
Jul 22 14

Once you explain something enough, it’s worth writing it down, simply so you can link it. :)

There are five cast operators in C++:

cppreference.com does a great job documenting the semantics of the various cast operators, so I won’t cover them here. Instead, let’s define some rules that will keep you and your teammates sane and prevent a class of bug-causing accidents.

Enable Warnings For Implicit Conversions

Enable your compiler’s warnings for implicit conversions of float -> int, int -> float, int -> char, and so on. On the hopefully-rare occasion you actually want said conversion, use a static_cast.

Avoid Casts

When possible, avoid casts entirely. If you find yourself wanting dynamic_cast, consider whether a virtual function will suffice. Virtual functions are cheaper and your design becomes more flexible. Instead of:

if (auto p = dynamic_cast<Console*>(output)) {
    p->setColor(RED);
} else {
    p->output("FAIL: ");
}

Try:

p->onFailure();

Also, consider restructuring your code so type knowledge is explicit. Sometimes, instead of a single list of base class pointers, it works out better to store a list per subtype. That is, instead of std::vector<Material*> materials;, the following design might work a little more smoothly in practice: std::vector<StaticMaterial*> staticMaterials; std::vector<DynamicMaterial*> dynamicMaterials;

If you’re converting between floats and ints all the time, see if you can stick with one or the other. Some platforms severely penalize said conversions.

In general, frequent use of cast operators indicates the software’s design can be improved.

Use Weaker Casts

Use the most restrictive cast operator for the situation. Casts should be precise, specific, and should fail if the code is changed in a way that makes the conversion meaningless.

Prefer static_cast over reinterpret_cast

static_cast will give a compile-time error if attempting to convert C* to D* where neither C nor D derive from each other. reinterpret_cast and C-style casts would both allow said conversion.

Prefer reinterpret_cast over C-style Casts

reinterpret_cast does not allow casting away constness. You must use const_cast for that. C-style casts, again, let you do anything. Prefer the weaker cast operation.

Avoid const_cast

I’m not sure I’ve ever seen a justified use of const_cast. It’s almost always some kind of clever hack. Just get your constness right in the first place.

Avoid C-style Casts

I’m going to repeat myself. Don’t use C-style casts. They let anything by. When you refactor something and you find yourself debugging some insane issue that should have been a compiler error, you’ll wish you used a weaker cast operator.

Don’t Cast Function Pointers to void*

It’s undefined behavior.

Emscripten, Callbacks, and C++11 Lambdas

by Chad Austin
Jun 20 14

The web is an asynchronous world built on asynchronous APIs. Thus, typical web applications are full of callbacks. onload and onerror for XMLHttpRequest, the callback argument to setTimeout, and messages from Web Workers are common examples.

Using asynchronous APIs is relatively natural in JavaScript. JavaScript is garbage-collected and supports anonymous functions that close over their scope.

However, when writing Emscripten-compiled C++ to interact with asynchronous web APIs, callbacks are less natural. First, JavaScript must call into a C++ interface. Embind works well for this purpose. Then, the C++ must respond to the callback appropriately. This post is about the latter component: writing efficient and clean asynchronous code in C++11.

I won’t go into detail here about how that works, but imagine you have an interface for fetching URLs with XMLHttpRequest.

class XHRCallback {
    virtual void onLoaded(const Response& response) = 0;
};
void fetch(const std::string& url, XHRCallback* callback);

Imagine Response is a struct that embodies the HTTP response and onLoaded runs when the XMLHttpRequest ‘load’ event fires.

To fetch data from the network, you would instantiate an implementation of the XHRCallback interface and pass it into the XHR object. I’m not going to cover in this article how to connect these interfaces up to JavaScript, but instead we will look at various various implementations of XHRCallback on the C++ side.

For the purposes of this example, let’s imagine we want to fetch some JSON, parse it, and store the result in a model.

Approach 1

A simple approach is to write an implementation of the interface that knows about the destination Model and updates it after parsing the body. Something like:

class MyXHRCallback : public XHRCallback {
public:
    Model* model;

    MyXHRCallback(Model* model) : model(model) {}

    virtual void onLoaded(const Response& response) override {
        model->update(parseJSON(response.body));
    }
};

void updateModelFromURL(const std::string& url, Model* model) {
    fetch(url, new MyXHRCallback(model));
}

This is quite doable, but it’s a real pain to write a new class instance, field list, and constructor for every callback.

What if we tried to simplify the API with C++11 lambdas?

Approach 2

class LambdaXHRCallback : public XHRCallback {
public:
    std::function<void(const Response&)> onLoadedFunction;
    virtual void onLoaded(const Response& response) override {
        if (onLoadedFunction) {
            onLoadedFunction(response);
        }
    }
};

Above is boilerplate per interface. Below is use.

void updateModelFromURL(const std::string& url, Model* model) {
    auto callback = new LambdaXHRCallback;
    callback->onLoaded = [model](const Response& response) {
        model->update(parseJSON(response.body));
    };
    fetch(url, callback);
}

Ignoring the implementation of LambdaXHRCallback, the API’s a little cleaner to use. This approach requires backing the callback interface with an implementation that delegates to a std::function. The std::function can be bound to a local lambda, keeping the callback logic lexically near the code issuing the request.

From a clarity perspective, this is an improvement. However, because Emscripten requires that your customers download and parse the entire program during page load (in some browsers, parsing happens on every pageload!), code size is a huge deal. Even code in rarely-used code paths is worth paying attention to.

std::function, being implemented with its own abstract “implementation-knowledge-erasing” interface that is allocated upon initialization or assignment, tends to result in rather fat code. The default 16-byte backing storage in 32-bit libc++ doesn’t help either.

Can we achieve clear asynchronous code without paying the std::function penalty? Yes, in fact!

Approach 3

template<typename OnLoad>
class LambdaXHRCallback : public XHRCallback {
public:
    // perfect forwarding
    LambdaXHRCallback(OnLoad&& onLoad)
    : onLoad_(std::forward<OnLoad>(onLoad))
    {}

    virtual void onLoaded(const Response& response) override {
        onLoad_(response);
    }
    
private:
    OnLoad onLoad_;
};

// this function exists so OnLoad’s type can be inferred,
// as lambdas have anonymous types
template<typename OnLoad>
LambdaXHRCallback<OnLoad>* makeXHRCallback(OnLoad&& onLoad) {
    return new LambdaXHRCallback(std::forward<OnLoad>(onLoad));
}

Above is boilerplate per interface. Below is use.

void updateModelFromURL(const std::string& url, Model* model) {
    fetch(url, makeXHRCallback(
        [model](const Response& response) {
            model->update(parseJSON(response.body));
        }
    ));
}

But… but… there are templates here, how is that any better than std::function? Well, first of all, now we only have one virtual call: the XHRCallback interface itself. Previously, we would have a virtual call into LambdaXHRCallback and then again through the std::function.

Second, in C++11, lambdas are syntax sugar for an anonymous class type with an operator(). Since the lambda’s immediately given to the LambdaXHRCallback template and stored directly as a member variable, in practice, the types are merged during link-time optimization.

I ported a dozen or so network callbacks from std::function to the template lambda implementation and saw a 39 KB reduction in the size of the resulting minified JavaScript.

I won’t go so far as to recommend avoiding std::function in Emscripten projects, but I would suggest asking whether there are better ways to accomplish your goals.

View-Independent Translucent Triangle Sorting

by Chad Austin
Feb 26 14

Given a large, potentially self-intersecting, translucent, two-sided triangle mesh, we’d like to render it with as few visual artifacts as possible. The issue is that OpenGL and Direct3D will render triangles in exactly the order they’re given, and since alpha blending is not commutative, alpha-blended triangles must be rendered back to front.

My colleague, Jon Watte, pointed me to an article he’d written that describes an O(n^2) algorithm that splits all double-sided triangles into two, which are then sorted based on a coverage factor: how many other triangles “shadow” this one. Triangles that are farther back are sorted lower in the list than triangles in front of them. This algorithm is view-independent.

My understanding is that Torque Engine and Unreal Engine implement similar algorithms.

The idea works well in practice, but O(n^2) is too slow for large meshes, especially approaching the 100,000 triangle mark.

I set out to see if we could get similar results from an algorithm with subquadratic or even linear-ish running time.

The basic idea is, if we can define a strict weak ordering on triangles, then we can run a standard sort algorithm across the triangles and have a perfectly-ordered, view-independent, translucent mesh.

What does it mean to order two translucent triangles in a mesh?

bool compare(Triangle A, Triangle B) {
    a_behind_b = all points in A behind B
    b_behind_a = all points in B behind A
    if a_behind_b and b_behind_a: return false
    if !(a_behind_b or b_behind_a): return false
    if a_behind_b: return true
    if b_behind_a: return false
    // some kind of weighted area comparison
}

If two triangles are facing away from each other — that is, each is behind the other — then the order in which they’re rendered does not matter.

A = B.

If two triangles face each other, their order also does not matter.

Again, here, A = B.

If two triangles point the same direction, but triangle A is in front of triangle B, then A should always be rendered before B.

A < B.

If two triangles intersect, without splitting, there is no correct rendering order. We’d like to sort them by each’s coverage of the other, which can be calculated by each triangle’s plane with the other triangle.

In this case, A < B because B covers A more than A covers B.

So far we have a strict weak ordering. However, even ignoring the possibility of intersecting triangles, there’s a problem. Strict weak ordering implies and requires transitivity. (Indeed, that’s how O(n lg n) comparison sorting algorithms work.) However, consider the following three triangles, A, B, and C.

A > B
B > C
C > A

Without splitting, there is no ordering that can render this mesh correctly from all viewpoints.

Rats, this means our triangle comparison does not implement a strict weak ordering. What if we passed it to std::sort or std::stable_sort anyway? Undefined behavior, according to my reading of the C++ 2003 spec. But let’s say we implemented a merge sort without undefined behavior if given a non-strict-weak ordering. What’s the worst ordering that could happen?

Given that comparison sorts rely on transitivity, consider our failure case from before, this time with a few extra “easy” triangles (labeled X and Y) thrown in.

The sort may compare A>B and B>C, assuming “A>C”. Then, worst case, given any triangle X where X>A, “X>C” which could be incorrect. In addition, given any triangle Y where C>Y, “A>Y” which could also be incorrect. The circled Xs and Ys in the diagram above indicate incorrectly-ordered triangles if the sort assumes “A>C”. That is, even the smallest cycle could result in several triangles being sorted incorrectly. A possible sort order would be: YYYCBAXXX, even though, ideally, one of the Ys > A and and two of the Xs < C.

A body of research named consensus clustering or rank aggregation explores approximating the optimal ordering given conflicting constraints. The goal is to minimize conflicting orderings. An optimal solution is NP-complete, but I propose the following stochastic solution, where k and j are tunable constants:

for k iterations:
    shuffle triangle list
    merge sort
    pick j random pairs of triangles, count number of pairs where (second < first)
    keep ordering that minimizes incorrect orderings

This algorithm runs in O(k*(n+j)) — better than quadratic, which was the stated goal. Also, given sufficient iterations, stochastic algorithms tend to avoid worst-case scenarios, which is good enough here.

Now, if we could find cycles in subquadratic time, we could break them by selectively splitting triangles or by removing cycles entirely from the mesh and merging them separately, once the strict weak ordering on non-cycles is determined. Cycle detection is O(E+V) in general, but total ordering implies E=V^2.

Future work: Could we use the fact that some triangles can be rendered in any order relative to other triangles to optimize vertex cache hits?

Thanks to a bunch of people who explored this problem with me.

Multiplatform C++ on the Web with Emscripten

by Chad Austin
Apr 5 13

[This post is also available at the IMVU engineering blog.]

At GDC 2013, IMVU shared the analysis that led us to using Emscripten as a key component of our technology strategy to bring our rich 3D virtual goods catalog to new platforms, including web browsers.

Here are the slides from said presentation.

C++11 on Windows (Hah!)

by Chad Austin
Feb 22 13

For it being 2013 and as much as Herb Sutter has talked about C++11, it’s surprisingly hard to get an off-the-shelf C++11 development toolchain on Windows, at least as of today.  By off-the-shelf I mean suitable for an engineering team to get up and running quickly. Of course I could perform unnatural acts and compile my own packages of whatever, but no thanks.

Cygwin runs gcc 4.5 which is too old for most C++11 features.  Cygwin does provide a clang 3.1 package, but it uses the gcc 4.5 libstdc++ headers, lacking most of C++11′s standard library.

I could attempt to compile my own libcxx but libcxx is only known to work on Mac OS X.

In November, Microsoft released a Community Technology Preview increasing Visual Studio 2012′s C++11 support but it requires modifying your project to use the CTP toolchain. I’m using SCons and I have no idea how to convince it to use the CTP.

MinGW offers gcc 4.7 and, in theory, complete C++11 support, but a known bug disables std::to_string.  At least until gcc 4.8.

Blah. I think I’ll abandon Windows for development and focus on Mac and Linux until this situation has improved.

Benchmarking JSON Parsing: Emscripten vs. Native

by Chad Austin
Feb 11 13

This post concludes my obsession with JSON parsing. In fact, the entire reason I wrote a JSON parser was to show these graphs. I wanted to see whether I could write a JSON parser faster than any other when run in Emscripten. As vjson is typically faster, I did not succeed unless I requalify my goal as writing the fastest-in-Emscripten JSON parser with a useful parse tree.

This benchmark’s code is on GitHub. I encourage you to reproduce my results and search for flaws.

All benchmarks were run on a 2010 Macbook Pro, 2.53 GHz Core i5, 8 GB 1067 MHz DDR3.

Native vs. Emscripten

First, let’s compare native JSON parsing performance (clang 3.1, -O2) with both stable and nightly versions of Firefox and Chrome.

Two things are immediately clear from this graph. First, native code is still 5-10x faster than Emscripten/JavaScript. Second, yajl and jansson are dramatically slower than rapidjson, sajson, and vjson. Native yajl and jansson are even slower than Emscripten’d sajson. Henceforth I will not include them.

Looking closely at the browser results, a third conclusion is clear. Firefox runs Emscripten’d code much faster than Chrome.

Finally, sajson consistently performs better than rapidjson in my Emscripten tests. And vjson always parses the fastest. I believe this is because Emscripten and browser JITs punish larger code sizes.

The previous graph only shows parse rates by browser and parser for a single file. Next let’s look at parse rates by file.

Yep, native code consistently wins. At this point I want to dig into differences between the browsers, so I will show the same graph but without native code.

Firefox vs. Chrome

Not only is Firefox consistently faster than Chrome, it’s faster by a factor of 2-5x!

Finally, here the same graph but normalized against Firefox 18.

If I were a Firefox JS developer, I’d be feeling pretty proud right now, assuming this experiment holds up to scrutiny. Even so, these results match my experience with Bullet/Emscripten in Chrome: Chrome takes a very long time to stabilize its JIT, and I’ve even seen it fail to stabilize, constantly deopting and then reoptimizing code. In contrast, Firefox may take longer to JIT up front, but performance is smooth afterwards.

Further work is necessary to test the hypothesis that code size is the biggest predictor of Emscripten performance.

Preemptive answers to predicted questions follow:

Well, duh. You shouldn’t use an Emscripten’d JSON parser. You should use the browser’s built-in JSON.parse function.

This isn’t about parsing JSON. This is about seeing whether the web can compete with native code under common workloads. After all, Mozillians have been claiming JavaScript is or will be fast enough to replace native apps. If parsing JSON through a blessed browser API is the only way to do it quickly, then developers are disempowered. What if a new format comes along? Must developers wait on the browser vendors? The web has to be better than that.

Shouldn’t you have compiled the Emscripten code with -fno-exceptions?

Yes. Oops. That said, after generating the graphs, I did recompile with -fno-exceptions and it made no difference to the benchmark results in either Chrome or Firefox.

JSON Parser Benchmarking

by Chad Austin
Jan 31 13

With the caveat that each parser provides different functionality and access to the resulting parse tree, I benchmarked sajsonrapidjsonvjsonYAJL, and Jansson.  My methodology was simple: given large-ish real-world JSON files, parse them as many times as possible in one second.  To include the cost of reading the parse tree in the benchmark, I then iterated over the entire document and summed the number of strings, numbers, object values, array elements, etc.

The documents are:

  • apache_builds.json: Data from the Apache Jenkins installation.  Mostly it’s a array of three-element objects, with string keys and string values.
  • github_events.json: JSON data from GitHub’s events feature.  Nested objects several levels deep, mostly strings, but also contains booleans, nulls, and numbers.
  • instruments.json: A very long array of many-key objects.
  • mesh.json: 3D mesh data. Almost entirely consists of floating point numbers.
  • update-center.json: Also from Jenkins though I’m not sure where I found it.

apache_builds.json, github_events.json, and instruments.json are pretty-printed with a great deal of interelement whitespace.

Now for the results. The Y axis is parses per second. Thus, higher is faster.

Core 2 Duo E6850, Windows 7, Visual Studio 2010, x86

Core 2 Duo E6850, Windows 7, Visual Studio 2010, AMD64

Atom D2700, Ubuntu 12.04, clang 3.0, AMD64

Raspberry Pi

Conclusions

sajson compares favorably to rapidjson and vjson, all of which stomp the C-based YAJL and Jansson parsers. 64-bit builds are noticeably faster: presumably because the additional registers are helpful. Raspberry Pis are slow. :)

Comparing JSON Parse Trees

by Chad Austin
Jan 25 13

I will soon post performance benchmarks of sajson, rapidjson, vjson, YAJL, and Jansson.

However, before talking about performance, I want to compare the parse trees generated by each library.

sajson

  • Differentiates between integers and doubles
  • Array lengths are known.
  • Arrays are random-access.
  • Object lengths are known.
  • Object keys are sorted.  Thus, random access of object elements is O(lg N).
  • Stored contiguously in memory.

rapidjson

  • At its root, rapidjson is a SAX-style parser.  However, rapidjson does provide a default DOM implementation.
  • Array and Object length are not directly accessible in the API but are accessible via <code>End() – Begin()</code> or <code>MemberBegin() – MemberEnd()</code>.
  • Arrays support random access.
  • Object lookup is O(N).
  • Differentiates between integers, unsigned integers, 64-bit integers, 64-bit unsigned integers, and doubles.
  • Parse tree is mutable.

vjson

  • Namespaces its symbols under the generic “json” prefix.
  • Differentiates between integers and floats.  No support for doubles.
  • Array and Object lengths are not stored.  Instead a linked list must be traversed to calculate the length.
  • Random array access is O(N).
  • Object element lookup is O(N), by walking a linked list.
  • Very simple interface.

YAJL

  • Differentiates between integers and doubles.
  • Array and Object lengths are available.
  • Object element lookup is O(N).
  • Parse tree is mutable.

Jansson

  • Differentiates between integers and doubles.
  • Array and Object lengths are available.
  • Array access is O(1).
  • Object element access is O(1) via a hash table.
  • The parse tree is mutable.