HTTP/2 Request Priorities – A Summary

by Chad Austin
Oct 30 14

Previously, I’d written about a particular use case we’d run into where we needed to make thousands of HTTP requests to download 3D assets to populate a WebGL scene.

Many of these assets (skeletons, meshes, lower-mip-level textures, poses) are higher priority than others (higher-mip-level textures, animations). In addition, objects closer to the camera should be prioritized higher than objects farther away.

The following table defines the priority of each asset download:

Asset Type Base Priority
Object Description 100
Skeleton 90
Mesh 90
Low-res texture 90
Pose 90
Animation 50
High-res texture 10

Plus a modifier based on the requesting object:

Object Priority modifier
Room +9
My Avatar +8
Other Avatars +[0,7] — distance to camera

Asset type determines the base priority. Within an asset type, the referencing 3D object adds a priority modifier. The room gets highest priority, your avatar next, and all remaining avatars are sorted by camera distance.

That is, object descriptions trump everything else. (They’re small and contain links to other assets.) Skeletons, meshes, low-res textures, and poses must be loaded before the object to be rendered at all, so they claim the next-highest-level priority. Finally, animations and high-res textures can come in later.

Bandwidth-Delay Product

Let me take a brief digression to explain why priority is important.

The bandwidth-delay product measures the amount of data that must be in flight to fully utilize a pipe. In 2012, the average US Internet connection was 6.7 Mbps. Average round-trip to Google in US was, say, 55 ms. This places the bandwidth-delay product at 6.7 Mbps * 55 ms = 46 KB. If you cannot keep 46 KB in flight at a time, you are wasting bandwidth. Of course, not everyone has Google’s data centers, and latency is jittery, and you likely want worldwide efficiency, so aim higher. Targeting a 150 KB bandwidth delay product, or even higher if you’re worldwide, is not a bad idea.

Now, because the pipe should have a lot of data in flight at any one moment, a large number of requests should be sent to the server right away. Since the client doesn’t know how big responses are, if the server returned low-priority responses first, the page would load more slowly than if low-priority requests weren’t sent until all high-priority responses were downloaded. However, having the client wait to issue low-priority requests does not make good use of available bandwidth. The best option is to give the server enough information to prioritize responses, letting it send down high-priority responses first, making everything load faster. Experience with SPDY shows that it’s critical that priority work well.

An ideal prioritization solution would satisfy the following objectives:

  • make full use of network bandwidth
  • retrieve responses for high-priority requests before low-priority requests
  • immediately — on the current or next frame — retrieve any responses already cached by the browser

On the web, today, there is no way to accomplish all of these goals.

Again, for as much as people like to point out that priority is optional or advisory [1] [2], it is critically important. Multiplexed connections without priority are slower than HTTP/1! Browsers already prioritize HTTP/1 requests, so if they didn’t prioritize SPDY or HTTP/2 requests, low-priority resources would contend for bandwidth with high-priority resources, causing initial load to be slower.

After whining about all of this to a friend, he casually said “Why don’t you take it up with the standards bodies?” (Damn you, jfb!) I’ve not had good luck with web standards bodies in the past — to the point that I have the impression some key people don’t care a whole lot about real-world use cases — but I thought I’d give it a shot.

It turns out that exposing a simple priority field on XMLHttpRequest (W3C WebApps WG), while easy, obvious, and an immediate solution to many use cases, does not satisfy everyone. In fact, it’s been proposed before. The WebApps WG instead suggested that prioritization should be solved at the “Fetch” layer, so regular DOM fetches like <img> and >script< tags can benefit too. This increases the scope of the proposal, but fair enough…

So I went to WHATWG and discovered a proposal that, on first blush, looked absurdly complicated. It’s based on HTTP/2′s dependency tree priority model. I spent an evening or two wrapping my head around the dependency tree model and ran into some problems.

I now see that, for all of the varied use cases for prioritization on the web, there is no apparent solution that satisfies them all. (But one is very close, as you’ll see.)

Use Cases

Let’s work through a few illustrative use cases for prioritizing requests on a multiplexed connection.

Basic Page Load

The ideal scenario for a page load is that enough of the HTML arrives to link to the appropriate CSS and JS resources, after which the JS and CSS are downloaded (with resources completing serially so that response processing on the CPU can happen in parallel with transfer), followed by the remainder of the HTML.

This requires dynamic reprioritization of the HTML document mid-transfer.

Any referenced images are not critical to the initial load, thus have lower priority. They are also transferred in parallel rather than serially under the assumption that they can be progressively-rendered. Two half-complete JavaScript files is not useful at all, but two half-complete JPEGs certainly could be useful.

This use case illustrates sequential transfer (JS and CSS) and parallel transfer (images). Sequential transfer is ideal when a resource doesn’t provide value until it’s completely downloaded. Parallel transfer is ideal when resources support progressive display or when trying to be fair to competing requests.

Multiple Tabs

Consider a tabbed browser. Each tab may issue dozens of its own requests. The browser ought to prioritize requests from the focused tab. Background tabs receive the remainder of the network resources.

Switching tabs ought to efficiently reprioritize requests from the old and new tabs appropriately.

Proxy

Proxies take many requests from multiple clients and multiplex them on fewer backend connections. Proxies ought to be fair: if client A asks for low-priority requests and client B asks for high-priority requests, the proxy’s resources should be split evenly between them.

Streaming Video

A video stream (e.g. DASH or M3U) consists of a sequence of short video segments. The current segment has the highest priority, followed by the next, and so on.

Reprioritization occurs on seek, though it’s probably fine to disregard any outstanding requests upon seek.

Loading a 3D Scene

This is the use case described at the top. Unless the connection is otherwise idle, not a single byte of low-priority responses should come before high-priority responses. Requests should be periodically reprioritized as the objects move closer to the camera.

Large Page of Images

Image a web page that consists of 10,000 img tags. Ideally, they would download in priority order where priority is distance from current viewport. On scroll, requests would be reprioritized.

Server Push

I personally have no interest in server push and many smart people are hashing out how priority interacts with server-pushed data in various forums, so I will refrain from further discussion here.

Prioritization Primitives

We can take the uses cases above and break them down into the components that a prioritization algorithm should support.

  1. Sequential in specified order (A before B before C …)
  2. Sequential in no particular order (A, B, C, but all of one before the next)
  3. Parallel (A, B, C, interleaved on the connection)
  4. Fairness (resources allocated evenly between {A, B, C} and {X, Y, Z}.)
  5. Reprioritization
  6. Parallel group dependencies (all of {A, B, C}, in parallel, before {D, E, F}, also in parallel)
  7. And because non-prioritized mux is worse than HTTP/1, advertised knowledge of prioritization support helps clients decide whether to blast all requests all at once or be more conservative :)

The Proposed Priority Models

HTTP/1

First, let’s talk about the status quo: legacy HTTP served by N concurrent TCP connections.

Firefox maintains a priority queue on 32-bit signed integers fed from the nsISupportsPriority interface.

Chromium supports a handful of priorities but I can’t find the code that implements the priority queue in there. All evidence says it exists though. :)

Most browsers also implement a first-paint optimization, where they distinguish between resources that are required for the first paint, like HTML, JS, and CSS, and download those first. Anything else: images or XMLHttpRequests is blocked until after the first paint. This is slightly inefficient: it leaves the connection idle during the first paint.

An HTTP/1 priority queue served by N TCP connections is inefficient for a couple reasons:

  1. round-trips and limited concurrency make it hard to keep the pipe full (especially on high-BDP connections)
  2. cannot reprioritize an in-flight request
  3. TCP flow control works better if you have one saturated connection rather than 8 unsaturated connections

SPDY/3

The version of SPDY currently in widespread use, SPDY/3, associates a 3-bit priority field with every stream. Streams cannot be reprioritized. “The sender and recipient SHOULD use best-effort to process streams in the order of highest priority to lowest priority.”

Original SPDY/4 Proposal

Two years ago, Google proposed switching SPDY/4′s dependency model to a system based on stream dependencies and weights. The idea is that, rather than using numeric priorities, streams form a DAG which represents a partial ordering relation.

At each depth “level” of the DAG, weights are used to distribute available resources across streams.

In this model, streams could have multiple parents. You could say, for example, C depends on A and B, placing C at a lower priority than both A and B.

This proposal also introduced reprioritization.

SPDY/4 advertises on the connection whether prioritization has an effect.

One of the gnarlier bits of the stream dependency model is that it introduces some number of race conditions. After a stream is closed, how long is its information, for the purposes of future reprioritization packets, retained?

Microsoft’s HTTP/2 Proposal

In a simplified form, the SPDY/4 dependencies and weights proposal made it into the HTTP/2 draft. However, prioritization was still a somewhat controversial topic. Osama Mazahir, on behalf of Microsoft, proposed an alternate priority design that appeared to garner some support. See the replies in that thread.

The basic idea is that there are groups of streams. Each group is assigned a weight, and within each group, streams are assigned priority.

Osama does a better job describing the tradeoffs, but the main argument is that it’s simpler and supports more use cases more naturally than dependencies and weights.

The large open questions seem to be: “So what priority values do people choose? What if you want to prioritize resource B between A and C, but A’s and C’s priorities differ by only 1?” If you use numeric priorities, you have to select arbitrary values for people to

After this thread, I never saw mention of it again. What happened?

HTTP/2 Draft 15

HTTP/2 Draft 15 goes into great detail about the stream dependencies and weights priority model.

Every stream is given one 31-bit parent and an 8-bit weight in [1, 256]. The server SHOULD process streams from the root of the tree down, dividing available resources among streams by their relative weights. If a parent stream closes, it is removed from the tree and its weight is divided among its children.

The default stream parent is zero – the root.

One repeated bit of confusion I’ve seen on the mailing lists is the difference between weights and priorities. Priorities are used to specify relative ordering: A before B. Weights are used to divide resources among many competing requests. If I wanted {A0,A1,A2,A3}[weight=256] before {B0,B1,B2,B3}[weights=128] before {C0,C1,C2,C3}[weights=1], is the expectation that the gateway proxy would open 12 simultaneous backend connections, and divide the connection with 2/3 of the frames solving for A, 1/3 solving for B, and a trickle for C? That would be silly: we want all As before any Bs. So then is the expectation that proxies would “tend to choose” streams of higher weight and run them to completion? If so, that violates at least the spirit of the algorithm described in HTTP/2.

In terms of server decision-making and client bandwidth utilization, weighting make a very poor substitute for true prioritization, so it’s important to treat them separately. Priorities specify ordering, weights specify resource allocation.

However, there doesn’t seem to be much practical implementation experience with the HTTP 2 priority model. Mozilla’s current implementation ignores dependencies entirely, which would have unexpected results against a compliant connection, as low-priority resources would consume bandwidth while the page is waiting for high-priority resources. (To be fair, this implementation is actively in progress.)

Which Priority Protocol is Best?

Let’s enumerate the prioritization primitives again. The goal is to list whether a particular primitive is expressible in a given protocol. We won’t even consider HTTP/1 because it doesn’t solve for efficient bandwidth usage on a high-latency connection. The three priority protocols under consideration are: SPDY/3, SPDY/4 as of 2012, Microsoft, and HTTP/2.

SPDY/3 is easy to eliminate from the running too: 3 bits of priority is way too limiting in many use cases, including, for example, in the use case of streaming video chunks. After 8 chunks, you would be out of priority levels.

Use Case SPDY/4 2012 Microsoft HTTP/2
Sequential in specified order (A before B before C …) 👍 👍 👍
Sequential in no particular order^
Parallel (A, B, C in parallel) 👍 👍 👍
Fairness (resources allocated evenly between {A, B, C} and {X, Y, Z}.) 👍 👍 👍
Reprioritization 👍 👍 👍
Parallel group dependencies ({A, B, C} in parallel before {D, E, F}, also in parallel) 👍 👍
Advertisement of prioritization support 👍

^ None of the protocols allow the requestor to specify that responses should be sent in their entirety before the next response is sent without also defining a specific sequence. I suppose the assumption is the server can make this determination by content type? I wonder whether a bit to indicate “I don’t care about the order, but once you start transmitting a resource, please finish before starting the next.” would be useful in practice.

Many of the desired use cases are supported in HTTP/2, but Microsoft’s proposal appears to satisfy strictly more. I wonder what happened to it in committee? I will reach out to its authors and supporters and see.

The biggest issues with HTTP/2 are that:

  • There is no way to express “download {A,B,C} in parallel then download {D,E,F} in parallel”
  • If priority is truly optional, and there is no mechanism by which the server advertises support, then clients will need to play it safe, reducing the possible upside from a multiplexed connection.

Can HTTP/2 be fixed?

Can we modify HTTP/2 to add support for the {A,B,C} in parallel followed by {D,E,F} use case?

The idea of dummy placeholder streams has been bandied about. Let’s see what that might look like:

  • 0 is the root.
  • P1 is the high-priority dummy stream.
  • P2 is the low-priority dummy stream.

If we defined a tree as such:

P1 -> 0
{A, B, C, P2} -> P1
{D, E, F} -> P2

and changed the HTTP/2 specification such that dummy nodes never closed and streams closer to the root were serviced first, then the “Parallel groups” use case is satisfied. However, watch what happens to fairness in the proxy / multiple tabs use case.

Consider two inbound connections to a proxy that are being multiplexed on one outbound connection.

Inbound connection A has three priority levels, thus three dummy streams. Inbound connection B has two priority levels, thus two dummy streams.

0 <- A_P1 <- A_P2 <- A_P3
0 <- B_P1 <- B_P2

Now let’s assume that inbound connection A has two requests: A_S1 at priority A_P1 and A_S2 at priority A_P3. Inbound connection B also has two requests: B_S1 at priority B_P1 and B_S2 at B_P2.

A_P1 <- A_S1
A_P3 <- A_S2
B_P1 <- B_S1
B_P2 <- B_S2

Note that A’s priority levels go deeper than B’s. Per the proposed spec modification above, A_S1 and B_S1 would have equal priority and be serviced fairly. However, if two streams are at unequal depth in the tree, the one closer to the root would win. B_S2 would unfairly starve A_S2. So it seems that fairness and group prioritization are at odds in a collapsing, tree-based dependency model.

What now?

It’s surprising to me how hard it is to get priorities correct over multiplexed streams. Microsoft’s proposal seems to be the most useful, but I have no insight into what caused it to fail in committee.

I don’t have any ideas for how to save a tree-based model. We need either weighted stream groups or we need to convert the tree into a DAG — that is, if nodes had multiple parents, the dependency model would work swimmingly.

Getting priority right is critical — HTTP/2 will be the most popular protocol on the Internet, and the upside potential in both page-load efficiency and performance of new web applications is huge.

Pre-emptive Rebuttals

Why not define your own priority extension?

The big advantage to using HTTP/2 is that it’s a standard. CDNs, caching proxies, gateways, and even load balancers will implement it. A custom protocol would be prohibitive in practice.

Just get over it and use weights as priorities.

I hope I demonstrated earlier why weights don’t solve the prioritization problem. “But it’s close enough!” The speed of light isn’t getting any faster. Page load optimization is getting increasingly expensive. Projects like SPDY, QUIC, and HTTP/2 are ambitious, protocol-level efforts for arguably marginal wins. Being “close enough” is no longer sufficient. Defining a protocol that can specify the ideal transfer schedule has a huge return on investment, assuming the protocol is adopted and implemented well.

The Gamepad API

by Chad Austin
Oct 12 14

The Gamepad API is a general API supporting a large number of possible input devices. However, it’s named after the most common use case: gamepad controllers. It could definitely support IR remote controls, switches, audio mixers, and so on… Maybe it should be named the “Input Device API”. :) It’s not as general as DirectInput or USB HID though. No access to positional information (like mouse deltas, finger coordinates on a trackpad, or 3D tracking), no force feedback, and no access to device accelerometers.

Overall, the Gamepad API is narrow in scope. With a little more specification effort and implementation complexity (copy the good bits of DirectInput or Raw Input?), the API could support a huge number of use cases.

That said, within its narrow scope, the Gamepad API is well-designed. Devices can have an arbitrary number of either buttons or axes.

I’d make a few changes, however.

Thoughts

Remove Fingerprinting Mitigation

Gamepads MUST only appear in the list if they are currently connected to the user agent, and at least one device has been interacted with by the user. If no devices have been interacted with, devices MUST NOT appear in the list to avoid a malicious page from fingerprinting the user.

That’s such a lost cause it’s not even funny. There are dozens of ways to retrieve identifying information from a user (and companies that license implementations of such). As long as there are machine learning algoritms and any differences at all between different computers and browsers, users will be fingerprinted. My favorite example is when researchers demonstrated a fingerprinting technique by rendering text into a canvas and analyzing the anti-aliasing and font rendering. Attempting to mitigate fingerprinting by penalizing the user experience seems like the wrong tradeoff here, except perhaps on an opt-in basis.

Allow Standard-Mapped Devices to have Extra Buttons or Axes

Two sentences in the spec imply that an input device that is recognized to implement a standard mapping will not expose more functionality than is defined by the mapping object.

When the user agent recognizes the attached device, it is RECOMMENDED that it be remapped to a canonical ordering when possible.

and

The standard gamepad has 4 axes, and up to 17 buttons.

Simply changing the wording to “has at least 4 axes, and at least 17 buttons” would allow games that default to standard mapping inputs to work with devices that support additional capabilities, like racing wheels with standard-mapped controls in the middle.

Add support for multiple standard mappings per device.

Input controller idioms come and go. Some become entrenched in a generation’s gamepads, and some fade away.

Assigning a single canonical mapping per device limits the discovery of useful structure. For example, an SNES controller, Wiimote, or Logitech Driving Force wheel wouldn’t satisfy the “Standard Gamepad” mapping, but all of them have a directional pad. If there was a “Directional Pad” mapping, and devices implemented multiple mappings, then any game that relied on a Directional Pad would work out of the box.

Add an event-based input mechanism.

The Gamepad API’s sole input discovery mechanism relies on JavaScript polling the current gamepad state at high frequency to detect input events from the gamepad. For interactive applications like games, this is usually fine. However, polling at 60 Hz in JavaScript is excessive if you just need to know when a button was pressed. We don’t poll for the mouse or keyboard events – why are gamepads different? If the argument is convenience, libraries can always offer that.

Moreover, underneath it all, some high-priority operating system thread is polling the device, and translating the current device state into an event stream. This is why, even though XInput is a polling-based API, if your game drops a few frames, button presses aren’t lost.

For certain classes of problems, like mental chronometry, you need to know when the button press occurred so you can measure elapsed time within a millisecond or so. If JavaScript is polling button state, it doesn’t know when the input event originated – perhaps 10-50 ms of latency has elapsed by the time it sees the button change – but the underlying high-priority polling thread knows. (Or at least has a better sense.)

Let’s say I’m playing a game running at 10 Hz and I press a button to open a bit of UI. Display of the UI hitches (maybe WebGL textures are being uploaded), pausing the game’s polling loop for one second or so, blocking requestAnimationFrame from polling the device. If the Gamepad API isn’t polling the device at a higher frequency under the hood, any buttons pressed and released within that one second period would not register. So we have to assume that, to avoid missed button presses, they have to be queued. But, since only ONE change can be measured each frame, each press has to be queued for two frames (button DOWN, button UP).

Thus, even though the presses occurred at T, T+100ms, and T+200ms, JavaScript won’t see them until T+1000ms, T+1200ms, and T+1400ms. In games that rely on high-precision gestures, this is the difference between it recognizing gestures even in the presence of frame drops, and it missing gestures entirely.

If you think dropped events aren’t a problem in “real games”, try playing SimCity 2013 on an average Mac sometime… The low frame rate would be completely tolerable except for the dropped events.

In contrast, an API where JavaScript would ask “Give me all the input events since I last asked” would associate a timestamp with each event, allowing for accurate combination recognition. If the Gamepad API is backed by Windows’s Raw Input API, this data can be retrieved with GetMessageTime().

The Gamepad API is definitely a step in the right direction, but it feels like it ought to be a little lower-level and more general to avoid being yet another Almost Good web API.

Optimizing WebGL Shaders by Reading D3D Shader Assembly

by Chad Austin
Sep 26 14

This post is a mirror of the IMVU Engineering Blog.

We are optimizing WebGL shaders for the Intel GMA 950 chipset, which is basically the slowest WebGL-capable device we care about. Unfortunately, it’s a fairly common chipset too. On the plus side, if we run well on the GMA 950, we should basically run well anywhere. :)

When you’re writing GLSL in WebGL on Windows, your code is three layers of abstraction away from what actually runs on the GPU. First, ANGLE translates your GLSL into HLSL. Then, D3DX compiles the HLSL into optimized shader assembly bytecode. That shader bytecode is given to the driver, where it’s translated into hardware instructions for execution on the silicon.

Ideally, when writing GLSL, we’d like to see at least the resulting D3D shader assembly.

With a great deal of effort and luck, I have finally succeeded at extracting Direct3D shader instructions from WebGL. I installed NVIDIA Nsight and Visual Studio 2013 in Boot Camp on my Mac. To use Nsight, you need to create a dummy Visual Studio project. Without a project, it won’t launch at all. Once you have your dummy project open, configure Nsight (via Nsight User Properties under Project Settings) to launch Firefox.exe instead. Firefox is easier to debug than Chrome because it runs in a single process.

If you’re lucky — and I’m not sure why it’s so unreliable — the Nsight UI will show up inside Firefox. If it doesn’t show up, try launching it again. Move the window around, open various menus. Eventually you should have the ability to capture a frame. If you try to capture a frame before the Nsight UI shows up in Firefox, Firefox will hang.

Once you capture a frame, find an interesting draw call, make sure the geometry is from your WebGL application, and then begin looking at the shader. (Note: again, this tool is unreliable. Sometimes you get to look at the HLSL that ANGLE produced, which you can compile and disassemble with fxc.exe, and sometimes you get the raw shader assembly.)

Anyway, check this out. We’re going to focus on the array lookup in the following bone skinning GLSL:

attribute vec4 vBoneIndices;
uniform vec4 uBones[3 * 68];

ivec3 i = ivec3(vBoneIndices.xyz) * 3;
vec4 row0 = uBones[i.x];

ANGLE translates the array lookup into HLSL. Note the added bounds check for security. (Why? D3D claims it already bounds-checks array accesses.)

int3 _i = (ivec3(_vBoneIndices) * 3);
float4 _row0 = _uBones[int(clamp(float(_i.x), 0.0, 203.0))]

This generates the following shader instructions:

# NOTE: v0 is vBoneIndices
# The actual load isn't shown here.  This is just index calculation.

def c0, 2.00787401, -1, 3, 0
def c218, 203, 2, 0.5, 0

# r1 = v0, truncated towards zero
slt r1.xyz, v0, -v0
frc r2.xyz, v0
add r3.xyz, -r2, v0
slt r2.xyz, -r2, r2
mad r1.xyz, r1, r2, r3

mul r2.xyz, r1, c0.z # times three

# clamp
max r2.xyz, r2, c0.w
min r2.xyz, r2, c218.x

# get ready to load, using a0.x as index into uBones
mova a0.x, r2.y

That blob of instructions that implements truncation towards zero? Let’s decode that.

r1.xyz = (v0 < 0) ? 1 : 0
r2.xyz = v0 - floor(v0)
r3.xyz = v0 - r2
r2.xyz = (-r2 < r2) ? 1 : 0
r1.xyz = r1 * r2 + r3

Simplified further:

r1.xyz = (v0 < 0) ? 1 : 0
r2.xyz = (floor(v0) < v0) ? 1 : 0
r1.xyz = (r1 * r2) + floor(v0)

In short, r1 = floor(v0), UNLESS v0 < 0 and floor(v0) < v0, in which case r1 = floor(v0) + 1.

That’s a lot of instructions just to calculate an array index. After the index is calculated, it’s multiplied by three, clamped to the array boundaries (securitah!), and loaded into the address register.

Can we convince ANGLE and HLSL that the array index will never be negative? (It should know this, since it’s already clamping later on, but whatever.) Perhaps avoid a bunch of generated code? Let’s tweak the GLSL a bit.

ivec3 i = ivec3(max(vec3(0), vBoneIndices.xyz)) * 3;
vec4 row0 = uBones[i.x];

Now the instruction stream is substantially reduced!

def c0, 2.00787401, -1, 0, 3
def c218, 203, 1, 2, 0.5

# clamp v0 positive
max r1, c0.z, v0.xyzx

# r1 = floor(r1)
frc r2, r1.wyzw
add r1, r1, -r2

mul r1, r1, c0.w # times three

# bound-check against array
min r2.xyz, r1.wyzw, c218.x
mova a0.x, r2.y

By clamping the bone indices against zero before converting to integer, the shader optimizer eliminates several instructions.

Can we get rid of the two floor instructions? We know that the mova instruction rounds to the nearest integer when converting a float to an index. Given that knowledge, I tried to eliminate the floor by making my GLSL match mova semantics, but the HLSL compiler didn’t seem smart enough to elide the two floor instructions. If you can figure this out, please let me know!

Either way, I wanted to demonstrate that, by reading the generated Direct3D shader code from your WebGL shaders, you may find small changes that result in big wins.

Web Platform Limitations, Part 2 – Web Audio API is a Mess

by Chad Austin
Sep 17 14

The Web Audio API is a beautiful example of how bizarre web APIs can get.

Originally, Mozilla proposed their Audio Data API which allowed programmatic creation of stream of audio sample data. You simply opened an output stream, specified the number of channels and sample frequency, and wrote sample data to it. It was simple, beautiful, low-latency, and exposed a minimum baseline set of functionality, meaning that each browser would have likely had a high-quality implementation. The Audio Data API had another nice feature: the application developer specified the desired sample rate. In practice, every OS already has a high-quality resampler and mixer, so there’s no penalty when using a non-native sample rate.

I don’t know exactly why the Audio Data API lost, but I suspect it had something to do with the idea that JavaScript is too slow or high-latency to generate sample data in small buffers on demand.

Of course, with typed arrays, asm.js, and the upcoming SIMD.js spec (more on that later), that concern is not very well-founded.

Either way, the Audio Data API lost and the Web Audio API won.

The Web Audio API is an enormous mess. It has a huge API surface with a signal processing graph containing HRTFs, sound cones, doppler, convolutions, oscillators, filters, and so on. Quoth the spec: “It is a goal of this specification to include the capabilities found in modern game audio engines as well as some of the mixing, processing, and filtering tasks that are found in modern desktop audio production applications.”

We may end up with at least four separate implementations: Blink, WebKit, Gecko, and IE. Some friends of mine and I have this suspicion that, since each browser has to implement such a wide API, in practice, browsers won’t always do a great job with each node, so serious audio people will simply generate samples in JavaScript (or Emscripten-compiled C++) anyway.

I created a little demo that shows how poorly browsers handle the basic task of playing a handful of buffers, with fixed sample rates, at exact, scheduled times. I did not find a single browser on either Windows or Mac that could play five 200 Hz sine wave buffers without glitching. You can try the demo yourself.

And I would be remiss without linking the jsmess plea for help.

These are almost certainly solvable problems (the Web Audio API allows you to specify when buffers begin playback in seconds at 64-bit floating point precision), but given that the Web Audio API has been in development for years, you’d think the fundamentals of playing gapless buffers would be solved by now. All the media graph stuff could be optional and layered on.

Final Thoughts

For posterity, audio APIs should work much like DirectSound: at some fairly high, reliable frequency, your application should query the current ‘play’ and ‘write’ heads in a circular audio buffer, fill the buffer from the write head up to the play head, and ta da. Streaming audio. Let the OS’s audio stack handle the final resampling and mixing, and let the particular application perform whatever transforms it needs to. Of course, the actual polling and buffer-filling could happen in a browser thread, and JavaScript could stream arbitrary numbers of samples in advance if latency doesn’t matter that much.

(Side note: OpenAL is terrible too. Audio is a stream, not a collection of buffers that you queue.)

“But, but, JavaScript performance!”

If your concern is truly about JavaScript performance, then we’re all in trouble anyway. Why should only audio benefit from SIMD and low latency callbacks? If it will never be the case that JavaScript numerical performance will approach native numerical performance, the Web Audio graph could be replaced entirely by a handful of signal processing functions that are 1) mathematically defined and 2) work on arbitrary typed arrays. This would be simpler for browsers to implement, easier to verify, AND have more flexibility.

Connecting C++ and JavaScript on the Web with Embind

by Chad Austin
Sep 12 14

At CppCon 2014 I gave a presentation on the design of Embind, its feature set, and some little C++11 tricks we developed to make Embind a little smaller and nicer.

I understand the video will be available after the conference, but for now, the annotated slides are up:

Web Platform Limitations, Part 1 – XMLHttpRequest Priority

by Chad Austin
Aug 31 14

Note: this post is a mirror of the post at the IMVU Engineering Blog.

The web is inching towards being a general-purpose applications platform that rivals native apps for performance and functionality. However, to this day, API gaps make certain use cases hard or impossible.

Today I want to talk about streaming network resources for a 3D world.

Streaming Data

In IMVU, when joining a 3D room, hundreds of resources need to be downloaded: object descriptions, 3D mesh geometry, textures, animations, and skeletons. The size of this data is nontrivial: a room might contain 40 MB of data, even after gzip. The largest assets are textures.

To improve load times, we first request low-resolution thumbnail versions of each texture. Then, once the scene is fully loaded and interactive, high-resolution textures are downloaded in the background. High-resolution textures are larger and not critical for interactivity. That is, each resource is requested with a priority:

High Priority Low Priority
Skeletons High-resolution textures
Meshes
Low-resolution textures
Animations

Browser Connection Limits

Let’s imagine what would happen if we opened an XMLHttpRequest for each resource right away. What happens depends on whether the browser is using plain HTTP or SPDY.

HTTP

Browsers limit the number of simultaneous TCP connections to each domain. That is, if the browser’s limit is 8 connections per domain, and we open 50 XMLHttpRequests, the 9th would not even submit its request until the 8th finished. (In theory, HTTP pipelining helps, but browsers don’t enable it by default.) Since there is no way to indicate priority in the XMLHttpRequest API, you would have to be careful to issue XMLHttpRequests in order of decreasing priority, and hope no higher-priority requests would arrive later. (Otherwise, they would be blocked by low-priority requests.) This assumes the browser issues requests in sequence, of course. If not, all bets are off.

There is a way to approximate a prioritizing network atop HTTP XMLHttpRequest. At the application layer, limit the number of open XMLHttpRequests to 8 or so and have them pull the next request from a priority queue as requests finish.

Soon I’ll show why this doesn’t work that well in practice.

SPDY

If the browser and server both support SPDY, then there is no theoretical limit on the number of simultaneous requests to a server. The browser could happily blast out hundreds of HTTP requests, and the responses will make full use of your bandwidth. However, a low-priority response might burn bandwidth that could otherwise be spent on a high-priority response.

SPDY has a mechanism for prioritizing requests. However, that mechanism is not exposed to JavaScript, so, like HTTP, you either issue requests from high priority to low priority or you build the prioritizing network approximation described above.

However, the prioritizing network reduces effective bandwidth utilization by limiting the number of concurrent requests at any given time.

Prioritizing Network Approximation

Let’s consider the prioritizing network implementation described above. Besides the fact that it doesn’t make good use of the browser’s available bandwidth, it has another serious problem: imagine we’re loading a 3D scene with some meshes, 100 low-resolution textures, and 100 high-resolution textures. Imagine the high-resolution textures are already in the browser’s disk cache, but the low-resolution textures aren’t.

Even though the high-resolution textures could be displayed immediately (and would trump the low-resolution textures), because they have low priority, the prioritizing network won’t even check the disk cache until all low-resolution textures have been downloaded.

That is, even though the customer’s computer has all of the high-resolution textures on disk, they won’t show up for several seconds! This is an unnecessarily poor experience.

Browser Knows Best

In short, the browser has all of the information needed to effectively prioritize HTTP requests. It knows whether it’s using HTTP or SPDY. It knows what’s in cache and not.

It would be super fantastic if browsers let you tell them. I’ve seen some discussions about adding priority hints, but they seem to have languished.

tl;dr Not being able to tell the browser about request priority prevents us from making effective use of available bandwidth.

FAQ

Why not download all resources in one large zip file or package?

Each resource lives at its own URL, which maximizes utilization of HTTP caches and data sharing. If we downloaded resources in a zip file, we wouldn’t be able to leverage CDNs and the rest of the HTTP ecosystem. In addition, HTTP allows trivially sharing resources across objects. Plus, with protocols like SPDY, per-request overhead is greatly reduced.

CppCon 2014: Embind and Emscripten: Blending C++11, JavaScript, and the Web Browser

by Chad Austin
Aug 23 14

On September 9th, at CppCon 2014, I am presenting on the design of Embind, a JavaScript-to-C++ binding API for Emscripten. Embind was authored by IMVU Inc. and myself.

See the session abstract at the conference site.

In this talk, you’ll learn where Embind fits in the overall space of solutions for connecting C++ and JavaScript, why generated code size is so important, how Embind works hard to keep code size small, and several of the C++11 techniques I learned for this project.

C++ Casts

by Chad Austin
Jul 22 14

Once you explain something enough, it’s worth writing it down, simply so you can link it. :)

There are five cast operators in C++:

cppreference.com does a great job documenting the semantics of the various cast operators, so I won’t cover them here. Instead, let’s define some rules that will keep you and your teammates sane and prevent a class of bug-causing accidents.

Enable Warnings For Implicit Conversions

Enable your compiler’s warnings for implicit conversions of float -> int, int -> float, int -> char, and so on. On the hopefully-rare occasion you actually want said conversion, use a static_cast.

Avoid Casts

When possible, avoid casts entirely. If you find yourself wanting dynamic_cast, consider whether a virtual function will suffice. Virtual functions are cheaper and your design becomes more flexible. Instead of:

if (auto p = dynamic_cast<Console*>(output)) {
    p->setColor(RED);
} else {
    p->output("FAIL: ");
}

Try:

p->onFailure();

Also, consider restructuring your code so type knowledge is explicit. Sometimes, instead of a single list of base class pointers, it works out better to store a list per subtype. That is, instead of std::vector<Material*> materials;, the following design might work a little more smoothly in practice: std::vector<StaticMaterial*> staticMaterials; std::vector<DynamicMaterial*> dynamicMaterials;

If you’re converting between floats and ints all the time, see if you can stick with one or the other. Some platforms severely penalize said conversions.

In general, frequent use of cast operators indicates the software’s design can be improved.

Use Weaker Casts

Use the most restrictive cast operator for the situation. Casts should be precise, specific, and should fail if the code is changed in a way that makes the conversion meaningless.

Prefer static_cast over reinterpret_cast

static_cast will give a compile-time error if attempting to convert C* to D* where neither C nor D derive from each other. reinterpret_cast and C-style casts would both allow said conversion.

Prefer reinterpret_cast over C-style Casts

reinterpret_cast does not allow casting away constness. You must use const_cast for that. C-style casts, again, let you do anything. Prefer the weaker cast operation.

Avoid const_cast

I’m not sure I’ve ever seen a justified use of const_cast. It’s almost always some kind of clever hack. Just get your constness right in the first place.

Avoid C-style Casts

I’m going to repeat myself. Don’t use C-style casts. They let anything by. When you refactor something and you find yourself debugging some insane issue that should have been a compiler error, you’ll wish you used a weaker cast operator.

Don’t Cast Function Pointers to void*

It’s undefined behavior.

Emscripten, Callbacks, and C++11 Lambdas

by Chad Austin
Jun 20 14

The web is an asynchronous world built on asynchronous APIs. Thus, typical web applications are full of callbacks. onload and onerror for XMLHttpRequest, the callback argument to setTimeout, and messages from Web Workers are common examples.

Using asynchronous APIs is relatively natural in JavaScript. JavaScript is garbage-collected and supports anonymous functions that close over their scope.

However, when writing Emscripten-compiled C++ to interact with asynchronous web APIs, callbacks are less natural. First, JavaScript must call into a C++ interface. Embind works well for this purpose. Then, the C++ must respond to the callback appropriately. This post is about the latter component: writing efficient and clean asynchronous code in C++11.

I won’t go into detail here about how that works, but imagine you have an interface for fetching URLs with XMLHttpRequest.

class XHRCallback {
    virtual void onLoaded(const Response& response) = 0;
};
void fetch(const std::string& url, XHRCallback* callback);

Imagine Response is a struct that embodies the HTTP response and onLoaded runs when the XMLHttpRequest ‘load’ event fires.

To fetch data from the network, you would instantiate an implementation of the XHRCallback interface and pass it into the XHR object. I’m not going to cover in this article how to connect these interfaces up to JavaScript, but instead we will look at various various implementations of XHRCallback on the C++ side.

For the purposes of this example, let’s imagine we want to fetch some JSON, parse it, and store the result in a model.

Approach 1

A simple approach is to write an implementation of the interface that knows about the destination Model and updates it after parsing the body. Something like:

class MyXHRCallback : public XHRCallback {
public:
    Model* model;

    MyXHRCallback(Model* model) : model(model) {}

    virtual void onLoaded(const Response& response) override {
        model->update(parseJSON(response.body));
    }
};

void updateModelFromURL(const std::string& url, Model* model) {
    fetch(url, new MyXHRCallback(model));
}

This is quite doable, but it’s a real pain to write a new class instance, field list, and constructor for every callback.

What if we tried to simplify the API with C++11 lambdas?

Approach 2

class LambdaXHRCallback : public XHRCallback {
public:
    std::function<void(const Response&)> onLoadedFunction;
    virtual void onLoaded(const Response& response) override {
        if (onLoadedFunction) {
            onLoadedFunction(response);
        }
    }
};

Above is boilerplate per interface. Below is use.

void updateModelFromURL(const std::string& url, Model* model) {
    auto callback = new LambdaXHRCallback;
    callback->onLoaded = [model](const Response& response) {
        model->update(parseJSON(response.body));
    };
    fetch(url, callback);
}

Ignoring the implementation of LambdaXHRCallback, the API’s a little cleaner to use. This approach requires backing the callback interface with an implementation that delegates to a std::function. The std::function can be bound to a local lambda, keeping the callback logic lexically near the code issuing the request.

From a clarity perspective, this is an improvement. However, because Emscripten requires that your customers download and parse the entire program during page load (in some browsers, parsing happens on every pageload!), code size is a huge deal. Even code in rarely-used code paths is worth paying attention to.

std::function, being implemented with its own abstract “implementation-knowledge-erasing” interface that is allocated upon initialization or assignment, tends to result in rather fat code. The default 16-byte backing storage in 32-bit libc++ doesn’t help either.

Can we achieve clear asynchronous code without paying the std::function penalty? Yes, in fact!

Approach 3

template<typename OnLoad>
class LambdaXHRCallback : public XHRCallback {
public:
    // perfect forwarding
    LambdaXHRCallback(OnLoad&& onLoad)
    : onLoad_(std::forward<OnLoad>(onLoad))
    {}

    virtual void onLoaded(const Response& response) override {
        onLoad_(response);
    }
    
private:
    OnLoad onLoad_;
};

// this function exists so OnLoad’s type can be inferred,
// as lambdas have anonymous types
template<typename OnLoad>
LambdaXHRCallback<OnLoad>* makeXHRCallback(OnLoad&& onLoad) {
    return new LambdaXHRCallback(std::forward<OnLoad>(onLoad));
}

Above is boilerplate per interface. Below is use.

void updateModelFromURL(const std::string& url, Model* model) {
    fetch(url, makeXHRCallback(
        [model](const Response& response) {
            model->update(parseJSON(response.body));
        }
    ));
}

But… but… there are templates here, how is that any better than std::function? Well, first of all, now we only have one virtual call: the XHRCallback interface itself. Previously, we would have a virtual call into LambdaXHRCallback and then again through the std::function.

Second, in C++11, lambdas are syntax sugar for an anonymous class type with an operator(). Since the lambda’s immediately given to the LambdaXHRCallback template and stored directly as a member variable, in practice, the types are merged during link-time optimization.

I ported a dozen or so network callbacks from std::function to the template lambda implementation and saw a 39 KB reduction in the size of the resulting minified JavaScript.

I won’t go so far as to recommend avoiding std::function in Emscripten projects, but I would suggest asking whether there are better ways to accomplish your goals.

View-Independent Translucent Triangle Sorting

by Chad Austin
Feb 26 14

Given a large, potentially self-intersecting, translucent, two-sided triangle mesh, we’d like to render it with as few visual artifacts as possible. The issue is that OpenGL and Direct3D will render triangles in exactly the order they’re given, and since alpha blending is not commutative, alpha-blended triangles must be rendered back to front.

My colleague, Jon Watte, pointed me to an article he’d written that describes an O(n^2) algorithm that splits all double-sided triangles into two, which are then sorted based on a coverage factor: how many other triangles “shadow” this one. Triangles that are farther back are sorted lower in the list than triangles in front of them. This algorithm is view-independent.

My understanding is that Torque Engine and Unreal Engine implement similar algorithms.

The idea works well in practice, but O(n^2) is too slow for large meshes, especially approaching the 100,000 triangle mark.

I set out to see if we could get similar results from an algorithm with subquadratic or even linear-ish running time.

The basic idea is, if we can define a strict weak ordering on triangles, then we can run a standard sort algorithm across the triangles and have a perfectly-ordered, view-independent, translucent mesh.

What does it mean to order two translucent triangles in a mesh?

bool compare(Triangle A, Triangle B) {
    a_behind_b = all points in A behind B
    b_behind_a = all points in B behind A
    if a_behind_b and b_behind_a: return false
    if !(a_behind_b or b_behind_a): return false
    if a_behind_b: return true
    if b_behind_a: return false
    // some kind of weighted area comparison
}

If two triangles are facing away from each other — that is, each is behind the other — then the order in which they’re rendered does not matter.

A = B.

If two triangles face each other, their order also does not matter.

Again, here, A = B.

If two triangles point the same direction, but triangle A is in front of triangle B, then A should always be rendered before B.

A < B.

If two triangles intersect, without splitting, there is no correct rendering order. We’d like to sort them by each’s coverage of the other, which can be calculated by each triangle’s plane with the other triangle.

In this case, A < B because B covers A more than A covers B.

So far we have a strict weak ordering. However, even ignoring the possibility of intersecting triangles, there’s a problem. Strict weak ordering implies and requires transitivity. (Indeed, that’s how O(n lg n) comparison sorting algorithms work.) However, consider the following three triangles, A, B, and C.

A > B
B > C
C > A

Without splitting, there is no ordering that can render this mesh correctly from all viewpoints.

Rats, this means our triangle comparison does not implement a strict weak ordering. What if we passed it to std::sort or std::stable_sort anyway? Undefined behavior, according to my reading of the C++ 2003 spec. But let’s say we implemented a merge sort without undefined behavior if given a non-strict-weak ordering. What’s the worst ordering that could happen?

Given that comparison sorts rely on transitivity, consider our failure case from before, this time with a few extra “easy” triangles (labeled X and Y) thrown in.

The sort may compare A>B and B>C, assuming “A>C”. Then, worst case, given any triangle X where X>A, “X>C” which could be incorrect. In addition, given any triangle Y where C>Y, “A>Y” which could also be incorrect. The circled Xs and Ys in the diagram above indicate incorrectly-ordered triangles if the sort assumes “A>C”. That is, even the smallest cycle could result in several triangles being sorted incorrectly. A possible sort order would be: YYYCBAXXX, even though, ideally, one of the Ys > A and and two of the Xs < C.

A body of research named consensus clustering or rank aggregation explores approximating the optimal ordering given conflicting constraints. The goal is to minimize conflicting orderings. An optimal solution is NP-complete, but I propose the following stochastic solution, where k and j are tunable constants:

for k iterations:
    shuffle triangle list
    merge sort
    pick j random pairs of triangles, count number of pairs where (second < first)
    keep ordering that minimizes incorrect orderings

This algorithm runs in O(k*(n+j)) — better than quadratic, which was the stated goal. Also, given sufficient iterations, stochastic algorithms tend to avoid worst-case scenarios, which is good enough here.

Now, if we could find cycles in subquadratic time, we could break them by selectively splitting triangles or by removing cycles entirely from the mesh and merging them separately, once the strict weak ordering on non-cycles is determined. Cycle detection is O(E+V) in general, but total ordering implies E=V^2.

Future work: Could we use the fact that some triangles can be rendered in any order relative to other triangles to optimize vertex cache hits?

Thanks to a bunch of people who explored this problem with me.