My Political Views

by Chad Austin
Dec 30 14

I lean towards empiricism. We use empirical evidence to judge the effectiveness of drugs and the safety of our food. Yet we enact legislation based on wishful thinking, hope, and what feels right. Most of my standpoints stem from what has generally been shown to work — though I also support social experiments like The Kansas Experiment.

Education is critical for the country’s long-term health. Higher education is important but preschool, per dollar, is at least equally valuable. I support public funding of preschool, especially in lower-income settings. College education is also valuable, but I actually think it’s overpriced and trade schools and apprenticeships ought to come back into vogue. In addition, much of the cost of college education appears to stem from the increase in “staff costs” such as health care, as teaching is a lower-leverage activity than factory production or software development.

Science education in particular is critical. There appears to be a growing antiscience movement seeded from the far right that jeopardizes the USA’s future as a world leader in innovation.

Public statements made by politicians should be publicly fact-checked and advertised as such. False advertising is illegal for product advertisements – why is it not illegal for those running for office? It’s disturbing how easily humans are swayed by information that is not true, even after it’s been shown to be untrue. This is my biggest fear about our modern hyper-connected world.

Solving energy is critical for a sustainable world. We should continue investing in fusion just in case it works. I support the development and maintenance of a nuclear power infrastructure as it’s cleaner than coal. It’s a good idea to continue investment into renewable energy, as it’s quite viable and hydrocarbon externalities are not priced into the market like they are with renewables.

I do think that free markets are effective at optimizing efficiency. However, they must be monitored every so often because corporations love to form monopolies and/or capture their regulators. Effective regulation can prevent monopoly formation and enable private markets to succeed. Consider the Swiss healthcare system and UK forcing British Telecom to open its wires.

If legislation has a very small effect, it should not be enacted.

Government systems should maximize transparency and checks and balances to mitigate the inevitable long-term drift towards corruption.

I believe that a solid and practical understanding of systems theory is one of western society’s biggest blind spots. I might go as far as saying systems thinking is a new literacy. As our world becomes increasingly connected and interdependent, systems theory’s relevance will increase.

Empathy is key. In general, people either have good intentions or believe they have good intentions (is there a difference?). Trying to understand their viewpoints is a good idea. (Though there’s a point where you have to give up. Once, during a debate, my counterpart literally said “I’m sorry, I won’t read that information. It’s against my beliefs.”)

The US Department of Defense is dramatically over-funded. It’s totally not clear that meddling in the affairs of other countries has been beneficial to us, especially relative to the enormous cost. I’d love to see a portion of the DOD’s budget redirected towards the NSF and other basic research organizations. Basic research pays long-term dividends.

I think it’s worth preserving the environment. The environment is a broad concern over time and space, thus should be managed at a higher level than individuals, local communities, or corporations. The cost of environmental regulation is peanuts compared to the long-term benefits. [citation needed ;)]

All people should have equal basic rights. This includes people of any race, gender, socioeconomic status, sexual orientation, age, or ability.

Immigration policy should be relaxed, especially to skilled immigrants.

We should strive to demonstrate, through our actions, character, wisdom, and culture, that we are a great nation. I’m not the most patriotic person, but there is value in striving to be great in competition with others.

More Thoughts on Haskell

by Chad Austin
Dec 29 14

I’d like to follow up on my previous post with some points that are more nuanced and probably only make sense to people who’ve written Haskell before.


Some syntax elements in Haskell were annoying at first, but I adjusted to them quite easily, and I now believe my annoyance was simply unfamiliarity. Separating arguments by spaces and using the $ operator (that is, a $ b c as shorthand for a (b c) comes naturally now. The only thing I still get wrong is operator precedence.


Regarding category theory terminology: when I even mention words like functor or monad, I’ve seen engineers’ eyes glaze over. They instantly think “Oh, that’s too complicated to understand.” I swear, if Functor was named Mappable and Monad was named Chainable or something like that, it would make Haskell seem much less intimidating to beginners. Explaining monads doesn’t require any fancy tutorials or stretch analogies. Similarly, explaining functors doesn’t require using the words “lifting” or “context”. It’s as simple as “Well, you know how you can map across a list in every other language? Well, consider Maybe as a list of zero or one elements. You can map across that too. What about a promise or future? You can map across that too. Functions themselves can be mapped, transforming their return values. Anything that can be mapped is a functor.”

In general, I do think it’s a good idea for programmers to use the same terminology as mathematicians, as they’re generally more rigorous and precise, but… I’ll just quote An Introduction to Category Theory:

In her charming Italian style she asked me (instructed me) to cut out all the jokes. This was quite difficult since some of the official categorical terminology is a joke, but I have done my best.


This is a relatively minor nit: Haskell module import syntax is cluttered. I always feel like import qualified Data.Vector as Vector would be better written in Python style: from Data import Vector. This would have the side benefit of mitigating a common, and in my opinion unfortunate, pattern you see in Haskell code: importing modules by abbreviations. import qualified Data.Vector as V. For some common modules, like Data.ByteString as BS and Data.ByteString.Char8 as BSC, the abbreviations are common enough that everyone knows in context what module has been imported. However, for your own modules, you should import with explicit, non-abbreviated names, so it’s clear in context which module is being referenced.


I’m unsure about laziness by default. Laziness can be really great, and it’s somewhat cheap in Haskell, but there are many situations where, if you’re going to compute anything for a data structure, you might as well compute it all right then, while the caches are still hot. The theoretical benefits of only computing the values necessary have a nontrivial cost: lazy thunked data structures have branching and dereferencing costs that strict languages can avoid.

Streaming and Chunking

I feel like Haskell streaming and chunking libraries, like Conduit, have the same problem as laziness for most reasonably-sized data structures. If your machine has 32 GiB of RAM and 50 GB/s of memory bandwidth, who cares if you allocate 100 MB of data and dump it all on a socket in one go. Chunking only matters much if your peak memory usage times concurrency doesn’t fit. I’ve seen similar performance issues with with Python 2’s xrange() function. range() is frequently faster in context because it avoids the extra iteration overhead.

Partial Functions

Some people have vehemently argued that pattern matching is so superior to conditionals and dereferencing than Java and C++ are fundamentally flawed. That is:

case p of
  Nothing -> bar
  Just v -> foo v

is safer than:

if (ptr) {
} else {

I have some sympathy for this argument. Pattern matching syntax does mean that the dereferenced value is only visible in the scope where it is valid. However, the real problem with languages like C++ and Java is not the lack of pervasive pattern matching. It’s that Java and C++ don’t have a way to distinguish in the type system between a boxed value (aka T*) and a potentially-null boxed value (aka T*).

Pattern matching by itself is nice, but the real benefit comes from its combination with sum types.

What is the Haskell equivalent of NullPointerException? Partial functions!

Let’s imagine I wrote the above code as follows, except assuming p is not Nothing:

foo $ fromJust p

If p is Nothing, this code throws an error. fromJust is a partial function in that, when the input has an unexpected value, it throws an error. Haskell has many partial functions.

In your code, you should strive to write total functions. However, I am of the opinion that Haskell should distinguish, in the type system, between the two types of bottom. There are two ways that a function in Haskell can have no value. It can either throw an exception or it can enter an infinite loop. Partial functions throw exceptions.

In practice, infinite loops are rare (except through accidental recursive references). However, it would be useful to disallow certain classes of pure code from throwing an exceptions. This would make it illegal for pure code to call partial functions. However, certain types of errors could still occur, especially out of memory errors and stack overflows. Thus, the language would have to draw a distinction between synchronous errors such as partial matches and asynchronous errors such as resource exhaustion.

I can understand why the language adopted a simpler model, even though it would be nice to guarantee that some pure functions cannot throw non-resource-exhaustion errors.

Purity is a vague concept. Some useful attributes can be applied to a call graph: “has no observable side effects”. “does not allocate heap memory”. “fits within a defined resource budget”. “does not throw exceptions.” I’m not aware of any mainstream languages that distinguish between these notions of purity.

Thoughts on Haskell

by Chad Austin
Dec 28 14

My colleague Andy recently wrote about what it’s like to use Haskell at IMVU. Since then, I’ve started writing some Haskell, and I wanted to share my thoughts on the language and ecosystem.

I am by no means a language expert. But I’ve gotten over the beginner’s hump and have written a fair amount of real code, so I have a decent sense of what the idioms are.

Haskell is Awesome

Overall, I find Haskell to be great. It’s not that fun to write — the common complaint that Haskell makes you think a lot up front is totally true — but once the code is written it is so easy to safely refactor. This is likely due to type inference, Haskell’s brevity, and the ease of creating new, lightweight types. At the technology strategy level, I value long-term maintainability and evolvability over ease of writing new code, so Haskell has a big advantage here.

With Warp and Wai, Haskell is particularly excellent for writing HTTP services. (Michael Snoyman generally has solid design taste and performance sense.) In just a few lines of code, you can write a scalable, concurrent, low-latency HTTP server. Better yet, you can deploy this HTTP server to your production servers as a static binary: no fancy configuration or package management necessary!

The GHC runtime is both innovative and mature. It has excellent concurrency support. With STM, avoiding deadlocks in concurrent algorithm is easy. Green threads and sparks enable spinning off hundreds of thousands of lightweight jobs, soaking up any available CPU capacity. The parallel I/O manager makes it efficient to concurrently handle multiple requests in a single process.

GHC’s benefits go beyond the language and runtime: with GHC’s foreign function interface, it’s easy to call into C functions from Haskell.

Enough about the runtime, let me discuss the language a bit.

When digging into an unfamiliar codebase, the hardest part is building an understanding of the possible states and data flows. That is, understanding the big picture. Sum types help a lot here. They explicitly denote the possible states. Pattern matching, plus warnings for partial matches, make sure that all states are properly handled, largely preventing any kind of accidental null pointer errors. (Remember: Maybe a is a sum type: Just a | Nothing.)

Haskell draws a sharp line between pure code and effectful code. This makes it much easier to understand how information flows through the code. When you see a pure function, you know it is only calculating a result, and thus only need to read the type signature to understand its effect. Also, separating pure and effectful code prevents a common class of performance regression: accidentally adding expensive I/O to otherwise pure data transformation code. (The compiler also benefits from knowledge of which functions are pure: it can evaluate in arbitrary order, lift common function calls to outer scopes, and apply rewrite rules.)

In addition, through custom Monad types, Haskell makes it easy to restrict effects to specific contexts. For example, you can enforce rules like “No MySQL queries in Redis transactions” or “No arbitrary I/O in HTTP request handlers, only specific restricted actions.” By limiting your code to a restricted interface, it’s even possible to guarantee that your unit tests will be deterministic. This is a real problem in languages like Python, C++, and JavaScript, where a single unmocked call to “get the current time” can make your tests intermittent.

Monads, in general, let you use the same syntax for imperative code, asynchronous code, code with automatic error handling. You don’t need generators or new keywords to have convenient syntax for asynchronous programs. Monads and ‘do sugar’ by themselves are sufficient. See slide 121 and beyond in Scott Wlaschin’s excellent Functional Design Patterns slides.

Type classes are awesome too! Traditional OO languages allow polymorphism through the first argument (this pointer, vtable). But type classes allow polymorphism on single values (see Data.Default) and across multiple arguments, even return values. See Eq.

Type classes can be created independently from their resident types. Types can be specified independently from type class implementations. This allows layering concepts into third-party libraries that weren’t designed with them in mind.

Type classes are especially powerful when you start building application frameworks with them. For example, consider the type of HTTP request handlers:

myRequestHandler :: ToHTTPResponse a => HTTPRequest -> a

That is, a request handler is a function that takes an HTTP request and returns any value that can be converted to an HTTP response. This means request handlers don’t themselves need to build the HTTP response: just return any compatible object and let the framework convert.

Haskell is Not Awesome

Haskell is not all roses. In fact, I’m not sure it can even become the Next Big Language. And for a rather sad, superficial reason: the syntax is a too idiosyncratic compared to every other mainstream language. I’m not talking about lets, operators, or separating functions from arguments by whitespace. Those things can be learned pretty quickly.

My biggest syntax problem with Haskell is that there is too much variety. I’m a strong believer in the Style is Substance thesis, which is that our precious brain cells and decision-making units shouldn’t be spent on stylistic choices — instead, we should all have a fairly consistent style and spend our energy on more important decisions.

Haskell, on the other hand, has WAY too many nerd knobs. You can use “let … in …” syntax, “expr … where …” syntax, do syntax (even in non-monadic contexts). You can write functions with explicit arguments or in point-free style.

In your own code, or even within an organization, you can define a coding style and aggressively follow it. But if Haskell syntax was as consistent and tight as, for example, Java or Python, it would have a better shot at being mainstream.

I have some other syntax nits too: the verbosity of module imports is annoying and some common libraries (Aeson) use too many custom operators.

Some extensions should definitely be promoted to the core language by now. String should basically never be used in production code (more on that later), and ByteString and Text aren’t usable without the OverloadedStrings extension, which allows specifying ByteString and Text values with string literals. OverloadedLists should also be enabled. Really, any literal should be overloaded. It’s very convenient to be able to use existing literal syntax to initialize values of many different types.

“ScopedTypeVariables simplifies real code. TupleSections are commonly-used. There’s little risk in folding both into the core language. I feel like RecordWildCards should also be enabled by default, but records are a mess in Haskell and that part of the language is actively evolving. ExistentialQuantification is also critical in most real codebases.

Now let’s talk about the Prelude. The Prelude probably made sense in the 80s and 90s when Haskell was a simple research language, but parts of it are showing its age. String should never be used in production code. A lazy linked list of 32-bit Unicode code points is… elegant? … but pretty much the least efficient data structure for text. Sadly, a lot of underlying APIs in Haskell use String, just because it’s standard and the original data type for text.

The Num typeclass is a mess. Just because a value can be added does not also mean it can be multiplied or has an absolute value or can be converted from a numeric program literal. Generally, type classes should be small and precise, and Num is none of those. The problems with Num are well-documented.

The Prelude in general should use more type classes. This is a point of contention within the Haskell community. My opinion, however, falls squarely on the “more polymorphic functions!” side of the argument. It really sucks to have to teach people about both map and fmap when they’re the same function. It generally goes like this:

“What is fmap?”

“It’s generalized map. It works on Maybe and Vector and HashMap.”

“Why doesn’t map work on those?”

“Well, some people in the community believe that type classes make learning harder, so the Prelude should stick to less general function definitions.”

I haven’t seen any actual human factors research that says type classes make learning the language harder. And you can’t really learn Haskell without understanding type classes, so a monomorphic Prelude simply delays some education that is necessary anyway.

The C++ standard library is a good counterexample: you can use std::vector without understanding how to write templates. Python too. You can call len on any container without having to know how to implement the __len__ protocol. Haskell could have this same exact feature if people would stop inventing usability reasons not to. See Michael Snoyman’s work on ClassyPrelude.

My coworker, Jonathan Fischoff, makes a similar point on Reddit.

This brings up a higher-level issue with the design of most Haskell APIs: little regard is paid to the number of symbols exposed. Ideal APIs are memorable and predictable. Once you know how to map across any collection, you should know how to map across all collections. Once you know how to calculate the length of a list, you should know how to calculate the length of a vector or map. In Haskell, today, you need separate functions for each of those (e.g. Prelude.length vs. Data.Vector.length vs. Data.HashMap.Strict.size). There is so much to remember that you need about six browser tabs open to get any useful work done.

In summary: type classes are an excellent language feature. But they’re not used enough or well enough in the base Haskell libraries. And that’s all I’ll say on the subject for now. :)

Now I will discuss integrating with libraries.

I think it’s time that certain libraries get promoted into the standard: ByteString, Data.Maybe, Data.Map, Data.Set, Data.Text. It’s not realistic to write a real application without using them, so they should be made ubiquitous.

Finally, I have one more major complaint about the Haskell ecosystem. Fortunately, it’s a solvable one. Cabal sucks. I don’t know enough about it to say specifically why, but literally every time I come back to our build system after time away, something goes horribly wrong. I end up having to manually deleted generated files, or manually uninstall some packages. The error messages are always inscrutable. Sometimes I have to pull in one of the Cabal experts at the company and it’s a coin flip as to whether they’ll be able to figure it out.

I assume Cabal is fixable, but it’s somewhat boggling to me that a community that thinks so much about precision and mathematical laws and reliability can produce a tool so flaky.

Haskell is Worth Using

That was a long list of complaints – longer than I spent on the upsides. :) But I want to be clear: I still think Haskell is pretty great. It’s certainly worth learning. It’s also a great platform for writing HTTP services. Unless someone with GvR-esque taste attacks the syntax, Prelude, and standard libraries with gusto, Haskell likely won’t become as popular as Java or PHP or Python. Nonetheless, Haskell’s on a rapid upward trajectory, and as the community becomes less academic and more practical, many of these problems will get addressed. Even if the community is slow to adapt, as with any language, you can define your own internal conventions and base libraries, including your own Prelude if desired.

Update on Sorting Translucent Triangles

by Chad Austin
Dec 20 14

In my last article on the subject, I searched for an algorithm to sort translucent triangle meshes that ran in less than quadratic time. Sadly, on large organic meshes, the proposed algorithm didn’t perform very well. In particular, when triangles in one continuous patch of geometry sort differently, the individual triangles of the mesh become visible, which is visually worse than not sorting at all.

Thus, I gave up on that approach and tried some others. If the algorithm has to be quadratic, then we have some options:

Topological Sort

First, build a digraph that represents whether a triangle should be rendered before another.
A triangle A should be rendered before triangle B if A is in front of B and B is facing A but A is not facing B.

Once the graph is built, cycles must be broken, ideally using a heuristic that minimizes visual errors. (If a set of triangles form a cycle, there is no ordering that has no errors.) Finally, once all cycles are broken, the graph can be topologically sorted using a depth-first traversal.

This algorithm is O(n^2) in CPU and O(n^2) in memory.

Topologically sorting works well on geometric meshes with a clear triangle ordering. But on organic meshes, there are a great number of cycles, and the heuristic cycle breaking produces visual artifacts where individual triangles become visible.

Coverage Sort

For all pairs of triangles (A, B), if A is in front of B, calculate the form factor of A onto B. Form factor between triangles is fairly hard to calculate, so I tried a simple approximation using area of covering triangle, distance between triangle centers, and incident angles. Then, using a comparison sort, order the triangles such that triangles covered the most are first in the mesh.

This algorithm is O(n^2) in CPU and O(n) in memory.

I am not sure whether my heuristic was that good, but this also produced somewhat weak results on large organic meshes.

Sort by Distance and Angle to Center

This is a cheapo algorithm that works pretty well on convex shapes. Sort all triangles by the dot product of the triangle’s normal vector and the vector from the mesh’s center to the triangle’s center.

It’s pretty easy to come up with cases where this algorithm will sort incorrectly, but for most convex meshes it does well. More importantly, errors occur across entire slabs, which looks better than errors showing individual triangles.

Jon Watte’s “Coverage Peeling” Algorithm

Jon’s original algorithm works by building a graph of which triangles are in front of others, just like the topological sort above. It also accumulates a per-triangle-pair coverage metric, and accumulates them into a per-triangle coverage sum.

Then it peels triangles off, one at a time, by finding the triangle with the lowest coverage value. After peeling a triangle off, it reduces coverage values by the corresponding amount.

This algorithm works well, even in the presence of triangle cycles, but it has a fairly high constant factor. However, it runs out of memory on large meshes where the topological sort doesn’t.

I still plan to see if I can get an implementation to complete on a 60,000 triangle mesh in a few seconds and under a few GB of memory.


Algorithm Time Complexity Memory Constant Factors Errors
Topological Sort O(n^2) O(n^2) Low Per-triangle
Coverage Sort O(n^2) O(n) Low Per-triangle
Distance and Angle Sort O(n lg n) O(n) Low Consistent regions of the mesh
Coverage Peeling O(n^2) O(n^2) High Per-triangle, but fewer than the other algorithms

The Parsing Problem

by Chad Austin
Nov 20 14

Computers are insanely wide these days.  Not only can out-of-order cores issue and retire several instructions per clock cycle, but vector instructions get wider and wider.  SSE has 128-bit registers.  AVX widened that to 256 bits.  AVX2 widens the vector registers further — to 512 bits!  On top of that, modern CPUs have several cores, and some of those CPUs support running a couple hardware threads per core.

Given a parallelizable problem and a bit of work, it’s amazing how much computation you can achieve per clock.

However, there’s a class of problems that don’t benefit much, if at all, from instruction-level parallelism or vectorization.  I call this “the parsing problem”.

Consider a high-performance JSON parser like sajson or rapidjson.  Fundamentally, a JSON parser converts a stream of bytes into a higher-level structure.  The algorithm looks something like:

read a byte
switch on byte value to see whether we're opening an object, closing an object, opening or closing a string, etc.

This kind of byte-by-byte, branch-heavy code is not amenable to vectorization, especially for particularly dense structure like JSON.  It IS possible to accelerate XML parsers with SIMD: see Parabix.

Parabix uses SIMD to rapidly scan characters for the next XML closing tag.  Similar techniques can be applied to parsing JSON, but minified JSON has a higher ratio of structural characters to text than XML.

In my sajson benchmarks, my fancy new Haswell 4771 (3.9 GHz during benchmark run) is only a few times faster than my years-old, in-order Atom server, clocked at 2.1 GHz.

The only point of this article is that some problems lend themselves to taking advantage of wide vector units, instruction-level parallelism, and multithreading.  Others, like JSON parsing, don’t.  :)

The Web is a Frustrating Application Platform

by Chad Austin
Nov 19 14

Preface: In this article, I am not talking about the web as the document-based hypermedia system. I’m talking about the web platform as a potential replacement for desktop and mobile applications.

I’ve spent the last few years trying to bring a rich client application to the web platform. The web has enormous promise. It runs almost everywhere. It’s secure. It’s seamless – customers click a link and land in your application. Deployment doesn’t require any installers or code signing tools. HTML and CSS are fantastic for UIs, flows, transitions, typography, and layout. HTTP has excellent tooling.

At the same time, the web is enormously frustrating. I cut my teeth on Windows and Linux. I love having rich control over the quality of the experience delivered to customers. For comparison, on Windows and Linux:

  • APIs are low-level. Higher-level functionality is built by the community into user-mode libraries. The platform vendor provides capabilities and the community provides convenience. A web analogue would be jQuery on top of the DOM, or Three.js on top of WebGL.
  • APIs are capable. With few exceptions, the entire functionality of the machine is accessible to programs. There are a few security concerns here, but in many cases, it’s possible to provide capability without sacrificing security.
  • For the sake of performance and flexibility, native platform vendors are somewhat willing to leave a bit of undefined behavior on the table. The specific behavior of an allocator, for example, is left undefined between releases.

Before I get into what I think has to improve at a high level, let me describe some of my specific complaints about the web platform:

  • On WebGL, there’s no way to detect available VRAM. The current recommendation appears to be to use a “reasonable amount” of VRAM and back off if you run into problems…
  • Web Workers had the same problem – there was no way to query the number of available cores. I believe an API is coming that finally exposes this, however.
  • Lack of direct access to the browser’s message loop. For some bizarre reason, both Mozilla and Google were highly resistant to implementing setImmediate, resulting in some… interesting… polyfills.
  • STILL no fast typed array memcpy! This is just ridiculous. The best option I’ve seen allocates a new aliasing Uint8Array object so it can call .set(). The proposed .copyWithin() doesn’t allow copying between two TypedArrays, which means you can’t use it for copying XMLHttpRequest data into the asm.js heap.
  • CORS, thanks to preflight OPTIONS requests, turns out to be quite expensive to use in practice. For every cross-origin request with credentials or using custom headers, you pay an extra network round-trip for the preflight OPTIONS request. For high performance, you’ll want to keep your JSON services on the same domain as your site and bypass CORS entirely. If you don’t need credentials, avoid custom headers and stick with query parameters.
  • I can’t find the mailing list thread, but there was a discussion where it came up that ES6 Promises cannot be implemented in user code, and es-discuss folks considered that acceptable.
  • JavaScript stack traces are not standardized and may never be so, as there are apparently security concerns with allowing JavaScript code to inspect the names of functions higher on the stack. In general, handling and reporting application errors on the web is a disaster.
  • No weak references. The web API community is resistant to anything that could possibly expose garbage collection implementation details.
  • No way to specify priority on XMLHttpRequests, for any supported protocols (HTTP, SPDY, HTTP/2). If you read this blog, you already know the sordid details.
  • vector minimum and maximum in SIMD.js will likely sacrifice performance for the sake of cross-platform consistent behavior in the presence of NaNs. Personally, I’d prefer this be left as undefined behavior for maximum performance, but JavaScript convention is that every operation is exactly defined, to the point that all browsers have, in practice, attempted to standardize the order of iteration through hash tables.
  • No efficient way to encode and decode between Strings and ArrayBuffer. See the Twitter thread.

Why is the web like this? Why isn’t the web more like a typical native platform? My hypothesis follows.

Software and standards are a product of time and the pressures acting upon them. Browser vendors have been burned in the past by trying to advance the platform in a backwards-incompatible way. Thus, there is enormous fear of exposing any undefined behavior. Couple that with security concerns (some well-founded, some not), and you have a recipe for APIs that don’t actually expose the full set of functionality required for rich, high-performance applications.

Moreover, there is widespread ambiguity on whether the web should provide high-level or low-level APIs. In mailing list discussions, it seems generally expected that web programmers will call JavaScript APIs directly. Yet, to expose maximum functionality and performance, the APIs should be low-level. This tension was a large source of the discussion around my request to get direct access to protocol priorities in XMLHttpRequest.

Couple all of this with some strong personalities, and you end up with something that’s actually rather hostile to systems programmers like myself.

I feel the web standards community is disempowering to developers. Paraphrasing: “You don’t want text encodings.” “Let’s not even bother supporting HTTP/1.” “Weak references are too dangerous to put in peoples’ hands.” [private communication]

Part of me wishes browser vendors would stop playing nice with each other and instead court developers as aggressively as possible, even if that means new incompatible APIs. For example, perhaps asm.js would have never existed without Native Client. :)

I am not alone in this point of view! I recently discovered the Extensible Web Manifesto which aims to correct many of my gripes by focusing on low-level APIs that advance the capabilities of the web platform.

I still think the web has promise. But now I see that the problem isn’t that browsers have to catch up with native platform capabilities. Instead, it’s politics. The web standards authors need to focus on exposing raw functionality, even if it allows programmers to shoot themselves in the foot.

It’s up to you to decide whether the web is the right platform for your app. But I never thought I would say “I miss Microsoft’s low-level APIs.” :)

Update on HTTP/2 Priority

by Chad Austin
Nov 16 14

After some great offline discussions with various folks involved in the specification of HTTP/2 priority, I wrote an email to the list expressing my concern about the design as written.

In response, Martin Thomson agreed with my concerns and proposed a small tweak to the spec that allows for the allocation of dummy nodes, which enables rough approximation of a priority hierarchy.

The tweak makes it possible to send a PRIORITY frame to a stream in any state. Given two streams {A, B} at a higher priority than streams {C, D}, this allows construction of a tree as follows:

A{w=256} -> 0
B{w=256} -> 0
Dummy{w=1} -> 0
C{w=256} -> Dummy
D{w=256} -> Dummy

We can probably assume that a well-behaved server would not spend any resources on the Dummy subtree — since it has weight=1 — until A and B are completed, which places C and D at a lower priority.

This protocol tweak appears to be headed for acceptance. Here are the relevant meeting minutes.

Offline, several people expressed general nervousness about HTTP/2 priority, as there is no production experience with a tree-based priority model. I was told that, should HTTP/2 priority not work out in practice, it will be deprecated and some kind of alternate protocol or extension will be proposed instead.

Ultimately, I’m thrilled that, with Martin’s tweak, it’s plausible that HTTP/2 priority can solve most use cases. I do wish I’d gotten involved in this topic nine months ago, as I would have had a realistic shot at getting Microsoft’s proposal adopted. At this point, the standard will ship with only minor changes.

HTTP/2 Request Priorities – A Summary

by Chad Austin
Oct 30 14

Previously, I’d written about a particular use case we’d run into where we needed to make thousands of HTTP requests to download 3D assets to populate a WebGL scene.

Many of these assets (skeletons, meshes, lower-mip-level textures, poses) are higher priority than others (higher-mip-level textures, animations). In addition, objects closer to the camera should be prioritized higher than objects farther away.

The following table defines the priority of each asset download:

Asset Type Base Priority
Object Description 100
Skeleton 90
Mesh 90
Low-res texture 90
Pose 90
Animation 50
High-res texture 10

Plus a modifier based on the requesting object:

Object Priority modifier
Room +9
My Avatar +8
Other Avatars +[0,7] — distance to camera

Asset type determines the base priority. Within an asset type, the referencing 3D object adds a priority modifier. The room gets highest priority, your avatar next, and all remaining avatars are sorted by camera distance.

That is, object descriptions trump everything else. (They’re small and contain links to other assets.) Skeletons, meshes, low-res textures, and poses must be loaded before the object to be rendered at all, so they claim the next-highest-level priority. Finally, animations and high-res textures can come in later.

Bandwidth-Delay Product

Let me take a brief digression to explain why priority is important.

The bandwidth-delay product measures the amount of data that must be in flight to fully utilize a pipe. In 2012, the average US Internet connection was 6.7 Mbps. Average round-trip to Google in US was, say, 55 ms. This places the bandwidth-delay product at 6.7 Mbps * 55 ms = 46 KB. If you cannot keep 46 KB in flight at a time, you are wasting bandwidth. Of course, not everyone has Google’s data centers, and latency is jittery, and you likely want worldwide efficiency, so aim higher. Targeting a 150 KB bandwidth delay product, or even higher if you’re worldwide, is not a bad idea.

Now, because the pipe should have a lot of data in flight at any one moment, a large number of requests should be sent to the server right away. Since the client doesn’t know how big responses are, if the server returned low-priority responses first, the page would load more slowly than if low-priority requests weren’t sent until all high-priority responses were downloaded. However, having the client wait to issue low-priority requests does not make good use of available bandwidth. The best option is to give the server enough information to prioritize responses, letting it send down high-priority responses first, making everything load faster. Experience with SPDY shows that it’s critical that priority work well.

An ideal prioritization solution would satisfy the following objectives:

  • make full use of network bandwidth
  • retrieve responses for high-priority requests before low-priority requests
  • immediately — on the current or next frame — retrieve any responses already cached by the browser

On the web, today, there is no way to accomplish all of these goals.

Again, for as much as people like to point out that priority is optional or advisory [1] [2], it is critically important. Multiplexed connections without priority are slower than HTTP/1! Browsers already prioritize HTTP/1 requests, so if they didn’t prioritize SPDY or HTTP/2 requests, low-priority resources would contend for bandwidth with high-priority resources, causing initial load to be slower.

After whining about all of this to a friend, he casually said “Why don’t you take it up with the standards bodies?” (Damn you, jfb!) I’ve not had good luck with web standards bodies in the past — to the point that I have the impression some key people don’t care a whole lot about real-world use cases — but I thought I’d give it a shot.

It turns out that exposing a simple priority field on XMLHttpRequest (W3C WebApps WG), while easy, obvious, and an immediate solution to many use cases, does not satisfy everyone. In fact, it’s been proposed before. The WebApps WG instead suggested that prioritization should be solved at the “Fetch” layer, so regular DOM fetches like <img> and >script< tags can benefit too. This increases the scope of the proposal, but fair enough…

So I went to WHATWG and discovered a proposal that, on first blush, looked absurdly complicated. It’s based on HTTP/2’s dependency tree priority model. I spent an evening or two wrapping my head around the dependency tree model and ran into some problems.

I now see that, for all of the varied use cases for prioritization on the web, there is no apparent solution that satisfies them all. (But one is very close, as you’ll see.)

Use Cases

Let’s work through a few illustrative use cases for prioritizing requests on a multiplexed connection.

Basic Page Load

The ideal scenario for a page load is that enough of the HTML arrives to link to the appropriate CSS and JS resources, after which the JS and CSS are downloaded (with resources completing serially so that response processing on the CPU can happen in parallel with transfer), followed by the remainder of the HTML.

This requires dynamic reprioritization of the HTML document mid-transfer.

Any referenced images are not critical to the initial load, thus have lower priority. They are also transferred in parallel rather than serially under the assumption that they can be progressively-rendered. Two half-complete JavaScript files is not useful at all, but two half-complete JPEGs certainly could be useful.

This use case illustrates sequential transfer (JS and CSS) and parallel transfer (images). Sequential transfer is ideal when a resource doesn’t provide value until it’s completely downloaded. Parallel transfer is ideal when resources support progressive display or when trying to be fair to competing requests.

Multiple Tabs

Consider a tabbed browser. Each tab may issue dozens of its own requests. The browser ought to prioritize requests from the focused tab. Background tabs receive the remainder of the network resources.

Switching tabs ought to efficiently reprioritize requests from the old and new tabs appropriately.


Proxies take many requests from multiple clients and multiplex them on fewer backend connections. Proxies ought to be fair: if client A asks for low-priority requests and client B asks for high-priority requests, the proxy’s resources should be split evenly between them.

Streaming Video

A video stream (e.g. DASH or M3U) consists of a sequence of short video segments. The current segment has the highest priority, followed by the next, and so on.

Reprioritization occurs on seek, though it’s probably fine to disregard any outstanding requests upon seek.

Loading a 3D Scene

This is the use case described at the top. Unless the connection is otherwise idle, not a single byte of low-priority responses should come before high-priority responses. Requests should be periodically reprioritized as the objects move closer to the camera.

Large Page of Images

Image a web page that consists of 10,000 img tags. Ideally, they would download in priority order where priority is distance from current viewport. On scroll, requests would be reprioritized.

Server Push

I personally have no interest in server push and many smart people are hashing out how priority interacts with server-pushed data in various forums, so I will refrain from further discussion here.

Prioritization Primitives

We can take the uses cases above and break them down into the components that a prioritization algorithm should support.

  1. Sequential in specified order (A before B before C …)
  2. Sequential in no particular order (A, B, C, but all of one before the next)
  3. Parallel (A, B, C, interleaved on the connection)
  4. Fairness (resources allocated evenly between {A, B, C} and {X, Y, Z}.)
  5. Reprioritization
  6. Parallel group dependencies (all of {A, B, C}, in parallel, before {D, E, F}, also in parallel)
  7. And because non-prioritized mux is worse than HTTP/1, advertised knowledge of prioritization support helps clients decide whether to blast all requests all at once or be more conservative :)

The Proposed Priority Models


First, let’s talk about the status quo: legacy HTTP served by N concurrent TCP connections.

Firefox maintains a priority queue on 32-bit signed integers fed from the nsISupportsPriority interface.

Chromium supports a handful of priorities but I can’t find the code that implements the priority queue in there. All evidence says it exists though. :)

Most browsers also implement a first-paint optimization, where they distinguish between resources that are required for the first paint, like HTML, JS, and CSS, and download those first. Anything else: images or XMLHttpRequests is blocked until after the first paint. This is slightly inefficient: it leaves the connection idle during the first paint.

An HTTP/1 priority queue served by N TCP connections is inefficient for a couple reasons:

  1. round-trips and limited concurrency make it hard to keep the pipe full (especially on high-BDP connections)
  2. cannot reprioritize an in-flight request
  3. TCP flow control works better if you have one saturated connection rather than 8 unsaturated connections


The version of SPDY currently in widespread use, SPDY/3, associates a 3-bit priority field with every stream. Streams cannot be reprioritized. “The sender and recipient SHOULD use best-effort to process streams in the order of highest priority to lowest priority.”

Original SPDY/4 Proposal

Two years ago, Google proposed switching SPDY/4’s dependency model to a system based on stream dependencies and weights. The idea is that, rather than using numeric priorities, streams form a DAG which represents a partial ordering relation.

At each depth “level” of the DAG, weights are used to distribute available resources across streams.

In this model, streams could have multiple parents. You could say, for example, C depends on A and B, placing C at a lower priority than both A and B.

This proposal also introduced reprioritization.

SPDY/4 advertises on the connection whether prioritization has an effect.

One of the gnarlier bits of the stream dependency model is that it introduces some number of race conditions. After a stream is closed, how long is its information, for the purposes of future reprioritization packets, retained?

Microsoft’s HTTP/2 Proposal

In a simplified form, the SPDY/4 dependencies and weights proposal made it into the HTTP/2 draft. However, prioritization was still a somewhat controversial topic. Osama Mazahir, on behalf of Microsoft, proposed an alternate priority design that appeared to garner some support. See the replies in that thread.

The basic idea is that there are groups of streams. Each group is assigned a weight, and within each group, streams are assigned priority.

Osama does a better job describing the tradeoffs, but the main argument is that it’s simpler and supports more use cases more naturally than dependencies and weights.

The large open questions seem to be: “So what priority values do people choose? What if you want to prioritize resource B between A and C, but A’s and C’s priorities differ by only 1?” If you use numeric priorities, you have to select arbitrary values for people to

After this thread, I never saw mention of it again. What happened?

HTTP/2 Draft 15

HTTP/2 Draft 15 goes into great detail about the stream dependencies and weights priority model.

Every stream is given one 31-bit parent and an 8-bit weight in [1, 256]. The server SHOULD process streams from the root of the tree down, dividing available resources among streams by their relative weights. If a parent stream closes, it is removed from the tree and its weight is divided among its children.

The default stream parent is zero – the root.

One repeated bit of confusion I’ve seen on the mailing lists is the difference between weights and priorities. Priorities are used to specify relative ordering: A before B. Weights are used to divide resources among many competing requests. If I wanted {A0,A1,A2,A3}[weight=256] before {B0,B1,B2,B3}[weights=128] before {C0,C1,C2,C3}[weights=1], is the expectation that the gateway proxy would open 12 simultaneous backend connections, and divide the connection with 2/3 of the frames solving for A, 1/3 solving for B, and a trickle for C? That would be silly: we want all As before any Bs. So then is the expectation that proxies would “tend to choose” streams of higher weight and run them to completion? If so, that violates at least the spirit of the algorithm described in HTTP/2.

In terms of server decision-making and client bandwidth utilization, weighting make a very poor substitute for true prioritization, so it’s important to treat them separately. Priorities specify ordering, weights specify resource allocation.

However, there doesn’t seem to be much practical implementation experience with the HTTP 2 priority model. Mozilla’s current implementation ignores dependencies entirely, which would have unexpected results against a compliant connection, as low-priority resources would consume bandwidth while the page is waiting for high-priority resources. (To be fair, this implementation is actively in progress.)

Which Priority Protocol is Best?

Let’s enumerate the prioritization primitives again. The goal is to list whether a particular primitive is expressible in a given protocol. We won’t even consider HTTP/1 because it doesn’t solve for efficient bandwidth usage on a high-latency connection. The three priority protocols under consideration are: SPDY/3, SPDY/4 as of 2012, Microsoft, and HTTP/2.

SPDY/3 is easy to eliminate from the running too: 3 bits of priority is way too limiting in many use cases, including, for example, in the use case of streaming video chunks. After 8 chunks, you would be out of priority levels.

Use Case SPDY/4 2012 Microsoft HTTP/2
Sequential in specified order (A before B before C …) 👍 👍 👍
Sequential in no particular order^
Parallel (A, B, C in parallel) 👍 👍 👍
Fairness (resources allocated evenly between {A, B, C} and {X, Y, Z}.) 👍 👍 👍
Reprioritization 👍 👍 👍
Parallel group dependencies ({A, B, C} in parallel before {D, E, F}, also in parallel) 👍 👍
Advertisement of prioritization support 👍

^ None of the protocols allow the requestor to specify that responses should be sent in their entirety before the next response is sent without also defining a specific sequence. I suppose the assumption is the server can make this determination by content type? I wonder whether a bit to indicate “I don’t care about the order, but once you start transmitting a resource, please finish before starting the next.” would be useful in practice.

Many of the desired use cases are supported in HTTP/2, but Microsoft’s proposal appears to satisfy strictly more. I wonder what happened to it in committee? I will reach out to its authors and supporters and see.

The biggest issues with HTTP/2 are that:

  • There is no way to express “download {A,B,C} in parallel then download {D,E,F} in parallel”
  • If priority is truly optional, and there is no mechanism by which the server advertises support, then clients will need to play it safe, reducing the possible upside from a multiplexed connection.

Can HTTP/2 be fixed?

Can we modify HTTP/2 to add support for the {A,B,C} in parallel followed by {D,E,F} use case?

The idea of dummy placeholder streams has been bandied about. Let’s see what that might look like:

  • 0 is the root.
  • P1 is the high-priority dummy stream.
  • P2 is the low-priority dummy stream.

If we defined a tree as such:

P1 -> 0
{A, B, C, P2} -> P1
{D, E, F} -> P2

and changed the HTTP/2 specification such that dummy nodes never closed and streams closer to the root were serviced first, then the “Parallel groups” use case is satisfied. However, watch what happens to fairness in the proxy / multiple tabs use case.

Consider two inbound connections to a proxy that are being multiplexed on one outbound connection.

Inbound connection A has three priority levels, thus three dummy streams. Inbound connection B has two priority levels, thus two dummy streams.

0 <- A_P1 <- A_P2 <- A_P3
0 <- B_P1 <- B_P2

Now let’s assume that inbound connection A has two requests: A_S1 at priority A_P1 and A_S2 at priority A_P3. Inbound connection B also has two requests: B_S1 at priority B_P1 and B_S2 at B_P2.

A_P1 <- A_S1
A_P3 <- A_S2
B_P1 <- B_S1
B_P2 <- B_S2

Note that A’s priority levels go deeper than B’s. Per the proposed spec modification above, A_S1 and B_S1 would have equal priority and be serviced fairly. However, if two streams are at unequal depth in the tree, the one closer to the root would win. B_S2 would unfairly starve A_S2. So it seems that fairness and group prioritization are at odds in a collapsing, tree-based dependency model.

What now?

It’s surprising to me how hard it is to get priorities correct over multiplexed streams. Microsoft’s proposal seems to be the most useful, but I have no insight into what caused it to fail in committee.

I don’t have any ideas for how to save a tree-based model. We need either weighted stream groups or we need to convert the tree into a DAG — that is, if nodes had multiple parents, the dependency model would work swimmingly.

Getting priority right is critical — HTTP/2 will be the most popular protocol on the Internet, and the upside potential in both page-load efficiency and performance of new web applications is huge.

Pre-emptive Rebuttals

Why not define your own priority extension?

The big advantage to using HTTP/2 is that it’s a standard. CDNs, caching proxies, gateways, and even load balancers will implement it. A custom protocol would be prohibitive in practice.

Just get over it and use weights as priorities.

I hope I demonstrated earlier why weights don’t solve the prioritization problem. “But it’s close enough!” The speed of light isn’t getting any faster. Page load optimization is getting increasingly expensive. Projects like SPDY, QUIC, and HTTP/2 are ambitious, protocol-level efforts for arguably marginal wins. Being “close enough” is no longer sufficient. Defining a protocol that can specify the ideal transfer schedule has a huge return on investment, assuming the protocol is adopted and implemented well.

The Gamepad API

by Chad Austin
Oct 12 14

The Gamepad API is a general API supporting a large number of possible input devices. However, it’s named after the most common use case: gamepad controllers. It could definitely support IR remote controls, switches, audio mixers, and so on… Maybe it should be named the “Input Device API”. :) It’s not as general as DirectInput or USB HID though. No access to positional information (like mouse deltas, finger coordinates on a trackpad, or 3D tracking), no force feedback, and no access to device accelerometers.

Overall, the Gamepad API is narrow in scope. With a little more specification effort and implementation complexity (copy the good bits of DirectInput or Raw Input?), the API could support a huge number of use cases.

That said, within its narrow scope, the Gamepad API is well-designed. Devices can have an arbitrary number of either buttons or axes.

I’d make a few changes, however.


Remove Fingerprinting Mitigation

Gamepads MUST only appear in the list if they are currently connected to the user agent, and at least one device has been interacted with by the user. If no devices have been interacted with, devices MUST NOT appear in the list to avoid a malicious page from fingerprinting the user.

That’s such a lost cause it’s not even funny. There are dozens of ways to retrieve identifying information from a user (and companies that license implementations of such). As long as there are machine learning algoritms and any differences at all between different computers and browsers, users will be fingerprinted. My favorite example is when researchers demonstrated a fingerprinting technique by rendering text into a canvas and analyzing the anti-aliasing and font rendering. Attempting to mitigate fingerprinting by penalizing the user experience seems like the wrong tradeoff here, except perhaps on an opt-in basis.

Allow Standard-Mapped Devices to have Extra Buttons or Axes

Two sentences in the spec imply that an input device that is recognized to implement a standard mapping will not expose more functionality than is defined by the mapping object.

When the user agent recognizes the attached device, it is RECOMMENDED that it be remapped to a canonical ordering when possible.


The standard gamepad has 4 axes, and up to 17 buttons.

Simply changing the wording to “has at least 4 axes, and at least 17 buttons” would allow games that default to standard mapping inputs to work with devices that support additional capabilities, like racing wheels with standard-mapped controls in the middle.

Add support for multiple standard mappings per device.

Input controller idioms come and go. Some become entrenched in a generation’s gamepads, and some fade away.

Assigning a single canonical mapping per device limits the discovery of useful structure. For example, an SNES controller, Wiimote, or Logitech Driving Force wheel wouldn’t satisfy the “Standard Gamepad” mapping, but all of them have a directional pad. If there was a “Directional Pad” mapping, and devices implemented multiple mappings, then any game that relied on a Directional Pad would work out of the box.

Add an event-based input mechanism.

The Gamepad API’s sole input discovery mechanism relies on JavaScript polling the current gamepad state at high frequency to detect input events from the gamepad. For interactive applications like games, this is usually fine. However, polling at 60 Hz in JavaScript is excessive if you just need to know when a button was pressed. We don’t poll for the mouse or keyboard events – why are gamepads different? If the argument is convenience, libraries can always offer that.

Moreover, underneath it all, some high-priority operating system thread is polling the device, and translating the current device state into an event stream. This is why, even though XInput is a polling-based API, if your game drops a few frames, button presses aren’t lost.

For certain classes of problems, like mental chronometry, you need to know when the button press occurred so you can measure elapsed time within a millisecond or so. If JavaScript is polling button state, it doesn’t know when the input event originated – perhaps 10-50 ms of latency has elapsed by the time it sees the button change – but the underlying high-priority polling thread knows. (Or at least has a better sense.)

Let’s say I’m playing a game running at 10 Hz and I press a button to open a bit of UI. Display of the UI hitches (maybe WebGL textures are being uploaded), pausing the game’s polling loop for one second or so, blocking requestAnimationFrame from polling the device. If the Gamepad API isn’t polling the device at a higher frequency under the hood, any buttons pressed and released within that one second period would not register. So we have to assume that, to avoid missed button presses, they have to be queued. But, since only ONE change can be measured each frame, each press has to be queued for two frames (button DOWN, button UP).

Thus, even though the presses occurred at T, T+100ms, and T+200ms, JavaScript won’t see them until T+1000ms, T+1200ms, and T+1400ms. In games that rely on high-precision gestures, this is the difference between it recognizing gestures even in the presence of frame drops, and it missing gestures entirely.

If you think dropped events aren’t a problem in “real games”, try playing SimCity 2013 on an average Mac sometime… The low frame rate would be completely tolerable except for the dropped events.

In contrast, an API where JavaScript would ask “Give me all the input events since I last asked” would associate a timestamp with each event, allowing for accurate combination recognition. If the Gamepad API is backed by Windows’s Raw Input API, this data can be retrieved with GetMessageTime().

The Gamepad API is definitely a step in the right direction, but it feels like it ought to be a little lower-level and more general to avoid being yet another Almost Good web API.

Optimizing WebGL Shaders by Reading D3D Shader Assembly

by Chad Austin
Sep 26 14

This post is a mirror of the IMVU Engineering Blog.

We are optimizing WebGL shaders for the Intel GMA 950 chipset, which is basically the slowest WebGL-capable device we care about. Unfortunately, it’s a fairly common chipset too. On the plus side, if we run well on the GMA 950, we should basically run well anywhere. :)

When you’re writing GLSL in WebGL on Windows, your code is three layers of abstraction away from what actually runs on the GPU. First, ANGLE translates your GLSL into HLSL. Then, D3DX compiles the HLSL into optimized shader assembly bytecode. That shader bytecode is given to the driver, where it’s translated into hardware instructions for execution on the silicon.

Ideally, when writing GLSL, we’d like to see at least the resulting D3D shader assembly.

With a great deal of effort and luck, I have finally succeeded at extracting Direct3D shader instructions from WebGL. I installed NVIDIA Nsight and Visual Studio 2013 in Boot Camp on my Mac. To use Nsight, you need to create a dummy Visual Studio project. Without a project, it won’t launch at all. Once you have your dummy project open, configure Nsight (via Nsight User Properties under Project Settings) to launch Firefox.exe instead. Firefox is easier to debug than Chrome because it runs in a single process.

If you’re lucky — and I’m not sure why it’s so unreliable — the Nsight UI will show up inside Firefox. If it doesn’t show up, try launching it again. Move the window around, open various menus. Eventually you should have the ability to capture a frame. If you try to capture a frame before the Nsight UI shows up in Firefox, Firefox will hang.

Once you capture a frame, find an interesting draw call, make sure the geometry is from your WebGL application, and then begin looking at the shader. (Note: again, this tool is unreliable. Sometimes you get to look at the HLSL that ANGLE produced, which you can compile and disassemble with fxc.exe, and sometimes you get the raw shader assembly.)

Anyway, check this out. We’re going to focus on the array lookup in the following bone skinning GLSL:

attribute vec4 vBoneIndices;
uniform vec4 uBones[3 * 68];

ivec3 i = ivec3( * 3;
vec4 row0 = uBones[i.x];

ANGLE translates the array lookup into HLSL. Note the added bounds check for security. (Why? D3D claims it already bounds-checks array accesses.)

int3 _i = (ivec3(_vBoneIndices) * 3);
float4 _row0 = _uBones[int(clamp(float(_i.x), 0.0, 203.0))]

This generates the following shader instructions:

# NOTE: v0 is vBoneIndices
# The actual load isn't shown here.  This is just index calculation.

def c0, 2.00787401, -1, 3, 0
def c218, 203, 2, 0.5, 0

# r1 = v0, truncated towards zero
slt, v0, -v0
frc, v0
add, -r2, v0
slt, -r2, r2
mad, r1, r2, r3

mul, r1, c0.z # times three

# clamp
max, r2, c0.w
min, r2, c218.x

# get ready to load, using a0.x as index into uBones
mova a0.x, r2.y

That blob of instructions that implements truncation towards zero? Let’s decode that. = (v0 < 0) ? 1 : 0 = v0 - floor(v0) = v0 - r2 = (-r2 < r2) ? 1 : 0 = r1 * r2 + r3

Simplified further: = (v0 < 0) ? 1 : 0 = (floor(v0) < v0) ? 1 : 0 = (r1 * r2) + floor(v0)

In short, r1 = floor(v0), UNLESS v0 < 0 and floor(v0) < v0, in which case r1 = floor(v0) + 1.

That’s a lot of instructions just to calculate an array index. After the index is calculated, it’s multiplied by three, clamped to the array boundaries (securitah!), and loaded into the address register.

Can we convince ANGLE and HLSL that the array index will never be negative? (It should know this, since it’s already clamping later on, but whatever.) Perhaps avoid a bunch of generated code? Let’s tweak the GLSL a bit.

ivec3 i = ivec3(max(vec3(0), * 3;
vec4 row0 = uBones[i.x];

Now the instruction stream is substantially reduced!

def c0, 2.00787401, -1, 0, 3
def c218, 203, 1, 2, 0.5

# clamp v0 positive
max r1, c0.z, v0.xyzx

# r1 = floor(r1)
frc r2, r1.wyzw
add r1, r1, -r2

mul r1, r1, c0.w # times three

# bound-check against array
min, r1.wyzw, c218.x
mova a0.x, r2.y

By clamping the bone indices against zero before converting to integer, the shader optimizer eliminates several instructions.

Can we get rid of the two floor instructions? We know that the mova instruction rounds to the nearest integer when converting a float to an index. Given that knowledge, I tried to eliminate the floor by making my GLSL match mova semantics, but the HLSL compiler didn’t seem smart enough to elide the two floor instructions. If you can figure this out, please let me know!

Either way, I wanted to demonstrate that, by reading the generated Direct3D shader code from your WebGL shaders, you may find small changes that result in big wins.