Digging Into JavaScript Performance
While JavaScript implementations have been improving by leaps and bounds, I predict that they still won't meet the performance of native code within the next couple years, even when plenty of memory is available and the algorithms are restricted to long, homogenous loops. (Death-by-1000-cuts situations where your profile is complete flat and function call overhead dominates may be permanently relegated to statically compiled languages.)
Thus, I really want to see Native Client succeed, as it neatly jumps to a world where it's possible to have code within 5-10% of the performance of native code, securely deployed on the web. I wrote a slightly inflammatory post about why the web should compete at the same level as native desktop applications, and why Native Client is important for getting us there.
Mike Shaver called me out. "Write a benchmark that's important to you, submit it as a bug, and we'll make it fast." So I took the Cal3D skinning loop and wrote four versions: C++ with SSE intrinsics, C++ with scalar math, JavaScript, and JavaScript with typed arrays. I tested on a MacBook Pro, Core i5, 2.5 GHz, with gcc and Firefox 4.0 beta 8.
First, the code is on github.
The numbers:
- C++ w/ SSE intrinsics: 98.3
- C++ w/ scalars: 61.2
- JavaScript: 5.1
- JavaScript w/ typed arrays: 8.4
It's clear we've got a ways to go until JavaScript can match native code, but the Mozilla team is confident they can improve this benchmark. Even late on a Sunday night, Vlad took a look and found some suspiciously-inefficient code generation. If JavaScript grows SIMD intrinsics, that will help a lot.
From a coding style perspective, writing high-performance JavaScript is a challenge. In C++, it's easy to express that a BoneTransform contains three four-float vectors, and they're all stored contiguously in memory. In JavaScript, that involves using typed arrays and being very careful with your offsets. I would love to be able to specify memory layout without changing all property access to indices and offsets.
Finally, if you want to track Mozilla's investigation into this benchmark, here is the bug. I'm excited to see what they can do.
In a modern game engine that SSE version would also run in parallel across all HW threads to further improve performance and reduce latency.
Would be interesting to add that to the benchmark test as well :) Ideally on say a quad core which a lot of gamers have.
@repi: except javascript does not really have shared memory threading model, AFAIK. There are WebWorkers, but they are based on message passing between threads (which is good because sync is much easier!), but it can only pass very limited set of data - strings or JSON objects (yuck!). So my guess is that while computation would scale across CPUs, getting data in & our of the "jobs" would be a serious hassle/bottleneck.
Looks like the SSE copyright notice is in the wrong cpp file... :)
Thanks for the benchmark! I wasn't even aware of types arrays until now. I made a version of jsmem for typed arrays (Float64), and get only about 25% performance gain, vs 64% in your case. But at least the curve looks much less ragged (among the browsers I tested, only FF4 has the "ragged" problem which probably means that the numbers are allocated as objects, and there is a lot of GC activity):
http://www.chr-breitkopf.de/comp/jsmem
Not even going two times faster when vectorizing floats? Looks like you didn't do it right...