Visualizing Python Import Dependencies

In a large Python program such as IMVU, startup time is dominated by Python module imports. Take these warm timings:

$ time python -c 'None'

real    0m0.096s
user    0m0.077s
sys     0m0.061s

$ time python -c 'import urllib2'

real    0m0.314s
user    0m0.155s
sys     0m0.186s

That’s 300ms for a single basic dependency. Importing the entire IMVU client takes 1.5s warm and 20s cold on a typical user’s machine.

Loading

The IMVU client’s loading progress bar imports modules bottom-up; that is, leaf modules are imported before their parents. The root module is imported last.

Implementing a bottom-up import sequence requires generating a graph of dependencies between modules:

def get_dependencies(module_name):
    """\
    Takes a module name as input (e.g. 'xml.dom') and returns a set of
    (lhs, rhs) tuples where lhs and rhs are module names and lhs
    imports rhs.
    """
    
    # module_from_key is a dict from a module key, an arbitrary
    # object, to a module object.  While importing, we discover
    # dependencies before we have access to the actual module objects.
    
    # import_dependencies is a list of (lhs, rhs) tuples where lhs and
    # rhs are module keys, and module_from_key[lhs] imported
    # module_from_key[rhs].

    root_key = object()
    module_from_key = {root_key: __main__}
    import_dependencies = []
    stack = [root_key]

    def import_in_stack(key, name, globals, locals, fromlist, level):
        stack.append(key)
        try:
            return original_import(name, globals, locals, fromlist, level)
        finally:
            stack.pop()

    import __builtin__
    original_import = __builtin__.__import__

    def my_import(name, globals=globals(), locals=locals(), fromlist=[], level=-1):
        # fromlist is a whore.  Most of the complexity in this
        # function stems from fromlist's semantics.  See
        # http://docs.python.org/library/functions.html#__import__
        
        # If a module imports 'xml.dom', then the module depends on
        # both 'xml' and 'xml.dom' modules.
        dotted = name.split('.')
        for i in range(1, len(dotted)):
            my_import('.'.join(dotted[0:i]), globals, locals, [], level)

        module_key = object()
        parent_key = stack[-1]

        def add_dependency_from_parent(key, m):
            module_from_key[key] = m
            import_dependencies.append((parent_key, key))

        submodule = import_in_stack(module_key, name, globals, locals, ['__name__'], level)
        add_dependency_from_parent(module_key, submodule)

        for f in (fromlist or []):
            from_key = object()
            module = import_in_stack(from_key, name, globals, locals, [f], level)
            if f == '*':
                continue
            submodule = getattr(module, f)
            if isinstance(submodule, types.ModuleType):
                add_dependency_from_parent(from_key, submodule)

        return original_import(name, globals, locals, fromlist, level)

    # Import module_name with import hook enabled.
    original_modules = sys.modules.copy()
    __builtin__.__import__ = my_import
    try:
        my_import(module_name)
    finally:
        __builtin__.__import__ = original_import
        sys.modules.clear()
        sys.modules.update(original_modules)

    assert stack == [root_key], stack

    return sorted(set(
        (module_from_key[lhs].__name__, module_from_key[rhs].__name__)
        for lhs, rhs in import_dependencies
    ))

(You can view all of the code at SourceForge).

First, we install an __import__ hook that discovers import dependencies between modules, and convert those relationships into a directed graph:

xml.dom.minidom

Then, we merge cycles. If module A imports B, B imports C, and C imports A, then it doesn’t matter which you import first. Importing A is equivalent to importing B or C. After this step, we have a DAG:

xml.dom.minidom DAG

Finally, we can postorder traverse the DAG to determine a good import sequence and cost (approximated as the number of modules in the cycle) per import:

1 xml
3 xml.dom
1 copy_reg
1 types
1 copy
1 xml.dom.NodeFilter
1 xml.dom.xmlbuilder
1 xml.dom.minidom
1 __main__

Now let’s look at some less-trivial examples. urllib2:

urllib2

SimpleXMLRPCServer:

SimpleXMLRPCServer

imvu.task:

imvu.task

Final notes: Other people have solved this problem with bytecode scanning, but we wanted to know the actual modules imported for an accurate progress bar. A simpler __import__ hook could have calculate the correct import sequence, but I find a visual representation of module dependencies to have additional benefits.

Flushing the Windows Disk Cache

Occasionally, I want to test the performance of a program after a cold boot, or maybe after the computer has been idle for hours and the program has been paged out. For example, the IMVU client starts relatively quickly when the disk cache is warm, but at system boot, it can take quite a while for the login dialog to even appear. Iterating in these situations is a pain in the butt because you have to reboot or leave your computer idle for hours.

I’m sure there exists a program which will flush the disk caches and force programs out of memory and into the page file, but I can’t find it. So I wrote one.

First, a caveat: programs these days rarely handle out-of-memory situations, so running flushmem.exe might cause open applications to explode like popcorn. Buyer beware, etc.

After running flushmem.exe, you should find that your computer becomes painfully slow as applications are paged back into memory and the disk cache is refilled. Perfect. Now I can realistically simulate the experiences of our users.

You can download the program here or on the FlushMem page.

Implementation details: in Windows, each process has a 2 GB user mode address space limit by default. If physical memory + page file size is greater than 2 GB, flushmem spawns multiple processes. Each process allocates memory in 64 KiB chunks until it can’t anymore, and then writes to each page, forcing older pages out to the page file.

Latency vs. Throughput

This is my last post about processors and performance, I swear! Plus,
my wrists are starting to hurt from this bloodpact thing (as I’m
diagnosed with RSI), so I think this will be a light one.

As I’ve discussed previously,
modern desktop processors work really hard to exploit the inherent
parallelism in your programs. This is called instruction-level
parallelism
, and is one of the techniques processors use to get
more performance out of slower clock rates (along with data-level
parallelism (SIMD) or multiple cores (MIMD)*). Previously, I waved my
hands a bit and said “The processor makes independent operations run
in parallel.” Now I’m going to teach you how to count cycles in the presence of latency and parallelism.

Traditionally, when analyzing the cost of an algorithm, you would
simply count the operations involved, sum their costs in cycles, and
call it a day. These days, it’s not that easy. Instructions have two
costs: dependency chain latency and reciprocal throughput.

Reciprocal throughput is simply the reciprocal of the maximum
throughput of a particular instruction. Throughput is measured in
instructions/cycle, so reciprocal throughput is cycles/instruction.

OK, that sounds like the way we’ve always measured performance. So
what’s dependency chain latency? When the results of a previous
calculation are needed for another calculation, you have a dependency
chain. In a dependency chain, you measure the cost of an instruction
by its latency, not its reciprocal throughput. Remember that our
processors are working really hard to exploit parallelism in our code.
When there is no instruction-level parallelism available, we get
penalized.

Let’s go back to our sum 10000 numbers example, but unroll it a bit:

float array[10000];
float sum = 0.0f;
for (int i = 0; i < 10000; i += 8) {
    sum += array[i+0];
    sum += array[i+1];
    sum += array[i+2];
    sum += array[i+3];
    sum += array[i+4];
    sum += array[i+5];
    sum += array[i+6];
    sum += array[i+7];
}
return sum;

In x86:

xor ecx, ecx     ; ecx  = i   = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum = 0.0

begin:
addss xmm0, [esi+0]
addss xmm0, [esi+4]
addss xmm0, [esi+8]
addss xmm0, [esi+12]
addss xmm0, [esi+16]
addss xmm0, [esi+20]
addss xmm0, [esi+24]
addss xmm0, [esi+28]

add esi, 32
add ecx, 1
cmp ecx, 10000
jl begin ; if ecx < 10000, goto begin

; xmm0 = total sum

Since each addition to sum in the loop depends on the previous
addition, these instructions are a dependency chain. On a modern
processor, let’s say the reciprocal throughput of addss is 1 cycle.
However, the minimum latency is 4 cycles. Since every instruction
depends on the previous, each addition costs 4 cycles.

As before, let’s try summing with four temporary sums:

xor ecx, ecx     ; ecx  = i    = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum1 = 0.0
xorps xmm1, xmm1 ; xmm1 = sum2 = 0.0
xorps xmm2, xmm2 ; xmm2 = sum3 = 0.0
xorps xmm3, xmm3 ; xmm3 = sum4 = 0.0

; top = sum0

begin:
addss xmm0, [esi+0]
addss xmm1, [esi+4]
addss xmm2, [esi+8]
addss xmm3, [esi+12]
addss xmm0, [esi+16]
addss xmm1, [esi+20]
addss xmm2, [esi+24]
addss xmm3, [esi+28]

add esi, 32
add ecx, 1
cmp ecx, 10000
jl begin ; if ecx < 10000, goto begin

; accumulate sums
addss xmm0, xmm1
addss xmm2, xmm3 ; this instruction happens in parallel with the one above
addss xmm0, xmm2

Here, the additions in the loop that depend on each other are 4 cycles apart,
meaning the minimum latency is no longer a problem. This lets us hit
the maximum addition rate of one per cycle.

Removing dependency chains is a critical part of optimizing on today’s
processors. The Core 2 processor has six execution units,
three of which are fully 128-bit SIMD ALUs. If you can restructure
your algorithm so calculations happen independently, you can take
advantage of all of them. (And if you can pull off making full use of
the Core 2’s ALU capacity, you win.)

* BTW, it’s sort of unrelated, but I couldn’t help but link this article.
Greg Pfister has an interesting comparison and history of SIMD
vs. MIMD here. Ignore the terminology blathering and focus on the history of and influences on SIMD and MIMD over time.

Running Time -> Algebra -> Hardware

I’m going to talk about something which should be obvious, but I continue to see people optimizing code in the wrong order (*cough* including myself *cough*). So, here’s a reminder. When you’re optimizing a bit of code…

FIRST

Make sure your algorithmic running time is right. O(1) is almost always faster than O(N), and O(N^2) is right out. Often these optimization exercises involve changing some O(N) to O(M), where M is smaller than N.

I’ll give an example. Drawing a frame of 3D graphics in IMVU is O(N) where N is the sum of all vertices from all products on all objects loaded into a scene. We recently implemented view frustum culling, which skips drawing objects that are known to be off-screen. This reduces the rendering time from O(N) to O(M) where M<N and M is the number of vertices from products that are visible. If we implemented View Independent Progressive Meshes, we could reduce the time to O(P) where P is the number of vertices that contribute to the visible detail of the scene, and P<M<N.

However, make sure to avoid algorithms with good running times but huge constants. This is why, when CPUs got fast and random memory accesses got slow, searching an O(N) array (or std::vector) is often faster than searching an O(log N) tree (or std::map). The tree will miss cache far more often.

SECOND

Then, use all of the algebra, set theory, and logic you know to reduce the number of operations required, in order of operation cost.

Let’s say we’re going to calculate the diffuse reflectance on a surface: N dot L, where N and L are three-vectors, N is the normal of the surface, and L is the direction to the light.

The naive normalize(N) dot normalize(L) is…

float lengthN = sqrtf(N.x*N.x + N.y*N.y + N.z*N.z);
float lengthL = sqrtf(L.x*L.x + L.y*L.y + L.z*L.z);
float dot =
    (N.x / lengthN) * (L.x / lengthL) +
    (N.y / lengthN) * (L.y / lengthL) +
    (N.z / lengthN) * (L.z / lengthL);

… which turns out to be 6 additions, 9 multiplications, 6 divisions,
and 2 square roots. Let’s say additions and multiplications are 2
cycles, and divisions and square roots are 40 cycles. This gives us a
total of 6*2 + 9*2 + 6*40 + 2*40 = 350 cycles.

Instead, let’s do a bit of algebra:

  normalize(N) dot normalize(L)
= N/|N| dot L/|L|
= (N dot L) / (|N||L|)
= (N dot L) / sqrt((N dot N) * (L dot L))

The new calculation is…

float lengthSquaredN = N.x*N.x + N.y*N.y + N.z*N.z;
float lengthSquaredL = sqrtf(L.x*L.x + L.y*L.y + L.z*L.z);
float NdotL          = N.x*L.x + N.y*L.y + N.z*L.z;
float dot =          NdotL / sqrtf(lengthSquaredN * lengthSquaredL);

… 6 additions, 10 multiplications, 1 division, and 1 sqrt: 6*2 +
10*2 + 1*40 + 1*40 = 112 cycles. Huge improvement just by applying basic math.

THIRD

Once you’re done optimizing algebraically, read your
processor manuals
and take full advantage of the hardware. If you’ve got SSE4, you can
do the dot products in one instruction (DPPS), and an approximate
reciprocal square root in another (RSQRTSS), which can give another
huge improvement.

The reason you want to optimize in this order is that algorithmic improvements reduce the amount of work you have to do, making it less important to make that work fast. A hardware-optimized O(N^2) algorithm can be easily beaten by an unoptimized O(N log N) algorithm. Remember, Chad, the next time you schedule optimization projects, consider downstream effects such as these.

A Simple Introduction to Superscalar, Out-of-Order Processors

Since the Pentium Pro/Pentium 2, we have all been using heavily superscalar, out-of-order processors. I’d heard these terms a million times, but didn’t know what they meant until I read The Pentium Chronicles: The People, Passion, and Politics Behind Intel’s Landmark Chips (Practitioners). (BTW, if you love processors, the history of technology, and the fascinating dynamics at a company like Intel, that book is fantastic.)

Superscalar basically means “greater than 1”, implying that a superscalar processor can run code faster than its clock speed would suggest. Indeed, a 3 GHz Pentium 4 can retire 4 independent integer additions per clock cycle, which is 12 billion integer additions per second!

Out-of-order means just that – the processor looks at your code at runtime and executes it in parallel if it can. For example, imagine this code:

// three independent, non-null pointers
int* p; int* q; int* r;
const int flag1, flag2, flag3;

if (*p & flag1) {
    if (*q & flag2) {
        if (*r & flag3) {
            do_work();
        }
    }
}

The processor can’t assume that q is a valid pointer until it checks p, and the same for r and q. Accessing main memory costs ~200 cycles, so if none of the pointers point to cached memory, you just spent 600 cycles determining whether to do_work(). This is called a “dependency chain”, where the result of a later calculation depends on the previous. But what if you know that p, q, and r will all be valid pointers? You can rewrite as:

const int x = *p;
const int y = *q;
const int z = *r;
if ((x & flag1) && (y & flag2) && (z & flag3)) {
    do_work();
}

Now, the processor knows that all of those memory fetches are independent, so it runs them in parallel. Then, it runs the ANDs in parallel too, since they’re independent. Your 600-cycle check just became 200 cycles.

Similarly, let’s say you want to add 10,000 numbers.

int sum = 0;
for (int i = 0; i < 10000; ++i) {
    sum += array[i];
}
return sum;

Let’s assume the loop overhead and memory access is free, and each addition takes one cycle. Since each addition depends on the previous value of sum, they must be executed serially, taking 10000 cycles. However, you know that addition is associative, you can sum with two variables:

int sum1 = 0;
int sum2 = 0;
for (int i = 0; i < 10000; i += 2) {
    sum1 += array[i];
    sum2 += array[i+1];
}
return sum1 + sum2;

Now you have two independent additions, which can be executed in parallel! The loop takes 5000 cycles now. If you independently sum in sum1, sum2, sum3, and sum4, the loop will take 2500 cycles. And so on, until you’ve hit the IPC (instructions per cycle) limit on your processor. If you’re making effective use of your SIMD units, you’d be surprised at how much work you can do in parallel…

And that’s what an out-of-order, superscalar processor can do for you!

Logic vs. Array Processing

I’ve always been amused by the Java vs. C++ performance arguments:

  • “Java’s faster than C++!”
  • “No it’s not!”
  • “Yeah it is, look at this benchmark!”
  • “Well look how much longer the Java version of program takes to start!”

Back and forth and back and forth. The fact is, they’re both right, and here’s why. I mentally separate code into either of two categories, logic or array processing:

  1. 3D rasterization is obviously array processing.
  2. Video playback is also array processing.
  3. Calculating your tax refund is logic.
  4. Loading a PDF is definitely logic.

Often the line is blurry, but array processing involves running a relatively small set of rules over a lot of homogenous data. Computers are very, very good at this kind of computation, and specialized hardware such as a GPU can increase performance by orders of magnitude. Ignoring memory bandwidth, a desktop CPU can multiply billions of floating point numbers per second, and a fast GPU can multiply trillions.

At the other extreme, logic code tends to be full of branches, function calls, dependent memory accesses, and often it executes code that hasn’t been run in minutes. Just think about the set of operations that happen when you open a file in Word. Computers aren’t so good at these types of operations, and as Moore’s Law continues, they tend not to improve as rapidly as array computation does.

Back to Java vs. C++. The synthetic benchmarks that compare Java and C++ performance tend to be tight loops, simply because accurate measurement requires it. This gives the JVM time to prime its JIT/prediction engines/what have you, so I’d expect a good result. Heck, I’d expect a good result from the modern JavaScript tracing engines.*

The lesson here is that, for array processing, it’s very little work to make full use of the hardware at hand. Because the amount of code is limited (and the amount of data is large), time spent optimization has high leverage.

On the other hand, logic code is messy and spread out, often written by entire teams of people. Its performance is dominated by your programming language and the team’s vocabulary of idioms. Truly optimizing this kind of code is hard or impossible. It can be done, but you often have to retrain your team to make sure the benefits stick.

This is a reason that the choice of programming language(s) and libraries has such a big effect on the responsiveness of a desktop application, and one of the reasons why people can “feel” the programming language in which a project was written. Typical desktop application usage patterns are dominated by random, temporally sparse actions, so code size, “directness”, and working set are primary performance factors. (Anecdote: Andy‘s rewriting the IMVU client’s windowing framework so it’s a bajillion times simpler, and when he had the client running again, he exclaimed “Hey, resizing the 3D window is twice as responsive!”)

Perhaps there’s an argument here for the creation of more project-specific programming languages (GOAL, TreeHydra, DSLs), so that performance improvements can be applied universally across the codebase.

With disk and memory speeds improving so much more slowly than CPU speeds, the difference between a snappy desktop application and a sluggish application is a handful of page faults. When choosing a technology platform for a project, it’s worth considering the impact to overall responsiveness down the road. And I’m pretty sure I just recommended writing your entire application in C++, which sounds insane, even to me. I’ll leave it at that.

* By the way, I’m not picking on Java or promoting C++ in particular. You could make these same arguments between any “native” language and “managed” language. The blocking and tackling of loading applications, calling functions, and keeping memory footprint low are important.

A Global Approach to Optimization

Since joining IMVU, I have had two people tell me “Profilers (especially VTune) suck. They don’t help you optimize anything.” That didn’t make sense to me: how could a tool that gives such amazingly detailed data about the execution of your program be useless?

Now, these are two very smart people, so I knew there had to be some truth in what they were saying. I just had to look for it. At least a year later, after I’d dramatically improved the performance of some key subsystems in the IMVU client, I realized what they meant. They should have said: “Don’t just run a profiler to find out which functions are taking the most time, and make them execute faster. Take a global approach to optimization. Prevent those functions from being called at all.”

There you have it: take a global approach to optimization. But how does that work? First, let me ramble a bit about the benefits of performance.

There are two types of features:

  1. interactive
  2. not interactive (i.e. slow)

Searching on Google, opening a Word document, and firing a gun in Team Fortress 2 are all interactive.

Compressing large files, factoring large primes, and downloading HD movies are not.

We wish all features were interactive, but computers can’t do everything instantly. Sometimes, however, a feature switches from non-interactive to interactive, to dramatic effect. Remember way back? Before YouTube? Downloading videos took forever, and probably wouldn’t even play. YouTube made video consumption so fast and so easy that it changed the shape of the internet. Similarly, thanks to Google, it’s faster to search the internet for something than it is to search your own hard drive in Windows XP.

Anyway, if you truly want to make something as fast as it can be, you need to think like this:

  • What’s the starting state A?
  • What’s the ending state B?
  • What’s the minimal set of operations to get from A to B, and how do I execute them as fast as possible?

Optimizing your existing, naive code won’t get you there. You’ll have to build your application around these goals. There’s plenty of room for out-of-the-box thinking here. Take MacOS X’s resumption-from-hibernation feature:

  • The starting state: the computer is off, the memory is saved to disk.
  • The ending state: the computer is on and the user is productive.

MacOS X takes advantage of the fact that this is not purely a technology problem. The user has to remember what they were doing and become reattached to the computer. Thus, they show you a copy of what was on your screen last to remind you what was happening while the computer prepares for your actions. Opportunities for this kind of parallelism abound: why is it that operating system installers ask you questions, download packages, and install them serially? There is dramatic room for improvement there.

I don’t claim that IMVU’s website is the fastest website out there, but here’s an example of a type of optimization that takes the whole picture into account: when you start loading a page on IMVU.com, it optimistically fetches hundreds of keys from our memcache servers, before even looking up your customer information. It’s probable that you’ll need many of those keys, and it’s faster to get them all at once than to fetch them as you need them.

Someday, I hope to apply this global optimization approach to a software build system (a la Make, SCons, MSBuild). It’s insane that we don’t have a build system with all of the flexibility of SCons and instantaneous performance. Sure, the first build may need to scan for dependencies, but there’s no reason that a second build couldn’t reuse the information from the first build and start instantaneously. Just have a server process that watches for changes to files and updates the internal dependency graph. On large projects, I’ve seen SCons take minutes to figure out that it has to rebuild one file, which is simply crazy.

When optimizing a feature, take the user into consideration, and write down the minimum set of steps between the starting state and ending state. Execute those steps as fast as you can, and run them in parallel if it helps. If technology has advanced enough, maybe you have just transformed something non-interactive into an interactive feature.

The Real Benefit of Inlining Functions (or: Floating Point Calling Conventions)

My mental model for the performance benefit of inlining a function call was:

  1. code size increases
  2. the overhead of the call, including argument and return value marshalling, is eliminated
  3. the compiler knows more information, so it can generate better code

I had dramatically underestimated the value of #3, so this entry is an attempt to give a concrete example of how inlining can help.

As alluded in my previous entry, you can’t just leave the floating point state willy nilly across function calls. Every function should be able to make full use of the floating point register stack, which doesn’t work if somebody has left stale values on it. In general, these rules are called calling conventions. Agner Fog has excellent coverage of the topic, as usual.

Anyway, back to inlining. The specifics aren’t that important, but we had a really simple function in the IMVU client which continued to show up in the profiles. It looked something like this:

std::vector<float> array;

float function() {
    float sum = 0.0f;
    for (size_t i = 0; i < array.size(); ++i) {
        sum += array[i];
    }
    return sum;
}

This function never operated on very large lists, and it also wasn’t called very often, so why was it consistently in the profiles? A peek at the assembly showed (again, something like):

fldz
fstp dword ptr [sum] ; sum = 0.0

xor ecx, ecx ; i = 0
jmp cmp

loop:

push ecx
call array.operator[]

fadd [sum] ; return value of operator[] in ST(0)
fstp [sum] ; why the load and the store??

add ecx, 1

cmp:

call array.size()
cmp ecx, eax
jb loop ; continue if i < return value

fld [sum] ; return value

First of all, why all of the function calls? Shouldn't std::vector be inlined? But more importantly, why does the compiler keep spilling sum out to the stack? Surely it could keep the sum in a floating point register for the entire calculation.

This is when I realized: due to the calling convention requirements on function calls, the floating point stack must be empty upon entry into the function. The stack is in L1 cache, but still, that's three cycles per access, plus a bunch of pointless load and store uops.

Now, I actually know why std::vector isn't inlined. For faster bug detection, we compile and ship with bounds checking enabled on STL containers and iterators. But in this particular situation, the bounds checking isn't helpful, since we're iterating over the entire container. I rewrote the function as:

std::vector<float> array;

float function() {
    const float* p = &array[0];
    size_t count = array.size();
    float sum = 0.0f;
    while (count--) {
        sum += *p++;
    }
    return sum;
}

And the compiler generated the much more reasonable:

call array.size()
mov ecx, eax ; ecx = count

push 0
call array.operator[]
mov esi, eax ; esi = p

fldz ; ST(0) = sum

jmp cmp
loop:

fadd [esi] ; sum += *p

add esi, 4 ; p++
sub ecx, 1 ; count--

cmp:
cmp ecx, 0
jne loop

; return ST(0)

This is the real benefit of inlining. Modern compilers are awesome at making nearly-optimal use of the CPU, but only when they have enough information. Inlining functions gives them that information.

p.s. I apologize if my pseudo-assembly had mistakes. I wrote it from memory.

#IND and #QNaN with /fp:fast

The other day Timothy and I were optimizing some floating-point-intensive lighting code. Looking at the generated code, I realized we weren’t compiling with /fp:fast. Due to the wonky state of floating point on 32-bit x86, Visual C++ frequently stores temporary results of floating point calculations to the stack and then reloads them, for the sake of consistent results.

See, the problem is that the floating point registers on x86 are 80 bits wide, so if you compile “float x, y, z, w; w = (x + y) * z” as…

fld [x]  ; ST0 = x
fadd [y] ; ST0 = x + y
fmul [z] ; ST0 = (x + y) * z
fstp [w] ; w = (x + y) * z

… the temporary results are always stored in ST0 with 80 bits of precision. However, since floats only have 32 bits of precision, you can wind up with different results depending on compilers, optimization settings, register allocation, etc. We often had problems like this at VRAC. Some poor engineering student would send out a panicked e-mail at 9:00 p.m. asking why his program started producing different results in release mode than it did in debug mode.

Thus, Visual C++ takes a more cautious approach. By default, it stores float intermediates back to memory to truncate them to 32 bits of precision:

fld [x]
fadd [y]
fstp [temp] ; truncate precision
fld [temp]
fmul [z]
fstp [w]

Tiny differences in precision don’t matter in IMVU, so enabling /fp:fast saved 50-100 CPU cycles per vertex in our vertex lighting loop. However, with this option turned on, our automated tests started failing with crazy #IND and #QNAN errors!

After some investigation, we discovered that our 4×4 matrix inversion routine (which calculates several 2×2 and 3×3 determinants) was using all 8 floating point registers with /fp:fast enabled. The x87 registers are stored in a “stack“, where ST0 is the top of the stack and STi is the i’th entry. Load operations like fld, fld1, and fldz push entries on the stack. Arithmetic operations like fadd and fmul operate on the top of the stack with the value in memory, storing the result back on the stack.

But what if the x87 register stack overflows? In this case, an “indefinite” NAN is loaded instead of the value you requested, indicating that you have lost information. (The data at the bottom of the stack was lost.) Here’s an example:

fldz  ; ST0 = 0
fld1  ; ST0 = 1, ST1 = 0
fldpi ; ST0 = pi, ST1 = 1, ST2 = 0
fldz
fldz
fldz
fldz
fldz  ; ST0-4 = 0, ST5 = pi, ST6 = 1, ST7 = 0
fldz  ; ST0 = IND!

Woops, there’s a bug in your code! You shouldn’t overflow the x87 register stack, so the processor has given you IND.

Indeed, this is what happened in our matrix inversion routine. But why?

Using a debugger, we determined that the x87 stack contained one value at the start of the function. Moreover, it contained a value at the start of the test! Something was fishy. Somebody was leaving the x87 stack dirty, and we needed to find out who.

void verify_x87_stack_empty() {
    unsigned z[8];
    __asm {
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fstp dword ptr [z+0x00]
        fstp dword ptr [z+0x04]
        fstp dword ptr [z+0x08]
        fstp dword ptr [z+0x0c]
        fstp dword ptr [z+0x10]
        fstp dword ptr [z+0x14]
        fstp dword ptr [z+0x18]
        fstp dword ptr [z+0x1c]
    }

    // Verify bit patterns. 0 = 0.0
    for (unsigned i = 0; i < 8; ++i) {
        CHECK_EQUAL(z[i], 0);
    }
}

The previous function, called before and after every test, discovered the culprit: we had a test that intentionally called printf() and frexp() with NaN values, which had the side effect of leaving the floating point stack in an unpredictable state.

Adding __asm emms to the end of the test fixed our problem: thereafter, /fp:fast worked wonderfully. Case closed.