The other day Timothy and I were optimizing some floating-point-intensive lighting code. Looking at the generated code, I realized we weren't compiling with /fp:fast. Due to the wonky state of floating point on 32-bit x86, Visual C++ frequently stores temporary results of floating point calculations to the stack and then reloads them, for the sake of consistent results.

See, the problem is that the floating point registers on x86 are 80 bits wide, so if you compile "float x, y, z, w; w = (x + y) * z" as...
fld [x]  ; ST0 = x
fadd [y] ; ST0 = x + y
fmul [z] ; ST0 = (x + y) * z
fstp [w] ; w = (x + y) * z

... the temporary results are always stored in ST0 with 80 bits of precision. However, since floats only have 32 bits of precision, you can wind up with different results depending on compilers, optimization settings, register allocation, etc. We often had problems like this at VRAC. Some poor engineering student would send out a panicked e-mail at 9:00 p.m. asking why his program started producing different results in release mode than it did in debug mode.

Thus, Visual C++ takes a more cautious approach. By default, it stores float intermediates back to memory to truncate them to 32 bits of precision:

fld [x]
fadd [y]
fstp [temp] ; truncate precision
fld [temp]
fmul [z]
fstp [w]

Tiny differences in precision don't matter in IMVU, so enabling /fp:fast saved 50-100 CPU cycles per vertex in our vertex lighting loop. However, with this option turned on, our automated tests started failing with crazy #IND and #QNAN errors!

After some investigation, we discovered that our 4x4 matrix inversion routine (which calculates several 2x2 and 3x3 determinants) was using all 8 floating point registers with /fp:fast enabled. The x87 registers are stored in a "stack", where ST0 is the top of the stack and STi is the i'th entry. Load operations like fld, fld1, and fldz push entries on the stack. Arithmetic operations like fadd and fmul operate on the top of the stack with the value in memory, storing the result back on the stack.

But what if the x87 register stack overflows? In this case, an "indefinite" NAN is loaded instead of the value you requested, indicating that you have lost information. (The data at the bottom of the stack was lost.) Here's an example:

fldz  ; ST0 = 0
fld1  ; ST0 = 1, ST1 = 0
fldpi ; ST0 = pi, ST1 = 1, ST2 = 0
fldz
fldz
fldz
fldz
fldz  ; ST0-4 = 0, ST5 = pi, ST6 = 1, ST7 = 0
fldz  ; ST0 = IND!

Woops, there's a bug in your code! You shouldn't overflow the x87 register stack, so the processor has given you IND.

Indeed, this is what happened in our matrix inversion routine. But why?

Using a debugger, we determined that the x87 stack contained one value at the start of the function. Moreover, it contained a value at the start of the test! Something was fishy. Somebody was leaving the x87 stack dirty, and we needed to find out who.

void verify_x87_stack_empty() {
    unsigned z[8];
    __asm {
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fstp dword ptr [z+0x00]
        fstp dword ptr [z+0x04]
        fstp dword ptr [z+0x08]
        fstp dword ptr [z+0x0c]
        fstp dword ptr [z+0x10]
        fstp dword ptr [z+0x14]
        fstp dword ptr [z+0x18]
        fstp dword ptr [z+0x1c]
    }

    // Verify bit patterns. 0 = 0.0
    for (unsigned i = 0; i < 8; ++i) {
        CHECK_EQUAL(z[i], 0);
    }
}

The previous function, called before and after every test, discovered the culprit: we had a test that intentionally called printf() and frexp() with NaN values, which had the side effect of leaving the floating point stack in an unpredictable state.

Adding __asm emms to the end of the test fixed our problem: thereafter, /fp:fast worked wonderfully. Case closed.