The Real Benefit of Inlining Functions (or: Floating Point Calling Conventions)

My mental model for the performance benefit of inlining a function call was:

code size increases
the overhead of the call, including argument and return value marshalling, is eliminated
the compiler knows more information, so it can generate better code

I had dramatically underestimated the value of #3, so this entry is an attempt to give a concrete example of how inlining can help.

As alluded in my previous entry, you can't just leave the floating point state willy nilly across function calls. Every function should be able to make full use of the floating point register stack, which doesn't work if somebody has left stale values on it. In general, these rules are called calling conventions. Agner Fog has excellent coverage of the topic, as usual.

Anyway, back to inlining. The specifics aren't that important, but we had a really simple function in the IMVU client which continued to show up in the profiles. It looked something like this:

std::vector<float> array;

float function() {
    float sum = 0.0f;
    for (size_t i = 0; i < array.size(); ++i) {
        sum += array[i];
    }
    return sum;
}

This function never operated on very large lists, and it also wasn't called very often, so why was it consistently in the profiles? A peek at the assembly showed (again, something like):

fldz
fstp dword ptr [sum] ; sum = 0.0

xor ecx, ecx ; i = 0
jmp cmp

loop:

push ecx
call array.operator[]

fadd [sum] ; return value of operator[] in ST(0)
fstp [sum] ; why the load and the store??

add ecx, 1

cmp:

call array.size()
cmp ecx, eax
jb loop ; continue if i < return value

fld [sum] ; return value

First of all, why all of the function calls? Shouldn't std::vector be inlined? But more importantly, why does the compiler keep spilling sum out to the stack? Surely it could keep the sum in a floating point register for the entire calculation.

This is when I realized: due to the calling convention requirements on function calls, the floating point stack must be empty upon entry into the function. The stack is in L1 cache, but still, that's three cycles per access, plus a bunch of pointless load and store uops.

Now, I actually know why std::vector isn't inlined. For faster bug detection, we compile and ship with bounds checking enabled on STL containers and iterators. But in this particular situation, the bounds checking isn't helpful, since we're iterating over the entire container. I rewrote the function as:

std::vector<float> array;

float function() {
    const float* p = &array[0];
    size_t count = array.size();
    float sum = 0.0f;
    while (count--) {
        sum += *p++;
    }
    return sum;
}

And the compiler generated the much more reasonable:

call array.size()
mov ecx, eax ; ecx = count

push 0
call array.operator[]
mov esi, eax ; esi = p

fldz ; ST(0) = sum

jmp cmp
loop:

fadd [esi] ; sum += *p

add esi, 4 ; p++
sub ecx, 1 ; count--

cmp:
cmp ecx, 0
jne loop

; return ST(0)

This is the real benefit of inlining. Modern compilers are awesome at making nearly-optimal use of the CPU, but only when they have enough information. Inlining functions gives them that information.

p.s. I apologize if my pseudo-assembly had mistakes. I wrote it from memory.