A Simple Introduction to Superscalar, Out-of-Order Processors

Since the Pentium Pro/Pentium 2, we have all been using heavily superscalar, out-of-order processors. I'd heard these terms a million times, but didn't know what they meant until I read The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners). (BTW, if you love processors, the history of technology, and the fascinating dynamics at a company like Intel, that book is fantastic.)

Superscalar basically means "greater than 1", implying that a superscalar processor can run code faster than its clock speed would suggest. Indeed, a 3 GHz Pentium 4 can retire 4 independent integer additions per clock cycle, which is 12 billion integer additions per second!

Out-of-order means just that - the processor looks at your code at runtime and executes it in parallel if it can. For example, imagine this code:

// three independent, non-null pointers
int* p; int* q; int* r;
const int flag1, flag2, flag3;

if (*p & flag1) {
    if (*q & flag2) {
        if (*r & flag3) {
            do_work();
        }
    }
}

The processor can't assume that q is a valid pointer until it checks p, and the same for r and q. Accessing main memory costs ~200 cycles, so if none of the pointers point to cached memory, you just spent 600 cycles determining whether to do_work(). This is called a "dependency chain", where the result of a later calculation depends on the previous. But what if you know that p, q, and r will all be valid pointers? You can rewrite as:

const int x = *p;
const int y = *q;
const int z = *r;
if ((x & flag1) && (y & flag2) && (z & flag3)) {
    do_work();
}

Now, the processor knows that all of those memory fetches are independent, so it runs them in parallel. Then, it runs the ANDs in parallel too, since they're independent. Your 600-cycle check just became 200 cycles.

Similarly, let's say you want to add 10,000 numbers.

int sum = 0;
for (int i = 0; i < 10000; ++i) {
    sum += array[i];
}
return sum;

Let's assume the loop overhead and memory access is free, and each addition takes one cycle. Since each addition depends on the previous value of sum, they must be executed serially, taking 10000 cycles. However, you know that addition is associative, you can sum with two variables:

int sum1 = 0;
int sum2 = 0;
for (int i = 0; i < 10000; i += 2) {
    sum1 += array[i];
    sum2 += array[i+1];
}
return sum1 + sum2;

Now you have two independent additions, which can be executed in parallel! The loop takes 5000 cycles now. If you independently sum in sum1, sum2, sum3, and sum4, the loop will take 2500 cycles. And so on, until you've hit the IPC (instructions per cycle) limit on your processor. If you're making effective use of your SIMD units, you'd be surprised at how much work you can do in parallel...

And that's what an out-of-order, superscalar processor can do for you!

Imported Comments [?]

mad_and_crazy on Feb 12, 2009

Now how much does MSVC actually recognize these cases? i.e. what is the assembly generated for each of these examples?

aegisknight on Feb 12, 2009

Out of curiosity, I checked the naive case, and it looks like VC++ unrolls the sum loop, but does not sum into different accumulators. (In reality, you're probably bound on memory anyway. The examples are illustrative.)

Interestingly, in the i+=2 example, it doesn't unroll at all, but it keeps both accumulators in registers.

sufianrhazi on Feb 13, 2009

[HTML_REMOVED]This is probably a useless comment, but that last piece of code just reminded me of [HTML_REMOVED]duff's device[HTML_REMOVED]. I'd assume you could do something like that (with slight deviations) to capture that parallelism of your sum (fill in N and SIZE to your liking). Or maybe I'm just off my rocker thinking about optimization of compilers/processors in the 80s...[HTML_REMOVED] [HTML_REMOVED][HTML_REMOVED]

int fast_sum(array){ int sum = 0; int sums[N] = {0}; int i; int n = (SIZE + N-1)/n; switch(SIZE%N){ case 0: do { sum[0] += array[0]; case N-1: sum[1] += array[1]; case N-2: sum[2] += array[2]; ... case 1: sum[N-1] += array[N-1]; array += N; /* yes, this is awkward, I know */ } while(--n > 0);

for(i = 0; i < N; i++)
    sum += sums[i];

return sum;

}
[HTML_REMOVED][HTML_REMOVED]

aegisknight on Feb 13, 2009

hah, I'd be curious to see if that's a win.