Since the Pentium Pro/Pentium 2, we have all been using heavily superscalar, out-of-order processors. I’d heard these terms a million times, but didn’t know what they meant until I read The Pentium Chronicles: The People, Passion, and Politics Behind Intel’s Landmark Chips (Practitioners). (BTW, if you love processors, the history of technology, and the fascinating dynamics at a company like Intel, that book is fantastic.)

Superscalar basically means “greater than 1”, implying that a superscalar processor can run code faster than its clock speed would suggest. Indeed, a 3 GHz Pentium 4 can retire 4 independent integer additions per clock cycle, which is 12 billion integer additions per second!

Out-of-order means just that – the processor looks at your code at runtime and executes it in parallel if it can. For example, imagine this code:

// three independent, non-null pointers int* p; int* q; int* r; const int flag1, flag2, flag3; if (*p & flag1) { if (*q & flag2) { if (*r & flag3) { do_work(); } } }

The processor can’t assume that `q`

is a valid pointer until it checks `p`

, and the same for `r`

and `q`

. Accessing main memory costs ~200 cycles, so if none of the pointers point to cached memory, you just spent 600 cycles determining whether to `do_work()`

. This is called a “dependency chain”, where the result of a later calculation depends on the previous. But what if you know that p, q, and r will all be valid pointers? You can rewrite as:

const int x = *p; const int y = *q; const int z = *r; if ((x & flag1) && (y & flag2) && (z & flag3)) { do_work(); }

Now, the processor knows that all of those memory fetches are independent, so it runs them in parallel. Then, it runs the `AND`

s in parallel too, since they’re independent. Your 600-cycle check just became 200 cycles.

Similarly, let’s say you want to add 10,000 numbers.

int sum = 0; for (int i = 0; i < 10000; ++i) { sum += array[i]; } return sum;

Let’s assume the loop overhead and memory access is free, and each addition takes one cycle. Since each addition depends on the previous value of sum, they must be executed serially, taking 10000 cycles. However, you know that addition is associative, you can sum with two variables:

int sum1 = 0; int sum2 = 0; for (int i = 0; i < 10000; i += 2) { sum1 += array[i]; sum2 += array[i+1]; } return sum1 + sum2;

Now you have two independent additions, which can be executed in parallel! The loop takes 5000 cycles now. If you independently sum in `sum1`

, `sum2`

, `sum3`

, and `sum4`

, the loop will take 2500 cycles. And so on, until you’ve hit the IPC (instructions per cycle) limit on your processor. If you’re making effective use of your SIMD units, you’d be surprised at how much work you can do in parallel…

And that’s what an out-of-order, superscalar processor can do for you!

Now how much does MSVC actually recognize these cases? i.e. what is the assembly generated for each of these examples?

Out of curiosity, I checked the naive case, and it looks like VC++ unrolls the sum loop, but does not sum into different accumulators. (In reality, you’re probably bound on memory anyway. The examples are illustrative.)

Interestingly, in the i+=2 example, it doesn’t unroll at all, but it keeps both accumulators in registers.

This is probably a useless comment, but that last piece of code just reminded me of duff’s device. I’d assume you could do something like that (with slight deviations) to capture that parallelism of your sum (fill in N and SIZE to your liking). Or maybe I’m just off my rocker thinking about optimization of compilers/processors in the 80s…

hah, I’d be curious to see if that’s a win.