Since the Pentium Pro/Pentium 2, we have all been using heavily superscalar, out-of-order processors. I'd heard these terms a million times, but didn't know what they meant until I read The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners). (BTW, if you love processors, the history of technology, and the fascinating dynamics at a company like Intel, that book is fantastic.)

Superscalar basically means "greater than 1", implying that a superscalar processor can run code faster than its clock speed would suggest. Indeed, a 3 GHz Pentium 4 can retire 4 independent integer additions per clock cycle, which is 12 billion integer additions per second!

Out-of-order means just that - the processor looks at your code at runtime and executes it in parallel if it can. For example, imagine this code:

// three independent, non-null pointers
int* p; int* q; int* r;
const int flag1, flag2, flag3;

if (*p & flag1) {
    if (*q & flag2) {
        if (*r & flag3) {
            do_work();
        }
    }
}

The processor can't assume that q is a valid pointer until it checks p, and the same for r and q. Accessing main memory costs ~200 cycles, so if none of the pointers point to cached memory, you just spent 600 cycles determining whether to do_work(). This is called a "dependency chain", where the result of a later calculation depends on the previous. But what if you know that p, q, and r will all be valid pointers? You can rewrite as:

const int x = *p;
const int y = *q;
const int z = *r;
if ((x & flag1) && (y & flag2) && (z & flag3)) {
    do_work();
}

Now, the processor knows that all of those memory fetches are independent, so it runs them in parallel. Then, it runs the ANDs in parallel too, since they're independent. Your 600-cycle check just became 200 cycles.

Similarly, let's say you want to add 10,000 numbers.

int sum = 0;
for (int i = 0; i < 10000; ++i) {
    sum += array[i];
}
return sum;

Let's assume the loop overhead and memory access is free, and each addition takes one cycle. Since each addition depends on the previous value of sum, they must be executed serially, taking 10000 cycles. However, you know that addition is associative, you can sum with two variables:

int sum1 = 0;
int sum2 = 0;
for (int i = 0; i < 10000; i += 2) {
    sum1 += array[i];
    sum2 += array[i+1];
}
return sum1 + sum2;

Now you have two independent additions, which can be executed in parallel! The loop takes 5000 cycles now. If you independently sum in sum1, sum2, sum3, and sum4, the loop will take 2500 cycles. And so on, until you've hit the IPC (instructions per cycle) limit on your processor. If you're making effective use of your SIMD units, you'd be surprised at how much work you can do in parallel...

And that's what an out-of-order, superscalar processor can do for you!