This is my last post about processors and performance, I swear! Plus,

my wrists are starting to hurt from this bloodpact thing (as I’m

diagnosed with RSI), so I think this will be a light one.

As I’ve discussed previously,

modern desktop processors work really hard to exploit the inherent

parallelism in your programs. This is called instruction-level

parallelism, and is one of the techniques processors use to get

more performance out of slower clock rates (along with data-level

parallelism (SIMD) or multiple cores (MIMD)*). Previously, I waved my

hands a bit and said “The processor makes independent operations run

in parallel.” Now I’m going to teach you how to count cycles in the presence of latency and parallelism.

Traditionally, when analyzing the cost of an algorithm, you would

simply count the operations involved, sum their costs in cycles, and

call it a day. These days, it’s not that easy. Instructions have two

costs: dependency chain latency and reciprocal throughput.

Reciprocal throughput is simply the reciprocal of the maximum

throughput of a particular instruction. Throughput is measured in

instructions/cycle, so reciprocal throughput is cycles/instruction.

OK, that sounds like the way we’ve always measured performance. So

what’s dependency chain latency? When the results of a previous

calculation are needed for another calculation, you have a dependency

chain. In a dependency chain, you measure the cost of an instruction

by its latency, not its reciprocal throughput. Remember that our

processors are working really hard to exploit parallelism in our code.

When there is no instruction-level parallelism available, we get

penalized.

Let’s go back to our sum 10000 numbers example, but unroll it a bit:

float array[10000]; float sum = 0.0f; for (int i = 0; i < 10000; i += 8) { sum += array[i+0]; sum += array[i+1]; sum += array[i+2]; sum += array[i+3]; sum += array[i+4]; sum += array[i+5]; sum += array[i+6]; sum += array[i+7]; } return sum;

In x86:

xor ecx, ecx ; ecx = i = 0 mov esi, array xorps xmm0, xmm0 ; xmm0 = sum = 0.0 begin: addss xmm0, [esi+0] addss xmm0, [esi+4] addss xmm0, [esi+8] addss xmm0, [esi+12] addss xmm0, [esi+16] addss xmm0, [esi+20] addss xmm0, [esi+24] addss xmm0, [esi+28] add esi, 32 add ecx, 1 cmp ecx, 10000 jl begin ; if ecx < 10000, goto begin ; xmm0 = total sum

Since each addition to `sum`

in the loop depends on the previous

addition, these instructions are a dependency chain. On a modern

processor, let’s say the reciprocal throughput of `addss`

is 1 cycle.

However, the minimum latency is 4 cycles. Since every instruction

depends on the previous, each addition costs 4 cycles.

As before, let’s try summing with four temporary sums:

xor ecx, ecx ; ecx = i = 0 mov esi, array xorps xmm0, xmm0 ; xmm0 = sum1 = 0.0 xorps xmm1, xmm1 ; xmm1 = sum2 = 0.0 xorps xmm2, xmm2 ; xmm2 = sum3 = 0.0 xorps xmm3, xmm3 ; xmm3 = sum4 = 0.0 ; top = sum0 begin: addss xmm0, [esi+0] addss xmm1, [esi+4] addss xmm2, [esi+8] addss xmm3, [esi+12] addss xmm0, [esi+16] addss xmm1, [esi+20] addss xmm2, [esi+24] addss xmm3, [esi+28] add esi, 32 add ecx, 1 cmp ecx, 10000 jl begin ; if ecx < 10000, goto begin ; accumulate sums addss xmm0, xmm1 addss xmm2, xmm3 ; this instruction happens in parallel with the one above addss xmm0, xmm2

Here, the additions in the loop that depend on each other are 4 cycles apart,

meaning the minimum latency is no longer a problem. This lets us hit

the maximum addition rate of one per cycle.

Removing dependency chains is a critical part of optimizing on today’s

processors. The Core 2 processor has *six* execution units,

three of which are fully 128-bit SIMD ALUs. If you can restructure

your algorithm so calculations happen independently, you can take

advantage of all of them. (And if you can pull off making full use of

the Core 2’s ALU capacity, you win.)

* BTW, it’s sort of unrelated, but I couldn’t help but link this article.

Greg Pfister has an interesting comparison and history of SIMD

vs. MIMD here. Ignore the terminology blathering and focus on the history of and influences on SIMD and MIMD over time.

Trying to understand your post, and wondering what is the benefit of unrolling of 8 vs unrolling of 4?

what i am actually wondering is, suppose i have more complex operations in my loop:

for (int i = 0; i < 10000; i += 1) {

sum += array1[i]*array2[i]*array3[i]

}

and as i dont know how many cycles a multiplication costs, assume it costs 6 cycles, would then 18 partial sums and an unroll of 18 benefit the performance of such a sum?

candyman: it depends on the number of execution ports on your CPU, but in general, the more complex the loop body, the more inherent parallelism you might get, so the less you need to unroll. That’s not always true of course… why not measure and see? :)