Latency vs. Throughput
This is my last post about processors and performance, I swear! Plus, my wrists are starting to hurt from this bloodpact thing (as I'm diagnosed with RSI), so I think this will be a light one.
As I've discussed previously, modern desktop processors work really hard to exploit the inherent parallelism in your programs. This is called instruction-level parallelism, and is one of the techniques processors use to get more performance out of slower clock rates (along with data-level parallelism (SIMD) or multiple cores (MIMD)*). Previously, I waved my hands a bit and said "The processor makes independent operations run in parallel." Now I'm going to teach you how to count cycles in the presence of latency and parallelism.
Traditionally, when analyzing the cost of an algorithm, you would simply count the operations involved, sum their costs in cycles, and call it a day. These days, it's not that easy. Instructions have two costs: dependency chain latency and reciprocal throughput.
Reciprocal throughput is simply the reciprocal of the maximum throughput of a particular instruction. Throughput is measured in instructions/cycle, so reciprocal throughput is cycles/instruction.
OK, that sounds like the way we've always measured performance. So what's dependency chain latency? When the results of a previous calculation are needed for another calculation, you have a dependency chain. In a dependency chain, you measure the cost of an instruction by its latency, not its reciprocal throughput. Remember that our processors are working really hard to exploit parallelism in our code. When there is no instruction-level parallelism available, we get penalized.
Let's go back to our sum 10000 numbers example, but unroll it a bit:
float array[10000]; float sum = 0.0f; for (int i = 0; i < 10000; i += 8) { sum += array[i+0]; sum += array[i+1]; sum += array[i+2]; sum += array[i+3]; sum += array[i+4]; sum += array[i+5]; sum += array[i+6]; sum += array[i+7]; } return sum;In x86:
xor ecx, ecx ; ecx = i = 0 mov esi, array xorps xmm0, xmm0 ; xmm0 = sum = 0.0 begin: addss xmm0, [esi+0] addss xmm0, [esi+4] addss xmm0, [esi+8] addss xmm0, [esi+12] addss xmm0, [esi+16] addss xmm0, [esi+20] addss xmm0, [esi+24] addss xmm0, [esi+28] add esi, 32 add ecx, 1 cmp ecx, 10000 jl begin ; if ecx < 10000, goto begin ; xmm0 = total sum
Since each addition to sum
in the loop depends on the previous
addition, these instructions are a dependency chain. On a modern
processor, let's say the reciprocal throughput of addss
is 1 cycle.
However, the minimum latency is 4 cycles. Since every instruction
depends on the previous, each addition costs 4 cycles.
As before, let's try summing with four temporary sums:
xor ecx, ecx ; ecx = i = 0 mov esi, array xorps xmm0, xmm0 ; xmm0 = sum1 = 0.0 xorps xmm1, xmm1 ; xmm1 = sum2 = 0.0 xorps xmm2, xmm2 ; xmm2 = sum3 = 0.0 xorps xmm3, xmm3 ; xmm3 = sum4 = 0.0 ; top = sum0 begin: addss xmm0, [esi+0] addss xmm1, [esi+4] addss xmm2, [esi+8] addss xmm3, [esi+12] addss xmm0, [esi+16] addss xmm1, [esi+20] addss xmm2, [esi+24] addss xmm3, [esi+28] add esi, 32 add ecx, 1 cmp ecx, 10000 jl begin ; if ecx < 10000, goto begin ; accumulate sums addss xmm0, xmm1 addss xmm2, xmm3 ; this instruction happens in parallel with the one above addss xmm0, xmm2
Here, the additions in the loop that depend on each other are 4 cycles apart, meaning the minimum latency is no longer a problem. This lets us hit the maximum addition rate of one per cycle.
Removing dependency chains is a critical part of optimizing on today's processors. The Core 2 processor has six execution units, three of which are fully 128-bit SIMD ALUs. If you can restructure your algorithm so calculations happen independently, you can take advantage of all of them. (And if you can pull off making full use of the Core 2's ALU capacity, you win.)
* BTW, it's sort of unrelated, but I couldn't help but link this article. Greg Pfister has an interesting comparison and history of SIMD vs. MIMD here. Ignore the terminology blathering and focus on the history of and influences on SIMD and MIMD over time.
Trying to understand your post, and wondering what is the benefit of unrolling of 8 vs unrolling of 4?
what i am actually wondering is, suppose i have more complex operations in my loop: for (int i = 0; i < 10000; i += 1) { sum += array1[i]array2[i]array3[i] } and as i dont know how many cycles a multiplication costs, assume it costs 6 cycles, would then 18 partial sums and an unroll of 18 benefit the performance of such a sum?
candyman: it depends on the number of execution ports on your CPU, but in general, the more complex the loop body, the more inherent parallelism you might get, so the less you need to unroll. That's not always true of course... why not measure and see? :)