This is my last post about processors and performance, I swear! Plus, my wrists are starting to hurt from this bloodpact thing (as I'm diagnosed with RSI), so I think this will be a light one.

As I've discussed previously, modern desktop processors work really hard to exploit the inherent parallelism in your programs. This is called instruction-level parallelism, and is one of the techniques processors use to get more performance out of slower clock rates (along with data-level parallelism (SIMD) or multiple cores (MIMD)*). Previously, I waved my hands a bit and said "The processor makes independent operations run in parallel." Now I'm going to teach you how to count cycles in the presence of latency and parallelism.

Traditionally, when analyzing the cost of an algorithm, you would simply count the operations involved, sum their costs in cycles, and call it a day. These days, it's not that easy. Instructions have two costs: dependency chain latency and reciprocal throughput.

Reciprocal throughput is simply the reciprocal of the maximum throughput of a particular instruction. Throughput is measured in instructions/cycle, so reciprocal throughput is cycles/instruction.

OK, that sounds like the way we've always measured performance. So what's dependency chain latency? When the results of a previous calculation are needed for another calculation, you have a dependency chain. In a dependency chain, you measure the cost of an instruction by its latency, not its reciprocal throughput. Remember that our processors are working really hard to exploit parallelism in our code. When there is no instruction-level parallelism available, we get penalized.

Let's go back to our sum 10000 numbers example, but unroll it a bit:

```float array[10000];
float sum = 0.0f;
for (int i = 0; i < 10000; i += 8) {
sum += array[i+0];
sum += array[i+1];
sum += array[i+2];
sum += array[i+3];
sum += array[i+4];
sum += array[i+5];
sum += array[i+6];
sum += array[i+7];
}
return sum;
```
In x86:
```xor ecx, ecx     ; ecx  = i   = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum = 0.0

begin:

cmp ecx, 10000
jl begin ; if ecx < 10000, goto begin

; xmm0 = total sum
```

Since each addition to `sum` in the loop depends on the previous addition, these instructions are a dependency chain. On a modern processor, let's say the reciprocal throughput of `addss` is 1 cycle. However, the minimum latency is 4 cycles. Since every instruction depends on the previous, each addition costs 4 cycles.

As before, let's try summing with four temporary sums:

```xor ecx, ecx     ; ecx  = i    = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum1 = 0.0
xorps xmm1, xmm1 ; xmm1 = sum2 = 0.0
xorps xmm2, xmm2 ; xmm2 = sum3 = 0.0
xorps xmm3, xmm3 ; xmm3 = sum4 = 0.0

; top = sum0

begin:

cmp ecx, 10000
jl begin ; if ecx < 10000, goto begin

; accumulate sums