# Latency vs. Throughput

This is my last post about processors and performance, I swear! Plus,
my wrists are starting to hurt from this bloodpact thing (as I’m
diagnosed with RSI), so I think this will be a light one.

As I’ve discussed previously,
modern desktop processors work really hard to exploit the inherent
parallelism in your programs. This is called instruction-level
parallelism
, and is one of the techniques processors use to get
more performance out of slower clock rates (along with data-level
parallelism (SIMD) or multiple cores (MIMD)*). Previously, I waved my
hands a bit and said “The processor makes independent operations run
in parallel.” Now I’m going to teach you how to count cycles in the presence of latency and parallelism.

Traditionally, when analyzing the cost of an algorithm, you would
simply count the operations involved, sum their costs in cycles, and
call it a day. These days, it’s not that easy. Instructions have two
costs: dependency chain latency and reciprocal throughput.

Reciprocal throughput is simply the reciprocal of the maximum
throughput of a particular instruction. Throughput is measured in
instructions/cycle, so reciprocal throughput is cycles/instruction.

OK, that sounds like the way we’ve always measured performance. So
what’s dependency chain latency? When the results of a previous
calculation are needed for another calculation, you have a dependency
chain. In a dependency chain, you measure the cost of an instruction
by its latency, not its reciprocal throughput. Remember that our
processors are working really hard to exploit parallelism in our code.
When there is no instruction-level parallelism available, we get
penalized.

Let’s go back to our sum 10000 numbers example, but unroll it a bit:

```float array[10000];
float sum = 0.0f;
for (int i = 0; i < 10000; i += 8) {
sum += array[i+0];
sum += array[i+1];
sum += array[i+2];
sum += array[i+3];
sum += array[i+4];
sum += array[i+5];
sum += array[i+6];
sum += array[i+7];
}
return sum;
```

In x86:

```xor ecx, ecx     ; ecx  = i   = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum = 0.0

begin:

cmp ecx, 10000
jl begin ; if ecx < 10000, goto begin

; xmm0 = total sum
```

Since each addition to `sum` in the loop depends on the previous
addition, these instructions are a dependency chain. On a modern
processor, let’s say the reciprocal throughput of `addss` is 1 cycle.
However, the minimum latency is 4 cycles. Since every instruction
depends on the previous, each addition costs 4 cycles.

As before, let’s try summing with four temporary sums:

```xor ecx, ecx     ; ecx  = i    = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum1 = 0.0
xorps xmm1, xmm1 ; xmm1 = sum2 = 0.0
xorps xmm2, xmm2 ; xmm2 = sum3 = 0.0
xorps xmm3, xmm3 ; xmm3 = sum4 = 0.0

; top = sum0

begin:

cmp ecx, 10000
jl begin ; if ecx < 10000, goto begin

; accumulate sums
addss xmm2, xmm3 ; this instruction happens in parallel with the one above
```

Here, the additions in the loop that depend on each other are 4 cycles apart,
meaning the minimum latency is no longer a problem. This lets us hit
the maximum addition rate of one per cycle.

Removing dependency chains is a critical part of optimizing on today’s
processors. The Core 2 processor has six execution units,
three of which are fully 128-bit SIMD ALUs. If you can restructure
your algorithm so calculations happen independently, you can take
advantage of all of them. (And if you can pull off making full use of
the Core 2’s ALU capacity, you win.)

Greg Pfister has an interesting comparison and history of SIMD
vs. MIMD here. Ignore the terminology blathering and focus on the history of and influences on SIMD and MIMD over time.

## 2 thoughts on “Latency vs. Throughput”

1. candyman says:

Trying to understand your post, and wondering what is the benefit of unrolling of 8 vs unrolling of 4?

what i am actually wondering is, suppose i have more complex operations in my loop:
for (int i = 0; i < 10000; i += 1) {
sum += array1[i]*array2[i]*array3[i]
}
and as i dont know how many cycles a multiplication costs, assume it costs 6 cycles, would then 18 partial sums and an unroll of 18 benefit the performance of such a sum?

2. candyman: it depends on the number of execution ports on your CPU, but in general, the more complex the loop body, the more inherent parallelism you might get, so the less you need to unroll. That’s not always true of course… why not measure and see? :)