I'm going to talk about something which should be obvious, but I continue to see people optimizing code in the wrong order (*cough* including myself *cough*). So, here's a reminder. When you're optimizing a bit of code...

FIRST

Make sure your algorithmic running time is right. O(1) is almost always faster than O(N), and O(N^2) is right out. Often these optimization exercises involve changing some O(N) to O(M), where M is smaller than N.

I'll give an example. Drawing a frame of 3D graphics in IMVU is O(N) where N is the sum of all vertices from all products on all objects loaded into a scene. We recently implemented view frustum culling, which skips drawing objects that are known to be off-screen. This reduces the rendering time from O(N) to O(M) where M<N and M is the number of vertices from products that are visible. If we implemented View Independent Progressive Meshes, we could reduce the time to O(P) where P is the number of vertices that contribute to the visible detail of the scene, and P<M<N.

However, make sure to avoid algorithms with good running times but huge constants. This is why, when CPUs got fast and random memory accesses got slow, searching an O(N) array (or std::vector) is often faster than searching an O(log N) tree (or std::map). The tree will miss cache far more often.

SECOND

Then, use all of the algebra, set theory, and logic you know to reduce the number of operations required, in order of operation cost.

Let's say we're going to calculate the diffuse reflectance on a surface: N dot L, where N and L are three-vectors, N is the normal of the surface, and L is the direction to the light.

The naive normalize(N) dot normalize(L) is...

float lengthN = sqrtf(N.x*N.x + N.y*N.y + N.z*N.z);
float lengthL = sqrtf(L.x*L.x + L.y*L.y + L.z*L.z);
float dot =
    (N.x / lengthN) * (L.x / lengthL) +
    (N.y / lengthN) * (L.y / lengthL) +
    (N.z / lengthN) * (L.z / lengthL);

... which turns out to be 6 additions, 9 multiplications, 6 divisions, and 2 square roots. Let's say additions and multiplications are 2 cycles, and divisions and square roots are 40 cycles. This gives us a total of 6*2 + 9*2 + 6*40 + 2*40 = 350 cycles.

Instead, let's do a bit of algebra:

  normalize(N) dot normalize(L)
= N/|N| dot L/|L|
= (N dot L) / (|N||L|)
= (N dot L) / sqrt((N dot N) * (L dot L))

The new calculation is...

float lengthSquaredN = N.x*N.x + N.y*N.y + N.z*N.z;
float lengthSquaredL = sqrtf(L.x*L.x + L.y*L.y + L.z*L.z);
float NdotL          = N.x*L.x + N.y*L.y + N.z*L.z;
float dot =          NdotL / sqrtf(lengthSquaredN * lengthSquaredL);

... 6 additions, 10 multiplications, 1 division, and 1 sqrt: 6*2 + 10*2 + 1*40 + 1*40 = 112 cycles. Huge improvement just by applying basic math.

THIRD

Once you're done optimizing algebraically, read your processor manuals and take full advantage of the hardware. If you've got SSE4, you can do the dot products in one instruction (DPPS), and an approximate reciprocal square root in another (RSQRTSS), which can give another huge improvement.

The reason you want to optimize in this order is that algorithmic improvements reduce the amount of work you have to do, making it less important to make that work fast. A hardware-optimized O(N^2) algorithm can be easily beaten by an unoptimized O(N log N) algorithm. Remember, Chad, the next time you schedule optimization projects, consider downstream effects such as these.