Today I attended the Hot Chips 2009 conference, where I was introduced to OpenCL*. I'd never heard of OpenCL before, but it's totally hot and I will tell you why.

What is OpenCL?

OpenCL (Open Computing Library) is an open and free API that lets your programs efficiently leverage all of the computing power in your computer, including multiple CPU cores, vector computation units, GPUs, Cell coprocessors, DSPs, etc. OpenCL programs, called kernels, are written in a variant of C and are automatically scheduled across your available hardware. Your customers don't have a GPU but their processor has 2 or 4 cores? No problem: OpenCL will utilize that spare power.

In addition, OpenCL specifies 4-, 8-, and 16-wide SIMD vector types, meaning it can easily leverage both the vector computation units on GPUs and extended vector instruction sets on CPUs, such as SSE and AVX. (Looking forward to Larrabee.)

With OpenCL, Apple and the Khronos Group are cleanly solving two clear market needs: we have increasing amounts of computational hardware available to us, but each bit of hardware is a bit different, with separate APIs, performance, and instruction sets. As much as possible, OpenCL unifies the interface to this hardware.

It's true that GPUs and CPUs have different performance characteristics, but the same types of programs run efficiently on both: do way more math than memory loads and split your program into large parallel chunks.

Where do I get it?

The major vendors are starting to provide OpenCL implementations:

Intel's on the way, I'm sure. I expect we'll see an OpenCL implementation on Larrabee as well.

How do I use it?

OpenCL programs are called kernels -- functions that run in parallel across large amounts of data. You're given a lot of freedom: OpenCL uses a variant of the C99 programming language, with only a handful of restrictions.

If you find that too onerous, you can use OpenCL with native functions you've already written! Just let it schedule your tasks and manage task dependencies to easily get started running code on multiple cores.

I won't go into the details of the memory model or API; you can look that up on Google. However, here's an example kernel:

kernel void square(
    global float* input,
    global float* output
) {
    int i = get_global_id(0);
    output[i] = input[i] * input[i];

It squares a bunch of numbers in an array, storing the results in an output array. This trivial example will easily consume all of the compute power in your system. (Defining compute to mean ALU + memory, of course...)

Next Steps

I hope that's enough information to get you excited about OpenCL! OpenCL is a huge step towards transparently leveraging additional cores, vector instructions, and GPUs.

* Full disclaimer: I have never actually used OpenCL, but I've definitely struggled with the problems it solves. The IMVU client has some optimized inner loops that we've hand-implemented for the GPU, SSE, or C-compiled x87 code. Unfortunately, we have had to manually select between those code paths based on the capabilities of the hardware. This sucks. With even decent OpenCL implementations, we could avoid all of that work and transparently benefit from future hardware improvements.