Computers are insanely wide these days.  Not only can out-of-order cores issue and retire several instructions per clock cycle, but vector instructions get wider and wider.  SSE has 128-bit registers.  AVX widened that to 256 bits.  AVX2 widens the vector registers further -- to 512 bits!  On top of that, modern CPUs have several cores, and some of those CPUs support running a couple hardware threads per core.

Given a parallelizable problem and a bit of work, it's amazing how much computation you can achieve per clock.

However, there's a class of problems that don't benefit much, if at all, from instruction-level parallelism or vectorization.  I call this "the parsing problem".

Consider a high-performance JSON parser like sajson or rapidjson.  Fundamentally, a JSON parser converts a stream of bytes into a higher-level structure.  The algorithm looks something like:

read a byte
switch on byte value to see whether we're opening an object, closing an object, opening or closing a string, etc.

This kind of byte-by-byte, branch-heavy code is not amenable to vectorization, especially for particularly dense structure like JSON.  It IS possible to accelerate XML parsers with SIMD: see Parabix.

Parabix uses SIMD to rapidly scan characters for the next XML closing tag.  Similar techniques can be applied to parsing JSON, but minified JSON has a higher ratio of structural characters to text than XML.

In my sajson benchmarks, my fancy new Haswell 4771 (3.9 GHz during benchmark run) is only a few times faster than my years-old, in-order Atom server, clocked at 2.1 GHz.

The only point of this article is that some problems lend themselves to taking advantage of wide vector units, instruction-level parallelism, and multithreading.  Others, like JSON parsing, don't.  :)