The Parsing Problem
Computers are insanely wide these days. Not only can out-of-order cores issue and retire several instructions per clock cycle, but vector instructions get wider and wider. SSE has 128-bit registers. AVX widened that to 256 bits. AVX2 widens the vector registers further -- to 512 bits! On top of that, modern CPUs have several cores, and some of those CPUs support running a couple hardware threads per core.
Given a parallelizable problem and a bit of work, it's amazing how much computation you can achieve per clock.
However, there's a class of problems that don't benefit much, if at all, from instruction-level parallelism or vectorization. I call this "the parsing problem".
Consider a high-performance JSON parser like sajson or rapidjson. Fundamentally, a JSON parser converts a stream of bytes into a higher-level structure. The algorithm looks something like:
read a byte switch on byte value to see whether we're opening an object, closing an object, opening or closing a string, etc.
This kind of byte-by-byte, branch-heavy code is not amenable to vectorization, especially for particularly dense structure like JSON. It IS possible to accelerate XML parsers with SIMD: see Parabix.
Parabix uses SIMD to rapidly scan characters for the next XML closing tag. Similar techniques can be applied to parsing JSON, but minified JSON has a higher ratio of structural characters to text than XML.
In my sajson benchmarks, my fancy new Haswell 4771 (3.9 GHz during benchmark run) is only a few times faster than my years-old, in-order Atom server, clocked at 2.1 GHz.
The only point of this article is that some problems lend themselves to taking advantage of wide vector units, instruction-level parallelism, and multithreading. Others, like JSON parsing, don't. :)