Code Fragment: Finding quote pairs with carry-less multiply (PCLMULQDQ)


Well, if that doesn’t put off some readers, I don’t know what will…

A technique that we recently disclosed for our JSON parser (see the paper or recent blog post) is something I think I invented (please set me straight if this is known since the glory days of HAKMEM, or whatever).

We want to find all the space between quotes – so if we have an input like:

abc xxx "foobar" zzz "a"

We can easily have a bitvector:


… corresponding to the quote positions (reading left-to-right from input bytes to bits here).

However, we really want to know that the strings “foobar” and “a” are inside the quotes and nothing else is. What we really want is this:


… where the 1 bits appear over the things that are inside our quotes – I use ‘.’ to indicate that we don’t really care what the results are for the quote characters themselves (the technique that I detail later will set the opening quote to 1 in this mask and the closing quote to 0).

So how do we find this property?

One solution is to calculate the parallel prefix over XOR of our bits. So given bits b0, b1, b2, b3 our result is b0, b0^b1, b0^b1^b2, b0^b1^b2^b3, and so on. This means that if you get a ‘parity’ result – for a bit, it’s on if there are an odd-numbered count of bits that are on to its left (LSB side) otherwise it’s off.

A nice way to calculate this is to do a carryless multiply by an all-ones (or 2-complement -1 value). This is provided in Intel’s PCLMULQDQ instruction (part of the CLMUL extension, which is pretty specific!) . This instruction is a weird one – it takes two 64-bit values (selected out of a pair of 128-bit registers by using immediates to select which 64-bit value to get within each register) and produces a 128-bit result. The low-order bits are what we need here (oddly, the high-order bits after multiplying by -1 in this case work out as the parallel prefix over XOR from the opposite direction, a property that I haven’t thought of a good use for).

Here’s the code:

really_inline uint64_t find_quote_mask_and_bits(
__m256i input_lo, __m256i input_hi, uint64_t odd_ends,
uint64_t &prev_iter_inside_quote, uint64_t &quote_bits) {
quote_bits =
cmp_mask_against_input(input_lo, input_hi, _mm256_set1_epi8('"'));
quote_bits = quote_bits & ~odd_ends;
uint64_t quote_mask = _mm_cvtsi128_si64(_mm_clmulepi64_si128(
_mm_set_epi64x(0ULL, quote_bits), _mm_set1_epi8(0xFF), 0));
quote_mask ^= prev_iter_inside_quote;
// right shift of a signed value expected to be well-defined and standard
// compliant as of C++20,
// John Regher from Utah U. says this is fine code
prev_iter_inside_quote =
static_cast<uint64_t>(static_cast<int64_t>(quote_mask) >> 63);
return quote_mask;

In the above code, “odd ends” comes from a previous stage and represents quotes that need to be ‘escaped’ as they were preceded by an unescaped backslash, and “cmp_mask_against_input” is just the standard SIMD comparison against a character that we do to find a ‘”” character (the usual PCMPEQB->PMOVMSKB dance).

The only real weirdness in this code, above, is that we need to factor in our previous quoting status from the previous iteration. So if we ended inside a quote, the “prev_iter_inside_quote” value is all 1s – this allows us to invert the sense of our quote mask.

This exotic instruction (PCLMULQDQ) is pretty slow in terms of latency – on a modern Skylake processor it’s still latency 6. So, in simdjson, we needed to rearrange our code a bit to hide that latency and find some other work to do. But it’s still quite fast overall as long as you don’t follow it with anything that depends on its result (the reciprocal throughput of this instruction is 1 on Skylake).

If you read the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference (pdf) (and who doesn’t?), you’ll note that this instruction is going to get extended on Icelake – it will be turned into a true SIMD instruction (with 4 carryless multiplies over a 512-bit register) rather than one that performs one operation and really just uses SIMD for operand size.

It’s a little tidbit, but I liked it. If you’re working over a pretty large chunk of data, with a lot of little strings in it, your alternative is to process one quote pair at a time, which will be quote slow by comparison. And, as per the name of the blog, the technique is branch free.

Let me know in the comments if you’ve found any other fun uses for carry-less multiply or have any war stories.

4 thoughts on “Code Fragment: Finding quote pairs with carry-less multiply (PCLMULQDQ)”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s