Code Fragment: Finding quote pairs with carry-less multiply (PCLMULQDQ)

 

Well, if that doesn’t put off some readers, I don’t know what will…

A technique that we recently disclosed for our JSON parser (see the paper or recent blog post) is something I think I invented (please set me straight if this is known since the glory days of HAKMEM, or whatever).

We want to find all the space between quotes – so if we have an input like:

abc xxx "foobar" zzz "a"

We can easily have a bitvector:

000000001000000100000101

… corresponding to the quote positions (reading left-to-right from input bytes to bits here).

However, we really want to know that the strings “foobar” and “a” are inside the quotes and nothing else is. What we really want is this:

00000000.111111.00000.1.

… where the 1 bits appear over the things that are inside our quotes – I use ‘.’ to indicate that we don’t really care what the results are for the quote characters themselves (the technique that I detail later will set the opening quote to 1 in this mask and the closing quote to 0).

So how do we find this property?

One solution is to calculate the parallel prefix over XOR of our bits. So given bits b0, b1, b2, b3 our result is b0, b0^b1, b0^b1^b2, b0^b1^b2^b3, and so on. This means that if you get a ‘parity’ result – for a bit, it’s on if there are an odd-numbered count of bits that are on to its left (LSB side) otherwise it’s off.

A nice way to calculate this is to do a carryless multiply by an all-ones (or 2-complement -1 value). This is provided in Intel’s PCLMULQDQ instruction (part of the CLMUL extension, which is pretty specific!) . This instruction is a weird one – it takes two 64-bit values (selected out of a pair of 128-bit registers by using immediates to select which 64-bit value to get within each register) and produces a 128-bit result. The low-order bits are what we need here (oddly, the high-order bits after multiplying by -1 in this case work out as the parallel prefix over XOR from the opposite direction, a property that I haven’t thought of a good use for).

Here’s the code:


really_inline uint64_t find_quote_mask_and_bits(
__m256i input_lo, __m256i input_hi, uint64_t odd_ends,
uint64_t &prev_iter_inside_quote, uint64_t &quote_bits) {
quote_bits =
cmp_mask_against_input(input_lo, input_hi, _mm256_set1_epi8('"'));
quote_bits = quote_bits & ~odd_ends;
uint64_t quote_mask = _mm_cvtsi128_si64(_mm_clmulepi64_si128(
_mm_set_epi64x(0ULL, quote_bits), _mm_set1_epi8(0xFF), 0));
quote_mask ^= prev_iter_inside_quote;
// right shift of a signed value expected to be well-defined and standard
// compliant as of C++20,
// John Regher from Utah U. says this is fine code
prev_iter_inside_quote =
static_cast<uint64_t>(static_cast<int64_t>(quote_mask) >> 63);
return quote_mask;
}

In the above code, “odd ends” comes from a previous stage and represents quotes that need to be ‘escaped’ as they were preceded by an unescaped backslash, and “cmp_mask_against_input” is just the standard SIMD comparison against a character that we do to find a ‘”” character (the usual PCMPEQB->PMOVMSKB dance).

The only real weirdness in this code, above, is that we need to factor in our previous quoting status from the previous iteration. So if we ended inside a quote, the “prev_iter_inside_quote” value is all 1s – this allows us to invert the sense of our quote mask.

This exotic instruction (PCLMULQDQ) is pretty slow in terms of latency – on a modern Skylake processor it’s still latency 6. So, in simdjson, we needed to rearrange our code a bit to hide that latency and find some other work to do. But it’s still quite fast overall as long as you don’t follow it with anything that depends on its result (the reciprocal throughput of this instruction is 1 on Skylake).

If you read the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference (pdf) (and who doesn’t?), you’ll note that this instruction is going to get extended on Icelake – it will be turned into a true SIMD instruction (with 4 carryless multiplies over a 512-bit register) rather than one that performs one operation and really just uses SIMD for operand size.

It’s a little tidbit, but I liked it. If you’re working over a pretty large chunk of data, with a lot of little strings in it, your alternative is to process one quote pair at a time, which will be quote slow by comparison. And, as per the name of the blog, the technique is branch free.

Let me know in the comments if you’ve found any other fun uses for carry-less multiply or have any war stories.

9 thoughts on “Code Fragment: Finding quote pairs with carry-less multiply (PCLMULQDQ)”

  1. The xor scan ≠\ was well known in APL, but I don’t think the clmul method was. I added it to Dyalog APL 18.0 (see https://aplwiki.com/wiki/Dyalog_APL#Instruction_set_usage for example); previously Dyalog used “not Boolean” Bob Bernecky’s code that he mentions in his talk “A Compendium of SIMD Boolean Array Algorithms in APL”. But it used successive shifts, the same as Hacker’s Delight section 5-2, “Parity”. I think I learned about the clmul method from some source related to simdjson, but that I’d done it before the “Expanding Bits” article (https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/) so it couldn’t have been this post.

    Like

  2. Thanks for that detailed background. I emailed Daniel Lemire in March of 2018 telling him that I’d found that trick and cited https://bitmath.blogspot.com/2016/11/parallel-prefixsuffix-operations.html which might well be where you found it. I doubt that we would have had anything public in time for you to see it.

    To be clear, I only claim to have “thought of the application of this trick for quote-balancing”, not to have invented either XOR-prefix-sum (which would be preposterous) or even clmul by -1.

    That’s an interesting page on Dyalog. Have you looked into GFNI for boolean matrix transpose?

    Like

    1. I did end up doing some work on AVX2 bit transpose, but put it aside when I found out it takes a pretty large matrix (side length >100) for it to be consistently better than converting to bytes and back. Unaligned rows are not nice. I think GFNI is useful for large enough matrices but not a huge deal.
      Regardless of how fast you can do it, transposing 8×8 sub-matrices isn’t great because then you’re loading and storing individual bytes. I went with 32×32 for AVX2 and I think 64×64 would make sense for AVX-512. The way to think about such a transpose is to consider the kernel to be a 2x2x2x…2 array, where you want to exchange the (maybe) 6 axes on the left with the 6 on the right. It’s fine for those sets of axes to get jumbled up in the process, because that can be corrected by rearranging the loads or stores, which is free or close to it.
      With a 64×64 AVX-512 kernel you have 3 sub-byte axes, 4 axes between bytes and lanes, 2 axes between lanes and registers, and 3 axes for the 8 registers. The GFNI instruction lets you swap the lowest-order 3 axes with the 3 above them, so it moves a lot of axes, but it’s all staying on one side instead of moving axes between the sides. A sequence I came up with is 012345abcdef to 012abc345def with big shuffles, to abc012345def with GFNI, to abcdef045123 with three rounds of unpack instructions. I think that comes out better than anything I had not using GFNI, but overall it’s doing a fairly small portion of the work, especially when you account for the loads and stores.

      Like

  3. Ah, the quote-matching trick’s pretty well known in APL since xor-scan is such an important primitive. Searching “≠\ quote” on Github even turns up this for JSON by 2013 (MiServer is Brian Becker’s, likely it’s his code): https://github.com/okdistribute/mags/blob/a2ce218/websrv/Utils/JSON.dyalog#L620. There’s also an instance in Jd (J database) that looks like it comes from Chris Burke’s JDB, so that’s probably how I learned it, 2012 or 2013.

    I haven’t tried a GFNI transpose, and apparently missed it in your Ice Lake article! My current array language BQN has nice AVX2 kernel transposes for integer types, but any sort of boolean transpose is still to-do, so I suppose I’ll be looking at it soon enough. Incidentally, the reason I checked this page is for writing some notes on reductions and scans in BQN, https://mlochbaum.github.io/BQN/implementation/primitive/fold.html#booleans. There’s a page on transpose too, but it doesn’t say much about the boolean case yet.

    Like

    1. I have to laugh then. I certainly thought I was inventing that, but, like a long list of things that I thought I invented, there’s someone else already parked in that spot. Apparently I need to broaden my reading list to include more APL.

      Like

Leave a comment