Bits to indexes in BMI2 and AVX-512

[ Please bear with the terrible formatting of the table in this post; I was pretty startled at how limited my options were from a vanilla formatter. Opinions on a better method are welcome. ]

Daniel Lemire, in his post Iterating over set bits quickly (SIMD edition) discusses several techniques to iterate over set bits quickly – or more precisely, to turn a collection of bits into a variable-length buffer full of integers indicating which bits were set.

So, if your code gets given an array with the following 16-bit bitfields (assuming little-endian order):

0x1001, 0x0003, 0xffff

you would want the answer:

indexes = 0, 12, 16, 17, 32, 33, 34, ... , 46, 47

This is an important operation. While it’s a lot of fun to do clever things with SIMD, sooner or later you may need to do something specific with the bits you found in your SIMD registers. For example, we used a number of SIMD techniques in Hyperscan to search for strings, but eventually you would have to report that you’d found something to the rest of the system.

After reading Daniel’s post, and more importantly, taking some time to hack on an AVX-512 system that he generously shared access with me, I think I have invented a new, branch-free way of solving this problem for 64-bit integers. There is the small catch that you will have to have an AVX-512 capable system handy.

(I say I think I invented this as it’s quite possible that (a) I’ve absorbed this technique from somewhere and forgot, or (b) someone else has already independently invented this)

Here’s the technique.

Let’s rig up a bunch of masks with alternating blocks of one and zero bits:

uint64_t msk_1 = 0xffffffff00000000ULL;
uint64_t msk_2 = 0xffff0000ffff0000ULL;
uint64_t msk_3 = 0xff00ff00ff00ff00ULL;
uint64_t msk_4 = 0xf0f0f0f0f0f0f0f0ULL;
uint64_t msk_5 = 0xccccccccccccccccULL;
uint64_t msk_6 = 0xaaaaaaaaaaaaaaaaULL;

Now, suppose I have a bitvector in v that I’d like to turn into a bunch of indexes. I can get a start by doing this:

uint64_t v1 = _pext_u64(msk_1, v);
uint64_t v2 = _pext_u64(msk_2, v);
uint64_t v3 = _pext_u64(msk_3, v);
uint64_t v4 = _pext_u64(msk_4, v);
uint64_t v5 = _pext_u64(msk_5, v);
uint64_t v6 = _pext_u64(msk_6, v);

What did this achieve? Well, suppose I have the 11th bit set in v and nothing else. Looking into my masks, I can see that my PEXT operation (a fast bitwise extract) got me a 1-bit from msk_6, a 1-bit from msk_5, a 0-bit from msk_4, a 1-bit from msk_3 and 0-bits otherwise. These bits will all be deposited into the least significant bits of the v1 through 6 temporaries.

In other works, for each set bit, I’m extracting the bit pattern of its index from the masks and depositing that bit pattern at the lowest-significant bytes on my v1 through v6 temporary values.

So, in the unlikely event that you were hoping to get the right answers, annoyingly bit-smeared across 6 different uint64_t variables, you’re done. But that’s probably not very satisfying. We’ll get to that.

So how do we interleave these 6 values together? This looks pretty ugly – we’re looking at 384 total bits in the worst case of answers. So this doesn’t seem like something we can do fast in the General Purpose Registers. Let’s go to SIMD.

The principle we will apply is that we will use AVX-512’s facility to use 64-bit mask to control a SIMD computation. We will take our 6 values and use them to control the progressive adding of 32, 16, 8, 4, 2 and 1 into a result.

__m512i vec;
vec = _mm512_maskz_add_epi8(v1, v32_bit, _mm512_set1_epi8(0));
vec = _mm512_mask_add_epi8(vec, v2, v16_bit, vec);
vec = _mm512_mask_add_epi8(vec, v3, v8_bit, vec);
vec = _mm512_mask_add_epi8(vec, v4, v4_bit, vec);
vec = _mm512_mask_add_epi8(vec, v5, v2_bit, vec);
vec = _mm512_mask_add_epi8(vec, v6, v1_bit, vec);

Now vec holds the answer we wanted, if we just wanted a bunch of bytes on our output, ranging from 0..63. Unfortunately, we need to write some not very interesting code if we’re doing this over a large range, where we imagine that our offsets might be much larger than a byte. If we’re working continuously over inputs >64K, we would expect to need 4 byte answers. In order to write out up to 64 uint32_t offsets, we’re going to have to spread out our results over 4 registers (spreading the bytes over u32 units), add in a value ‘k’ representing the base offset of our 64-bit value to begin with, and write all 4 of these big registers out.

__m512i base = _mm512_set1_epi32(k*64);
__m512i r1 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,0));
__m512i r2 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,1));
__m512i r3 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,2));
__m512i r4 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,3));

r1 = _mm512_add_epi32(r1, base);
r2 = _mm512_add_epi32(r2, base);
r3 = _mm512_add_epi32(r3, base);
r4 = _mm512_add_epi32(r4, base);
_mm512_storeu_si512((__m512i *)out, r1);
_mm512_storeu_si512((__m512i *)(out + 16), r2);
_mm512_storeu_si512((__m512i *)(out + 32), r3);
_mm512_storeu_si512((__m512i *)(out + 48), r4);

(note that ‘out’ is a uint32_t so we are actually getting +64, +128, +192 bytes with those last three offsets).

Alert readers will note that this code is writing a lot of stuff out. What happens if we only had 1 bit set? Or 0? Well, this blog isn’t called “Branch Free” for nothing.

More seriously, the point is that it’s usually cheaper to do the same thing every time rather than run the risk of a branch mispredict. Looking back at the code above – sure, it looks like a giant bolus of code. But a branch miss on a modern architecture is around 14 cycles. That’s a lot of missed opportunity to do work.

Even if you accept my above philosophy of doing tons of potentially redundant work over risking a branch miss, there’s one more question – we need to know where our next write should be:

uint8_t advance = __builtin_popcountll(v);
out += advance

That just moves us up (remember ‘out’ is a uint32_t for pointer math purposes) to the last value that actually had something set. And we’re done!

Is it fast?

Here’s a rough spreadsheet of the results measured against several of the other methods described in Daniel’s article. It’s faster than most of the other methods, falling down only for very low ‘bitmap densities’. For these lower densities, taking a conditional branch with the prospect that the expected number of bits set in a word is very low is a winning proposition.

Bitmap density Method Cycles per index
0.03 bitmap_decode_ctz 3.852
bitmap_decode_avx2 10.116
bitmap_decode_avx2_turbo 14.363
bitmap_decode_avx2_turbo_thin 15.736
bitmap_decode_avx2_turbo_nopopcnt 12.624
bitmap_decode_bmi2_avx512 12.9
0.12 bitmap_decode_ctz 4.97
bitmap_decode_avx2 3.003
bitmap_decode_avx2_turbo 4.205
bitmap_decode_avx2_turbo_thin 4.547
bitmap_decode_avx2_turbo_nopopcnt 3.732
bitmap_decode_bmi2_avx512 2.481
0.25 bitmap_decode_ctz 4.251
bitmap_decode_avx2 1.52
bitmap_decode_avx2_turbo 2.09
bitmap_decode_avx2_turbo_thin 2.265
bitmap_decode_avx2_turbo_nopopcnt 1.861
bitmap_decode_bmi2_avx512 1.25
0.5 bitmap_decode_ctz 3.446
bitmap_decode_avx2 0.796
bitmap_decode_avx2_turbo 1.042
bitmap_decode_avx2_turbo_thin 1.131
bitmap_decode_avx2_turbo_nopopcnt 0.92
bitmap_decode_bmi2_avx512 0.616
0.9 bitmap_decode_ctz 3.037
bitmap_decode_avx2 0.444
bitmap_decode_avx2_turbo 0.574
bitmap_decode_avx2_turbo_thin 0.628
bitmap_decode_avx2_turbo_nopopcnt 0.509
bitmap_decode_bmi2_avx512 0.366

Is this a great idea? I don’t know.

There are no doubt other methods to use AVX512 to transform bit vectors in this fashion, and for a relatively low ‘population’ there are a number of ways the bitmap_decode_ctz code can be made to run faster (possibly the topic of another article).

I still think it’s an interesting ‘trick’ and it’s nice to take my second-favorite instruction (PEXT) out for a spin.

Let me know if you’ve seen this trick somewhere before and I’ll be happy to credit where credit is due. As I said, I think I invented it…

The code is available at Daniel Lemire’s Github with an error (my fault, apparently I thought 8+2 = 9) which will be corrected in due course.

ps. In the ‘dynamiting the trout stream’ category, I give you VPCOMPRESSB from Intel® Architecture Instruction Set Extensions Programming Reference (PDF) which will greatly simplify all the above trickery, once we have AVX512_VBMI2 capable machines (Ice Lake time-frame).

pps. There is also a branch-free means where VPCOMPRESSD can be used four times on 16-bit words to solve a similar problem on machines that are publicly available now. This can be left as an exercise for the reader. It might be faster than the BMI2 stuff, but it lacks style points.

10 thoughts on “Bits to indexes in BMI2 and AVX-512”

  1. The technique reminds me of “Reversing bits and bytes” from Hacker’s Delight 2nd edition, section 7.1 — at least in part, for the population count.

    Like

  2. Some updates:

    http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.html has a technique by Zach Wegner, independently invented, that is very similar to the first part of the algorithm.

    Zach has some interesting links that prefigure this kind of trickery from the chess programming world, for those who want to delve into some history of this.

    https://www.chessprogramming.org/BitScan#Divide_and_Conquer
    http://talkchess.com/forum3/viewtopic.php?t=29333
    https://www.chessprogramming.org/Quad-Bitboards
    http://www.talkchess.com/forum3/viewtopic.php?t=40333

    Like

  3. I learned about the clever PEXT bit slicing part from hackernews:

    https://news.ycombinator.com/item?id=18836608

    (long after this article was written though).

    I had done something similar in the past with PDEP but prior to AVX-512 and without bit slicing: using an expanded bitmap (eg where each bit is duplicated 4 or 8 times), to do a single PEXT against 0,1,2,3,… but it’s worse than the bit slicing approach.

    So it seems AVX-512 has finally given us a good “reverse movmaskb” with the k regs and it’s flexible too.

    Like

  4. One of the problems with the code from an algorithmic perspective is accumulation of the pop count within the loop. This is merely the length of the final array and can be calculated at the end.

    Like

    1. I don’t understand this comment. The pop count is necessary to know how many actually valid values were written out and to ensure that the next loop iteration writes its output to the right spot. It’s quite cheap: at latency 3 it can easily be fit within the more expensive operations in the loop.

      Like

Leave a comment