Bits to indexes in BMI2 and AVX-512

[ Please bear with the terrible formatting of the table in this post; I was pretty startled at how limited my options were from a vanilla formatter. Opinions on a better method are welcome. ]

Daniel Lemire, in his post Iterating over set bits quickly (SIMD edition) discusses several techniques to iterate over set bits quickly – or more precisely, to turn a collection of bits into a variable-length buffer full of integers indicating which bits were set.

So, if your code gets given an array with the following 16-bit bitfields (assuming little-endian order):

0x1001, 0x0003, 0xffff

you would want the answer:

indexes = 0, 12, 16, 17, 32, 33, 34, ... , 46, 47

This is an important operation. While it’s a lot of fun to do clever things with SIMD, sooner or later you may need to do something specific with the bits you found in your SIMD registers. For example, we used a number of SIMD techniques in Hyperscan to search for strings, but eventually you would have to report that you’d found something to the rest of the system.

After reading Daniel’s post, and more importantly, taking some time to hack on an AVX-512 system that he generously shared access with me, I think I have invented a new, branch-free way of solving this problem for 64-bit integers. There is the small catch that you will have to have an AVX-512 capable system handy.

(I say I think I invented this as it’s quite possible that (a) I’ve absorbed this technique from somewhere and forgot, or (b) someone else has already independently invented this)

Here’s the technique.

Let’s rig up a bunch of masks with alternating blocks of one and zero bits:

uint64_t msk_1 = 0xffffffff00000000ULL;
uint64_t msk_2 = 0xffff0000ffff0000ULL;
uint64_t msk_3 = 0xff00ff00ff00ff00ULL;
uint64_t msk_4 = 0xf0f0f0f0f0f0f0f0ULL;
uint64_t msk_5 = 0xccccccccccccccccULL;
uint64_t msk_6 = 0xaaaaaaaaaaaaaaaaULL;

Now, suppose I have a bitvector in v that I’d like to turn into a bunch of indexes. I can get a start by doing this:

uint64_t v1 = _pext_u64(msk_1, v);
uint64_t v2 = _pext_u64(msk_2, v);
uint64_t v3 = _pext_u64(msk_3, v);
uint64_t v4 = _pext_u64(msk_4, v);
uint64_t v5 = _pext_u64(msk_5, v);
uint64_t v6 = _pext_u64(msk_6, v);

What did this achieve? Well, suppose I have the 11th bit set in v and nothing else. Looking into my masks, I can see that my PEXT operation (a fast bitwise extract) got me a 1-bit from msk_6, a 1-bit from msk_5, a 0-bit from msk_4, a 1-bit from msk_3 and 0-bits otherwise. These bits will all be deposited into the least significant bits of the v1 through 6 temporaries.

In other works, for each set bit, I’m extracting the bit pattern of its index from the masks and depositing that bit pattern at the lowest-significant bytes on my v1 through v6 temporary values.

So, in the unlikely event that you were hoping to get the right answers, annoyingly bit-smeared across 6 different uint64_t variables, you’re done. But that’s probably not very satisfying. We’ll get to that.

So how do we interleave these 6 values together? This looks pretty ugly – we’re looking at 384 total bits in the worst case of answers. So this doesn’t seem like something we can do fast in the General Purpose Registers. Let’s go to SIMD.

The principle we will apply is that we will use AVX-512’s facility to use 64-bit mask to control a SIMD computation. We will take our 6 values and use them to control the progressive adding of 32, 16, 8, 4, 2 and 1 into a result.

__m512i vec;
vec = _mm512_maskz_add_epi8(v1, v32_bit, _mm512_set1_epi8(0));
vec = _mm512_mask_add_epi8(vec, v2, v16_bit, vec);
vec = _mm512_mask_add_epi8(vec, v3, v8_bit, vec);
vec = _mm512_mask_add_epi8(vec, v4, v4_bit, vec);
vec = _mm512_mask_add_epi8(vec, v5, v2_bit, vec);
vec = _mm512_mask_add_epi8(vec, v6, v1_bit, vec);

Now vec holds the answer we wanted, if we just wanted a bunch of bytes on our output, ranging from 0..63. Unfortunately, we need to write some not very interesting code if we’re doing this over a large range, where we imagine that our offsets might be much larger than a byte. If we’re working continuously over inputs >64K, we would expect to need 4 byte answers. In order to write out up to 64 uint32_t offsets, we’re going to have to spread out our results over 4 registers (spreading the bytes over u32 units), add in a value ‘k’ representing the base offset of our 64-bit value to begin with, and write all 4 of these big registers out.

__m512i base = _mm512_set1_epi32(k*64);
__m512i r1 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,0));
__m512i r2 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,1));
__m512i r3 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,2));
__m512i r4 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,3));

r1 = _mm512_add_epi32(r1, base);
r2 = _mm512_add_epi32(r2, base);
r3 = _mm512_add_epi32(r3, base);
r4 = _mm512_add_epi32(r4, base);
_mm512_storeu_si512((__m512i *)out, r1);
_mm512_storeu_si512((__m512i *)(out + 16), r2);
_mm512_storeu_si512((__m512i *)(out + 32), r3);
_mm512_storeu_si512((__m512i *)(out + 48), r4);

(note that ‘out’ is a uint32_t so we are actually getting +64, +128, +192 bytes with those last three offsets).

Alert readers will note that this code is writing a lot of stuff out. What happens if we only had 1 bit set? Or 0? Well, this blog isn’t called “Branch Free” for nothing.

More seriously, the point is that it’s usually cheaper to do the same thing every time rather than run the risk of a branch mispredict. Looking back at the code above – sure, it looks like a giant bolus of code. But a branch miss on a modern architecture is around 14 cycles. That’s a lot of missed opportunity to do work.

Even if you accept my above philosophy of doing tons of potentially redundant work over risking a branch miss, there’s one more question – we need to know where our next write should be:

uint8_t advance = __builtin_popcountll(v);
out += advance

That just moves us up (remember ‘out’ is a uint32_t for pointer math purposes) to the last value that actually had something set. And we’re done!

Is it fast?

Here’s a rough spreadsheet of the results measured against several of the other methods described in Daniel’s article. It’s faster than most of the other methods, falling down only for very low ‘bitmap densities’. For these lower densities, taking a conditional branch with the prospect that the expected number of bits set in a word is very low is a winning proposition.

Bitmap density Method Cycles per index
0.03 bitmap_decode_ctz 3.852
bitmap_decode_avx2 10.116
bitmap_decode_avx2_turbo 14.363
bitmap_decode_avx2_turbo_thin 15.736
bitmap_decode_avx2_turbo_nopopcnt 12.624
bitmap_decode_bmi2_avx512 12.9
0.12 bitmap_decode_ctz 4.97
bitmap_decode_avx2 3.003
bitmap_decode_avx2_turbo 4.205
bitmap_decode_avx2_turbo_thin 4.547
bitmap_decode_avx2_turbo_nopopcnt 3.732
bitmap_decode_bmi2_avx512 2.481
0.25 bitmap_decode_ctz 4.251
bitmap_decode_avx2 1.52
bitmap_decode_avx2_turbo 2.09
bitmap_decode_avx2_turbo_thin 2.265
bitmap_decode_avx2_turbo_nopopcnt 1.861
bitmap_decode_bmi2_avx512 1.25
0.5 bitmap_decode_ctz 3.446
bitmap_decode_avx2 0.796
bitmap_decode_avx2_turbo 1.042
bitmap_decode_avx2_turbo_thin 1.131
bitmap_decode_avx2_turbo_nopopcnt 0.92
bitmap_decode_bmi2_avx512 0.616
0.9 bitmap_decode_ctz 3.037
bitmap_decode_avx2 0.444
bitmap_decode_avx2_turbo 0.574
bitmap_decode_avx2_turbo_thin 0.628
bitmap_decode_avx2_turbo_nopopcnt 0.509
bitmap_decode_bmi2_avx512 0.366

Is this a great idea? I don’t know.

There are no doubt other methods to use AVX512 to transform bit vectors in this fashion, and for a relatively low ‘population’ there are a number of ways the bitmap_decode_ctz code can be made to run faster (possibly the topic of another article).

I still think it’s an interesting ‘trick’ and it’s nice to take my second-favorite instruction (PEXT) out for a spin.

Let me know if you’ve seen this trick somewhere before and I’ll be happy to credit where credit is due. As I said, I think I invented it…

The code is available at Daniel Lemire’s Github with an error (my fault, apparently I thought 8+2 = 9) which will be corrected in due course.

ps. In the ‘dynamiting the trout stream’ category, I give you VPCOMPRESSB from Intel® Architecture Instruction Set Extensions Programming Reference (PDF) which will greatly simplify all the above trickery, once we have AVX512_VBMI2 capable machines (Ice Lake time-frame).

pps. There is also a branch-free means where VPCOMPRESSD can be used four times on 16-bit words to solve a similar problem on machines that are publicly available now. This can be left as an exercise for the reader. It might be faster than the BMI2 stuff, but it lacks style points.

Introduction and welcome

Hello, world.

This is my blog where I will talk about things that interest me (and a no doubt small collection of others). Topics that I’m interested in include:

  • Low-level and performance-oriented programming
  • Computer architecture (especially as it related to performance-oriented code)
  • Programming languages
  • Regular expression implementation and automata theory
  • … and of course, implementing things without branches! Thus the name.

I was the designer of the Hyperscan project. I built this system at Sensory Networks, which was acquired by Intel Corporation in 2013, and worked on Hyperscan at Intel for over 4 years.

I hope that I can show you some interesting things. I have a few things in the pipeline that I will show shortly, including some string matching work, fast Random Forest implementation and a lot of my favorite low-level coding tips and tricks.

I request that my readers can bear with me and forgive the (hopefully temporary) amateurish nature of the site; I am not an expert blogger or user of WordPress.