[ Please bear with the terrible formatting of the table in this post; I was pretty startled at how limited my options were from a vanilla formatter. Opinions on a better method are welcome. ]
Daniel Lemire, in his post Iterating over set bits quickly (SIMD edition) discusses several techniques to iterate over set bits quickly – or more precisely, to turn a collection of bits into a variable-length buffer full of integers indicating which bits were set.
So, if your code gets given an array with the following 16-bit bitfields (assuming little-endian order):
0x1001, 0x0003, 0xffff
you would want the answer:
indexes = 0, 12, 16, 17, 32, 33, 34, ... , 46, 47
This is an important operation. While it’s a lot of fun to do clever things with SIMD, sooner or later you may need to do something specific with the bits you found in your SIMD registers. For example, we used a number of SIMD techniques in Hyperscan to search for strings, but eventually you would have to report that you’d found something to the rest of the system.
After reading Daniel’s post, and more importantly, taking some time to hack on an AVX-512 system that he generously shared access with me, I think I have invented a new, branch-free way of solving this problem for 64-bit integers. There is the small catch that you will have to have an AVX-512 capable system handy.
(I say I think I invented this as it’s quite possible that (a) I’ve absorbed this technique from somewhere and forgot, or (b) someone else has already independently invented this)
Here’s the technique.
Let’s rig up a bunch of masks with alternating blocks of one and zero bits:
uint64_t msk_1 = 0xffffffff00000000ULL; uint64_t msk_2 = 0xffff0000ffff0000ULL; uint64_t msk_3 = 0xff00ff00ff00ff00ULL; uint64_t msk_4 = 0xf0f0f0f0f0f0f0f0ULL; uint64_t msk_5 = 0xccccccccccccccccULL; uint64_t msk_6 = 0xaaaaaaaaaaaaaaaaULL;
Now, suppose I have a bitvector in v that I’d like to turn into a bunch of indexes. I can get a start by doing this:
uint64_t v1 = _pext_u64(msk_1, v); uint64_t v2 = _pext_u64(msk_2, v); uint64_t v3 = _pext_u64(msk_3, v); uint64_t v4 = _pext_u64(msk_4, v); uint64_t v5 = _pext_u64(msk_5, v); uint64_t v6 = _pext_u64(msk_6, v);
What did this achieve? Well, suppose I have the 11th bit set in v and nothing else. Looking into my masks, I can see that my PEXT operation (a fast bitwise extract) got me a 1-bit from msk_6, a 1-bit from msk_5, a 0-bit from msk_4, a 1-bit from msk_3 and 0-bits otherwise. These bits will all be deposited into the least significant bits of the v1 through 6 temporaries.
In other works, for each set bit, I’m extracting the bit pattern of its index from the masks and depositing that bit pattern at the lowest-significant bytes on my v1 through v6 temporary values.
So, in the unlikely event that you were hoping to get the right answers, annoyingly bit-smeared across 6 different uint64_t variables, you’re done. But that’s probably not very satisfying. We’ll get to that.
So how do we interleave these 6 values together? This looks pretty ugly – we’re looking at 384 total bits in the worst case of answers. So this doesn’t seem like something we can do fast in the General Purpose Registers. Let’s go to SIMD.
The principle we will apply is that we will use AVX-512’s facility to use 64-bit mask to control a SIMD computation. We will take our 6 values and use them to control the progressive adding of 32, 16, 8, 4, 2 and 1 into a result.
__m512i vec; vec = _mm512_maskz_add_epi8(v1, v32_bit, _mm512_set1_epi8(0)); vec = _mm512_mask_add_epi8(vec, v2, v16_bit, vec); vec = _mm512_mask_add_epi8(vec, v3, v8_bit, vec); vec = _mm512_mask_add_epi8(vec, v4, v4_bit, vec); vec = _mm512_mask_add_epi8(vec, v5, v2_bit, vec); vec = _mm512_mask_add_epi8(vec, v6, v1_bit, vec);
Now vec holds the answer we wanted, if we just wanted a bunch of bytes on our output, ranging from 0..63. Unfortunately, we need to write some not very interesting code if we’re doing this over a large range, where we imagine that our offsets might be much larger than a byte. If we’re working continuously over inputs >64K, we would expect to need 4 byte answers. In order to write out up to 64 uint32_t offsets, we’re going to have to spread out our results over 4 registers (spreading the bytes over u32 units), add in a value ‘k’ representing the base offset of our 64-bit value to begin with, and write all 4 of these big registers out.
__m512i base = _mm512_set1_epi32(k*64); __m512i r1 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,0)); __m512i r2 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,1)); __m512i r3 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,2)); __m512i r4 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,3)); r1 = _mm512_add_epi32(r1, base); r2 = _mm512_add_epi32(r2, base); r3 = _mm512_add_epi32(r3, base); r4 = _mm512_add_epi32(r4, base); _mm512_storeu_si512((__m512i *)out, r1); _mm512_storeu_si512((__m512i *)(out + 16), r2); _mm512_storeu_si512((__m512i *)(out + 32), r3); _mm512_storeu_si512((__m512i *)(out + 48), r4);
(note that ‘out’ is a uint32_t so we are actually getting +64, +128, +192 bytes with those last three offsets).
Alert readers will note that this code is writing a lot of stuff out. What happens if we only had 1 bit set? Or 0? Well, this blog isn’t called “Branch Free” for nothing.
More seriously, the point is that it’s usually cheaper to do the same thing every time rather than run the risk of a branch mispredict. Looking back at the code above – sure, it looks like a giant bolus of code. But a branch miss on a modern architecture is around 14 cycles. That’s a lot of missed opportunity to do work.
Even if you accept my above philosophy of doing tons of potentially redundant work over risking a branch miss, there’s one more question – we need to know where our next write should be:
uint8_t advance = __builtin_popcountll(v); out += advance
That just moves us up (remember ‘out’ is a uint32_t for pointer math purposes) to the last value that actually had something set. And we’re done!
Is it fast?
Here’s a rough spreadsheet of the results measured against several of the other methods described in Daniel’s article. It’s faster than most of the other methods, falling down only for very low ‘bitmap densities’. For these lower densities, taking a conditional branch with the prospect that the expected number of bits set in a word is very low is a winning proposition.
Bitmap density | Method | Cycles per index |
0.03 | bitmap_decode_ctz | 3.852 |
bitmap_decode_avx2 | 10.116 | |
bitmap_decode_avx2_turbo | 14.363 | |
bitmap_decode_avx2_turbo_thin | 15.736 | |
bitmap_decode_avx2_turbo_nopopcnt | 12.624 | |
bitmap_decode_bmi2_avx512 | 12.9 | |
0.12 | bitmap_decode_ctz | 4.97 |
bitmap_decode_avx2 | 3.003 | |
bitmap_decode_avx2_turbo | 4.205 | |
bitmap_decode_avx2_turbo_thin | 4.547 | |
bitmap_decode_avx2_turbo_nopopcnt | 3.732 | |
bitmap_decode_bmi2_avx512 | 2.481 | |
0.25 | bitmap_decode_ctz | 4.251 |
bitmap_decode_avx2 | 1.52 | |
bitmap_decode_avx2_turbo | 2.09 | |
bitmap_decode_avx2_turbo_thin | 2.265 | |
bitmap_decode_avx2_turbo_nopopcnt | 1.861 | |
bitmap_decode_bmi2_avx512 | 1.25 | |
0.5 | bitmap_decode_ctz | 3.446 |
bitmap_decode_avx2 | 0.796 | |
bitmap_decode_avx2_turbo | 1.042 | |
bitmap_decode_avx2_turbo_thin | 1.131 | |
bitmap_decode_avx2_turbo_nopopcnt | 0.92 | |
bitmap_decode_bmi2_avx512 | 0.616 | |
0.9 | bitmap_decode_ctz | 3.037 |
bitmap_decode_avx2 | 0.444 | |
bitmap_decode_avx2_turbo | 0.574 | |
bitmap_decode_avx2_turbo_thin | 0.628 | |
bitmap_decode_avx2_turbo_nopopcnt | 0.509 | |
bitmap_decode_bmi2_avx512 | 0.366 |
Is this a great idea? I don’t know.
There are no doubt other methods to use AVX512 to transform bit vectors in this fashion, and for a relatively low ‘population’ there are a number of ways the bitmap_decode_ctz code can be made to run faster (possibly the topic of another article).
I still think it’s an interesting ‘trick’ and it’s nice to take my second-favorite instruction (PEXT) out for a spin.
Let me know if you’ve seen this trick somewhere before and I’ll be happy to credit where credit is due. As I said, I think I invented it…
The code is available at Daniel Lemire’s Github with an error (my fault, apparently I thought 8+2 = 9) which will be corrected in due course.
ps. In the ‘dynamiting the trout stream’ category, I give you VPCOMPRESSB from Intel® Architecture Instruction Set Extensions Programming Reference (PDF) which will greatly simplify all the above trickery, once we have AVX512_VBMI2 capable machines (Ice Lake time-frame).
pps. There is also a branch-free means where VPCOMPRESSD can be used four times on 16-bit words to solve a similar problem on machines that are publicly available now. This can be left as an exercise for the reader. It might be faster than the BMI2 stuff, but it lacks style points.