Fitting My Head Through The ARM Holes or: Two Sequences to Substitute for the Missing PMOVMSKB Instruction on ARM NEON

ah

(the author hard at work)

In the last post, I talked about some first impressions of programming SIMD for the ARM architecture. Since then, I’ve gotten simdjson working on our ARM box – a 3.3Ghz eMag from Ampere Computing.

I will post some very preliminary performance results for that shortly, but I imagine that will turn into a giant festival of misinterpretation (take your pick: “Intel’s lead in server space is doomed, says ex-Intel Principal Engineer” or “ARM NEON is DOA and will never work”) and fanboy opinions, so I’m going to stick to implementation details for now.

I had two major complaints in my last post. One was that SIMD on ARM is still stuck at 128-bit. As of now, there does not seem to be a clever way to work around this…

The other complaint, a little more tractable, was the absence of the old Intel SIMD standby, the PMOVMSKB instruction. This SSE instruction takes the high bit from every byte and packs it into the low 16 bits of a general purpose register. There is also an AVX2 version of it, called VPMOVMSKB, that sets 32 bits in similar fashion.

Naturally, given the popularity of this instruction, AVX512 does something different and does all this – and much more – via ‘mask’ registers instead, but that’s a topic for another post.

At any rate, we want this operation a fair bit in simdjson (and more generally). We often have the results of a compare operation sitting in a SIMD register and would like to reduce it down to a concise form.

In fact, what we want for simdjson is not PMOVMSKB – we want 4 PMOVMSKB’s in a row and want the results to be put in a single 64-bit register. This is actually good news – the code to do this on ARM is much cheaper (amortized) if you have 4 of these to do and 1 destination register.

So, here’s how to do it. For the purposes of this discussion assume we have already filled each lane of each input vector with 0xff or 0x00. Strictly speaking the below sequences aren’t exactly analogous to PMOVMSKB as they don’t just pick out the high bit.

The Simple Variant


uint64_t neonmovemask_bulk(uint8x16_t p0, uint8x16_t p1, uint8x16_t p2, uint8x16_t p3) {
const uint8x16_t bitmask = { 0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80};
uint8x16_t t0 = vandq_u8(p0, bitmask);
uint8x16_t t1 = vandq_u8(p1, bitmask);
uint8x16_t t2 = vandq_u8(p2, bitmask);
uint8x16_t t3 = vandq_u8(p3, bitmask);
uint8x16_t sum0 = vpaddq_u8(t0, t1);
uint8x16_t sum1 = vpaddq_u8(t2, t3);
sum0 = vpaddq_u8(sum0, sum1);
sum0 = vpaddq_u8(sum0, sum0);
return vgetq_lane_u64(vreinterpretq_u64_u8(sum0), 0);
}

view raw

neon-pmovmskb

hosted with ❤ by GitHub

This requires 4 logical operations (to mask off the unwanted bits), 4 paired-add instructions (it is 3 steps to go from 512->256->128->64 bits, but the first stage requires two separate 256->128 bit operations, so 4 total), and an extraction operation (to get the final result into our general purpose register).

Here’s the basic flow using ‘a’ for the result bit for our first lane of the vector, ‘b’ for the second, etc. We start with (all vectors written left to right from least significant bit to most significant bit, with “/” to delimit bytes):

a a a a a a a a / b b b b b b b b / c c c c c c c c / ...

Masking gets us:

(512 bits across 4 regs)
a 0 0 0 0 0 0 0 / 0 b 0 0 0 0 0 0 / 0 0 c 0 0 0 0 0 / ...

Those 3 stages of paired-add operations (the 4 vpaddq_u8 intrinsics) yield:

(256 bits across 2 regs)
a b 0 0 0 0 0 0 / 0 0 c d 0 0 0 0 / 0 0 0 0 e f 0 0 / ...

(128 bits across 1 reg)
a b c d 0 0 0 0 / 0 0 0 0 e f g h / i j k l 0 0 0 0 / ...

(64 bits across 1 reg; top half is a repeated 'dummy' copy)
a b c d e f g h / i j k l m n o p / q r s t u v w x / ...

… and then all we need to do is extract the first 64-bit lane. Note that doing this operation over a single 128-bit value to extract a 16-bit mask would not be anything like 1/4 as cheap as this – we would still require 3 vpaddq operations (we could use the slightly cheaper 64-bit forms for the second and third versions).

It is possible to combine the results in fewer instructions with a mask and a 16-bit ADDV instruction (which adds results horizontally inside a SIMD register). This instruction, however, seem quite expensive, and I cannot think of a way to extract the predicate results in their original order without extra instructions.

The Interleaved Variant

However, there’s a faster, and intriguing way to do this, that isn’t really analogous to anything you can do on the Intel SIMD side of the fence.

Let’s step back a moment. In simdjson, we want to calculate a bunch of different predicates on single-byte values – at different stages, we want to know if they are backslashes, or quotes, or illegal values inside a string (under 0x20), or whether they are in various character classes. All these things can be calculated byte by byte. So it doesn’t really matter what order we operate on our bytes, just as long as we can get our 64-bit mask back out cheaply in the original order.

So we can use an oddity (from an Intel programmer’s perspective) of ARM – the N-way load instructions. In this case, we use LD4 to load 4 vector registers – so 512 bits – in a single hit. Instead of loading these registers consecutively, the 0th, 4th, 8th, … bytes are packed into register 0, the 1st, 5th, 9th, … bytes are packed into register 1, etc.

In simdjson, we can operate on these bytes normally as before. It doesn’t matter what order they are in when we’re doing compares or looking up shuffle tables to test membership of character classes. However, at the end of it all, we need to reverse the effect of the way that LD4 has interleaved our results.

Here’s the ‘interleaved’ version:


uint64_t neonmovemask_bulk(uint8x16_t p0, uint8x16_t p1, uint8x16_t p2, uint8x16_t p3) {
const uint8x16_t bitmask1 = { 0x01, 0x10, 0x01, 0x10, 0x01, 0x10, 0x01, 0x10,
0x01, 0x10, 0x01, 0x10, 0x01, 0x10, 0x01, 0x10};
const uint8x16_t bitmask2 = { 0x02, 0x20, 0x02, 0x20, 0x02, 0x20, 0x02, 0x20,
0x02, 0x20, 0x02, 0x20, 0x02, 0x20, 0x02, 0x20};
const uint8x16_t bitmask3 = { 0x04, 0x40, 0x04, 0x40, 0x04, 0x40, 0x04, 0x40,
0x04, 0x40, 0x04, 0x40, 0x04, 0x40, 0x04, 0x40};
const uint8x16_t bitmask4 = { 0x08, 0x80, 0x08, 0x80, 0x08, 0x80, 0x08, 0x80,
0x08, 0x80, 0x08, 0x80, 0x08, 0x80, 0x08, 0x80};
uint8x16_t t0 = vandq_u8(p0, bitmask1);
uint8x16_t t1 = vbslq_u8(bitmask2, p1, t0);
uint8x16_t t2 = vbslq_u8(bitmask3, p2, t1);
uint8x16_t tmp = vbslq_u8(bitmask4, p3, t2);
uint8x16_t sum = vpaddq_u8(tmp, tmp);
return vgetq_lane_u64(vreinterpretq_u64_u8(sum), 0);
}

We start with compare results, with result bits designated as a, b, c, etc (where a is a 1-bit if the first byte-wise vector result was true, else a 0-bit).

4 registers, interleaved
a a a a a a a a / e e e e e e e e / i i i i i i i i / ...
b b b b b b b b / f f f f f f f f / j j j j j j j j / ...
c c c c c c c c / g g g g g g g g / k k k k k k k k / ...
d d d d d d d d / h h h h h h h h / l l l l l l l l / ...

We can start the process of deinterleaving and reducing by AND’ing our compare results for the first register by the repeated bit pattern { 0x01, 0x10 }, the second register by {0x02, 0x20}, the third by { 0x04, 0x40} and the fourth by { 0x08, 0x80 }. Nominally, this would look like this:

4 registers, interleaved, masked
a 0 0 0 0 0 0 0 / 0 0 0 0 e 0 0 0 / i 0 0 0 0 0 0 0 / ...
0 b 0 0 0 0 0 0 / 0 0 0 0 0 f 0 0 / 0 j 0 0 0 0 0 0 / ...
0 0 c 0 0 0 0 0 / 0 0 0 0 0 0 g 0 / 0 0 k 0 0 0 0 0 / ...
0 0 0 d 0 0 0 0 / 0 0 0 0 0 0 0 h / 0 0 0 l 0 0 0 0 / ...

In practice, while we have to AND off the very first register, the second and subsequent registers can be combined and masked with the first register using one of ARM’s “bit select” operations, which allows us to combine and mask the registers in one operation (so the above picture never really exists in the registers). So with 4 operations (1 AND and 3 bit selects) we now have our desired 64 bits lined up in a single accumulator register, in a slightly awkward fashion.

If our result bits are designated as abcdef… etc., our accumulator register now holds (reading from LSB to MSB):

a b c d 0 0 0 0 / 0 0 0 0 e f g h / i j k l 0 0 0 0 ...

We can then use the paired-add routine to combine and narrow these bytes, yielding a 64-bit result.

So we need 4 logical operations, 1 paired add, and an extract. This is strictly cheaper than our original sequence. The LD4 operation (it is, after all, and instruction that allows us to load 512 bits in a single instruction) is also cheaper than 4 128-bit vector loads.

The use of this transformation allows us to go from spending 4.23 cycles per byte in simdjson’s “stage1” to 3.86 cycles per byte, an almost 10% improvement. I haven’t quantified how much benefit we get from using LD4 versus how much benefit we get from this cheaper PMOVMSKB sequence.

Conclusion

There is a substitute for PMOVMSKB, especially at larger scale (it would still be painfully slow if you only needed a single 16-bit PMOVMSKB in comparison to the Intel operation). It’s a little faster to use the “interleaved” variant, if interleaving the bytes can be tolerated.

On a machine with 128-bit operations, requiring just 6 operations to do the equivalent of 4 PMOVMSKBs isn’t all that bad – notably, if this was a SSE-based Intel machine, the 4 PMOVMSKB operations would need to be followed by 3 shift and 3 OR operations to be glued together into one register. Realistically, though, Intel has had 256-bit integer operations since Haswell (2013) so the comparison should really be against 2 VPMOVMSKB ops followed by 1 shift+or combination) – or, if you really want to be mean to ARM, a single operation to the AVX-512 mask registers followed by a move to the GPRs.

Still, it’s better than I thought it would be…

I think I thought up these tricks, but given that I’ve been coding on ARM for a matter of days, I’m sure there’s plenty of people doing this or similar. Please leave alternate implementations or pointers to earlier versions of my implementation in the comments; I’m happy to give credit where it is due.

Side note: though it is a latent possibility in simdjson, the “interleaved” variant actually allows us to combine our results for adjacent values more cheaply than we would be able to if we had our input in non-interleaved fashion.

If we were evaluating a pair of predicates that are adjacent in our data, and wanted to do this combination in the SIMD side (as opposed to moving the results to the GPR side and calculating things there – in Hyperscan, we had occasion to do things both ways, depending on specifics), we can combine our results for bytes 0, 4, 8, 12 … with our results for bytes 1, 5, 9, 13 … with a simple AND operation. It is only for the combination of bytes 3, 7, 11, 15 with the subsequent bytes 4, 8, 12, 16 that we need to do a comparatively expensive vector shift operation (EXT in ARM terms, or PALIGNR or PSLLDQ for Intel folks).

In a way, the cheapness of the ‘vertical’ combinations in this PMOVMSKB substitute hints at this capability: adjacent bytes are easier to combine, except across 32-bit boundaries (not an issue for the code in this post).

This would be a key consideration if porting a Hyperscan matcher such as “Teddy” to ARM. I might build up a demo of this for subsequent posts.

 

 

 

 

4 thoughts on “Fitting My Head Through The ARM Holes or: Two Sequences to Substitute for the Missing PMOVMSKB Instruction on ARM NEON”

  1. Interleaved variant without using constant masks:

    uint64_t NEON_i8x64_MatchMask (const uint8_t* ptr, uint8_t match_byte)
    {
    uint8x16x4_t src = vld4q_u8(ptr);
    uint8x16_t dup = vdupq_n_u8(match_byte);
    uint8x16_t cmp0 = vceqq_u8(src.val[0], dup);
    uint8x16_t cmp1 = vceqq_u8(src.val[1], dup);
    uint8x16_t cmp2 = vceqq_u8(src.val[2], dup);
    uint8x16_t cmp3 = vceqq_u8(src.val[3], dup);

    uint8x16_t t0 = vsriq_n_u8(cmp1, cmp0, 1);
    uint8x16_t t1 = vsriq_n_u8(cmp3, cmp2, 1);
    uint8x16_t t2 = vsriq_n_u8(t1, t0, 2);
    uint8x16_t t3 = vsriq_n_u8(t2, t2, 4);
    uint8x8_t t4 = vshrn_n_u16(vreinterpretq_u16_u8(t3), 4);
    return vget_lane_u64(vreinterpret_u64_u8(t4), 0);
    }

    Like

    1. This is really good! Thanks for the update. I haven’t really internalized the NEON instruction set the way I had the various Intel SIMD extensions, so I’m not surprised that people find better stuff than I did.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s