Fitting My Head Through The ARM Holes or: Two Sequences to Substitute for the Missing PMOVMSKB Instruction on ARM NEON

2019-04-01T22:44:01+00:00

[…] UPDATE: I have, at least partly, dealt with the lack of PMOVMKSB and written a new post about it […]

2021-05-02T23:35:33+00:00

Interleaved variant without using constant masks:
uint64_t NEON_i8x64_MatchMask (const uint8_t* ptr, uint8_t match_byte) { uint8x16x4_t src = vld4q_u8(ptr); uint8x16_t dup = vdupq_n_u8(match_byte); uint8x16_t cmp0 = vceqq_u8(src.val[0], dup); uint8x16_t cmp1 = vceqq_u8(src.val[1], dup); uint8x16_t cmp2 = vceqq_u8(src.val[2], dup); uint8x16_t cmp3 = vceqq_u8(src.val[3], dup);

uint8x16_t t0 = vsriq_n_u8(cmp1, cmp0, 1); uint8x16_t t1 = vsriq_n_u8(cmp3, cmp2, 1); uint8x16_t t2 = vsriq_n_u8(t1, t0, 2); uint8x16_t t3 = vsriq_n_u8(t2, t2, 4); uint8x8_t t4 = vshrn_n_u16(vreinterpretq_u16_u8(t3), 4); return vget_lane_u64(vreinterpret_u64_u8(t4), 0); }

LikeLike

2021-05-02T23:45:36+00:00

This is really good! Thanks for the update. I haven’t really internalized the NEON instruction set the way I had the various Intel SIMD extensions, so I’m not surprised that people find better stuff than I did.

LikeLike

2022-12-02T06:54:16+00:00

Thanks for thhis blog post

LikeLike

2024-02-11T21:25:06+00:00

Eventually I’m going to make a Youtube video about doing vector operations on scalar processors using the multiply instruction that’s all new stuff that I’ve figured out. Multiplies can be used for a lot more than just multiplication, and that includes grouping sign bits together into a compact bitfield for a PMOVMSK type of operation. Seeing how the Intel CPU doesn’t have this operation for 16 bit words, it can’t be done with their SIMD anyway, if one needs that data size. To keep things CPU generic, I’ll show how a BYTE version of the instruction can be done on a 64 bit CPU with just 4 instructions in pseudo asm language.

r0 = input value
mask = 0x8080808080808080
magic = 0x0102040810204080

and r0,[mask] ;clear out all but the sign bits
lsr r0,#7 ;make room in most significant bits for bitfield
mul r0,[magic] ;multiply by magic : the upper 8 bits contain result
lsr r0,#56 ;shift result to low byte also clearing junk out of r0

Keep in mind that because of the superscalar nature of most modern CPU’s, one could interleave more of these operations into the above code. For example, if you needed to process more than just 8 bytes, you could interleave 4 more of the about instructions to the above code and get twice the work done with probably just one extra clock cycle. The PMOVMSK instructions are really slow on many AMD processors with 10 clock cycles of latency, so using my method is actually faster if you need the result sooner.

LikeLike

2024-02-11T22:02:00+00:00

This technique is well understood. We used to do it for the SWAR path in Hyperscan back when Hyperscan had to run on a range of different processors, some without SIMD extensions at all (and the technique of bit-gather via multiply was decades old at that point). Since PEXT is now pretty mainstream on x86 it’s not as compelling as it used to be but it’s still a neat trick. I remember spending some time trying to figure out the capabilities of this “bit gather by multiply” since it can’t do arbitrary bit locations, but never found a concise expression of it.

If I remember correctly you can get by without the second lsr, but I forget the details.

Hardly my job to defend AMD processors, but I would point out that PMOVMSKB and friends hasn’t been in the 10 cycle ballpark since the Excavator/Bulldozer/etc. era. 32b VPMOVMSKB was latency 5 for Zen 2/3, but otherwise Zen through Zen 4 are latency 3 – so any architecture shipped by AMD in the past 7 years or so should be fine.

LikeLike

2024-02-11T23:52:54+00:00

I’d never seen this technique when I came up with it. After that, I did see something similar in a bit counting routine. A simple change of the multiplier can return a count of the sign bits. But I was looking for something that worked with a 16 bit zbuffer in computer graphics code, and the x86 supports 32 bit and 8 bit, but nothing for 16 bit. The multiply instruction can also be used for horizontal adds / sums, broadcasting, multiplying a vector by a scalar. fused multiply adds, and bitfield insertions.

As for AMD slowness, yes, they’ve improved since on the Zen series, but if one is writing code that’s going to run on more than just the latest CPU’s or on non-x86 CPU’s, this technique is still relevant. And with the PEXT and PDEP instructions taking 19 clock cycles on Zen1 and Zen2, those instructions may not be ideal either.

LikeLike

	uint64_t neonmovemask_bulk(uint8x16_t p0, uint8x16_t p1, uint8x16_t p2, uint8x16_t p3) {
	const uint8x16_t bitmask = { 0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
	0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80};
	uint8x16_t t0 = vandq_u8(p0, bitmask);
	uint8x16_t t1 = vandq_u8(p1, bitmask);
	uint8x16_t t2 = vandq_u8(p2, bitmask);
	uint8x16_t t3 = vandq_u8(p3, bitmask);
	uint8x16_t sum0 = vpaddq_u8(t0, t1);
	uint8x16_t sum1 = vpaddq_u8(t2, t3);
	sum0 = vpaddq_u8(sum0, sum1);
	sum0 = vpaddq_u8(sum0, sum0);
	return vgetq_lane_u64(vreinterpretq_u64_u8(sum0), 0);
	}

	uint64_t neonmovemask_bulk(uint8x16_t p0, uint8x16_t p1, uint8x16_t p2, uint8x16_t p3) {
	const uint8x16_t bitmask1 = { 0x01, 0x10, 0x01, 0x10, 0x01, 0x10, 0x01, 0x10,
	0x01, 0x10, 0x01, 0x10, 0x01, 0x10, 0x01, 0x10};
	const uint8x16_t bitmask2 = { 0x02, 0x20, 0x02, 0x20, 0x02, 0x20, 0x02, 0x20,
	0x02, 0x20, 0x02, 0x20, 0x02, 0x20, 0x02, 0x20};
	const uint8x16_t bitmask3 = { 0x04, 0x40, 0x04, 0x40, 0x04, 0x40, 0x04, 0x40,
	0x04, 0x40, 0x04, 0x40, 0x04, 0x40, 0x04, 0x40};
	const uint8x16_t bitmask4 = { 0x08, 0x80, 0x08, 0x80, 0x08, 0x80, 0x08, 0x80,
	0x08, 0x80, 0x08, 0x80, 0x08, 0x80, 0x08, 0x80};

	uint8x16_t t0 = vandq_u8(p0, bitmask1);
	uint8x16_t t1 = vbslq_u8(bitmask2, p1, t0);
	uint8x16_t t2 = vbslq_u8(bitmask3, p2, t1);
	uint8x16_t tmp = vbslq_u8(bitmask4, p3, t2);
	uint8x16_t sum = vpaddq_u8(tmp, tmp);
	return vgetq_lane_u64(vreinterpretq_u64_u8(sum), 0);
	}

Fitting My Head Through The ARM Holes or: Two Sequences to Substitute for the Missing PMOVMSKB Instruction on ARM NEON

The Simple Variant

The Interleaved Variant

Conclusion

Published by geofflangdale

7 thoughts on “Fitting My Head Through The ARM Holes or: Two Sequences to Substitute for the Missing PMOVMSKB Instruction on ARM NEON”

Leave a comment Cancel reply

The Simple Variant

The Interleaved Variant

Conclusion

Share this:

Related

Published by geofflangdale

7 thoughts on “Fitting My Head Through The ARM Holes or: Two Sequences to Substitute for the Missing PMOVMSKB Instruction on ARM NEON”

Leave a comment Cancel reply