Code Fragment: Finding quote pairs with carry-less multiply (PCLMULQDQ)

2019-05-29T01:36:53+00:00

[…] of what SNC is bringing us as programmers, assuming we’re programmers who like using SIMD (I’ve dabbled a […]

LikeLike

2019-05-29T07:00:19+00:00

[…] of what SNC is bringing us as programmers, assuming we’re programmers who like using SIMD (I’ve dabbled a […]

LikeLike

2019-05-29T07:30:38+00:00

[…] of what SNC is bringing us as programmers, assuming we’re programmers who like using SIMD (I’ve dabbled a […]

LikeLike

2022-09-08T08:20:38+00:00

Don’t know about HAKMEM, but it was certainly used in APL circles. “Boolean Bob” Smith might know the history.

LikeLike

2023-12-15T23:25:33+00:00

The xor scan ≠\ was well known in APL, but I don’t think the clmul method was. I added it to Dyalog APL 18.0 (see https://aplwiki.com/wiki/Dyalog_APL#Instruction_set_usage for example); previously Dyalog used “not Boolean” Bob Bernecky’s code that he mentions in his talk “A Compendium of SIMD Boolean Array Algorithms in APL”. But it used successive shifts, the same as Hacker’s Delight section 5-2, “Parity”. I think I learned about the clmul method from some source related to simdjson, but that I’d done it before the “Expanding Bits” article (https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/) so it couldn’t have been this post.

LikeLike

2023-12-16T00:02:57+00:00

Thanks for that detailed background. I emailed Daniel Lemire in March of 2018 telling him that I’d found that trick and cited https://bitmath.blogspot.com/2016/11/parallel-prefixsuffix-operations.html which might well be where you found it. I doubt that we would have had anything public in time for you to see it.

To be clear, I only claim to have “thought of the application of this trick for quote-balancing”, not to have invented either XOR-prefix-sum (which would be preposterous) or even clmul by -1.

That’s an interesting page on Dyalog. Have you looked into GFNI for boolean matrix transpose?

LikeLike

2024-01-19T20:34:24+00:00

I did end up doing some work on AVX2 bit transpose, but put it aside when I found out it takes a pretty large matrix (side length >100) for it to be consistently better than converting to bytes and back. Unaligned rows are not nice. I think GFNI is useful for large enough matrices but not a huge deal.
Regardless of how fast you can do it, transposing 8×8 sub-matrices isn’t great because then you’re loading and storing individual bytes. I went with 32×32 for AVX2 and I think 64×64 would make sense for AVX-512. The way to think about such a transpose is to consider the kernel to be a 2x2x2x…2 array, where you want to exchange the (maybe) 6 axes on the left with the 6 on the right. It’s fine for those sets of axes to get jumbled up in the process, because that can be corrected by rearranging the loads or stores, which is free or close to it.
With a 64×64 AVX-512 kernel you have 3 sub-byte axes, 4 axes between bytes and lanes, 2 axes between lanes and registers, and 3 axes for the 8 registers. The GFNI instruction lets you swap the lowest-order 3 axes with the 3 above them, so it moves a lot of axes, but it’s all staying on one side instead of moving axes between the sides. A sequence I came up with is 012345abcdef to 012abc345def with big shuffles, to abc012345def with GFNI, to abcdef045123 with three rounds of unpack instructions. I think that comes out better than anything I had not using GFNI, but overall it’s doing a fairly small portion of the work, especially when you account for the loads and stores.

LikeLike

2023-12-16T01:08:42+00:00

Ah, the quote-matching trick’s pretty well known in APL since xor-scan is such an important primitive. Searching “≠\ quote” on Github even turns up this for JSON by 2013 (MiServer is Brian Becker’s, likely it’s his code): https://github.com/okdistribute/mags/blob/a2ce218/websrv/Utils/JSON.dyalog#L620. There’s also an instance in Jd (J database) that looks like it comes from Chris Burke’s JDB, so that’s probably how I learned it, 2012 or 2013.

I haven’t tried a GFNI transpose, and apparently missed it in your Ice Lake article! My current array language BQN has nice AVX2 kernel transposes for integer types, but any sort of boolean transpose is still to-do, so I suppose I’ll be looking at it soon enough. Incidentally, the reason I checked this page is for writing some notes on reductions and scans in BQN, https://mlochbaum.github.io/BQN/implementation/primitive/fold.html#booleans. There’s a page on transpose too, but it doesn’t say much about the boolean case yet.

LikeLike

2023-12-16T01:20:47+00:00

I have to laugh then. I certainly thought I was inventing that, but, like a long list of things that I thought I invented, there’s someone else already parked in that spot. Apparently I need to broaden my reading list to include more APL.

LikeLike

	really_inline uint64_t find_quote_mask_and_bits(
	__m256i input_lo, __m256i input_hi, uint64_t odd_ends,
	uint64_t &prev_iter_inside_quote, uint64_t &quote_bits) {
	quote_bits =
	cmp_mask_against_input(input_lo, input_hi, _mm256_set1_epi8('"'));
	quote_bits = quote_bits & ~odd_ends;
	uint64_t quote_mask = _mm_cvtsi128_si64(_mm_clmulepi64_si128(
	_mm_set_epi64x(0ULL, quote_bits), _mm_set1_epi8(0xFF), 0));
	quote_mask ^= prev_iter_inside_quote;
	// right shift of a signed value expected to be well-defined and standard
	// compliant as of C++20,
	// John Regher from Utah U. says this is fine code
	prev_iter_inside_quote =
	static_cast<uint64_t>(static_cast<int64_t>(quote_mask) >> 63);
	return quote_mask;
	}

Code Fragment: Finding quote pairs with carry-less multiply (PCLMULQDQ)

Published by geofflangdale

9 thoughts on “Code Fragment: Finding quote pairs with carry-less multiply (PCLMULQDQ)”

Leave a reply to Why Ice Lake is Important (a bit-basher’s perspective) Cancel reply

Share this:

Related

Published by geofflangdale

9 thoughts on “Code Fragment: Finding quote pairs with carry-less multiply (PCLMULQDQ)”

Leave a reply to Why Ice Lake is Important (a bit-basher’s perspective) Cancel reply