SMH: The Swiss Army Chainsaw of shuffle-based matching sequences

2018-06-05T05:32:20+00:00

[…] quick update on last week’s post SMH: The Swiss Army Chainsaw of shuffle-based matching sequences on performance […]

LikeLike

2018-07-21T10:59:23+00:00

Hey, pshufb’s my favorite instruction too! (I will admit that vpconflictd is a contender.) I discovered a similar trick some months ago while trying to find a vectorized variant of a B-tree that could handle variable-length prefixes. I eventually found a nice way to compare with a vectorfull of prefixes for both equality and inequality. I found that it’s most efficient to store each prefix reversed, that is, in big-endian order. Then the central idea is:

bot = 2*top + 1; streq = top & eq & ((eq&~top)+bot); // Same as SMH tight fit strlt = top & (((eq>>1)&~top)+lt);

The popcount of (islt<<16|iseq) can be used to determine which node to go to next, as in poptrie. There are a few more tricks to figure out how long of a prefix can be discarded and when to stop, but the whole thing comes out to about 10ns per level. If there aren't many strings, this is probably faster than a hash table, so I think it's suitable for name lookups. I haven't used it yet—one problem is handling insertions, but the real problem is dealing with all the code that expects names to be stored in a binary tree.

The question in subsection 3 (bit arithmetic) took me a while to find the right approach but it turns out to have a fairly simple answer. Ignore all the character stuff and just consider the value we get from movemask. What predicates can we express on that string of bits using the add instruction? The input is an n-bit number consisting of the results of n comparisons and a fixed n-bit string defining the predicate, and the output is bit n of their sum.

A predicate P on n+1 bits is composed of a predicate Q on the low n bits, plus one more bit b which acts on corresponding comparison bit c. If b=0, then P is true only if Q and c are both true, so P≡Q∧c. If b=1, then the addition carries if either Q or c is true, and P≡Q∨c. At the base level, a 0-bit sum can only be 0. In BNF, we get

Pred ::= Cmp"∧("Pred")" | Cmp"∨("Pred")" | "0"

as the possible logical formulas for predicates, where Cmp is whatever you can put together with vector instructions (using ranged comparison, c<=str[i]&&str[i]<=d for any index i<16 and characters c and d).

The full result of the addition also contains values from all the sub-predicates, which could be used in later operations like shifts, adds, and subtracts. The model rapidly gets more complicated, and I don't have a good idea of what classes of operations are possible in small numbers of operations. Figuring out the implications of carry-less multiply is left as an exercise for Claude Shannon.

Surely you're aware that double-width shuffle can be emulated pretty easily? My version uses five instructions, plus moves, and a constant register, assuming all indices are in range (unsigned-less-than 32).

__m128i shuf2(__m128i x0, __m128i x1, __m128i sel) { __m128i f0 = _mm_set1_epi8(0xf0); sel = _mm_add_epi8(sel, f0); return _mm_or_si128( _mm_shuffle_epi8(x0, _mm_xor_si128(sel, f0)), _mm_shuffle_epi8(x1, sel) ); }

LikeLike

2019-02-28T09:38:28+00:00

[…] assertions: we use a similar technique to my SMH matching engine to, upon receipt of a literal match, to augment the power of the literal match by verifying that the […]

LikeLike

	struct SIMD_SMH_PART {
	m256 shuf_mask;
	m256 cmp_mask;
	m256 and_mask; // not yet used
	m256 sub_mask; // not yet used
	u32 doit(m256 d) {
	return _mm256_movemask_epi8(
	_mm256_cmpeq_epi8(_mm256_shuffle_epi8(d, shuf_mask),
	cmp_mask));
	}
	};

	struct GPR_SMH_PART {
	u64 hi;
	u64 low;

	u64 doit(u64 m, bool loose_fit) {
	if (loose_fit) {
	return (m + low) & hi;
	} else {
	return ((m & ~hi) + low) & (m & hi);
	}
	}
	};

Fit	Predicate Count	ns per sequence (throughput)
Loose	32	0.888
Loose	64	1.38
Loose	128	2.82
Tight	32	1.14
Tight	64	1.65
Tight	128	3.37

	109 111 117 115:101 0 0 0\| 0 0 0 0: 0 0 0 0\|109 111 117 115:101 0 0 0\| 0 0 0 0: 0 0 0 0\| input
	0 1 2 3: 4 128 0 1\| 2 3 4 128: 0 1 2 128\| 0 1 2 128:128 128 128 128\|128 128 128 128:128 128 128 128\| shuf_mask
	109 111 117 115:101 0 109 111\|117 115 101 0:109 111 117 0\|109 111 117 0: 0 0 0 0\| 0 0 0 0: 0 0 0 0\| shuf result
	109 111 111 115:101 255 109 111\|117 115 101 255: 99 97 116 255\|100 111 103 255:255 255 255 255\|255 255 255 255:255 255 255 255\| cmp_mask
	255 255 0 255:255 0 255 255\|255 255 255 0: 0 0 0 0\| 0 255 0 0: 0 0 0 0\| 0 0 0 0: 0 0 0 0\| cmp result

	11_11_11111______1______________________________________________ input to gpr-smh
	_____1_____1___1___1____________________________________________ hi
	1_____1_____1___1_______________________________________________ low
	__111______11___11______________________________________________ after_add
	___________1____________________________________________________ ret

	Result: 25
	99 97 116 0: 0 0 0 0\| 0 0 0 0: 0 0 0 0\| 99 97 116 0: 0 0 0 0\| 0 0 0 0: 0 0 0 0\| input
	0 1 2 3: 4 128 0 1\| 2 3 4 128: 0 1 2 128\| 0 1 2 128:128 128 128 128\|128 128 128 128:128 128 128 128\| shuf_mask
	99 97 116 0: 0 0 99 97\|116 0 0 0: 99 97 116 0\| 99 97 116 0: 0 0 0 0\| 0 0 0 0: 0 0 0 0\| shuf result
	109 111 111 115:101 255 109 111\|117 115 101 255: 99 97 116 255\|100 111 103 255:255 255 255 255\|255 255 255 255:255 255 255 255\| cmp_mask
	0 0 0 0: 0 0 0 0\| 0 0 0 0:255 255 255 0\| 0 0 0 0: 0 0 0 0\| 0 0 0 0: 0 0 0 0\| cmp result

	____________111_________________________________________________ input to gpr-smh
	_____1_____1___1___1____________________________________________ hi
	1_____1_____1___1_______________________________________________ low
	1_____1________11_______________________________________________ after_add
	_______________1________________________________________________ ret

SMH: The Swiss Army Chainsaw of shuffle-based matching sequences

Baseline Application: Prefix Matching

SMH: Full Sequence

1. Shuffle allows discontinuous things to be compared

2. The full sequence allows masking, ranged comparison and negation

3. The bit arithmetic at the end of the sequence can model more than just ADD

Future thingies

Summary: The Case for the Swiss Army Chainsaw

Postscript: A Notes on Comparisons To Trent Nelson’s Prefix Matcher

Published by geofflangdale

3 thoughts on “SMH: The Swiss Army Chainsaw of shuffle-based matching sequences”

Leave a comment Cancel reply

Baseline Application: Prefix Matching

SMH: Full Sequence

1. Shuffle allows discontinuous things to be compared

2. The full sequence allows masking, ranged comparison and negation

3. The bit arithmetic at the end of the sequence can model more than just ADD

Future thingies

Summary: The Case for the Swiss Army Chainsaw

Postscript: A Notes on Comparisons To Trent Nelson’s Prefix Matcher

Share this:

Related

Published by geofflangdale

3 thoughts on “SMH: The Swiss Army Chainsaw of shuffle-based matching sequences”

Leave a comment Cancel reply