“Say Hello To My Little Friend”: Sheng, a small but fast Deterministic Finite Automaton

2018-05-30T14:20:30+00:00

[…] enough) with better instructions. While my long-standing love affair with PSHUFB has already been demonstrated and will be demonstrated again, the limitation of 16-character range is irritating. VBMI and VBMI2 […]

LikeLike

2018-06-01T14:17:03+00:00

The Github code looks suspiciously similar to one of my early prototypes 🙂 I guess there aren’t many ways to write something that simple…

LikeLike

2018-06-01T23:45:45+00:00

That’s right. I worked without reference to the Hyperscan source base, but there’s only so many ways to write an engine with fundamentally one instruction. This one is pretty rudimentary as it’s ‘silent’ – a ‘noisy’ Sheng is more complex, and from what I recall, we never really did figure a way of maintaining 1 cycle per byte in a ‘noisy’ Sheng (kinda hard to get state out in 1 cycle or stash it somewhere, especially if you’re using both load units and Port 5 already 😦 ).

LikeLike

2018-06-04T04:21:59+00:00

The Microsoft paper is probably “Data-Parallel Finite-State Machines”.

LikeLike

2018-06-04T05:59:30+00:00

This seems right, although I don’t remember the paper very well. They are taking a different tack but the fundamental idea is the same. I will make an addendum to the blog post and add a link to their paper.

LikeLike

2019-02-25T02:14:02+00:00

[…] the techniques used in the remarkable icgrep and the technique that I was so proud of in “Sheng” had been invented by Microsoft before. So maybe one day I’ll invent something of my […]

LikeLike

2019-02-28T09:38:24+00:00

[…] engines, like “Sheng“, and many others. The ‘Castle’ engine, for example, efficiently handles many […]

LikeLike

2020-10-24T13:19:33+00:00

Nice!
I’m interested in your noisy experiments.
I need to write a small dfa that reports the last position that was in state 0. What method/instructions would you recommend in order to get this “noisy” version?

LikeLike

2020-10-24T23:21:28+00:00

That’s a curious problem. The most straightforward way of doing this would be to have a PCMPEQB against 0 and to sweep out that result periodically, or to just move the LSB out of the SIMD register if you’re only doing “one channel”. Accumulating the LSB in another register by OR’ing it into a shift register (which you would then move with PSRLDQ) would allow you to accumulate 16 bytes worth of results before you bother to actually look at your state. Everything we did that was ‘noisy’ in Sheng was uncomfortably expensive – generally accessing the state halves the performance.

In general adaptations to Sheng benefit from being very domain specific, if that makes sense. Knowing a bit more about the problem being solved often helps a good deal. There are a number of tricks you can use if, for example, you have a few states to spare (e.g. kind of a ‘skid zone’ after the thing you were looking for). Other tricks exist for using the weird potential of Sheng to be in multiple states at once.

LikeLike

2020-11-18T20:23:35+00:00

Not sure if my previous question went through. Playing around with ctre, and trying to implement a noisy sheng to search for a prefix string. Not sure if I’ve got it quite right. Definitely messed up a few attempts, right now I’m maybe 33% faster over what ctre generates for it’s search function, which would be a brute force loop of a string comparison, into the remainder of the regex.

Currently it seems like I’m processing 1,500,000,000~ bytes/s w/ the dfa and 1,100,000,000 bytes/s natively. Does this sound about right? That’s around 0.64 ns/byte vs 0.86 ns/byte, you mentioned trying for a noisy sheng was roughly half as fast, so I think I’m close.

Guess follow up question would be, since I’m learning things here, is there a data structure for a regex you would rather use for matching patterns if you could build it at compile time?

LikeLike

2020-11-18T23:19:45+00:00

These numbers seem reasonable, with the proviso that I don’t know enough about what you’re benchmarking on. Sheng, nominally, should be able to process 1 byte per cycle – that’s 3GB/s on a 3Ghz machine, so 1.5GB/s isn’t crazy. That’s a decent result, especially comparing a DFA (albeit a small one) with a brute force string compare.

It’s possible to shave down noisy Sheng with a lot of SIMD hacking, but it’s quite hard. Any use of the shuffle port on recent-but-not-ICL-recent Intel will nail you down to 2 cycles, no questions asked. I don’t know your method for reading out the state so I won’t deluge you with details, but there are some other options. It’s likely to be rather ‘fractional’ in its improvement.

A domain-specific way of improving Noisy Sheng can be tried if you know for sure you only need (say) 15 states in your DFA, not 16 AND if you aren’t planning to handle overlapping matches (typically). If you have a spare state, and you’re looking for (say) /x\s\dy/ you would generate a DFA that (morally speaking) looks for /x\s\dy./ as well – essentially having a ‘skid’ state that matches anything and occurs right after the real match state. Now you can label both of the matching states as accepts and inspect your state half as often.

This could even be generalized, but a problem starts happening if you’re planning to handle overlapping matches – you need to be able to handle an input like “x 1yx 2y” where two matches overlap – the ‘skid’ state then needs to do the work of handling the ‘skid accept’ and the regular work of detecting DFA states – this means you can consume a lot more states. I never bothered with this as 16 states is already very small. It might be worth revisiting with the bigger Shengs that you can build on ICL/TGL.

LikeLike

2020-11-18T23:21:07+00:00

Sorry – just realized (lacking context in the reply window) that I wrote a lot of that stuff again in my previous reply on this thread to someone else. Hey, at least I’m consistent!

LikeLike

	struct BasicDFA {
	typedef u8 State;
	u8 transitions[16][256];
	State start_state;
	BasicDFA(std::vector<std::tuple<u32, u32, u8>> & trans_vec, u8 start_state_, u8 default_state) {
	…
	}
	State apply(const u8 * data, size_t len, State s) {
	size_t i = 0;
	for (; i+7 < len; i+=8) {
	u8 c1 = data[i+0];
	…
	u8 c8 = data[i+7];
	s = transitions[s][c1];
	…
	s = transitions[s][c8];
	}
	for (; i < len; ++i) {
	s = transitions[s][data[i]];
	}
	return s;
	}
	};

	struct Sheng {
	typedef m128 State;
	m128 transitions[256];
	State start_state;

	Sheng(std::vector<std::tuple<u32, u32, u8>> & trans_vec, u8 start_state_, u8 default_state) {
	// fill all transitions with default state
	for (u32 i = 0; i < 256; ++i) {
	transitions[i] = _mm_set1_epi8(default_state);
	}
	// fill in state transition for slot 'from' to point to 'to' for our character transition c
	for (auto p : trans_vec) {
	u32 from, to;
	u8 c;
	std::tie(from, to, c) = p;
	set_byte_at_offset(transitions[c], from, to);
	}
	start_state = _mm_set1_epi8(start_state_); // put everyone into start state – why not?
	}

	State apply(const u8 * data, size_t len, State s) {
	size_t i = 0;
	for (; i+7 < len; i+=8) {
	u8 c1 = data[i+0];
	u8 c2 = data[i+1];
	u8 c3 = data[i+2];
	u8 c4 = data[i+3];
	u8 c5 = data[i+4];
	u8 c6 = data[i+5];
	u8 c7 = data[i+6];
	u8 c8 = data[i+7];
	s = _mm_shuffle_epi8(transitions[c1], s);
	s = _mm_shuffle_epi8(transitions[c2], s);
	s = _mm_shuffle_epi8(transitions[c3], s);
	s = _mm_shuffle_epi8(transitions[c4], s);
	s = _mm_shuffle_epi8(transitions[c5], s);
	s = _mm_shuffle_epi8(transitions[c6], s);
	s = _mm_shuffle_epi8(transitions[c7], s);
	s = _mm_shuffle_epi8(transitions[c8], s);
	}
	for (; i < len; ++i) {
	s = _mm_shuffle_epi8(transitions[data[i]], s);
	}
	return s;
	}

	$ sudo nice –20 taskset -c 1 ./sheng

	Sheng
	0/1 1/1 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1
	10/1 11/1 12/1 13/1 14/1 15/1 16/1 17/1 18/1 19/1
	20/1 21/1 22/1 23/1 24/1 25/2 26/3 27/4 28/5 29/5
	30/5 31/5 32/5 33/5 34/5 35/5 36/5 37/5 38/5 39/5
	40/5 41/5 42/5 43/5 44/5 45/5 46/5 47/5 48/5 49/5
	50/5 51/5 52/5 53/5 54/5 55/5 56/5 57/5 58/5 59/5
	60/6 61/7 62/8 63/9 64/10 65/1 66/1 67/1 68/1
	final state: 1 bytes scanned: 1638400000 seconds: 0.417347
	bytes per ns 3.92575

	Basic DFA
	0/1 1/1 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1
	10/1 11/1 12/1 13/1 14/1 15/1 16/1 17/1 18/1 19/1
	20/1 21/1 22/1 23/1 24/1 25/2 26/3 27/4 28/5 29/5
	30/5 31/5 32/5 33/5 34/5 35/5 36/5 37/5 38/5 39/5
	40/5 41/5 42/5 43/5 44/5 45/5 46/5 47/5 48/5 49/5
	50/5 51/5 52/5 53/5 54/5 55/5 56/5 57/5 58/5 59/5
	60/6 61/7 62/8 63/9 64/10 65/1 66/1 67/1 68/1
	final state: 1 bytes scanned: 1638400000 seconds: 2.73646
	bytes per ns 0.59873

“Say Hello To My Little Friend”: Sheng, a small but fast Deterministic Finite Automaton

The Problem in Plain DFA implementations: Memory Latency

Enter My Little Friend: Sheng

Published by geofflangdale

12 thoughts on ““Say Hello To My Little Friend”: Sheng, a small but fast Deterministic Finite Automaton”

Leave a comment Cancel reply

The Problem in Plain DFA implementations: Memory Latency

Enter My Little Friend: Sheng

Share this:

Related

Published by geofflangdale

12 thoughts on ““Say Hello To My Little Friend”: Sheng, a small but fast Deterministic Finite Automaton”

Leave a comment Cancel reply