“Say Hello To My Little Friend”: Sheng, a small but fast Deterministic Finite Automaton

Deterministic Finite Automata (DFA, subsequently) are a fundamental structure. Most state machines that programmers build are some variant on a DFA, whether they are built by jumping around inside a switch statement or moving from state to state in a table structure.

They have uses all over the place; they are used heavily in regular expression implementation, and can be used in various validation schemes such as UTF-8 validation. I’m going to show a curious little DFA of my own invention* that we used in Hyperscan**. The presentation here will be an independent re-implementation as the version in Hyperscan is buried in some pretty complex code.

Sheng has some pretty tight limitations, especially in the version I’m presenting here:

  1. It cannot have more than 16 states.
  2. This version of Sheng is ‘quiet’ – it calculates states but doesn’t have an ‘accept state’ that is actively raised. So you can’t detect a regular expression and get a callback or a index as to where it matched.
  3. This version of Sheng is also a bare DFA without a compiler. You need to put the transitions of the state machine in manually.
  4. This version of Sheng depends on x86 instructions, but the principles could allow the extension of Sheng to any system with a similar permute instruction, such as ARM NEON.

Most important: Sheng uses my favorite instruction, PSHUFB!

The Problem in Plain DFA implementations: Memory Latency

A typical problem for DFA implementation is that, at best, each DFA state transition typically involves a single memory access. More compact implementations may use several. Worse still, each of these state transitions depends on the previous state transition, so a simple DFA cannot run faster than the latency of the lowest level of cache (often plus a cycle, if there are things that need to be done to the loaded value from the transition table to make it suitable for another state transition).

This is the critical path of the DFA: the state-to-state transition. Other activities, such as remapping characters to a smaller character set to save space, or checking for accept states, are not on the critical path and are almost ‘free’ in a typical implementation – after all, we’re waiting for the state transition to finish. That’s a lot of free execute slots!

Here’s a not very interesting DFA implementation:


struct BasicDFA {
typedef u8 State;
u8 transitions[16][256];
State start_state;
BasicDFA(std::vector<std::tuple<u32, u32, u8>> & trans_vec, u8 start_state_, u8 default_state) {
}
State apply(const u8 * data, size_t len, State s) {
size_t i = 0;
for (; i+7 < len; i+=8) {
u8 c1 = data[i+0];
u8 c8 = data[i+7];
s = transitions[s][c1];
s = transitions[s][c8];
}
for (; i < len; ++i) {
s = transitions[s][data[i]];
}
return s;
}
};

view raw

basicdfa-gist

hosted with ❤ by GitHub

This isn’t a perfect “simple” DFA implementation; we waste at least 1 cycle of latency in our state-to-state transition on index arithmetic to look up that big array (better, but more obscure, would be to track our state as a location within the transition table).

Note the implementation in full unrolls the loop, too.

However, even given a wasted cycle or two of latency, this implementation is close to the limit of memory latency. The DFA is small (4K) so we will be getting it from L1 cache in the steady state, but that means a state-to-state transition at around 4-5 cycles minimum.

Enter My Little Friend: Sheng

Sheng is a different approach. Sheng uses the PSHUFB instruction to implement the state transitions taken by looking up a shuffle mask for each input character. Note that the lookup operation is not on the critical path, as we know our input characters well in advance.

As such, the critical path for Sheng is just 1 cycle on modern architectures; both recent Intel and AMD processors implement PSHUFB with a single cycle of latency.

The variant of Sheng presented is ‘silent’ – it allows us to calculate which state we’re in at a given point but it has no facility to detect whether a match has occurred. We’ll cover the feature of a non-silent Sheng later; sadly, the number of instructions required to check our state means that we will have to add a lot of extra work – too much work to manage 1 cycle per byte (not a critical path issue – it’s just that it’s hard to do that many operations in a cycle).


struct Sheng {
typedef m128 State;
m128 transitions[256];
State start_state;
Sheng(std::vector<std::tuple<u32, u32, u8>> & trans_vec, u8 start_state_, u8 default_state) {
// fill all transitions with default state
for (u32 i = 0; i < 256; ++i) {
transitions[i] = _mm_set1_epi8(default_state);
}
// fill in state transition for slot 'from' to point to 'to' for our character transition c
for (auto p : trans_vec) {
u32 from, to;
u8 c;
std::tie(from, to, c) = p;
set_byte_at_offset(transitions[c], from, to);
}
start_state = _mm_set1_epi8(start_state_); // put everyone into start state – why not?
}
State apply(const u8 * data, size_t len, State s) {
size_t i = 0;
for (; i+7 < len; i+=8) {
u8 c1 = data[i+0];
u8 c2 = data[i+1];
u8 c3 = data[i+2];
u8 c4 = data[i+3];
u8 c5 = data[i+4];
u8 c6 = data[i+5];
u8 c7 = data[i+6];
u8 c8 = data[i+7];
s = _mm_shuffle_epi8(transitions[c1], s);
s = _mm_shuffle_epi8(transitions[c2], s);
s = _mm_shuffle_epi8(transitions[c3], s);
s = _mm_shuffle_epi8(transitions[c4], s);
s = _mm_shuffle_epi8(transitions[c5], s);
s = _mm_shuffle_epi8(transitions[c6], s);
s = _mm_shuffle_epi8(transitions[c7], s);
s = _mm_shuffle_epi8(transitions[c8], s);
}
for (; i < len; ++i) {
s = _mm_shuffle_epi8(transitions[data[i]], s);
}
return s;
}

view raw

sheng-gist

hosted with ❤ by GitHub

So this one is a little weird: we heavily depend on my favorite instruction, PSHUFB, included on most x86 processors since its introduction with SSSE3 (the catchily named “Supplemental Streaming SIMD Extensions 3”).

PSHUFB (_mm_shuffle_epi8 in this code) is a bytewise shuffle, using the low 4 bits of each byte from a control mask register to indicate which byte to copy from the source register to the destination. It can be used to permute data, but it can also be used to effectively look up a 16-wide table.

In this usage, PSHUFB masks are found on a per-character basis. We look up a character from our input and use this mask to look up what our next state should be. For example, in the 5th unrolled iteration, our current state is used to index into this mask (“transitions[c5]”) and by permuting that mask, and this yields our new state.

We keep our canonical state in the bottom lane of the 128-bit register.

As a side note, we could actually be processing 16 DFAs at once, with an almost useless set of limitations:

  1. The DFAs all have to have the same structure and character transitions.
  2. The DFAs all have to be acting on the same data.

So really, all we can do is start the DFAs off in different states and then crank those states and see what happens. There is an interesting usage of this (picture what happens when we initialize a register with [0,1,2,3,…, 15] and process a block of data – we now have a function that can be applied as another shufle mask! Details can wait for another followup blog post.

So, what do we get from all this? The main advantage of doing this is speed – here’s the basic comparison of speed between the two systems (measured on a 4Ghz Skylake client machine):


$ sudo nice –20 taskset -c 1 ./sheng
Sheng
0/1 1/1 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1
10/1 11/1 12/1 13/1 14/1 15/1 16/1 17/1 18/1 19/1
20/1 21/1 22/1 23/1 24/1 25/2 26/3 27/4 28/5 29/5
30/5 31/5 32/5 33/5 34/5 35/5 36/5 37/5 38/5 39/5
40/5 41/5 42/5 43/5 44/5 45/5 46/5 47/5 48/5 49/5
50/5 51/5 52/5 53/5 54/5 55/5 56/5 57/5 58/5 59/5
60/6 61/7 62/8 63/9 64/10 65/1 66/1 67/1 68/1
final state: 1 bytes scanned: 1638400000 seconds: 0.417347
bytes per ns 3.92575
Basic DFA
0/1 1/1 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1
10/1 11/1 12/1 13/1 14/1 15/1 16/1 17/1 18/1 19/1
20/1 21/1 22/1 23/1 24/1 25/2 26/3 27/4 28/5 29/5
30/5 31/5 32/5 33/5 34/5 35/5 36/5 37/5 38/5 39/5
40/5 41/5 42/5 43/5 44/5 45/5 46/5 47/5 48/5 49/5
50/5 51/5 52/5 53/5 54/5 55/5 56/5 57/5 58/5 59/5
60/6 61/7 62/8 63/9 64/10 65/1 66/1 67/1 68/1
final state: 1 bytes scanned: 1638400000 seconds: 2.73646
bytes per ns 0.59873

view raw

sheng-output

hosted with ❤ by GitHub

(there’s also a basic-level of traces through states included here so that I could verify that the two state machines are basically sane and doing the same thing; see the code)

So we’re processing 3.92 bytes per nanosecond (pretty close to 1 cycle/byte) as opposed to around 0.6 bytes per nanosecond with a basic DFA implementation (which could probably go about 10-20% faster with a more sophisticated table lookup, but not that much more). Sounds good –  as long as we can live with the long list of limitations of Sheng.

Sheng has a lot of interesting properties, which I’ll follow up in later posts:

  • There are several strategies for having a “noisy” Sheng – that is, one that can stop, raise a callback, or write to a buffer whenever it encounters some “interesting” state (e.g. an accept state).
  • There are also a number of ways Sheng can be adapted to handle a larger portion of the pattern matching task.
  • These is nothing inherently x86-centric about Sheng. The TBL instructions on Neon could be used to build up the same facility on ARM, and the multiple register variants of these instructions could be used to build 32, 48 or 64-state DFAs.
  • An AVX2 machine can run two independent 16-state DFAs at once for the same cost, although there is no cost-free way for them to interact. AVX 512 adaptation of the same techniques allows 4 such independent 16-state DFAs.
  • AVX512 also allows other exotic structures, including larger DFAs using the 16-bit permute operations, including the 2-source permutes.
  • AVX512 VBMI adds VPERMB and 2-source byte permutes, allowing this technique to be extended to 64 or even 128 states! However, the added latency of these permutes means that a simplistic implementation will be much slower.
  • Since PSHUFB is a permute, it’s possible to compute blocks of this operation out-of-order. This can be exploited to improve throughput where latency of an operation is not equal to throughput – this is not true of PSHUFB or VPSHUFB but is true of some of the more recent shuffle instructions (for example, many of the AVX512 16-bit shuffles are latency=7 throughput=2) and will likely be true of the next generation of shuffle instructions.
    • Note that a 2-source permute is not straightforwardly handled by this, as in order to turn permutes over a block on input into a function, we must calculate all possible outcomes on all states. This becomes prohibitively expensive with already large operations.
    • This out-of-order computation is not particularly suitable where a “noisy” Sheng is required

Until then, I hope you enjoyed Sheng, and you can find the code on Github.

https://github.com/geofflangdale/sheng

[ please note: it is essentially a ‘sketch’, lacking many features and there is approximately zero software engineering applied to it. The Sheng and BasicDFA structures should related through static or dynamic polymorphism so that they can share test drivers, but I didn’t want to start designing a more generalized interface until I have built out more of the Sheng variants, so I used cut-n-paste-polymorphism 🙂 ]

[ also note: yes, there are many ways to make DFAs run faster, including acceleration, gluing the characters together and various other tricks. There are also a bunch of ways to make DFAs run slower; typically by implementing them on specialized hardware add-in cards, then waiting geological ages to get the data to the cards and the matches back. ]

* I independently invented this technique along with some researchers at Microsoft Research; if anyone can recall the paper where this technique is documented, please let me know and I’ll put in a link and appropriate credit.

Update: Anuj Kalia, in comments, identified a Microsoft Research paper that’s possibly what I saw as Data-Parallel Finite-State Machines – Microsoft Research – for the 16-state case, I believe this approach converges to be functionally equivalent to Sheng. We discovered this work only when we went looking to establish originality of Sheng…

** Anatoly Burakov wrote the first implementation of Sheng within Hyperscan. Alex Coyte later extended Sheng to work as part of a much larger DFA, a subsystem which he felt moved to dub “Shengy McShengface”, for reasons he may not be able to adequately explain.

12 thoughts on ““Say Hello To My Little Friend”: Sheng, a small but fast Deterministic Finite Automaton”

  1. The Github code looks suspiciously similar to one of my early prototypes 🙂 I guess there aren’t many ways to write something that simple…

    Like

    1. That’s right. I worked without reference to the Hyperscan source base, but there’s only so many ways to write an engine with fundamentally one instruction. This one is pretty rudimentary as it’s ‘silent’ – a ‘noisy’ Sheng is more complex, and from what I recall, we never really did figure a way of maintaining 1 cycle per byte in a ‘noisy’ Sheng (kinda hard to get state out in 1 cycle or stash it somewhere, especially if you’re using both load units and Port 5 already 😦 ).

      Like

    1. This seems right, although I don’t remember the paper very well. They are taking a different tack but the fundamental idea is the same. I will make an addendum to the blog post and add a link to their paper.

      Like

  2. Nice!
    I’m interested in your noisy experiments.
    I need to write a small dfa that reports the last position that was in state 0. What method/instructions would you recommend in order to get this “noisy” version?

    Like

    1. That’s a curious problem. The most straightforward way of doing this would be to have a PCMPEQB against 0 and to sweep out that result periodically, or to just move the LSB out of the SIMD register if you’re only doing “one channel”. Accumulating the LSB in another register by OR’ing it into a shift register (which you would then move with PSRLDQ) would allow you to accumulate 16 bytes worth of results before you bother to actually look at your state. Everything we did that was ‘noisy’ in Sheng was uncomfortably expensive – generally accessing the state halves the performance.

      In general adaptations to Sheng benefit from being very domain specific, if that makes sense. Knowing a bit more about the problem being solved often helps a good deal. There are a number of tricks you can use if, for example, you have a few states to spare (e.g. kind of a ‘skid zone’ after the thing you were looking for). Other tricks exist for using the weird potential of Sheng to be in multiple states at once.

      Like

  3. Not sure if my previous question went through. Playing around with ctre, and trying to implement a noisy sheng to search for a prefix string. Not sure if I’ve got it quite right. Definitely messed up a few attempts, right now I’m maybe 33% faster over what ctre generates for it’s search function, which would be a brute force loop of a string comparison, into the remainder of the regex.

    Currently it seems like I’m processing 1,500,000,000~ bytes/s w/ the dfa and 1,100,000,000 bytes/s natively. Does this sound about right? That’s around 0.64 ns/byte vs 0.86 ns/byte, you mentioned trying for a noisy sheng was roughly half as fast, so I think I’m close.

    Guess follow up question would be, since I’m learning things here, is there a data structure for a regex you would rather use for matching patterns if you could build it at compile time?

    Like

    1. These numbers seem reasonable, with the proviso that I don’t know enough about what you’re benchmarking on. Sheng, nominally, should be able to process 1 byte per cycle – that’s 3GB/s on a 3Ghz machine, so 1.5GB/s isn’t crazy. That’s a decent result, especially comparing a DFA (albeit a small one) with a brute force string compare.

      It’s possible to shave down noisy Sheng with a lot of SIMD hacking, but it’s quite hard. Any use of the shuffle port on recent-but-not-ICL-recent Intel will nail you down to 2 cycles, no questions asked. I don’t know your method for reading out the state so I won’t deluge you with details, but there are some other options. It’s likely to be rather ‘fractional’ in its improvement.

      A domain-specific way of improving Noisy Sheng can be tried if you know for sure you only need (say) 15 states in your DFA, not 16 AND if you aren’t planning to handle overlapping matches (typically). If you have a spare state, and you’re looking for (say) /x\s\dy/ you would generate a DFA that (morally speaking) looks for /x\s\dy./ as well – essentially having a ‘skid’ state that matches anything and occurs right after the real match state. Now you can label both of the matching states as accepts and inspect your state half as often.

      This could even be generalized, but a problem starts happening if you’re planning to handle overlapping matches – you need to be able to handle an input like “x 1yx 2y” where two matches overlap – the ‘skid’ state then needs to do the work of handling the ‘skid accept’ and the regular work of detecting DFA states – this means you can consume a lot more states. I never bothered with this as 16 states is already very small. It might be worth revisiting with the bigger Shengs that you can build on ICL/TGL.

      Like

Leave a comment