Deterministic Finite Automata (DFA, subsequently) are a fundamental structure. Most state machines that programmers build are some variant on a DFA, whether they are built by jumping around inside a switch statement or moving from state to state in a table structure.
They have uses all over the place; they are used heavily in regular expression implementation, and can be used in various validation schemes such as UTF-8 validation. I’m going to show a curious little DFA of my own invention* that we used in Hyperscan**. The presentation here will be an independent re-implementation as the version in Hyperscan is buried in some pretty complex code.
Sheng has some pretty tight limitations, especially in the version I’m presenting here:
- It cannot have more than 16 states.
- This version of Sheng is ‘quiet’ – it calculates states but doesn’t have an ‘accept state’ that is actively raised. So you can’t detect a regular expression and get a callback or a index as to where it matched.
- This version of Sheng is also a bare DFA without a compiler. You need to put the transitions of the state machine in manually.
- This version of Sheng depends on x86 instructions, but the principles could allow the extension of Sheng to any system with a similar permute instruction, such as ARM NEON.
Most important: Sheng uses my favorite instruction, PSHUFB!
The Problem in Plain DFA implementations: Memory Latency
A typical problem for DFA implementation is that, at best, each DFA state transition typically involves a single memory access. More compact implementations may use several. Worse still, each of these state transitions depends on the previous state transition, so a simple DFA cannot run faster than the latency of the lowest level of cache (often plus a cycle, if there are things that need to be done to the loaded value from the transition table to make it suitable for another state transition).
This is the critical path of the DFA: the state-to-state transition. Other activities, such as remapping characters to a smaller character set to save space, or checking for accept states, are not on the critical path and are almost ‘free’ in a typical implementation – after all, we’re waiting for the state transition to finish. That’s a lot of free execute slots!
Here’s a not very interesting DFA implementation:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
struct BasicDFA { | |
typedef u8 State; | |
u8 transitions[16][256]; | |
State start_state; | |
BasicDFA(std::vector<std::tuple<u32, u32, u8>> & trans_vec, u8 start_state_, u8 default_state) { | |
… | |
} | |
State apply(const u8 * data, size_t len, State s) { | |
size_t i = 0; | |
for (; i+7 < len; i+=8) { | |
u8 c1 = data[i+0]; | |
… | |
u8 c8 = data[i+7]; | |
s = transitions[s][c1]; | |
… | |
s = transitions[s][c8]; | |
} | |
for (; i < len; ++i) { | |
s = transitions[s][data[i]]; | |
} | |
return s; | |
} | |
}; |
This isn’t a perfect “simple” DFA implementation; we waste at least 1 cycle of latency in our state-to-state transition on index arithmetic to look up that big array (better, but more obscure, would be to track our state as a location within the transition table).
Note the implementation in full unrolls the loop, too.
However, even given a wasted cycle or two of latency, this implementation is close to the limit of memory latency. The DFA is small (4K) so we will be getting it from L1 cache in the steady state, but that means a state-to-state transition at around 4-5 cycles minimum.
Enter My Little Friend: Sheng
Sheng is a different approach. Sheng uses the PSHUFB instruction to implement the state transitions taken by looking up a shuffle mask for each input character. Note that the lookup operation is not on the critical path, as we know our input characters well in advance.
As such, the critical path for Sheng is just 1 cycle on modern architectures; both recent Intel and AMD processors implement PSHUFB with a single cycle of latency.
The variant of Sheng presented is ‘silent’ – it allows us to calculate which state we’re in at a given point but it has no facility to detect whether a match has occurred. We’ll cover the feature of a non-silent Sheng later; sadly, the number of instructions required to check our state means that we will have to add a lot of extra work – too much work to manage 1 cycle per byte (not a critical path issue – it’s just that it’s hard to do that many operations in a cycle).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
struct Sheng { | |
typedef m128 State; | |
m128 transitions[256]; | |
State start_state; | |
Sheng(std::vector<std::tuple<u32, u32, u8>> & trans_vec, u8 start_state_, u8 default_state) { | |
// fill all transitions with default state | |
for (u32 i = 0; i < 256; ++i) { | |
transitions[i] = _mm_set1_epi8(default_state); | |
} | |
// fill in state transition for slot 'from' to point to 'to' for our character transition c | |
for (auto p : trans_vec) { | |
u32 from, to; | |
u8 c; | |
std::tie(from, to, c) = p; | |
set_byte_at_offset(transitions[c], from, to); | |
} | |
start_state = _mm_set1_epi8(start_state_); // put everyone into start state – why not? | |
} | |
State apply(const u8 * data, size_t len, State s) { | |
size_t i = 0; | |
for (; i+7 < len; i+=8) { | |
u8 c1 = data[i+0]; | |
u8 c2 = data[i+1]; | |
u8 c3 = data[i+2]; | |
u8 c4 = data[i+3]; | |
u8 c5 = data[i+4]; | |
u8 c6 = data[i+5]; | |
u8 c7 = data[i+6]; | |
u8 c8 = data[i+7]; | |
s = _mm_shuffle_epi8(transitions[c1], s); | |
s = _mm_shuffle_epi8(transitions[c2], s); | |
s = _mm_shuffle_epi8(transitions[c3], s); | |
s = _mm_shuffle_epi8(transitions[c4], s); | |
s = _mm_shuffle_epi8(transitions[c5], s); | |
s = _mm_shuffle_epi8(transitions[c6], s); | |
s = _mm_shuffle_epi8(transitions[c7], s); | |
s = _mm_shuffle_epi8(transitions[c8], s); | |
} | |
for (; i < len; ++i) { | |
s = _mm_shuffle_epi8(transitions[data[i]], s); | |
} | |
return s; | |
} |
So this one is a little weird: we heavily depend on my favorite instruction, PSHUFB, included on most x86 processors since its introduction with SSSE3 (the catchily named “Supplemental Streaming SIMD Extensions 3”).
PSHUFB (_mm_shuffle_epi8 in this code) is a bytewise shuffle, using the low 4 bits of each byte from a control mask register to indicate which byte to copy from the source register to the destination. It can be used to permute data, but it can also be used to effectively look up a 16-wide table.
In this usage, PSHUFB masks are found on a per-character basis. We look up a character from our input and use this mask to look up what our next state should be. For example, in the 5th unrolled iteration, our current state is used to index into this mask (“transitions[c5]”) and by permuting that mask, and this yields our new state.
We keep our canonical state in the bottom lane of the 128-bit register.
As a side note, we could actually be processing 16 DFAs at once, with an almost useless set of limitations:
- The DFAs all have to have the same structure and character transitions.
- The DFAs all have to be acting on the same data.
So really, all we can do is start the DFAs off in different states and then crank those states and see what happens. There is an interesting usage of this (picture what happens when we initialize a register with [0,1,2,3,…, 15] and process a block of data – we now have a function that can be applied as another shufle mask! Details can wait for another followup blog post.
So, what do we get from all this? The main advantage of doing this is speed – here’s the basic comparison of speed between the two systems (measured on a 4Ghz Skylake client machine):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ sudo nice –20 taskset -c 1 ./sheng | |
Sheng | |
0/1 1/1 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1 | |
10/1 11/1 12/1 13/1 14/1 15/1 16/1 17/1 18/1 19/1 | |
20/1 21/1 22/1 23/1 24/1 25/2 26/3 27/4 28/5 29/5 | |
30/5 31/5 32/5 33/5 34/5 35/5 36/5 37/5 38/5 39/5 | |
40/5 41/5 42/5 43/5 44/5 45/5 46/5 47/5 48/5 49/5 | |
50/5 51/5 52/5 53/5 54/5 55/5 56/5 57/5 58/5 59/5 | |
60/6 61/7 62/8 63/9 64/10 65/1 66/1 67/1 68/1 | |
final state: 1 bytes scanned: 1638400000 seconds: 0.417347 | |
bytes per ns 3.92575 | |
Basic DFA | |
0/1 1/1 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1 | |
10/1 11/1 12/1 13/1 14/1 15/1 16/1 17/1 18/1 19/1 | |
20/1 21/1 22/1 23/1 24/1 25/2 26/3 27/4 28/5 29/5 | |
30/5 31/5 32/5 33/5 34/5 35/5 36/5 37/5 38/5 39/5 | |
40/5 41/5 42/5 43/5 44/5 45/5 46/5 47/5 48/5 49/5 | |
50/5 51/5 52/5 53/5 54/5 55/5 56/5 57/5 58/5 59/5 | |
60/6 61/7 62/8 63/9 64/10 65/1 66/1 67/1 68/1 | |
final state: 1 bytes scanned: 1638400000 seconds: 2.73646 | |
bytes per ns 0.59873 |
(there’s also a basic-level of traces through states included here so that I could verify that the two state machines are basically sane and doing the same thing; see the code)
So we’re processing 3.92 bytes per nanosecond (pretty close to 1 cycle/byte) as opposed to around 0.6 bytes per nanosecond with a basic DFA implementation (which could probably go about 10-20% faster with a more sophisticated table lookup, but not that much more). Sounds good – as long as we can live with the long list of limitations of Sheng.
Sheng has a lot of interesting properties, which I’ll follow up in later posts:
- There are several strategies for having a “noisy” Sheng – that is, one that can stop, raise a callback, or write to a buffer whenever it encounters some “interesting” state (e.g. an accept state).
- There are also a number of ways Sheng can be adapted to handle a larger portion of the pattern matching task.
- These is nothing inherently x86-centric about Sheng. The TBL instructions on Neon could be used to build up the same facility on ARM, and the multiple register variants of these instructions could be used to build 32, 48 or 64-state DFAs.
- An AVX2 machine can run two independent 16-state DFAs at once for the same cost, although there is no cost-free way for them to interact. AVX 512 adaptation of the same techniques allows 4 such independent 16-state DFAs.
- AVX512 also allows other exotic structures, including larger DFAs using the 16-bit permute operations, including the 2-source permutes.
- AVX512 VBMI adds VPERMB and 2-source byte permutes, allowing this technique to be extended to 64 or even 128 states! However, the added latency of these permutes means that a simplistic implementation will be much slower.
- Since PSHUFB is a permute, it’s possible to compute blocks of this operation out-of-order. This can be exploited to improve throughput where latency of an operation is not equal to throughput – this is not true of PSHUFB or VPSHUFB but is true of some of the more recent shuffle instructions (for example, many of the AVX512 16-bit shuffles are latency=7 throughput=2) and will likely be true of the next generation of shuffle instructions.
- Note that a 2-source permute is not straightforwardly handled by this, as in order to turn permutes over a block on input into a function, we must calculate all possible outcomes on all states. This becomes prohibitively expensive with already large operations.
- This out-of-order computation is not particularly suitable where a “noisy” Sheng is required
Until then, I hope you enjoyed Sheng, and you can find the code on Github.
https://github.com/geofflangdale/sheng
[ please note: it is essentially a ‘sketch’, lacking many features and there is approximately zero software engineering applied to it. The Sheng and BasicDFA structures should related through static or dynamic polymorphism so that they can share test drivers, but I didn’t want to start designing a more generalized interface until I have built out more of the Sheng variants, so I used cut-n-paste-polymorphism 🙂 ]
[ also note: yes, there are many ways to make DFAs run faster, including acceleration, gluing the characters together and various other tricks. There are also a bunch of ways to make DFAs run slower; typically by implementing them on specialized hardware add-in cards, then waiting geological ages to get the data to the cards and the matches back. ]
* I independently invented this technique along with some researchers at Microsoft Research; if anyone can recall the paper where this technique is documented, please let me know and I’ll put in a link and appropriate credit.
Update: Anuj Kalia, in comments, identified a Microsoft Research paper that’s possibly what I saw as Data-Parallel Finite-State Machines – Microsoft Research – for the 16-state case, I believe this approach converges to be functionally equivalent to Sheng. We discovered this work only when we went looking to establish originality of Sheng…
** Anatoly Burakov wrote the first implementation of Sheng within Hyperscan. Alex Coyte later extended Sheng to work as part of a much larger DFA, a subsystem which he felt moved to dub “Shengy McShengface”, for reasons he may not be able to adequately explain.