Performance notes on SMH: measuring throughput vs latency of short C++ sequences

A quick update on last week’s post SMH: The Swiss Army Chainsaw of shuffle-based matching sequences on performance measurement.

During that post, I provided throughput numbers for these sequences but didn’t show latency. This is a critical distinction, and it doesn’t pay to be confused about the the two. I would rather avoid the rather cringeworthy formulation from the Mythical Man Month (where women are “assigned” to the task of bearing children!) and stick to the metaphor of boiling eggs: a suitably large pot of boiling water could boil eggs at a throughput of an egg every 10 seconds, but cannot provide you with a 3-minute-boiled egg in less than 3 minutes.

It is important not to confuse the ability to do something in, say, 10 cycles vs the ability to do 1,000 somethings in 10,000 cycles. The former is always at least as hard and usually much harder. This distinction holds all the way down to the single operation level: for example, a modern x86 processor can launch a multiply operation every cycle, but requires 3 cycles to know the result of a given multiply.

Modern computer architecture conspires against us when we wish to measure latency. Attempting to measure the latency of a single short code sequence is quite error-prone due to the overhead of the various performance counter or clock measurement calls.

Throughput is easy to measure on a larger scale, as we can measure thousands of iterations and establish an average cost per iteration. However, well-written code will usually attempt to minimize dependencies from one iteration to the next. When we attempt to measure, say, the branch-free code of SMH, there is little to prevent a modern, out-of-order processor from getting on with the next iteration or two while the previous iteration is handled.

I tried two approaches both attempting to measure the latency of the various SMH sequences. The first was to insert an LFENCE instruction between each SMH sequence but otherwise keep the code the same. Note that LFENCE in this case can be switched on and off by a macro.

The second approach was to make the location that was read by an SMH sequence depend on the result of the previous SMH sequence. Since I didn’t want to introduce a spurious ‘jumping around memory’ component to the benchmark (which would always be absent from the equivalent throughput metric), I made sure that the previous SMH sequence always happened to return zero (no match): we know this, but the architecture and the compiler don’t.

Creating long chains of dependent operations is also how Agner Fog (and others) measure latency; those who have not yet seen Agner’s Software optimization resources are in for a treat.

The code to measure SMH latency is below (note that LFENCE is switched off by the preprocessor as needed and was not used in the latency-test version of this code at all):

Observe the “tmp” variable in the gist above; it is always zero, but we cannot safely start our matching operation until the architecture has the result of the previous match operation in hand (Intel Architecture has many fascinating optimizations, but generalized value prediction is not one of them).

This gives us somewhat of a hybrid creature: “steady-state” latency. The compiler and architecture are still free to load things into registers that don’t depend on the actual computation – so this latency number is perhaps unrepresentative of a ‘cold start’. However, it is a reasonable measurement of the latency of a single operation in a well-optimized code base.

SMH Variant normal no unroll LFENCE
SMH32-loose Throughput (ns) 0.89 0.98 10.62
Latency (ns) 7.03 6.92 10.65
SMH32 Throughput (ns) 1.12 1.15 11.02
Latency (ns) 7.25 7.30 10.89
SMH64-loose Throughput (ns) 1.35 1.44 11.03
Latency (ns) 7.63 7.61 11.36
SMH64 Throughput (ns) 1.62 1.66 11.67
Latency (ns) 7.95 8.00 11.61
SMH128-loose Throughput (ns) 2.80 2.67 12.39
Latency (ns) 8.97 8.14 12.91
SMH128 Throughput (ns) 3.32 3.08 12.82
Latency (ns) 9.78 8.55 12.91

The above numbers seem reasonable based on a walkthough of the code. I also measured the effect of turning off my manual 8-way unroll. I had focused on smaller models and the metric of throughput as I tuned SMH; it’s marginally interesting to note that latency is generally better without an unroll in the measurement loop if not decisive.

The LFENCE results are hard to interpret – they seem to generally track the latency of the normal case plus around 3.5ns. More work is needed to confirm this; it would be nice to have a way of getting a latency number out of the system that doesn’t rely on an ability to introduce contrived data dependencies from one iteration to the next.

I feel reasonably confident that SMH can be said to do its work in 7-9 cycles; note that the overlap of iterations required to hit the full throughput (looking at the above table) must have to be as many as 8 iterations for the cheapest cases. As always, this implies that being stuck in the ‘latency world’ is miserable – try to phrase your computations to stay in the ‘throughput world’ whenever you can.

Updated code is here

Thoughts on how to measure latency are welcome.

SMH: The Swiss Army Chainsaw of shuffle-based matching sequences

Today I’m going to share with you one of my favorite instruction sequences.

PSHUFB
PAND
PSUBB
PCMPGTB
PMOVMSKB
ANDN
ADD
AND
AND
LOAD

OK, got it? That’s the post. Have a nice day.

Perhaps a little more explanation is in order, especially if you weren’t next to my cubicle for what were no doubt 11 very long years during the creation of Hyperscan. The above sequence, which I have dubbed SMH (an acronym which stands for “I am Not Going to Explain”), appears in various forms in Hyperscan, and has a few expansions and variants which I’ll discuss. I think a full expansion of all the fun of SMH, and a codebase to match, will have to wait.

This sequence looks fairly banal – but it can accomplish a huge range of matching tasks, not all of which are traditional pattern matching tasks. Let’s walk through a basic application and we’ll get to more elaborate ones in later posts.

Keep in mind that the basic unit here is asking questions composed of a bunch of byte-level predicates comparing our input to queries we prepared earlier.

The code for this is up at Github at: Shuffle-based predicate matcher and all-round branch free swiss army chainsaw

Baseline Application: Prefix Matching

Everyone likes a good Prefix Matcher – Prefix or Suffix matchers crop up all over the place in pattern matching as a basic building block. There are plenty of traditional implementations of this (generally involving DFAs or hashing/Bloom Filters).

I’m going to focus on a case that allows SMH to be displayed without too much extra complexity: small-set Prefix matching of literal strings no greater than 16 characters.

Suppose we want to match a few animal names and map those to some sort of id: we have {“cat”, “dog”, “mouse”, “moose”, …}. Here’s a straightforward approach to doing this, assuming we have no more than 32 characters overall in total and a machine handy with AVX2 (if you don’t, buy yourself a better computer and give Grandpa back his Ivy Bridge, or translate everything I’m saying into Neon – there’s nothing here that’s all that Intel specific, but I don’t have a fast ARM workstation so y’all can suck it up and read AVX code):

  1. Load 16 bytes of our input from memory and broadcast it to high and low lanes of an AVX2 register
  2. Shuffle the input with a pre-prepared shuffle mask, so that we can consecutively compare the first 3 bytes against cat at bytes 0..2 of our register, the first 3 bytes again against ‘dog’ in bytes 3..5 of our register, etc.
  3. Compare the input against another pre-prepared mask, this one containing our strings: so the register would pack in ‘catdogmousemoose’.
  4. Use VPMOVMSKB to turn the compare results (which will be 0x00 or 0xff) into single bits in a general purpose register (now, you can put your SIMD units away).

    These steps are illustrated below:
    drawings-for-smh

  5. Take the Most Significant Bit of each string in the mask – if we are packing our strings from left to right (it doesn’t really matter which way we do the comparisons) we would have the comparison of ‘cat’ so that ‘c’ is in bit 0, ‘a’ is in bit 1, ‘t’ is in bit 2, and we would declare the bit 2 part of our Most Significant or “high” end, so bit 2, 5, etc. would be our high mask. We’re going to make a temporary mask which is the result of our VPMOVMSKB with the ‘high’ mask zeroed out.
    I refer to this as ‘digging a hole’. We’ll need that ‘hole’ later.
  6. Now take the corresponding ‘low’ mask (the LSB end of each string) and simply add it to our temporary.
    The effect of this will be that if, and only if, all our other bits in the mask are on, there will be an arithmetic carry into that ‘hole’ I mentioned (and you can see why we needed it – we certainly don’t want to carry beyond that point).
  7. Now AND the original mask (not the high mask, but the original mask we got from PMOVMSKB back in). Now the high bits are on if and only if every comparison bit associated with the string is on.
  8. We then AND by our high mask again to turn off everything but our high bits.
  9. We can use LZCNT (leading zero count) to find our first high bit that’s set. We pack our strings in order of priority, so the ‘winning bit’ (in case two bits are on – possible if our strings or later patterns overlap). We don’t have to do it this way – we might want to report all matches. For now, we’ll report a highest priority match.
  10. This LZCNT is then used to read from a table that has ID entries only for the bits that have high bits set (the other entries in the table are never read – this wastes some space but means we don’t need to compress down our table).

    Steps 5-10 are illustrated here:
    drawings-for-smh-2

So – that’s the basic algorithm.

This gist shows the edited highlights (no debugging guff – I have some code that prints out the process in greater detail if that’s what you’re into) of steps 1-9: there’s really not that much run-time code.

There are obvious extensions to handle more than 32 ‘predicates’. We can build a 64-wide version of the above by simply gluing two copies of steps 1-4 together with a shift-by-32 and an OR at the end. A 128-wide version is four copies of steps 1-4 and two separate copies of steps 5-9 with a little bit of logic (a conditional move and an add) to put together our two LZCNT results to turn the result of doing 2 64-bit LZCNTs into a single 128-bit LZCNT.

The above sequence is fast, and reliably so: it doesn’t have any branches and it doesn’t make complex use of memory, so its performance is pretty much constant regardless of input.

It can be made faster if we take a few shortcuts – suppose we have plenty of room in our model, whether it be 32, 64 or 128 bits. We might have, say, 4 strings with 5 characters each, so we’re consuming only 20 slots in a 32-bit model. In this case, why ‘dig holes’? Instead, we can reserve a slot (“gutters” instead of “holes”?) at the high end of the string with a guaranteed zero compare – this means that all we need to do is add our low mask and filter out the carries from the high end, so the ANDN/ADD/AND/AND sequence loses 2 instructions. We refer to this as the “loose fit” model as opposed to the “tight fit” model.

Here are the performance numbers in nanoseconds on a 4.0 Ghz SKL workstation:

Fit Predicate Count ns per sequence (throughput)
Loose 32 0.888
Loose 64 1.38
Loose 128 2.82
Tight 32 1.14
Tight 64 1.65
Tight 128 3.37

I describe these as throughput numbers, because you won’t get a single SMH lookup done at anything like these speeds – the latency of these sequences is much higher than the throughput you can get if you have lots of these sequences to do.

A future blog post will look into generalized ways to measure this latency specifically (I had a unconvincing experiment with LFENCE). In the meantime, be sure to think carefully about experiments where you “do something in N cycles” where you actually mean “doing 10,000 somethings is N*10,000 cycles” I recommend the following three step program. 1) Cook 18 eggs in a pot for 3 minutes. 2) Convince yourself that throughput == latency. 3) Throw away your original eggs, cook another egg for 10 seconds and bon appétit!

Here’s a debug output showing a couple iterations of one of our models (SMH32-loose, the simplest model). I use underscore instead of zero as it’s easier to read:

The curious can go to Github and poke the system with a stick. I cannot guarantee that it won’t fall over; this project is very preliminary.

SMH: Full Sequence

(I will not illustrate this with a movie reference)

At this point, you might be excused for wondering why I called SMH the “Swiss Army Chainsaw”. The instantiation above is pretty simplistic and seems to be covering literals only. The codebase as it stands really only allows this; there’s no compiler to do anything but literal prefixes.

However…

1. Shuffle allows discontinuous things to be compared

Because we are using shuffle, we don’t have to select contiguous parts of a string. Better yet, we don’t have to pay for bits of the string we don’t select to compare on. Using regex notation, the very same sequence could just as easily match /a…b/s as /ab/ and both take the same number of predicate slots.

This means we can range over data structures (subject to our limit of 16 bytes; more on that later); we don’t just have to compare strings in buffers.

2. The full sequence allows masking, ranged comparison and negation

The full sequence adds a couple extra SIMD instructions to the front-end, and changes the nature of the comparison from ‘equal’ to ‘greater-than’.

Inserting an AND into our sequence allows us to carry out a ‘masked’ comparison. This allows us to do some fairly simple things – e.g. make caseless comparisons of alphabetic ASCII characters, check individual bits, or look for a range of values like 0x0-0x3 (but not an arbitrary range).

But even more fun – leave the AND into the sequence and subsequently carry out a subtract and change the comparison to ‘greater-than’ comparison (PAND, PSUBB, PCMPGTB).

This gives us all the old power we had before (if we want to target a given value for ‘equal-to’, we simply use PSUBB to ensure that it’s now at the maximum possible value (+127, as the AVX2 comparison is on signed byte) and compare-greater-than with the value +126. However, we can now also detect ranges, and we can even negate single characters and ranges – it’s just a matter of picking out the right subtract (really, PADDB works just as well) and compare values.

I admit I have not really thought through whether there is interesting power granted by the combination of PAND and the PSUBB/PCMPGTB together; I have really only thought about these in a one-at-a-time fashion. It might be better to follow the PSUBB with the PAND. Insights welcome.

3. The bit arithmetic at the end of the sequence can model more than just ADD

The sequence – whether ‘loose model’ (ANDN, ADD, AND, AND) or the ‘tight model’  (ADD, AND) – carried out over the general purpose registers is used to calculate a series of ANDs of variable length bitfields and produce exactly one result per bitfield.

It has more possibilities:

  1. With a bit of work, it’s possible to handle overlap within the predicates. Suppose we have two strings “dogcow” and “dog”. We can overlap these as long as we either are (a) using the loose model (so we have a safe spot after ‘dogcow’ already) or (b) we separate our ‘dig our holes’ masks and our ‘extract the final comparison masks’. After all, if we overlap “dogcow” and “dog” we don’t want to ‘dig a hole’ at ‘g’, or else we don’t get a carry all the way to the ‘w’. So in a tight model we will still ‘dig a hole’ at “w” but we will have to treat seeing a ‘1’ after than process is done differently – in fact, we will find a match for all of the bolded characters in “dogcow“.

    Note we also need to make sure that the index for successfully seeing “dog” is copied not just to the slot corresponding to “g”, but also “c” and “o”, as if we see “dogco” that’s a match for “dog” but not “dogcow”.

  2. Bizarrely, we can handle OR over some predicates, in some order, and even nest at least one AND-term within that OR (but not two).

    So, suppose we wanted to match (in regex notation) /ab[cx]/ – “a, followed by b, followed by c or x”. We could use 4 slots, comparing the first 2 characters in the typical way, then using 2 byte-predicates against the third. We would then, instead of adding the equivalent of binary 0b0001 (the way we would for, say /abcx/), we add binary 0b0011 – so either the ‘c’ matching or the ‘x’ matching will cause a carry out of our last 2 places. The only difference that results is what’s left behind in bits we don’t care about anyway.

    Even more odd: suppose we wanted to match /ab(cd|x)/. We can still do this – by ordering our predicates to match a, b, x, c and d in appropriate places. We then add 0b00101 to the mask, which gets the carry we need iff we have “cd” or “x”.

    It is not possible to do this trick for an arbitrary combination and something as simple as /ab(cd|xy)/ cannot be done. Only a boolean function where some ordering of the variables allows us to arrange all possible ‘true’ values above or below all possible ‘false’ values can be handled in this way.

    In anyone has any theoretical insight into how to express which functions can and can’t be modeled, please let me know!

Future thingies

Needless to say, this can all be made far more powerful (in case it wasn’t powerful enough) with better instructions. While my long-standing love affair with PSHUFB has already been demonstrated and will be demonstrated again, the limitation of 16-character range is irritating. VBMI and VBMI2 in particular introduce more powerful capabilities, and ARM Neon machines already have the means to do generalized byte shuffles over 1-4 registers. There are also possibilities for using larger data items for both shuffles and for the compares (this yields a different set of functionality).

If anyone wants to send me a Cannonlake machine, or a ARM Neon workstation, I’ll give it a shot, without fear or favor.

A pre-shuffle using the existing AVX2 or AVX512 permutes could extend the range of the finer-grain PSHUFB shuffles, although the functionality provided is a bit complex and hard to characterize.

All this will be explored in due course, but not until the basic SMH models have been fleshed out and grown a slightly usable API.

Summary: The Case for the Swiss Army Chainsaw

Armed with the above features, it seems possible to handle an extraordinary number of possibilities. Note that range checks on larger data types than simple bytes, can be (somewhat laboriously) composed with these features, and obviously the whole range of mask checks (as per many network operations) are easily available.

These comparisons could also be used to combine quantitative checks (e.g. is TTL field > some value) with checks of strings or portions of strings in fixed locations.

The fact that we have a logical combination of predicates could allow checking of a type field to be combined with checks that are specific only to the structure associated with that type field – so a data structure which leads with a type field and has several different interpretations for what follows could be checked for properties branchlessly.

Needless to say, actually doing this is hard. The run-time is easy (add 2-3 more instructions!); the compiler and a good API to express all this – not so easy. I’ve illustrated a simple application in this post, and can see more, but have to admit I don’t really quite understand the full possibilities and limitations of this sequence.

Postscript: A Notes on Comparisons To Trent Nelson’s Prefix Matcher

This work has some similarities to the recent Is Prefix Of String In Table work by Trent Nelson. I’d point out a few issues:

  1. This work is measured in a tighter measurement loop than Trent allows himself, so these numbers are probably artificially better. I’m allowing myself to inline the matching sequence into my measurement code as the SMH sequence here is in a header-only library. I don’t think Trent is doing the same thing so he may be paying some function prologue/epilogue costs. If I get rid of my unrolls and inlines I lose a  couple cycles.
  2. The SMH work presented here isn’t really intended for a larger-scale standalone: it would be more typical to embed it as a second-stage after a first-stage lookup.
  3. Despite this, out of curiosity, I tried putting Trent’s testing strings into my matcher – one of the strings is too long (I can’t handle >16 length strings) but after that is trimmed to length 16, the matcher can fit those strings in the 128-predicate ‘loose’ model or about 11.3 cycles throughput.
  4. If one were to select on (say) the first 11 bits or so of the buffer and look up a SMH data structure after 1 layer of indirection (SMH is too big to really make it fun to have 2048*sizeof(SMH) bytes, so indirection is needed) – a simple load, AND, load sequence, it seems obvious that the “loose 32” model could cover the case (3.6 cycles throughput plus added costs from the loads and AND, as well as whatever costs happen from having to go back and forth between multiple SMH sequences rather than using a statically determined set of sequences continuously). As a bonus, this arrangement could strip the first character out of the workload, allowing us to cover 17 characters.
  5. The SMH code is branch-free and won’t be affected by branch prediction. Trent’s performance analysis doesn’t really cover the effects of branch prediction as it involves benchmarking with the same string over and over again – it is, of course, very hard to come up with realistic benchmarks as there’s no real ‘natural population’ of input to draw from.
  6. I don’t yet bother to deal with length of the strings, which is naughty. Trent is more responsible. However, the more full SMH model is capable of including length information in its predicate calculations, and an alternative strategy of ‘poisoning’ our input (loading ‘out of range’ characters with values that cannot occur at that position in any valid string – not hard when you only have 16 different strings and no wildcards) is also available.

 

 

“Say Hello To My Little Friend”: Sheng, a small but fast Deterministic Finite Automaton

Deterministic Finite Automata (DFA, subsequently) are a fundamental structure. Most state machines that programmers build are some variant on a DFA, whether they are built by jumping around inside a switch statement or moving from state to state in a table structure.

They have uses all over the place; they are used heavily in regular expression implementation, and can be used in various validation schemes such as UTF-8 validation. I’m going to show a curious little DFA of my own invention* that we used in Hyperscan**. The presentation here will be an independent re-implementation as the version in Hyperscan is buried in some pretty complex code.

Sheng has some pretty tight limitations, especially in the version I’m presenting here:

  1. It cannot have more than 16 states.
  2. This version of Sheng is ‘quiet’ – it calculates states but doesn’t have an ‘accept state’ that is actively raised. So you can’t detect a regular expression and get a callback or a index as to where it matched.
  3. This version of Sheng is also a bare DFA without a compiler. You need to put the transitions of the state machine in manually.
  4. This version of Sheng depends on x86 instructions, but the principles could allow the extension of Sheng to any system with a similar permute instruction, such as ARM NEON.

Most important: Sheng uses my favorite instruction, PSHUFB!

The Problem in Plain DFA implementations: Memory Latency

A typical problem for DFA implementation is that, at best, each DFA state transition typically involves a single memory access. More compact implementations may use several. Worse still, each of these state transitions depends on the previous state transition, so a simple DFA cannot run faster than the latency of the lowest level of cache (often plus a cycle, if there are things that need to be done to the loaded value from the transition table to make it suitable for another state transition).

This is the critical path of the DFA: the state-to-state transition. Other activities, such as remapping characters to a smaller character set to save space, or checking for accept states, are not on the critical path and are almost ‘free’ in a typical implementation – after all, we’re waiting for the state transition to finish. That’s a lot of free execute slots!

Here’s a not very interesting DFA implementation:

This isn’t a perfect “simple” DFA implementation; we waste at least 1 cycle of latency in our state-to-state transition on index arithmetic to look up that big array (better, but more obscure, would be to track our state as a location within the transition table).

Note the implementation in full unrolls the loop, too.

However, even given a wasted cycle or two of latency, this implementation is close to the limit of memory latency. The DFA is small (4K) so we will be getting it from L1 cache in the steady state, but that means a state-to-state transition at around 4-5 cycles minimum.

Enter My Little Friend: Sheng

Sheng is a different approach. Sheng uses the PSHUFB instruction to implement the state transitions taken by looking up a shuffle mask for each input character. Note that the lookup operation is not on the critical path, as we know our input characters well in advance.

As such, the critical path for Sheng is just 1 cycle on modern architectures; both recent Intel and AMD processors implement PSHUFB with a single cycle of latency.

The variant of Sheng presented is ‘silent’ – it allows us to calculate which state we’re in at a given point but it has no facility to detect whether a match has occurred. We’ll cover the feature of a non-silent Sheng later; sadly, the number of instructions required to check our state means that we will have to add a lot of extra work – too much work to manage 1 cycle per byte (not a critical path issue – it’s just that it’s hard to do that many operations in a cycle).

So this one is a little weird: we heavily depend on my favorite instruction, PSHUFB, included on most x86 processors since its introduction with SSSE3 (the catchily named “Supplemental Streaming SIMD Extensions 3”).

PSHUFB (_mm_shuffle_epi8 in this code) is a bytewise shuffle, using the low 4 bits of each byte from a control mask register to indicate which byte to copy from the source register to the destination. It can be used to permute data, but it can also be used to effectively look up a 16-wide table.

In this usage, PSHUFB masks are found on a per-character basis. We look up a character from our input and use this mask to look up what our next state should be. For example, in the 5th unrolled iteration, our current state is used to index into this mask (“transitions[c5]”) and by permuting that mask, and this yields our new state.

We keep our canonical state in the bottom lane of the 128-bit register.

As a side note, we could actually be processing 16 DFAs at once, with an almost useless set of limitations:

  1. The DFAs all have to have the same structure and character transitions.
  2. The DFAs all have to be acting on the same data.

So really, all we can do is start the DFAs off in different states and then crank those states and see what happens. There is an interesting usage of this (picture what happens when we initialize a register with [0,1,2,3,…, 15] and process a block of data – we now have a function that can be applied as another shufle mask! Details can wait for another followup blog post.

So, what do we get from all this? The main advantage of doing this is speed – here’s the basic comparison of speed between the two systems (measured on a 4Ghz Skylake client machine):

(there’s also a basic-level of traces through states included here so that I could verify that the two state machines are basically sane and doing the same thing; see the code)

So we’re processing 3.92 bytes per nanosecond (pretty close to 1 cycle/byte) as opposed to around 0.6 bytes per nanosecond with a basic DFA implementation (which could probably go about 10-20% faster with a more sophisticated table lookup, but not that much more). Sounds good –  as long as we can live with the long list of limitations of Sheng.

Sheng has a lot of interesting properties, which I’ll follow up in later posts:

  • There are several strategies for having a “noisy” Sheng – that is, one that can stop, raise a callback, or write to a buffer whenever it encounters some “interesting” state (e.g. an accept state).
  • There are also a number of ways Sheng can be adapted to handle a larger portion of the pattern matching task.
  • These is nothing inherently x86-centric about Sheng. The TBL instructions on Neon could be used to build up the same facility on ARM, and the multiple register variants of these instructions could be used to build 32, 48 or 64-state DFAs.
  • An AVX2 machine can run two independent 16-state DFAs at once for the same cost, although there is no cost-free way for them to interact. AVX 512 adaptation of the same techniques allows 4 such independent 16-state DFAs.
  • AVX512 also allows other exotic structures, including larger DFAs using the 16-bit permute operations, including the 2-source permutes.
  • AVX512 VBMI adds VPERMB and 2-source byte permutes, allowing this technique to be extended to 64 or even 128 states! However, the added latency of these permutes means that a simplistic implementation will be much slower.
  • Since PSHUFB is a permute, it’s possible to compute blocks of this operation out-of-order. This can be exploited to improve throughput where latency of an operation is not equal to throughput – this is not true of PSHUFB or VPSHUFB but is true of some of the more recent shuffle instructions (for example, many of the AVX512 16-bit shuffles are latency=7 throughput=2) and will likely be true of the next generation of shuffle instructions.
    • Note that a 2-source permute is not straightforwardly handled by this, as in order to turn permutes over a block on input into a function, we must calculate all possible outcomes on all states. This becomes prohibitively expensive with already large operations.
    • This out-of-order computation is not particularly suitable where a “noisy” Sheng is required

Until then, I hope you enjoyed Sheng, and you can find the code on Github.

https://github.com/geofflangdale/sheng

[ please note: it is essentially a ‘sketch’, lacking many features and there is approximately zero software engineering applied to it. The Sheng and BasicDFA structures should related through static or dynamic polymorphism so that they can share test drivers, but I didn’t want to start designing a more generalized interface until I have built out more of the Sheng variants, so I used cut-n-paste-polymorphism 🙂 ]

[ also note: yes, there are many ways to make DFAs run faster, including acceleration, gluing the characters together and various other tricks. There are also a bunch of ways to make DFAs run slower; typically by implementing them on specialized hardware add-in cards, then waiting geological ages to get the data to the cards and the matches back. ]

* I independently invented this technique along with some researchers at Microsoft Research; if anyone can recall the paper where this technique is documented, please let me know and I’ll put in a link and appropriate credit.

Update: Anuj Kalia, in comments, identified a Microsoft Research paper that’s possibly what I saw as Data-Parallel Finite-State Machines – Microsoft Research – for the 16-state case, I believe this approach converges to be functionally equivalent to Sheng. We discovered this work only when we went looking to establish originality of Sheng…

** Anatoly Burakov wrote the first implementation of Sheng within Hyperscan. Alex Coyte later extended Sheng to work as part of a much larger DFA, a subsystem which he felt moved to dub “Shengy McShengface”, for reasons he may not be able to adequately explain.

Bits to indexes in BMI2 and AVX-512

[ Please bear with the terrible formatting of the table in this post; I was pretty startled at how limited my options were from a vanilla formatter. Opinions on a better method are welcome. ]

Daniel Lemire, in his post Iterating over set bits quickly (SIMD edition) discusses several techniques to iterate over set bits quickly – or more precisely, to turn a collection of bits into a variable-length buffer full of integers indicating which bits were set.

So, if your code gets given an array with the following 16-bit bitfields (assuming little-endian order):

0x1001, 0x0003, 0xffff

you would want the answer:

indexes = 0, 12, 16, 17, 32, 33, 34, ... , 46, 47

This is an important operation. While it’s a lot of fun to do clever things with SIMD, sooner or later you may need to do something specific with the bits you found in your SIMD registers. For example, we used a number of SIMD techniques in Hyperscan to search for strings, but eventually you would have to report that you’d found something to the rest of the system.

After reading Daniel’s post, and more importantly, taking some time to hack on an AVX-512 system that he generously shared access with me, I think I have invented a new, branch-free way of solving this problem for 64-bit integers. There is the small catch that you will have to have an AVX-512 capable system handy.

(I say I think I invented this as it’s quite possible that (a) I’ve absorbed this technique from somewhere and forgot, or (b) someone else has already independently invented this)

Here’s the technique.

Let’s rig up a bunch of masks with alternating blocks of one and zero bits:

uint64_t msk_1 = 0xffffffff00000000ULL;
uint64_t msk_2 = 0xffff0000ffff0000ULL;
uint64_t msk_3 = 0xff00ff00ff00ff00ULL;
uint64_t msk_4 = 0xf0f0f0f0f0f0f0f0ULL;
uint64_t msk_5 = 0xccccccccccccccccULL;
uint64_t msk_6 = 0xaaaaaaaaaaaaaaaaULL;

Now, suppose I have a bitvector in v that I’d like to turn into a bunch of indexes. I can get a start by doing this:

uint64_t v1 = _pext_u64(msk_1, v);
uint64_t v2 = _pext_u64(msk_2, v);
uint64_t v3 = _pext_u64(msk_3, v);
uint64_t v4 = _pext_u64(msk_4, v);
uint64_t v5 = _pext_u64(msk_5, v);
uint64_t v6 = _pext_u64(msk_6, v);

What did this achieve? Well, suppose I have the 11th bit set in v and nothing else. Looking into my masks, I can see that my PEXT operation (a fast bitwise extract) got me a 1-bit from msk_6, a 1-bit from msk_5, a 0-bit from msk_4, a 1-bit from msk_3 and 0-bits otherwise. These bits will all be deposited into the least significant bits of the v1 through 6 temporaries.

In other works, for each set bit, I’m extracting the bit pattern of its index from the masks and depositing that bit pattern at the lowest-significant bytes on my v1 through v6 temporary values.

So, in the unlikely event that you were hoping to get the right answers, annoyingly bit-smeared across 6 different uint64_t variables, you’re done. But that’s probably not very satisfying. We’ll get to that.

So how do we interleave these 6 values together? This looks pretty ugly – we’re looking at 384 total bits in the worst case of answers. So this doesn’t seem like something we can do fast in the General Purpose Registers. Let’s go to SIMD.

The principle we will apply is that we will use AVX-512’s facility to use 64-bit mask to control a SIMD computation. We will take our 6 values and use them to control the progressive adding of 32, 16, 8, 4, 2 and 1 into a result.

__m512i vec;
vec = _mm512_maskz_add_epi8(v1, v32_bit, _mm512_set1_epi8(0));
vec = _mm512_mask_add_epi8(vec, v2, v16_bit, vec);
vec = _mm512_mask_add_epi8(vec, v3, v8_bit, vec);
vec = _mm512_mask_add_epi8(vec, v4, v4_bit, vec);
vec = _mm512_mask_add_epi8(vec, v5, v2_bit, vec);
vec = _mm512_mask_add_epi8(vec, v6, v1_bit, vec);

Now vec holds the answer we wanted, if we just wanted a bunch of bytes on our output, ranging from 0..63. Unfortunately, we need to write some not very interesting code if we’re doing this over a large range, where we imagine that our offsets might be much larger than a byte. If we’re working continuously over inputs >64K, we would expect to need 4 byte answers. In order to write out up to 64 uint32_t offsets, we’re going to have to spread out our results over 4 registers (spreading the bytes over u32 units), add in a value ‘k’ representing the base offset of our 64-bit value to begin with, and write all 4 of these big registers out.

__m512i base = _mm512_set1_epi32(k*64);
__m512i r1 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,0));
__m512i r2 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,1));
__m512i r3 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,2));
__m512i r4 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,3));

r1 = _mm512_add_epi32(r1, base);
r2 = _mm512_add_epi32(r2, base);
r3 = _mm512_add_epi32(r3, base);
r4 = _mm512_add_epi32(r4, base);
_mm512_storeu_si512((__m512i *)out, r1);
_mm512_storeu_si512((__m512i *)(out + 16), r2);
_mm512_storeu_si512((__m512i *)(out + 32), r3);
_mm512_storeu_si512((__m512i *)(out + 48), r4);

(note that ‘out’ is a uint32_t so we are actually getting +64, +128, +192 bytes with those last three offsets).

Alert readers will note that this code is writing a lot of stuff out. What happens if we only had 1 bit set? Or 0? Well, this blog isn’t called “Branch Free” for nothing.

More seriously, the point is that it’s usually cheaper to do the same thing every time rather than run the risk of a branch mispredict. Looking back at the code above – sure, it looks like a giant bolus of code. But a branch miss on a modern architecture is around 14 cycles. That’s a lot of missed opportunity to do work.

Even if you accept my above philosophy of doing tons of potentially redundant work over risking a branch miss, there’s one more question – we need to know where our next write should be:

uint8_t advance = __builtin_popcountll(v);
out += advance

That just moves us up (remember ‘out’ is a uint32_t for pointer math purposes) to the last value that actually had something set. And we’re done!

Is it fast?

Here’s a rough spreadsheet of the results measured against several of the other methods described in Daniel’s article. It’s faster than most of the other methods, falling down only for very low ‘bitmap densities’. For these lower densities, taking a conditional branch with the prospect that the expected number of bits set in a word is very low is a winning proposition.

Bitmap density Method Cycles per index
0.03 bitmap_decode_ctz 3.852
bitmap_decode_avx2 10.116
bitmap_decode_avx2_turbo 14.363
bitmap_decode_avx2_turbo_thin 15.736
bitmap_decode_avx2_turbo_nopopcnt 12.624
bitmap_decode_bmi2_avx512 12.9
0.12 bitmap_decode_ctz 4.97
bitmap_decode_avx2 3.003
bitmap_decode_avx2_turbo 4.205
bitmap_decode_avx2_turbo_thin 4.547
bitmap_decode_avx2_turbo_nopopcnt 3.732
bitmap_decode_bmi2_avx512 2.481
0.25 bitmap_decode_ctz 4.251
bitmap_decode_avx2 1.52
bitmap_decode_avx2_turbo 2.09
bitmap_decode_avx2_turbo_thin 2.265
bitmap_decode_avx2_turbo_nopopcnt 1.861
bitmap_decode_bmi2_avx512 1.25
0.5 bitmap_decode_ctz 3.446
bitmap_decode_avx2 0.796
bitmap_decode_avx2_turbo 1.042
bitmap_decode_avx2_turbo_thin 1.131
bitmap_decode_avx2_turbo_nopopcnt 0.92
bitmap_decode_bmi2_avx512 0.616
0.9 bitmap_decode_ctz 3.037
bitmap_decode_avx2 0.444
bitmap_decode_avx2_turbo 0.574
bitmap_decode_avx2_turbo_thin 0.628
bitmap_decode_avx2_turbo_nopopcnt 0.509
bitmap_decode_bmi2_avx512 0.366

Is this a great idea? I don’t know.

There are no doubt other methods to use AVX512 to transform bit vectors in this fashion, and for a relatively low ‘population’ there are a number of ways the bitmap_decode_ctz code can be made to run faster (possibly the topic of another article).

I still think it’s an interesting ‘trick’ and it’s nice to take my second-favorite instruction (PEXT) out for a spin.

Let me know if you’ve seen this trick somewhere before and I’ll be happy to credit where credit is due. As I said, I think I invented it…

The code is available at Daniel Lemire’s Github with an error (my fault, apparently I thought 8+2 = 9) which will be corrected in due course.

ps. In the ‘dynamiting the trout stream’ category, I give you VPCOMPRESSB from Intel® Architecture Instruction Set Extensions Programming Reference (PDF) which will greatly simplify all the above trickery, once we have AVX512_VBMI2 capable machines (Ice Lake time-frame).

pps. There is also a branch-free means where VPCOMPRESSD can be used four times on 16-bit words to solve a similar problem on machines that are publicly available now. This can be left as an exercise for the reader. It might be faster than the BMI2 stuff, but it lacks style points.

Introduction and welcome

Hello, world.

This is my blog where I will talk about things that interest me (and a no doubt small collection of others). Topics that I’m interested in include:

  • Low-level and performance-oriented programming
  • Computer architecture (especially as it related to performance-oriented code)
  • Programming languages
  • Regular expression implementation and automata theory
  • … and of course, implementing things without branches! Thus the name.

I was the designer of the Hyperscan project. I built this system at Sensory Networks, which was acquired by Intel Corporation in 2013, and worked on Hyperscan at Intel for over 4 years.

I hope that I can show you some interesting things. I have a few things in the pipeline that I will show shortly, including some string matching work, fast Random Forest implementation and a lot of my favorite low-level coding tips and tricks.

I request that my readers can bear with me and forgive the (hopefully temporary) amateurish nature of the site; I am not an expert blogger or user of WordPress.