Paper: Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs

I’m pleased to report that Hyperscan, the regular expression matcher that ate my life from 2008 through 2018, finally has a paper (pdf) – it’s being presented this week at NSDI ’19.

Anyone who lacks a taste for my discursive bullshit could just stop reading this blog post there, and just go read the paper…. Bye!

It’s a good paper and it works as a standalone read. I’d like to call out a few things in it as particularly interesting and worth reviewing. I’d also like to call out a few things in Hyperscan that we couldn’t write about for reasons of space – we were hard up against the space limits for any conceivable conference paper, so there were whole subsystems and implementation strategies that we couldn’t write about. I will also mention a few aspects of Hyperscan historically that didn’t make it into the Hyperscan 4.x release series (the first open source release of Hyperscan and a substantial API change) nor the Hyperscan 5.x release series (a major release number that largely celebrates the now Australian-free aspect of Hyperscan).

Hyperscan Summary

It’s a software based, large-scale regex matcher designed to match multiple patterns at once (up to tens of thousands of patterns at once) and to ‘stream‘ (that is, match patterns across many different ‘stream writes’ without holding on to all the data you’ve ever seen). To my knowledge this makes it unique.

RE2 is software based but doesn’t scale to large numbers of patterns; nor does it stream (although it could). It occupies a fundamentally different niche to Hyperscan; we compared the performance of RE2::Set (the RE2 multiple pattern interface) to Hyperscan a while back.

Most back-tracking matchers (such as libpcre) are one pattern at a time and are inherently incapable of streaming, due to their requirement to backtrack into arbitrary amounts of old input.

Hyperscan was the product of a ‘pivot’ for a startup called Sensory Networks that was founded in 2003 to do regex on FPGA. I joined in 2006 and spent some time doing regex on GPGPU, before we decided that implementing regular expressions on GPCPU was a better idea. We worked in a much diminished company from 2009 (5 staff members!) until our acquisition by Intel in October 2013.

Future Thingies

Hyperscan and RE2 continue in their niches. There are a number of other projects out there – even a few hardware regular expression matchers struggle on – but I’ll have to get to other projects in another post.

It’s possible that something much simpler than Hyperscan could get a considerable portion of the benefit of Hyperscan, while being a more tractable code base for adaptation, porting to other CPU architectures or exotic architectures (GPGPU, for example), alternate use cases like bioinformatics, etc. I have some ideas in this area under the dubious code name “Ultimate Regex Engine the Third” (ure3) – Hyperscan itself was, originally “Ultimate Engine the Second”. If anyone is interested, get in contact.

Pattern Matching Techniques Roundup

Hyperscan Things That Made It Into The Paper

Pattern Decomposition (Section 3)

A key idea of Hyperscan is that if we have a complex regular expression – say:

42:/<\s*object[^>]*data\s*\x3A[^,>]*base64/smi

We can observe that there are a lot of parts of this pattern that are simply literal strings: “object”, “data”, “base64”. It’s a lot easier and cheaper to match literals rather than do a full regular expression implementation, so breaking up this pattern (an easy one, as it is simply ‘one thing after another’) into several components allows us to remove the strings from the finite automata matching task.

A rudimentary version of this is ‘pre-filtering’ – we could just make sure we saw all 3 strings, then run the pattern (although this is tricky in streaming mode – more details at some other time). But this still means we’re matching all the characters in those three strings in an automata. Given that our methods for matching NFAs are quite sensitive to the number of states, it’s better to have smaller automata.

String Matching: “FDR” (Section 4.1)

FDR is one of a long list of string matching engines that I’ve built. It’s been fairly effective, although it gets a bit cranky at large string counts. See the paper for details, as there’s only so many times I can explain what a “bucketed SIMD shift-or matcher with supercharacters” is.

The FDR matcher is used for moderate-to-large string match cases – typically between around 80 and tens of thousands of strings – and uses SIMD not for scanning many characters at once, but holding a much larger set of simple state machines than could be done with a general purpose register.

This is worth noting given the debates about vector machines – not every use of SIMD is oriented towards applying simple operations over great long vectors of data. FDR is a bucketed shift-or matcher that works better at 128 bits (8 buckets, 16 wide) than it would at 64 bits – there’s no obviously urgent extension to 256, 512 or for that matter, 2048 bits.

Freud said, sometimes a cigar is just a cigar; similarly, sometimes a SIMD register is just a SIMD register.

Glushkov NFA implementation: “LimEx” (Section 4.2)

The Glushkov construction is a way of getting an  ε-free NFA (“epsilon free”); that is, an NFA that does not make transitions from state to state without processing a character. This makes it considerably easier to build state machines with simple bit vector operations (although the successor function, showing which states can follow other states, is not necessarily trivial).

The LimEx matcher is an evolution of several earlier matchers. The “Limited NFA” could only allow states to jump to 0,1 or 2 states ahead (generalized to a larger number, eventually, but still not allowing irregular transformations). The “General NFA” used multiple partitions to allow arbitrary jumps after the Method of Four Russians. The appallingly stupidly-named “LimGen” NFA combined the two, gathering only states requiring a ‘non-limited’ transition using a PSHUFB instruction (yes, that one again, fans), and doing a table lookup.

All of these exotic techniques fell away to the simple strategy: do ‘limited’ transitions (a certain number of fixed-distance jumps only) then work through all the remaining states and do “Exceptions” (the “Ex” in “LimEx”). This covers quite a bit of plumbing for accept states, triggering and reading of bounded repeats, etc. It’s not pretty, but it works.

Hyperscan Things That Didn’t Make It Into The Paper

The following sections are not an exhaustive list; there are many more optimizations and transformations that don’t make it in. A few are below:

  • Graph transformations: redundancy, unreachable components, etc. – we built an abstract NFA graph (also a Glushkov NFA, but tailored towards representing our workload, not a direct implementation) and apply transformations. We can detect redundant work in an NFA and eliminate it, expand out literal paths, etc.
  • Bounded Repeat Analysis and implementation: handling of single-character bounded repeats: e.g. /foo[^\n]{1024}bar/ is quite challenging. We ‘discover’ such repeats in our NFA graph, and have specialized code to implement this class of repeats.
  • Other string matchers: we have a small-group literal matcher called “Teddy” (that uses SIMD; to this day, the best explication of Teddy is in the Rust reimplementation of it, even if they could benefit from using the rest of my compiler algorithm; my logic to merge multiple strings into a bucket is at least mildly sophisticated).
  • DFA implementation: Alex Coyte, one of our engineers (now at Google) always said “DFAs are awesome”. They are; they are faster and simpler than NFAs. We used them quite a bit (when we could, and when they made sense). We had a pretty straightforward implementation of DFAs without too many exotic components, but there are some tricks to making DFAs smaller.
  • NFA and DFA acceleration: when it’s possible to check that an automata, whether NFA or DFA, can’t leave its local neighborhood of current states without seeing easily detected ‘event’, it’s possible to use SIMD to scan for that event.
    For example, the pattern fragment /[^\n]*foo/ can scan for a newline, which breaks out of the \n state, or an ‘f’, which may commence the start of a ‘foo’.
    This extends even to NFAs, where we can take the closure of all possibly active ‘accelerable’ states and scan ahead for events.
  • Specialized engines, like “Sheng“, and many others. The ‘Castle’ engine, for example, efficiently handles many bounded repeats at once, improving overall performance but also radically reducing the amount of state we have to store.
  • Lookaround assertions: we use a similar technique to my SMH matching engine to, upon receipt of a literal match, to augment the power of the literal match by verifying that the characters around the match will match at the regex level. So if we extracted the string ‘foobar’ from a regex (to use a modified version of our earlier example)’/<\s*object[^>]*\s\dfoobar\d[^,>]*base64/’

    we can implement the character class lookarounds to check that we’ve seen ‘\s\dfoobar\d’ before running a full-fledged NFA matcher for the other parts of the pattern.

  • … and many others

Hyperscan Things That Didn’t Make It Into Open Source Hyperscan

  • Ports to non-x86 platforms. Hyperscan had ports to MIPS, ARM, PowerPC and Tilera at one stage. These ports were a lot of hassle to maintain, and I don’t think it takes a genius to see how these ports were not particularly strategic for Intel to devote resources to.
  • Capturing subexpressions: at once stage we had an experimental product that (in block mode only, not streaming) could accurately model libpcre’s capturing semantics. This worked by scanning the data backwards with a backwards version of the pattern, making a trace of the states visited, and tracing forwards through the backward state trace in order to correctly model the behavior that libpcre would have done. This was possible since when we’re tracing forwards through the backwards trace and following 1-bits in our states, we know they will lead to an accept state.Even more fun was implementing a version of this that checkpointed our backwards state traces periodically, allowing us to store O(sqrt(N)) states in our state trace, at the expense of having to scan twice as much (a generalization would have been to allow us to store O(k*pow(N, 1/k)) states with a cost of scanning k times as much.

    This had the feel of one of those tricky algorithm interview questions…  overall, capturing support it was fun to work on, but no-one wanted it commercially. We probably wouldn’t do it this way today.

Acknowledgements

We have already mentioned the pioneering work of Victor Glushkov. His work was introduced to us at Sensory Networks by the excellent textbook Flexible Pattern Matching in Strings by Gonzalo Navarro and Mathieu Raffinot. Professor Navarro has also written many excellent papers over the years, so if you can’t afford the book, do check out his publications page. His output is by no means confined to regular expressions, but his treatment of NFA implementations is very good.

A great deal of hard work went into Hyperscan over the years. Version 1.0 was the initiative of Peter Duthie, whose startup Ground Labs is going strong. Alex Coyte, Matt Barr and Justin Viiret worked all the way through from the early days at Sensory Networks through to the end of the Sydney-based Intel DCG Lab. I’d also like to thank Xiang Wang, Yang Hong, Harry Chang, Jiayu Hu and Heqing Zhu from Intel; KyoungSoo Park from KAIST for their work on the paper, as well as the Hyperscan team in general, past and present.

20 thoughts on “Paper: Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs”

    1. I tend to agree. I would like “ure3” to be open. I also will need help. I designed Hyperscan but did not build it all by myself; generally, aside from the string matchers, which were my pets, the excellent folk in the Acknowledgements did the heavy lifting.

      If I build a new regex matching system it will be a lot smaller and simpler, but it will still be a huge effort.

      Like

      1. I’m a student searching about regular expression. I think hyperscan’s performance is excellent. However, it doesn’t support ARM arch. I want to do some some work to port it to arm. If “ure3” do the corresponding work,may I join the group?

        Like

      2. You certainly may, if I get to it. I have considered doing the port of Hyperscan to ARM directly, but I don’t know what purpose it would serve. A smaller, more tractable regular expression matcher with fewer moving parts would be a better research platform and also easier to port to a variety of architectures (not just ARM, but potentially emerging architectures and GPGPU).

        Like

      3. Thanks for your reply. I think implementing a smaller and general RE matcher will infect the performance. So to achieve it on a special platform may have a bigger probability that the matcher perform well like Hyperscan.

        Like

      4. Well, it might take a performance hit. I think the main weakness that a smaller regex matcher would have relative to Hyperscan would be that it would not implement the giant portfolio of special-case optimizations that Hyperscan has – e.g. for large cases, for bounded repeats, for very small cases, etc. I think general-purpose performance wouldn’t suffer all that much. “Special platforms” are interesting (e.g. GPGPU) but I wouldn’t automatically assume always going to win. Many papers about amazing GPGPU or FPGA speedups (50x, 100x, 1000x) over software depend mostly on having bad software for comparison.

        Like

      5. Well, if implement the same algorithm on different platform, the performance is obviously different. However, maybe there is some good idea that make different platforms all have great performance. And I know a smaller general-purpose RE matcher has many advantages, it will be more likely become successful for software is more convenient to maintain and update and to be applied to other products.

        Like

  1. […] Hyperscan — библиотека от Intel, которая ищет сразу по многим регулярным выражениям. Она использует эвристики вычленения подслов из регулярных выражений, о которых мы писали, и много SIMD для поиска по автомату Глушкова (алгоритм, кажется, называется Teddy). […]

    Like

    1. It would be a fair amount of work. Older versions of Hyperscan (pre-open source) ran on ARM, PowerPC, Tilera and MIPS. ARM isn’t hard as most targets are little-endian. Once SVE and SVE2 are in place it would be a more interesting project, but it’s certainly possible even just to NEON, or, if you’re a masochist, general-purpose-register-only ARM.

      Like

      1. I have ported hyperscan 5.2.0 to ARMv7 with simde. All 3746 unit test cases PASSED (run and test with qemu-arm).

        My fork: https://github.com/zzqcn/hyperscan, and my commit: https://github.com/zzqcn/hyperscan/commit/249178abd421f2be6a5b28aecc53930976a92ce8

        But just as you have said, porting to big-endian platform (e.g. MIPS) is too difficult and can’t use this ‘intrinsics translation’ approach. On the other hand, 2 years ago I have separate the ‘SmallWriter’ path from hyperscan and compile & run it on ARM/MIPS. So called ‘SmallWriter’ path just use the ‘McClellan’ method and don’t use any SIMD intrinsics. Yes the ‘SmallWrite’ version can run but it can’t compile many complicated regex and the size of compiled database is huge.

        I don’t known much about SSE, Neon, etc, so any suggestion or code review is helpful for me.

        Like

      2. Congratulations! This is significant work – well done. I am not nearly current enough working on the Hyperscan source base to give you detailed code review, but this approach is very solid.

        I view the big-endian stuff as a lost cause. I don’t think there are many major architectures upcoming that are BE – historically, there have been many MIPS devices in networked IPS/IDS but I don’t see a big future there. There are so many data structures that would need to be fixed that you would have to touch things all over the code base. It is unlikely that the team would ever be able to accept such an intrusive patch and unlikely you would want to maintain it for a long time.

        The “small write” path tries to compile everything as a DFA, as you’ve probably noticed. The idea of compiling everything into a single DFA has a lot of prospect for huge combinatorial explosions. It would be very unlikely to work – it would almost be better to start fresh with a new design – possibly keeping the compiler but building a completely new backend. I have thought along these lines but am not being paid to work on that sort of thing, and it would be a lot of work.

        On the ARM side – it would be possible – and better – to do everything natively in NEON rather than trying to pretend that NEON == SSSE3 (although that’s a good starting point). NEON has some interesting capabilities on its own (and is missing some of the things that SSE2 has, like PMOVMSKB). It would also depend on how forward-looking you are – maybe the ARM ecosystem will eventually give us a non-supercomputer version of SVE, which would also be a very interesting target.

        Like

  2. Thank you Geoff for a very informative post. The paper mentions that
    “Our evaluation shows Hyperscan greatly helps improve the performance of real-world DPI applications. It improves the performance of Snort by 8.7x for a real traffic trace.”
    We have been testing the integration of Snort version 3.0 with Hyperscan version 5.2.0 and running some tests on an Intel Atom C3558 chip. Compared to the stock Snort version 3.0 the performance remained the same. I am trying to figure out if I am making a mistake in the configurations or execution or if Hyperscan would not produce a great deal of improvement given that the Intel Atom C3558 chip does not support AVX2 or BMI2.

    Do you have any benchmarks for Hyperscan on Intel C3000 series, with which I can compare.

    Like

    1. I’ve seen your exchanges with Xiang Wiang on the Hyperscan mailing list.

      Historically Hyperscan has delivered performance uplift even on Atom – Atom still has SSSE3 allowing access to PSHUFB, as well as enough SSE2 to do string matching and most of the other acceleration code paths. However, a number of things are worth looking at:

      1. Snort’s string matcher may have improved enough to obviate the advantage that “Atom-only Hyperscan had”
      2. The workload may not spend much time in string matching or regex matching due to input or configuration.
      3. The workload may contain arbitrary content that neutralizes the advantage of Hyperscan. Naive string matchers often do very well on pathological inputs – random data and/or floods of one character. While you shouldn’t be scanning ‘in-the-clear’ signatures against random/encrypted/compressed data (what would you expect to match?), it’s quite possible to wind up doing that. Most Aho-Corasick string matchers perform quite well against data that’s effectively random as they can often just bounce around in a set of ‘early’ states, never getting close to a match and never leaving L1 cache. The performance gap between HS and AC narrows or vanishes on random data for simple literal search.

      Those are my 3 most likely things to look into, but you will have to do your own detailed performance benchmarking if you want to uncover what is happening.

      Like

      1. Thank you Geoff.

        I will investigate our test setup against the three points that you have mentioned.

        Like

Leave a comment