Some opinions about “algorithms startups”, from a sample size of approximately 1

June 12, 2018 geofflangdaleLeave a comment

Something a little different today. For my regular readers, I promise to try to keep the number of “opinion/rant” posts to a minimum and we’ll be back on our regular technical content in a few days. It’s pretty easy to just whack the keys and issue Epic Pronouncements on things, but the effect is limited:

In any case, I have had this post kicking around in my brain in some form for years.

Preliminary Comments, Background, Disclaimers

I’m going to talk about “algorithms startups”: this is a vague term to mean a startup that is oriented around building and selling (in some form) an algorithm – as opposed to building a complete solution and trying to make money directly from customers. I don’t mean a “pure” IP play where you invent something, patent the hell out of it, and try to extract money from world+dog. I’m assuming we’re talking about inventing something that didn’t exist before, writing the code yourselves, and trying to make money more or less directly from the code.

My experience (short version): I joined a startup (Sensory Networks, founded in 2003) while it was quite large in 2006, watched it lose traction until the end of 2008, and formed part of a small team (5 people, at the start of 2009) which took a small chunk of additional funding and took the business to a decent exit (I claim ‘decent’, in terms of the scale and funding of the startup since 2009) in 2013.

We built a software-based regular expression matcher called Hyperscan which we sold as a closed-source commercial library. Hyperscan was later (2015) open-sourced at Intel. I don’t know how to make money directly off open source so if you’re hoping for insights there I don’t have any experience.

Sensory Networks wasn’t planned to be an pure ‘algorithms startup’ – we just wound up there by default; focusing on the core of the task was the only viable way forward from 2009 onward.

I should note that many – most, even – of the interesting things that happened at Sensory Networks and subsequently at Intel are commercial-in-confidence. So, boringly, I am not going to be reeling forth exciting details of evaluations and commercial deals made with brand name companies That You’ve Probably Heard Of. There will be no exciting insider revelations, just affirmation of principles you’ve probably heard 50 times before for the most part. I will also not discuss acquisition mechanics.

I draw my experience both from Sensory Networks and my continued experiences with the Hyperscan product but also from watching closely a lot of other startups in the area. While we did some things right, we got a lot of stuff wrong, too. Unfortunately, a bunch of the things we didn’t do right are tied up with things that I can’t talk about or they are speculative (it’s easy to speculate about things you should have done but hard to tell whether pursuing alternate strategies would have worked better).

I’m assuming that most readers have already heard about the idea of continuous integration, fixing bugs first, etc. so we can take that stuff as read.

Many of the principles here were applied by much better software engineers than I am; I may talk a great line about testability and API design and fuzzing and so on, but most of the real work in this area was done by the core Sensory Networks team of developers from the restart in 2009 through to the Intel acquisition and beyond: Matt Barr, Alex Coyte, and Justin Viiret.

It’s also clear that the continued good qualities of Hyperscan and the freedom to pursue the strategy of open-sourcing the product are due to many good people at Intel. I don’t want to make it sound like the story of the product is over. What we learned is captured in the existing Hyperscan product and the processes around it. This post doesn’t focus on the post-acquisition side of things; the privilege of being able to give away your software is while working in a large company is a very different story than the process of getting to that point. It’s also a story that you’re not usually allowed to tell! 🙂

Opinions are my own and not that of any other person, past employer, etc.

So, in no particular order, some opinions about ‘algorithms startups’.

Doing an algorithms startup is a lot of fun

First of all, while there were parts of the process that were awful, if you like computer science, this kind of startup can yield an enjoyable experience. This may vary for different team members and at different times, of course. If you want to work on interesting algorithms and have picked a market where that’s actually rewarded, you might enjoy your work.

Doing an algorithms startup won’t necessarily make you tons of money

Obviously, no startup is guaranteed to make tons of money. But algorithms startups have some extra downsides.

You are attempting to make money from other businesses. You’re going to paid a pittance relative to what they are getting, for good reason. They are building the user interfaces, supporting thousands of customers, building all the boring code you aren’t interested in or couldn’t possibly write yourself.
It’s also very likely that you’ll get paid very slowly. Try not to die in the interim!

The lifesaver for you is that once you get your system accepted by other businesses, they will keep paying for it and you can go and sell the same code to lots of other companies (“Doctrine of Multiple Use”).
You are competing with Free and Open Source if you are closed-source, or you are trying to make money off a product that people can get for free and dissect if not.
I have no experience trying to make money off FOSS software so I can’t speculate about how hard that is.

Competing with FOSS (while still asking for money for closed-source software) is difficult, and you need an enormous advantage. There were a number of FOSS regular expression matchers around when Hyperscan was closed-source, but none of them were close to providing what our customers wanted (large scale pattern matching and ‘streaming’ matching).

I think a startup of this kind can make a fair bit of money, but I would be surprised to hear that it’s in the ‘hyper-growth’ category.

Speculative: What should an algorithms startup do after capturing most of the Total Addressable Market for that algorithm? When are you ready to do that?

This gets into unexplored territory: our answer turned out to be “get acquired”. I would hazard a guess that it’s at least possible for a algorithms startup with a good structure to move into adjacent markets and continue growing. Maybe if you’re good enough at this you could make something big…

Equally speculative would be answers to questions like “when is your core algorithmic product essentially ‘done'”? We continued to tune Hyperscan, always aware of gaps in performance, excessive costs of various kinds (bytecode size, stream size, compile time) and gaps in functionality that might be expected from a regular expression matcher (unsupported constructs).

So we never answered either of these questions – at least not directly – but that answer would be pretty important for a similar startup in a similar place several years in.

Don’t drift into being a consulting business

Stick to the Doctrine of Multiple Use; don’t build special-purpose builds of your software if you can help it, and definitely don’t just wander into consulting if you didn’t intend to have a consulting business.

We had some extra help with this – the Australian government had a nice R&D scheme (now the “Research and Development Tax Incentive”). This mandated a doctrine of “multiple sales” – we couldn’t get a generous credit for work done for just one company. This ‘restraint’ helped us in the long term (not just the money, but the discipline).

We did add a few features in the pre-Hyperscan 4.0-era (before the open source release) that were each ultimately needed by just one customer in the end. These features were always theoretically interesting more broadly and we didn’t do special-purpose builds for single customers; these single-customer features were made available to all. However, they never really got wide adoption.

Ultimately these features were dead-ends – adding a big testing load (adding weird new modes or API functions often increased the test load geometrically) while never getting much use. On the flip side, some of these features were needed to stay alive commercially.

Iterate, and release a Minimum Viable Product (MVP) early, but make the MVP actually Viable

You have to offer something much better than the alternative. A critical functionality improvement or 5-10x on some metric will get you noticed – and unless you’re a drop-in replacement for something else, you’ll probably need that big improvement.

The idea that you build a Minimal Viable Product is now a cliché. It’s harder than it sounds, even when you plan to do it. For an algorithmic startup, there’s a fine line between “unintentionally trivial” and “minimal”.

When we built Hyperscan, the first iteration of what became the successful product (Hyperscan 2.0 – 1.0 was built on different lines and very little aside from the parser was retained) was pretty awful in many respects. Tons of different regular expression constructs would be either slow or not supported (“Pattern too large error”). An extremely early evaluation version even occasionally printed curse words on the console, a behaviour not normally desired in software libraries.

However, we did have some killer features:

Supporting lots of regular expressions at a time (alternatives like libpcre or Java regex only supported 1 at a time),
Streaming (the ability to suspend and resume a regex match, as you would need to if matching a regex across a stream of packets), and
High performance (we were typically quite a bit faster than alternatives – 5-10x was typical).

People were willing to live with a lot of quirks and edge cases if we could deliver on those three items. Over time most of the more obvious quirks and edge cases went away (especially compared to the competition).

We weren’t a drop-in replacement for any other regular expression matcher, so a modest increase in performance was always offset against developer effort at our customers. Evaluations where we couldn’t deliver a big speedup or some substantial new functionality almost always failed. They even failed later, when we were an open source product and were giving Hyperscan away for free.

If your key selling point is performance, but you’re only offering 20% better, you’re in trouble – especially if you’re not a straightforward drop-in replacement for someone else’s product.

Your product will have gaps, but the earlier your customers discover them, the better

Aside from the elevator pitch (hardly the time to tell people how much Hyperscan sucks), we were careful to set expectations early. For us, there was a hierarchy of when the bad news gets found out:

During early discussions (“Your product isn’t a white-box IPS system? Oh.”)
During a technical deep dive (“Your product doesn’t support back-references? No, thank you!”)
During the evaluation when your customer tried to integrate your code (‘doesn’t compile’, ‘API wasn’t actually what we expected’)
When the customer tried to load signatures into our regex engine (“fail at compile time”)
When the customer ran our engine during evaluations (“performance or overheads not good at run-time”)
After the customer has signed a contract and shipped something with our product in it to their customers.

There are a number of terrible strategies that many startups use that pushes the ‘bad news discovery’ downward in this hierarchy. Some of these terrible strategies are technical, some are marketing related.

It’s better to eat the pain early; most developers understand the principle that you’re better off getting a nasty message from the compiler than a crash at run-time. This principle of “bad news early” is good practice beyond that. You won’t screw your customers; you’ll pleasantly surprise them in the evaluations and you’ll get a well-deserved reputation for honesty. You also won’t waste time in meetings or evaluations that can’t end well.

Maybe if you don’t waste their time now, they’ll be more interested in you when your offering is better aligned with what they want.

Testing

Work clean and test everything

It’s tempting to cut corners when you’re a struggling startup. However, you’re actually less set up to get away with cutting corners than a big corporation. If you mess things up, that becomes your reputation – you can’t send a VP out with a few more talking points for his or her weekly golf game with his good buddies who are all VPs at the customer whose product you just stuffed up. If you mess up, you’re dead.

Don’t mess up.

We did this once – we disabled a test (unusually large inputs) after we made a few changes with the intent of turning it back on shortly after (this only affected evaluation versions of our code, not commercially shipping versions of our code). As per Murphy’s Law, naturally this bug was found, not by us, but by an evaluation team at one of the biggest networking companies in the world, on the second day of their first evaluation of our product. The evaluation continued, but with an air of forced smiles and gritted teeth, and didn’t go much further.

Assume anything you don’t test is broken.

You will need to test your code relentlessly, and designing your code for “testablity” is critical. We rejected some features strictly because we didn’t know how to quickly and programmatically find a ‘ground truth’ for how they should behave (needed for our fuzz testing). Other features had their design influenced or dictated by testing requirements.

For example, our ‘streaming’ feature has always been guaranteed to behave identically, in terms of matches generated, to the matches generated by block mode writes. This was very hard – many other regular expression implementations either don’t do streaming at all, or sort of ‘fake’ it (i.e. you can get matches as long as they aren’t spread out too far in the buffer, or too spread out over multiple packets, or on some regular expressions you get them accurately but not all, etc).

By sticking to a strong principle (streaming always works the same as block mode) we could test our stream mode programmatically without having a poorly defined notion of when we were and weren’t expecting to be correct.

The ability to ‘fuzz’ a complex system is a lifesaver, but it comes with a trap

Fuzzing is great. I met a couple Microsoft employees at RSA in 2009 and they asked me: “do you fuzz-test your system”? I admitted “no, we don’t, but I’ll try that when I get back”. We found a lot of stuff – before our real customers did.

We invested a lot of effort into the idea of figuring out how to most effectively test regular expressions – they have a complex structure in themselves, and then you’ve got to figure out what sort of inputs will make interesting things happen inside the bytecodes that we built in Hyperscan. There’s no point testing regular expressions with random data – all those nice optimizations that allow you to skip the hard stuff whenever a required “literal factor” isn’t there will “protect” you from finding your bug. Great for performance, bad for fuzzing. Thus, we put a lot of work into building positive and near-miss negative regular expression test cases. We build systems that were every bit as complex and (arguably) interesting as the regex optimizations itself.

Get interested in innovative ways to test your product. This is not a second-class activity for the “lesser developers” (many other firms have discovered this).

The trap: having a good fuzzer gives you a sense of safety, allowing you to build a more complex system than you might have dared to otherwise. Possibly this is dangerous; I’m still thinking about this point. It’s said that people who think their cars are safer are more likely to drive like maniacs…

Assume every metric that isn’t measured is Bad News for you.

In the same way that everything that isn’t tested is broken, any performance metric you don’t regularly measure (and regularly look at the measurements) is ugly news, showing that your system is bad and getting worse.

Assume everything you don’t measure is probably bad

Long after we supposedly knew what we were doing, we managed to regress our main public benchmark case for open source Hyperscan without noticing. It was differently structured than our normal performance runs, so we didn’t put it in our regular performance matrix – so out of the 21,000 numbers generated per night by our Continuous Performance Monitoring infrastructure, we managed to mess up our ‘brag numbers’. It wasn’t hard to fix, and the performance change resulted from a restructure that likely made sense (most performance numbers improved, and these numbers went back on track long-term), but it was a fresh illustration of a principle that we should have grasped already.

Team Issues

Watch out for Individual Contributor “Tourists”

We all know them. These folks are heading for management by the shortest route possible. They don’t like coding or grunt-work and the minute they can stop, they will be telling people what to do. Computer scientists should be skilled professionals, but many people enter the field with the goal of doing as little as possible of that and to get up into management as soon as they can.

I would be stunned to hear that an architect with 3-4 years professional experience (or a structural engineer, or a doctor, etc.) would deem themselves ready to go lead a team of professionals (often with more experience than they have), but a lot of people coming through computing degrees are expressly on that path.

These people are dangerous in startups because there are few reasonable outlets for their ambition. There’s just not that much of a hierarchy to climb; don’t let them make one to suit themselves.

Conversely, reward your Individual Contributors and don’t dead-end them on compensation.

The converse of this comes from the motivation of many of the “tourists” to get out of these individual contributor jobs: the pay sucks. A mediocre manager is usually paid far better than a really great individual contributor. A well-rewarded ‘technical leadership’ track is a good ideal – rather than dead-ending your technical people or hoping that they’ll magically turn into good person-managers.

Of course, this is a nice trick, given as a startup you probably won’t have any money for a while. But it would be good to think about it, especially before you thoughtlessly splash out a salary $20K per year higher to a random VP of Something-or-Other or a Director of Important Sounding Stuff than you pay your absolute best developer.

A good team is not comprised of 100% ‘A’ players on some “Most Awesome Geek” standard.

It’s actually OK to hire people who are ‘B’ or even ‘C’ players in some areas. The right analogy is closer to one of those team sports with relatively specialized players – being an Australian the natural analogies are cricket or Rugby Union, but our American readers might think NFL. A team full of the ‘best all-rounders you can find’ would be mediocre in most sports; and team full of the ‘best quarterbacks/fast bowlers/etc. you can find’ would be terrible.

Even a small startup needs a diversity of skills. If you put everyone through an algorithms-on-the-whiteboard exam and take the top performers, you might wind up with 5 algorithms / compiler / systems nerds and no-one who knows how to talk to customers, write documentation, test your system or do releases and builds.

In the Computer Science world, there’s an snootiness about certain skills trumping all the others. You need to hire people who are excellent at something you need and willing to learn some new things.

API Design

Build only what you need

It’s a lot easier to hear complaints from customers that your API doesn’t do enough, and fix that, than it will be to wean them off stupid things you put in your API back when you didn’t know what you were doing.

We saw a number of preposterously complex APIs for regular expression matching go by over the years. A minimalist API was popular with customers and easier to test.

We made some decisions that meant our API was not necessarily tiny – having streaming, multiple regular expression support, and having to completely avoid dynamic memory allocation meant that Hyperscan’s API is quite a bit more complex than, say, RE2, but we converged pretty quickly to a small API that we were broadly happy with.

Don’t throw extra features in there if you aren’t sure customers really want them. If you have to do it, mark them experimental and kill them off if you don’t hear much about them.

Listen to your customers but don’t let them design your API for you

We had a lot of really valuable feedback over the years from customers. Getting information about their use case was hugely valuable. However, an exercise that never went well was trying to co-design API features with them. It didn’t seem to work. They don’t know enough about how your system operates to make good suggestions.

Capture significant use cases, even when you don’t have a brilliant solution for the use case.

One thing that worked well was to identify important use cases and capture them in an API even if our implementation wasn’t great. For example, a lot of users wanted to be able to identify matches that occurred in a range of the output – e.g. “This regex /<R>/ matches only if the end of the match is between the 100th and 200th byte”. The user could have been told “hey, we don’t have any particularly good way of handling this – why don’t you do that check yourself, as our solution for this will be pretty much equivalent”. However, over time, implementing optimizations for this case is something we did – which we would not have been able to do if we told our users to go away and bake the solution for the problem into their code, which we wouldn’t see.

So if the API requested creates information you can use, it may make sense to capture the requirement even before you have a good solution.

An example of where we didn’t get this right (initially) was regular expression ordering. Due to the way we initially implemented things, we didn’t return regular expression matches in order by ending offset, nor we guarantee that the user would not get the occasional duplicate match (pretty bad, but it turned out that these things were OK in a MVP). One problem, though, was that users who picked up Hyperscan 2.0 (2.1 added ordering and duplicate suppression) built layers of code that dealt with our inadequacies – these layers of code get baked-in and often sprout other functionality, so even after we guaranteed ordering, those layers of code were there, sucking up performance for a task that was mostly no longer even needed.

This isn’t a license to just build castles in the sky – the requirements that you’re capturing should be important. This principle contradicts minimalism, so be careful.

Miscellaneous Issues

Don’t Bog Down on Trivial Stuff Immediately (or at all)

Image result for bikeshedding

There are a lot of decisions to be made early in a startup. One pretentious thing you can do is decide that, because your startup is going to grow to take over the world and be really awesome right from the start, you should definitely spend a nice constructive period of weeks arguing over things like coding standards (and maybe some company values and a mission statement). You will find that Parkinson’s Law of Triviality takes over – everyone has an opinion on this kind of stuff and you’ll get a tedious all-in brawl for weeks, resulting in some standards that everyone will go ahead and ignore.

This didn’t apply to programming languages for us (this was more or less dictated by the level of complexity of the compiler, dictating C++, and the harsh environment of the run-time, dictating C, and the huge variety of platforms and tool-chains we needed to support – ruling out pretty much everything else). But I imagine that a nice knock-down-drag-out pissing contest (not a nice combination of mental images, is it?) about programming languages would be another great way to waste the first 2-4 weeks (months?) of your investors money.

Be aware of the risks of ‘bikeshedding’ at all times, not just starting out. However, it seems particularly unpleasant to get stuck in this phase early – the temptation will be strong when the startup isn’t really working yet.

Work Clean – Legals

batch, books, document

Another area where it’s imperative to work clean, as a small startup, is legally. I am not qualified to provide legal advice, but it is of enormous benefit to think about this from Day 1. Do you own your code? Can you prove that? Have you dragged in random fragments of code that you don’t know the licenses for? Have you hired corner-cutters whose code will be revealed to be 50% copypasta from Stack Overflow and 40% fragments of unacknowledged GPL code?

I’m not specifically recommending you use a service for automated detection of this (Black Duck seems to do well, but I don’t know whether a small startup would want to spend their money on this); just don’t hire people who do that sort of thing, and remind junior developers that it’s not OK.

Similarly, a lot of startups join consortia and relentlessly announce partnerships that amount to little more than a press release and a exchange of banners on your website. These agreements may not bring you much more, but bear in mind, every bit of paper you accumulate will be something that you’ll be hearing about again during due diligence.

Every bit of paper you sign is a potential millstone. Don’t do a whole pile of important-sounding ‘businessing’ stuff that doesn’t get you anything and involves you signing tons of legals.

Think really carefully before you splash out small shareholdings to random people. You’ll need to go back to these people during an acquisition.

Dance like no-one’s watching; enter into agreements like every single thing you have ever done will be meticulously examined by one or more teams of lawyers working on behalf of a Fortune 500 company, as well as your own team of lawyers, who will be billing you for the time.

Work clean – Static and dynamic analysis

In our experience, running every static and dynamic analysis tool you can lay your hands on is worth trying. Both customers and acquirers down the track will thank you. Some tools are garbage, but as a rule, being clean on things like valgrind and clang static analysis and running with all warnings switched on and set to stop compilation was worth the trouble.

This is a day-to-day hit; you will occasionally have to do Weird Things to satisfy these tools. That’s a steady dull pain, but it’s better than the sharp pain you’ll experience if one of these tools could have caught something and didn’t.

Build in an niche appropriate to your scale; don’t take your tricycle out on the expressway

One of the keys to our success is that hardly anyone attempted to muscle in on our territory. While it seems that good quarter of the world’s serious computer scientists have a pet regular expression project, very few of these projects are ever built out as a commercial product. There were a number of regular expression libraries that had quite decent performance on some of our key use cases, but none of these libraries had the work done to make them robust and high-performing across the use cases we handled.

What competition did exist, fortunately, thought hardware-accelerated regular expressions were a great idea. Perhaps this is a stroke of luck that happens only once in a career.

Our job was doable with a small team over a number of years because ‘high-speed software regular expressions’ was a niche: profitable enough, but not too crowded. I’m glad we hadn’t decided that “video compression” or “neural networks” or “machine translation” was actually our niche.

Expect to fail evaluations and keep trying

We had evaluations at big name companies that failed 4 or 5 times before finally getting a win. Sometimes the teams wander away, sometimes your product is just not good enough, sometimes they were just kicking the tires with no intent of ever doing business.

If you go single-threaded with the intent of landing that amazing nameplate customer, it might well kill your company. They might say ‘no’. Worse still, they might say ‘yes’, but you have invested so much time in them, and waited so long for revenue, that you’ll wish you failed the evaluation.

Persist and chase many opportunities; also try to find out what went wrong (in case there’s a next time, or in case the mistakes you made will effect you elsewhere). The latter is surprisingly difficult; in fact, it’s often hard to elicit feedback of any kind – even from a successful evaluation. After bad – or even good – results, you may be like these two gentlemen from the Coen Brothers’ “Burn After Reading” (caution: strong language)

Build a huge database of benchmarks and actually look at them

One of the big advantages that we built over the years at Sensory Networks was a huge database of regular expression patterns that customers had shared with us. We treated this with great care and didn’t leak information from it – but we used it relentlessly to try to improve performance, even on cases where customers had wandered away.

Subsequent dealings with other companies often left us amazed at how little data our competitors had on the complex workload we were all supposedly trying to make go faster/better.

This took a fair bit of pleading with customers to get this information. One of the main selling points was that “if you share your use case with us in enough detail – or something that looks enough like it – we will measure performance on your case and if we mess up our code base relative to your usage we will discover it in 12 hours, not 4 months after we make the mistake and 2 weeks after we send you the release”.

This worked well, but not perfectly. Some of our best customers never, ever showed us their workloads.

As mentioned above, while it’s nice to have all these benchmarks, it helps to look at the results of running them, too. If there are 24,000 metrics on your dashboard you’re probably not looking at them any more.

Expect to be evaluated by the person whose code will be replaced by yours if the evaluation succeeds

If you are an algorithms library, the person who evaluates you will probably be the person who previously wrote the library to do whatever your product does – good luck! They are the domain expert, and if you’re unlucky, they Hate You Already.

There are a surprising number of honest and self-critical computer scientists out there working at big companies who will give respect where it’s due, even when this means admitting that someone else wrote better code (and sometimes, people were glad to give up the role and move on). Sadly, this isn’t universal. Expect to have the goal-posts moved frequently: you will often be competing with someone else’s system that’s being ‘generously benchmarked’ while your system is being ‘adversarially benchmarked’. This means that you really can’t afford to have glaring weaknesses in secondary metrics.

Our primary metric was essentially raw performance. However, there were a host of secondary metrics (size of pattern matching byte code, size of stream state, pattern compile time, etc.) and it was impossible to tell in advance who cared about what. Even worse, in an adversarial benchmark situation, you can expect whoever is doing the evaluation to suddenly ‘care’ about whichever metric makes your code look the worst.

Bonus anti-pattern to look out for: finding out that for months you have been talking to one evaluator who has 100% control of process and is hiding their results away from the rest of their company; you will go back through the email chain and notice that no other email address has ever appeared. Who is their boss? Who is their coworker? If this happens to you, stay not on the order of your going but Go At Once!

Evaluations seem to go a lot better if they are bottom-up and engineer-driven rather than top-down and manager-driven

We had a number of very successful evaluations at companies where the engineers were on our side and they persuaded their management that spending money on us was a good idea. Later on we had a number of evaluations where management of a company descended on their engineers and told them “use Hyperscan”. These evaluations were typically disasters, even though we had a better product and on paper the opportunities were promising. When it comes down to it, engineers don’t like being told what to do.

Expect to not be able to announce successes

For the entire history of Sensory Networks, we were almost never allowed to announce “design wins”. Most vendors who used Hyperscan were adamant that this not be mentioned publicly. I expect this would be similar for most algorithmic startups – too many announcements of this kind is presumably a free invitation to the competitors of those vendors to duplicate their functionality (we use signatures from X, a pattern match engine from Y, hardware from Z, and …).

So, expect your ‘News’ section on your website to be a bit more threadbare than you hoped.

Contract negotiations: don’t lose your nerve

Expect people to try stuff on. Many – most, in fact – of our customers dealt fairly with us as a small company. A few people, at a few companies, tried outrageous last-minute surprises in contracts. Keep your nerve; if that company make-or-break deal gets a horrifying provision added at the last minute, tell them to go away and do better.

Trying to impose exclusivity or various other limits on our freedom of action to sell Hyperscan as we pleased was a popular pastime, but no-one really insisted.

Some things that didn’t seem to be missed

A nice looking website.
Help from people who have nebulous jobs “helping out startups” (I don’t mean lawyers or accountants, I mean the Picks and Shovels crew that seem to know the real way to make money in a gold rush).
Having a roadmap that stretched more than about 2-3 releases and 6-9 months into the future; we almost never achieved any of the ‘long term’ items on our roadmap.
Finishing off emulating all the weird bits of libpcre, which was our ‘reference’ library for regular expression semantics (and generally an excellent base for semantics), or supporting a host of other syntaxes and semantics
Joining important-sounding consortia that just amount to having a banner on someone else’s website in exchange for having your banner on their website. Does anyone care? The same goes double for being awarded ridiculous startup or small business prizes (“East Sydney’s Most Agile Startup 3 Quarters Running!”), exchanging physical plaques (!), sponsoring random things. etc.
Getting all sorts of mysterious certifications about how great our development methodology was, which often seemed to amount to telling some organization “our development methodology is pretty great”, writing a cheque, and getting the certification, without anyone ever actually looking at our code. Odd.

Conclusions, Sort Of

So, that was a bit of a stream-of-consciousness series of opinionated “hints and tips”. I don’t think there’s a really solid conclusion here – we got some things right-ish and some things wrong-ish and didn’t do too badly.

I’d be lying if I said that I thought that doing this type of startup was a route to enormous startup wealth. I’d be surprised to hear that a company can become a 1000X type Silicon Valley success story from algorithms alone; I’m pretty sure that you have to capture a lot more of the value than can be captured if you ship a nifty library and go home. I do think that this kind of startup can yield a reasonable outcome and someone sufficiently interested in their work can have a pretty nice time and learn a lot, while getting paid reasonably for it.

I’d be interested to hear comments or criticisms or links to other similar startup stories. I’d be particularly interested to hear stories of what it’s like on the open source side of the fence; the path taken by Sensory Networks now seems somewhat of a closed-source anachronism.

Performance notes on SMH: measuring throughput vs latency of short C++ sequences

June 5, 2018 geofflangdale4 Comments

A quick update on last week’s post SMH: The Swiss Army Chainsaw of shuffle-based matching sequences on performance measurement.

During that post, I provided throughput numbers for these sequences but didn’t show latency. This is a critical distinction, and it doesn’t pay to be confused about the the two. I would rather avoid the rather cringeworthy formulation from the Mythical Man Month (where women are “assigned” to the task of bearing children!) and stick to the metaphor of boiling eggs: a suitably large pot of boiling water could boil eggs at a throughput of an egg every 10 seconds, but cannot provide you with a 3-minute-boiled egg in less than 3 minutes.

It is important not to confuse the ability to do something in, say, 10 cycles vs the ability to do 1,000 somethings in 10,000 cycles. The former is always at least as hard and usually much harder. This distinction holds all the way down to the single operation level: for example, a modern x86 processor can launch a multiply operation every cycle, but requires 3 cycles to know the result of a given multiply.

Modern computer architecture conspires against us when we wish to measure latency. Attempting to measure the latency of a single short code sequence is quite error-prone due to the overhead of the various performance counter or clock measurement calls.

Throughput is easy to measure on a larger scale, as we can measure thousands of iterations and establish an average cost per iteration. However, well-written code will usually attempt to minimize dependencies from one iteration to the next. When we attempt to measure, say, the branch-free code of SMH, there is little to prevent a modern, out-of-order processor from getting on with the next iteration or two while the previous iteration is handled.

I tried two approaches both attempting to measure the latency of the various SMH sequences. The first was to insert an LFENCE instruction between each SMH sequence but otherwise keep the code the same. Note that LFENCE in this case can be switched on and off by a macro.

	template <typename T>
	void match_multiple_smh(T & smh, std::vector<u8 *> & buffers, std::vector<size_t> & lengths,
	std::vector<u32> & results) {
	u32 i = 0;
	#ifndef NO_UNROLL
	for (; i+7 < buffers.size(); i+=8) {
	results[i+0] = smh.match(buffers[i+0], lengths[i+0]); LFENCE
	results[i+1] = smh.match(buffers[i+1], lengths[i+1]); LFENCE
	results[i+2] = smh.match(buffers[i+2], lengths[i+2]); LFENCE
	results[i+3] = smh.match(buffers[i+3], lengths[i+3]); LFENCE
	results[i+4] = smh.match(buffers[i+4], lengths[i+4]); LFENCE
	results[i+5] = smh.match(buffers[i+5], lengths[i+5]); LFENCE
	results[i+6] = smh.match(buffers[i+6], lengths[i+6]); LFENCE
	results[i+7] = smh.match(buffers[i+7], lengths[i+7]); LFENCE
	}
	#endif
	for (; i < buffers.size(); ++i) {
	results[i] = smh.match(buffers[i], lengths[i]); LFENCE
	}
	}

view raw

smh_throughput.cpp

hosted with ❤ by GitHub

The second approach was to make the location that was read by an SMH sequence depend on the result of the previous SMH sequence. Since I didn’t want to introduce a spurious ‘jumping around memory’ component to the benchmark (which would always be absent from the equivalent throughput metric), I made sure that the previous SMH sequence always happened to return zero (no match): we know this, but the architecture and the compiler don’t.

Creating long chains of dependent operations is also how Agner Fog (and others) measure latency; those who have not yet seen Agner’s Software optimization resources are in for a treat.

The code to measure SMH latency is below (note that LFENCE is switched off by the preprocessor as needed and was not used in the latency-test version of this code at all):

	template <typename T>
	void match_multiple_smh_latency_test(T & smh, std::vector<u8 *> & buffers, std::vector<size_t> & lengths,
	std::vector<u32> & results) {
	u32 i = 0;
	u32 tmp = 0;
	#ifndef NO_UNROLL
	// NOTE: experimental code only. Note that the addition of 'tmp' – being the id of a possible
	// match – could take us RIGHT outside our buffer if we actually matched something. We aren't
	// in this particular run, but so it goes. Saner would be to build up an all-zero id vector
	for (; i+7 < buffers.size(); i+=8) {
	tmp = results[i+0] = smh.match(buffers[i+0 + tmp], lengths[i+0] + tmp); LFENCE
	tmp = results[i+1] = smh.match(buffers[i+1 + tmp], lengths[i+1] + tmp); LFENCE
	tmp = results[i+2] = smh.match(buffers[i+2 + tmp], lengths[i+2] + tmp); LFENCE
	tmp = results[i+3] = smh.match(buffers[i+3 + tmp], lengths[i+3] + tmp); LFENCE
	tmp = results[i+4] = smh.match(buffers[i+4 + tmp], lengths[i+4] + tmp); LFENCE
	tmp = results[i+5] = smh.match(buffers[i+5 + tmp], lengths[i+5] + tmp); LFENCE
	tmp = results[i+6] = smh.match(buffers[i+6 + tmp], lengths[i+6] + tmp); LFENCE
	tmp = results[i+7] = smh.match(buffers[i+7 + tmp], lengths[i+7] + tmp); LFENCE
	}
	#endif
	for (; i < buffers.size(); ++i) {
	tmp = results[i] = smh.match(buffers[i + tmp], lengths[i + tmp]); LFENCE
	}
	}

view raw

smh_latency.cpp

hosted with ❤ by GitHub

Observe the “tmp” variable in the gist above; it is always zero, but we cannot safely start our matching operation until the architecture has the result of the previous match operation in hand (Intel Architecture has many fascinating optimizations, but generalized value prediction is not one of them).

This gives us somewhat of a hybrid creature: “steady-state” latency. The compiler and architecture are still free to load things into registers that don’t depend on the actual computation – so this latency number is perhaps unrepresentative of a ‘cold start’. However, it is a reasonable measurement of the latency of a single operation in a well-optimized code base.

SMH Variant		normal	no unroll	LFENCE
SMH32-loose	Throughput (ns)	0.89	0.98	10.62
	Latency (ns)	7.03	6.92	10.65
SMH32	Throughput (ns)	1.12	1.15	11.02
	Latency (ns)	7.25	7.30	10.89
SMH64-loose	Throughput (ns)	1.35	1.44	11.03
	Latency (ns)	7.63	7.61	11.36
SMH64	Throughput (ns)	1.62	1.66	11.67
	Latency (ns)	7.95	8.00	11.61
SMH128-loose	Throughput (ns)	2.80	2.67	12.39
	Latency (ns)	8.97	8.14	12.91
SMH128	Throughput (ns)	3.32	3.08	12.82
	Latency (ns)	9.78	8.55	12.91

The above numbers seem reasonable based on a walkthough of the code. I also measured the effect of turning off my manual 8-way unroll. I had focused on smaller models and the metric of throughput as I tuned SMH; it’s marginally interesting to note that latency is generally better without an unroll in the measurement loop if not decisive.

The LFENCE results are hard to interpret – they seem to generally track the latency of the normal case plus around 3.5ns. More work is needed to confirm this; it would be nice to have a way of getting a latency number out of the system that doesn’t rely on an ability to introduce contrived data dependencies from one iteration to the next.

I feel reasonably confident that SMH can be said to do its work in 7-9 cycles; note that the overlap of iterations required to hit the full throughput (looking at the above table) must have to be as many as 8 iterations for the cheapest cases. As always, this implies that being stuck in the ‘latency world’ is miserable – try to phrase your computations to stay in the ‘throughput world’ whenever you can.

Updated code is here

Thoughts on how to measure latency are welcome.

“Say Hello To My Little Friend”: Sheng, a small but fast Deterministic Finite Automaton

May 25, 2018June 4, 2018 geofflangdale12 Comments

Deterministic Finite Automata (DFA, subsequently) are a fundamental structure. Most state machines that programmers build are some variant on a DFA, whether they are built by jumping around inside a switch statement or moving from state to state in a table structure.

They have uses all over the place; they are used heavily in regular expression implementation, and can be used in various validation schemes such as UTF-8 validation. I’m going to show a curious little DFA of my own invention* that we used in Hyperscan**. The presentation here will be an independent re-implementation as the version in Hyperscan is buried in some pretty complex code.

Sheng has some pretty tight limitations, especially in the version I’m presenting here:

It cannot have more than 16 states.
This version of Sheng is ‘quiet’ – it calculates states but doesn’t have an ‘accept state’ that is actively raised. So you can’t detect a regular expression and get a callback or a index as to where it matched.
This version of Sheng is also a bare DFA without a compiler. You need to put the transitions of the state machine in manually.
This version of Sheng depends on x86 instructions, but the principles could allow the extension of Sheng to any system with a similar permute instruction, such as ARM NEON.

Most important: Sheng uses my favorite instruction, PSHUFB!

The Problem in Plain DFA implementations: Memory Latency

A typical problem for DFA implementation is that, at best, each DFA state transition typically involves a single memory access. More compact implementations may use several. Worse still, each of these state transitions depends on the previous state transition, so a simple DFA cannot run faster than the latency of the lowest level of cache (often plus a cycle, if there are things that need to be done to the loaded value from the transition table to make it suitable for another state transition).

This is the critical path of the DFA: the state-to-state transition. Other activities, such as remapping characters to a smaller character set to save space, or checking for accept states, are not on the critical path and are almost ‘free’ in a typical implementation – after all, we’re waiting for the state transition to finish. That’s a lot of free execute slots!

Here’s a not very interesting DFA implementation:

	struct BasicDFA {
	typedef u8 State;
	u8 transitions[16][256];
	State start_state;
	BasicDFA(std::vector<std::tuple<u32, u32, u8>> & trans_vec, u8 start_state_, u8 default_state) {
	…
	}
	State apply(const u8 * data, size_t len, State s) {
	size_t i = 0;
	for (; i+7 < len; i+=8) {
	u8 c1 = data[i+0];
	…
	u8 c8 = data[i+7];
	s = transitions[s][c1];
	…
	s = transitions[s][c8];
	}
	for (; i < len; ++i) {
	s = transitions[s][data[i]];
	}
	return s;
	}
	};

view raw

basicdfa-gist

hosted with ❤ by GitHub

This isn’t a perfect “simple” DFA implementation; we waste at least 1 cycle of latency in our state-to-state transition on index arithmetic to look up that big array (better, but more obscure, would be to track our state as a location within the transition table).

Note the implementation in full unrolls the loop, too.

However, even given a wasted cycle or two of latency, this implementation is close to the limit of memory latency. The DFA is small (4K) so we will be getting it from L1 cache in the steady state, but that means a state-to-state transition at around 4-5 cycles minimum.

Enter My Little Friend: Sheng

Sheng is a different approach. Sheng uses the PSHUFB instruction to implement the state transitions taken by looking up a shuffle mask for each input character. Note that the lookup operation is not on the critical path, as we know our input characters well in advance.

As such, the critical path for Sheng is just 1 cycle on modern architectures; both recent Intel and AMD processors implement PSHUFB with a single cycle of latency.

The variant of Sheng presented is ‘silent’ – it allows us to calculate which state we’re in at a given point but it has no facility to detect whether a match has occurred. We’ll cover the feature of a non-silent Sheng later; sadly, the number of instructions required to check our state means that we will have to add a lot of extra work – too much work to manage 1 cycle per byte (not a critical path issue – it’s just that it’s hard to do that many operations in a cycle).

	struct Sheng {
	typedef m128 State;
	m128 transitions[256];
	State start_state;

	Sheng(std::vector<std::tuple<u32, u32, u8>> & trans_vec, u8 start_state_, u8 default_state) {
	// fill all transitions with default state
	for (u32 i = 0; i < 256; ++i) {
	transitions[i] = _mm_set1_epi8(default_state);
	}
	// fill in state transition for slot 'from' to point to 'to' for our character transition c
	for (auto p : trans_vec) {
	u32 from, to;
	u8 c;
	std::tie(from, to, c) = p;
	set_byte_at_offset(transitions[c], from, to);
	}
	start_state = _mm_set1_epi8(start_state_); // put everyone into start state – why not?
	}

	State apply(const u8 * data, size_t len, State s) {
	size_t i = 0;
	for (; i+7 < len; i+=8) {
	u8 c1 = data[i+0];
	u8 c2 = data[i+1];
	u8 c3 = data[i+2];
	u8 c4 = data[i+3];
	u8 c5 = data[i+4];
	u8 c6 = data[i+5];
	u8 c7 = data[i+6];
	u8 c8 = data[i+7];
	s = _mm_shuffle_epi8(transitions[c1], s);
	s = _mm_shuffle_epi8(transitions[c2], s);
	s = _mm_shuffle_epi8(transitions[c3], s);
	s = _mm_shuffle_epi8(transitions[c4], s);
	s = _mm_shuffle_epi8(transitions[c5], s);
	s = _mm_shuffle_epi8(transitions[c6], s);
	s = _mm_shuffle_epi8(transitions[c7], s);
	s = _mm_shuffle_epi8(transitions[c8], s);
	}
	for (; i < len; ++i) {
	s = _mm_shuffle_epi8(transitions[data[i]], s);
	}
	return s;
	}

view raw

sheng-gist

hosted with ❤ by GitHub

So this one is a little weird: we heavily depend on my favorite instruction, PSHUFB, included on most x86 processors since its introduction with SSSE3 (the catchily named “Supplemental Streaming SIMD Extensions 3”).

PSHUFB (_mm_shuffle_epi8 in this code) is a bytewise shuffle, using the low 4 bits of each byte from a control mask register to indicate which byte to copy from the source register to the destination. It can be used to permute data, but it can also be used to effectively look up a 16-wide table.

In this usage, PSHUFB masks are found on a per-character basis. We look up a character from our input and use this mask to look up what our next state should be. For example, in the 5th unrolled iteration, our current state is used to index into this mask (“transitions[c5]”) and by permuting that mask, and this yields our new state.

We keep our canonical state in the bottom lane of the 128-bit register.

As a side note, we could actually be processing 16 DFAs at once, with an almost useless set of limitations:

The DFAs all have to have the same structure and character transitions.
The DFAs all have to be acting on the same data.

So really, all we can do is start the DFAs off in different states and then crank those states and see what happens. There is an interesting usage of this (picture what happens when we initialize a register with [0,1,2,3,…, 15] and process a block of data – we now have a function that can be applied as another shufle mask! Details can wait for another followup blog post.

So, what do we get from all this? The main advantage of doing this is speed – here’s the basic comparison of speed between the two systems (measured on a 4Ghz Skylake client machine):

	$ sudo nice –20 taskset -c 1 ./sheng

	Sheng
	0/1 1/1 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1
	10/1 11/1 12/1 13/1 14/1 15/1 16/1 17/1 18/1 19/1
	20/1 21/1 22/1 23/1 24/1 25/2 26/3 27/4 28/5 29/5
	30/5 31/5 32/5 33/5 34/5 35/5 36/5 37/5 38/5 39/5
	40/5 41/5 42/5 43/5 44/5 45/5 46/5 47/5 48/5 49/5
	50/5 51/5 52/5 53/5 54/5 55/5 56/5 57/5 58/5 59/5
	60/6 61/7 62/8 63/9 64/10 65/1 66/1 67/1 68/1
	final state: 1 bytes scanned: 1638400000 seconds: 0.417347
	bytes per ns 3.92575

	Basic DFA
	0/1 1/1 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1
	10/1 11/1 12/1 13/1 14/1 15/1 16/1 17/1 18/1 19/1
	20/1 21/1 22/1 23/1 24/1 25/2 26/3 27/4 28/5 29/5
	30/5 31/5 32/5 33/5 34/5 35/5 36/5 37/5 38/5 39/5
	40/5 41/5 42/5 43/5 44/5 45/5 46/5 47/5 48/5 49/5
	50/5 51/5 52/5 53/5 54/5 55/5 56/5 57/5 58/5 59/5
	60/6 61/7 62/8 63/9 64/10 65/1 66/1 67/1 68/1
	final state: 1 bytes scanned: 1638400000 seconds: 2.73646
	bytes per ns 0.59873

view raw

sheng-output

hosted with ❤ by GitHub

(there’s also a basic-level of traces through states included here so that I could verify that the two state machines are basically sane and doing the same thing; see the code)

So we’re processing 3.92 bytes per nanosecond (pretty close to 1 cycle/byte) as opposed to around 0.6 bytes per nanosecond with a basic DFA implementation (which could probably go about 10-20% faster with a more sophisticated table lookup, but not that much more). Sounds good – as long as we can live with the long list of limitations of Sheng.

Sheng has a lot of interesting properties, which I’ll follow up in later posts:

There are several strategies for having a “noisy” Sheng – that is, one that can stop, raise a callback, or write to a buffer whenever it encounters some “interesting” state (e.g. an accept state).
There are also a number of ways Sheng can be adapted to handle a larger portion of the pattern matching task.
These is nothing inherently x86-centric about Sheng. The TBL instructions on Neon could be used to build up the same facility on ARM, and the multiple register variants of these instructions could be used to build 32, 48 or 64-state DFAs.
An AVX2 machine can run two independent 16-state DFAs at once for the same cost, although there is no cost-free way for them to interact. AVX 512 adaptation of the same techniques allows 4 such independent 16-state DFAs.
AVX512 also allows other exotic structures, including larger DFAs using the 16-bit permute operations, including the 2-source permutes.
AVX512 VBMI adds VPERMB and 2-source byte permutes, allowing this technique to be extended to 64 or even 128 states! However, the added latency of these permutes means that a simplistic implementation will be much slower.
Since PSHUFB is a permute, it’s possible to compute blocks of this operation out-of-order. This can be exploited to improve throughput where latency of an operation is not equal to throughput – this is not true of PSHUFB or VPSHUFB but is true of some of the more recent shuffle instructions (for example, many of the AVX512 16-bit shuffles are latency=7 throughput=2) and will likely be true of the next generation of shuffle instructions.
- Note that a 2-source permute is not straightforwardly handled by this, as in order to turn permutes over a block on input into a function, we must calculate all possible outcomes on all states. This becomes prohibitively expensive with already large operations.
- This out-of-order computation is not particularly suitable where a “noisy” Sheng is required

Until then, I hope you enjoyed Sheng, and you can find the code on Github.

https://github.com/geofflangdale/sheng

[ please note: it is essentially a ‘sketch’, lacking many features and there is approximately zero software engineering applied to it. The Sheng and BasicDFA structures should related through static or dynamic polymorphism so that they can share test drivers, but I didn’t want to start designing a more generalized interface until I have built out more of the Sheng variants, so I used cut-n-paste-polymorphism 🙂 ]

[ also note: yes, there are many ways to make DFAs run faster, including acceleration, gluing the characters together and various other tricks. There are also a bunch of ways to make DFAs run slower; typically by implementing them on specialized hardware add-in cards, then waiting geological ages to get the data to the cards and the matches back. ]

* I independently invented this technique along with some researchers at Microsoft Research; if anyone can recall the paper where this technique is documented, please let me know and I’ll put in a link and appropriate credit.

Update: Anuj Kalia, in comments, identified a Microsoft Research paper that’s possibly what I saw as Data-Parallel Finite-State Machines – Microsoft Research – for the 16-state case, I believe this approach converges to be functionally equivalent to Sheng. We discovered this work only when we went looking to establish originality of Sheng…

** Anatoly Burakov wrote the first implementation of Sheng within Hyperscan. Alex Coyte later extended Sheng to work as part of a much larger DFA, a subsystem which he felt moved to dub “Shengy McShengface”, for reasons he may not be able to adequately explain.

Bits to indexes in BMI2 and AVX-512

May 22, 2018May 22, 2018 geofflangdale10 Comments

[ Please bear with the terrible formatting of the table in this post; I was pretty startled at how limited my options were from a vanilla formatter. Opinions on a better method are welcome. ]

Daniel Lemire, in his post Iterating over set bits quickly (SIMD edition) discusses several techniques to iterate over set bits quickly – or more precisely, to turn a collection of bits into a variable-length buffer full of integers indicating which bits were set.

So, if your code gets given an array with the following 16-bit bitfields (assuming little-endian order):

0x1001, 0x0003, 0xffff

you would want the answer:

indexes = 0, 12, 16, 17, 32, 33, 34, ... , 46, 47

This is an important operation. While it’s a lot of fun to do clever things with SIMD, sooner or later you may need to do something specific with the bits you found in your SIMD registers. For example, we used a number of SIMD techniques in Hyperscan to search for strings, but eventually you would have to report that you’d found something to the rest of the system.

After reading Daniel’s post, and more importantly, taking some time to hack on an AVX-512 system that he generously shared access with me, I think I have invented a new, branch-free way of solving this problem for 64-bit integers. There is the small catch that you will have to have an AVX-512 capable system handy.

(I say I think I invented this as it’s quite possible that (a) I’ve absorbed this technique from somewhere and forgot, or (b) someone else has already independently invented this)

Here’s the technique.

Let’s rig up a bunch of masks with alternating blocks of one and zero bits:

uint64_t msk_1 = 0xffffffff00000000ULL;
uint64_t msk_2 = 0xffff0000ffff0000ULL;
uint64_t msk_3 = 0xff00ff00ff00ff00ULL;
uint64_t msk_4 = 0xf0f0f0f0f0f0f0f0ULL;
uint64_t msk_5 = 0xccccccccccccccccULL;
uint64_t msk_6 = 0xaaaaaaaaaaaaaaaaULL;

Now, suppose I have a bitvector in v that I’d like to turn into a bunch of indexes. I can get a start by doing this:

uint64_t v1 = _pext_u64(msk_1, v);
uint64_t v2 = _pext_u64(msk_2, v);
uint64_t v3 = _pext_u64(msk_3, v);
uint64_t v4 = _pext_u64(msk_4, v);
uint64_t v5 = _pext_u64(msk_5, v);
uint64_t v6 = _pext_u64(msk_6, v);

What did this achieve? Well, suppose I have the 11th bit set in v and nothing else. Looking into my masks, I can see that my PEXT operation (a fast bitwise extract) got me a 1-bit from msk_6, a 1-bit from msk_5, a 0-bit from msk_4, a 1-bit from msk_3 and 0-bits otherwise. These bits will all be deposited into the least significant bits of the v1 through 6 temporaries.

In other works, for each set bit, I’m extracting the bit pattern of its index from the masks and depositing that bit pattern at the lowest-significant bytes on my v1 through v6 temporary values.

So, in the unlikely event that you were hoping to get the right answers, annoyingly bit-smeared across 6 different uint64_t variables, you’re done. But that’s probably not very satisfying. We’ll get to that.

So how do we interleave these 6 values together? This looks pretty ugly – we’re looking at 384 total bits in the worst case of answers. So this doesn’t seem like something we can do fast in the General Purpose Registers. Let’s go to SIMD.

The principle we will apply is that we will use AVX-512’s facility to use 64-bit mask to control a SIMD computation. We will take our 6 values and use them to control the progressive adding of 32, 16, 8, 4, 2 and 1 into a result.

__m512i vec;
vec = _mm512_maskz_add_epi8(v1, v32_bit, _mm512_set1_epi8(0));
vec = _mm512_mask_add_epi8(vec, v2, v16_bit, vec);
vec = _mm512_mask_add_epi8(vec, v3, v8_bit, vec);
vec = _mm512_mask_add_epi8(vec, v4, v4_bit, vec);
vec = _mm512_mask_add_epi8(vec, v5, v2_bit, vec);
vec = _mm512_mask_add_epi8(vec, v6, v1_bit, vec);

Now vec holds the answer we wanted, if we just wanted a bunch of bytes on our output, ranging from 0..63. Unfortunately, we need to write some not very interesting code if we’re doing this over a large range, where we imagine that our offsets might be much larger than a byte. If we’re working continuously over inputs >64K, we would expect to need 4 byte answers. In order to write out up to 64 uint32_t offsets, we’re going to have to spread out our results over 4 registers (spreading the bytes over u32 units), add in a value ‘k’ representing the base offset of our 64-bit value to begin with, and write all 4 of these big registers out.

__m512i base = _mm512_set1_epi32(k*64);
__m512i r1 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,0));
__m512i r2 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,1));
__m512i r3 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,2));
__m512i r4 = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(vec,3));

r1 = _mm512_add_epi32(r1, base);
r2 = _mm512_add_epi32(r2, base);
r3 = _mm512_add_epi32(r3, base);
r4 = _mm512_add_epi32(r4, base);
_mm512_storeu_si512((__m512i *)out, r1);
_mm512_storeu_si512((__m512i *)(out + 16), r2);
_mm512_storeu_si512((__m512i *)(out + 32), r3);
_mm512_storeu_si512((__m512i *)(out + 48), r4);

(note that ‘out’ is a uint32_t so we are actually getting +64, +128, +192 bytes with those last three offsets).

Alert readers will note that this code is writing a lot of stuff out. What happens if we only had 1 bit set? Or 0? Well, this blog isn’t called “Branch Free” for nothing.

More seriously, the point is that it’s usually cheaper to do the same thing every time rather than run the risk of a branch mispredict. Looking back at the code above – sure, it looks like a giant bolus of code. But a branch miss on a modern architecture is around 14 cycles. That’s a lot of missed opportunity to do work.

Even if you accept my above philosophy of doing tons of potentially redundant work over risking a branch miss, there’s one more question – we need to know where our next write should be:

uint8_t advance = __builtin_popcountll(v);
out += advance

That just moves us up (remember ‘out’ is a uint32_t for pointer math purposes) to the last value that actually had something set. And we’re done!

Is it fast?

Here’s a rough spreadsheet of the results measured against several of the other methods described in Daniel’s article. It’s faster than most of the other methods, falling down only for very low ‘bitmap densities’. For these lower densities, taking a conditional branch with the prospect that the expected number of bits set in a word is very low is a winning proposition.

Bitmap density	Method	Cycles per index
0.03	bitmap_decode_ctz	3.852
	bitmap_decode_avx2	10.116
	bitmap_decode_avx2_turbo	14.363
	bitmap_decode_avx2_turbo_thin	15.736
	bitmap_decode_avx2_turbo_nopopcnt	12.624
	bitmap_decode_bmi2_avx512	12.9
0.12	bitmap_decode_ctz	4.97
	bitmap_decode_avx2	3.003
	bitmap_decode_avx2_turbo	4.205
	bitmap_decode_avx2_turbo_thin	4.547
	bitmap_decode_avx2_turbo_nopopcnt	3.732
	bitmap_decode_bmi2_avx512	2.481
0.25	bitmap_decode_ctz	4.251
	bitmap_decode_avx2	1.52
	bitmap_decode_avx2_turbo	2.09
	bitmap_decode_avx2_turbo_thin	2.265
	bitmap_decode_avx2_turbo_nopopcnt	1.861
	bitmap_decode_bmi2_avx512	1.25
0.5	bitmap_decode_ctz	3.446
	bitmap_decode_avx2	0.796
	bitmap_decode_avx2_turbo	1.042
	bitmap_decode_avx2_turbo_thin	1.131
	bitmap_decode_avx2_turbo_nopopcnt	0.92
	bitmap_decode_bmi2_avx512	0.616
0.9	bitmap_decode_ctz	3.037
	bitmap_decode_avx2	0.444
	bitmap_decode_avx2_turbo	0.574
	bitmap_decode_avx2_turbo_thin	0.628
	bitmap_decode_avx2_turbo_nopopcnt	0.509
	bitmap_decode_bmi2_avx512	0.366

Is this a great idea? I don’t know.

There are no doubt other methods to use AVX512 to transform bit vectors in this fashion, and for a relatively low ‘population’ there are a number of ways the bitmap_decode_ctz code can be made to run faster (possibly the topic of another article).

I still think it’s an interesting ‘trick’ and it’s nice to take my second-favorite instruction (PEXT) out for a spin.

Let me know if you’ve seen this trick somewhere before and I’ll be happy to credit where credit is due. As I said, I think I invented it…

The code is available at Daniel Lemire’s Github with an error (my fault, apparently I thought 8+2 = 9) which will be corrected in due course.

ps. In the ‘dynamiting the trout stream’ category, I give you VPCOMPRESSB from Intel® Architecture Instruction Set Extensions Programming Reference (PDF) which will greatly simplify all the above trickery, once we have AVX512_VBMI2 capable machines (Ice Lake time-frame).

pps. There is also a branch-free means where VPCOMPRESSD can be used four times on 16-bit words to solve a similar problem on machines that are publicly available now. This can be left as an exercise for the reader. It might be faster than the BMI2 stuff, but it lacks style points.

Introduction and welcome

May 22, 2018May 22, 2018 geofflangdale5 Comments

Hello, world.

This is my blog where I will talk about things that interest me (and a no doubt small collection of others). Topics that I’m interested in include:

Low-level and performance-oriented programming
Computer architecture (especially as it related to performance-oriented code)
Programming languages
Regular expression implementation and automata theory
… and of course, implementing things without branches! Thus the name.

I was the designer of the Hyperscan project. I built this system at Sensory Networks, which was acquired by Intel Corporation in 2013, and worked on Hyperscan at Intel for over 4 years.

I hope that I can show you some interesting things. I have a few things in the pipeline that I will show shortly, including some string matching work, fast Random Forest implementation and a lot of my favorite low-level coding tips and tricks.

I request that my readers can bear with me and forgive the (hopefully temporary) amateurish nature of the site; I am not an expert blogger or user of WordPress.

Preliminary Comments, Background, Disclaimers

Doing an algorithms startup is a lot of fun

Doing an algorithms startup won’t necessarily make you tons of money

Speculative: What should an algorithms startup do after capturing most of the Total Addressable Market for that algorithm? When are you ready to do that?

Don’t drift into being a consulting business

Iterate, and release a Minimum Viable Product (MVP) early, but make the MVP actually Viable

Your product will have gaps, but the earlier your customers discover them, the better

Testing

Work clean and test everything

The ability to ‘fuzz’ a complex system is a lifesaver, but it comes with a trap

Assume every metric that isn’t measured is Bad News for you.

Team Issues

Watch out for Individual Contributor “Tourists”

Conversely, reward your Individual Contributors and don’t dead-end them on compensation.

A good team is not comprised of 100% ‘A’ players on some “Most Awesome Geek” standard.

API Design

Build only what you need

Listen to your customers but don’t let them design your API for you

Capture significant use cases, even when you don’t have a brilliant solution for the use case.

Miscellaneous Issues

Don’t Bog Down on Trivial Stuff Immediately (or at all)

Work Clean – Legals

Work clean – Static and dynamic analysis

Build in an niche appropriate to your scale; don’t take your tricycle out on the expressway

Expect to fail evaluations and keep trying

Build a huge database of benchmarks and actually look at them

Expect to be evaluated by the person whose code will be replaced by yours if the evaluation succeeds

Evaluations seem to go a lot better if they are bottom-up and engineer-driven rather than top-down and manager-driven

Expect to not be able to announce successes

Contract negotiations: don’t lose your nerve

Some things that didn’t seem to be missed

Conclusions, Sort Of

Share this:

Share this:

The Problem in Plain DFA implementations: Memory Latency

Enter My Little Friend: Sheng

Share this:

Share this: