Why Ice Lake is Important (a bit-basher’s perspective)

With Computex, there’s been a ton of news about Ice Lake (hereafter ICL) and the Sunny Cove core (SNC). Wikichip, Extremetech and Anandtech among many others have coverage (and there will be a lot more), so I won’t rehash this.

I also don’t want to embark on a long discussion about 14nm vs 10nm, Intel’s problems with clock speeds, etc. Much of this is covered elsewhere both well and badly. I’d rather focus on something I find interesting: a summary of what SNC is bringing us as programmers, assuming we’re programmers who like using SIMD (I’ve dabbled a bit).

I’m also going to take it as a given that Cannonlake (CNL) didn’t really happen (for all practical purposes). A few NUCs got into the hands of smart people who did some helpful measurements of VBMI, but from the perspective of the mass market, the transition is really going to be from AVX2 to “what SNC can do”.

While I’m strewing provisos around, I will also admit that I’m not very interested in deep learning, floating point or crypto (not that there’s anything wrong with that; there are just plenty of other people willing to analyze how many FMAs or crypto ops/second new processors have).

Also interesting: SNC is a wider machine with some new ports and new capabilities on existing ports. I don’t know enough about this to comment, and what I do remember from my time at Intel isn’t public, so I’m only going to talk about things that are on the public record. In any case, I can’t predict the implications of this sort of change without getting to play with a machine.

So, leaving all that aside, here are the interesting bits:

  • ICL will add AVX-512 capability to a consumer chip for the first time. Before ICL, you can’t buy a consumer-grade computer that runs AVX-512 – there are various Xeon D chips that you could buy to get AVX-512 or you could buy the obscure CNL NUC or laptop (maybe), but there wasn’t a mainstream development platform for this.
    • So we’re going straight from AVX2, which is an incremental SSE update that didn’t really extend SSE across to 256 bits in a general way (looking at VPSHUFB and VPALIGNR makes it starkly clear that, from the perspective of a bit/byte-basher, AVX2 is 2x128b) – all the way to AVX-512.
    • AVX-512 capability isn’t just a extension of AVX 2 across more bits. The way that nearly all operations can be controlled by ‘mask’ registers allows us to do all sorts of cool things that previously would have required mental gymnastics to do in SIMD. The short version of the story is that AVX-512 introduces 8 mask registers that allow most operations to be conditionally controlled by whether the corresponding bit in the mask register in on or off (including the ability to suppress loads and stores that might otherwise introduce faults).
  • Further, the AVX-512 capabilities in ICL are a big update on what’s there in Skylake Server (SKX), the mainstream server platform that would be the main way that most people would have encountered AVX-512. When you get SNC, you’re getting not just SKX-type AVX-512, you’re getting a host of interesting add-ons. There include, but are not limited to (I told you I don’t care about floating point, neural nets or crypto):
    • VBMI
    • VBMI2
    • VPOPCNTDQ
    • BITALG
    • GFNI
    • VPCLMULQDQ

So, what’s so good about all this? Well, armed with a few handy manuals (the Extensions Reference manual and Volume 2 of the Intel Architecture Manual set), let’s dig in.

VBMI

This one is the only extension that we’ve seen before – it’s in Cannonlake. VBMI adds the capability of dynamically (not just a fixed pattern shuffle) shuffling bytes as well as words, “doublewords” (Intel-ese for 32-bits), and “quadwords” (64 bits). All the other granularities of shuffle up to a 512-bit size are there in AVX-512 but bytes don’t make it until AVX512_VBMI.

Not only that, the shuffles can have 2-register forms, so you can pull in values over a 2x512b pair of registers. So in addition to VPERMB we add VPERMT2B and VPERMI2B (the 2-register source shuffles have 2 variants depending on what thing gets overwritten by the shuffle result).

This is significant both for the ‘traditional’ sense of shuffles (suppose you have a bunch of bytes you want to rearrange before processing) but also for table lookup. If you treat the shuffle instructions as ‘table lookups’, the byte operations in VBMI allow you to look up 64 different bytes at once out of a 64-byte table, or 64 different bytes at once out of a 128-byte table in the 2-register form.

The throughput and latency costs on CNL shuffles are pretty good, too, so I expect these operations to be fairly cheap on SNC (I don’t know yet).

VBMI also adds VPMULTISHIFTQB – this one works on 64-bit lanes and allows unaligned selection of any 8-bit field from the corresponding 64-bit lane. A pretty handy one for people pulling things out of packed columnar databases, where someone might have packed annoying sized values (say 5 or 6 bits) into a dense representation.

VBMI2

VBMI2 extends the VPCOMPRESS/VPEXPAND instructions to byte and word granularity. This allow a mask register to be used to extract (or deposit) only the appropriate data elements either out of (or into) another SIMD register or memory.

VPCOMPRESSB pretty much ‘dynamites the trout stream’ for the transformation of bits to indexes, ruining all the cleverness (?) I perpetrated here with AVX512 and BMI2 (not the vector kind, the PDEP/PEXT kind).

VBMI2 also adds a new collection of instructions: VPSHLD, VPSHLDV, VPSHRD, VPSHRDV. These instructions allow left/right logical double shifts, either by a fixed about or a variable amount (thus the 4 variants) across 2 SIMD registers at once. So, for either 16, 32 or 64-bit granularity, we can effectively concatenate a pair of corresponding elements, shift them (variably or not, left or right) and extract the result. This is a handy building block and would have been nice to have while building Hyperscan – we spend a lot of time working around the fact that it’s hard to move bits around a SIMD register (one of these double-shifts, plus a coarse-grained shuffle, would allow bit-shuffles across SIMD registers).

VPOPCNTDQ/BITALG

I’m grouping these together, as VPOPCNTDQ is older (from the MIC product line) but the BITALG capabilities that arrive together with VPOPCNTDQ for everyone barring the Knights* nerds nicely round out the capabilities.

VPOPCNT does what it says on the tin: a bitwise population count for everything from bytes and words (BITALG) up to doublewords and quadwords (VPOPCNTDQ). We like counting bits. Yay.

VPSHUFBITQMB, also introduced with BITALG, is a lot like VPMULTISHIFTQB, except that it extracts 8 single bits from each 64-bit lane and deposits it in a mask register.

GFNI

OK, I said I didn’t care about crypto. I don’t! However, even a lowly bit-basher can get something out of these ones. If I’ve mangled the details, or am missing some obvious better ways of presenting this, let me know – be aware that the below picture is an accurate presentation of what I look like when dealing with these instructions:

GF2P8AFFINEINVQB I’ll pass over in silence; I think it’s for the real crypto folks, not a monkey with a stick like me.

GF2P8AFFINEQB on the other hand is likely awesome. It takes each 8 bit value and ‘matrix multiplies’ it, in a carryless multiply sense, with a 8×8 bit matrix held in the same 64-bit lane as the 8 bit value came from.

This can do some nice stuff. Notably, it can permute bits within each byte, or, speaking more generally, replace each bit with an arbitrary XOR of any bit from the source byte. So if you wanted to replace (b0, b1, .. b7) with (b7^b6, b6^b5, … b0^b0) you could. Trivially, of course, this also gets you 8-bit shift and rotate (not operations that exist on Intel SIMD otherwise). This use of the instruction effectively assumes the 64-bit value is ‘fixed’ and our 8-bit values are coming from an unknown input.

One could also view GF2P8AFFINEQB as something where the 8-bit values are ‘fixed’ and the 64-bit values are unknown – this would allow the user to, say, extract bits 0,8,16… from a 64-bit value and put it in byte 0, as well as 1,9,17,… and put it in byte 1, etc. – thus doing a 8×8 bit matrix transpose of our 64-bit values.

I don’t have too much useful stuff for GF2P8MULB outside crypto, but it is worth noting that there aren’t that many cheap byte->byte transformations that can be done over 8 bits that aren’t straightforward arithmetic or logic (add, and, or, etc) – notably no lanewise multiplies. So this might come in handy, in a kind of monkey-with-a-stick fashion.

VPCLMULQDQ

OK, I rhapsodized already about a use of carry-less multiply to find quote pairs.

So the addition of a vectorized version of this instruction – VPCLMULQDQ – that allows us to not just use SIMD registers to hold the results of a 64b x 64b->128b multiply, but to carry out up to 4 of them at once, could be straightfowardly handy.

Longer term, carryless multiply works as a good substitute for some uses of PEXT. While it would be nice to have a vectorized PEXT/PDEP (make it happen, Intel!), it is possible to get a poor-man’s version of PEXT via AND + PCLMULQDQ – we can’t get a clean bitwise extract, but, we can get a 1:1 pattern of extracted bits by carefully choosing our carryless multiply multiplier. This is probably worth a separate blog post.

I have a few nice string matching algorithms that rest on PEXT and take advantage of PEXT as a useful ‘hash function’ (not really a hash function, of course). The advantage in the string matching world of using PEXT and not a hash function is the ability to ‘hash’ simultaneously over ‘a’, ‘abc’ and ‘xyz’, while ensuring that the effectively ‘wild-carded’ nature of ‘a’ in a hash big enough to cover ‘abc’ and ‘xyz’ doesn’t ruin the hash table completely.

Conclusion

So, what’s so great about all this? From an integer-focused programmer’s perspective, ICL/SNC adds a huge collection of instructions that allow us – in many cases for the first time – to move bits and bytes around within SIMD registers in complex and useful ways. This radically expands the number of operations that can be done in SIMD – without branching and potentially without having to go out to memory for table accesses.

It’s my contention that this kind of SIMD programming is hugely important. There are plenty of ways to do ‘bulk data processing’ – on SIMD, on GPGPU, etc. This approach is the traditional “well, I had to multiply a huge vector by a huge matrix” type problem. Setup costs, latencies – all this stuff is less important if we can amortize over thousands of elements.

On the other hand, doing scrappy little bits of SIMD within otherwise scalar code can yield all sorts of speedups and new capabilities. The overhead of mixing in some SIMD tricks that do things that are extremely expensive in general purpose registers is very low on Intel. It’s time to get exploring (or will be, in July).

Conclusion Caveats

Plenty of things could go wrong, and this whole project is a long-term thing. Obviously we are a long way away from having SNC across the whole product range at Intel, much less across a significant proportion of the installed base. Some of the above instructions might be too expensive for the uses I’d like to put them (see also: SSE4.2, which was never been a good idea). The processors might clock down too much with AVX-512, although this seems to be less and less of a problem.

If you have some corrections, or interesting ideas about these instructions, hit me up in the comments! If you just want to complain about Intel (or, for that matter, anyone or anything else), I can suggest some alternative venues.

Question: Is matching fixed regexes with Back-references in P?

There is a persistent meme out there that matching regular expressions with back-references is NP-Hard. There is a post about this and the claim is repeated by Russ Cox so this is now part of received wisdom.

None of these claims are false; they just don’t apply to regular expression matching in the sense that most people would imagine (any more than, say, someone would claim, “colloquially” that summing a list of N integers is O(N^2) since it’s quite possible that each integer might be N bits long). It depends on the generally unfamiliar notion that the regular expression being matched might be arbitrarily varied to add more back-references.

These constructions rely on being able to add more things to the regular expression as the size of the problem that’s being reduced to ‘regex matching with back-references’ gets bigger.

Suppose, instead, as per more common practice, we are considering the difficulty of matching a fixed regular expressions with one or more back-references against an input of size N.

Is this task is in P? That is, is there a polynomial-time algorithm in the size of the input that will tell us whether this back-reference containing regular expression matched?

Note that back-references in a regular expression don’t “lock” – so the pattern /((\wx)\2)z/ will match “axaxbxbxz” (EDIT: sorry, I originally fat-fingered this example). So, sadly, we can’t just enumerate all starts and ending positions of every back-reference (say there are k backreferences) for a bad but polynomial-time algorithm (this would be O(N^2k) runs of our algorithm without back-references, so if we had a O(N) algorithm we could solve it in O(N^(2k+1)). Unfortunately, this construction doesn’t work – the capturing parentheses to which the back-references occur update, and so there can be numerous instances of them.

Note that even a lousy algorithm for establishing that this is possible suffices. So if there’s a construction that shows that we can match regular expressions with k backreferences in O(N^(100k^2+10000)) we’d still be in P, even if the algorithm is rubbish. I’ve read that (I forget the source) that, informally, a lousy poly-time algorithm can often be improved, but an exponential-time algorithm is intractable. So knowing that this problem was in P would be helpful.

So I’m curious – are there any either (a) results showing that fixed regex matching with back-references is also NP-hard, or (b) results, possibly the construction of a dreadfully naive algorithm, showing that it can be polynomial?

Fitting My Head Through The ARM Holes or: Two Sequences to Substitute for the Missing PMOVMSKB Instruction on ARM NEON

ah

(the author hard at work)

In the last post, I talked about some first impressions of programming SIMD for the ARM architecture. Since then, I’ve gotten simdjson working on our ARM box – a 3.3Ghz eMag from Ampere Computing.

I will post some very preliminary performance results for that shortly, but I imagine that will turn into a giant festival of misinterpretation (take your pick: “Intel’s lead in server space is doomed, says ex-Intel Principal Engineer” or “ARM NEON is DOA and will never work”) and fanboy opinions, so I’m going to stick to implementation details for now.

I had two major complaints in my last post. One was that SIMD on ARM is still stuck at 128-bit. As of now, there does not seem to be a clever way to work around this…

The other complaint, a little more tractable, was the absence of the old Intel SIMD standby, the PMOVMSKB instruction. This SSE instruction takes the high bit from every byte and packs it into the low 16 bits of a general purpose register. There is also an AVX2 version of it, called VPMOVMSKB, that sets 32 bits in similar fashion.

Naturally, given the popularity of this instruction, AVX512 does something different and does all this – and much more – via ‘mask’ registers instead, but that’s a topic for another post.

At any rate, we want this operation a fair bit in simdjson (and more generally). We often have the results of a compare operation sitting in a SIMD register and would like to reduce it down to a concise form.

In fact, what we want for simdjson is not PMOVMSKB – we want 4 PMOVMSKB’s in a row and want the results to be put in a single 64-bit register. This is actually good news – the code to do this on ARM is much cheaper (amortized) if you have 4 of these to do and 1 destination register.

So, here’s how to do it. For the purposes of this discussion assume we have already filled each lane of each input vector with 0xff or 0x00. Strictly speaking the below sequences aren’t exactly analogous to PMOVMSKB as they don’t just pick out the high bit.

The Simple Variant

This requires 4 logical operations (to mask off the unwanted bits), 4 paired-add instructions (it is 3 steps to go from 512->256->128->64 bits, but the first stage requires two separate 256->128 bit operations, so 4 total), and an extraction operation (to get the final result into our general purpose register).

Here’s the basic flow using ‘a’ for the result bit for our first lane of the vector, ‘b’ for the second, etc. We start with (all vectors written left to right from least significant bit to most significant bit, with “/” to delimit bytes):

a a a a a a a a / b b b b b b b b / c c c c c c c c / ...

Masking gets us:

(512 bits across 4 regs)
a 0 0 0 0 0 0 0 / 0 b 0 0 0 0 0 0 / 0 0 c 0 0 0 0 0 / ...

Those 3 stages of paired-add operations (the 4 vpaddq_u8 intrinsics) yield:

(256 bits across 2 regs)
a b 0 0 0 0 0 0 / 0 0 c d 0 0 0 0 / 0 0 0 0 e f 0 0 / ...

(128 bits across 1 reg)
a b c d 0 0 0 0 / 0 0 0 0 e f g h / i j k l 0 0 0 0 / ...

(64 bits across 1 reg; top half is a repeated 'dummy' copy)
a b c d e f g h / i j k l m n o p / q r s t u v w x / ...

… and then all we need to do is extract the first 64-bit lane. Note that doing this operation over a single 128-bit value to extract a 16-bit mask would not be anything like 1/4 as cheap as this – we would still require 3 vpaddq operations (we could use the slightly cheaper 64-bit forms for the second and third versions).

It is possible to combine the results in fewer instructions with a mask and a 16-bit ADDV instruction (which adds results horizontally inside a SIMD register). This instruction, however, seem quite expensive, and I cannot think of a way to extract the predicate results in their original order without extra instructions.

The Interleaved Variant

However, there’s a faster, and intriguing way to do this, that isn’t really analogous to anything you can do on the Intel SIMD side of the fence.

Let’s step back a moment. In simdjson, we want to calculate a bunch of different predicates on single-byte values – at different stages, we want to know if they are backslashes, or quotes, or illegal values inside a string (under 0x20), or whether they are in various character classes. All these things can be calculated byte by byte. So it doesn’t really matter what order we operate on our bytes, just as long as we can get our 64-bit mask back out cheaply in the original order.

So we can use an oddity (from an Intel programmer’s perspective) of ARM – the N-way load instructions. In this case, we use LD4 to load 4 vector registers – so 512 bits – in a single hit. Instead of loading these registers consecutively, the 0th, 4th, 8th, … bytes are packed into register 0, the 1st, 5th, 9th, … bytes are packed into register 1, etc.

In simdjson, we can operate on these bytes normally as before. It doesn’t matter what order they are in when we’re doing compares or looking up shuffle tables to test membership of character classes. However, at the end of it all, we need to reverse the effect of the way that LD4 has interleaved our results.

Here’s the ‘interleaved’ version:

We start with compare results, with result bits designated as a, b, c, etc (where a is a 1-bit if the first byte-wise vector result was true, else a 0-bit).

4 registers, interleaved
a a a a a a a a / e e e e e e e e / i i i i i i i i / ...
b b b b b b b b / f f f f f f f f / j j j j j j j j / ...
c c c c c c c c / g g g g g g g g / k k k k k k k k / ...
d d d d d d d d / h h h h h h h h / l l l l l l l l / ...

We can start the process of deinterleaving and reducing by AND’ing our compare results for the first register by the repeated bit pattern { 0x01, 0x10 }, the second register by {0x02, 0x20}, the third by { 0x04, 0x40} and the fourth by { 0x08, 0x80 }. Nominally, this would look like this:

4 registers, interleaved, masked
a 0 0 0 0 0 0 0 / 0 0 0 0 e 0 0 0 / i 0 0 0 0 0 0 0 / ...
0 b 0 0 0 0 0 0 / 0 0 0 0 0 f 0 0 / 0 j 0 0 0 0 0 0 / ...
0 0 c 0 0 0 0 0 / 0 0 0 0 0 0 g 0 / 0 0 k 0 0 0 0 0 / ...
0 0 0 d 0 0 0 0 / 0 0 0 0 0 0 0 h / 0 0 0 l 0 0 0 0 / ...

In practice, while we have to AND off the very first register, the second and subsequent registers can be combined and masked with the first register using one of ARM’s “bit select” operations, which allows us to combine and mask the registers in one operation (so the above picture never really exists in the registers). So with 4 operations (1 AND and 3 bit selects) we now have our desired 64 bits lined up in a single accumulator register, in a slightly awkward fashion.

If our result bits are designated as abcdef… etc., our accumulator register now holds (reading from LSB to MSB):

a b c d 0 0 0 0 / 0 0 0 0 e f g h / i j k l 0 0 0 0 ...

We can then use the paired-add routine to combine and narrow these bytes, yielding a 64-bit result.

So we need 4 logical operations, 1 paired add, and an extract. This is strictly cheaper than our original sequence. The LD4 operation (it is, after all, and instruction that allows us to load 512 bits in a single instruction) is also cheaper than 4 128-bit vector loads.

The use of this transformation allows us to go from spending 4.23 cycles per byte in simdjson’s “stage1” to 3.86 cycles per byte, an almost 10% improvement. I haven’t quantified how much benefit we get from using LD4 versus how much benefit we get from this cheaper PMOVMSKB sequence.

Conclusion

There is a substitute for PMOVMSKB, especially at larger scale (it would still be painfully slow if you only needed a single 16-bit PMOVMSKB in comparison to the Intel operation). It’s a little faster to use the “interleaved” variant, if interleaving the bytes can be tolerated.

On a machine with 128-bit operations, requiring just 6 operations to do the equivalent of 4 PMOVMSKBs isn’t all that bad – notably, if this was a SSE-based Intel machine, the 4 PMOVMSKB operations would need to be followed by 3 shift and 3 OR operations to be glued together into one register. Realistically, though, Intel has had 256-bit integer operations since Haswell (2013) so the comparison should really be against 2 VPMOVMSKB ops followed by 1 shift+or combination) – or, if you really want to be mean to ARM, a single operation to the AVX-512 mask registers followed by a move to the GPRs.

Still, it’s better than I thought it would be…

I think I thought up these tricks, but given that I’ve been coding on ARM for a matter of days, I’m sure there’s plenty of people doing this or similar. Please leave alternate implementations or pointers to earlier versions of my implementation in the comments; I’m happy to give credit where it is due.

Side note: though it is a latent possibility in simdjson, the “interleaved” variant actually allows us to combine our results for adjacent values more cheaply than we would be able to if we had our input in non-interleaved fashion.

If we were evaluating a pair of predicates that are adjacent in our data, and wanted to do this combination in the SIMD side (as opposed to moving the results to the GPR side and calculating things there – in Hyperscan, we had occasion to do things both ways, depending on specifics), we can combine our results for bytes 0, 4, 8, 12 … with our results for bytes 1, 5, 9, 13 … with a simple AND operation. It is only for the combination of bytes 3, 7, 11, 15 with the subsequent bytes 4, 8, 12, 16 that we need to do a comparatively expensive vector shift operation (EXT in ARM terms, or PALIGNR or PSLLDQ for Intel folks).

In a way, the cheapness of the ‘vertical’ combinations in this PMOVMSKB substitute hints at this capability: adjacent bytes are easier to combine, except across 32-bit boundaries (not an issue for the code in this post).

This would be a key consideration if porting a Hyperscan matcher such as “Teddy” to ARM. I might build up a demo of this for subsequent posts.

 

 

 

 

An Intel Programmer Jumps Over the Wall: First Impressions of ARM SIMD Programming

buddha-jump-over-the-wall-2280691_1920(the pictured dish is apparently materials for “Buddha Jumps Over The Wall”, named for its ability to seduce away vegetarians – sadly it uses shark fin so has some ethical issues…)

[ UPDATE: I have, at least partly, dealt with the lack of PMOVMKSB and written a new post about it ]

I’ve done a lot of SIMD coding. However, aside from dabbling with a bit of 32-bit ARM coding during the early Hyperscan era (back before the Intel acquisition of Sensory Networks), it’s been nearly all Intel SIMD – SSE2 through to AVX512.

Recently I’ve been working on an ARM port of simdjson, our fast (Gigabytes/second) SIMD parser. I’ll be uploading preliminary ARM SIMD code for this soon. While the experience is fresh in my mind, I thought I’d write up some first impressions of ARM AArch64 SIMD programming (good, bad and just plain ugly).

The Good

First of all, orthogonality. it’s really nice to program with a SIMD instruction set where one (usually) doesn’t have to wonder whether there will be a an operation for a given data size. Every time I went looking for an operation on bytes, I found it (this is by contrast to Intel SIMD programming, where a whole raft of operations don’t exist for bytes and some don’t exist for 16-bit “word” quantities either).

[ many of these missing byte operations will finally appear with SNC, assuming GFNI is fast enough; the catchily named GF2P8AFFINEQB will allow arbitrary bit permutes, thus including rotates and shifts – see Intel® Architecture Instruction Set Extensions and Future Features Programming Reference for details ]

Orthogonality of these operations is a huge relief to the programmer – it’s less work to commit things to memory, and fewer “what the hell” moments later when you realize something that should exist doesn’t. For example, when doing the “bits to indexes” work my original paper notes on the code happily had an operation that didn’t exist (masked 512-bit OR using a byte-granularity kreg).

Second, multiple-table permutes: TBL and TBX can take multiple registers – up to 4 – as inputs. Thus, you can permute over up to 512 bits. This is a leap-frogging race here – with VBMI and Cannonlake, Intel will allow 2 AVX512-bit registers to be used in a VPERMI2B or VPERMT2B. More on latencies later (I would like to praise these ARM SIMD operations more but, despite many happy claims about how fast these ops are in some architectures – e.g. A12 – I can’t find any documentation).

Note for Intel people – TBL/TBX yield “zero” or “no operation” on out of range indexes, which is a contrast to PSHUFB (with its odd behavior governed by the high bit) or VPERM*, where only the low-order bits affect what the permute does. This seems to be a neutral change; sometimes the PSHUFB behavior is annoying, sometimes it’s useful.

Third, horizontal operations and pairwise operationsThis is something that exists spottily on Intel SIMD, but ARM allows a wide variety of operations to be either applied across the whole vector or be done as a pairwise approach. ADD and MAX/MIN are pretty handy in both contexts.

Fourth, multiple vector interleaved load and store. This is pretty awesome, and the latency/throughput numbers aren’t too bad for at least A75.

Some elements of ARM SIMD coding are pleasant but no longer a significant advantage. The comprehensive range of compare operations is now matched by AVX512. Overall, it’s still a very pleasant instruction set to work with.

The Bad

There is no equivalent of PMOVMSKB on ARM. People have been complaining about this for years. I have a couple not-too-terrible workarounds for this – especially if one has lots of these operations to do at once (e.g. 4×128 bulk PMOVMSKB equivalent to a 64-bit register) which will be the topic of a blog post in the near future. There is at least a decent version involving masking and a series of paired-add operations. So this can be worked around.

It’s 2019 and we’re still doing 128-bit SIMD. I’m excited to see Scalable Vector Extensions (SVE), but… it was announced late 2016 and the only place you can use this extension is a Fujitsu supercomputer? The absence of SVE from the Neoverse announcement was worrying; this will be a processor shipping almost 4 years after SVE was announced that seemed like a logical fit for SVE. ARM really needs to announce a roadmap for SVE. Is anyone going to support it?

The Ugly

Documentation. OK, so we have tons of different ARM variants out there supporting AArch64. Why do none of them – aside from ARM itself – publish tables of instruction latency and throughput? Plenty of people complain about Intel’s documentation, but the Software Optimization Guide coupled to all the supplementary information (Agner Fog, uops.info) is a wonderful source of information by comparison.

ARM apparently has made real strides in openness – I can get a lot of information off their site without registering or being force-signed-up for marketing material (there are still some regrettable exceptions to this – there’s a Neon programmers guide that forces you to sign up for marketing emails, then turns out to be old…). However, most of the other vendors (Apple, Marvell, Ampere) seem to provide zero information to the public (there might be a super-secret NDA version?). This is depressing: you guys do understand that it helps you to have people write good code for your architecture, right?

Also, holes in the tables of throughput and latency are never welcome, no matter which vendor you are. I’d like to wax more enthusiastic about TBL and TBX but even ARM’s data on this has holes (no throughput data).

Conclusion

All up it’s been fairly pleasant to port simdjson to ARM. Missing operations are counter-balanced by a lot of nice new tricks. I’d say the biggest source of pain at this stage is how scarce information on non-ARM implementations of the instruction set.

If anyone has better software optimization resources for ARM (or if there’s a secretive ARM equivalent of Agner Fog or uops.info lurking out there), please comment or send me an email.

Code Fragment: Finding quote pairs with carry-less multiply (PCLMULQDQ)

 

Well, if that doesn’t put off some readers, I don’t know what will…

A technique that we recently disclosed for our JSON parser (see the paper or recent blog post) is something I think I invented (please set me straight if this is known since the glory days of HAKMEM, or whatever).

We want to find all the space between quotes – so if we have an input like:

abc xxx "foobar" zzz "a"

We can easily have a bitvector:

000000001000000100000101

… corresponding to the quote positions (reading left-to-right from input bytes to bits here).

However, we really want to know that the strings “foobar” and “a” are inside the quotes and nothing else is. What we really want is this:

00000000.111111.00000.1.

… where the 1 bits appear over the things that are inside our quotes – I use ‘.’ to indicate that we don’t really care what the results are for the quote characters themselves (the technique that I detail later will set the opening quote to 1 in this mask and the closing quote to 0).

So how do we find this property?

One solution is to calculate the parallel prefix over XOR of our bits. So given bits b0, b1, b2, b3 our result is b0, b0^b1, b0^b1^b2, b0^b1^b2^b3, and so on. This means that if you get a ‘parity’ result – for a bit, it’s on if there are an odd-numbered count of bits that are on to its left (LSB side) otherwise it’s off.

A nice way to calculate this is to do a carryless multiply by an all-ones (or 2-complement -1 value). This is provided in Intel’s PCLMULQDQ instruction (part of the CLMUL extension, which is pretty specific!) . This instruction is a weird one – it takes two 64-bit values (selected out of a pair of 128-bit registers by using immediates to select which 64-bit value to get within each register) and produces a 128-bit result. The low-order bits are what we need here (oddly, the high-order bits after multiplying by -1 in this case work out as the parallel prefix over XOR from the opposite direction, a property that I haven’t thought of a good use for).

Here’s the code:

In the above code, “odd ends” comes from a previous stage and represents quotes that need to be ‘escaped’ as they were preceded by an unescaped backslash, and “cmp_mask_against_input” is just the standard SIMD comparison against a character that we do to find a ‘”” character (the usual PCMPEQB->PMOVMSKB dance).

The only real weirdness in this code, above, is that we need to factor in our previous quoting status from the previous iteration. So if we ended inside a quote, the “prev_iter_inside_quote” value is all 1s – this allows us to invert the sense of our quote mask.

This exotic instruction (PCLMULQDQ) is pretty slow in terms of latency – on a modern Skylake processor it’s still latency 6. So, in simdjson, we needed to rearrange our code a bit to hide that latency and find some other work to do. But it’s still quite fast overall as long as you don’t follow it with anything that depends on its result (the reciprocal throughput of this instruction is 1 on Skylake).

If you read the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference (pdf) (and who doesn’t?), you’ll note that this instruction is going to get extended on Icelake – it will be turned into a true SIMD instruction (with 4 carryless multiplies over a 512-bit register) rather than one that performs one operation and really just uses SIMD for operand size.

It’s a little tidbit, but I liked it. If you’re working over a pretty large chunk of data, with a lot of little strings in it, your alternative is to process one quote pair at a time, which will be quote slow by comparison. And, as per the name of the blog, the technique is branch free.

Let me know in the comments if you’ve found any other fun uses for carry-less multiply or have any war stories.

Paper: Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs

I’m pleased to report that Hyperscan, the regular expression matcher that ate my life from 2008 through 2018, finally has a paper (pdf) – it’s being presented this week at NSDI ’19.

Anyone who lacks a taste for my discursive bullshit could just stop reading this blog post there, and just go read the paper…. Bye!

It’s a good paper and it works as a standalone read. I’d like to call out a few things in it as particularly interesting and worth reviewing. I’d also like to call out a few things in Hyperscan that we couldn’t write about for reasons of space – we were hard up against the space limits for any conceivable conference paper, so there were whole subsystems and implementation strategies that we couldn’t write about. I will also mention a few aspects of Hyperscan historically that didn’t make it into the Hyperscan 4.x release series (the first open source release of Hyperscan and a substantial API change) nor the Hyperscan 5.x release series (a major release number that largely celebrates the now Australian-free aspect of Hyperscan).

Hyperscan Summary

It’s a software based, large-scale regex matcher designed to match multiple patterns at once (up to tens of thousands of patterns at once) and to ‘stream‘ (that is, match patterns across many different ‘stream writes’ without holding on to all the data you’ve ever seen). To my knowledge this makes it unique.

RE2 is software based but doesn’t scale to large numbers of patterns; nor does it stream (although it could). It occupies a fundamentally different niche to Hyperscan; we compared the performance of RE2::Set (the RE2 multiple pattern interface) to Hyperscan a while back.

Most back-tracking matchers (such as libpcre) are one pattern at a time and are inherently incapable of streaming, due to their requirement to backtrack into arbitrary amounts of old input.

Hyperscan was the product of a ‘pivot’ for a startup called Sensory Networks that was founded in 2003 to do regex on FPGA. I joined in 2006 and spent some time doing regex on GPGPU, before we decided that implementing regular expressions on GPCPU was a better idea. We worked in a much diminished company from 2009 (5 staff members!) until our acquisition by Intel in October 2013.

Future Thingies

Hyperscan and RE2 continue in their niches. There are a number of other projects out there – even a few hardware regular expression matchers struggle on – but I’ll have to get to other projects in another post.

It’s possible that something much simpler than Hyperscan could get a considerable portion of the benefit of Hyperscan, while being a more tractable code base for adaptation, porting to other CPU architectures or exotic architectures (GPGPU, for example), alternate use cases like bioinformatics, etc. I have some ideas in this area under the dubious code name “Ultimate Regex Engine the Third” (ure3) – Hyperscan itself was, originally “Ultimate Engine the Second”. If anyone is interested, get in contact.

Pattern Matching Techniques Roundup

Hyperscan Things That Made It Into The Paper

Pattern Decomposition (Section 3)

A key idea of Hyperscan is that if we have a complex regular expression – say:

42:/<\s*object[^>]*data\s*\x3A[^,>]*base64/smi

We can observe that there are a lot of parts of this pattern that are simply literal strings: “object”, “data”, “base64”. It’s a lot easier and cheaper to match literals rather than do a full regular expression implementation, so breaking up this pattern (an easy one, as it is simply ‘one thing after another’) into several components allows us to remove the strings from the finite automata matching task.

A rudimentary version of this is ‘pre-filtering’ – we could just make sure we saw all 3 strings, then run the pattern (although this is tricky in streaming mode – more details at some other time). But this still means we’re matching all the characters in those three strings in an automata. Given that our methods for matching NFAs are quite sensitive to the number of states, it’s better to have smaller automata.

String Matching: “FDR” (Section 4.1)

FDR is one of a long list of string matching engines that I’ve built. It’s been fairly effective, although it gets a bit cranky at large string counts. See the paper for details, as there’s only so many times I can explain what a “bucketed SIMD shift-or matcher with supercharacters” is.

The FDR matcher is used for moderately string match cases – 80+ strings – and uses SIMD not for scanning many characters at once, but holding a much larger set of simple state machines than could be done with a general purpose register.

This is worth noting given the debates about vector machines – not every use of SIMD is oriented towards applying simple operations over great long vectors of data. FDR is a bucketed shift-or matcher that works better at 128 bits (8 buckets, 16 wide) than it would at 64 bits – there’s no obviously urgent extension to 256, 512 or for that matter, 2048 bits.

Freud said, sometimes a cigar is just a cigar; similarly, sometimes a SIMD register is just a SIMD register.

Glushkov NFA implementation: “LimEx” (Section 4.2)

The Glushkov construction is a way of getting an  ε-free NFA (“epsilon free”); that is, an NFA that does not make transitions from state to state without processing a character. This makes it considerably easier to build state machines with simple bit vector operations (although the successor function, showing which states can follow other states, is not necessarily trivial).

The LimEx matcher is an evolution of several earlier matchers. The “Limited NFA” could only allow states to jump to 0,1 or 2 states ahead (generalized to a larger number, eventually, but still not allowing irregular transformations). The “General NFA” used multiple partitions to allow arbitrary jumps after the Method of Four Russians. The appallingly stupidly-named “LimGen” NFA combined the two, gathering only states requiring a ‘non-limited’ transition using a PSHUFB instruction (yes, that one again, fans), and doing a table lookup.

All of these exotic techniques fell away to the simple strategy: do ‘limited’ transitions (a certain number of fixed-distance jumps only) then work through all the remaining states and do “Exceptions” (the “Ex” in “LimEx”). This covers quite a bit of plumbing for accept states, triggering and reading of bounded repeats, etc. It’s not pretty, but it works.

Hyperscan Things That Didn’t Make It Into The Paper

The following sections are not an exhaustive list; there are many more optimizations and transformations that don’t make it in. A few are below:

  • Graph transformations: redundancy, unreachable components, etc. – we built an abstract NFA graph (also a Glushkov NFA, but tailored towards representing our workload, not a direct implementation) and apply transformations. We can detect redundant work in an NFA and eliminate it, expand out literal paths, etc.
  • Bounded Repeat Analysis and implementation: handling of single-character bounded repeats: e.g. /foo[^\n]{1024}bar/ is quite challenging. We ‘discover’ such repeats in our NFA graph, and have specialized code to implement this class of repeats.
  • Other string matchers: we have a small-group literal matcher called “Teddy” (that uses SIMD; to this day, the best explication of Teddy is in the Rust reimplementation of it, even if they could benefit from using the rest of my compiler algorithm; my logic to merge multiple strings into a bucket is at least mildly sophisticated).
  • DFA implementation: Alex Coyte, one of our engineers (now at Google) always said “DFAs are awesome”. They are; they are faster and simpler than NFAs. We used them quite a bit (when we could, and when they made sense). We had a pretty straightforward implementation of DFAs without too many exotic components, but there are some tricks to making DFAs smaller.
  • NFA and DFA acceleration: when it’s possible to check that an automata, whether NFA or DFA, can’t leave its local neighborhood of current states without seeing easily detected ‘event’, it’s possible to use SIMD to scan for that event.
    For example, the pattern fragment /[^\n]*foo/ can scan for a newline, which breaks out of the \n state, or an ‘f’, which may commence the start of a ‘foo’.
    This extends even to NFAs, where we can take the closure of all possibly active ‘accelerable’ states and scan ahead for events.
  • Specialized engines, like “Sheng“, and many others. The ‘Castle’ engine, for example, efficiently handles many bounded repeats at once, improving overall performance but also radically reducing the amount of state we have to store.
  • Lookaround assertions: we use a similar technique to my SMH matching engine to, upon receipt of a literal match, to augment the power of the literal match by verifying that the characters around the match will match at the regex level. So if we extracted the string ‘foobar’ from a regex (to use a modified version of our earlier example)

    ‘/<\s*object[^>]*\s\dfoobar\d[^,>]*base64/’

    we can implement the character class lookarounds to check that we’ve seen ‘\s\dfoobar\d’ before running a full-fledged NFA matcher for the other parts of the pattern.

  • … and many others

Hyperscan Things That Didn’t Make It Into Open Source Hyperscan

  • Ports to non-x86 platforms. Hyperscan had ports to MIPS, ARM, PowerPC and Tilera at one stage. These ports were a lot of hassle to maintain, and I don’t think it takes a genius to see how these ports were not particularly strategic for Intel to devote resources to.
  • Capturing subexpressions: at once stage we had an experimental product that (in block mode only, not streaming) could accurately model libpcre’s capturing semantics. This worked by scanning the data backwards with a backwards version of the pattern, making a trace of the states visited, and tracing forwards through the backward state trace in order to correctly model the behavior that libpcre would have done. This was possible since when we’re tracing forwards through the backwards trace and following 1-bits in our states, we know they will lead to an accept state.

    Even more fun was implementing a version of this that checkpointed our backwards state traces periodically, allowing us to store O(sqrt(N)) states in our state trace, at the expense of having to scan twice as much (a generalization would have been to allow us to store O(k*pow(N, 1/k)) states with a cost of scanning k times as much.

    This had the feel of one of those tricky algorithm interview questions…  overall, capturing support it was fun to work on, but no-one wanted it commercially. We probably wouldn’t do it this way today.

 

Acknowledgements

We have already mentioned the pioneering work of Victor Glushkov. His work was introduced to us at Sensory Networks by the excellent textbook Flexible Pattern Matching in Strings by Gonzalo Navarro and Mathieu Raffinot. Professor Navarro has also written many excellent papers over the years, so if you can’t afford the book, do check out his publications page. His output is by no means confined to regular expressions, but his treatment of NFA implementations is very good.

A great deal of hard work went into Hyperscan over the years. Version 1.0 was the initiative of Peter Duthie, whose startup Ground Labs is going strong. Alex Coyte, Matt Barr and Justin Viiret worked all the way through from the early days at Sensory Networks through to the end of the Sydney-based Intel DCG Lab. I’d also like to thank Xiang Wang, Yang Hong, Harry Chang, Jiayu Hu and Heqing Zhu from Intel; KyoungSoo Park from KAIST for their work on the paper, as well as the Hyperscan team in general, past and present.

Paper: Parsing Gigabytes of JSON per Second

Daniel Lemire and I have spent some time this year working on a fast JSON parser. You can read about it in paper form at arXiv here and the repo is here.
In this blog post, I’ll provide an informal summary of the paper and some background as to the thinking behind the system.

What it is

simdjson is a parser from JSON input to an immutable tree form (a “DOM” in XML terminology). It’s not a full-featured library; it doesn’t allow you to modify the JSON document and write it back out.
simdjson is a fully featured parser – it doesn’t just scan through bits of the JSON looking for particular fields. That’s an extension which we have thought about but we don’t really have good benchmarks for that.
simdjson validates its input. Being generous about accepting malformed input is a rabbit hole – how do you define the semantics of handling a malformed document?
Here’s our performance, against the closest C/C++ competitors we measured, on a range on inputs on a Intel Skylake processor (a i7-6700 running at 3.4 GHz in 64 bit mode) – for full performance details, read the paper!
gbps
It’s pretty fast – we seem to be 2-3x faster than the next most reasonable alternative (the primary sources of variability are string and number handling – different JSON documents have different amounts of this kind of work to do). We’re probably competitive with Mison (one of our major inspirations) but we don’t have source code so we can’t tell – and we’re doing a lot more work than Mison. Specifically, Mison isn’t validating and doesn’t build a parse tree.
We provide everything required to repeat our experiments, so if you want to verify our results or test them under different assumptions or against different workloads or parsers, it should be very easy to do that.

Why it is fast

The chief point of difference between simdjson and systems such as RapidJSON and sajson lies in the “Stage 1” of simdjson; the part of the system that detects the locations of structural and pseudo-structural characters from the input. This system operates with large scale and regular operations over 64 bytes of the input at once, and can take advantage of both SIMD operations (when examining characters and character classes) as well as 64-bit bitwise operations (when applying transformations on the masks obtained from the SIMD operations). As such, it can achieve “economies of scale” as compared to a step-by-step approach.
For example, an input that consists of a section comprising two key/value pairs:
“result_type”: “recent”,
“iso_language_code”: “ja”
… enters and leaves the condition of ‘being in a string’ 4 times in under 64 characters. A loop which involves step-by-step scanning of our input would typically not benefit from SIMD due to the relatively short stings, and will need to perform relatively slow operations to detect the next string delimiter. By contrast, our stage 1 code would detect all string begin and end markers with a fixed-cost set of operations (albeit an admittedly complex set). Further, our code can do this without data-dependent conditional branches, which will bring our pipeline to a crashing halt on the the first mispredict.

How it works

Just like the old Soviet-era joke about ‘mat’ (the amazing sub-dialect of Russian swearing) usage by workers in the missile factory, we have two stages, and one goes into the other.
There used to be four stages, and this may show up in a few places in old issues and comments.
Stage 1 uses SIMD to attempt to discover the significant parts of the document that need to be fed into the parser, which is Stage 2. The job of Stage 1 is not necessarily to validate what happens. Instead, the job of Stage 1 is to expose what Stage 2 needs to see – so we need to find the set of structural characters such as [, ], {, }, comma, colon, etc. but we also need to find potential starting characters for atoms such as “true”, “false”, “null” and the numbers. Further, we need to find the starts of errors – so if whitespace or one of these other structural characters is followed by some erroneous string, we need to expose that string to stage 2 to reveal an error.
These subsidiary not-quite-structural characters we refer to as “pseudo-structural characters”: things falling after whitespace or other structural characters, that are not safely tucked inside strings. There’s a fair bit of bit-bashing to find them, but it is all quite mechanical.
To get all the way through stage 1 is a fairly mechanical straight-line process. It is almost entirely branch-free, with a couple exceptions (UTF-8 checking and handling of large numbers of structural characters at the end of stage 1). Stage 1 proceeds as follows:
  1. Validation of UTF-8 across the whole input.
  2. Detection of odd-length sequences of backslashes (that will result in escaping the subsequent character)
  3. Detection of which characters are “inside” quotes, by filtering out escaped quote characters from the previous step, then doing a parallel prefix sum over XOR (using the PCLMULQDQ instruction with an argument of -1 as the multiplier) to turn our escaped quote mask into a mask showing which bits are between a pair of quotes.
  4. Detection of structural characters and whitespace via table-based lookup (implemented with a pair of VPSHUFB instructions).
  5. Detection of pseudo-structural characters (those characters I talked about in the summary of stage 1 that need to be exposed to the subsequent stage 2 for error detection and atom handling).
  6. Conversion of the bitmask containing structural and pseudo-structural characters into a series of indexes.
Stage 2 is considerably less clean and branch free. It operates as a goto-based automata with a stack and validates that the sequence of structural and pseudo-structural characters that have passed in correspond to valid JSON. It also handles atom, number and string validation and, as it goes, constructs our ‘tape’ representation (a navigable, if immutable, tree of JSON elements).
If there’s enough interest, I may do a detailed code walk-through. It’s my belief that many of the steps are fairly novel or novel; to my knowledge, I invented the use of PCLMULQDQ to balance quote pairs for this work and the PSHUFB-based table lookup was also my invention while I worked on the Hyperscan project (which continues, at Intel, unburdened of its awkward load of Australians). However, it would not surprise me to find that many of the techniques are independently invented somewhere else: we found that we had independently invented the techniques used in the remarkable icgrep and the technique that I was so proud of in “Sheng” had been invented by Microsoft before. So maybe one day I’ll invent something of my own for real…

Why it exists – what’s the idea here?

We were curious about how far parsing can be done with SIMD instructions. Stage 1 represents the current limit for us – Stage 2 is only sporadically SIMD and is extremely branchy. So we got as far as we could.
It’s worth stepping back and talking about the original 4-stage design. Initially the bits->indexes transformation occupied a separate stage, but why were there stages 3 and 4? Initially, stage 3 was branch free. It operated over the indexes, but through a series of supposedly clever conditional moves and awkward design, it was able to process and validate almost everything about the input without ever taking a branch (for example, it used a DFA to validate most of the local correctness of the sequence of structural and pseudo-structural characters). It also built a very awkward predecessor of our ‘tape’ structure, with just enough information to traverse our tapes. It then deferred all awkward ‘branchy’ work to a stage 4 (handling of validation of individual atoms, ensuring that {} and [] pairs matched, number and string validation, and cleanup of the awkwardness of the tape structure allowing a more conventional traversal of the tapes).
Fun fact: stage 3 was originally called the “ape_machine” (being a combination of a state_machine and a tape_machine), while stage 4 was the “shovel_machine”, which followed the ape_machine and cleaned up its messes.
Those who have worked with me before may recognize the penchant for cutesey code-names; simdjson doesn’t really have any, now that the ape and shovel machines are put out to pasture. We couldn’t even think of a particularly good name (sadly “Euphemus”, helmsman for Jason and the Argonauts, has already been taken by another project with a similar general task area, and none of the other Argonauts have names that are nearly as amusing).
While stage 3 wasn’t SIMD, it was branch-free and thus avoided the awkwardness of branchy, conditional-heavy processing. Thus, in theory, on a modern architecture, it could run very fast, overlapping the handling of one index with the handling of the next index. The problem is that it pushes a bunch of awkward, branchy work into stage 4.
The sad truth is: sooner or later, you have to take the conditional branches. So pushing them into a late stage didn’t solve the problem. If there is a certain amount of awkward conditional work do to, eventually one must do it – this is the software equivalent of the psychoanalytic “return of the repressed“.

Thus, our new “stage 2” is the pragmatic alternative. If we’re going to take unpredictable branches, we may as well do all the work there.

I’m still on the hunt for more parallel and/or branch-free ways of handling the work done in stage 2. Some aspects of what we are doing could be done in parallel (computation of depth and/or final tape location could likely be done with parallel prefix sum) or at least fast (we could go back to a fast automata to handle much of the global validation or the local validation of things like numbers and strings). But the current system is pretty fast, so the parallel system would have to work really well to be competitive.

It’s still worth thinking about this for more modern or upcoming architectures (AVX 512, ARM’s SVE) or unusual architectures (the massive parallelism of GPGPU). This would be a good research project (see below).

What next? (and/or “I’m interested, how can I help?”)

There are a few interesting projects in different directions. Similar tricks to what we’ve done could be used for other formats (notably CSV parsing, which should be doable with something analogous to our stage 1). We could also try to extend these techniques more generally – it’s our hope that a more systematic version of our SIMD tricks could be picked up and incorporated into a more general parser.
There’s quite a bit of software engineering still to do with the JSON parser if someone wants to use it as a generally usable library. It’s a few steps beyond the typical caricature of an academic project (“it worked once on my graduate student’s laptop”) but it isn’t really battle-hardened.
The structure of simdjson is not particularly oriented towards reuse. A helpful set of transformations would be to break it into smaller, reusable components without compromising performance. Particularly egregious is the use of many hard-coded tables for the VPSHUFB-based character matcher; not only does this hard code the particular characters and character classes, it cannot be reused in a situation as-is in a number of situations (e.g. overly numerous character classes or ones where a desired character class includes one with the high bit set).
The aforementioned work with retargeting this work to AVX512, ARM NEON, ARM SVE or GPGPU (CUDA or OpenCL) would be interesting as an exercise. These are varying degrees of difficulty.
Note that there are some opportunities to engage in SIMD on a more massive scale – we could use SIMD not just for character class detection but for many of the intermediate steps. So we would process, on AVX512, 512 characters at once, then do our various manipulations on 512-bit vectors instead of 64-bit words. There are some nice possibilities here (including a fun sequence to calculate XOR-based parallel prefix over 512 bits rather than 64; a nice challenge that I have worked out on paper but not in practice). We also have examined bits->indexes for AVX512 as well. However, this may fall into the category of the drunk man “looking for their keys under the lamppost not because he dropped them there, but because the light is better”. It is Stage 2 that needs to be made parallel and/or regular, not Stage 1!
An ambitious person could attack Stage 2 and make it considerably more parallel. I confess to having failed here, but a patient soul may benefit from my copious notes and half-baked ideas. If you are serious, get in touch.

Acknowledgements

Daniel Lemire provided most of the initial impetus towards this project, put up with a ludicrous churn of ideas and half-formed implementations from me, handled many of the tasks that I was too fidgety to do properly (number handling, UTF-8 validation) and did nearly all the real software engineering present in this project. Without his ceaseless work, this code would be a few half-formed ideas floating around in a notebook.
The original Mison paper (authors: Yinan Li, Nikos R. Katsipoulakis, Badrish Chandramouli, Jonathan Goldstein, Donald Kossmann) provided inspiration for the overall structure and data-flow of the earlier parts of ‘stage 1’ – our implementations are different but the early steps of our Stage 1 follow Mison very closely in structure.
The vectorized UTF-8 validation was motivated by a blog  post by O. Goffart. K. Willets helped design the current vectorized UTF-8 validation. In particular, he provided the algorithm and code to check that sequences of  two, three and four non-ASCII bytes match the leading byte. The authors are grateful to W. Mula for sharing related number-parsing code online.

Post-script: Frequently Unanswered Questions

  • How much faster could this run if the JSON was in some weird format e.g. “one record per line”?
  • Why do people use so much JSON, anyhow? Why don’t they use a more efficient format?
  • Why not write this in <insert favorite language X here>? Wouldn’t a rewrite in X be just inherently better/faster/safer?