Why Ice Lake is Important (a bit-basher’s perspective)

With Computex, there’s been a ton of news about Ice Lake (hereafter ICL) and the Sunny Cove core (SNC). Wikichip, Extremetech and Anandtech among many others have coverage (and there will be a lot more), so I won’t rehash this.

I also don’t want to embark on a long discussion about 14nm vs 10nm, Intel’s problems with clock speeds, etc. Much of this is covered elsewhere both well and badly. I’d rather focus on something I find interesting: a summary of what SNC is bringing us as programmers, assuming we’re programmers who like using SIMD (I’ve dabbled a bit).

I’m also going to take it as a given that Cannonlake (CNL) didn’t really happen (for all practical purposes). A few NUCs got into the hands of smart people who did some helpful measurements of VBMI, but from the perspective of the mass market, the transition is really going to be from AVX2 to “what SNC can do”.

While I’m strewing provisos around, I will also admit that I’m not very interested in deep learning, floating point or crypto (not that there’s anything wrong with that; there are just plenty of other people willing to analyze how many FMAs or crypto ops/second new processors have).

Also interesting: SNC is a wider machine with some new ports and new capabilities on existing ports. I don’t know enough about this to comment, and what I do remember from my time at Intel isn’t public, so I’m only going to talk about things that are on the public record. In any case, I can’t predict the implications of this sort of change without getting to play with a machine.

So, leaving all that aside, here are the interesting bits:

  • ICL will add AVX-512 capability to a consumer chip for the first time. Before ICL, you can’t buy a consumer-grade computer that runs AVX-512 – there are various Xeon D chips that you could buy to get AVX-512 or you could buy the obscure CNL NUC or laptop (maybe), but there wasn’t a mainstream development platform for this.
    • So we’re going straight from AVX2, which is an incremental SSE update that didn’t really extend SSE across to 256 bits in a general way (looking at VPSHUFB and VPALIGNR makes it starkly clear that, from the perspective of a bit/byte-basher, AVX2 is 2x128b) – all the way to AVX-512.
    • AVX-512 capability isn’t just a extension of AVX 2 across more bits. The way that nearly all operations can be controlled by ‘mask’ registers allows us to do all sorts of cool things that previously would have required mental gymnastics to do in SIMD. The short version of the story is that AVX-512 introduces 8 mask registers that allow most operations to be conditionally controlled by whether the corresponding bit in the mask register in on or off (including the ability to suppress loads and stores that might otherwise introduce faults).
  • Further, the AVX-512 capabilities in ICL are a big update on what’s there in Skylake Server (SKX), the mainstream server platform that would be the main way that most people would have encountered AVX-512. When you get SNC, you’re getting not just SKX-type AVX-512, you’re getting a host of interesting add-ons. There include, but are not limited to (I told you I don’t care about floating point, neural nets or crypto):
    • VBMI
    • VBMI2
    • VPOPCNTDQ
    • BITALG
    • GFNI
    • VPCLMULQDQ

So, what’s so good about all this? Well, armed with a few handy manuals (the Extensions Reference manual and Volume 2 of the Intel Architecture Manual set), let’s dig in.

VBMI

This one is the only extension that we’ve seen before – it’s in Cannonlake. VBMI adds the capability of dynamically (not just a fixed pattern shuffle) shuffling bytes as well as words, “doublewords” (Intel-ese for 32-bits), and “quadwords” (64 bits). All the other granularities of shuffle up to a 512-bit size are there in AVX-512 but bytes don’t make it until AVX512_VBMI.

Not only that, the shuffles can have 2-register forms, so you can pull in values over a 2x512b pair of registers. So in addition to VPERMB we add VPERMT2B and VPERMI2B (the 2-register source shuffles have 2 variants depending on what thing gets overwritten by the shuffle result).

This is significant both for the ‘traditional’ sense of shuffles (suppose you have a bunch of bytes you want to rearrange before processing) but also for table lookup. If you treat the shuffle instructions as ‘table lookups’, the byte operations in VBMI allow you to look up 64 different bytes at once out of a 64-byte table, or 64 different bytes at once out of a 128-byte table in the 2-register form.

The throughput and latency costs on CNL shuffles are pretty good, too, so I expect these operations to be fairly cheap on SNC (I don’t know yet).

VBMI also adds VPMULTISHIFTQB – this one works on 64-bit lanes and allows unaligned selection of any 8-bit field from the corresponding 64-bit lane. A pretty handy one for people pulling things out of packed columnar databases, where someone might have packed annoying sized values (say 5 or 6 bits) into a dense representation.

VBMI2

VBMI2 extends the VPCOMPRESS/VPEXPAND instructions to byte and word granularity. This allow a mask register to be used to extract (or deposit) only the appropriate data elements either out of (or into) another SIMD register or memory.

VPCOMPRESSB pretty much ‘dynamites the trout stream’ for the transformation of bits to indexes, ruining all the cleverness (?) I perpetrated here with AVX512 and BMI2 (not the vector kind, the PDEP/PEXT kind).

VBMI2 also adds a new collection of instructions: VPSHLD, VPSHLDV, VPSHRD, VPSHRDV. These instructions allow left/right logical double shifts, either by a fixed about or a variable amount (thus the 4 variants) across 2 SIMD registers at once. So, for either 16, 32 or 64-bit granularity, we can effectively concatenate a pair of corresponding elements, shift them (variably or not, left or right) and extract the result. This is a handy building block and would have been nice to have while building Hyperscan – we spend a lot of time working around the fact that it’s hard to move bits around a SIMD register (one of these double-shifts, plus a coarse-grained shuffle, would allow bit-shuffles across SIMD registers).

VPOPCNTDQ/BITALG

I’m grouping these together, as VPOPCNTDQ is older (from the MIC product line) but the BITALG capabilities that arrive together with VPOPCNTDQ for everyone barring the Knights* nerds nicely round out the capabilities.

VPOPCNT does what it says on the tin: a bitwise population count for everything from bytes and words (BITALG) up to doublewords and quadwords (VPOPCNTDQ). We like counting bits. Yay.

VPSHUFBITQMB, also introduced with BITALG, is a lot like VPMULTISHIFTQB, except that it extracts 8 single bits from each 64-bit lane and deposits it in a mask register.

GFNI

OK, I said I didn’t care about crypto. I don’t! However, even a lowly bit-basher can get something out of these ones. If I’ve mangled the details, or am missing some obvious better ways of presenting this, let me know – be aware that the below picture is an accurate presentation of what I look like when dealing with these instructions:

GF2P8AFFINEINVQB I’ll pass over in silence; I think it’s for the real crypto folks, not a monkey with a stick like me.

GF2P8AFFINEQB on the other hand is likely awesome. It takes each 8 bit value and ‘matrix multiplies’ it, in a carryless multiply sense, with a 8×8 bit matrix held in the same 64-bit lane as the 8 bit value came from.

This can do some nice stuff. Notably, it can permute bits within each byte, or, speaking more generally, replace each bit with an arbitrary XOR of any bit from the source byte. So if you wanted to replace (b0, b1, .. b7) with (b7^b6, b6^b5, … b0^b0) you could. Trivially, of course, this also gets you 8-bit shift and rotate (not operations that exist on Intel SIMD otherwise). This use of the instruction effectively assumes the 64-bit value is ‘fixed’ and our 8-bit values are coming from an unknown input.

One could also view GF2P8AFFINEQB as something where the 8-bit values are ‘fixed’ and the 64-bit values are unknown – this would allow the user to, say, extract bits 0,8,16… from a 64-bit value and put it in byte 0, as well as 1,9,17,… and put it in byte 1, etc. – thus doing a 8×8 bit matrix transpose of our 64-bit values.

I don’t have too much useful stuff for GF2P8MULB outside crypto, but it is worth noting that there aren’t that many cheap byte->byte transformations that can be done over 8 bits that aren’t straightforward arithmetic or logic (add, and, or, etc) – notably no lanewise multiplies. So this might come in handy, in a kind of monkey-with-a-stick fashion.

VPCLMULQDQ

OK, I rhapsodized already about a use of carry-less multiply to find quote pairs.

So the addition of a vectorized version of this instruction – VPCLMULQDQ – that allows us to not just use SIMD registers to hold the results of a 64b x 64b->128b multiply, but to carry out up to 4 of them at once, could be straightfowardly handy.

Longer term, carryless multiply works as a good substitute for some uses of PEXT. While it would be nice to have a vectorized PEXT/PDEP (make it happen, Intel!), it is possible to get a poor-man’s version of PEXT via AND + PCLMULQDQ – we can’t get a clean bitwise extract, but, we can get a 1:1 pattern of extracted bits by carefully choosing our carryless multiply multiplier. This is probably worth a separate blog post.

I have a few nice string matching algorithms that rest on PEXT and take advantage of PEXT as a useful ‘hash function’ (not really a hash function, of course). The advantage in the string matching world of using PEXT and not a hash function is the ability to ‘hash’ simultaneously over ‘a’, ‘abc’ and ‘xyz’, while ensuring that the effectively ‘wild-carded’ nature of ‘a’ in a hash big enough to cover ‘abc’ and ‘xyz’ doesn’t ruin the hash table completely.

Conclusion

So, what’s so great about all this? From an integer-focused programmer’s perspective, ICL/SNC adds a huge collection of instructions that allow us – in many cases for the first time – to move bits and bytes around within SIMD registers in complex and useful ways. This radically expands the number of operations that can be done in SIMD – without branching and potentially without having to go out to memory for table accesses.

It’s my contention that this kind of SIMD programming is hugely important. There are plenty of ways to do ‘bulk data processing’ – on SIMD, on GPGPU, etc. This approach is the traditional “well, I had to multiply a huge vector by a huge matrix” type problem. Setup costs, latencies – all this stuff is less important if we can amortize over thousands of elements.

On the other hand, doing scrappy little bits of SIMD within otherwise scalar code can yield all sorts of speedups and new capabilities. The overhead of mixing in some SIMD tricks that do things that are extremely expensive in general purpose registers is very low on Intel. It’s time to get exploring (or will be, in July).

Conclusion Caveats

Plenty of things could go wrong, and this whole project is a long-term thing. Obviously we are a long way away from having SNC across the whole product range at Intel, much less across a significant proportion of the installed base. Some of the above instructions might be too expensive for the uses I’d like to put them (see also: SSE4.2, which was never been a good idea). The processors might clock down too much with AVX-512, although this seems to be less and less of a problem.

If you have some corrections, or interesting ideas about these instructions, hit me up in the comments! If you just want to complain about Intel (or, for that matter, anyone or anything else), I can suggest some alternative venues.

12 thoughts on “Why Ice Lake is Important (a bit-basher’s perspective)”

  1. VNNI actually might have uses outside of neural networks, for example, if you want to do horizontal summing of bytes/words. There’s PSADBW for bytes, but if you need to acculumate results, requires an extra add.

    GF2P8AFFINEQB may also be useful (assuming the instruction is fast enough) for broadcasting an immediate byte if, for whatever reason, you don’t want to touch memory. E.g. to broadcast the byte 0x42: PXOR xmm0, xmm0; GF2P8AFFINEQB xmm0, xmm0, 0x42
    GFNI can be useful for some non-crypto data coding stuff like Reed Solomon and maybe checksumming (Wiki lists some uses: https://en.wikipedia.org/wiki/Finite_field_arithmetic), but these are all fairly application specific.

    With AVX512-VL, you can always use these new instructions at 128/256-bit widths, avoiding AVX512 speed penalties (which, if you’re only doing a little bit of SIMD here and there, may not be worth incurring).
    Personally, AVX512’s lack of compares (e.g. PCMPEQ*) giving a vector result has been annoying. Mask manipulation is rather limited and quite slow on Skylake-X (one port only, crossing bits has a 4 cycle latency), and moving them back to a vector incurs additional penalty. If using AVX512-VL though, fortunately, you can stick to the AVX2 compares.
    Sunny Cove also adds a second 256-bit shuffle port, which should reduce the penalty of using 256-bit SIMD relative to 512-bit.

    Complete speculation on my part, but I have a feeling that AMD’s Zen2 won’t support AVX512, so we’ll have to how much adoption it gets. If they surprise me and do include it, it’ll be interesting to see which AVX512 extensions they support.

    Like

  2. Agreed on the aggravation with PCMPEQ* always going to masks. This felt high-handed – there were plenty of useful things that you could do with the “0xff…ff vs 0x0 in a lane” compare style, including the neat trick of counting things by just vertically summing those results. Having everything go into masks wasn’t my favorite and it broke the way people used to write this code.

    Travis Downs has speculated that the second shuffle port is just a way of getting at half the 512-bit unit on port 5 when you’re not using 512b shuffles, but we have yet to see any details. We’ll have to wait until one of these machines comes out. Looking forward to getting my hands on one but also seeing Agner Fog get *his* hands on one.

    Good point about VNNI and horizontal summing. There probably are some bit-basher tricks possible there too.

    Like

  3. “With AVX512-VL, you can always use these new instructions at 128/256-bit widths, avoiding AVX512 speed penalties”

    AVX512-VL may be the killer feature.

    This being said…

    We will have to see whether downclocking remains as it is. On Cannon Lake, I have not been able to measure any downclocking (at all). On servers (Skylake X), you get downclocking easily. I am guessing that new chips will be more like Cannon Lake than like Skylake X. Not that I am predicting downclocking will go away… but I think it will be much less of a concern.

    Like

    1. Agreed.

      Historically this has been how things have gone. I’m sure a few years ago anti-AVX2 folks were doing wild end zone dances about how AVX2 was Way Too Expensive To Ever Use.

      VL’s other advantage is that one can get at the AVX512 features (masks, new instructions, etc) while issuing SIMD uops on 3 ports.

      Like

    2. The issue I have with downclocking is that different SKUs may act differently, which makes it hard to make general statements about performance if your code only uses a little bit of SIMD.

      I think the ‘AVX512 frequency’ on Skylake only kicks in with ‘AVX512 heavy’ instructions. Cannon Lake only has 1 port capable of executing ‘AVX512 heavy’ instructions (presumably) so may never trigger ‘AVX512 frequencies’, but probably will trigger ‘AVX2 frequencies’. Interesting that you don’t detect any downclocking though.
      Other thing to note is that, from memory, Core i3s don’t have turbo clocks enabled, so the no downclocking part could be difficult to determine if the chip isn’t even trying to boost at all.
      Desktop SKUs seem to throttle less than the server SKUs for AVX512, likely because the latter is more power/heat constrained. As Ice Lake won’t have desktop SKUs, I wouldn’t be surprised if AVX512 throttling shows its worst there.

      Will be interested to see if Zen2 throttles under AVX, since it presumably has 4x 256-bit ports.

      On another note, ICL’s successor, Tiger Lake currently only has one more SIMD extension announced at this stage, so probably won’t be as exciting as ICL. VP2INTERSECT does also seem to have some similarity with VPCONFLICT, which isn’t particularly fast on SKX/CNL…

      Like

      1. Oh, it looks like the i3 8121U does have turbo clocks; I think it’s the desktop i3s which don’t, so forget what I said there.

        Like

  4. It’s true that SunnyCove seams appealing for the wider 8-bit shuffle and the additional port capable of performing shuffles. I have been implementing and experimenting high-performance nearest neighbor search which relies heavily on shuffles, and we’ve obtained significant speedup over state of the art implementations (Facebook’s FAISS). For this project, I am curious about the latencies of VPERMB on CannonLake, do you have some references for these ?

    The open-source implementation of our nearest neighbor search system is available at https://github.com/nlescoua/faiss-quickeradc with an under-review publication at https://arxiv.org/pdf/1812.09162.pdf and is based our previous work Quick ADC published at ICMR’17.

    By the way, thanks for your blog and your work I discovered after HyperScan’s publication at Usenix ATC. It’s very interesting and well documented. It’s nice to see performance evaluation of real implementation rather than weak comparisons of algorithms implemented in pure python.

    Like

    1. Thanks for the kind words. I’m very fond of pure Python, but hardly for low-level performance code.

      According to https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel0060663_CannonLake_InstLatX64.txt, VPERMB it is L3 T1 for all sizes. A few people, including my collaborator on simdjson, Daniel Lemire, have purchased the notorious Cannonlake NUC – aside from some obscure China-market laptops, it is the only way to try out VBMI. Personally, I am waiting for Ice Lake.

      The paper looks interesting. I’m not familiar with the background so it would take a bit to digest, but I’m always interested in SIMD table lookup problems.

      As a not-very-important side note, “Parralelism” is misspelled in Table 1.

      Like

  5. “GF2P8AFFINEINVQB I’ll pass over in silence” I was already loving this blog (my first time stumbling across it), but did you really just throw in a reference to Wittgenstein’s *tractatus*? If so, five sigmas say best blog ever.

    Like

    1. Why, thank you. I did study the Tractatus in my misspent youth (I was a Philosophy/Computer Science major) so the turn of phrase I think is always lurking there. I will try not to start blog posts with “The world is everything that is the case” though.

      Like

Leave a comment