A (Draft) Taxonomy of SIMD Usage

I’ve made many different types of SIMD usage over the years, and occasionally even written about them (on this blog, in papers, and in Twitter). Just for reference, I’d like to cover my taxonomy of SIMD usage so I don’t keep writing it up (poorly) on Twitter every few months. This discussion is a draft to elicit discussion: I’m aware that the taxonomy I present here is a bit wooly and amorphous.

SIMD – single-instruction, multiple-data is a fundamental extension to most instruction sets. It gains considerable usefulness since having instructions that can handle up to 256 or 512 bits worth of data at one time are not 4x or 8x as expensive to implement as instructions that work on 64 bits of data at once – while the execution units and registers are bigger, much of the complexity on an out-of-order architecture is there regardless of whether you’re working on an 8 bit value or a 512 bit one.

In my taxonomy, we use SIMD in 3 different ways:

Type 1: Horizontal (“Do lots of simple things at once”)

This is the most common form of SIMD processing. In this world, we have a relatively simple thing we want to do – for example, we might want to detect whether we have a particular character somewhere in our input (or set of characters). We have a small task, and we want to repeat it, over and over again.

A single-character search done with a SIMD operation would be a canonical example of this, as are many of the conventional uses of SIMD and vector programming (e.g. run through a gigantic array and perform a vector or scalar multiply on it, a technique that has recently achieved a degree of importance).

A slightly muddier example of this (muddier since the table lookups we’re performing here themselves depend on looking up a relatively large in-register structure – arguably loosely fitting into my other “Types”) would be looking for a character set rapidly, as described in http://0x80.pl/articles/simd-byte-lookup.html and https://lemire.me/blog/2024/06/08/scan-html-faster-with-simd-instructions-chrome-edition/ (and appearing in many other papers and code bases).

Another (quite related) engine that uses this kind of SIMD parallelism is “Teddy”, used in Hyperscan to match smaller numbers of literals (from around 2-72 literals). We don’t use SIMD parallelism for matching very large numbers of literals, since we only have so much register space – eventually, a literal matcher that’s searching for 10s of thousands of literals will need to have bigger tables that can fit in one literal.

Type 2: Vertical (“Do a really big thing using our SIMD resources”)

A less common, but very useful way of using SIMD is to take advantage of the fact that we can handle very large data structures in-register. For example, Hyperscan’s NFA implementation can implement Glushkov NFAs with up to 512 states at once, using SIMD registers. See https://branchfree.org/2019/02/28/paper-hyperscan-a-fast-multi-pattern-regex-matcher-for-modern-cpus/ and the Hyperscan paper for more details.

These process input 1 character at a time, but handle a more complex task. Frequently the SIMD usage in this case can be large and comparatively irregular – we’re using SIMD to simulate having an enormous general-purpose register, not to perform a homogenous task over a lot of similarly-sized data items.

The FDR literal matcher in the Hyperscan paper in another example of this – it reads 1 character at a time, and uses SIMD parallelism to implement 8 16-wide ‘shift-or’ buckets (using a 128b register).

Type 3: Exotic (“Take advantage of functionality that’s only available as SIMD instructions”)

Less easy to characterize are SIMD-using subsystems where we make use of SIMD functionality not necessarily because we have a task we want to do over and over – or a task that’s scaled to “SIMD size”, but because the task can use hardware functionality that’s only available on the SIMD units.

For example, in simdjson (https://github.com/simdjson/simdjson) we used the PCLMULQDQ instruction to match up quote pairs via parallel-prefix XOR. It’s possible to imagine that an alternate architecture could have a integer-multiply-type instruction that takes 2 64-bit register and yields a 128 bit result spread across two registers (in the fashion of a typical multiply operation like MULX on Intel Architecture). But that’s not the architecture we have: the functionality is there on the SIMD units.

Other exotic functionality on Intel Architecture includes:

  • Cryptographic instructions (AES New Instructions, SHA New Instructions, etc)
  • Galois Field Instructions like GF2P8AFFINEQB and GF2MULB (also, I guess, cryptographic instructions, but with considerably more general usages)
  • PCLMULQDQ (carry-less multiply) as mentioned above

Other exotic instructions such as the permutes and shuffles (VPERMB, PSHUFB, …), compress/expand instructions, VPMULTISHIFTQB, etc. – where we could implement the same functionality via “normal C code” but the functionality would be essentially different if implemented in non-SIMD instructions: specifically, it would require branches and memory operations. Using tables in SIMD registers, we can implement and look up non-trivial data structures without memory lookups, or with significantly fewer memory lookups. Using SIMD techniques such as masking, we can have multiple paths in our computations, without using branches. This isn’t quite as fundamental as getting an operation like GF2P8AFFINEQB or PCLMULQDQ in a single instruction, but there is almost a qualitative difference in the behavior of code that can implement complicated functionality with few memory accesses and branches vs regular C code.

A number of techniques that fall in this middle ground are somewhat possible on General Purpose Registers, but are considerably more limited in scope. For example, it’s possible to use conditional moves and other branch-free programming tricks on many architecture, and to use 32b or 64b registers as lookup tables. SWAR (“SIMD Within A Register”) tricks continue to be useful. These can even be applied in SIMD registers, which I suppose makes them “SWASR” – “SIMD Within A SIMD Register”. If you want to, say, handle 168 3-bit quantities in a single register, you might need SWASR!

Discussion

This taxonomy isn’t hard and fast – we might combine these techniques. For example, we might build a literal matching algorithm where we apply Hyperscan’s “FDR” matcher across 4 different blocks at once in an AVX-512 register – this would take a bit from Type 1 and a bit from Type 2. There’s also a certain muddiness to the categories when we’re, say, depending on the size of a register to do big table lookup (Type 3?) across many values at once (Type 1).

We also combine different styles of coding in the same code base – so a category 1 SIMD “acceleration” matcher might skip over large blocks of characters while driving a category 2 NFA or DFA implementation.

An interesting point about these different uses of SIMD is that frequently, Type 2/3 SIMD usages are not necessarily benefited in the same way from wider SIMD models (e.g. going from 128b to 256b to 512b) – or moving to a “vector” model as in ARM’s Scalable Vector Extension or RISC-V’s Vector Extensions. In these usages, we aren’t just doing the same thing over and over again. If there are 193 states in your Glushkov NFA, a 256b register suffices to implement this in a typical bit-parallel implementation, and you will not inherently run twice as fast on a 512b register. We will not receive any benefit from the ability to seamlessly scale on a 2048-bit underlying architecture. Our quote-balancing use of PCLMULQDQ in simdjson isn’t going to run faster by doing 4 lanes of PCLMULQDQ in parallel.

A similar argument might apply to the attitude that SIMD architectures are just undersized GPUs – frequently, especially in the Type 2/3 usages, we are intermixing SIMD code with single-threaded code. In this case, I must respectfully disagree with the viewpoint in https://www.sigarch.org/simd-instructions-considered-harmful/ – while such an argument might be very strong with regular cases such as the DAXPY example that is heavily used in that article (a Type 1 usage).

8 thoughts on “A (Draft) Taxonomy of SIMD Usage”

  1. Nice to see a new article here!

    Another usage could be the extra register space. This was particularly prominent on 32-bit x86, where you’ve got <=7 GPRs – throwing stuff onto MMX/SSE registers means less spilling and/or more ability to pipline operations.
    This is perhaps less prevalent these days.

    Perhaps a bit of a mix with your 1 and 2, but if you need to do something, say, two times, SIMD can be useful. It's not "a lot", like in the traditional vector sense, but it's also more than one, justifying the use of SIMD.

    Extension to your 3: many x86 chips have single cycle latency SIMD instructions, so it can be viable to even do latency sensitive scalar workloads on SIMD, and be beneficial if it can save you an operation here or there (e.g. through VPTERNLOG). On the other hand, many ARM chips have a minimum two cycle latency on NEON, which can make this sort of use case unviable.

    Like

    1. Interesting remark. The added space in the SIMD registers is itself helpful, like you say – although calling conventions mean it’s a bit of a crapshoot. A language with an aggressive SIMD First calling convention would be interesting (whole structures in there?)

      Doing things “a small number of times” in SIMD: a danger is that it’s usually quicker to get to load/store (unless you can load/store the small multiplicity items in one SIMD operation) and the flags (for conditional branches) via INT than VEC. The glory era of “parity” for SIMD in terms of port count is also gone (at the peak of the Sandy Bridge era, we had 3 ports for both INT and VEC so doing 8b was no more or less expensive than doing 128b or 256b if you could manage AVX). Now, Lion Cove has 6 units vs 4 vector units and leaving VEC off is a power saving (and, potentially even a frequency saving depending on your uarch). I would be very hesitant to go down the SIMD path without a more solid use case than “hey, I can do 3 ops at once for this stray calculation”.

      The single cycle latency is really good and I’ve heard people at Intel be very defensive of some cherished latency=1 operations. VPTERNLOG is also great, since there’s no equivalent on INT – that’s a good catch – kind of a quantitative difference but it can really add up. Doing ‘trees’ of AND/OR, for example, is much cheaper with ternaries.

      Like

      1. For the “extra register space” case, I was mostly referring to cases where you’re processing a lot of data, so calling convention overhead isn’t so much of an issue there. An example would be lookup heavy/constrained workloads – if you’ve got a 4 cycle L1 latency, with 2 loads per clock, you ideally want to sustain 8 in-flight independent loads, which is likely impossible with only <=7 GPRs.

        Is there any uArch where SSE runs at a lower frequency than integer? Power cost is harder to measure, and one could argue that having fewer instructions + race to sleep may offset the expense of using the vector units. I typically go for speed, due to difficulty in determining power.
        Of course, it’s all workload dependent, but I’ve gotten >10% speed gains from doing some scalar workloads on SIMD, even though I’m only using a single lane, due to saving some ops here and there. But if you need to do stuff like branching, flags etc, it’d probably hurt more.

        Using only two lanes on SIMD can sometimes yield a decent win, particularly on low throughput uArchs like Atom/Jaguar or Cortex A55, since you can be port constrained otherwise.

        Liked by 1 person

      2. With respect to VPTERNLOG, is there any impediment to implementing it in the INT pipeline?

        I read a lot of people mentioning that there is a preference for having all instructions that read only two registers and write only one for architectural reasons, but given that CMOV already exists for x64, that shouldn’t be a problem?

        Liked by 1 person

  2. I think “horizontal” should be split into two categories: shuffle (horizontal data movement) and reduction (horizontal arithmetic). These two have very different performance characteristics and use cases. Special cases include things like PCMPISTRM or ARM’s MATCH instruction, which perform horizontal arithmetic but produce a vector result.

    Like

    1. I’m quite nervous about having used “vertical” and “horizontal” at all, and might get rid of that whole concept, since it uncomfortably intersect with other usages of horizontal. I’d move something like PCMPISTRM into the “wacky special purpose” category.

      In any case, this was a draft to try to get started on this, not the final word. I’m leaning towards revamping the categories in a second version that’s more focused on something actionable (“Why should you use SIMD”) rather than just putting SIMD code into boxes for no apparent reason.

      Like

  3. Perhaps “vector” vs. “array” processor distinction for SIMD also makes sense, as well as relating or contrasting these to VLIW, SIMT & SPMD; cf. https://safari.ethz.ch/architecture/fall2022/lib/exe/fetch.php?media=onur-comparch-fall2022-lecture25-simd-processors-and-gpu-beforelecture.pdf

    • Array processor: Instruction operates on multiple data elements at the same time using different spaces (PEs)
    • Vector processor: Instruction operates on multiple data elements in consecutive time steps using the same space (PE)

    Like

Leave a comment