An Intel Programmer Jumps Over the Wall: First Impressions of ARM SIMD Programming

buddha-jump-over-the-wall-2280691_1920(the pictured dish is apparently materials for “Buddha Jumps Over The Wall”, named for its ability to seduce away vegetarians – sadly it uses shark fin so has some ethical issues…)

[ UPDATE: I have, at least partly, dealt with the lack of PMOVMKSB and written a new post about it ]

I’ve done a lot of SIMD coding. However, aside from dabbling with a bit of 32-bit ARM coding during the early Hyperscan era (back before the Intel acquisition of Sensory Networks), it’s been nearly all Intel SIMD – SSE2 through to AVX512.

Recently I’ve been working on an ARM port of simdjson, our fast (Gigabytes/second) SIMD parser. I’ll be uploading preliminary ARM SIMD code for this soon. While the experience is fresh in my mind, I thought I’d write up some first impressions of ARM AArch64 SIMD programming (good, bad and just plain ugly).

The Good

First of all, orthogonality. it’s really nice to program with a SIMD instruction set where one (usually) doesn’t have to wonder whether there will be a an operation for a given data size. Every time I went looking for an operation on bytes, I found it (this is by contrast to Intel SIMD programming, where a whole raft of operations don’t exist for bytes and some don’t exist for 16-bit “word” quantities either).

[ many of these missing byte operations will finally appear with SNC, assuming GFNI is fast enough; the catchily named GF2P8AFFINEQB will allow arbitrary bit permutes, thus including rotates and shifts – see Intel® Architecture Instruction Set Extensions and Future Features Programming Reference for details ]

Orthogonality of these operations is a huge relief to the programmer – it’s less work to commit things to memory, and fewer “what the hell” moments later when you realize something that should exist doesn’t. For example, when doing the “bits to indexes” work my original paper notes on the code happily had an operation that didn’t exist (masked 512-bit OR using a byte-granularity kreg).

Second, multiple-table permutes: TBL and TBX can take multiple registers – up to 4 – as inputs. Thus, you can permute over up to 512 bits. This is a leap-frogging race here – with VBMI and Cannonlake, Intel will allow 2 AVX512-bit registers to be used in a VPERMI2B or VPERMT2B. More on latencies later (I would like to praise these ARM SIMD operations more but, despite many happy claims about how fast these ops are in some architectures – e.g. A12 – I can’t find any documentation).

Note for Intel people – TBL/TBX yield “zero” or “no operation” on out of range indexes, which is a contrast to PSHUFB (with its odd behavior governed by the high bit) or VPERM*, where only the low-order bits affect what the permute does. This seems to be a neutral change; sometimes the PSHUFB behavior is annoying, sometimes it’s useful.

Third, horizontal operations and pairwise operationsThis is something that exists spottily on Intel SIMD, but ARM allows a wide variety of operations to be either applied across the whole vector or be done as a pairwise approach. ADD and MAX/MIN are pretty handy in both contexts.

Fourth, multiple vector interleaved load and store. This is pretty awesome, and the latency/throughput numbers aren’t too bad for at least A75.

Some elements of ARM SIMD coding are pleasant but no longer a significant advantage. The comprehensive range of compare operations is now matched by AVX512. Overall, it’s still a very pleasant instruction set to work with.

The Bad

There is no equivalent of PMOVMSKB on ARM. People have been complaining about this for years. I have a couple not-too-terrible workarounds for this – especially if one has lots of these operations to do at once (e.g. 4×128 bulk PMOVMSKB equivalent to a 64-bit register) which will be the topic of a blog post in the near future. There is at least a decent version involving masking and a series of paired-add operations. So this can be worked around.

It’s 2019 and we’re still doing 128-bit SIMD. I’m excited to see Scalable Vector Extensions (SVE), but… it was announced late 2016 and the only place you can use this extension is a Fujitsu supercomputer? The absence of SVE from the Neoverse announcement was worrying; this will be a processor shipping almost 4 years after SVE was announced that seemed like a logical fit for SVE. ARM really needs to announce a roadmap for SVE. Is anyone going to support it?

The Ugly

Documentation. OK, so we have tons of different ARM variants out there supporting AArch64. Why do none of them – aside from ARM itself – publish tables of instruction latency and throughput? Plenty of people complain about Intel’s documentation, but the Software Optimization Guide coupled to all the supplementary information (Agner Fog, uops.info) is a wonderful source of information by comparison.

ARM apparently has made real strides in openness – I can get a lot of information off their site without registering or being force-signed-up for marketing material (there are still some regrettable exceptions to this – there’s a Neon programmers guide that forces you to sign up for marketing emails, then turns out to be old…). However, most of the other vendors (Apple, Marvell, Ampere) seem to provide zero information to the public (there might be a super-secret NDA version?). This is depressing: you guys do understand that it helps you to have people write good code for your architecture, right?

Also, holes in the tables of throughput and latency are never welcome, no matter which vendor you are. I’d like to wax more enthusiastic about TBL and TBX but even ARM’s data on this has holes (no throughput data).

Conclusion

All up it’s been fairly pleasant to port simdjson to ARM. Missing operations are counter-balanced by a lot of nice new tricks. I’d say the biggest source of pain at this stage is how scarce information on non-ARM implementations of the instruction set.

If anyone has better software optimization resources for ARM (or if there’s a secretive ARM equivalent of Agner Fog or uops.info lurking out there), please comment or send me an email.

19 thoughts on “An Intel Programmer Jumps Over the Wall: First Impressions of ARM SIMD Programming”

  1. Have you looked at the LLVM machine models?
    These are hardly ideal (they’re certainly not that easy to figure out!) but something can be gleaned from them.
    For example the Cyclone (Apple A7) model is here:
    https://github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64SchedCyclone.td

    Apple is clearly a special case, making a lot of effort to keep things secret. (One can complain about this all one wants; it’s the way things are. IMHO this is a reaction to the Design Patent system doing a lousy job of protecting Apple’s UI innovations, so Apple has given up on the legal system [which means given up on openness] and resorted to using trade secrets to protect its IP — be careful what you wish for, Internet…)

    But other companies more or less have to use the open LLVM codebase (as opposed to Apple which occasionally merges open LLVM into its secret LLVM codebase, and which has kept every machine model after Cyclone secret) and there is a fair bit of info in the LLVM machine models.

    There’s also the danger that you are committing the Linus error, the insistence that the only right way to do things is the way that you’re used to doing things 🙂
    I get the appeal of this deep dive documentation, but I’m willing to concede that the vendors have a point insofar as they don’t want to be tied to it, and they don’t want developers tied to it. ARM and Apple seem to modify their µArch a lot more frequently than Intel, then throw in the fact of big.LITTLE and other per-design variants (like cache sizes), and it’s not clear that it’s really in anyone’s interests to be super-optimizing for the precise details of this particular single-company 2018 model of the core, rather than just using some sort of “common sense” to create something that runs at 95% performance level everywhere rather than at 100% level here and 85% everywhere else?

    There are, of course even deeper issues at play for some vendors’ Apple’s bitcode, in particular, passed its first serious test (WatchOS ARMv7 to v8) transition with flying colors, and I expect Apple wants bitcode (with the advantages that brings for continual optimization) to become ever more important.

    That doesn’t help you if your company goal is something like “sell this bespoke assembly code for megadollars to enterprise to run on these particular servers”, but it explains the thinking of the various companies. Everyone (MS in their way, Google in theirs, Apple in theirs) wants to get off the fragility and hassle of perpetual backward binary compatibility…

    As for SVE, again you have to look at the agendas of the various companies.
    ARM’s agenda has always been (and remains, IMHO too much so) good enough performance in MINIMAL area. SVE doesn’t work with that, so SVE for ARM LLC has so far been something for their other customers, not for their core business.
    As for Apple, who knows exactly how they schedule things, but a plausible story would be that their best CPU design team has been spending the past few years working on a core appropriate to the desktop+server (obviously much in common with A12 etc, but choices biased more towards performance, less towards area and power), AND that this team was also targeted with adding SVE, AND that SVE will be retrofitted to A13 or A14 or whatever once it ships on the desktop.
    Point is, yes, delays are frustrating. But I don’t see these delays as reflecting lack of interest or failure so much as the fact that ARM’s market was what it was (ie phones and downward), going upward (official ARM server announcements and Apple desktop) were always scheduled to be around 2020 or so, and that was always going to gate the shipping of features like SVE.

    One consequence is that SVE will likely ship with powerful support in the compiler from day one. That’s not THAT interesting to an asm level hacker like you. But it is very different (and likely more useful for the community as a whole) than what we saw with every previous SIMD introduction (eg I remember writing so much AltiVec first as assembly, then as intrinsics, because no compiler support. And as a consequence, yes, useful for a few narrowly targeted apps like QuickTime, but not generally useful to all developers…)

    Like

    1. The problem is that, some uArch specific optimisations can lead to more than just a 5% difference in speed. I’ve seen 2-3x gains by targeting an algorithm for a uArch. For example, on Intel Silvermont, PSHUFB costs 5 cycles, whilst it’s 1 on all Core i3/i5/i7 and most AMD CPUs. For an algorithm where the inner-loop relies on PSHUFB performance, I’ve found that using an alternative, though usually slower, algorithm which doesn’t rely on shuffles runs significantly faster on Silvermont.

      And it’s not like not having good documentation prevents people doing this stuff anyway, so the point is largely moot. There’s absolutely no harm in letting developers have a general idea of how to optimise for particular uArchs. I doubt the info would reveal anything secret, after all, someone determined could figure it out from microbenchmarking anyway.

      Liked by 1 person

      1. I agree. People are imagining that I’m getting my undies bunched over 5-10% performance “here and there”. But in practice, sometimes the difference between an operation being 2 uops and 1 uop is 100% performance on some hot loop! Some code that I’ve written in the past more or less packs character matching into 5 uops of which 1 manage to fuse, and the 4 fused domain uops issue to different ports, so 1 cycle/byte. If *any* of those operations was actually 2 uops on *any* port – or if I was wrong about which port something goes to – we have a 2 cycle/byte scan. I had to wrestle a fair bit with the algorithm to make that happen; it’s not something that just might emerge from a compiler. A naive version, straight out of the compiler with no low-level redesign at the algorithms level, would probably have managed to get 3 cycles/byte.

        (note all of this is intrinsics, not asm)

        Also agree that the information is essentially public anyhow. Eventually someone will figure out uops.info for ARM. Making a software optimization guide is a chance to get ahead of any issues rather than have a 3rd party build up a document that might make you look bad, too – suggest workarounds, show best practices, etc.

        Like

    2. I hope you’re right that we see SVE eventually. I am aware that compiler support is well underway; this is less unusual these days (I could be coding for ICX on SDE, and even using gcc, and I suppose I would be if I was still aligned to Intel’s interests 🙂 ).

      I’m certainly not taking ARM vendors to task for not including SVE in mobile right away (although I keep hearing about how scalable it is…). But Neoverse seems like a natural fit. Hopefully v2 of it will get SVE out the door.

      As for Apple keeping architecture details under their lid – I dunno. Maybe. You tend to attribute everything Apple does to them playing 8-dimensional-chess. Maybe they just aren’t great at developer docs and can’t be bothered? I’d be very startled to see that you can’t reverse-engineer the .td files out of their LLVM (not to mention derive this stuff by benchmarking the hardware). Look at what Agner Fog or uops.info gets done – this should be easier on ARM, not harder (aside from, of course, handling the proliferation of cores).

      It’s really not – like another respondent posted – the choice between 95% and 100% performance. There are *factors* available when you have a hot loop and you are close to the metal. And for this kind of payoff, it can be worth writing your important code over and over again. If you have good developers, a good CI system, etc. it can easily be economically viable to pull on your big boy/girl pants every few months and just deal with the new core whatever it is. I have a few inklings of trying to automate this process (beyond the level of just “hey, how about using a compiler and trusting what it gives you” – not bad, but never quite enough) and this might be a project for me in the next few years (although I will have to figure out how to pay the bills until then).

      Like

  2. Thanks for this. Google’s news algorithm put this in my feed, which is normally filled with click bait, and I truly learned something useful. Nicely written.

    Liked by 1 person

    1. LLVM has a tool called llvm-mca which will use the .td files to put timing data on assembly code. An easy way to play with this is to use godbolt.org. Hopefully this will encourage LLVM’s timing models to be accurate on a range of architectures.

      Like

  3. As someone working on the RISC-V Vector extension I’m curious what your use case for PMOVMSKB is. *Why* do you want to extract every 8th bit and compress them together? What are you going to use that for?

    Like

    1. Most commonly it results from having the results of a compare in each byte – 0xff if true, 0x0 if false (not exclusively – there are a few tricks like detecting conditions by using add, saturating or otherwise, to carry into the MSB). You then want to go do something on the General Purpose Register side of the fence – take a conditional, loop through all set bits and do something, look up a index in some table (maybe a “4 Russians” table a la https://en.wikipedia.org/wiki/Method_of_Four_Russians ). Or, potentially, you will do this a bunch of times and move these bit vectors back to SIMD.

      Like

    2. I actually find PMOVMSKB to be quite a useful instruction, and have seen it used in a variety of places. Converting a SIMD mask to scalar mask itself has uses, and you get to do scalar stuff on them (as mentioned above). I’ve also used it for bit shuffling or bit gathers, and the like.

      Liked by 1 person

  4. It’s important not to overstate the benefit of microarchitectural optimization. Firstly one must start with a good algorithm. This is where most software fails, and no amount of assembler coding or optimization can rescue you from that. If you know how to write good code, modern compilers will get close to optimal performance. I’ve written lots of generic C code that beat highly optimized assembler code by large factors. Many assembler functions have been removed from GLIBC and replaced by C in recent years. So a good algorithm + well written code + modern compiler beats just about anything.

    If you absolutely need that last 5-10% performance then you have to use micro architectural optimization (if you find it’s much more then it’s likely the wrong algorithm). While instruction latencies and throughput are very useful, they aren’t nearly enough. You need all the details on how instructions are fetched, decoded, renamed, dispatched and executed. You also need to know which instructions are split into micro-ops at each point. Also due to low level implementation details there are quite complex interactions between instruction sequences. Most of this is kept secret for obvious reasons. However if you haven’t ever seen detailed execution traces from a cycle-accurate simulator then you won’t fully understand how a modern OoO core actually executes code.

    So generally I’d say don’t even bother with micro architectural optimization to unless you’re one of the few experts – but even then it’s a lot of work for little gain!

    Like

    1. I agree that microarchitectural optimization should not be the first thing anyone reaches for, and some people shouldn’t do it at all. We had a big block of code in early Hyperscan from a former employee in the “#else” branch of a big macro “#ifdef I_DONT_CARE_ABOUT_PERFORMANCE”, with the plain C code as the “naive” alternative. Amusingly, the C code was *faster* than the ASM on anything that wasn’t a Pentium 4.

      That being said, microarchitectural optimization is not a matter of 5-10%. A lot of the time people are already stuck in the “linear world” (or the nlogn world, or whatever). Even if you’re not dealing with “small N” (surprisingly common) you are often already in the big-O ballpark – say, you’re looking for a minimum value on some criteria in a vector that you don’t control so can’t sort or make use of another method of data organization (say, a heap). In this case, a SIMD version with prefetch and tuning might run 10x faster than a naive version. *Maybe* you’ll get a auto-vectorized version of your code from a smart compiler if your code is simple enough, but I wouldn’t want to count on it if I really needed the performance.

      For most cases I run into, the compiler just doesn’t know enough about the problem to restructure it for peak performance.

      You’re also vastly overstating the difficulty, but it is somewhat of a dark art. Even a simplified model is better than being completely naive – for example, knowing that an algorithm that is, essentially, one data-dependent read after another is going to take N cycles per step (where N is a few cycles plus cache latency) while an algorithm that makes independent accesses can run potentially an OOM faster (if the first algorithm is hitting L2, L3 or memory). A lot of the time you don’t need to know all that guff you’re talking about – you just need to know what your “tentpole” is (port 5 pressure, too many uops to successfully execute in your time budget, front end issues, branch misses, etc). For example, we learned practically 0 things about front issues (excluding branch misses) in Hyperscan because the front end was almost never an issue.

      As for “little gain”. Well, this was about half of the work we put into Hyperscan (the other half being the algorithmic stuff). So I’m intrigued to hear you and Maynard turn up and tell me how it can’t be done, or shouldn’t really be done, or is only 5-10%. It is not. It is _factors_ (as long as all hot spots are amenable to such optimization; obviously, for a big enough code base, eventually you spend all your time in ugly glue code). I’m glad you guys weren’t around to persuade us not to do this work back in 2008-2013.

      Like

      1. I haven’t seen evidence there is scope for more than 5-10% overall gain due to micro architectural optimizations. Even for very small code samples like memcpy it’s extremely hard to get a few percent despite a lot of micro architecture tuning. But choosing the correct algorithm? That easily makes 2x difference on memcpy. On math functions we have shown it to be 5x or more. With those kind of factors any improvement due to microarchitecture is simply noise. And this is why “optimized” assembler code is often slower – even if it happens to be highly optimized, it’s typically the wrong algorithm!

        Note microarchitecture optimization does not include high level optimizations like vectorization, unrolling or even using intrinsics. It means scheduling, counting micro ops, aligning code, adding extra instructions or dependencies to avoid particular schedules, choosing execution units by understanding how decode/dispatch work, zeroing some registers to increase the renaming window etc. It implies getting better performance on one specific microarchitecture, but at the detriment of others.

        As an aside, since you mention O notation. A quadratic algorithm can be significantly faster than a linear one, even for fairly large N. A good example is string matching – my strstr algorithm is ~20x faster than a linear time algorithm despite being quadratic in the worst case. It’s always fun explaining why it doesn’t matter and how it’s trivial to make it behave lineary in the worst case…

        Like

      2. OK – to clarify, as I can’t go deeper than 3 layers of comments, I think Wilco and I are using different definitions of microarchitectural optimizations – I view many of the high level optimizations like unrolling as a form of microarchitectural optimization (similarly, transforming a problem to be less dependent on memory latency and more or throughput). So this does come down to definition.

        I think there are obviously cases where the 5-10% limit is pretty much spot-on. If the code is big, complex and irregular, uarch optimization stuff is impractical. If you’re bottlenecked on something the compiler is doing pretty well with anyhow (e.g. FMA, or memory throughput, etc) 5-10% might be as good as it gets.

        But in my experience, I’ve seen 2X – the difference between packing everything neatly into 1 cycle per task (scanning literal bytes) and having just *one* thing that adds more pressure on a port. Bingo, 2 cycles per task instead. Not super-common, but I imagine a lot of Intel SIMD hackers have succumbed to this sort of thing – especially Port 5 pressure – over the years. I’ve done all sorts of shenanigans – like rearranging my data in memory – to allow (say) the use of a x86 addressing mode calculation to be done for free, rather than having to have an extra shift operation in a critical loop.

        I think it’s a matter of definition whether this is microarchitectural optimization but it’s the sort of thing I had in mind.

        Like

  5. Thanks for this post!

    I have done lots of AVX2 assembly a few years ago and I was curious how things were done in the ARM space. There isn’t much documentation about. Your article presents a nice overview.

    I can’t imagine programming in assembly without knowing the instruction latency and throughput. Agner Fog did more for Intel than any of Intel’s actual employees when it comes to that.

    I was wondering how usable SVE is in practice. This bit about “TBL” isn’t reassuring: “Since the index values can select any element in a vector this operation is not naturally vector length agnostic.”

    Like

    1. There’s a fair bit of information wandering around about ARM latency and throughput, but no centralized source for it. Some vendors clearly are against publishing this kind of detail or keep it only for registered SDK customers.

      I guess we’re about to find out about SVE’s usability in a big way. I don’t think the TBL issue is that big – surely that’s how it would *have* to work, right? Otherwise it would become more and more difficult to implement and wind up just being a microcoded monster.

      Like

      1. Right. Cross-lane shuffles inherently assume a fixed register width. Once you assume it, vector-length-agnostism goes out of the window. I don’t see a good solution to this. It feels like you have to choose between a super userful feature (there is a reason port 5 is often the bottleneck on Intel) and portability.

        Thanks for the reply!

        Like

    1. Good point. I still think we’re waiting on a repeatable process that’s done across different ARM architectures (in the way that Agner works with multiple x86 processors from different vendors), but I imagine this will turn up at some point.

      Like

Leave a comment