Performance notes on SMH: measuring throughput vs latency of short C++ sequences

A quick update on last week’s post SMH: The Swiss Army Chainsaw of shuffle-based matching sequences on performance measurement.

During that post, I provided throughput numbers for these sequences but didn’t show latency. This is a critical distinction, and it doesn’t pay to be confused about the the two. I would rather avoid the rather cringeworthy formulation from the Mythical Man Month (where women are “assigned” to the task of bearing children!) and stick to the metaphor of boiling eggs: a suitably large pot of boiling water could boil eggs at a throughput of an egg every 10 seconds, but cannot provide you with a 3-minute-boiled egg in less than 3 minutes.

It is important not to confuse the ability to do something in, say, 10 cycles vs the ability to do 1,000 somethings in 10,000 cycles. The former is always at least as hard and usually much harder. This distinction holds all the way down to the single operation level: for example, a modern x86 processor can launch a multiply operation every cycle, but requires 3 cycles to know the result of a given multiply.

Modern computer architecture conspires against us when we wish to measure latency. Attempting to measure the latency of a single short code sequence is quite error-prone due to the overhead of the various performance counter or clock measurement calls.

Throughput is easy to measure on a larger scale, as we can measure thousands of iterations and establish an average cost per iteration. However, well-written code will usually attempt to minimize dependencies from one iteration to the next. When we attempt to measure, say, the branch-free code of SMH, there is little to prevent a modern, out-of-order processor from getting on with the next iteration or two while the previous iteration is handled.

I tried two approaches both attempting to measure the latency of the various SMH sequences. The first was to insert an LFENCE instruction between each SMH sequence but otherwise keep the code the same. Note that LFENCE in this case can be switched on and off by a macro.

The second approach was to make the location that was read by an SMH sequence depend on the result of the previous SMH sequence. Since I didn’t want to introduce a spurious ‘jumping around memory’ component to the benchmark (which would always be absent from the equivalent throughput metric), I made sure that the previous SMH sequence always happened to return zero (no match): we know this, but the architecture and the compiler don’t.

Creating long chains of dependent operations is also how Agner Fog (and others) measure latency; those who have not yet seen Agner’s Software optimization resources are in for a treat.

The code to measure SMH latency is below (note that LFENCE is switched off by the preprocessor as needed and was not used in the latency-test version of this code at all):

Observe the “tmp” variable in the gist above; it is always zero, but we cannot safely start our matching operation until the architecture has the result of the previous match operation in hand (Intel Architecture has many fascinating optimizations, but generalized value prediction is not one of them).

This gives us somewhat of a hybrid creature: “steady-state” latency. The compiler and architecture are still free to load things into registers that don’t depend on the actual computation – so this latency number is perhaps unrepresentative of a ‘cold start’. However, it is a reasonable measurement of the latency of a single operation in a well-optimized code base.

SMH Variant normal no unroll LFENCE
SMH32-loose Throughput (ns) 0.89 0.98 10.62
Latency (ns) 7.03 6.92 10.65
SMH32 Throughput (ns) 1.12 1.15 11.02
Latency (ns) 7.25 7.30 10.89
SMH64-loose Throughput (ns) 1.35 1.44 11.03
Latency (ns) 7.63 7.61 11.36
SMH64 Throughput (ns) 1.62 1.66 11.67
Latency (ns) 7.95 8.00 11.61
SMH128-loose Throughput (ns) 2.80 2.67 12.39
Latency (ns) 8.97 8.14 12.91
SMH128 Throughput (ns) 3.32 3.08 12.82
Latency (ns) 9.78 8.55 12.91

The above numbers seem reasonable based on a walkthough of the code. I also measured the effect of turning off my manual 8-way unroll. I had focused on smaller models and the metric of throughput as I tuned SMH; it’s marginally interesting to note that latency is generally better without an unroll in the measurement loop if not decisive.

The LFENCE results are hard to interpret – they seem to generally track the latency of the normal case plus around 3.5ns. More work is needed to confirm this; it would be nice to have a way of getting a latency number out of the system that doesn’t rely on an ability to introduce contrived data dependencies from one iteration to the next.

I feel reasonably confident that SMH can be said to do its work in 7-9 cycles; note that the overlap of iterations required to hit the full throughput (looking at the above table) must have to be as many as 8 iterations for the cheapest cases. As always, this implies that being stuck in the ‘latency world’ is miserable – try to phrase your computations to stay in the ‘throughput world’ whenever you can.

Updated code is here

Thoughts on how to measure latency are welcome.

3 thoughts on “Performance notes on SMH: measuring throughput vs latency of short C++ sequences”

  1. I think the right way to measure latency is the “make the input depend on the output approach” that you outline. After, all this is the most reflective of how a sequence of instructions will part of a latency chain in the real application (even though of course you might not be using the particular function you are testing in a back-to-back manner).

    So introducing an output-to-input dependency seems to be the most realistic way to do it. You don’t necessarily have to organize for your test function to return zero: a more general way to do it is just to have a zero value in a stack variable zero (usually initialized from a zero() function which always returns zero but cannot be optimized away), then you could just do `input += output & zero` to make input depend on output. This is nice because it’s generic and usually only adds 2 instructions (cycles) to the latency (which you could perhaps cancel out if you had a calibration step).

    It is also worth noting that there isn’t actually just a single latency for a function: in principle there is an I * O matrix of latencies for a function with I inputs and O outputs (often O is one – i.e., any function that returns a single scalar). Sometimes the latencies are quite different depending on which output chains to which input. This even applies at the instruction level: some functions with asymmetric arguments (non-commutative, I guess) have different latencies depending on which input operand the dependency flows though: many of the two uop instructions are like this: one operand is used by the first uop (e.g., to set up a mask or other control) and then the second uop consumes that result and the other operand: so the latency is 2 for one operand and 1 for the other.

    The “forced dependency” approach lets you build the full matrix of latencies if you want.

    I don’t like the lfence approach although I’ve certainly used it before. You are using the normal OoO machinery, but a special instruction that works in an unspecified way, so it’s hard to say if you are getting a true latency measurement. For example, I found that the lfence behavior was not as you might expect for short instruction sequences: e.g., a back-to-back lfence had timings of say 6 cycles, but adding a single 1-cycle intruction would increase it to 9 cycles, but then you could add more 1 cycle instructions to a dependent chain and it would say at 9 cycles until you had 4 instructions, then it jumped again, etc. So you really rely on specific uarch details and you’ll have to characterize it for each arch, because it probably changes. Also, on AMD it will probably also be quite different.

    LFENCE is also a bit pessimistic when it comes to measuring latency: *every* instruction must retire, but in real code there can be some work which is not part of the dependency chain to the output (verification of conditional branches comes to mind), and in the real world this work can freely overlap between iterations, and it should be free to do so since that will happen in real life. Similarly, there can be some chains that don’t depend on the input (e.g,. loading some constants, whatever) and these should be free to execute earlier and across iterations. LFENCE blocks all that, unrealistically.

    You mention, I think, that latency is always at least as long as (inverse) throughput, but there are actually interesting counter-examples. I want to nit-pick that :). At the instruction level, you have the x86 shl/shr instructions:

    https://www.agner.org/optimize/blog/read.php?i=860&v=t

    At the function level you can get this effect too: when enough work to do, but short dependency chains such that latency can be less than the throughput. Consider for example a function like:

    int foo(int x) {
    if (x == 42) return 1;
    if (x == 52) return 2;
    if (x == 62) return 3;
    return x;
    }

    It checks its input against 3 values, with the branches (usually) not taken and if no match just returns its argument. This will can be simply implemented as something like mov eax, edi for the fall-through case and 3 cmp/je pairs for the 3 checks. The checks don’t participate in the data dependency chain latency chain, so this function usually has a latency of 0 cycle (mov elimination worked) or 1 cycle (didn’t work), but the throughput is worse than that since you can only check 2 untaken branches a cycle, so the throughput is at best 1.5 cycles/iteration.

    Of course, you can’t *sustain* a latency of 0 or 1, back-to-back, since you’ll hit the throughput bottleneck, but the properly defined latency still is truly 0 or 1 (indeed, real code will probably not run this back-to-back so there will free slots and you might achieve the 0/1 latency), and you can additionally distinguish something like the “steady state” latency as you mention at the end, which caps the latency based on the throughput. Which one you measure in a benchmark depends on the design: in the case where you have a couple cycles of instructions to create the fake dependency, you get enough free slots to measure 0/1 after calibration, but if you don’t have those you’ll measure a higher one.

    Like

    1. Thanks for this comment. I was not previously aware of the SHL/SHR example and hadn’t reasoned through the concept of the cases where latency is actually better than reciprocal throughput. It seems like an exotic case but it’s well worth thinking through.

      I’m not a big fan of LFENCE either. I think the problem of forcing dependency will rapidly turn into a gigantic pain if there are multiple inputs and outputs, so it’s probably safe to say that trying to benchmark complex functions for latency at a very fine grain is tough. I think you might need to introduce multiple forced dependencies to deal with a function that has multiple inputs and outputs. At some point it becomes an approximation game and the best approach is probably to try a bunch of different methodologies and establish that they aren’t too insanely divergent (IACA, LFENCE, forced dependencies and, probably, scrawling little pictures of dependency graphs and likely port assignments on paper and hoping for the best).

      Like

      1. Yeah, it is relatively exotic – mostly because with an execution width of 4 it’s hard to arrange with just plain instructions which all contribute to the output. You can do it though: consider a tree of single latency instructions, e.g., N inputs combined pair-wise into N/2 outputs and so on until you get one output (e.g., a function that simply adds together all inputs, tournament-style). This has N-1 add operations but only has latency lg(N), so for N = 32 you have 31 add ops, which takes at 7.75 cycles inverse throughput, but the latency is only 5. Of course, x86 doesn’t have 32 scalar registers, so it gets a bit more complicated to actually write a working example (it’s easier if you use instructions that execute on fewer than 4 units, since then you have a lower throughput bar to clear).

        The real world cases of such a function are much more likely to be related to functions that have “out of line” code (instructions which aren’t on the input-to-output latency path at all), usually associated with branches – like the example I gave. A function with lower latency than recip. throughput isn’t actually too unusual in that case (but still it doesn’t mater “much”).

        I don’t actually agree about the “gigantic pain” regarding forcing dependencies. I think you can do it in a fairly rote and simply, probably in a one-liner with some macro or template meta-programming. The basic idea is not to try to force your output to zero or some other special value that makes for appropriate input, but just to take your output, AND it with a zero value, and then add (or any operation that is a no-op with zero) it to all your new input values. This never changes the value so is compatible with whatever mechanism you were already using generate the values in the first place, works for scalars, pointers, etc. For structs yeah you need some kind of recursion to apply it to each (relevant member of the struct). For the common case of 1 output from a function and N inputs, this takes only N extra add instructions and one extra AND instruction.

        That’s not just theoretical: I use that approach with success in various benchmarks: it makes to easy to have your “throughput” and “latency” benchmark loops identical except for that one line that creates the dependency.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s