A quick update on last week’s post SMH: The Swiss Army Chainsaw of shuffle-based matching sequences on performance measurement.
During that post, I provided throughput numbers for these sequences but didn’t show latency. This is a critical distinction, and it doesn’t pay to be confused about the the two. I would rather avoid the rather cringeworthy formulation from the Mythical Man Month (where women are “assigned” to the task of bearing children!) and stick to the metaphor of boiling eggs: a suitably large pot of boiling water could boil eggs at a throughput of an egg every 10 seconds, but cannot provide you with a 3-minute-boiled egg in less than 3 minutes.
It is important not to confuse the ability to do something in, say, 10 cycles vs the ability to do 1,000 somethings in 10,000 cycles. The former is always at least as hard and usually much harder. This distinction holds all the way down to the single operation level: for example, a modern x86 processor can launch a multiply operation every cycle, but requires 3 cycles to know the result of a given multiply.
Modern computer architecture conspires against us when we wish to measure latency. Attempting to measure the latency of a single short code sequence is quite error-prone due to the overhead of the various performance counter or clock measurement calls.
Throughput is easy to measure on a larger scale, as we can measure thousands of iterations and establish an average cost per iteration. However, well-written code will usually attempt to minimize dependencies from one iteration to the next. When we attempt to measure, say, the branch-free code of SMH, there is little to prevent a modern, out-of-order processor from getting on with the next iteration or two while the previous iteration is handled.
I tried two approaches both attempting to measure the latency of the various SMH sequences. The first was to insert an LFENCE instruction between each SMH sequence but otherwise keep the code the same. Note that LFENCE in this case can be switched on and off by a macro.
The second approach was to make the location that was read by an SMH sequence depend on the result of the previous SMH sequence. Since I didn’t want to introduce a spurious ‘jumping around memory’ component to the benchmark (which would always be absent from the equivalent throughput metric), I made sure that the previous SMH sequence always happened to return zero (no match): we know this, but the architecture and the compiler don’t.
Creating long chains of dependent operations is also how Agner Fog (and others) measure latency; those who have not yet seen Agner’s Software optimization resources are in for a treat.
The code to measure SMH latency is below (note that LFENCE is switched off by the preprocessor as needed and was not used in the latency-test version of this code at all):
Observe the “tmp” variable in the gist above; it is always zero, but we cannot safely start our matching operation until the architecture has the result of the previous match operation in hand (Intel Architecture has many fascinating optimizations, but generalized value prediction is not one of them).
This gives us somewhat of a hybrid creature: “steady-state” latency. The compiler and architecture are still free to load things into registers that don’t depend on the actual computation – so this latency number is perhaps unrepresentative of a ‘cold start’. However, it is a reasonable measurement of the latency of a single operation in a well-optimized code base.
|SMH Variant||normal||no unroll||LFENCE|
The above numbers seem reasonable based on a walkthough of the code. I also measured the effect of turning off my manual 8-way unroll. I had focused on smaller models and the metric of throughput as I tuned SMH; it’s marginally interesting to note that latency is generally better without an unroll in the measurement loop if not decisive.
The LFENCE results are hard to interpret – they seem to generally track the latency of the normal case plus around 3.5ns. More work is needed to confirm this; it would be nice to have a way of getting a latency number out of the system that doesn’t rely on an ability to introduce contrived data dependencies from one iteration to the next.
I feel reasonably confident that SMH can be said to do its work in 7-9 cycles; note that the overlap of iterations required to hit the full throughput (looking at the above table) must have to be as many as 8 iterations for the cheapest cases. As always, this implies that being stuck in the ‘latency world’ is miserable – try to phrase your computations to stay in the ‘throughput world’ whenever you can.
Updated code is here
Thoughts on how to measure latency are welcome.