Performance notes on SMH: measuring throughput vs latency of short C++ sequences

2018-07-24T05:39:05+00:00

I think the right way to measure latency is the “make the input depend on the output approach” that you outline. After, all this is the most reflective of how a sequence of instructions will part of a latency chain in the real application (even though of course you might not be using the particular function you are testing in a back-to-back manner).

So introducing an output-to-input dependency seems to be the most realistic way to do it. You don’t necessarily have to organize for your test function to return zero: a more general way to do it is just to have a zero value in a stack variable zero (usually initialized from a zero() function which always returns zero but cannot be optimized away), then you could just do `input += output & zero` to make input depend on output. This is nice because it’s generic and usually only adds 2 instructions (cycles) to the latency (which you could perhaps cancel out if you had a calibration step).

It is also worth noting that there isn’t actually just a single latency for a function: in principle there is an I * O matrix of latencies for a function with I inputs and O outputs (often O is one – i.e., any function that returns a single scalar). Sometimes the latencies are quite different depending on which output chains to which input. This even applies at the instruction level: some functions with asymmetric arguments (non-commutative, I guess) have different latencies depending on which input operand the dependency flows though: many of the two uop instructions are like this: one operand is used by the first uop (e.g., to set up a mask or other control) and then the second uop consumes that result and the other operand: so the latency is 2 for one operand and 1 for the other.

The “forced dependency” approach lets you build the full matrix of latencies if you want.

I don’t like the lfence approach although I’ve certainly used it before. You are using the normal OoO machinery, but a special instruction that works in an unspecified way, so it’s hard to say if you are getting a true latency measurement. For example, I found that the lfence behavior was not as you might expect for short instruction sequences: e.g., a back-to-back lfence had timings of say 6 cycles, but adding a single 1-cycle intruction would increase it to 9 cycles, but then you could add more 1 cycle instructions to a dependent chain and it would say at 9 cycles until you had 4 instructions, then it jumped again, etc. So you really rely on specific uarch details and you’ll have to characterize it for each arch, because it probably changes. Also, on AMD it will probably also be quite different.

LFENCE is also a bit pessimistic when it comes to measuring latency: *every* instruction must retire, but in real code there can be some work which is not part of the dependency chain to the output (verification of conditional branches comes to mind), and in the real world this work can freely overlap between iterations, and it should be free to do so since that will happen in real life. Similarly, there can be some chains that don’t depend on the input (e.g,. loading some constants, whatever) and these should be free to execute earlier and across iterations. LFENCE blocks all that, unrealistically.

You mention, I think, that latency is always at least as long as (inverse) throughput, but there are actually interesting counter-examples. I want to nit-pick that :). At the instruction level, you have the x86 shl/shr instructions:

https://www.agner.org/optimize/blog/read.php?i=860&v=t

At the function level you can get this effect too: when enough work to do, but short dependency chains such that latency can be less than the throughput. Consider for example a function like:

int foo(int x) {
if (x == 42) return 1;
if (x == 52) return 2;
if (x == 62) return 3;
return x;
}

It checks its input against 3 values, with the branches (usually) not taken and if no match just returns its argument. This will can be simply implemented as something like mov eax, edi for the fall-through case and 3 cmp/je pairs for the 3 checks. The checks don’t participate in the data dependency chain latency chain, so this function usually has a latency of 0 cycle (mov elimination worked) or 1 cycle (didn’t work), but the throughput is worse than that since you can only check 2 untaken branches a cycle, so the throughput is at best 1.5 cycles/iteration.

Of course, you can’t *sustain* a latency of 0 or 1, back-to-back, since you’ll hit the throughput bottleneck, but the properly defined latency still is truly 0 or 1 (indeed, real code will probably not run this back-to-back so there will free slots and you might achieve the 0/1 latency), and you can additionally distinguish something like the “steady state” latency as you mention at the end, which caps the latency based on the throughput. Which one you measure in a benchmark depends on the design: in the case where you have a couple cycles of instructions to create the fake dependency, you get enough free slots to measure 0/1 after calibration, but if you don’t have those you’ll measure a higher one.

LikeLike

2018-07-24T10:26:24+00:00

Thanks for this comment. I was not previously aware of the SHL/SHR example and hadn’t reasoned through the concept of the cases where latency is actually better than reciprocal throughput. It seems like an exotic case but it’s well worth thinking through.

I’m not a big fan of LFENCE either. I think the problem of forcing dependency will rapidly turn into a gigantic pain if there are multiple inputs and outputs, so it’s probably safe to say that trying to benchmark complex functions for latency at a very fine grain is tough. I think you might need to introduce multiple forced dependencies to deal with a function that has multiple inputs and outputs. At some point it becomes an approximation game and the best approach is probably to try a bunch of different methodologies and establish that they aren’t too insanely divergent (IACA, LFENCE, forced dependencies and, probably, scrawling little pictures of dependency graphs and likely port assignments on paper and hoping for the best).

LikeLike

2018-07-25T02:05:48+00:00

Yeah, it is relatively exotic – mostly because with an execution width of 4 it’s hard to arrange with just plain instructions which all contribute to the output. You can do it though: consider a tree of single latency instructions, e.g., N inputs combined pair-wise into N/2 outputs and so on until you get one output (e.g., a function that simply adds together all inputs, tournament-style). This has N-1 add operations but only has latency lg(N), so for N = 32 you have 31 add ops, which takes at 7.75 cycles inverse throughput, but the latency is only 5. Of course, x86 doesn’t have 32 scalar registers, so it gets a bit more complicated to actually write a working example (it’s easier if you use instructions that execute on fewer than 4 units, since then you have a lower throughput bar to clear).

The real world cases of such a function are much more likely to be related to functions that have “out of line” code (instructions which aren’t on the input-to-output latency path at all), usually associated with branches – like the example I gave. A function with lower latency than recip. throughput isn’t actually too unusual in that case (but still it doesn’t mater “much”).

I don’t actually agree about the “gigantic pain” regarding forcing dependencies. I think you can do it in a fairly rote and simply, probably in a one-liner with some macro or template meta-programming. The basic idea is not to try to force your output to zero or some other special value that makes for appropriate input, but just to take your output, AND it with a zero value, and then add (or any operation that is a no-op with zero) it to all your new input values. This never changes the value so is compatible with whatever mechanism you were already using generate the values in the first place, works for scalars, pointers, etc. For structs yeah you need some kind of recursion to apply it to each (relevant member of the struct). For the common case of 1 output from a function and N inputs, this takes only N extra add instructions and one extra AND instruction.

That’s not just theoretical: I use that approach with success in various benchmarks: it makes to easy to have your “throughput” and “latency” benchmark loops identical except for that one line that creates the dependency.

LikeLike

2019-03-21T19:40:31+00:00

Some truly nice stuff on this site, I love it.

LikeLike

	template <typename T>
	void match_multiple_smh(T & smh, std::vector<u8 *> & buffers, std::vector<size_t> & lengths,
	std::vector<u32> & results) {
	u32 i = 0;
	#ifndef NO_UNROLL
	for (; i+7 < buffers.size(); i+=8) {
	results[i+0] = smh.match(buffers[i+0], lengths[i+0]); LFENCE
	results[i+1] = smh.match(buffers[i+1], lengths[i+1]); LFENCE
	results[i+2] = smh.match(buffers[i+2], lengths[i+2]); LFENCE
	results[i+3] = smh.match(buffers[i+3], lengths[i+3]); LFENCE
	results[i+4] = smh.match(buffers[i+4], lengths[i+4]); LFENCE
	results[i+5] = smh.match(buffers[i+5], lengths[i+5]); LFENCE
	results[i+6] = smh.match(buffers[i+6], lengths[i+6]); LFENCE
	results[i+7] = smh.match(buffers[i+7], lengths[i+7]); LFENCE
	}
	#endif
	for (; i < buffers.size(); ++i) {
	results[i] = smh.match(buffers[i], lengths[i]); LFENCE
	}
	}

	template <typename T>
	void match_multiple_smh_latency_test(T & smh, std::vector<u8 *> & buffers, std::vector<size_t> & lengths,
	std::vector<u32> & results) {
	u32 i = 0;
	u32 tmp = 0;
	#ifndef NO_UNROLL
	// NOTE: experimental code only. Note that the addition of 'tmp' – being the id of a possible
	// match – could take us RIGHT outside our buffer if we actually matched something. We aren't
	// in this particular run, but so it goes. Saner would be to build up an all-zero id vector
	for (; i+7 < buffers.size(); i+=8) {
	tmp = results[i+0] = smh.match(buffers[i+0 + tmp], lengths[i+0] + tmp); LFENCE
	tmp = results[i+1] = smh.match(buffers[i+1 + tmp], lengths[i+1] + tmp); LFENCE
	tmp = results[i+2] = smh.match(buffers[i+2 + tmp], lengths[i+2] + tmp); LFENCE
	tmp = results[i+3] = smh.match(buffers[i+3 + tmp], lengths[i+3] + tmp); LFENCE
	tmp = results[i+4] = smh.match(buffers[i+4 + tmp], lengths[i+4] + tmp); LFENCE
	tmp = results[i+5] = smh.match(buffers[i+5 + tmp], lengths[i+5] + tmp); LFENCE
	tmp = results[i+6] = smh.match(buffers[i+6 + tmp], lengths[i+6] + tmp); LFENCE
	tmp = results[i+7] = smh.match(buffers[i+7 + tmp], lengths[i+7] + tmp); LFENCE
	}
	#endif
	for (; i < buffers.size(); ++i) {
	tmp = results[i] = smh.match(buffers[i + tmp], lengths[i + tmp]); LFENCE
	}
	}

SMH Variant		normal	no unroll	LFENCE
SMH32-loose	Throughput (ns)	0.89	0.98	10.62
	Latency (ns)	7.03	6.92	10.65
SMH32	Throughput (ns)	1.12	1.15	11.02
	Latency (ns)	7.25	7.30	10.89
SMH64-loose	Throughput (ns)	1.35	1.44	11.03
	Latency (ns)	7.63	7.61	11.36
SMH64	Throughput (ns)	1.62	1.66	11.67
	Latency (ns)	7.95	8.00	11.61
SMH128-loose	Throughput (ns)	2.80	2.67	12.39
	Latency (ns)	8.97	8.14	12.91
SMH128	Throughput (ns)	3.32	3.08	12.82
	Latency (ns)	9.78	8.55	12.91

Performance notes on SMH: measuring throughput vs latency of short C++ sequences

Published by geofflangdale

4 thoughts on “Performance notes on SMH: measuring throughput vs latency of short C++ sequences”

Leave a comment Cancel reply

Share this:

Related

Published by geofflangdale

4 thoughts on “Performance notes on SMH: measuring throughput vs latency of short C++ sequences”

Leave a comment Cancel reply