[cfarm-users] GCC135/Power9 performance?

Sun Apr 12 23:25:40 CEST 2020

>> As already said POWER9 is "allergic" to mixing scalar and vector
>> instructions. And since you will always have scalar instructions in the
>> mix, most notably to calculate effective addresses, vector code is
>> effectively bound to perform suboptimally. It's just the way POWER9 is,
>> ...
> 
> Yeah, I've been thinking about that. How would the following perform
> for loop control?
> 
> # Run 10 iterations
> vector unsigned int x, l, s;
> x = vec_spalt(1);
> l = vec_splat(10);
> c = vec_spalt(1);
> while (vec_all_ne(x, y))
> {
>     ...
>     x = vec_add(x, s);
> }
> 
> The return value from vec_all_ne is an int. Will using a vector as
> loop control improve performance.

I wouldn't actually consider it as viable question. Most notably because
there is no way to answer it in general case, with '...' in the middle.
Though one can ask if amount of iterations is known and small, then why
isn't loop unrolled? And if you rely on compiler to unroll it for you,
then what difference would specifics of condition make? And on the other
hand if '...' takes so much time that unrolling makes no sense, then
condition (and possible misprediction penalties) wouldn't make much of a
difference in "grand scale"...

>> What
>> one can do is to calculate as much effective addresses as possible in
>> advance and group those instructions, as opposite to spreading them
>> throughout loop. And of course, if you rely on compiler intrinsics, you
>> are at compiler's mercy, and is not exactly in position to control
>> effective address calculations (and complain ;-).
> 
> Yeah, it is a shame intrinsics are second class citizens.

Allow me to rephrase. I didn't mean to suggest that intrinsics are
second-class citizens, but rather implied that cryptographic algorithms
kind of are. But there is no reason to feel bad about it, it's just the
way things are, so to say naturally. Because a) there is no compiler
that would be equally good at *all* algorithms; b) cryptographic
algorithms tend to be unique by themselves; c) hence "omnipotent"
compiler would have to recognize them one by one. Now, given c) you can
ask what's more hard, a) modify all compilers to recognize algorithms
one by one, or b) write implementation in assembly?

Cheers.