[cfarm-users] GCC135/Power9 performance?

Sun Apr 12 16:18:50 CEST 2020

>>> GCC135 is a Power9 machine. Benchmarking on the machine shows
>>> performance is off. For example, here are some numbers for AES in ECB
>>> mode:
>>>
>>> GCC112 (Linux, ppc64le, 3.7 GHz, GCC 8.2):
>>>    * 1.12 cpb, 2851 MB/s
>>>
>>> GCC119 (AIX, ppc64be, 4.1 GHz, GCC 8.2):
>>>    * 0.54 cpb, 7242 MB/s

Just in case, AIX is misreporting resource usage (getrusage), and
presented result appears to be aligned with the said misrepresentation.
I mean if you try to calculate performance based on the skewed getrusage
values, you'll customarily observe ~2 "better" cycles-per-byte results.
In essence you should either use wallclock or disregard AIX results. But
don't ask me why AIX does it, as I don't know...

>>> GCC135 (Linux, ppc64le, 3.8 GHz, GCC 8.3):
>>>    * 1.94 cpb, 1815 MB/s
>>
>> What source code did you use for this test?
> 
> I used Crypto++ (https://github.com/weidai11/cryptopp) for the test.
> 
> I also spoke with Andy Polyakov. OpenSSL is observing the same issue.

As already said POWER9 is "allergic" to mixing scalar and vector
instructions. And since you will always have scalar instructions in the
mix, most notably to calculate effective addresses, vector code is
effectively bound to perform suboptimally. It's just the way POWER9 is,
and complaining about it would be like complaining about weather. What
one can do is to calculate as much effective addresses as possible in
advance and group those instructions, as opposite to spreading them
throughout loop. And of course, if you rely on compiler intrinsics, you
are at compiler's mercy, and is not exactly in position to control
effective address calculations (and complain ;-).

>>> All algorithms show a similar slowdown. SHA is so slow I am
>>> considering disabling in-core crypto for SHA and going back to the
>>> integer unit.

While IBM screwed up vector-scalar mix, they did improve scalar
performance in POWER9, significantly[!]. So that the gap between between
vector and equivalent scalar implementations gets reduced from both
ends. I mean vector is slower, and scalar is faster, both in comparison
to POWER8 that is. And it's more so from scalar end. Properly optimized
(a.k.a. hand-written) vector SHA is faster than scalar, but by mere 12%,
so that if you let compiler calculate effective addresses, it shouldn't
come as surprise if vector turns out slower than scalar. And again, it's
just the way POWER9 is, just accept it.

>>> What is different about GCC135? Is the Power9 hardware really that slow?
>>
>> Generally no: https://www.ibm.com/downloads/cas/K90RQOW8

One should make distinction between multi-user/capacity suite-specific
benchmarks (like SPEC/SAP/Oracle/etc.) and cryptographic primitive
cycles-per-byte benchmarks for single thread on idle system. In addition
cryptographic algorithms are to certain degree special case even in
single-thread context, because they customarily have relatively short
dependencies between steps, which results on all kinds of special and
non-formalize-able relations between compiler (or assembler programmer)
and hardware :-) In other words better SPECrates are not guaranteed
indication of better cycles-per-byte for crypto primitives.

Cheers.