[cfarm-users] GCC135/Power9 performance?

Jeffrey Walton noloader at gmail.com
Sun Apr 12 17:15:24 CEST 2020


On Sun, Apr 12, 2020 at 10:20 AM Andy Polyakov via cfarm-users
<cfarm-users at lists.tetaneutral.net> wrote:
>
> >>> GCC135 is a Power9 machine. Benchmarking on the machine shows
> >>> performance is off. For example, here are some numbers for AES in ECB
> >>> mode:
> >>>
> >>> GCC112 (Linux, ppc64le, 3.7 GHz, GCC 8.2):
> >>>    * 1.12 cpb, 2851 MB/s
> >>>
> >>> GCC119 (AIX, ppc64be, 4.1 GHz, GCC 8.2):
> >>>    * 0.54 cpb, 7242 MB/s
>
> Just in case, AIX is misreporting resource usage (getrusage), and
> presented result appears to be aligned with the said misrepresentation.
> I mean if you try to calculate performance based on the skewed getrusage
> values, you'll customarily observe ~2 "better" cycles-per-byte results.
> In essence you should either use wallclock or disregard AIX results. But
> don't ask me why AIX does it, as I don't know...

Yeah, we use simple wall clock via clock() calls. It seems to perform
most reliably on most platforms as long as high precision is not
needed.

> >>> GCC135 (Linux, ppc64le, 3.8 GHz, GCC 8.3):
> >>>    * 1.94 cpb, 1815 MB/s
> >>
> >> What source code did you use for this test?
> >
> > I used Crypto++ (https://github.com/weidai11/cryptopp) for the test.
> >
> > I also spoke with Andy Polyakov. OpenSSL is observing the same issue.
>
> As already said POWER9 is "allergic" to mixing scalar and vector
> instructions. And since you will always have scalar instructions in the
> mix, most notably to calculate effective addresses, vector code is
> effectively bound to perform suboptimally. It's just the way POWER9 is,
> ...

Yeah, I've been thinking about that. How would the following perform
for loop control?

# Run 10 iterations
vector unsigned int x, l, s;
x = vec_spalt(1);
l = vec_splat(10);
c = vec_spalt(1);
while (vec_all_ne(x, y))
{
    ...
    x = vec_add(x, s);
}

The return value from vec_all_ne is an int. Will using a vector as
loop control improve performance.

> What
> one can do is to calculate as much effective addresses as possible in
> advance and group those instructions, as opposite to spreading them
> throughout loop. And of course, if you rely on compiler intrinsics, you
> are at compiler's mercy, and is not exactly in position to control
> effective address calculations (and complain ;-).

Yeah, it is a shame intrinsics are second class citizens.

Clang calculates effective addresses using indexes (pointer math)
while GCC and XLC use offsets (integer math). It makes it harder to
follow the advice.

I thought Power9 was going to make things easier due to the vector
char and vector short loads, but they don't matter much when the
machine runs code more slowly.

> >>> All algorithms show a similar slowdown. SHA is so slow I am
> >>> considering disabling in-core crypto for SHA and going back to the
> >>> integer unit.
>
> While IBM screwed up vector-scalar mix, they did improve scalar
> performance in POWER9, significantly[!]. So that the gap between between
> vector and equivalent scalar implementations gets reduced from both
> ends. I mean vector is slower, and scalar is faster, both in comparison
> to POWER8 that is. And it's more so from scalar end. Properly optimized
> (a.k.a. hand-written) vector SHA is faster than scalar, but by mere 12%,
> so that if you let compiler calculate effective addresses, it shouldn't
> come as surprise if vector turns out slower than scalar. And again, it's
> just the way POWER9 is, just accept it.
>
> >>> What is different about GCC135? Is the Power9 hardware really that slow?
> >>
> >> Generally no: https://www.ibm.com/downloads/cas/K90RQOW8
>
> One should make distinction between multi-user/capacity suite-specific
> benchmarks (like SPEC/SAP/Oracle/etc.) and cryptographic primitive
> cycles-per-byte benchmarks for single thread on idle system. In addition
> cryptographic algorithms are to certain degree special case even in
> single-thread context, because they customarily have relatively short
> dependencies between steps, which results on all kinds of special and
> non-formalize-able relations between compiler (or assembler programmer)
> and hardware :-) In other words better SPECrates are not guaranteed
> indication of better cycles-per-byte for crypto primitives.

Thanks.


More information about the cfarm-users mailing list