[cfarm-users] gcc102 (sparc64) down again for maintenance/repairs

Zach van Rijn me at zv.io
Sun Dec 11 18:01:17 CET 2022


On Sun, 2022-12-11 at 09:09 -0600, Segher Boessenkool wrote:
> Hi Zach,
> 
> On Fri, Dec 09, 2022 at 09:12:06AM -0600, Zach van Rijn via
> cfarm-users wrote:
> > On Fri, 2022-12-09 at 15:42 +0100, Pierre Muller via cfarm-
> > users
> > wrote:
> > > ...
> > > 
> > >   It still seems that there are CPU lockup :-(
> 
> *Soft* lockups.  Tasks that were unresponsive for more than
> 20s.  This has been "normal" on bigger Linux systems (a hundred
> of cores or so) for very many years, and although it is scary,
> it does not harm much (there typically are other cores still
> available to do any work).  Some of the scalability problems
> are solved over time, but new ones crop up as well.

Under load, sure. The issue here is, even 'sshd' locks up at idle
within a few hours and the system is unresponsive. The local
console stops working. The "soft" lockup is an early indicator
that the system will be offline within a just few more minutes.

Linux 5.18 seemed fine for 6 months until a reboot, and then 5.18
showed this behavior. Any kernel, same thing. That isn't normal.

There's no service contract or budget to diagnose/repair/replace
parts if it isn't immediately clear what the problem is.


> Question.  Does it have ECC RAM?  It should of course; so how
> can RAM be undetected bad then?

It is ECC, yes. As for how, maybe other farm users could posit a
theory about what the underlying issue is. I don't know.


> I have no opinion.  Any larger machine works for us, I think.
> I think people did find it useful, yes.  And it certainly is
> good to have more than one machine.

Agreed.

Our options will depend on budget/donations.

I can cover the cost of a T4-2 (somehow cheaper than a T3-2 now),
however for anything else I'll need some help. My opinion is that
a T5-2 or later would be better. A T7-2, these can get expensive.

If a different architecture is desired, now is the time to make
requests / make known your willingness to help out financially.


> The cfarm is not there for benchmarketing or any other kind of
> comparitive system evaluation, it is there to help open source
> developers do their thing; what kind of comparison thing do you
> have in mind?  Not criticising you, just confused what your
> goal is here :-)

Open source developers often need to answer the question, "What
is different about this machine that is causing some behavior?",
and sometimes the hardware variable needs to be held constant
while the operating system changes. 32-/64- bit, Linux/Solaris...

That's easier for an admin to do than a farm user, unlike
changing the libc, an allocator, or code, which a developer does.

The T3-2 is normally capable of VM-like "logical domains" (LDOMs)
that make these tasks easier for admins as well. My particular
unit has always had a bug where this crashes the network stack,
so I dedicated the full machine to the compile farm instead.


ZV



More information about the cfarm-users mailing list