[cfarm-users] gcc102 (sparc64) down again for maintenance/repairs

Pierre Muller pierre at freepascal.org
Wed Nov 30 15:45:21 CET 2022



At least 'ppc2' and 'fpmake' are most probably executable on
my user account that are generated by my cron jobs.


Maybe it would be wise to check if the machine is stable if my cron jobs are disabled.

   I am currently unable to login into gcc102.
I you restart the machine, please also disable the cron jobs of user muller (myself),
and let's check if the machine is stable without my jobs.

   It could be that my jobs are generating some illegal instructions...
Of course, on a stable kernel, this should never lead to instabilities of the
system itself...

   I no lockup appears within a few days, we could try to reenable
my jobs and see if this correlates with the appearance of lockups.

Pierre

Le 30/11/2022 à 15:25, Zach van Rijn a écrit :
> On Wed, 2022-11-30 at 11:35 +0100, Pierre Muller via cfarm-users
> wrote:
>> Just got this:
>> Message from syslogd at gcc102 at Nov 30 04:31:20 ...
>>    kernel:[47393.509723] watchdog: BUG: soft lockup - CPU#2
>> stuck for 48s! [ppc2:203070]
>>
>> Can I do anything to help figuring out the problem?
> 
> Not sure. Ideas are certainly welcome. Had you used this machine
> much before a few days ago? This issue had occurred maybe twice
> in the last two years.
> 
> Here is what is being printed, over and over again:
> 
> [60140.204902] watchdog: BUG: soft lockup - CPU#2 stuck for
> 11919s! [ppc2:203070]
> [60140.608860] watchdog: BUG: soft lockup - CPU#136 stuck for
> 10914s! [in:imklog:2885]
> [60148.356060] watchdog: BUG: soft lockup - CPU#54 stuck for
> 11700s! [sshd:103658]
> [60152.803603] watchdog: BUG: soft lockup - CPU#195 stuck for
> 10530s! [exim4:3249]
> [60160.346825] watchdog: BUG: soft lockup - CPU#51 stuck for
> 11830s! [kworker/u512:1:103663]
> [60164.202428] watchdog: BUG: soft lockup - CPU#2 stuck for
> 11942s! [ppc2:203070]
> [60164.398409] rcu: INFO: rcu_sched detected stalls on
> CPUs/tasks:
> [60164.410066] rcu:     65-...0: (1 GPs behind)
> idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1067360
> [60164.428982] rcu:     175-...0: (6 ticks this GP)
> idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1067361
> [60164.606387] watchdog: BUG: soft lockup - CPU#136 stuck for
> 10936s! [in:imklog:2885]
> [60172.353588] watchdog: BUG: soft lockup - CPU#54 stuck for
> 11722s! [sshd:103658]
> [60176.801129] watchdog: BUG: soft lockup - CPU#195 stuck for
> 10553s! [exim4:3249]
> [60184.344352] watchdog: BUG: soft lockup - CPU#51 stuck for
> 11853s! [kworker/u512:1:103663]
> [60188.199955] watchdog: BUG: soft lockup - CPU#2 stuck for
> 11964s! [ppc2:203070]
> [60188.603913] watchdog: BUG: soft lockup - CPU#136 stuck for
> 10959s! [in:imklog:2885]
> [60196.351113] watchdog: BUG: soft lockup - CPU#54 stuck for
> 11745s! [sshd:103658]
> [60200.798657] watchdog: BUG: soft lockup - CPU#195 stuck for
> 10575s! [exim4:3249]
> [60208.341880] watchdog: BUG: soft lockup - CPU#51 stuck for
> 11875s! [kworker/u512:1:103663]
> [60212.197482] watchdog: BUG: soft lockup - CPU#2 stuck for
> 11986s! [ppc2:203070]
> [60212.601441] watchdog: BUG: soft lockup - CPU#136 stuck for
> 10981s! [in:imklog:2885]
> [60220.348642] watchdog: BUG: soft lockup - CPU#54 stuck for
> 11767s! [sshd:103658]
> [60224.796184] watchdog: BUG: soft lockup - CPU#195 stuck for
> 10597s! [exim4:3249]
> [60227.463909] rcu: INFO: rcu_sched detected stalls on
> CPUs/tasks:
> [60227.475543] rcu:     65-...0: (1 GPs behind)
> idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1072610
> [60227.494469] rcu:     175-...0: (6 ticks this GP)
> idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1072611
> [60232.339409] watchdog: BUG: soft lockup - CPU#51 stuck for
> 11897s! [kworker/u512:1:103663]
> [60236.195010] watchdog: BUG: soft lockup - CPU#2 stuck for
> 12009s! [ppc2:203070]
> [60236.598968] watchdog: BUG: soft lockup - CPU#136 stuck for
> 11003s! [in:imklog:2885]
> [60244.346169] watchdog: BUG: soft lockup - CPU#54 stuck for
> 11789s! [sshd:103658]
> [60248.793715] watchdog: BUG: soft lockup - CPU#195 stuck for
> 10620s! [exim4:3249]
> [60256.336936] watchdog: BUG: soft lockup - CPU#51 stuck for
> 11920s! [kworker/u512:1:103663]
> [60260.192538] watchdog: BUG: soft lockup - CPU#2 stuck for
> 12031s! [ppc2:203070]
> [60260.596497] watchdog: BUG: soft lockup - CPU#136 stuck for
> 11026s! [in:imklog:2885]
> [60268.343699] watchdog: BUG: soft lockup - CPU#54 stuck for
> 11812s! [sshd:103658]
> [60272.791241] watchdog: BUG: soft lockup - CPU#195 stuck for
> 10642s! [exim4:3249]
> [60280.334465] watchdog: BUG: soft lockup - CPU#51 stuck for
> 11942s! [kworker/u512:1:103663]
> [60284.190067] watchdog: BUG: soft lockup - CPU#2 stuck for
> 12054s! [ppc2:203070]
> [60284.594026] watchdog: BUG: soft lockup - CPU#136 stuck for
> 11048s! [in:imklog:2885]
> [60290.525418] rcu: INFO: rcu_sched detected stalls on
> CPUs/tasks:
> [60290.537054] rcu:     65-...0: (1 GPs behind)
> idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1077860
> [60290.555978] rcu:     175-...0: (6 ticks this GP)
> idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1077861
> [60292.341228] watchdog: BUG: soft lockup - CPU#54 stuck for
> 11834s! [sshd:103658]
> [60296.788771] watchdog: BUG: soft lockup - CPU#195 stuck for
> 10664s! [exim4:3249]
> 
> Serial console stopped.
> 
> 0-> set /HOST send_break_action=break
> Set 'send_break_action' to 'break'
> 
> 0-> start /host/console
> Are you sure you want to start /HOST/console (y/n)? y
> 
> Serial console started.  To stop, type #.
> [60395.353420] sysrq: HELP : loglevel(0-9) reboot(b) crash(c)
> terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i)
> thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l)
> show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-
> registers(p) show-all-timers(q) unraw(r) sync(s) show-task-
> states(t) unmount(u) show-blocked-tasks(w) global-pmu(x) global-
> regs(y) dump-ftrace-buffer(z)
> 
> But I can't get it to respond to any of these. It's usually dead
> too soon for me to try to get anything from 'dmesg'. That usually
> happens within a few minutes of the first stall.
> 
>>From a few days ago, 'fpmake' is the task that was implicated:
> 
> https://paste.debian.net/plainh/1162e193
> 
> Hoping to see if there are any kernel traces so I can report it
> to the relevant kernel lists.
> 
> It's also a huge waste of time trying to reset the machine since
> it (a) takes forever to boot, and (b) only boots less than 1/2
> the time, and (c) requires interaction to perform the resets.
> 
> So resetting the machine could turn into a 30-minute ordeal:
> 
> https://paste.debian.net/plainh/fb20bd17
> 
> I don't want to jump to the conclusion that it is a hardware
> issue, but having other hardware to test would be helpful, should
> anyone have spare memory for it or want to donate $ for research.
> 
> 
> ZV


More information about the cfarm-users mailing list