[cfarm-users] gcc102 (sparc64) down again for maintenance/repairs

Zach van Rijn me at zv.io
Wed Nov 30 15:25:49 CET 2022


On Wed, 2022-11-30 at 11:35 +0100, Pierre Muller via cfarm-users
wrote:
> Just got this:
> Message from syslogd at gcc102 at Nov 30 04:31:20 ...
>   kernel:[47393.509723] watchdog: BUG: soft lockup - CPU#2
> stuck for 48s! [ppc2:203070]
> 
> Can I do anything to help figuring out the problem?

Not sure. Ideas are certainly welcome. Had you used this machine
much before a few days ago? This issue had occurred maybe twice
in the last two years.

Here is what is being printed, over and over again:

[60140.204902] watchdog: BUG: soft lockup - CPU#2 stuck for
11919s! [ppc2:203070]
[60140.608860] watchdog: BUG: soft lockup - CPU#136 stuck for
10914s! [in:imklog:2885]
[60148.356060] watchdog: BUG: soft lockup - CPU#54 stuck for
11700s! [sshd:103658]
[60152.803603] watchdog: BUG: soft lockup - CPU#195 stuck for
10530s! [exim4:3249]
[60160.346825] watchdog: BUG: soft lockup - CPU#51 stuck for
11830s! [kworker/u512:1:103663]
[60164.202428] watchdog: BUG: soft lockup - CPU#2 stuck for
11942s! [ppc2:203070]
[60164.398409] rcu: INFO: rcu_sched detected stalls on
CPUs/tasks:
[60164.410066] rcu:     65-...0: (1 GPs behind)
idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1067360
[60164.428982] rcu:     175-...0: (6 ticks this GP)
idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1067361
[60164.606387] watchdog: BUG: soft lockup - CPU#136 stuck for
10936s! [in:imklog:2885]
[60172.353588] watchdog: BUG: soft lockup - CPU#54 stuck for
11722s! [sshd:103658]
[60176.801129] watchdog: BUG: soft lockup - CPU#195 stuck for
10553s! [exim4:3249]
[60184.344352] watchdog: BUG: soft lockup - CPU#51 stuck for
11853s! [kworker/u512:1:103663]
[60188.199955] watchdog: BUG: soft lockup - CPU#2 stuck for
11964s! [ppc2:203070]
[60188.603913] watchdog: BUG: soft lockup - CPU#136 stuck for
10959s! [in:imklog:2885]
[60196.351113] watchdog: BUG: soft lockup - CPU#54 stuck for
11745s! [sshd:103658]
[60200.798657] watchdog: BUG: soft lockup - CPU#195 stuck for
10575s! [exim4:3249]
[60208.341880] watchdog: BUG: soft lockup - CPU#51 stuck for
11875s! [kworker/u512:1:103663]
[60212.197482] watchdog: BUG: soft lockup - CPU#2 stuck for
11986s! [ppc2:203070]
[60212.601441] watchdog: BUG: soft lockup - CPU#136 stuck for
10981s! [in:imklog:2885]
[60220.348642] watchdog: BUG: soft lockup - CPU#54 stuck for
11767s! [sshd:103658]
[60224.796184] watchdog: BUG: soft lockup - CPU#195 stuck for
10597s! [exim4:3249]
[60227.463909] rcu: INFO: rcu_sched detected stalls on
CPUs/tasks:
[60227.475543] rcu:     65-...0: (1 GPs behind)
idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1072610
[60227.494469] rcu:     175-...0: (6 ticks this GP)
idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1072611
[60232.339409] watchdog: BUG: soft lockup - CPU#51 stuck for
11897s! [kworker/u512:1:103663]
[60236.195010] watchdog: BUG: soft lockup - CPU#2 stuck for
12009s! [ppc2:203070]
[60236.598968] watchdog: BUG: soft lockup - CPU#136 stuck for
11003s! [in:imklog:2885]
[60244.346169] watchdog: BUG: soft lockup - CPU#54 stuck for
11789s! [sshd:103658]
[60248.793715] watchdog: BUG: soft lockup - CPU#195 stuck for
10620s! [exim4:3249]
[60256.336936] watchdog: BUG: soft lockup - CPU#51 stuck for
11920s! [kworker/u512:1:103663]
[60260.192538] watchdog: BUG: soft lockup - CPU#2 stuck for
12031s! [ppc2:203070]
[60260.596497] watchdog: BUG: soft lockup - CPU#136 stuck for
11026s! [in:imklog:2885]
[60268.343699] watchdog: BUG: soft lockup - CPU#54 stuck for
11812s! [sshd:103658]
[60272.791241] watchdog: BUG: soft lockup - CPU#195 stuck for
10642s! [exim4:3249]
[60280.334465] watchdog: BUG: soft lockup - CPU#51 stuck for
11942s! [kworker/u512:1:103663]
[60284.190067] watchdog: BUG: soft lockup - CPU#2 stuck for
12054s! [ppc2:203070]
[60284.594026] watchdog: BUG: soft lockup - CPU#136 stuck for
11048s! [in:imklog:2885]
[60290.525418] rcu: INFO: rcu_sched detected stalls on
CPUs/tasks:
[60290.537054] rcu:     65-...0: (1 GPs behind)
idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1077860
[60290.555978] rcu:     175-...0: (6 ticks this GP)
idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1077861
[60292.341228] watchdog: BUG: soft lockup - CPU#54 stuck for
11834s! [sshd:103658]
[60296.788771] watchdog: BUG: soft lockup - CPU#195 stuck for
10664s! [exim4:3249]

Serial console stopped.

0-> set /HOST send_break_action=break
Set 'send_break_action' to 'break'

0-> start /host/console              
Are you sure you want to start /HOST/console (y/n)? y

Serial console started.  To stop, type #.
[60395.353420] sysrq: HELP : loglevel(0-9) reboot(b) crash(c)
terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i)
thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l)
show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-
registers(p) show-all-timers(q) unraw(r) sync(s) show-task-
states(t) unmount(u) show-blocked-tasks(w) global-pmu(x) global-
regs(y) dump-ftrace-buffer(z)

But I can't get it to respond to any of these. It's usually dead
too soon for me to try to get anything from 'dmesg'. That usually
happens within a few minutes of the first stall.

>From a few days ago, 'fpmake' is the task that was implicated:

https://paste.debian.net/plainh/1162e193

Hoping to see if there are any kernel traces so I can report it
to the relevant kernel lists.

It's also a huge waste of time trying to reset the machine since
it (a) takes forever to boot, and (b) only boots less than 1/2
the time, and (c) requires interaction to perform the resets.

So resetting the machine could turn into a 30-minute ordeal:

https://paste.debian.net/plainh/fb20bd17

I don't want to jump to the conclusion that it is a hardware
issue, but having other hardware to test would be helpful, should
anyone have spare memory for it or want to donate $ for research.


ZV



More information about the cfarm-users mailing list