[cfarm-users] gcc102 (sparc64) down again for maintenance/repairs

Thu Dec 1 04:21:16 CET 2022

Zach van Rijn via cfarm-users wrote:
> On Wed, 2022-11-30 at 11:35 +0100, Pierre Muller via cfarm-users
> wrote:
>   
>> Just got this:
>> Message from syslogd at gcc102 at Nov 30 04:31:20 ...
>>   kernel:[47393.509723] watchdog: BUG: soft lockup - CPU#2
>> stuck for 48s! [ppc2:203070]
>>
>> Can I do anything to help figuring out the problem?
>>     
>
> Not sure. Ideas are certainly welcome. Had you used this machine
> much before a few days ago? This issue had occurred maybe twice
> in the last two years.
>   

Do you have logs farther back?  Pierre Muller reported a message 
broadcast from syslogd at 47393.509723; at that time CPU#2 had already 
been stuck (nominally) for 48 seconds, so the previous minute or so of 
the kernel logs might be helpful.

After stitching the split lines back together, a quick analysis (with 
`awk '$2 == "watchdog:" { $2 = $1; sub(/\[/,"",$2); 
sub(/\.[0-9]+\]/,"",$2); sub(/CPU#/, "", $7); sub(/s!
/, "", $10); printf "%s %3d %6d -> %6d (%s)\n",$1,$7,$10,$2-$10,$11 }' 
log | sort -k 2`) yields:

8<------
[60140.204902]   2  11919 ->  48221 ([ppc2:203070])
[60164.202428]   2  11942 ->  48222 ([ppc2:203070])
[60188.199955]   2  11964 ->  48224 ([ppc2:203070])
[60212.197482]   2  11986 ->  48226 ([ppc2:203070])
[60236.195010]   2  12009 ->  48227 ([ppc2:203070])
[60260.192538]   2  12031 ->  48229 ([ppc2:203070])
[60284.190067]   2  12054 ->  48230 ([ppc2:203070])
[60160.346825]  51  11830 ->  48330 ([kworker/u512:1:103663])
[60184.344352]  51  11853 ->  48331 ([kworker/u512:1:103663])
[60208.341880]  51  11875 ->  48333 ([kworker/u512:1:103663])
[60232.339409]  51  11897 ->  48335 ([kworker/u512:1:103663])
[60256.336936]  51  11920 ->  48336 ([kworker/u512:1:103663])
[60280.334465]  51  11942 ->  48338 ([kworker/u512:1:103663])
[60148.356060]  54  11700 ->  48448 ([sshd:103658])
[60172.353588]  54  11722 ->  48450 ([sshd:103658])
[60196.351113]  54  11745 ->  48451 ([sshd:103658])
[60220.348642]  54  11767 ->  48453 ([sshd:103658])
[60244.346169]  54  11789 ->  48455 ([sshd:103658])
[60268.343699]  54  11812 ->  48456 ([sshd:103658])
[60292.341228]  54  11834 ->  48458 ([sshd:103658])
[60140.608860] 136  10914 ->  49226 ([in:imklog:2885])
[60164.606387] 136  10936 ->  49228 ([in:imklog:2885])
[60188.603913] 136  10959 ->  49229 ([in:imklog:2885])
[60212.601441] 136  10981 ->  49231 ([in:imklog:2885])
[60236.598968] 136  11003 ->  49233 ([in:imklog:2885])
[60260.596497] 136  11026 ->  49234 ([in:imklog:2885])
[60284.594026] 136  11048 ->  49236 ([in:imklog:2885])
[60152.803603] 195  10530 ->  49622 ([exim4:3249])
[60176.801129] 195  10553 ->  49623 ([exim4:3249])
[60200.798657] 195  10575 ->  49625 ([exim4:3249])
[60224.796184] 195  10597 ->  49627 ([exim4:3249])
[60248.793715] 195  10620 ->  49628 ([exim4:3249])
[60272.791241] 195  10642 ->  49630 ([exim4:3249])
[60296.788771] 195  10664 ->  49632 ([exim4:3249])
8<------

The columns above are the timestamp from the log, the CPU number, the 
amount of time that CPU is reported stuck, a projected time that the 
processor became stuck, and the reported stuck process.  This looks like 
kernel bugs are involved, since the projected times slip by about one to 
three seconds with each report.  Interestingly, sorting this log on 
column 2 (CPU number, for report convenience) and column 5 (projected 
lockup time) produce the same result.  I have no idea if this is a 
coincidence, or if the problem starts on lower-numbered CPUs and spreads 
"upward" on the machine.

Further extrapolation from this analysis could probably locate the 
actual lockup times for each processor, and we know that CPU#2 first 
locked up close to 47345 from Pierre's report.  The kernel log from 
around that time might be very interesting, especially if there is an 
"oops" in there.

Speculation:  (based on a bug I encountered long ago in the ext3 driver) 
ppc2 crashed its processor while holding some kernel resource (that this 
is possible is itself a kernel bug), some time later a kworker 
deadlocked waiting for that resource to be released, some time later 
still, the other affected processes deadlocked either on that same 
resource or on something held by a previous deadlocked process, while 
themselves holding kernel resources.  As more kernel resources are 
caught in deadlocks, the odds steadily increase for the entire system to 
hang.  The ext3 bug I mentioned would cause the process that triggered 
the bug to take a kernel oops, but locks relevant to the filesystem 
mount were not released, so all further accesses to the affected 
filesystem would hang.  Since an orderly shutdown or reboot attempts to 
sync filesystems, it too would hang.  The only way to recover control of 
the system was SysRq, (on that machine at that time) using "echo b 
 >/proc/sysrq-trigger" to force a reboot.  Note that the SysRq emergency 
sync would /also/ hang with that bug, and the affected machine was run 
as a network server with no local console attached.

If PID 103658 was the master sshd process and not a user's (dead) 
connection, CPU#54 locking up killed further SSH access to the machine.  
It is possible that the "failed" reboots you mentioned are simply sshd 
getting caught on a locked up CPU before you are able to log in.

-- Jacob