[cfarm-users] gcc102 (sparc64) down again for maintenance/repairs
Jacob Bachmeyer
jcb62281 at gmail.com
Thu Dec 1 04:21:16 CET 2022
Zach van Rijn via cfarm-users wrote:
> On Wed, 2022-11-30 at 11:35 +0100, Pierre Muller via cfarm-users
> wrote:
>
>> Just got this:
>> Message from syslogd at gcc102 at Nov 30 04:31:20 ...
>> kernel:[47393.509723] watchdog: BUG: soft lockup - CPU#2
>> stuck for 48s! [ppc2:203070]
>>
>> Can I do anything to help figuring out the problem?
>>
>
> Not sure. Ideas are certainly welcome. Had you used this machine
> much before a few days ago? This issue had occurred maybe twice
> in the last two years.
>
Do you have logs farther back? Pierre Muller reported a message
broadcast from syslogd at 47393.509723; at that time CPU#2 had already
been stuck (nominally) for 48 seconds, so the previous minute or so of
the kernel logs might be helpful.
After stitching the split lines back together, a quick analysis (with
`awk '$2 == "watchdog:" { $2 = $1; sub(/\[/,"",$2);
sub(/\.[0-9]+\]/,"",$2); sub(/CPU#/, "", $7); sub(/s!
/, "", $10); printf "%s %3d %6d -> %6d (%s)\n",$1,$7,$10,$2-$10,$11 }'
log | sort -k 2`) yields:
8<------
[60140.204902] 2 11919 -> 48221 ([ppc2:203070])
[60164.202428] 2 11942 -> 48222 ([ppc2:203070])
[60188.199955] 2 11964 -> 48224 ([ppc2:203070])
[60212.197482] 2 11986 -> 48226 ([ppc2:203070])
[60236.195010] 2 12009 -> 48227 ([ppc2:203070])
[60260.192538] 2 12031 -> 48229 ([ppc2:203070])
[60284.190067] 2 12054 -> 48230 ([ppc2:203070])
[60160.346825] 51 11830 -> 48330 ([kworker/u512:1:103663])
[60184.344352] 51 11853 -> 48331 ([kworker/u512:1:103663])
[60208.341880] 51 11875 -> 48333 ([kworker/u512:1:103663])
[60232.339409] 51 11897 -> 48335 ([kworker/u512:1:103663])
[60256.336936] 51 11920 -> 48336 ([kworker/u512:1:103663])
[60280.334465] 51 11942 -> 48338 ([kworker/u512:1:103663])
[60148.356060] 54 11700 -> 48448 ([sshd:103658])
[60172.353588] 54 11722 -> 48450 ([sshd:103658])
[60196.351113] 54 11745 -> 48451 ([sshd:103658])
[60220.348642] 54 11767 -> 48453 ([sshd:103658])
[60244.346169] 54 11789 -> 48455 ([sshd:103658])
[60268.343699] 54 11812 -> 48456 ([sshd:103658])
[60292.341228] 54 11834 -> 48458 ([sshd:103658])
[60140.608860] 136 10914 -> 49226 ([in:imklog:2885])
[60164.606387] 136 10936 -> 49228 ([in:imklog:2885])
[60188.603913] 136 10959 -> 49229 ([in:imklog:2885])
[60212.601441] 136 10981 -> 49231 ([in:imklog:2885])
[60236.598968] 136 11003 -> 49233 ([in:imklog:2885])
[60260.596497] 136 11026 -> 49234 ([in:imklog:2885])
[60284.594026] 136 11048 -> 49236 ([in:imklog:2885])
[60152.803603] 195 10530 -> 49622 ([exim4:3249])
[60176.801129] 195 10553 -> 49623 ([exim4:3249])
[60200.798657] 195 10575 -> 49625 ([exim4:3249])
[60224.796184] 195 10597 -> 49627 ([exim4:3249])
[60248.793715] 195 10620 -> 49628 ([exim4:3249])
[60272.791241] 195 10642 -> 49630 ([exim4:3249])
[60296.788771] 195 10664 -> 49632 ([exim4:3249])
8<------
The columns above are the timestamp from the log, the CPU number, the
amount of time that CPU is reported stuck, a projected time that the
processor became stuck, and the reported stuck process. This looks like
kernel bugs are involved, since the projected times slip by about one to
three seconds with each report. Interestingly, sorting this log on
column 2 (CPU number, for report convenience) and column 5 (projected
lockup time) produce the same result. I have no idea if this is a
coincidence, or if the problem starts on lower-numbered CPUs and spreads
"upward" on the machine.
Further extrapolation from this analysis could probably locate the
actual lockup times for each processor, and we know that CPU#2 first
locked up close to 47345 from Pierre's report. The kernel log from
around that time might be very interesting, especially if there is an
"oops" in there.
Speculation: (based on a bug I encountered long ago in the ext3 driver)
ppc2 crashed its processor while holding some kernel resource (that this
is possible is itself a kernel bug), some time later a kworker
deadlocked waiting for that resource to be released, some time later
still, the other affected processes deadlocked either on that same
resource or on something held by a previous deadlocked process, while
themselves holding kernel resources. As more kernel resources are
caught in deadlocks, the odds steadily increase for the entire system to
hang. The ext3 bug I mentioned would cause the process that triggered
the bug to take a kernel oops, but locks relevant to the filesystem
mount were not released, so all further accesses to the affected
filesystem would hang. Since an orderly shutdown or reboot attempts to
sync filesystems, it too would hang. The only way to recover control of
the system was SysRq, (on that machine at that time) using "echo b
>/proc/sysrq-trigger" to force a reboot. Note that the SysRq emergency
sync would /also/ hang with that bug, and the affected machine was run
as a network server with no local console attached.
If PID 103658 was the master sshd process and not a user's (dead)
connection, CPU#54 locking up killed further SSH access to the machine.
It is possible that the "failed" reboots you mentioned are simply sshd
getting caught on a locked up CPU before you are able to log in.
-- Jacob
More information about the cfarm-users
mailing list