[cfarm-users] gcc102 (sparc64) down again for maintenance/repairs
Jacob Bachmeyer
jcb62281 at gmail.com
Fri Dec 2 06:22:03 CET 2022
Zach van Rijn wrote:
> On Wed, 2022-11-30 at 21:21 -0600, Jacob Bachmeyer wrote:
>
>> ...
>> Do you have logs farther back?
>>
>
>
> Yes; I've attached some going back about ten days. Thank you for
> the analysis, by the way. It is an interesting theory. I would
> tend to agree with Bruno that memory should be checked before
> drawing any conclusions from the logs.
>
I had tried to find a possible cause other than bad hardware, since you
had stated that you did not want to jump to that conclusion.
While I am not intricately familiar with Linux-on-SPARC, the logs you
attached appear to contain a large number of oopses, which combined with...
> When I mentioned failed reboots, I meant that when the system
> first comes up, the kernel panics like so:
>
> https://paste.debian.net/plainh/fb20bd17
>
> This happens between 20-80% of the time and requires resetting
> over and over again until it boots into userspace.
>
...those panics during early boot, strongly suggest bad RAM as Bruno
Haible suggested. If the machine actually has OpenFirmware, you could
(or so I understand) write a small RAM tester in OF Forth, feed it in at
the boot monitor console, and pin the problem down to the bad module,
but I do not know the details of programming that environment, or if
later SPARC machines actually still have those capabilities, or if
OpenFirmware can actually reach all memory on larger SPARC systems.
(But your problem seems to be in relatively low memory, in an area the
kernel allocates for its data early on, before access to higher memory
(if that is an issue) is set up, if I am reading the log and guessing
correctly.)
> Once it's up it's fine, except for this recent soft lock error.
>
Most of the oopses I noticed in a quick look at those logs seem to be
associated with the process table, (which is not an actual contiguous
table in Linux) suggesting that there is bad RAM in an area where the
kernel allocates its data structures. (Linux's task structures are
quite large and thus have a fairly good chance to span a faulty RAM cell
compared to smaller structures.) The panic you mentioned is the kernel
detecting stack corruption, so if you can identify the physical
addresses and corresponding module(s) used for that kernel stack, you
should be able to pull it/them.
Can the machine operate with reduced RAM, or does it need every module
currently installed to start? If it needs all the modules, you might
still be able to shuffle them and move the fault away from the kernel's
data area and into user areas, then either write a small program to run
at early boot that allocates memory until it gets the bad pages and
holds onto them while releasing the rest, or use the Linux "badRAM"
feature/patch if it is available on SPARC and tell the kernel not to
give those pages to userspace, which should at least hold long enough
for you to be able to get more RAM modules. :-/
Speculation: Whatever contains the first 1.5GB or so of RAM likely has
the fault, since the kernel appears to be claiming about that much for
itself during early boot (" Memory: 1547008K/133671000K available (8913K
kernel code, 1456K rwdata, 2464K rodata, 672K init, 530K bss, 1355504K
reserved, 0K cma-reserved)") and I am guessing that that gets allocated
from the low end of the physical address space. Swapping it with the
top bank is most likely (guessing that memory is first used working from
low-to-high or high-to-low) to give the machine better chances to boot,
but will cause user programs to see unstable memory once the kernel
starts handing out the faulty pages to userspace. Swapping the first
and last banks also hedges the guess that Linux takes its RAM from the
low end of the address space against the possibility of Linux starting
at the top. If the machine is stabilized by this, you would then be
able to use the test program Bruno Haible suggested to find the exact
fault. As things stand right now, if the fault is an area (as appears
to be so) that the kernel reserves for itself, I do not think a user
space program would be able to find it.
-- Jacob
More information about the cfarm-users
mailing list