[cfarm-users] gcc102 (sparc64) down again for maintenance/repairs

Fri Dec 2 06:22:03 CET 2022

Zach van Rijn wrote:
> On Wed, 2022-11-30 at 21:21 -0600, Jacob Bachmeyer wrote:
>   
>> ...
>> Do you have logs farther back?
>>     
>
>
> Yes; I've attached some going back about ten days. Thank you for
> the analysis, by the way. It is an interesting theory. I would
> tend to agree with Bruno that memory should be checked before
> drawing any conclusions from the logs.
>   

I had tried to find a possible cause other than bad hardware, since you 
had stated that you did not want to jump to that conclusion.

While I am not intricately familiar with Linux-on-SPARC, the logs you 
attached appear to contain a large number of oopses, which combined with...

> When I mentioned failed reboots, I meant that when the system
> first comes up, the kernel panics like so:
>
>     https://paste.debian.net/plainh/fb20bd17
>
> This happens between 20-80% of the time and requires resetting
> over and over again until it boots into userspace.
>   

...those panics during early boot, strongly suggest bad RAM as Bruno 
Haible suggested.  If the machine actually has OpenFirmware, you could 
(or so I understand) write a small RAM tester in OF Forth, feed it in at 
the boot monitor console, and pin the problem down to the bad module, 
but I do not know the details of programming that environment, or if 
later SPARC machines actually still have those capabilities, or if 
OpenFirmware can actually reach all memory on larger SPARC systems.  
(But your problem seems to be in relatively low memory, in an area the 
kernel allocates for its data early on, before access to higher memory 
(if that is an issue) is set up, if I am reading the log and guessing 
correctly.)

> Once it's up it's fine, except for this recent soft lock error.
>   

Most of the oopses I noticed in a quick look at those logs seem to be 
associated with the process table, (which is not an actual contiguous 
table in Linux) suggesting that there is bad RAM in an area where the 
kernel allocates its data structures.  (Linux's task structures are 
quite large and thus have a fairly good chance to span a faulty RAM cell 
compared to smaller structures.)  The panic you mentioned is the kernel 
detecting stack corruption, so if you can identify the physical 
addresses and corresponding module(s) used for that kernel stack, you 
should be able to pull it/them.

Can the machine operate with reduced RAM, or does it need every module 
currently installed to start?  If it needs all the modules, you might 
still be able to shuffle them and move the fault away from the kernel's 
data area and into user areas, then either write a small program to run 
at early boot that allocates memory until it gets the bad pages and 
holds onto them while releasing the rest, or use the Linux "badRAM" 
feature/patch if it is available on SPARC and tell the kernel not to 
give those pages to userspace, which should at least hold long enough 
for you to be able to get more RAM modules.  :-/

Speculation:  Whatever contains the first 1.5GB or so of RAM likely has 
the fault, since the kernel appears to be claiming about that much for 
itself during early boot (" Memory: 1547008K/133671000K available (8913K 
kernel code, 1456K rwdata, 2464K rodata, 672K init, 530K bss, 1355504K 
reserved, 0K cma-reserved)") and I am guessing that that gets allocated 
from the low end of the physical address space.  Swapping it with the 
top bank is most likely (guessing that memory is first used working from 
low-to-high or high-to-low) to give the machine better chances to boot, 
but will cause user programs to see unstable memory once the kernel 
starts handing out the faulty pages to userspace.  Swapping the first 
and last banks also hedges the guess that Linux takes its RAM from the 
low end of the address space against the possibility of Linux starting 
at the top.  If the machine is stabilized by this, you would then be 
able to use the test program Bruno Haible suggested to find the exact 
fault.  As things stand right now, if the fault is an area (as appears 
to be so) that the kernel reserves for itself, I do not think a user 
space program would be able to find it.

-- Jacob