[cfarm-users] general instability of cfarm433..440 (ppc)

Thu Apr 23 09:43:42 CEST 2026

Hi all,

For the past two weeks or so, cfarm433..440 have not been very stable. 
Crashes. Especially cfarm439,440: the two FreeBSD VMs are suffering, at 
this point they are somewhat even more unstable than the Hurds 
(cfarm431,432)

There are a few different reasons:

イ. cfarm439: the qemu process crashes on the host and the VM has to be 
manually restarted (timezone: UTC+9):

Apr 02 06:51:56 talos2 kernel: kvm[3726753]: segfault (11) at 3110 nip 
10864a264 lr 108cb0b0c code 1 in 
qemu-system-ppc64[54a264,108100000+11e0000]
Apr 02 06:51:56 talos2 kernel: kvm[3726753]: code: 3842d4d0 fbe1fff8 
fbc1fff0 2c040000 7c7f1b78 f821ffc1 e9230030 ebc90008
Apr 02 06:51:56 talos2 kernel: kvm[3726753]: code: 4180001c 81230038 
2c090000 41800010 <a13e3110> 2c090000 40820024 7fe3fb78
...
Apr 17 14:53:29 talos2 kernel: kvm[42910]: segfault (11) at 53960 nip 
11f8426d4 lr 11f849e10 code 1 in 
qemu-system-ppc64[5426d4,11f300000+11e0000]
Apr 17 14:53:29 talos2 kernel: kvm[42910]: code: 7d4a4a79 39200000 
e93f0000 40820134 e9490000 a129000a 382100a0 7fe4fb78
Apr 17 14:53:29 talos2 kernel: kvm[42910]: code: ebc1fff0 ebe1fff8 
e94a5a28 79291f24 <7c6a482a> 4bfffbe0 60420000 7c0802a6
...
Apr 23 15:01:42 talos2 kernel: kvm[4019663]: segfault (11) at 
7ffb6810e961 nip 105b69d08 lr 1061d0b0c code 2 in 
qemu-system-ppc64[549d08,105620000+11e0000]
Apr 23 15:01:42 talos2 kernel: kvm[4019663]: code: 60000000 60420000 
39090001 80fb0000 e95f0040 79292708 911f004c 2c070000
Apr 23 15:01:42 talos2 kernel: kvm[4019663]: code: 7d2a4a14 83c90004 
eba90008 4082012c <813a1f6c> 7c1e4840 418100c0 7bde0020

I don't have a guess on what is going on here. From the timestamp it 
doesn't look like a cron job gone wrong.

ロ. cfarm439,440: multiple FreeBSD bugs. e.g., a recent regression: the 
FreeBSD kernel fails to initialize virto-scsi on powerpc64; and other 
unfixed bugs.

ハ. cfarm433..440: /home gets remounted read-only: this unfortunately is 
a known linux kernel bug (?). On cfarm433..440, /home is served using 
NVMe-over-TCP. When there are intensive disk I/O, the NVMe-oF target 
might fail to allocate memory and the NVMe controller fails [1]. In this 
case, async writes are lost and the VM needs to be cold rebooted, it 
cannot be just remounted rw because of errors. I have tuned the 
vm.min_free_kbytes and zfs_arc_max on the NVMe-oF target so it happens 
less often. If even that doesn't work, we can always go back to the old, 
reliable iSCSI.

ス. This one is more or less my fault: the Talos 2 machine overheats, the 
CPUs throttle so much that they look like they are frozen. All VMs 
become unresponsive of course. I guess it's really the time of the year 
that I should leave the air conditioner on 24 hours a day instead of 
just during the day.

[1] 
https://wiki.archlinux.org/title/NVMe_over_Fabrics#Page_allocation_failure

Cheers,
-- 
Luke Yasuda
GPG Fingerprint: 4E09 8D19 00AA 3F72 1899 2614 09B3 316E 13A1 1EFC
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: OpenPGP digital signature
URL: <http://lists.tetaneutral.net/pipermail/cfarm-users/attachments/20260423/0883a590/attachment.sig>