[cfarm-users] general instability of cfarm433..440 (ppc)
Luke Yasuda
jing at jing.rocks
Thu Apr 23 09:43:42 CEST 2026
Hi all,
For the past two weeks or so, cfarm433..440 have not been very stable.
Crashes. Especially cfarm439,440: the two FreeBSD VMs are suffering, at
this point they are somewhat even more unstable than the Hurds
(cfarm431,432)
There are a few different reasons:
イ. cfarm439: the qemu process crashes on the host and the VM has to be
manually restarted (timezone: UTC+9):
Apr 02 06:51:56 talos2 kernel: kvm[3726753]: segfault (11) at 3110 nip
10864a264 lr 108cb0b0c code 1 in
qemu-system-ppc64[54a264,108100000+11e0000]
Apr 02 06:51:56 talos2 kernel: kvm[3726753]: code: 3842d4d0 fbe1fff8
fbc1fff0 2c040000 7c7f1b78 f821ffc1 e9230030 ebc90008
Apr 02 06:51:56 talos2 kernel: kvm[3726753]: code: 4180001c 81230038
2c090000 41800010 <a13e3110> 2c090000 40820024 7fe3fb78
...
Apr 17 14:53:29 talos2 kernel: kvm[42910]: segfault (11) at 53960 nip
11f8426d4 lr 11f849e10 code 1 in
qemu-system-ppc64[5426d4,11f300000+11e0000]
Apr 17 14:53:29 talos2 kernel: kvm[42910]: code: 7d4a4a79 39200000
e93f0000 40820134 e9490000 a129000a 382100a0 7fe4fb78
Apr 17 14:53:29 talos2 kernel: kvm[42910]: code: ebc1fff0 ebe1fff8
e94a5a28 79291f24 <7c6a482a> 4bfffbe0 60420000 7c0802a6
...
Apr 23 15:01:42 talos2 kernel: kvm[4019663]: segfault (11) at
7ffb6810e961 nip 105b69d08 lr 1061d0b0c code 2 in
qemu-system-ppc64[549d08,105620000+11e0000]
Apr 23 15:01:42 talos2 kernel: kvm[4019663]: code: 60000000 60420000
39090001 80fb0000 e95f0040 79292708 911f004c 2c070000
Apr 23 15:01:42 talos2 kernel: kvm[4019663]: code: 7d2a4a14 83c90004
eba90008 4082012c <813a1f6c> 7c1e4840 418100c0 7bde0020
I don't have a guess on what is going on here. From the timestamp it
doesn't look like a cron job gone wrong.
ロ. cfarm439,440: multiple FreeBSD bugs. e.g., a recent regression: the
FreeBSD kernel fails to initialize virto-scsi on powerpc64; and other
unfixed bugs.
ハ. cfarm433..440: /home gets remounted read-only: this unfortunately is
a known linux kernel bug (?). On cfarm433..440, /home is served using
NVMe-over-TCP. When there are intensive disk I/O, the NVMe-oF target
might fail to allocate memory and the NVMe controller fails [1]. In this
case, async writes are lost and the VM needs to be cold rebooted, it
cannot be just remounted rw because of errors. I have tuned the
vm.min_free_kbytes and zfs_arc_max on the NVMe-oF target so it happens
less often. If even that doesn't work, we can always go back to the old,
reliable iSCSI.
ス. This one is more or less my fault: the Talos 2 machine overheats, the
CPUs throttle so much that they look like they are frozen. All VMs
become unresponsive of course. I guess it's really the time of the year
that I should leave the air conditioner on 24 hours a day instead of
just during the day.
[1]
https://wiki.archlinux.org/title/NVMe_over_Fabrics#Page_allocation_failure
Cheers,
--
Luke Yasuda
GPG Fingerprint: 4E09 8D19 00AA 3F72 1899 2614 09B3 316E 13A1 1EFC
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: OpenPGP digital signature
URL: <http://lists.tetaneutral.net/pipermail/cfarm-users/attachments/20260423/0883a590/attachment.sig>
More information about the cfarm-users
mailing list