I've got some weird issue with the platform this site runs on. It is an i9-10980xe on a Gigabyte X299 Legend motherboard, with 256GB of RAM.
The machine will run for some period of time and then hang hard. When it hangs, not even the magic sys-request key works, only thing that works is power cycling at that point.
I have kernel oops setup but it never leaves a vmcore, so that's not getting me anywhere. I've replaced the power supply twice, the motherboard 4x. I've tested the memory in other machines and it is fine in other machines. I've run memtest-86 for three days solid, no errors.
I have the generic memory setting to "stable" on the Gigabyte UEFI bios, and I manually set the timing slower down from 15-15-15-35 to 17-17-17-39. Since all four memory channels are occupied with two DIMMS each, I am aware it can be difficult for the i9-10980xe controllers to operate at full speed this loaded, so set latency higher hoping giving things more time to settle will help. Any other ideas?
The problem seems to be exacerbated by the 6.12.x kernels, 6.11.x will run on average 18 days, but 6.12.x rarely makes it more than a day, even though I have this same kernel on my other servers without issue.
Presently the CPU is clocked at 4.4Ghz with a vCore of 1.35v, this is quite conservative for this CPU, it was previously stable at 4.8Ghz and would run but not 100% stable at 5Ghz. At this point I'm kind of learning towards damaged CPU but it's always had adequate cooling. Because these are expensive chips, I want to eliminate every other possibility before replacing it.