friendica.eskimo.com

Hardware Troubleshooting

I've got some weird issue with the platform this site runs on. It is an i9-10980xe on a Gigabyte X299 Legend motherboard, with 256GB of RAM.

The machine will run for some period of time and then hang hard. When it hangs, not even the magic sys-request key works, only thing that works is power cycling at that point.

I have kernel oops setup but it never leaves a vmcore, so that's not getting me anywhere. I've replaced the power supply twice, the motherboard 4x. I've tested the memory in other machines and it is fine in other machines. I've run memtest-86 for three days solid, no errors.

I have the generic memory setting to "stable" on the Gigabyte UEFI bios, and I manually set the timing slower down from 15-15-15-35 to 17-17-17-39. Since all four memory channels are occupied with two DIMMS each, I am aware it can be difficult for the i9-10980xe controllers to operate at full speed this loaded, so set latency higher hoping giving things more time to settle will help. Any other ideas?

The problem seems to be exacerbated by the 6.12.x kernels, 6.11.x will run on average 18 days, but 6.12.x rarely makes it more than a day, even though I have this same kernel on my other servers without issue.

Presently the CPU is clocked at 4.4Ghz with a vCore of 1.35v, this is quite conservative for this CPU, it was previously stable at 4.8Ghz and would run but not 100% stable at 5Ghz. At this point I'm kind of learning towards damaged CPU but it's always had adequate cooling. Because these are expensive chips, I want to eliminate every other possibility before replacing it.

Shoreline, WA, USA
1
Sorry for the downtime. I am in hospital with leg infection so was unable to boot when it crashed. I got someone at the co-lo to boot but had to wait for Monday for them to be available. Yacy is still broken because it has screwed up its database. Because I am not all that familiar with solr it may take me a while to figure it out.
1 1
At some point a system update got the java infrastructure missing some functionality the old solr database relied on. But I was unable to save the old index so the old index is gone. The machine is still unable and I have a new CPU on order. Once that is replaced we'll resume some more intensive indexing.