@psf I followed that through to the Google and FB papers; they're quite good reads; the FB one especially - that must have been a *fun* debug session. I'm suspecting most of these are silicon level problems rather than microarchitectural.
I remember we had a problem like this when running a few dozen pentium 933MHz machines: one cpu on one machine would give incorrect results for one particular run (out of thousands)
that would be in 2001 or so.
And earlier than that, I remember people working at Sun said they'd re-run any failing job (out of tens or hundreds of thousands) and only mark it as failed if the re-run (or possibily even the re-re-run) failed.
@EdS @psf First time I've heard of that on Pentiums; around the same time Sun did have a known problem with cache on one series of SPARCs; https://www.theregister.com/2001/03/07/sun_suffers_ultrasparc_ii_cache/
Having now read the fine article...
Our problem with the Pentium was probably a test escape: some flaw in the circuit, rarely but reliably triggered, and not covered in production test (or by other workloads)
Whereas this is less repeatable. Some machines fail, some of the time, while most do not. And the failures react to the environment.
Maybe today's very complex CPUs have more holes in test coverage. Tiny transistors and wires can be flawed in subtle ways.
General purpose mastodon instance