@psf I followed that through to the Google and FB papers; they're quite good reads; the FB one especially - that must have been a *fun* debug session. I'm suspecting most of these are silicon level problems rather than microarchitectural.
@EdS @psf First time I've heard of that on Pentiums; around the same time Sun did have a known problem with cache on one series of SPARCs; https://www.theregister.com/2001/03/07/sun_suffers_ultrasparc_ii_cache/
Having now read the fine article...
Our problem with the Pentium was probably a test escape: some flaw in the circuit, rarely but reliably triggered, and not covered in production test (or by other workloads)
Whereas this is less repeatable. Some machines fail, some of the time, while most do not. And the failures react to the environment.
Maybe today's very complex CPUs have more holes in test coverage. Tiny transistors and wires can be flawed in subtle ways.
Could be some ageing effect. Clearly not leading to an easily reproducible failure, though: leading to something unlikely but possible.
I did read that Intel is cutting down on validation effort, in which case they are designing-in more bugs:
"CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validation effort"
Search within for "sheer panic"
General purpose mastodon instance