Unrecoverable hardware errors on T5140’s
January 5, 2011
Posted by on
Generally speaking, sparc hardware is extremely reliable and I have very few unplanned outages due to hardware.
However, out of the 60 T5140/T5240’s I’m running, we have had a handful of “Unrecoverable hardware error” kernel panic’s on T5140’s in the last couple months.
Every time support has had us replace memory.
I have blamed it on the cpudiag fma modules that have a tendency to fail and not restart on its own, but I recently had this occur on a system running the 09/10 Release of Solaris 10, which should not have these issues.
I personally don’t have a huge issue with memory errors/ failures, as long as the hardware/OS handles the error and prevents a system crash (it is sparc after all).
Having kernel panics due to these hardware errors was really concerning.
I was recently alerted by support to a Bug fixed in the newest 5×40 firmware that sounds like it may finally crush this issue.
CR 6983478 Multi-node systems crashing after CE due to incorrect rerouting code.
It is fixed in the 7.3.0 firmware release for T2+ systems.
I have long suspected there was a issues on the T2+ systems as my T5120’s are rock solid and I have been have tons and tons of memory issues on the T2+ based systems (compared to the earlier CMT and M-series servers).
Hopefully this will finally put those problems to rest.