rageek

A place for Unix Thoughts and Ideas

Unrecoverable hardware errors on T5140’s

Generally speaking, sparc hardware is extremely reliable and I have very few unplanned outages due to hardware.

However, out of the 60 T5140/T5240’s I’m running, we have had a handful of “Unrecoverable hardware error” kernel panic’s on T5140’s in the last couple months.

Every time support has had us replace memory.

I have blamed it on the cpudiag fma modules that have a tendency to fail and not restart on its own, but I recently had this occur on a system running the 09/10 Release of Solaris 10, which should not have these issues.

I personally don’t have a huge issue with memory errors/ failures, as long as the hardware/OS handles the error and prevents a system crash (it is sparc after all).

Having kernel panics due to these hardware errors was really concerning.

I was recently alerted by support to a Bug fixed in the newest 5×40 firmware that sounds like it may finally crush this issue.

CR 6983478 Multi-node systems crashing after CE due to incorrect rerouting code.

It is fixed in the 7.3.0 firmware release for T2+ systems.

I have long suspected there was a issues on the T2+ systems as my T5120’s are rock solid and I have been have tons and tons of memory issues on the T2+ based systems (compared to the earlier CMT and M-series servers).

Hopefully this will finally put those problems to rest.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: