rageek

A place for Unix Thoughts and Ideas

T-series Memory upgrade issues

Memory replacement and upgrades on the T2 systems (T5120/T5140/T5220/T5240) can be a royal pain.

Despite parts having all matching Sun/Oracle Part numbers, many combinations of DIMMs would reject the memory as a invalid configuration and spit back errors like:

Chassis | major: Jan 25 22:11:31 ERROR: MB/CMP0/BR1/CH1/D1 initialization failed: DRAM init, disabled
 Chassis | major: Jan 25 22:11:32 ERROR: [CMP0] Branch 0: DIMM depth reduced, limited by capacity of memory on other MCUs.
 Fault | critical: SP detected fault at time Wed Jan 25 22:11:32 2012. Jan 25 22:11:32 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:32 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:32 ERROR: System DRAM Available: 032768 MB
 Fault | critical: SP detected fault at time Wed Jan 25 22:11:34 2012. /SYS/MB/CMP0/BR1/CH1/D1 Forced fail (DRAM Init)
 Chassis | major: Jan 25 22:11:44 ERROR: [CMP0] Branch 0: DIMM depth reduced, limited by capacity of memory on other MCUs.
 Fault | critical: SP detected fault at time Wed Jan 25 22:11:45 2012. Jan 25 22:11:45 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:45 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:45 ERROR: System DRAM Available: 032768 MB
 Chassis | major: Jan 25 22:11:56 ERROR: [CMP0] Branch 0: DIMM depth reduced, limited by capacity of memory on other MCUs.
 Fault | critical: SP detected fault at time Wed Jan 25 22:11:57 2012. Jan 25 22:11:57 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:57 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:57 ERROR: System DRAM Available: 032768 MB
 and
 Chassis | major: Jan 25 22:12:41 ERROR: POST errors detected
 Chassis | major: Jan 25 22:12:42 ERROR: [CMP0] Branch 1: neither channel populated with DIMM0, Branch 1 not configured
 Chassis | major: Jan 25 22:12:42 ERROR: Degraded configuration: system operating at reduced capacity
 Chassis | major: Jan 25 22:12:42 ERROR: [CMP1 ] memc_1_1 unused because associated L2 banks on CMP0 cannot be used
 Fault | critical: SP detected fault at time Wed Jan 25 22:12:42 2012. Jan 25 22:12:42 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:12:42 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:12:42 ERROR: System DRAM Available: 024576 MB
 Fault | critical: SP detected fault at time Wed Jan 25 22:12:43 2012. /SYS/MB/CMP0/BR1/CH0/D0 Forced Fail (POST)
 Fault | critical: SP detected fault at time Wed Jan 25 22:12:43 2012. /SYS/MB/CMP0/BR1/CH1/D0 Forced Fail (POST)
 Chassis | major: Jan 25 22:12:44 ERROR: [CMP0 ] Only 4 cores, up to 32 cpus are configured because some L2_BANKS are unusable
 Chassis | major: Jan 25 22:12:44 ERROR: [CMP1 ] Only 4 cores, up to 32 cpus are configured because some L2_BANKS are unusable

Initially we were simply trying not to mix blue DIMMs with silver DIMMs (micron vs samsung memory), but it got so bad that we were going as far as trying to match up RAM Manufacturer date codes to try to get a working setup.

Recently I took one of my T5140’s I pulled out of service with a stack of RAM DIMMs and tried to figure out what the key to configuring memory was.

Previous conversations with my Sun FE reveled that you can mix 1.8V and 1.5V DIMMs in a system as long as they were on different CPUs, and if you read the system configuration guide, it says that DIMMS in the same branch have to be the part number.

So I took away that at a bare minimum I had to have sets of 4 matching DIMMs.

After a couple different memory configurations, it was clear that I could mix some date codes, but I noticed that if I swapped certain DIMMs that appeared to be the same date codes between slot 0 and 1, the system would still mark it as a invalid memory configuration.

I then ran showfru and made the observation that although some of my DIMMs looked identical with the exact same date codes, the Manufacturer part number reported was sometimes the SUN FRU and sometimes the actual manufacturer part number.

/SYS/MB/CMP0/BR1/CH0/D1 (container)
 SEGMENT: SP
 /SPD_FBDIMM_R
 /SPD_FBDIMM_R/SPD_Bytes_Written_SPDMemory: 92
 /SPD_FBDIMM_R/SPD_Data_Revision_Code: 11
 /SPD_FBDIMM_R/SPD_Fundamental_Memory_Type: FBDIMM
 /SPD_FBDIMM_R/SPD_Mod_Voltage_Interface: 12
 /SPD_FBDIMM_R/SPD_SDRAM_Addressing: 49
 /SPD_FBDIMM_R/SPD_Module_Physical_Attributes: 24
 /SPD_FBDIMM_R/SPD_Module_Type_Thickness: 07
 /SPD_FBDIMM_R/SPD_Module_Organization: 10
 /SPD_FBDIMM_R/SPD_FBDIMM_Specific:
 /SPD_FBDIMM_R/Vendor_Name: Samsung
 /SPD_FBDIMM_R/SPD_Man_Loc: 1
 /SPD_FBDIMM_R/SPD_Manufacturing_Date: 201116
 ………
 /SPD_FBDIMM_R/SPD_CRC16: 511F
 /SPD_FBDIMM_R/SPD_Manufacturer_Part_No: 501-7954-01 Rev 51
 /SPD_FBDIMM_R/SPD_Module_Rev_Code: 0000
 /SPD_FBDIMM_R/SPD_SDRAM_Vendor_Name: Samsung
/SYS/MB/CMP0/BR0/CH1/D1 (container)
 SEGMENT: SP
 /SPD_FBDIMM_R
 /SPD_FBDIMM_R/SPD_Bytes_Written_SPDMemory: 92
 /SPD_FBDIMM_R/SPD_Data_Revision_Code: 11
 /SPD_FBDIMM_R/SPD_Fundamental_Memory_Type: FBDIMM
 /SPD_FBDIMM_R/SPD_Mod_Voltage_Interface: 12
 /SPD_FBDIMM_R/SPD_SDRAM_Addressing: 49
 /SPD_FBDIMM_R/SPD_Module_Physical_Attributes: 24
 /SPD_FBDIMM_R/SPD_Module_Type_Thickness: 07
 /SPD_FBDIMM_R/SPD_Module_Organization: 10
 /SPD_FBDIMM_R/SPD_FBDIMM_Specific:
 /SPD_FBDIMM_R/Vendor_Name: Samsung
 /SPD_FBDIMM_R/SPD_Man_Loc: 1
 /SPD_FBDIMM_R/SPD_Manufacturing_Date: 200850
 ………
 /SPD_FBDIMM_R/SPD_CRC16: 511F
 /SPD_FBDIMM_R/SPD_Manufacturer_Part_No: M395T5160QZ4-CE66
 /SPD_FBDIMM_R/SPD_Module_Rev_Code: 0000
 /SPD_FBDIMM_R/SPD_SDRAM_Vendor_Name: Samsung

I then noticed that if the you have a DIMM in the DIMM 0 slot for a specific channel and it reports itself as a SUN FRU number for the part number, it would reject the configuration if the DIMM in DIMM slot 1 didn’t report a SUN FRU.

Building off of that, I took some Micron memory (Blue DIMMs) and populated them in the DIMM 0 location for a branches that had Samsung Memory and the system accepted the configuration.

Only time will tell if this works consistently on all my servers, but I feel that can help me avoid having invalid memory configurations and failed system upgrades at 2am.

BTW, if you happen to get a rejected configuration, run clearasrdb in alom mode and then unplug/replug the server to clear the error. Sometime I had to do this twice.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: