rageek

A place for Unix Thoughts and Ideas

Monthly Archives: January 2012

Storage Foundation 5.1 Sp1 issues with Local zones

When Storage Foundation 5.1 Sp1 was released last year, I finally decided that it was time to start upgrading my servers to the new version.

I immediately found that the software was having issues with ODM.

The issues were different depending if it was a upgrade or a fresh install.

For fresh installs of 5.1, the vxodm service will fail to start:

root@testzone-01 # svcs -xv
svc:/system/vxodm:default (VERITAS Oracle Disk Manager)
State: offline since Fri Jan 27 13:27:52 2012
Reason: Dependency svc:/system/vxfs/vxfsldlic is absent.
See: http://sun.com/msg/SMF-8000-E2
See: man -M /opt/VRTS/man -s 1 mount_odm
Impact: This service is not running.

This is due to vxfsldlic missing in the local zone.

For upgrades, if you do a upgrade on attach, you will see the following error in the update_log for the installation of VRTSodm

===== VRTSodm ====
/var/tmp//installcgaGlC/checkinstallegaGlC: /tmp/sh144290: cannot create
pkgadd: ERROR: checkinstall script did not complete successfully

Installation of on zone failed.
No changes were made to the system.

To Fix this you have to remove and reinstall VRTSodm and the apply its patch from the Global zone. This is inaddtion to the fixes for vxfsldic.

For Fresh installs, here is a script I wrote for fixing the zone configuration. The script will copy the vxfsldlic manifest to the zone, import and then enable it.
Run this with your zones attached and booted. You will probably need to run this for any zones you create in the future.

Read more of this post

Lucreate failure on S10 u10 with Zones with legacy mounts

Despite having updated my baselines to Solaris 10 update 10, I have been limiting the upgrades of my existing servers to update 9 due to a nasty bug in update 10 where it is unable to create a live upgrade environments if you have any ZFS data sets with legacy mounts assigned to a non-global zone.

If you try to create a liveupgrade environment, you might get a error similar to the following:

Mounting ABE .
ERROR: mount: /zones/testzone-01/root/legacy: No such file or directory
ERROR: cannot mount mount point device
ERROR: failed to mount file system on
ERROR: unmounting partially mounted boot environment file systems
ERROR: cannot mount boot environment by icf file
ERROR: Failed to mount ABE.
Reverting state of zones in PBE .
ERROR: Unable to copy file systems from boot environment to BE .
ERROR: Unable to populate file systems on boot environment .
Removing incomplete BE .
ERROR: Cannot make file systems for boot environment .

This appears to be bug 7078384 which is fixed in 121430-68 ( http://wesunsolve.net/patch/id/121430-68 )

This patch was released in November, but hasn’t made it into Recommended Patch Set yet.

On my test server, after applying this patch I was able to create and activate my BE’s, even the ones with zones with loop-backed filesystems.

Now I can finally fully use live upgrade on my hosts with local zones.

Solaris 10 update 9 zpool woes

Years ago I used to have a huge issue with the Oracle OEM agents core dumping constantly and rapidly filling up 100GB+ filesystems in a couple of hours.

My solution at the time was to consolidate the core dumps on /var/core and make /var/core a compressed 5GB zfs filesystem.

Since ZFS root wasn’t supported at the time, the zpool used a file in /var for hosting the zpool. This worked extremely well.

When I moved all my databases into local zones, I implemented a similar scheme, except that the zone roots were on vxfs filesystems instead of UFS. The core filesystem was a file again that existing in the zone root and then was included as a legacy zfs mount through a dataset definition.

This worked also worked well, until update 9 of solaris 10 where some subtle changes in the startup services tried to online all of my zpools prior to my vxfs filesystems coming online. The end result was that all of my zpool came up faulted.

This was easily remedied by running a zpool clear:
for i in `zpool list | grep FAULTED | awk ‘{print $1}’`; do zpool clear $i; done

But this requires manually intervention and delayed the starting of my zones.

The fix for this issue was create a solaris service that started right after system/filesystem/local which clears any faulted zpools.

Here is my service:

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<!--
    Copyright 2004 Sun Microsystems, Inc.  All rights reserved.
    Use is subject to license terms.

    pragma ident        "@(#)zpool_clear.xml 1.2     04/08/09 SMI"
-->
<service_bundle type='manifest' name='zpool_clear'>

<service
    name='system/filesystem/zpool_clear'
    type='service'
    version='1'>

    <single_instance/>
        <dependency
            name='usr'
            type='service'
            grouping='require_all'
            restart_on='none'>
            <service_fmri value='svc:/system/filesystem/local'/>
        </dependency>

        <exec_method
            type='method'
            name='start'
            exec='/lib/svc/method/zpool_clear.sh start'
            timeout_seconds='30' />

         <exec_method
            type='method'
            name='stop'
            exec='/lib/svc/method/zpool_clear.sh stop'
            timeout_seconds='30' />
        <property_group name='startd' type='framework'>
                <propval name='duration' type='astring' value='transient' />
        </property_group>

        <instance name='default' enabled='true' />

        <stability value='Unstable' />

        <template>
                <common_name>
                        <loctext xml:lang='C'>
                                Zpool Service
                        </loctext>
                </common_name>
        </template>
</service>
</service_bundle>

And my startup script

#!/bin/sh
#
# zpool_clear.sh
#
case "$1" in
'start')
        for i in `zpool list | grep FAULTED | awk '{print $1}'`
        do
                echo "clearing FAULTED status on Zpool $i"
                zpool clear $i
        done

	zfs mount -a
        ;;
*)
        echo "Usage: $0 start"
        ;;
esac
exit 0

Installation:
cp zpool_clear.sh /lib/svc/method/zpool_clear.sh
cp zpool_clear.xml /var/svc/manifest/site
chmod +x /lib/svc/method/zpool_clear.sh

svccfg import /var/svc/manifest/site/zpool_clear.xml
svcadm enable zpool_clear

Updated 4/16/2012: added stop method to manifest to suppress errors while importing on Solaris 11
Updated 8/20/2012: added zfs mount -a to catch auto-mounting zfs datasets

T-series Memory upgrade issues

Memory replacement and upgrades on the T2 systems (T5120/T5140/T5220/T5240) can be a royal pain.

Despite parts having all matching Sun/Oracle Part numbers, many combinations of DIMMs would reject the memory as a invalid configuration and spit back errors like:

Chassis | major: Jan 25 22:11:31 ERROR: MB/CMP0/BR1/CH1/D1 initialization failed: DRAM init, disabled
 Chassis | major: Jan 25 22:11:32 ERROR: [CMP0] Branch 0: DIMM depth reduced, limited by capacity of memory on other MCUs.
 Fault | critical: SP detected fault at time Wed Jan 25 22:11:32 2012. Jan 25 22:11:32 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:32 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:32 ERROR: System DRAM Available: 032768 MB
 Fault | critical: SP detected fault at time Wed Jan 25 22:11:34 2012. /SYS/MB/CMP0/BR1/CH1/D1 Forced fail (DRAM Init)
 Chassis | major: Jan 25 22:11:44 ERROR: [CMP0] Branch 0: DIMM depth reduced, limited by capacity of memory on other MCUs.
 Fault | critical: SP detected fault at time Wed Jan 25 22:11:45 2012. Jan 25 22:11:45 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:45 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:45 ERROR: System DRAM Available: 032768 MB
 Chassis | major: Jan 25 22:11:56 ERROR: [CMP0] Branch 0: DIMM depth reduced, limited by capacity of memory on other MCUs.
 Fault | critical: SP detected fault at time Wed Jan 25 22:11:57 2012. Jan 25 22:11:57 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:57 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:11:57 ERROR: System DRAM Available: 032768 MB
 and
 Chassis | major: Jan 25 22:12:41 ERROR: POST errors detected
 Chassis | major: Jan 25 22:12:42 ERROR: [CMP0] Branch 1: neither channel populated with DIMM0, Branch 1 not configured
 Chassis | major: Jan 25 22:12:42 ERROR: Degraded configuration: system operating at reduced capacity
 Chassis | major: Jan 25 22:12:42 ERROR: [CMP1 ] memc_1_1 unused because associated L2 banks on CMP0 cannot be used
 Fault | critical: SP detected fault at time Wed Jan 25 22:12:42 2012. Jan 25 22:12:42 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:12:42 ERROR: Operating with a degraded memory configuration.
 Chassis | major: Jan 25 22:12:42 ERROR: System DRAM Available: 024576 MB
 Fault | critical: SP detected fault at time Wed Jan 25 22:12:43 2012. /SYS/MB/CMP0/BR1/CH0/D0 Forced Fail (POST)
 Fault | critical: SP detected fault at time Wed Jan 25 22:12:43 2012. /SYS/MB/CMP0/BR1/CH1/D0 Forced Fail (POST)
 Chassis | major: Jan 25 22:12:44 ERROR: [CMP0 ] Only 4 cores, up to 32 cpus are configured because some L2_BANKS are unusable
 Chassis | major: Jan 25 22:12:44 ERROR: [CMP1 ] Only 4 cores, up to 32 cpus are configured because some L2_BANKS are unusable

Initially we were simply trying not to mix blue DIMMs with silver DIMMs (micron vs samsung memory), but it got so bad that we were going as far as trying to match up RAM Manufacturer date codes to try to get a working setup.

Recently I took one of my T5140’s I pulled out of service with a stack of RAM DIMMs and tried to figure out what the key to configuring memory was.

Previous conversations with my Sun FE reveled that you can mix 1.8V and 1.5V DIMMs in a system as long as they were on different CPUs, and if you read the system configuration guide, it says that DIMMS in the same branch have to be the part number.

So I took away that at a bare minimum I had to have sets of 4 matching DIMMs.

After a couple different memory configurations, it was clear that I could mix some date codes, but I noticed that if I swapped certain DIMMs that appeared to be the same date codes between slot 0 and 1, the system would still mark it as a invalid memory configuration.

I then ran showfru and made the observation that although some of my DIMMs looked identical with the exact same date codes, the Manufacturer part number reported was sometimes the SUN FRU and sometimes the actual manufacturer part number.

/SYS/MB/CMP0/BR1/CH0/D1 (container)
 SEGMENT: SP
 /SPD_FBDIMM_R
 /SPD_FBDIMM_R/SPD_Bytes_Written_SPDMemory: 92
 /SPD_FBDIMM_R/SPD_Data_Revision_Code: 11
 /SPD_FBDIMM_R/SPD_Fundamental_Memory_Type: FBDIMM
 /SPD_FBDIMM_R/SPD_Mod_Voltage_Interface: 12
 /SPD_FBDIMM_R/SPD_SDRAM_Addressing: 49
 /SPD_FBDIMM_R/SPD_Module_Physical_Attributes: 24
 /SPD_FBDIMM_R/SPD_Module_Type_Thickness: 07
 /SPD_FBDIMM_R/SPD_Module_Organization: 10
 /SPD_FBDIMM_R/SPD_FBDIMM_Specific:
 /SPD_FBDIMM_R/Vendor_Name: Samsung
 /SPD_FBDIMM_R/SPD_Man_Loc: 1
 /SPD_FBDIMM_R/SPD_Manufacturing_Date: 201116
 ………
 /SPD_FBDIMM_R/SPD_CRC16: 511F
 /SPD_FBDIMM_R/SPD_Manufacturer_Part_No: 501-7954-01 Rev 51
 /SPD_FBDIMM_R/SPD_Module_Rev_Code: 0000
 /SPD_FBDIMM_R/SPD_SDRAM_Vendor_Name: Samsung
/SYS/MB/CMP0/BR0/CH1/D1 (container)
 SEGMENT: SP
 /SPD_FBDIMM_R
 /SPD_FBDIMM_R/SPD_Bytes_Written_SPDMemory: 92
 /SPD_FBDIMM_R/SPD_Data_Revision_Code: 11
 /SPD_FBDIMM_R/SPD_Fundamental_Memory_Type: FBDIMM
 /SPD_FBDIMM_R/SPD_Mod_Voltage_Interface: 12
 /SPD_FBDIMM_R/SPD_SDRAM_Addressing: 49
 /SPD_FBDIMM_R/SPD_Module_Physical_Attributes: 24
 /SPD_FBDIMM_R/SPD_Module_Type_Thickness: 07
 /SPD_FBDIMM_R/SPD_Module_Organization: 10
 /SPD_FBDIMM_R/SPD_FBDIMM_Specific:
 /SPD_FBDIMM_R/Vendor_Name: Samsung
 /SPD_FBDIMM_R/SPD_Man_Loc: 1
 /SPD_FBDIMM_R/SPD_Manufacturing_Date: 200850
 ………
 /SPD_FBDIMM_R/SPD_CRC16: 511F
 /SPD_FBDIMM_R/SPD_Manufacturer_Part_No: M395T5160QZ4-CE66
 /SPD_FBDIMM_R/SPD_Module_Rev_Code: 0000
 /SPD_FBDIMM_R/SPD_SDRAM_Vendor_Name: Samsung

I then noticed that if the you have a DIMM in the DIMM 0 slot for a specific channel and it reports itself as a SUN FRU number for the part number, it would reject the configuration if the DIMM in DIMM slot 1 didn’t report a SUN FRU.

Building off of that, I took some Micron memory (Blue DIMMs) and populated them in the DIMM 0 location for a branches that had Samsung Memory and the system accepted the configuration.

Only time will tell if this works consistently on all my servers, but I feel that can help me avoid having invalid memory configurations and failed system upgrades at 2am.

BTW, if you happen to get a rejected configuration, run clearasrdb in alom mode and then unplug/replug the server to clear the error. Sometime I had to do this twice.

Updating HBA Firmware post patch 145098-02

I recently noticed the following Messages appearing in my message log.

Jan 22 11:07:19 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 1.0340]emlxs0:WARNING:1540: Firmware update required. (A manual HBA reset or link reset (using luxadm or fcadm) is required.)
Jan 22 11:07:21 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 1.0340]emlxs1:WARNING:1540: Firmware update required. (A manual HBA reset or link reset (using luxadm or fcadm) is required.)

After some quick research it appears that the after patch 145098-02, the emlxs driver switched from automatically upgrading the firmware of the hba’s to making it a manual process.

It can be easily upgraded through the following process.

1. Determine device paths
root@test01 # luxadm -e port
/devices/pci@0,600000/pci@0/pci@8/SUNW,emlxs@0/fp@0,0:devctl CONNECTED
/devices/pci@1,700000/pci@0/pci@0/SUNW,emlxs@0/fp@0,0:devctl CONNECTED

2. Disable path in DMP (if applicable)

Determine the controller number
root@test01 # vxdmpadm getctlr
LNAME PNAME VENDOR CTLR-ID
=============================================================================================
c2 /pci@1,700000/pci@0/pci@0/SUNW,emlxs@0/fp@0,0 Emulex 10:00:00:00:c9:8c:0f:a1
c1 /pci@0,600000/pci@0/pci@8/SUNW,emlxs@0/fp@0,0 Emulex 10:00:00:00:c9:8c:5c:74

Disable Path
root@test01 # vxdmpadm disable ctlr=c2

this can be seen in messages file as:
Jan 23 13:37:09 test01.testdomain.com vxdmp: [ID 575547 kern.notice] NOTICE: VxVM vxdmp V-5-0-0 disabled controller /pci@1,700000/pci@0/pci@0/SUNW,emlxs@0/fp@0,0 connected to disk array 18290

3. Reset hba to upgrade firmware
root@test01 # luxadm -e forcelip /devices/pci@1,700000/pci@0/pci@0/SUNW,emlxs@0/fp@0,0:devctl

Jan 23 13:39:34 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 5.0334]emlxs1: NOTICE: 710: Link down.
Jan 23 13:39:38 test01.testdomain.com emlxs: [ID 349649 kern.info] [13.02C0]emlxs1: NOTICE: 200: Adapter initialization. (Firmware update needed. Updating. id=24 fw=4)
Jan 23 13:39:38 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 3.0ECB]emlxs1: NOTICE:1520: Firmware download. (AWC file: KERN: old=1.20a9 new=1.21a0 Update.)
Jan 23 13:39:38 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 3.0EEB]emlxs1: NOTICE:1520: Firmware download. (DWC file: TEST: new=1.02a3 Update.)
Jan 23 13:39:38 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 3.0EFF]emlxs1: NOTICE:1520: Firmware download. (DWC file: STUB: old=2.80a4 new=2.82a4 Update.)
Jan 23 13:39:38 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 3.0F0E]emlxs1: NOTICE:1520: Firmware download. (DWC file: SLI1: old=2.80a4 new=2.82a3 Update.)
Jan 23 13:39:38 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 3.0F1D]emlxs1: NOTICE:1520: Firmware download. (DWC file: SLI2: old=2.80a4 new=2.82a4 Update.)
Jan 23 13:39:38 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 3.0F2C]emlxs1: NOTICE:1520: Firmware download. (DWC file: SLI3: old=2.80a4 new=2.82a4 Update.)
Jan 23 13:39:54 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 3.0143]emlxs1: NOTICE:1521: Firmware download complete. (Status good.)
Jan 23 13:39:59 test01.testdomain.com emlxs: [ID 349649 kern.info] [ 5.055E]emlxs1: NOTICE: 720: Link up. (2Gb, fabric, initiator)

4. re-enable path in DMP (if applicable)
root@test01 # vxdmpadm enable ctlr=c2
Jan 23 13:41:44 test01.testdomain.com vxdmp: [ID 575547 kern.notice] NOTICE: VxVM vxdmp V-5-0-0 enabled controller /pci@1,700000/pci@0/pci@0/SUNW,emlxs@0/fp@0,0 connected to disk array 18290

5. Double check that all paths recovered in dmp.

root@test01 # vxdmpadm getdmpnode
NAME STATE ENCLR-TYPE PATHS ENBL DSBL ENCLR-NAME
==============================================================================
hitachi_usp-v0_1063 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1064 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1065 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1066 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1067 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1068 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1069 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0

6. Repeat steps 4-5 on the remaining HBA
root@test01 # vxdmpadm disable ctlr=c1
root@test01 # luxadm -e forcelip /devices/pci@0,600000/pci@0/pci@8/SUNW,emlxs@0/fp@0,0:devctl
root@test01 # vxdmpadm enable ctlr=c1

root@test01 # vxdmpadm getdmpnode
NAME STATE ENCLR-TYPE PATHS ENBL DSBL ENCLR-NAME
==============================================================================
hitachi_usp-v0_1063 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1064 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1065 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1066 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1067 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1068 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0
hitachi_usp-v0_1069 ENABLED Hitachi_USP-V 2 2 0 hitachi_usp-v0

Getting a Cricket / ZTE A605 Broadband card working on Lion

I recently switched from a Sprint Broadband card to a Cricket Broadband card.

Surprisingly, my new card works significantly better than my old.

However, I found that the installation program and drivers didn’t work on Lion.

After some poking around, I figured out that the problem is that the installer program doesn’t recognize the OS Revision and instals the modules/ scripts improperly.

I doesn’t look like Lion is listed as support for any of the broadband cards and I wouldn’t be surprised if they all suffer from the same issue.

Here are the steps to get it working:

1. Install the software normally.
2. Before you reboot do the following in a terminal window:

sudo su –
cd /System/Library/Extensions/
rm -rf ACTScom*_panther
mv ACTScomCDFree_tiger* ACTScomCDFree.kext
mv ACTScomVsp_tiger* ACTScomVsp.kext
find ACTScom* -exec chmod go-w {} \;
find ACTScom* -type f -exec chmod -x {} \;
cd /Library/Modem\ Scripts
rm -rf ACTScom*_panther
mv ACTScom\ Modem\ Script_tiger* ACTScom\ Modem\ Script

3. Then Clear the extension cache with this command and reboot
sudo rm -rf /System/Library/Caches/com.apple.kext.caches

4. The device will properly enumerate and show up in the network properties now, but will not be recognized by the broadband app. So, after you reboot, clear the extension cache again and reboot and the device will show up in the cricket broadband application.
sudo rm -rf /System/Library/Caches/com.apple.kext.caches