|jflaster on ZFS on Linux emergency Boot…|
|HarusKG on ZFS on Linux emergency Boot…|
|jflaster on Customizing Your Solaris 11 Au…|
|Raymond on Customizing Your Solaris 11 Au…|
|jflaster on Minding your ZFS pool and file…|
A place for Unix Thoughts and Ideas
The mapping between the bay numbers on the HP-UX Integrity servers are not very straight forward and if you make a mistake, you could loose critical data.
You can use the sasmgr command to get the information.
Here are 2 examples:
Solaris Fault Management is great feature, but lacks basic reporting functionality on Solaris 10.
Here is a script I put together a couple years ago which will email alerts as they occur.
It is run from cron and will email any alerts encountered in the last X minutes.
Here is a example of the output:
Fri Dec 10 18:30:00 PST 2010
Fault Management Events Discovered on badserver.testdomain.com.com in the last 5 minutes:
Dec 10 18:25:42.7028 6f05831c-e318-6771-fafd-ecb888797fed SUN4V-8000-X2
There are certain packages that should not be installed in local zones and they are designated in
This problem was identified a while back with Solaris 10 and zones, but still persists
Here is a example:
root@suntest-01 # zoneadm -z sun_zonetest3 attach -u
zoneadm: zone ‘sun_zonetest3’: ERROR: attempt to downgrade package SUNWvtss, the source had patches but this system does not. Patches:
I have found that they usually are related to the Solaris VTS Packages.
Here is the fix.
echo 142138 >> /usr/lib/brand/native/bad_patches
What I really don’t get is why isn’t this file updated as part of the patch process.
It would only make sense that the file should be updated as part of a recommended bundle or rollup if they know of new “bad” patches.
Since I’m currently writing about RAC clusters I thought I would mention this.
A long time ago when I was implementing 11gR1 for the first time, I ran into a very weird issue where the Cluster seemed to be working fine, but we could not get the database instance to start on more than 1 node of the cluster.
The errors were not very helpful and Oracle wasn’t any help and the configurations were identical.
We would start the instance on node 1, and when started on node 2, it would immediately go down.
Now we had 2 clusters, one in a local data center and one remote. The one in the local datacenter worked fine, the remote cluster did not.
It turned out that the remote cluster was mis-cabled.
Each set of private heartbeat links were on its own dedicated private VLAN, so everything “looked” ok from a VCS/CRS perspective.
However, each link was actually being run to separate switches and had to cross uplinks in-order to talk reach the other node (unsupported/incorrect configuration).
This caused latency and fragmentation of the packets that caused Oracle to terminate the secondary instance.
After rewiring so each set of private links where on the same switch (primaries to switch 1, secondaries to switch 2), everything worked perfectly.
A couple months ago I became very intimate with ZFS live upgrade and Veritas Filesystem checkpoints as I repeatedly tried in vain to upgrade my 11gR1 installation to 11gR2.
New installs worked fine, but the upgrades would hang and/or fail during the root.sh execution on the 2nd node.
After attempting this upgrade 20 times (thank goodness for zfs and checkpoints, failed oracle upgrades are painful to roll back), I stumbled upon the cause of my pain.
The Oracle Installer fails on the root.sh if the the CRS cluster address is not the native address on the private network adapters. For some odd reason, when VCS brings the private nics online, the IP’s were getting added in the wrong order. It didn’t happen on the first node, but definitely on the second node.
Here is how to check:
For my example system, the private network addresses as follows
Generally speaking, sparc hardware is extremely reliable and I have very few unplanned outages due to hardware.
However, out of the 60 T5140/T5240’s I’m running, we have had a handful of “Unrecoverable hardware error” kernel panic’s on T5140’s in the last couple months.
Every time support has had us replace memory.
I have blamed it on the cpudiag fma modules that have a tendency to fail and not restart on its own, but I recently had this occur on a system running the 09/10 Release of Solaris 10, which should not have these issues.
I personally don’t have a huge issue with memory errors/ failures, as long as the hardware/OS handles the error and prevents a system crash (it is sparc after all).
Having kernel panics due to these hardware errors was really concerning.
I was recently alerted by support to a Bug fixed in the newest 5×40 firmware that sounds like it may finally crush this issue.
CR 6983478 Multi-node systems crashing after CE due to incorrect rerouting code.
It is fixed in the 7.3.0 firmware release for T2+ systems.
I have long suspected there was a issues on the T2+ systems as my T5120’s are rock solid and I have been have tons and tons of memory issues on the T2+ based systems (compared to the earlier CMT and M-series servers).
Hopefully this will finally put those problems to rest.
Here are some quick Commands for setting up email alerting on M-series and T-series servers
setsmtp -s mailserver=10.0.0.100 -s port=25 -s firstname.lastname@example.org
setemailreport -s enable=yes -s email@example.com
T-Series (Alom Mode)
setsc mgt_mailhost 10.0.0.100
setsc mgt_mailalert firstname.lastname@example.org 1
T-Series (Ilom Mode) or x86 SunFire Servers
set /SP/clients/smtp address=10.0.0.100
set /SP/alertmgmt/rules/1 email@example.com