Wierd 11g RAC database instance failures

Since I’m currently writing about RAC clusters I thought I would mention this.

A long time ago when I was implementing 11gR1 for the first time, I ran into a very weird issue where the Cluster seemed to be working fine, but we could not get the database instance to start on more than 1 node of the cluster.

The errors were not very helpful and Oracle wasn’t any help and the configurations were identical.

We would start the instance on node 1, and when started on node 2, it would immediately go down.

Now we had 2 clusters, one in a local data center and one remote. The one in the local datacenter worked fine, the remote cluster did not.

It turned out that the remote cluster was mis-cabled.

Each set of private heartbeat links were on its own dedicated private VLAN, so everything “looked” ok from a VCS/CRS perspective.

However, each link was actually being run to separate switches and had to cross uplinks in-order to talk reach the other node (unsupported/incorrect configuration).

This caused latency and fragmentation of the packets that caused Oracle to terminate the secondary instance.

After rewiring so each set of private links where on the same switch (primaries to switch 1, secondaries to switch 2), everything worked perfectly.


