A place for Unix Thoughts and Ideas

Wierd 11g RAC database instance failures

Since I’m currently writing about RAC clusters I thought I would mention this.

A long time ago when I was implementing 11gR1 for the first time, I ran into a very weird issue where the Cluster seemed to be working fine, but we could not get the database instance to start on more than 1 node of the cluster.

The errors were not very helpful and Oracle wasn’t any help and the configurations were identical.

We would start the instance on node 1, and when started on node 2, it would immediately go down.

Now we had 2 clusters, one in a local data center and one remote. The one in the local datacenter worked fine, the remote cluster did not.

It turned out that the remote cluster was mis-cabled.

Each set of private heartbeat links were on its own dedicated private VLAN, so everything “looked” ok from a VCS/CRS perspective.

However, each link was actually being run to separate switches and had to cross uplinks in-order to talk reach the other node (unsupported/incorrect configuration).

This caused latency and fragmentation of the packets that caused Oracle to terminate the secondary instance.

After rewiring so each set of private links where on the same switch (primaries to switch 1, secondaries to switch 2), everything worked perfectly.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: