Quantcast
Channel: Ask the Core Team
Viewing all articles
Browse latest Browse all 270

Having a problem with nodes being removed from active Failover Cluster membership?

$
0
0

Welcome to the AskCore blog. Today, we are going to talk about nodes being removed from active Failover Cluster membership randomly. If you are having problems with a node being removed from membership, you are seeing events like this logged in your System Event Log:

image

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. The responding node then sends out a heartbeat request of its own and waits for a response. This completes one heartbeat. The example below should help clarify this:

image

If any one of these packets is lost, then the specific heartbeat is considered failed. By default, Cluster nodes have a limit of 5 failures before the connection is marked down. Once all connections are marked down for a node, then the node is removed from active Failover Cluster membership and the 1135 event is logged on all other nodes. On the node that is removed from active Failover Cluster membership, the Cluster service is terminated and then started so it can try to rejoin the Cluster.

Now that we know how the heartbeat process works, what are some of the known causes for the process to fail.

1. Actual network hardware failures. If the packet is lost on the wire somewhere between the nodes, then the heartbeats will fail. A network trace from both nodes involved will reveal this.

2. The profile for your network connections could possibly be bouncing from Domain to Public and back to Domain again. During the transition of these changes, network I/O can be blocked. You can check to see if this is the case by looking at the Network Profile Operational log. You can find this log by opening the Event Viewer and navigating to: Applications and Services Logs\Microsoft\Windows\NetworkProfile\Operational. Look at the events in this log on the node that was mentioned in the Event ID: 1135 and see if the profile was changing at this time. If so, please check out the KB article “The network location profile changes from "Domain" to "Public" in Windows 7 or in Windows Server 2008 R2”.

3. You have IPv6 enabled on the servers, but have the following two rules disabled for Inbound and Outbound in the Windows Firewall:

  • Core Networking - Neighbor Discovery Advertisement
  • Core Networking - Neighbor Discovery Solicitation

4. Anti-virus software could be interfering with this process also. If you suspect this, test by disabling or uninstalling the software. Do this at your own risk because you will be unprotected from viruses at this point.

5. Latency on your network could also cause this to happen. The packets may not be lost between the nodes, but they may not get to the nodes fast enough before the timeout period expires.

6. IPv6 is the default protocol that Failover Clustering will use for its heartbeats.  The heartbeat itself is a UDP unicast network packet that communicates over Port 3343.  If there are switches, firewalls, or routers not configured properly to allow this traffic through, you can issues like this.

These are the main reasons that these events are logged, but there could be other reasons also. The point of this blog was to give you some insight into the process and also give ideas of what to look for. Some will raise the following values to their maximum values to try and get this problem to stop.

Parameter

Default

Range

SameSubnetDelay

1000 milliseconds

250-2000 milliseconds

CrossSubnetDelay

1000 milliseconds

250-4000 milliseconds

SameSubnetThreshold

5

3-10

CrossSubnetThreshold

5

3-10

Increasing these values to their maximum may make the event and node removal go away, but it just masks the problem.  It does not fix anything.  The best thing to do is find out the root cause of the heartbeat failures and get it fixed.  The only real need for increasing these values is in a multi-site scenario where nodes reside in different locations and network latency cannot be overcome.

I hope that this post helps you!

Thanks,
James Burrage
Senior Support Escalation Engineer
Windows High Availability Group


Viewing all articles
Browse latest Browse all 270

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>