Having a problem with nodes being removed from active Failover Cluster membership?

Welcome to the AskCore blog. Today, we are going to talk about nodes being removed from active Failover Cluster membership randomly. If you are having problems with a node being removed from membership, you are seeing events like this logged in your System Event Log:

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. The responding node then sends out a heartbeat request of its own and waits for a response. This completes one heartbeat. The example below should help clarify this:

If any one of these packets is lost, then the specific heartbeat is considered failed. By default, Cluster nodes have a limit of 5 failures before the connection is marked down. Once all connections are marked down for a node, then the node is removed from active Failover Cluster membership and the 1135 event is logged on all other nodes. On the node that is removed from active Failover Cluster membership, the Cluster service is terminated and then started so it can try to rejoin the Cluster.

Now that we know how the heartbeat process works, what are some of the known causes for the process to fail.

1. Actual network hardware failures. If the packet is lost on the wire somewhere between the nodes, then the heartbeats will fail. A network trace from both nodes involved will reveal this.

2. The profile for your network connections could possibly be bouncing from Domain to Public and back to Domain again. During the transition of these changes, network I/O can be blocked. You can check to see if this is the case by looking at the Network Profile Operational log. You can find this log by opening the Event Viewer and navigating to: Applications and Services Logs\Microsoft\Windows\NetworkProfile\Operational. Look at the events in this log on the node that was mentioned in the Event ID: 1135 and see if the profile was changing at this time. If so, please check out the KB article “The network location profile changes from "Domain" to "Public" in Windows 7 or in Windows Server 2008 R2”.

3. You have IPv6 enabled on the servers, but have the following two rules disabled for Inbound and Outbound in the Windows Firewall:

Core Networking - Neighbor Discovery Advertisement
Core Networking - Neighbor Discovery Solicitation

4. Anti-virus software could be interfering with this process also. If you suspect this, test by disabling or uninstalling the software. Do this at your own risk because you will be unprotected from viruses at this point.

5. Latency on your network could also cause this to happen. The packets may not be lost between the nodes, but they may not get to the nodes fast enough before the timeout period expires.

6. IPv6 is the default protocol that Failover Clustering will use for its heartbeats. The heartbeat itself is a UDP unicast network packet that communicates over Port 3343. If there are switches, firewalls, or routers not configured properly to allow this traffic through, you can issues like this.

These are the main reasons that these events are logged, but there could be other reasons also. The point of this blog was to give you some insight into the process and also give ideas of what to look for. Some will raise the following values to their maximum values to try and get this problem to stop.

Parameter	Default	Range
SameSubnetDelay	1000 milliseconds	250-2000 milliseconds
CrossSubnetDelay	1000 milliseconds	250-4000 milliseconds
SameSubnetThreshold	5	3-10
CrossSubnetThreshold	5	3-10

Increasing these values to their maximum may make the event and node removal go away, but it just masks the problem. It does not fix anything. The best thing to do is find out the root cause of the heartbeat failures and get it fixed. The only real need for increasing these values is in a multi-site scenario where nodes reside in different locations and network latency cannot be overcome.

I hope that this post helps you!

Thanks,
James Burrage
Senior Support Escalation Engineer
Windows High Availability Group

Having a problem with nodes being removed from active Failover Cluster membership?

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List