DBInsight’s Blogs

VMware Snapshot Kills Availability Group Cluster

Posted by Rob Risetto on January 20, 2016

Recently I was requested to investigate the Cluster failures underlying a SQL Server 2012 Always On Availability Group. The topology was a two node Windows 2012 cluster with a file share tie breaker as part of the Majority Node and File share Quorum model. The cluster servers were VMware 5.1 Guests and SQL Server 2012 Enterprise was the

SQL engine version/edition.

Typically on a daily basis the Windows Cluster would fail, and the event messages indicated that the communications were lost between the servers, then the Quorum was lost and finally the cluster services were terminated. Pretty drastic right?

Typical event messages looked like

Cluster node ‘XXXXXXXX’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.

Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

The Cluster Service service terminated with the following service-specific error:

A quorum of cluster nodes was not present to form a cluster.

As part of the investigation I retrieved the cluster log for each server using the following Powershell commands

Import-module failoverclusters

Get-clusterlog -Node <servername> -Destination “c:\temp”  -TimeSpan 3000  -UseLocalTime

I’m went back 3000 minutes in time and specified the UseLocalTime option so that the timestamps were easier to correlate rather than looking at UTC time.

The cluster log messages below indicate that the networking was lost between nodes, but interestingly the network connections were gracefully closed “GracefulClose(1226)” by an application layer process.

000017e8.000021b4::2014/06/18-21:09:47.403 INFO  [CHANNEL fe80::9414:1d47:56b7:7f83{2ec0bbd3bfdf207d2f0779c26660c3798ccadae611e41b6dbc787103c4a85cdd}14:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_IO_PENDING(997)

000017e8.000021b4::2014/06/18-21:09:47.403 WARN  [PULLER Server 2] ReadObject failed with GracefulClose(1226)’ because of ‘channel to remote endpoint fe80::9414:1d47:56b7:7f83{2ec0bbd3bfdf207d2f0779c26660c3798ccadae611e41b6dbc787103c4a85cdd}14:~3343~ is closed’

000017e8.000021b4::2014/06/18-21:09:47.403 ERR   [NODE] Node 1: Connection to Node 2 is broken. Reason GracefulClose(1226)’ because of ‘channel to remote endpoint fe80::9414:1d47:56b7:7f83{2ec0bbd3bfdf207d2f0779c26660c3798ccadae611e41b6dbc787103c4a85cdd}14:~3343~ is closed’

The following messages indicate that the network connection close causes the Cluster to lose Quorum and then the Cluster Service shuts down.

000017e8.00000d94::2014/06/18-21:09:48.307 INFO  [GUM] Node 1 some of the active nodes went down. Launching dummy update
000017e8.00000d94::2014/06/18-21:09:48.307 INFO  [RCM] director node 2 went down, resetting director to null
000017e8.00000d94::2014/06/18-21:09:48.307 INFO  [RCM] director node changed from 0 to 1
000017e8.00000d94::2014/06/18-21:09:48.307 ERR   [QUORUM] Node 1: Lost quorum (1)
000017e8.00000d94::2014/06/18-21:09:48.307 ERR   [QUORUM] Node 1: goingAway: 0, core.IsServiceShutdown: 0
000017e8.00000d94::2014/06/18-21:09:48.307 ERR   lost quorum (status = 5925)

After providing the VMware administrator with the event times it became obvious that the “Stun” effect of starting and then removing the VMware snapshot for the cluster server VMs was causing the issue.

As a workaround the Cluster heartbeat sensitivity was increased to avoid the effect of the VMware Snapshot “Stun” . The Powershell statements used to configure the cluster parameters are listed below.

Import-module failoverclusters

(get-cluster).SameSubnetDelay = 2000

(get-cluster).SameSubnetThreshold = 10

(get-cluster).RouteHistoryLength = 20

This is not the first time that I have seen the VMware snapshots causing problems for SQL Server. At another site an application’s database queries would timeout during the Snapshot removal phase and would cause the application interface to fail.  Sure the developer could have provided better error handling and increased query timeouts but they didn’t expect the VM to freeze for a prolonged period.

Spread the love

Leave a Reply

avatar
  Subscribe  
Notify of