Windows cluster packet loss causing failover
Last updated:
Windows Failover Diagnostic logs might have error 2051.
[RES] SQL Server <SQL Server (OPSQL19C1VS1)>: [sqsrvres] Failure detected, diagnostics heartbeat is lost
To check use perfmon to add counter for “Network Interface\Packets Received Discarded” If it has non-zero value, then we have an issue.
To fix, we can increase:
- Click Small Rx Buffers and increase the value (The maximum value is 8192).
- Click Rx Ring #1 Size and increase the value (The maximum value is 4096)
The settings ensure that packets which are not getting used, can get stored in buffer and processed when they can be. There can be a small performance impact.
These settings do not require reboot but may have small drop. So should be done during maintenance hours.
references:
Nodes being removed from Failover Cluster membership on VMWare ESX? | Microsoft Learn Large packet loss in the guest OS using VMXNET3 in ESXi (2039495) (vmware.com) windows - VMXNET3 receive buffer sizing and memory usage - Server Fault