The curious case of the slowly starting cluster

I came across an interesting case today that I thought it was worth doing a quick blog post about. I was working with a customer who was migrating onto their newly built SQL Server 2014 clustered instance. Part of the pre-migration testing was to fail the SQL Server service between the two nodes to ensure it started correctly. What we found was that SQL Server service wouldn’t accept incoming client connections, or appear “online” in failover cluster manager for around 75 seconds. Given the number of databases, virtual log files and resources I would have expected failover to happen much more quickly.

Reviewing the SQL Server error log we found that there were many scripts that were being executed “upgrading databases” before the message: “SQL Server is now ready for client connections. This is an informational message; no user action is required.” appeared. Upon investigation what we found was that one of the nodes was running SQL Server 2014 Service Pack 1, the other was running SQL Server 2014 RTM.

In order to identify this we ran the following query on the current active node:

SERVERPROPERTY('ProductVersion') AS ProductVersion,
SERVERPROPERTY('ProductLevel') AS ProductLevel,
SERVERPROPERTY('Edition') AS Edition

We then failed the service over onto the other node and ran the query again. The product level on the two nodes was different, one reporting RTM the other reporting SP1. Once the RTM node was upgraded to SP1 we were able to fail the SQL Service between the two nodes in under 15 seconds.

Always make sure that nodes in your SQL Server clusters are running at the same patch level!