At one of my customers, the deployment of some new test-servers uncovered a nice chain of issues. Some Windows Server 2012 R2 machines were deployed but the application they were intended for, could not connect to some of the SQL Servers (on a cluster). However, when trying to reproduce the problem for a specific case made of this problem, the application did work as expected. Then after Windows updates were applied to the cluster, the case was reopened again and now reproduction succeeded (as in the application indeed failed to connect).
Troubleshooting at that stage narrowed it down to the fact that Windows 2012 R2 servers could not connect due to authentication issues, whereas Windows Server 2008 R2 machines with the same application had no issues. Must be Windows 2012 R2, or a policy setting that is different. It turned out to be the latter of the two and it was concluded that the policy setting “Network Security: Restrict NTLM: Outgoing NTLM traffic to remote servers” to Deny all, caused the behavior. Bottom line, Kerberos authentication didn’t work, NTLM did.
Now with Kerberos authentication to SQL Server being an issue, the case followed its way to me. Well, Kerberos authentication should work. All clustered SQL Servers were configured and tested for Kerberos. But a little more about this particular cluster;
- 4 Nodes (Windows Server 2008 R2). Originally a 2-node cluster, extended about a year ago.
- 8 Virtual SQL Servers (2008 R2). Install dates of those virtual servers varying between 3 years to 1 month ago.
- Each virtual server is installed using a domain account for that particular server.
- Install is pretty much default, so the installer determines the dynamic tcp-port.
- After installation, the required SPNs are created.
- The state for this cluster is TEST/ACCEPTANCE.
This particular case included two clustered SQL Servers having the problem;
To verify the situation, I connected to the node for INS2 and confirmed there were SPNs registered for the service accounts for INS2 and INS5 (SetSPN –L “domain_name\svc_acnt_name”) which returns results like
C:\Users\admin_075>SetSPN -L "organization\srv_sql_ins2"
Registered ServicePrincipalNames for CN=SRV_SQL_INS2,OU=CL2_SQL,OU=SQL_Clusters,OU=Servers,DC=organization,DC=local:
In other words, it looked just like it should, the SPN for INS2 was 12345 and for INS5 was 54321. Next I opened the SQL Server Configuration Manager and retrieved the TCP Port for IPAll for both INS2 and INS5. INS5 was at TCP Dynamic Ports = 54321 (as expected), but INS2 was at TCP Port = 1433! Now this explains why INS2 isn’t working on Kerberos, it’s not registered for 1433. Of course there now is the question why INS2 is forced to listen on 1433 instead of 54321, but at that stage I was more concerned with why INS5 wasn’t authenticating on Kerberos despite the port configuration. Since the Windows Server I currently was looking at, was the active host for INS2 but not for INS5, I connected to the host for INS5 and checked the TCP Port settings. There is was in for yet another surprise: INS2 = tcp-port 12345 and INS5 = tcp-port 1433!
This is the situation that is not supposed to be!!! The active hosts for the respective virtual servers listen at 1433, but the passive hosts are correct (well, turned out INS2 had 2 hosts configured for 1433). I learned these were probably remnants from earlier troubleshooting activities. However I’m still surprised as in a cluster, when one edits the tcp-port number on the active host, it should propagate with a failover. So I edited the port on the active host for INS2 to be 12345 instead of 1433 and brought the resource offline and online again, port remained at 12345 (= OK). Next I did a failover to the other host that still had 1433 listed, expected that the port would become 12345. It didn’t, it remained at 1433. What is broken here (or has never worked correctly at all)?
The broken, or better said, missing part turned out to be checkpoints. Not that this is the first case of missing checkpoints SQL Server as it seems, the way to fix them is hidden inside the kb-article How to change SQL Server parameters in a clustered environment when SQL Server is not online. The title seems to suggest that this only applies when SQL is not online, but the described method 2 has no restriction on SQL Server being online or not. After checking with cluster . resource /checkpoints it turned out that none of the SQL Server Services had a checkpoint. And since the path HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.INS2\MSSQLServer\SuperSocketNetLib\Tcp\IPAll wasn’t covered by the another checkpoint, it is obvious why tcp-port numbers were inconsistent throughout the nodes. After adding a checkpoint for INS2 (from the host where INS2 was running on 1433) and editing the tcp-port number to become a static (not dynamic as on the other host) 12345. Next I conducted a failover to the previous host and verified that it too now had a static IPAll for port 12345. Issue of the missing checkpoints solved. The other 7 instances went the same way.
CAUTION: At step 4 for Method 2, for SQL Server 2008 the syntax is listed as cluster . resource "SQL Server (<InstanceName>)" /addcheckpoints:"
HKEY_LOCAL_MACHINE\Software\Microsoft\Microsoft SQL Server\MSSQL.x\MSSQLSERVER". Be sure to leave out the HKEY_LOCAL_MACHINE\ part as this results in invalid syntax and take a good look how to replace the MSSQL.x as this would apply to SQL Server 2005, whereas MSSQL10.<InstanceName> would apply to SQL 2008 and MSSQL10_50.<InstanceName> to SQL Server 2008 R2. Hopefully this will be updated based on the feedback I submitted.
For more background on the checkpoints, I found the Windows 2008 (R2) documentation on clustering is rather limited to the “What’s new” stuff and to be honest, even the Windows 2003 documentation is not as clear on the subject as Exchange 2003 on Windows Cluster Architecture (sections Checkpoint Manager and Database Manager under Cluster Service Components).