I am SQL server dba and came across weird scenario our cluster which had 10 nodes 5 primary 5 secondary each nodes had SQL role…. example Ams1pd11 to Ams1pd15 are primary.. and Ams3 side from pd11 to of 15 was secondary… in this scenario whole cluster behaved abnormally going 2 nodes down and all nodes availability groups were inaccessible leading to multicustomer outage.
Explanation of the real-time scenario as it started when I was in shift….
Ams1pd12 got down which was hosting primary server A So automatically the role A was failed over on best possible node and he choose Ams1pd11.
But Ams1pd11 already had one role on it e.g B Now Ams1pd11 was hosting both A and B as Ams1pd12 got down
As both primary roles were on one node it was a risk so to balance I failed over B node to one of secondary node Ams3 side.. now it was balanced and I was just about to investigate on Ams1pd12 why it got down and all…
But suddenly Ams1pd11 node also went down and the role was not failed over and stucked over there…
So now 2 nodes out of 10 were down and one role was stuck so the customers on that role were impacted….
We were troubleshooting with Microsoft for same and noticed that other nodes were showing up by the availability group for alll the nodes were stuck and they were not opening and stucked in expanding state….
So this way all the nodes on that cluster were impacted and as due to this our backups stopped.. There was data loss…
The one stucked role and customer on that node faced only 15 mins data loss as the service was down fr us and them too at the same time, but
The nodes which were showing up and AG groups were inaccessible weird thing was for the already logged in users they were able to change modify the data… It was only refusing new connections… But old connection were still active…
So if issue started and backup stopped at 7 am most of the customers were able to access the database till 6pm so there was 11 hours of data loss….
And it took lot of efforts to restore databases manually… Those online nodes were easy to recover as we just had to attach the dbs as we migrated data and log files bt for stucked role databases we had to manually recover them.
Please suggest the strategies which will be best to be followed in this type of disaster how we can obtain speedy recovery.