Friday, March 19, 2010

Exchange 2010 DAG datacenter failure

One of the neat features in Exchange 2010 is the DAG. This seems to have the most built-in features for solutions where the datacenters are spread across AD sites, but alot of companies out there don't have multiple AD sited.
One customer I'm at right now don't, and due to this we are trying out the datacenter failure scenario. And to get the solution up and running, my experience is that you have to do some manal steps, and its important to do this in the right order. This is descibed below.

One datacenter fails, and the Cluster does not get MajorityNodeSet.
Datacenter contains 2 MBX servers and 1 HUB/CAS witch is the FSW for the DAG.

To recover from this failure, this are the steps:
- Stop Cluster service on remaining DAG members I secondary datacenter
- On one DAG member do a net start clussvc /forcequorum
o In my case, the databases got mounted already here

To change the FSW on the DAG now will not work, since the DAG can’t communicate with the failed DAG servers.
To remove the affected DAG members and change FSW, you have to complete the following:
- Start cluster admin and evict the failed DAG member servers
- Remove all database replication
o Get-mailboxdatabasecopystsatus -server failedmbxserver
o Remove-mailboxdatabasecopy databasename\affectedMBX
- Remove-databaseavailabilitygroupserver DAGNAME –mailboxserver failedmbxserver

Now you will be able to change the FSW for the DAG.
- Set-databaseavailabilitygroup DAGNAME –witnessserver FQDN

Now everything should be cleaned and in order.

One thing I noted is when doing these steps in the wrong order, running the remove-databaseavailabilitygroupserver DAGNAME before evicting the nodes from the cluster.
The databases that had a copy to the failed MBXserver, got the following errors in their properties. This was a hell to clean up.

Get-mailboxdatabase databasename fl

Server : ActiveMBXServer
MasterServerOrAvailabilityGroup : FAILEDMBXServer
MasterType : Server

To cleanup this, I had to rejoin the failed MBXservers to the DAG,and enable database replication.
Even though the replication seemed to work fine, it didn’t. It wasn’t possible to switch the active database between the servers in the DAG due to the properties on the database.
Then remove the old failed MBXserver from the DAG to a single MBX server with the existing database. Then rejoin the MBXserver to the DAG, and then the properties was OK.

When a database is member of the DAG, the correct way this should be is:

Server : ActiveMBXServer
MasterServerOrAvailabilityGroup : DAGNAME
MasterType : DatabaseAvailabilityGroup

This is my experience with the datacenter failure, but I'll post more later.

No comments: