Find and fix the Root cause code Recover the client experience Repair the symptom Remove complexity root recover repair remove.

Transcript Find and fix the Root cause code Recover the client experience Repair the symptom Remove complexity root recover repair remove.

Find and fix the Root cause code
Recover the client experience
Repair the symptom
Remove complexity
root
recover
repair
remove
Easier deployment and management
Fewer things that can fail
root
recover
repair
remove
root
recover
repair
remove
root
recover
repair
remove
X
root
recover
repair
remove
Forget about replacing disks as they fail
root
recover
repair
remove
Probability you’ll need to replace more than monthly:
=(1-BINOM.DIST(spares + 1, disks per server, AFR/12, TRUE))*servers
root
recover
repair
remove
One Copy Min Per Database
HA weekly recovery actions per server
10
1
Nov
Sep
Jul
May
Mar
Jan'13
Nov
Sep
Jul
May
Mar
Jan'12
Nov
Jul
Sep
May
Mar
Jan'11
Oct
Aug
May
Feb '10
0
0.02
0.04
0.06
0.08
0.1
0.12
ServerOneCopyInternalMonitorServiceRestart
ServiceHealthMSExchangeReplEndpointRestart
ClusterEndpointRestart
ServiceHealthMSExchangeReplEndpointFailover
ServerOneCopyInternalMonitorForceReboot
0.1
ServiceHealthActiveManagerRestartService
ServiceHealthActiveManagerForceReboot
ServiceHealthMSExchangeReplForceReboot
0.01
ServiceHealthMSExchangeReplEndpointRestartSecondTrial
Restart service
Server failover
Reboot
1.
2.
3.
Monitor engine: contains business
logic to evaluate health of customerimpacting features
Probe engine: data collection and
notifications mechanism, feeding
into…
Responder engine: set of recovery actions that can be taken to
recover degraded state of the monitored resource
Get-ServerHealth
Get-HealthReport
DatabaseHealthDbCopyStalledMonitor
ClusterHangMonitor
EseDbTimeTooNewMonitor
EseDbTimeTooOldMonitor
EseInconsistentDataMonitor
EseLostFlushMonitor
StorageDbIoHardFailureItemMonitor
LowLogVolumeSpaceMonitor
DatabaseHealthUnMonitoredDatabaseMonitor
OneCopyMonitor
UNHEALTHY
…
Maximum Active
Databases
Dismount instead of
failover
Maximum Preferred
Actives
Designed optimum
Optimized for load
Still allows mount
Result of RedistributeActiveDatabases.ps1
Example: 14
Example: 19
Example: 28
Move-ActiveMailboxDatabase -SkipMaximumActiveDatabaseChecks
skips both
1.
2.
1.
3.
2.
3.
4.
4.
5.
Tool
Parameter
Value
Instance
Usage
SuspendMailboxDatabaseCopy
ActivationOnly
N/A
Per database
copy
• Keep active off a working
but questionable drive
Set-MailboxServer
DatabaseCopyAutoActivationPolicy
“Blocked” or
“Unrestricted”
Per server
• Used to control
active/passive SR
configurations and
maintenance
• Can force admin move
Set-MailboxServer
DatabaseCopyActivationDisabledAndMoveNow
$true or $false
Per server
• Used to do faster site
failovers and maintain
database availability
• Databases are not blocked
from failing back
• Continuous move-off
operation
Majority of 7 required
Majority of 7
4 required
X
X
X
Majority of 3 required
X
X
X
X
Majority of 2 required
X
X
X
X
X
Majority of 2 required
X
X
X
X
X
Majority of 2 required
X
0
1
X
X
X
X
Majority of 2 required
X
1
0
X
X
X
X
Majority of 2 required
X
X
1
0
X
X
X
X
Majority of 2 required
X
X
0
X
1
X
X
X
X
Get-ClusterNode
Name
---EX1
DynamicWeight NodeWeight State
------------- ---------- ----1
1
Up
Deployment scenario
Recommendations
DAG(s) deployed in a single datacenter
Locate witness server in the same datacenter as DAG members; can share one server across DAGs
DAG(s) deployed across two datacenters;
No additional locations available
Locate witness server in primary datacenter; can share one server across DAGs
DAG(s) deployed across two+ datacenters
Locate witness server in third location; can share one server across DAGs
Windows Server 2012 R2 and later
Witness Offline
Witness vote gets removed by the cluster
Witness Failure
Witness vote gets removed by the cluster
Witness Online
If necessary, Witness vote is added back by the cluster
cas1
cas2
Redmond
cas3
cas4
Portland
Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log
file, automatic failover should occur
mbx1
mbx2
mbx3
mbx4
witness
Redmond
Portland
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond
2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland
mbx1
mbx2
Redmond
mbx3
mbx4
Portland
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond
2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland
mbx1
mbx2
Redmond
mbx3
mbx4
Portland
http://channel9.msdn.com/Events/TechEd
www.microsoft.com/learning
http://microsoft.com/technet
http://microsoft.com/msdn