Find and fix the Root cause code Recover the client experience Repair the symptom Remove complexity root recover repair remove.
Download ReportTranscript Find and fix the Root cause code Recover the client experience Repair the symptom Remove complexity root recover repair remove.
Find and fix the Root cause code Recover the client experience Repair the symptom Remove complexity root recover repair remove Easier deployment and management Fewer things that can fail root recover repair remove root recover repair remove root recover repair remove X root recover repair remove Forget about replacing disks as they fail root recover repair remove Probability you’ll need to replace more than monthly: =(1-BINOM.DIST(spares + 1, disks per server, AFR/12, TRUE))*servers root recover repair remove One Copy Min Per Database HA weekly recovery actions per server 10 1 Nov Sep Jul May Mar Jan'13 Nov Sep Jul May Mar Jan'12 Nov Jul Sep May Mar Jan'11 Oct Aug May Feb '10 0 0.02 0.04 0.06 0.08 0.1 0.12 ServerOneCopyInternalMonitorServiceRestart ServiceHealthMSExchangeReplEndpointRestart ClusterEndpointRestart ServiceHealthMSExchangeReplEndpointFailover ServerOneCopyInternalMonitorForceReboot 0.1 ServiceHealthActiveManagerRestartService ServiceHealthActiveManagerForceReboot ServiceHealthMSExchangeReplForceReboot 0.01 ServiceHealthMSExchangeReplEndpointRestartSecondTrial Restart service Server failover Reboot 1. 2. 3. Monitor engine: contains business logic to evaluate health of customerimpacting features Probe engine: data collection and notifications mechanism, feeding into… Responder engine: set of recovery actions that can be taken to recover degraded state of the monitored resource Get-ServerHealth Get-HealthReport DatabaseHealthDbCopyStalledMonitor ClusterHangMonitor EseDbTimeTooNewMonitor EseDbTimeTooOldMonitor EseInconsistentDataMonitor EseLostFlushMonitor StorageDbIoHardFailureItemMonitor LowLogVolumeSpaceMonitor DatabaseHealthUnMonitoredDatabaseMonitor OneCopyMonitor UNHEALTHY … Maximum Active Databases Dismount instead of failover Maximum Preferred Actives Designed optimum Optimized for load Still allows mount Result of RedistributeActiveDatabases.ps1 Example: 14 Example: 19 Example: 28 Move-ActiveMailboxDatabase -SkipMaximumActiveDatabaseChecks skips both 1. 2. 1. 3. 2. 3. 4. 4. 5. Tool Parameter Value Instance Usage SuspendMailboxDatabaseCopy ActivationOnly N/A Per database copy • Keep active off a working but questionable drive Set-MailboxServer DatabaseCopyAutoActivationPolicy “Blocked” or “Unrestricted” Per server • Used to control active/passive SR configurations and maintenance • Can force admin move Set-MailboxServer DatabaseCopyActivationDisabledAndMoveNow $true or $false Per server • Used to do faster site failovers and maintain database availability • Databases are not blocked from failing back • Continuous move-off operation Majority of 7 required Majority of 7 4 required X X X Majority of 3 required X X X X Majority of 2 required X X X X X Majority of 2 required X X X X X Majority of 2 required X 0 1 X X X X Majority of 2 required X 1 0 X X X X Majority of 2 required X X 1 0 X X X X Majority of 2 required X X 0 X 1 X X X X Get-ClusterNode Name ---EX1 DynamicWeight NodeWeight State ------------- ---------- ----1 1 Up Deployment scenario Recommendations DAG(s) deployed in a single datacenter Locate witness server in the same datacenter as DAG members; can share one server across DAGs DAG(s) deployed across two datacenters; No additional locations available Locate witness server in primary datacenter; can share one server across DAGs DAG(s) deployed across two+ datacenters Locate witness server in third location; can share one server across DAGs Windows Server 2012 R2 and later Witness Offline Witness vote gets removed by the cluster Witness Failure Witness vote gets removed by the cluster Witness Online If necessary, Witness vote is added back by the cluster cas1 cas2 Redmond cas3 cas4 Portland Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file, automatic failover should occur mbx1 mbx2 mbx3 mbx4 witness Redmond Portland 1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond 2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc 3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland mbx1 mbx2 Redmond mbx3 mbx4 Portland 1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond 2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc 3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland mbx1 mbx2 Redmond mbx3 mbx4 Portland http://channel9.msdn.com/Events/TechEd www.microsoft.com/learning http://microsoft.com/technet http://microsoft.com/msdn