Exchange Server 2013 High Availability

Transcript Exchange Server 2013 High Availability

Scott Schnoll
Exchange Server 2013
High Availability
Agenda
•
•
•
•
DAG Architecture
HA Changes in Exchange 2013
Monitoring and Server Maintenance
Best Copy and Server Selection
DAG Architecture
DAG Replication Service
• Introduced in Exchange 2007 RTM
•
•
•
•
Microsoft Exchange Replication service | MSExchangeRepl
MSExchangeRepl.exe
Runs on all Mailbox servers (not just DAG members)
Communicates with Active Directory and other DAG members
• Includes 16 components
Active Directory lookup
Replay RPC server wrapper
TPR API manager
Copy status lookup
Remote data provider wrapper
Support API manager
Replay core manager
VssWriter
Server locator manager
Seed manager
Active Manager
Health state tracker
Autoreseed manager
Active Manager RPC server wrapper
Disk reclaimer manager
Failure item manager
DAG Management Service
• Introduced in RTM CU2
•
•
•
•
Microsoft Exchange DAG Management service | MSExchangeDagMgmt
MSExchangeDagMgmt.exe
Runs on all Mailbox servers (not just DAG members)
Communicates with Active Directory and other DAG members
• Includes 4 components
•
•
•
•
Active Directory lookup
Copy status lookup
Monitoring
Tracer instance
DAG Management Service
• Writes events to same place as Replication service
• Application event log (source of MSExchangeRepl)
• HighAvailability crimson channel
• Created for two primary reasons:
• so the Replication service can have more focused functionality
• so Managed Availability actions can kill lower-priority activities
• Other functions will move to this service
• AutoReseed
• Disk Reclaimer
• Future AutoDAG copy layout and mobility features
Cluster Service
• Introduced in NT Server Enterprise Edition (1997)
• Cluster Service | ClusSvc
• Clussvc.exe
• Exchange DAGs use several Cluster components
•
•
•
•
Quorum
Membership and Node Management
Networks and Heartbeating
Cluster Registry
Cluster Service
• Quorum is required in order to mount databases
• Quorum is based on votes, not membership
• Voting can be rigged
• Votes can be taken away manually or dynamically
• Exchange manages quorum model, not quorum
• Exchange management of quorum model based on nodes, not votes
• Removing votes requires manual configuration of quorum model
• Exchange will make incorrect quorum model management decisions if votes
are manually removed at the cluster level
Cluster Registry
• Active Manager stores database / server information in the cluster
registry for DAG members
• Registry changes are replicated immediately to all DAG members
• Stored information is used as part of BCSS
Cluster Registry
IsEntryExist?True*ActiveServer?ex2*LastMountedServer?ex2*LastMountedTime?2013-0715T22:29:39*MountStatus?Mounted*IsAdminDismounted?False*IsAutomaticActionsAllowe
d?True*
• ActiveServer
• Name of the server where the database is currently mounted or is expected to be mounted when
mount operations complete
• LastMountServer
• The name of the server where the database was last successfully mounted
• LastMountedTime
• The date and time stamp of the last time the database was mounted
Cluster Registry
IsEntryExist?True*ActiveServer?ex2*LastMountedServer?ex2*LastMountedTime?2013-0715T22:29:39*MountStatus?Mounted*IsAdminDismounted?False*IsAutomaticActionsAllowed?True*
• MountStatus
• The current mount status for the database
• Possible values are mounted / dismounted
• IsAdminDismounted
• Designates whether the current dismounted status of the database is the result of administrator action
• Possible values are True / False
• IsAutomaticActionsAllowed
• Designates whether the database can be automatically activated by AM
• Possible values are True / False
Cluster Registry
• Last Log
• Entry for each database copy in the DAG (named by the database GUID)
• Stores the last sequence number of the last generated log (in decimal)
Crimson Channel
• Applications and Services logs
• Area of Windows Server event log used by applications for logging and internal
communication
• These logs store events from a single application or component rather than events
that might have system-wide impact
• This is referred to as an application's crimson channel
• Exchange 2013 has multiple channels
•
•
•
•
•
•
ActiveMonitoring
HighAvailability
MailboxDatabaseFailureItems
ManagedAvailability
PushNotifications
Troubleshooters
Crimson Channel
HA Changes in Exchange
2013
HA Changes in Exchange Server 2013
• Exchange can automatically recovery from
•
•
•
•
Disk Failures
Network Failures
Server Failures
Datacenter Failures
• Failover time decreased by 50% over Exchange 2010
• 58% faster reseeds when using multiple databases per volume
DAGs without Cluster Admin Access Points
Easier deployment and management
Fewer things that can fail
Lagged Copy Management
A database has a bad page and needs a patch
There isn’t enough space to keep all the logs
There is a risk of losing all available copies of a database
AutoReseed overview
X
AutoReseed – why?
=(1-BINOM.DIST(spares + 1, disks per server, AFR/12, TRUE))*servers
Recovering from storage failures
ESE database hung IO (4 min)
Crimson channel heartbeat (30s)
System disk heartbeat (2 min)
System bad state (5 min)
Long I/O times (.6 min)
MSExchangeRepl.exe memory threshold (4 GB)
Replication service won’t restart (65 min)
Store timeout (1 min)
Cluster service repeated crashes (60 min)
Monitoring and Server
Maintenance
Managed Availability
Managed Availability
Monitor engine: contains business logic
to evaluate health of customerimpacting features
Probe engine: data collection and
notifications mechanism, feeding into…
Responder engine: set of recovery actions that can be taken to
recover degraded state of the monitored resource
Managed Availability
Get-ServerHealth
Get-HealthReport
HA Managed Availability
ServerOneCopyMonitor
Copy is (Healthy || Mounted) &&
ServerComponentState is NOT Offline &&
Copy is NOT Activation Blocked &&
Server is NOT exceeding MaxActive &&
Copy Queue Length < MountDial &&
Server is NOT Activation Disabled
HA Monitors – ServerOneCopyMonitor
OneCopyMonitor
UNHEALTHY
…
ServiceHealthMSExchangeReplEndpointMonitor
ReplEndpointProbe\
RPC
ReplEndpointProbe\
TCP
ReplEndpointProbe\
ServerLocator
DAG Member Server Maintenance
Mailbox Server has multiple roles installed
Exchange 2013 server maintenance
Set Transport and UM to draining their queues
Set messaging redirection to (preferably) another server in the DAG
Suspend cluster node
Set server to be Activation Disabled
Set server to be Activation Blocked
Set all ServerComponentStates Offline
All ServerComponentStates are offline
Server is activation blocked and activation disabled
Cluster node is “Paused”
Transport queues are empty
Best Copy and Server
Selection
Best Copy and Server Selection
Still an Active Manager algorithm
Performed at *over time
Uses extracted system health
Same replication criteria and phases
Cap replay queue to limit mount time
New max actives soft limit
BCS criteria includes protocol stack health
Protocol health prioritized to control impact
Tuned replication health criteria thresholds
MA failover responder targets not worse server
Activation controls
Controls server max load
Controls server usage
Prevent copy activation – questionable database copy?
Load management limits
Hard limit for activation– i.e. worst case
Enforced by BCS
Dismount databases over limit
Control “exceptional failure” load
Set to most databases you want per server
Follow role requirements calculator guidance
Soft limit for activation – added in SP1
Copies deprioritized in BCS
Catalog and copy queue health
Failovers can exceed limit
Load balancing optimizes to this limit
Move-ActiveMailboxDatabase -SkipMaximumActiveDatabaseChecks
skips both
Best Copy and Protocol Health
1. All health sets healthy
2. All medium priority health sets
and above are healthy
3. All health sets on target are
better than source
4. All health sets on target are the
same as source
5. Server health not considered
1. Skip target if not better than
source
2. All health sets healthy
3. All medium priority and above
are healthy
4. All health sets better than source
server
Questions?