Spring 2006 Connections Conference Template

Transcript Spring 2006 Connections Conference Template

Book Drawing
Make sure you
leave me a business
card or a piece of
paper with your
name on it for the
drawing at the end
of the session.
Exchange High Availability
Without Clustering
Jim McBee
ITCS Hawaii
[email protected]
Setting the stage….
“Approximately 80 percent of unplanned
downtime is caused by people and
process issues, while the remainder is
caused by technology failures and
disasters”
-Gartner Group study, March 16, 1999
Who is Jim McBee!!??
• Consultant, Writer, MCSE, MVP and MCT – Honolulu,
Hawaii (Aloha!)
• Principal clients
●
●
USPACOM J2
USARPAC G6
• Author – Exchange 2003 24Seven (Sybex)
• Contributor – Exchange and Outlook Administrator
• Blog
●
http://mostlyexchange.blogspot.com
• Free eBook
●
http://nexus.realtimepublishers.com/ttgsm.htm
This session’s coverage
• Introduction to me and the topic
• Presentation – About 60 minutes
• Book give away – Drop off your business
card or write your name on a slip of paper
• Questions and answers – 10 - 15 minutes
Audience Assumptions
• You have at least a few months
experience running Exchange 5.5, 2000,
or 2003.
• You have worked with Active Directory
• You can install and configure a Windows
2000 / 2003 server
Presentations coverage
• Defining…
●
●
•
•
•
•
•
•
•
•
Availability, reliability, fault tolerance
Estimated costs of clustering
Common causes of downtime
Your friend, the SLA
Preventing disasters
Configuration recommendations
Minimizing the effects of downtime
Daily operations
Backup plans
This presentation will be posted to my blog after April 30,
2006 – http://mostlyexchange.blogspot.com
If you take nothing else from this
session, take this:
Formula for better availability
• Get good training and have good
reference material
• Set yourself up for predictable operations
• Monitor your system to ensure it stays
within the boundaries you establish
 High Availability - 101
• Determine the causes of unplanned
downtime
• Focus on preventing ‘disasters’
• Predictable daily operations
• Catch problems before they affect the
users
Myths of high availability
• Failure to meet 24x7x365 is a technical
problem
• More hardware = better availability
• Training is not necessary
• Existing procedures and processes are
good enough
• High availability can be bought off the
shelf
• Can achieved without ‘investment’
In search of 5 nines (99.999%)
• The percentage of uptime you have during
your scheduled hours of operation
• Stated hours of operation 24x7x365?
●
●
●
●
●
99% up time = 3.7 days of downtime
99.7% up time = 1 day
99.9% up time = 8.8 hours
99.99% up time = 52 minutes
99.999% up time = 5.3 minutes
• Hopefully you are not promising 24x7x365!
Availability and Reliability…
• Availability…
●
●
●
The percent of time that Exchange is accessible to
the user community within the stated schedule of
operations
The proportion of time that a system can be used for
productive work
Let’s you keep your job
• Reliability…
●
●
●
An application or service provides the same results
under similar load
Provides consistent, correct results
Let’s you sleep a little better at night
Availability and Reliability…
• Don’t sacrifice reliability for availability!!!
• Don’t put off service pack application or critical
system maintenance to so your availability
numbers look good (i.e. replacing a dead disk)
• In general, 8 hours of scheduled, off-peak
downtime or degraded service is more
acceptable to users than 1 hour of unplanned
downtime in the middle of the business day.
Fault Tolerance versus High Availability
• Fault tolerance
●
Components that keep an application
functioning in the event of a component failure
• Disks (RAID 1, 5, 0+1)
• Redundant Power Supplies
• UPS
• High Availability
●
Does not necessarily guarantee 100%
availability, just higher availability
• Moving an application to an alternate server
So, what are WE talking about today?
• We are going to focus on:
●
●
●
●
Reliability
Fault tolerance
Preventing ‘disasters’
Increasing availability through better reliability,
fault tolerance, and procedures
What is an Exchange disaster?
• Answers vary from organization to organization
●
●
●
●
Typically loss of data
Loss of messaging services for more than one or two
hours during scheduled operations?
Loss of a single mailbox?
Failure of a specific service?
• Microsoft measures downtime based on the
number of users affected!
●
●
1000 users on a server that is down for 5 minutes
would be 5000 minutes of downtime!
That kind of downtime does NOT look good on a
resume
Appraise the cost of downtime
•
•
•
•
•
•
•
User productivity
Missed contractual obligations
Missed sales or customer contact
Loss of customer confidence
Loss of end user good will
Loss of credibility
Loss of your job! 
Clustering 101
• Providers higher availability
• Clustering does exactly what it claims to do; it
protects your organization against hardware
failures.
• Clustering gets a bad rap for a number of
reasons:
●
●
Improper operations
Lofty expectations or assumptions
• Allows the passive node to be shutdown or
rebooted for maintenance
Non-clustered configuration costs
• Possible configuration:
●
Dell Dual Xeon 2.8GHz
• 4GB RAM
• 700GB disks
• 160/320GB SDLT Tape
●
●
●
●
Windows 2003 Standard Server
Exchange 2003 Enterprise Edition
1,500 Exchange CALs
Veritas Backup Exec w/Exchange Agent
• Cost = approximately $91,000
Clustered configuration costs
• Possible configuration:
●
2 Dell Dual Xeon 2.8GHz
• 4GB RAM
• 700GB disks
●
●
●
●
●
●
●
2 copies Windows 2003 Advanced Server
1 copy Exchange 2003 Enterprise Edition
1,500 Exchange CALs
Veritas Backup Exec w/Exchange Agent
Veritas SAN Option
Dell rack
Dell fiber-based SAN and SAN connected 160/320GB SDLT
Tape Drive
• Cost = approximately $190,000
To cluster or not to cluster….
• Price potentially doubles!
• Complexity triples!
●
You must understand Windows / Active Directory / Exchange /
Clustering / SANs
• Layer 8 problems – The Political layer
●
●
Management expectations are higher!
Danger Will Robinson! Danger!
• Layer 9 problems - The Bozone layer
●
Snuffy the Network Admin
• Fail-over is NOT instantaneous (at best 2 – 3 minutes)
• Still have a single points of failure (the SAN, the network
infrastructure)
To cluster or not to cluster…
• If you don’t have 99.7% (1 day of
downtime) availability right NOW,
clustering won’t help.
• People and procedures are the highest
sources of failures.
“High availability starts from within,
grasshopper”
Downtime Common Causes:
13 customers and 25 outages
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
4 virus outbreaks requiring a shutdown
4 SAN failures
4 Shutdowns due to insufficient disk space
1 Exceeded 16GB limit on Exchange standard
1 File based A/V software corrupted EDB
1 Admin applied wrong security template
1 Operator could not restore database – 5 days!
1 Database corrupt, 1018 error (device driver)
1 Database corrupt, operator plugged external SCSI subsystem in while live
1 Loss of organization’s only global catalog
1 Loss of organization’s only DNS server
1 Administrator incorrect configured directory replication – loss of GAL
1 Server blue screening every few hours (service pack / firmware issue)
1 Motherboard failure
1 SCSI controller failure
1 Power to the campus data center failed
Ooops…
• All but 3 of these outages could have been
prevented with better procedures, training,
and reliability preparedness.
• Only 2 of these could have been
prevented with clustering.
• Many of these were prolonged or made
worse due to insufficient training or
procedures.
• Exchange was not directly to blame
Change and Configuration Control
• Never make changes without a process in place:
●
●
●
●
●
●
Document the changes to be made or patches to be
applied.
Test the change in your lab
Responsible parties should review / approve
Notify affected parties
Schedule and give notice to the users
Implement
• “Process” is going to become omnipresent for IT
Service Level Agreements (SLA)
• Many types of SLAs
●
●
From vendor to customer
From IT Department to management/users
• For an IT, the SLA may provide:
●
●
●
●
●
●
Published hours of operation
Expected system responsiveness
Guidelines for operation and recovery
Sets expectations for the user community
Guideline for planning server hardware and
configuration
May provide mechanism for reporting and
accountability
SLA: Defining Recovery Time
• SLA states that in the event of corruption, it
takes 4 hours to get a mailbox store back online
●
●
●
●
Largest store size is 75GB
DLT tape restores at 10GB per hour
The BEST restore time you can expect for the largest
> 8 hours!
It is time to re-think store sizes, backup / restore
devices, the distribution of mailboxes, or the SLA!
• Estimated recovery time may not accurately
estimate transaction log replay, either.
Sample SLAs and information
• Intermedia
●
http://www.intermedia.net/legal/shared_sla
• http://www.service-level-agreement.net
• http://servicelevelbooks.com
• http://www.oakland.edu/uts/helpdesk/docs/
emailservicelevel.pdf
 An ounce of prevention…
•
•
•
•
•
•
•
•
Eliminate single points of failure
Reliable servers / server configuration
UPS capacity - 30 minutes
Exchange configuration
Monitoring
Virus protection
Regular, reliable backups
Documentation
Where are your single points of failure?
•
•
•
•
•
•
DNS
Domain controllers
Global catalog servers
Front-end servers
Storage redundancy
Network infrastructure
●
●
●
Backbone
WAN links
Inbound / outbound SMTP mail
Server Configuration
• Environment factors
●
●
●
Potential heat or water damage?
Physically secure
It should be really hard to hit the power button
• Flash BIOS updates / firmware / device driver updates
●
●
Motherboard, disk controllers, tape devices, SANS
Check with your hardware vendor – The latest is not always the greatest
• Use good quality cables for networking, fiber, and SCSI connections
●
Label and neatly tie-wrap them down!
• Caching controllers
●
Using write caching only if battery backup exists; disable entirely
otherwise
• Budget for a ‘cold standby’ server with identical hardware
Server Configuration - Disks
• SCSI disks provide better performance than IDE!
• Disk redundancy
●
All disks should have redundancy (RAID 1, 5, 0+1)
• On database disks, keep the disks less 50% full
●
●
●
●
Improves restore performance
Provides capacity for unexpected growth
Allows for ESEUTIL repair
Don’t forget enough disk space for RSG
• On transaction log disks, plan for at least a week
of transaction logs
• Never compress Exchange logs or databases!
Server Configuration - Software
• Latest service pack, critical fixes, and
updates
• Device drivers – consult manufacturer
●
Buggy disk device drivers is common cause
for corrupt databases (and controller
firmware)
• Monitor security fixes
●
Evaluate each security / critical update to see
if it applies to you and how quickly it should
be applied.
Server Configuration - Batteries Go Bad!
• Consult manufacturer for recommended
schedule to replace:
●
●
UPS batteries
Caching controller batteries
Server Configuration - Consistency
• Organize Exchange servers in to OUs
• Use OU policy for
●
●
●
●
Auditing policy
Event log sizes and overwrite configuration
Security options
Disabled services
• Custom registry settings
●
●
●
●
Information Store MAPI ports
System Attendant DS MAPI ports
W3SVC service dependencies
These can be included in the SCEREGVL.INF file – See KB 214752
• Avoid server-by-server registry changes if possible
• Avoid security templates that overly restrict the local security
settings or make file system permission changes.
Server Configuration – Gold Build
• Get your servers, software, and
configuration to a ‘gold build’
• Except for critical updates, don’t change
the configuration frequently
Change is the enemy of availability,
grasshopper!
Exchange Configuration
• Necessary to limit Exchange usage to
prevent out-of-control or unexpected
growth, viruses spreading, as well as
system abuse.
• Limit:
●
●
●
Message sizes
Recipient limits
Mailbox sizes.
Exchange Configuration – Message
Delivery
Exchange Configuration – Mailbox Limits
Exchange Configuration – Misc.
• Configure deleted item recovery on all stores
• Configure deleted mailbox recovery
• Teach help desk how to recover ‘hard deleted’ items –
KB 178630
• Direct Exchange databases to RAID 5 or RAID 0+1
volumes
• Direct Exchange transaction logs to RAID 1 or RAID 0 +
1 volume
●
Preferably on separate disk controller from databases)
• Do not rely on PSTs as primary mechanism for mail
storage.
●
PST = BAD
Exchange Configuration: Role
Segmentation
• Dedicate Exchange servers to specific tasks:
●
●
●
Mailbox servers
Public folder servers
Routing group / Internet / X.400 bridgehead
• Foreign mail system connectors (MS Mail, Notes)
• Wireless, fax, SMS, and pager gateways
●
Front-end servers
• Segmentation can:
●
●
●
Simply complexity of your environment
Minimize impact of a server failure
Reduce recovery times
• Often not practical in the ‘age of consolidation’
●
If consolidating, consolidate mailbox servers from everything
else
We can’t all be clairvoyant ..
• …but we can monitor…
• Implement some type of monitoring even if you
can’t afford NetIQ, OmniAnalyzer, MOM, etc… You will be glad you did!
• Exchange System Manager’s Status and
Notifications is free! Recommend monitoring:
●
●
●
●
Critical services
Disk space
Queue growth
CPU usage
Operational Procedures
• Follow standardized and documented procedures
• Keep logs of all changes, updates, and problems with Exchange
servers
• Whenever possible, do not work at the Exchange server console. Do
office administration and automation tasks at your desktop!
• Never use beta software from any vendor
• Never install an e-mail client on the Exchange server.
• Perform complete backups before any changes
• Do not apply service packs or updates immediately after release
• Do not delete user accounts and mailboxes right away. Set account
expiration to the day the user left and wait a month or two.
• Never set file-based virus scanning software to scan the M:\ drive or
any Exchange data or transaction log directories.
• If enabled, never use backup software to back up the M:\ drive
I just gotta defrag!
• Squash the urge to ‘over administer’ Exchange.
• Rarely a reason to perform offline maintenance
or offline defrags
●
●
Deleted or moved many mailboxes
Users have recently performed a ‘purge’
• If you need to get away from your kids/spouse
and come in on weekends, use that time to test
your restoration or disaster recovery procedures
on a test network.
Daily operations
• The Big 5 daily tasks
●
●
●
●
●
Perform and verify successful backups
Check available disk space
Update virus signatures / scanning engine
Check the SMTP and X.400 queues
Check the event logs
Events to watch for…
• Anything that indicates a problem or error must be investigated.
• Nightly successful backups
●
●
●
NTBackup # 8001 – SG backed up
ESE # 213 – SG backed up
ESE # 224 – Log files being purged for SG
• Online maintenance (daily)
●
●
●
●
ESE # 701 – Completed online defrag
MSExchangeIS Mailbox # 1207 – Purged deleted items
MSExchangeIS Mailbox # 9535 – Purged deleted mailboxes
MSExchangeIS # 1221 – White space report
• Performance suffers if online maintenance does not complete.
●
Make sure that online backups do not overlap online maintenance
Weekly or monthly operations
• If enabled, purge the BADMAIL directory
• Check the log file generation
• Purge / archive the protocol logging
directories
• Archive event logs
Virus protection
• Virus protection is mandatory in Exchange
environments!
• On the Exchange server, use a AVAPI 2.0 /
2.5 enabled virus scanner
• Keep the signatures up-to-date – daily!
• Client-side antivirus scanning is important,
too
• Publish a ‘forbidden attachment list’
Forbidden Attachment List – Minimal
•
•
•
•
•
•
•
•
•
EXE
COM
CMD
BAT
CHM
REG
SCR
VBS
VB
•
•
•
•
•
•
•
•
•
ASP
EML
HTM
PIF
HTML
JS
SHS
WSH
WSC
Other Forbidden Attachments
•
•
•
•
•
•
•
MPG
MPEG
MP3
AVI
WAV
WMV
And other file types that are large and / or
possibly unbusiness-like.
Consider a Multi-Tier Approach
• Block unwanted content BEFORE it enters your
mailbox servers
• Scan for viruses at:
●
●
●
●
Use a managed provider
The perimeter network
On the Exchange server
At the client
• Use different scanning engines
• Possibly differing rules implemented for internal
versus external mail
Backup Procedures
“Our customers don’t buy a tape backup
solution, they buy a data restoration
solution” – David Tobey
• Base your backup plan and capacity
requirements on:
●
●
●
●
What you need to restore
How much data you can reasonably sacrifice
How quickly you need to restore
Required restore times
Backup types
• Online (strongly recommended)
●
●
●
●
●
●
Requires an Exchange ‘agent’
Exchange backup APIs permit the database to be backed up
‘page-by-page’
Users can continue to work, but online maintenance is halted
Each page’s CRC is checked during an online backup.
Backup halts if corruption discovered (it’s a feature, not a bug)
Transaction logs are purged
• Off-line backup
●
●
●
Stores must be dismounted
Transaction logs must be manually purged (strongly
discouraged!)
No CRC check of database pages
Critical Online Backup Error
Backup Strategy 1
• Daily online backups of all data
●
●
●
●
Entire storage group
Additional Exchange data (KMS, SRS, files)
System state
All local files
Backup Strategy 2
• A paranoid approach
●
●
●
Nightly full backup of all Exchange data
Nightly backup of system state
Differential backups every two hours
Backup Strategy 3
• Insufficient tape capacity
●
●
Full backup of Exchange data once per week
Nightly incremental or differential backups
Backup Strategy 4 – I get a warm and
fuzzy feeling with this one!
• Designed for quick restores
●
●
●
Nightly full-backup to disk file
Keep one or two previous backs on the local
disk
Back up the disk backup files to tape nightly
• Assumes lots of free disk space
• Most recent restore is very quick!
●
30 – 100GB per hour!
Minimize Downtime
• Have an escalation procedure and
decision matrix
• Keep your user community notified
• Keep a disaster recovery kit handy
• Practice makes perfect
Escalation
• What to do in the event of:
●
●
●
Virus outbreak
Network / Infrastructure / WAN outage
Lost mailbox or message item
• Do you really want to restore and entire mailbox store to a
recovery server?
●
●
Corrupted mailbox or public folder store
Wholesale server failure
• Who makes the decision of what to do next?
• And how soon?
●
Avoid the “let’s try just one more thing” syndrome.
Disaster Recovery Kit
• Printed telephone list, operations procedures, and
escalation procedures
• Server hardware / Windows / Exchange documentation
• Product keys / activation codes / key disks
• Windows and Exchange product CDs
●
•
•
•
•
•
Don’t forget the service pack CDs
Current versions of all device drivers
Third party CDs (antivirus, gateway, fax servers, etc..)
Emergency repair disk for each server
Keep the kit up to date.
Do not loan this kit or the contents to anyone
Drawing for book giveaway
• Did you get your
business card to
me?
Questions?
Thanks for attending!
More information…
• Exchange 2000 Support Home Page
●
http://support.microsoft.com/default.aspx?scid=fh;EN-US;exch2k
• Exchange 2003 Support Home Page
●
http://support.microsoft.com/default.aspx?scid=fh;EN-US;exch2003
• Slipstick Systems
●
http://www.slipstick.com
• My own links and info
●
http://www.somorita.com
• “7 Daily Checks to Keep Exchange 2000 Running Smoothly” by Joe
Neubauer
●
http://www.exchangeadmin.com InstantDoc #26185
• My blog
●
http://mostlyexchange.blogspot.com