Anatomy of a disaster recovery Jim McBee ITCS Hawaii jim@somorita.com Setting the stage “Approximately 80 percent of unplanned downtime is caused by people and process issues,

Anatomy of a disaster recovery Jim McBee ITCS Hawaii [email protected] Setting the stage “Approximately 80 percent of unplanned downtime is caused by people and process issues,

Transcript Anatomy of a disaster recovery Jim McBee ITCS Hawaii [email protected] Setting the stage “Approximately 80 percent of unplanned downtime is caused by people and process issues,

Anatomy of a disaster recovery
Jim McBee
ITCS Hawaii
[email protected]
Setting the stage
“Approximately 80 percent of unplanned
downtime is caused by people and
process issues, while the remainder is
caused by technology failures and
disasters”
Gartner Group study, March 16, 1999
Jim McBee – Shameless self promotion

•
•
•
•
•
Consultant, Writer, MCSE, MVP, and MCT –
Honolulu, Hawaii
Principal clients SAIC, Dell, and Microsoft
Author – Exchange 2003 24Seven (Sybex)
Contributor – Exchange and Outlook
Administrator
Blog – Mostly Exchange –
http://mostlyexchange.blogspot.com
Audience Assumptions
•
•
•
•
Level 200 session
You have at least a few months
experience running Exchange 5.5, 2000,
or 2003
You have worked with Active Directory
You can install and configure Windows
and Exchange
Session’s coverage
•
Presentation – About 60 minutes
●
●
●
●
•
•
•
Common causes of downtime
Case studies
Summary common things that delay recovery
Some things that can speed recover
Book give away – Drop off your business card
or write your name on a slip of paper
Questions and answers – 15 – 20 minutes
Catch me afterwards also, I’m here all week
Getting us on some common ground
•
•
•
Disaster means different things to
different people
The word “disaster” usually carries the
connotation of data loss or possibly
financial loss
In this presentation, disaster recovery is
really restoration of service
Quick poll…
• Over the last two years, how many of you
have had unplanned downtime of:
●
●
●
●
4 hours?
8 hours?
Entire day?
More than a day?
• Think about the positive and negative
factors that contributed to the downtime
and its recovery.
Common disaster causes
•
•
80% of unplanned downtime - People
and Processes
Infrastructure problems (DNS, DCs,
GCs, LAN, WAN, storage)
Common Exchange failure reasons
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
5 File based A/V software corrupted EDB
4 virus outbreaks requiring a shutdown
4 SAN failures
4 Shutdowns due to insufficient disk space
3 OUs were deleted that contained user accounts
1 Exceeded 16GB limit on Exchange standard
1 Admin applied wrong security template
1 Operator could not restore database – 5 days!
1 Database corrupt, 1018 error (device driver)
1 Database corrupt, operator plugged external SCSI subsystem in while live
1 Loss of organization’s only global catalog
1 Loss of organization’s only DNS server
1 Administrator incorrect configured directory replication – loss of GAL
1 Server blue screening every few hours (service pack / firmware issue)
1 Motherboard failure
1 SCSI controller failure
1 Power to the campus data center failed
What is the cost for delaying restoration
of e-mail service?
•
•
•
•
•
•
•
User productivity
Missed contractual obligations
Missed sales or customer contact
Failure to respond to customers promptly
Loss of end user good will
Loss of credibility (the company’s and
your own)
Loss of your job! 
Common causes of delays in restoration
of service
•
•
•
People, processes, training
Lack of resources
Not asking for help soon enough
Let’s look at some real-world situations
(I hope some of these are therapeutic)
Case Study 1 - Anything that can go
wrong will go wrong
•
•
•
•
•
•
Exchange 2000 A/P cluster, with 800 mailboxes across
multiple locations
Administrator deletes entire OU (approximately 600
users)
Some mailboxes still active, cannot restore from
backup
Previous backup tape was overwritten accidentally,
next most recent in off-site storage
Server administrator gets locked out of computer room
and cannot get back in
Tape device had to be moved from production network
to recovery network.
Case Study 2 - DNS Failure
•
•
•
•
•
Single E2K3 server, 700 mailboxes, and 2
W2K3 domain controllers
One of the two domain controllers failed; it was
hosting the only functioning DNS
They were under the impression that they had
redundancy
Lack of DNS troubleshooting skills delayed
repair by 4 hours
DNS was never set up on the second domain
controller even though it was defined on the
clients and member servers as an alternate
Case Study 3 - Database corruption
•
•
•
•
Exchange server database corruption,
database would not mount
“Reboot” mind frame and hope the
problem will go away
Delayed calling for help for over a day.
Affected 500+ mailboxes
Don’t be afraid to call for help
Case Study 4 - Operator ineptitude
•
•
•
•
•
•
Exchange 5.5 server with 350 mailboxes
User deletes important public folder
Inexperience operator spends the next 4
business days trying restore
Got an error each time they restored and tried
to start the store
Boss did not want to front $245.00 for a PSS
call
Error was due to GUID mismatch. Run
ISINTEG -PATCH.
Case Study 5 - Generic problem - Server
out of disk space
•
•
•
•
•
Server runs out of disk space - Very generic
Almost always due to transaction logs
Often a low-level Exchange admin may take
hours to diagnose this problem.
Event logs are helpful here
My solution: Select some of the older log files
and move them to another disk. Exchange
usually does not have outstanding transactions
in logs older than a few minutes. Pick
something from hours or days ago.
Case Study 6 - All your eggs in one
basket
•
•
•
•
•
•
•
•
Exchange 2003 server with 300 mailboxes
Single 150GB mailbox store
Sales organization with a few key mailboxes used for
customer communications
Dial-tone restore
Restore entire mailbox store to RSG
Could not / did not segment users that could be
restored.
Server ran out of transaction log space during merge
back in to store (while D/R team was at lunch)
Database files exceeded storage limits due to loss of
single instance store
Case Study 7 - Recovery Server
restoration
•
•
Restored 5000 mailboxes to a recovery
server
Recovery server was on a test network
that had a 10Mb/s connection to main
network
Key Factors that Slowed Recovery –
Human Factors
Indecision / no one managing the crisis
●
•
•
Not calling for help in a timely fashion
Lack of training
Employee fatigue
●
●
●
•
•
•
●
●
●
●
●
No clear escalation path / No SLA to guide recovery process
Timelines for escalation not established (at time X, call PSS, at time Y, ask
for escalation)
Everyone tends to get caught up in the “fire”
Large scale interruptions of service may take 24+ hours to recovery
Bad decisions tend to be made when everyone gets tired
Poor communications with users and management
Doing further harm (deleting database files or event logs)
Poor planning
Incorrect / unrealistic expectations w/r to time, restore rates, data
restored
Blame-storming first
Key factors that slowed recovery Documentation / Infrastructure
•
Unknowns in your environment
●
●
●
•
•
•
•
•
•
Infrastructure
Time to restore, retrieve tapes, get decisions made
Service levels for infrastructure such as LAN, WAN, storage
Inadequate spare / replacement hardware
Lack of a good, recent backup or cannot locate tapes
No documentation on how to rebuild
Large environments often have separate backup
personnel that must be available when restore
operations need to take place
Lack of resources to do a disaster recovery (CD
ROMs, license keys, documentation)
Server complexity (servers handling multiple roles)
Options that can speed recovery
•
•
•
•
•
•
•
•
•
•
Training
Practice, Practice, Practice
Written D/R plan and escalation procedures
Keep a written journal of everything you are doing to restore
service no matter how mundane.
Dial-tone restoration
Rapidly available replacement hardware
Documentation
Creating a disaster recovery kit
Restore critical mailboxes first (either using Dial-tone and RSG, or
segment users to different mailbox stores)
Reading the event logs
Disaster Recovery Kit (Crash cart)
•
•
•
•
•
•
•
•
•
•
Printed telephone list, operations procedures, and
escalation procedures
Server hardware / Windows / Exchange
documentation
Product keys / activation codes / key disks
Windows and Exchange product CDs
Don’t forget the service pack CDs
Current versions of all device drivers
Third party CDs (antivirus, gateway, fax servers, etc..)
Emergency repair disk for each server
Keep the kit up to date.
Do not loan this kit or the contents to anyone
Demonstrations
• Exchange outage
• Dial-tone restore
• Use Recovery Storage Group
Recovery Storage Group and Dial-Tone
Recovery
•
Mount “empty” database
●
•
•
Users can go back to work
Restore last backup to Recovery Storage Group
●
•
●
●
•
Rename the physical file names
Dismount “RSG” database
●
•
•
Here after known as the “RSG” database
Dismount “Dial-tone” database
●
•
Here after known as the “dial tone” database
Move the RSG database to production location and rename to production file
names
Move Dial-tone database to RSG location and rename to RSG database name
Using ESEUTIL /Y is fastest way to move/copy files if on separate disks
Mount dial-tone database and RSG databases
Use Merge tools to merge changes from the database that is NOW in the
RSG to the database that is NOW in production
Production database is now up-to-date!
Why do the database swap?
• Merging RSG database in to dial-tone
database destroys SIS
• Dial-tone database will not get mailbox
metadata such as rules and permissions
Neat 3rd party tools
• OnTrack PowerControls
●
http://www.ontrack.com/powercontrols/
• Quest Recovery Manager for Exchange
●
http://wm.quest.com/products/Exchange/
Additional information
• The Exchange 2003 Technical Library contains a number
of documents on disaster recovery and preparedness.
●
http://tinyurl.com/2pua2
• The Definitive Guide to Exchange Disaster Recovery
and Availability by Paul Robichaux
●
http://tinyurl.com/73ghr
• Support Web Cast: Recovery Storage Groups and
Disaster Recovery in Microsoft Exchange Server 2003
●
KB 832436
• Understanding and analyzing -1018, -1019, and -1022
Exchange database errors
●
KB 314917
Book Giveaway
• Has everyone
given me
something to
draw from?
Questions?
• You can always catch me this week if you
don’t get your questions answered.
• Thanks for attending!
• My blog is Mostly Exchange –
http://mostlyexchange.blogspot.com