The DBA Disaster Diary Real-world Oracle failures: How they were resolved, and how they could have been prevented Copyright © 2004 by Jeremiah Wilton.

Download Report

Transcript The DBA Disaster Diary Real-world Oracle failures: How they were resolved, and how they could have been prevented Copyright © 2004 by Jeremiah Wilton.

The DBA Disaster Diary
Real-world Oracle failures:
How they were resolved,
and how they could have been prevented
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Jeremiah Wilton, OCM
www.speakeasy.net/~jwilton
[email protected]
Independent Oracle Professional
Seattle, Washington
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
High Availability
• Many companies pursuing HA
• Prolonged projects
• Researching, designing and implementing
•
•
•
•
RAC
Dataguard
Clusters
Remote DR sites
• Huge capital expenditures
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Benefits of HA Big-Iron
4-week periods around rollout
12
10
8
Hours
6
down
Planned
Unplanned
4
2
0
Before HA
During
After HA
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
For my average outage in the past 12 months,
would RAC or Dataguard have helped me?
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Hypothetical #1
• 2-node RAC cluster
• Array controller fails
• Replacement parts 6 hours
away
• No value from RAC in this
outage
• Dataguard might have
helped…
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Hypothetical #2
DataGuard
• Same controller failure with 2-site Dataguard
• Standby is missing last few transactions unless using
expensive zero data loss
• Management: “Wait for the part, save the data.”
• Dataguard provides no value in this outage
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Hypothetical #3
•
•
•
•
Same 2-site Dataguard
Developer’s backfill is missing a where clause
All customer names updated to “Fred Flintstone”
Developer takes 30 minutes to realize the mistake
and alert the DBA
• Too late; the standby has committed the changes
• Dataguard provides no value in this outage
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Solutions for hypothetical #3
• Flashback Query
– Extension of CR mechanism
– SMU only
– Retention period
• Logminer
– Generate undo SQL
– Apply one rowid at a time.
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Hypothetical #4
Zero data loss
DataGuard
•
•
•
•
RAC
Zero data-loss dataguard w/3-day apply delay
SMU for flashback query
Logminer at the ready
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Hypothetical #4
•
•
•
•
•
Belt
Suspenders
Duct tape
Dry suit
Kevlar helmet
Welcome to the Disaster Diary…
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Chapter 1: Double Failure
Setting: Internet Startup
Circa: June 1997
Staff: 1 novice DBA, 1 systems engineer
Hardware: DEC Alphaserver 4800 8 procs
Storage: 2 DEC Storageworks HSZ50 arrays of 100Gb of
RAID0+1
Operating System: Digital Unix 4.0B
Oracle version: 7.3.2.3
Oracle Support: Silver (US)
Backups: Daily hot tablespace backup to DLT tape
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Some company background
• Independent thinkers
• Intelligent
• Specific knowledge,
certifications less
important
• Critical thinking
• Work ethic
• New DBA after two
years of developer
management
• Developers had
managed well
–
–
–
–
–
Hot backups
v$sesstat tuning tools
v$session_event tools
Lock monitoring
Space management
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Vigilant at investigating new features
• 7.3 Maxextents unlimited
• Prevent senseless DML failures
select 'alter '||segment_type||' '||segment_name||
' storage (maxextents unlimited);'
from dba_segments
where segment_type in ('TABLE','INDEX');
• Seemed to work well…
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Planned Outage
• During the DBA’s third week
• Modify and add some tables (simple DDL)
• Planned for late Thurs. PM/Fri. AM
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
DLT problems
• Thursday AM before outage, backup failed
with a DLT drive error:
/opt/app/oracle/product/7.3.2/local/bin/gnutar: Cannot write to /dev/nrmt0h: I/O error
BUS free ERROR - os_std, os_type = 11, std_type = 10 (from uerf)
• Outage on Fri. AM during backup window
• Last good backup before outage is Wed. AM.
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
15-Minute Outage
1.
2.
3.
4.
5.
6.
7.
8.
9.
Put out the closed sign on the website
Shut down the listeners
Shut down the instance (immediate)
Start up the instance in restrict mode
Make schema changes
Start up the listeners
Restart the webserver software so it reconnects to Oracle
Make sure the site is working internally
Take down the closed sign and let the public back in
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday
00:00
06:00
• Friday 01:00
SE begins outage steps; gets stuck reopening database with
ORA-00600 [4000].
• Friday 01:15
SE trying to figure out the problem.
–
–
–
–
Checks host logs
alert file
trace file from the ORA-00600
Searches Alta Vista and newsgroups for similar occurrences; no positive
results.
• Friday 01:45
SE in Silver Support severity-1 call queue.
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
•
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday
00:00
06:00
Friday 02:00
SE gets Oracle Support in the UK. Analyst opens a TAR
– Requests alert and trace files via email.
– Will call back when received
•
Friday 02:45
Oracle Support has not called back
– SE calls back in on the TAR
– Analyst says she did not receive any email! “Sometimes their email systems can be
slow.”
– Around this time, SE receives his email bounced by Oracle’s mail server for being too
large.
Oracle and SE agree on FTP
– SE uploads the files
– Support analyst will examine and call back shortly
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
•
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday
00:00
06:00
Friday 04:30
DBA, CTO, VP Tech. arrive. Site now down 3.5 hours.
–
–
–
–
•
VP starts calling people at Oracle
The DBA is horrified.
DBA & SE call Oracle Support again since no callback yet
Support says “corruption of some sort”; restore from backup and roll forward
Friday 05:00
DBA & SE begin restore
•
Friday 06:15
Restore completes
– DBA & SE begin roll forward through 48 hours of logs
– Unaware of parallel recovery feature; recovering serially
•
Friday 06:45
– Recovery time projected to take nine hours
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday
00:00
06:00
• Friday 12:00
A “friend” of the company knows Ellison; calls him;
– TAR transferred to bug diagnostics and escalation (BDE)
– Local Oracle office offers to send personnel on site
• Friday 15:30
Roll forward completes.
A cold backup is taken before attempting to open the database.
• Friday 16:00
• Recovered DB fails on open with ORA-00600 [4000] again
– Restore/recover had no effect
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday
00:00
06:00
• Friday 16:45
16 hours into the outage
– BDE divulges a known bug:
[BUG:434596] prevents a database from being opened if BOOTSTRAP$ has
its MAXEXTENTS value modified. Although it is unlikely anyone would
intentionally change this, a number of users have run scripts to change the
storage on ALL tables and in doing so have encountered this problem. Fixed
in 7.3.4.
– Restore/Recover was a waste of time
– Original Support recommendation lacked specific understanding of problem
Roll forward did nothing to fix several-week-old BOOTSTRAP$ extent
modification
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Remain calm, think clearly
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Macro timeline
• This is a “sleeper”
– Data-related problem introduced many backup cycles
ago
– Sits on disk until database restart
– Too much transpired since last “good” backup
– No real recovery strategy available to end user
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday
00:00
06:00
• Friday 17:30
BDE requests dial-in access
– “Work” copy of the DB placed new DEC Storageworks HSZ50 array for BDE
– Original broken DB preserved in place
Local Oracle office says on-site support analyst on the way
• Friday 18:00
Oracle Support BDE dialed in and working
• Friday 19:00
On-site Oracle Support arrives
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
•
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday
00:00
06:00
Friday 19:53
Media starts to take notice; Reuters report leads to articles in:
– Wall Street Journal
– TV networks
– Online news sites
Web site suffers outage
LOS ANGELES, June 6 (Reuter) - The World Wide Web site of [the company] went
down Friday, preventing Internet customers from ordering … from the online retailer.
According to a notice posted on the company's Web page, work was being done on the
system and customers should try back later. A spokesman for [the company] was not
immediately available to comment on the reason for the outage or how long it would
last.
•
Friday 22:00
BDE uses BBED to patch blocks of BOOSTRAP$’s extent map in datafile 1 of
SYSTEM tablespace
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
12:00
18:00
Saturday
00:00
06:00
• Friday 22:15
DB successfully opens!
– BDE repaired DB on new
storage array
– Now running on the new
array
• Friday 22:30
Sanity testing begins
• Friday 23:00
DB host crashes with Unix kernel
fault!
– 20 minutes to reboot
12:00
18:00
Sunday
00:00
06:00
• Friday 23:30
Site back up and available on
the Internet
Total outage time: 22 hours
and 30 minutes
Reduced YTD availability from
99.986% down to 99.729%
• Saturday 00:00
Hot backup of repaired DB
started
• Saturday 01:00
Everyone goes home.
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
•
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday
00:00
06:00
Saturday 06:15
DBA & SE paged with corruption errors in alert file:
Corrupt block dba: 0x4c0530a6 file=19. blocknum=340134. found during buffer read on disk type:6. ver:1.
dba: 0x4c0530a6 inc:0x000004e0 seq:0x0000026c incseq:0x00110000
Reread of block=4c0530a6 file=19. blocknum=340134. found same corupted data
Errors coming fast, but website is still up
–
•
Small number of read failures
Saturday 06:45
SE back on phone with Oracle
DBA & CTO on their way back
•
Saturday 08:00
DBA arrives and joins call with Oracle
–
–
–
–
BDE engaged
Everyone suspects Oracle problem related to block-munching by BDE
Non-website internal tools are disabled
BDE looks at block dumps; log in to examine the problem
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
•
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday
00:00
06:00
Saturday 12:00
BDE says new RAID array randomly and steadily corrupting blocks
The DBA is horrified.
SE makes a point:
–
–
–
Although blocks on the disk getting corrupted, redologs are on good array
Redologs not getting corrupted
No corrupted data being read and rewritten
DBA & SE conclude that a good version of each successful transaction preserved in the
redologs
Middle of day and peak website period
–
–
–
DBA & SE decide to keep the database up
Will run and in hobbled state until period of low activity
Then restore most recent backup to working array & roll forward
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
•
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday
00:00
06:00
Saturday 16:30
DBA & SE copy cold backup from before first BDE patch to good array
– Most recent hot backup taken from bad RAID array
– Previous one taken after initial restore and recover but prior to patching
– BDE will have to log in and re-patch this copy
•
Saturday 21:00
Restore & recover in parallel with running DB
– DBA & SE start applying Saturday’s logs to the restored database
– Corrupting copy still running
– Parallel recovery this time
•
Sunday 03:00
Recovery catches up to the corrupted open production DB
– BDE logs in and reapplies block patches for BOOTSTRAP$ bug
– Recovered copy still not open; Corrupted copy still running
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
Friday
01:00 06:00
•
12:00
18:00
Saturday
00:00
06:00
12:00
18:00
Sunday 04:00
Site and running corrupted DB taken down
– DBA & SE take cold backup of recovered and repaired database
– Last few logs generated by the corrupted system applied
•
Sunday 06:00
Repaired DB opened; testing begins
•
Sunday 08:50
Testing complete; Site back up
•
Sunday 09:00
CEO arrives and takes pictures with disposable camera
•
Sunday 10:00
VP Tech. takes IT staff out for breakfast but nobody hungry
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Sunday
00:00
06:00
Causes of outage,
Avoidance strategies
• BOOTSTRAP$ bug “sleeper”
–
–
–
–
Watch for bugs (no good way)
Doing things to dictionary tables (not documented)
Reasonably, this was an unavoidable problem
Could have been avoided by not trying to be smart
• Array corruption (never verified – controller
replaced)
– Write programs to exercise and verify new arrays
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Sleepers
• Introduction of latent problem long before
manifestation makes recovery not an option
• Changes to many dictionary tables only read
at initialization
• Extremely insidious; no good recovery path
other than BBED
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Time Spent
Sum of Minutes
Activity
Applying redologs
Cold backup
Initial outage
Kernel Panic/Reboot
Restoring from tape
Testing
Troubleshooting
Waiting for Oracle Support
Working with Oracle Support
Top time sinks for this outage
• 34%
Applying redologs
• 23%
Waiting for Oracle Support
• 12%
Troubleshooting
• 10%
Testing
• 8%
Working with Oracle Support
• 5%
Restoring from tape
• 5%
Taking cold backups
• 2%
Kernel panic/reboot
• 1%
Initial outage
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Redolog apply time – 34%
• Lack of understanding of the problem
• Lack of more recent backup
• Ignorance of parallel recovery
–
–
–
–
Learn about new features
Practice variety of recoveries frequently
Tune recovery: wait events
Become an expert in recovery
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Waiting for Oracle Support – 23%
• Managing support is a science
• Create your own TAR to avoid errors
– MetaLink iTAR
– Dial-in to TAR entry system
•
•
•
•
•
Upload files to FTP site before calling in
Call in sev-1 and judge analyst’s skill
Escalate immediately if necessary
Find out when shift ends
Watch analyst annotations in TAR and add your
own
• Make your own judgments on validity of solution
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Troubleshooting – 12%
• Too much time spent fiddling
– Should establish rules and time limits
– When in doubt start the TAR while
troubleshooting
– Escalate to other staff quickly
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Time Spent
Sum of Minutes
Activity
Applying redologs
Cold backup
Initial outage
Kernel Panic/Reboot
Restoring from tape
Testing
Troubleshooting
Waiting for Oracle Support
Working with Oracle Support
Top time sinks for this outage
• 34%
Applying redologs
• 23%
Waiting for Oracle Support
• 12%
Troubleshooting
• 10%
Testing
• 8%
Working with Oracle Support
• 5%
Restoring from tape
• 5%
Taking cold backups
• 2%
Kernel panic/reboot
• 1%
Initial outage
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Lessons
• What technology would have helped?
– RAC?
– DataGuard?
– Backups?
• Hypothetical belt & Suspenders, helmet, etc.?
Zero data loss
DataGuard
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Lessons
• Most valuable availability assets:
– Solid understanding of internals
– Support evasion countermeasures
– Excellent support contract and personal contacts
• Attend conferences and seek to meet Oracle dev folks
• Never do anything to SYS-owned objects without
complete understanding
• Perform testing on new equipment before placing in
production
• Troll MetaLink, mailing lists and newsgroups for
reports of trouble with new features
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Real Failures
• Real failures never resemble the examples in
the RAC and DataGuard marketing
literature
• Real failures are weird, unpredictable and
require fast critical thinking
• Invest in equipment conservatively and
srategically
• Invest in education liberally
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Future Disaster Diary Chapters
• Share your disasters with me
– Anonymity
– Acknowledgements
– Preferably Oracle-related
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
• Half-day Crisis Management and Disaster
Recovery Seminars
• Week-long Basic and Advanced Oracle
Administration Classes
• Remote Emergency DBA Services
www.speakeasy.net/~jwilton
[email protected]
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Q&A
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.